Platform v2 Components
Notifications
NotificationPolicy & EscalationPolicy
SLO & SLA
ServiceLevelObjective & IncidentSLA
Approvals
ApprovalPolicy & ApprovalRequest
Multi-Cluster
ClusterRegistration & Federation
Audit
AuditEvent (immutable trail)
Chaos Engineering
ChaosExperiment with safety checks
REST API & Dashboard
In addition to gRPC, the operator now exposes a REST HTTP API on port:8090 with 30+ endpoints covering incidents, SLOs, runbooks, approvals, postmortems, analytics, clusters and audit. Authentication is via X-API-Key with role mapping (viewer/operator/admin), rate limited at 100 req/min per key. A Web Dashboard is embedded and served at /.
For the complete reference, see the API Reference.
Pipeline Overview
Internal Components
1. WatcherBridge (watcher_bridge.go)
The WatcherBridge is the pipeline entry point. It implements the controller-runtime manager.Runnable interface and runs as a manager-managed goroutine.
Responsibilities:
| Function | Description |
|---|---|
Start() | Starts the polling loop (30s) with cancelable context |
poll() | Queries GetAlerts and creates Anomaly CRs |
discoverAndConnect() | Discovers server via Instance CRs in the cluster |
createAnomaly() | Converts alert -> Anomaly CR with reference labels |
alertHash() | SHA256(type|deployment|namespace) for dedup |
InvalidateDedupForResource() | Removes dedup entries for a deployment+namespace |
sanitizeK8sName() | Ensures valid names for K8s objects (63 chars, lowercase, no special characters) |
- No temporal component: A continuous problem (e.g., CrashLoopBackOff) generates only one Anomaly
- TTL: 2 hours — expired hashes are pruned automatically
- Invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the affected resource are invalidated, allowing immediate recurrence detection
- Result: Avoids duplicates during an active problem; detects recurrence after resolution
2. AnomalyReconciler (anomaly_controller.go)
Watches Anomaly CRs and correlates them into Issues.
Flow:
3. CorrelationEngine (correlation.go)
Correlation engine that groups anomalies into incidents.
Correlation Algorithm:
| Signal | Weight | Justification |
|---|---|---|
oom_kill | 30 | Indicates severe memory problem |
error_rate | 25 | Direct impact on users |
deploy_failing | 25 | Service unavailability |
latency_spike | 20 | Performance degradation |
pod_restart | 20 | Pod instability |
pod_not_ready | 20 | Reduced capacity |
oom_kill (30) + pod_restart (20) = risk 50 -> Medium. If adding error_rate (25) = risk 75 -> High.
Source Mapping:
| Anomaly Source | Issue Source |
|---|---|
watcher | watcher |
prometheus | prometheus |
manual | manual |
4. IssueReconciler (issue_controller.go)
Manages the complete lifecycle of an Issue through a state machine.
States and Transitions:
handleDetected()
handleDetected()
- Sets
detectedAtandmaxRemediationAttempts(default: 5, configurable via Instanceaiops.maxRemediationAttempts) - Creates AIInsight CR with owner reference (Issue -> AIInsight)
- Transitions to
Analyzing - Requeues after 10 seconds
handleAnalyzing()
handleAnalyzing()
- Checks if AIInsight has
Analysispopulated - Searches for matching manual Runbook (
findMatchingRunbook— tiered matching) - If manual Runbook found ->
createRemediationPlan()(manual has precedence) - If no manual Runbook but AIInsight has
SuggestedActions->generateRunbookFromAI()->createRemediationPlan()using the auto-generated Runbook - If none ->
createAgenticRemediationPlan()(AgenticMode=true, no pre-defined actions — AI decides each step) - Transitions to
Remediating
findMatchingRunbook() -- Tiered Matching
findMatchingRunbook() -- Tiered Matching
- Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
- Tier 2: Severity + ResourceKind (fallback when signal doesn’t match)
SignalTyperesolved from:issue.Spec.SignalType-> fallbackissue.Labels["platform.chatcli.io/signal"]
generateRunbookFromAI()
generateRunbookFromAI()
- Materializes
SuggestedActionsfrom AI as a reusable Runbook CR - Name:
auto-{signal}-{severity}-{kind}(sanitized) - Labels:
platform.chatcli.io/auto-generated=true - Trigger: SignalType + Severity + ResourceKind (for future reuse)
- Uses
CreateOrUpdatefor idempotency
handleRemediating()
handleRemediating()
- Finds the most recent RemediationPlan (
findLatestRemediationPlan) - If
Completed-> IssueResolved+ invalidates dedup for the resource- If agentic plan: generates PostMortem CR (timeline, root cause, impact, lessons) + reusable Runbook from successful steps
- If
Failedand remaining attempts -> re-analysis: collects failure evidence (collectFailureEvidence), clears AIInsight analysis, returns toAnalyzingstate with failure context - If
Failedand max attempts ->Escalated+ invalidates dedup for the resource
- Each retry triggers AI re-analysis with context from previous failures
- AI receives
previous_failure_contextwith evidence from failed attempts - The prompt instructs: “Do not repeat the same actions. Analyze why they failed and suggest a fundamentally different approach”
- Generates new auto-generated Runbook with different strategy (name includes attempt)
5. AIInsightReconciler (aiinsight_controller.go)
Watches AIInsight CRs and calls the AnalyzeIssue RPC to populate the analysis.
Flow:
Collects K8s context
Collects K8s context via
KubernetesContextBuilder (deployment, pods, events, revisions).Reads failure context
Reads failure context from annotation
platform.chatcli.io/failure-context (if re-analysis).k8s_context.go):
Collects real cluster context for Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, and HPAs (max 12000 chars):
- Resource Status: replicas, conditions, containers, images + resources (each type has a dedicated context builder)
- StatefulSet: replicas, update strategy, partition, PodManagementPolicy, VolumeClaimTemplates
- DaemonSet: desired/current/ready/available/unavailable, nodeSelector, tolerations
- Job/CronJob: active/succeeded/failed, completions, parallelism, schedule, lastSuccessful
- HPA: min/max replicas, current/desired, target utilization, current metrics, maxed-out detection
- Pod Details (up to 5 pods, unhealthy first): phase, restart count, container states
- Recent Events (last 15): type, reason, message, count
- Revision History: Last 5 revisions (ReplicaSets) with image diff between revisions
log_analyzer.go):
Advanced application log analysis (beyond the basic 50-line tail):
- Stack Trace Extraction: detects and extracts stack traces from Java (Exception/Caused by), Go (panic/goroutine), Python (Traceback), Node.js (Error at)
- Error Pattern Detection: 24+ critical patterns categorized (crash, connectivity, dns, auth, storage, tls, database, cache, messaging)
- Structured Log Parsing: extracts error/warn entries from JSON logs (fields level, msg, error, timestamp, logger)
- Init Container Logs: analyzes init container logs (reveals startup failures)
- Sidecar Logs: analyzes sidecar logs (istio-proxy, envoy, datadog-agent, etc.)
- Critical Lines: extracts FATAL/PANIC lines with 3 lines of context before/after
- Temporal Window: fetches logs by temporal window (10min before the incident), not just tail
metrics_collector.go):
Prometheus queries for quantitative data during analysis:
- CPU/Memory: usage trends 30min before → during → 15min after the incident
- Request/Error Rate: HTTP requests and 5xx per second
- Latency: P50, P95, P99 histogram percentiles
- HPA Metrics: current vs desired replicas, CPU target
- Network: receive/transmit bytes/s
- Trend Analysis: detects spikes, drops, sustained_high/low with % change calculation
- Enabled via:
PROMETHEUS_URLenv var on the operator
gitops_detector.go):
Detects and integrates with GitOps tools:
- Helm Releases: detects via Secrets type
helm.sh/release.v1, status (deployed/failed/pending-upgrade), chart version, previous revision for rollback - ArgoCD Applications: sync status (Synced/OutOfSync), health (Healthy/Degraded), conditions, last sync result
- Flux Kustomizations: ready status, source ref, conditions, last applied
source_controller.go):
Code-aware diagnostics when SourceRepository CRD is configured:
- Git Correlation: finds commits in the 30min before the incident
- Suspected Commit: identifies the most likely commit (score by temporal proximity + volume of changes)
- Code Extraction: extracts code snippets referenced in stack traces (file path + line number → source code)
- Config Analysis: reads Dockerfile, values.yaml, Chart.yaml for deploy context
cascade_analyzer.go):
Cross-service cascade failure analysis:
- Dependency Graph: discovers dependencies via Services + EndpointSlices
- Temporal Correlation: finds active issues in the same namespace and cross-namespace within a 15-20min window
- Cascade Chain: orders services by detection time (first = root cause)
- Root Cause Service: identifies the service that originated the cascade
blast_radius.go):
Impact prediction before action execution:
- PDB Check: verifies if the action would violate PodDisruptionBudgets
- Quota Check: verifies ResourceQuotas (>90% used = warning)
- Node Capacity: counts pods on node for cordon/drain actions
- Affected Services: discovers which Services would be impacted
- Risk Level: classifies as low/medium/high/critical
| Field | Source | Description |
|---|---|---|
issue_name | Issue.Name | Issue name |
namespace | Issue.Namespace | Namespace |
resource_kind | Issue.Spec.Resource.Kind | Resource type (Deployment) |
resource_name | Issue.Spec.Resource.Name | Deployment name |
signal_type | Issue.Spec.SignalType / labels | Signal type |
severity | Issue.Spec.Severity | Severity |
description | Issue.Spec.Description | Problem description |
risk_score | Issue.Spec.RiskScore | Risk score |
provider | AIInsight.Spec.Provider | LLM provider |
model | AIInsight.Spec.Model | LLM model |
kubernetes_context | 6 enrichers combined | K8s status (Deploy/STS/DS/Job/CronJob/HPA) + log analysis (stack traces, error patterns) + Prometheus metrics (trends) + GitOps (Helm/ArgoCD/Flux) + source code (commits, code snippets) + cascade analysis + RCA enrichment |
previous_failure_context | Annotation on AIInsight | Evidence from previous attempts (retries) |
6. RemediationReconciler (remediation_controller.go)
Executes the actions defined in a RemediationPlan.
Supported Actions (54 types across 9 categories):
Deployment / Generic (19 actions):
| Category | Type | What It Does | Key Parameters |
|---|---|---|---|
| Workload | ScaleDeployment | Adjusts Deployment replicas | replicas |
| Workload | RestartDeployment | Rollout restart via annotation | — |
| Workload | RollbackDeployment | Rollback to previous/healthy/specific revision (via ReplicaSet) | toRevision |
| Workload | PatchConfig | Updates ConfigMap data | configmap, key=value |
| Workload | AdjustResources | Adjusts CPU/memory on Deployment containers | container, memory_limit, cpu_limit, etc. |
| Workload | DeletePod | Removes the sickest pod (auto-selects) | pod (optional) |
| Workload | RestartStatefulSetPod | Restart specific StatefulSet pod or rolling restart | pod (optional) |
| GitOps | HelmRollback | Rollback Helm release | revision |
| GitOps | ArgoSyncApp | Trigger ArgoCD sync | revision |
| Autoscaling | AdjustHPA | Modifies HPA min/max/target | minReplicas, maxReplicas, targetCPUUtilization |
| Infra | CordonNode | Marks node unschedulable | node |
| Infra | DrainNode | Cordons and evicts pods from node | node |
| Storage | ResizePVC | Expands PVC (no shrinking) | pvc, size |
| Security | RotateSecret | Updates Secret values or copies from source | secret, sourceSecret or key=value |
| Networking | UpdateIngress | Modifies Ingress backend/annotations | ingress, backendService, backendPort |
| Networking | PatchNetworkPolicy | Adds ports to NetworkPolicy ingress rules | networkPolicy, allowPort, protocol |
| Advanced | ApplyManifest | Applies JSON manifest from ConfigMap | configmap, key |
| Advanced | ExecDiagnostic | Runs a command from a read-only allowlist inside a pod | command (see allowlist), pod, container |
| — | Custom | Blocked — requires manual approval | — |
| Type | What It Does | Key Parameters |
|---|---|---|
ScaleStatefulSet | Ordered replica scaling | replicas |
RestartStatefulSet | Rolling restart via annotation (ordered) | — |
RollbackStatefulSet | Rollback via ControllerRevision (not ReplicaSet) | toRevision (previous|N) |
AdjustStatefulSetResources | Adjusts CPU/memory on StatefulSet containers | container, memory_limit, cpu_limit, etc. |
DeleteStatefulSetPod | Deletes specific or unhealthiest pod (preserves PVC identity) | pod (optional) |
ForceDeleteStatefulSetPod | Force-delete stuck Terminating pod (grace=0) | pod (REQUIRED) |
UpdateStatefulSetStrategy | Changes updateStrategy type | type (RollingUpdate|OnDelete), maxUnavailable |
RecreateStatefulSetPVC | Deletes stuck PVC for recreation | pvc, confirm=true (REQUIRED) |
PartitionStatefulSetUpdate | Sets partition for canary rollout | partition |
| Type | What It Does | Key Parameters |
|---|---|---|
RestartDaemonSet | Rolling restart of all DaemonSet pods across nodes | — |
RollbackDaemonSet | Rollback via ControllerRevision | toRevision (previous|N) |
AdjustDaemonSetResources | Adjusts CPU/memory on DaemonSet containers | container, memory_limit, cpu_limit, etc. |
DeleteDaemonSetPod | Deletes pod (optionally on specific node) | pod or node (optional) |
UpdateDaemonSetStrategy | Changes update strategy | type, maxUnavailable, maxSurge |
PauseDaemonSetRollout | Pauses rollout (sets maxUnavailable=0) | — |
CordonAndDeleteDaemonSetPod | Cordons node + deletes DaemonSet pod on it | node (REQUIRED) |
| Type | What It Does | Key Parameters |
|---|---|---|
RetryJob | Deletes failed Job + recreates from spec | — |
AdjustJobResources | Adjusts CPU/memory on Job template | container, memory_limit, cpu_limit, etc. |
DeleteFailedJob | Cleans up a failed Job and its pods | — |
SuspendJob | Pauses a running Job (suspend=true) | — |
ResumeJob | Resumes a suspended Job (suspend=false) | — |
AdjustJobParallelism | Changes Job parallelism | parallelism |
AdjustJobDeadline | Changes activeDeadlineSeconds | activeDeadlineSeconds |
AdjustJobBackoffLimit | Changes backoffLimit | backoffLimit |
ForceDeleteJobPods | Force-deletes all pods of a Job (grace=0) | — |
| Type | What It Does | Key Parameters |
|---|---|---|
SuspendCronJob | Pauses CronJob scheduling (suspend=true) | — |
ResumeCronJob | Resumes CronJob scheduling (suspend=false) | — |
TriggerCronJob | Creates a Job from CronJob template immediately | — |
AdjustCronJobResources | Adjusts CPU/memory on jobTemplate containers | container, memory_limit, cpu_limit, etc. |
AdjustCronJobSchedule | Changes cron schedule expression | schedule |
AdjustCronJobDeadline | Changes startingDeadlineSeconds | startingDeadlineSeconds |
AdjustCronJobHistory | Changes success/failure history limits | successfulJobsHistoryLimit, failedJobsHistoryLimit |
AdjustCronJobConcurrency | Changes concurrencyPolicy | concurrencyPolicy (Allow|Forbid|Replace) |
DeleteCronJobActiveJobs | Kills all currently running Jobs | — |
ReplaceCronJobTemplate | Replaces jobTemplate from ConfigMap JSON | configmap, key |
7. ServerClient (grpc_client.go)
Shared gRPC client between WatcherBridge and AIInsightReconciler.
| Method | Description |
|---|---|
NewServerClient() | Creates instance (no connection) |
Connect(addr) | Connects via gRPC insecure (10s timeout) |
GetAlerts(namespace) | Fetches alerts from the watcher |
AnalyzeIssue(req) | Sends issue for AI analysis |
AgenticStep(req) | Executes one step of the agentic loop (context + history -> next action) |
IsConnected() | Checks if connection is active |
Close() | Closes gRPC connection |
Server and Operator Interaction
GetAlerts RPC
The server exposes K8s Watcher alerts via gRPC:ObservabilityStore of each MultiWatcher target, filters by namespace if specified, and returns active alerts.
AnalyzeIssue RPC
The server receives the Issue context and calls the LLM for analysis:- Issue context (name, namespace, resource, severity, risk score, description)
- List of 19 available actions organized by category (Workload, GitOps, Autoscaling, Infra, Storage, Security, Networking, Advanced)
- Instructions to return structured JSON with
analysis,confidence,recommendations, andactionsfields
- Removes markdown codeblocks (
```json ... ```) - Parses JSON into
analysisResult - Clamps confidence between 0.0 and 1.0
- If parsing fails -> uses raw response as analysis with confidence 0.5
AgenticStep RPC
The server receives the Issue context, history of previous steps, and updated K8s context, and decides the next action:- Role + Issue details: incident context (type, severity, resource)
- Kubernetes context: real cluster state (refreshed at each step via KubernetesContextBuilder)
- Tool definitions: 18 available mutating actions + “Observe” (no action, wait for next context)
- Conversation history: each previous step formatted with reasoning -> action -> observation
- Instructions: respond JSON, budget (step N of M), safety rules
resolved=true, the response includes data for PostMortem generation (summary, root_cause, impact, lessons_learned, prevention_actions).
PostMortem Generation
When any remediation resolves an Issue (standard or agentic), theIssueReconciler automatically generates:
PostMortem CR
Created viageneratePostMortem():
| Field | Source |
|---|---|
timeline | Issue.DetectedAt + each step from AgenticHistory + resolved |
actionsExecuted | Steps with Action != nil (includes result) |
summary | Annotation platform.chatcli.io/postmortem-summary (AI-generated) |
rootCause | Annotation platform.chatcli.io/root-cause |
impact | Annotation platform.chatcli.io/impact |
lessonsLearned | Annotation platform.chatcli.io/lessons-learned |
preventionActions | Annotation platform.chatcli.io/prevention-actions |
duration | Calculated: resolvedAt - detectedAt |
- Trending: detection of recurring incidents (count in the last 30 days, related PostMortems)
- Cascade Chain: cascade failure chain if there are correlated cross-service issues
- Git Correlation: suspected commit (SHA, author, changed files, confidence)
- GitOps Context: Helm/ArgoCD/Flux state at the time of the incident
Auto-generated Runbook (Agentic)
Created viagenerateAgenticRunbook():
- Name:
agentic-{signal}-{severity}-{kind}(sanitized) - Steps: only steps with successful actions
- Labels:
auto-generated=true,source=agentic - Uses
CreateOrUpdate(reused for future incidents of the same type)
Operator Prometheus Metrics
The operator exposes Prometheus metrics for observability:| Metric | Type | Description |
|---|---|---|
chatcli_operator_issues_total | Counter | Total issues by severity and state |
chatcli_operator_issue_resolution_duration_seconds | Histogram | Duration from detection to resolution |
chatcli_operator_active_issues | Gauge | Number of unresolved issues |
Tests
The operator has 130 tests (185 with subtests) covering all components:| Component | Tests | Coverage |
|---|---|---|
| InstanceReconciler | 15 | CRUD, watcher, persistence, replicas, RBAC, deletion, deepcopy |
| AnomalyReconciler | 4 | Creation, correlation, attachment to existing Issue |
| IssueReconciler | 12 | State machine, AI fallback, retry, agentic plan, PostMortem generation |
| RemediationReconciler | 38 | All 54 action types (Deployment + StatefulSet + DaemonSet + Job + CronJob), safety constraints, agentic loop, rollback, verification |
| AIInsightReconciler | 12 | Connectivity, mock RPC, analysis parsing, withAuth, TLS/token |
| PostMortemReconciler | 2 | State initialization, terminal state |
| WatcherBridge | 22 | Alert mapping, SHA256 dedup, hash, pruning, Anomaly creation, buildConnectionOpts (TLS, token, both) |
| CorrelationEngine | 4 | Risk scoring, severity, incident ID, related anomalies |
| Pipeline (E2E) | 3 | Complete flow: Anomaly->Issue->Insight->Plan->Resolved, escalation, correlation |
| MapActionType | 6+17 | All 54 string->enum mappings including StatefulSet, DaemonSet, Job, CronJob actions |
Run Tests
Ownership Diagram (Garbage Collection)
- Instance is the owner of all Kubernetes resources it creates (Deployment, Service, ConfigMap, SA, PVC)
- Issue is the owner of AIInsight, RemediationPlan, and PostMortem (cascade delete)
- Anomalies are independent (no owner) to preserve history
AIOps Deployment Checklist
Verify AIOps pipeline
kubectl get anomalies -A— anomalies being detectedkubectl get issues -A— issues being createdkubectl get aiinsights -A— AI analyzing
Next Steps
K8s Operator
Configuration and examples
K8s Watcher
Collection and budget details
Server Mode
GetAlerts, AnalyzeIssue, and AgenticStep RPCs
K8s Monitoring
Recipe: K8s Monitoring with AI