Pipeline Overview
Internal Components
1. WatcherBridge (watcher_bridge.go)
The WatcherBridge is the pipeline entry point. It implements the controller-runtime manager.Runnable interface and runs as a manager-managed goroutine.
Responsibilities:
| Function | Description |
|---|---|
Start() | Starts the polling loop (30s) with cancelable context |
poll() | Queries GetAlerts and creates Anomaly CRs |
discoverAndConnect() | Discovers server via Instance CRs in the cluster |
createAnomaly() | Converts alert -> Anomaly CR with reference labels |
alertHash() | SHA256(type|deployment|namespace) for dedup |
InvalidateDedupForResource() | Removes dedup entries for a deployment+namespace |
sanitizeK8sName() | Ensures valid names for K8s objects (63 chars, lowercase, no special characters) |
- No temporal component: A continuous problem (e.g., CrashLoopBackOff) generates only one Anomaly
- TTL: 2 hours — expired hashes are pruned automatically
- Invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the affected resource are invalidated, allowing immediate recurrence detection
- Result: Avoids duplicates during an active problem; detects recurrence after resolution
2. AnomalyReconciler (anomaly_controller.go)
Watches Anomaly CRs and correlates them into Issues.
Flow:
3. CorrelationEngine (correlation.go)
Correlation engine that groups anomalies into incidents.
Correlation Algorithm:
| Signal | Weight | Justification |
|---|---|---|
oom_kill | 30 | Indicates severe memory problem |
error_rate | 25 | Direct impact on users |
deploy_failing | 25 | Service unavailability |
latency_spike | 20 | Performance degradation |
pod_restart | 20 | Pod instability |
pod_not_ready | 20 | Reduced capacity |
oom_kill (30) + pod_restart (20) = risk 50 -> Medium. If adding error_rate (25) = risk 75 -> High.
Source Mapping:
| Anomaly Source | Issue Source |
|---|---|
watcher | watcher |
prometheus | prometheus |
manual | manual |
4. IssueReconciler (issue_controller.go)
Manages the complete lifecycle of an Issue through a state machine.
States and Transitions:
handleDetected()
handleDetected()
- Sets
detectedAtandmaxRemediationAttempts(default: 3) - Creates AIInsight CR with owner reference (Issue -> AIInsight)
- Transitions to
Analyzing - Requeues after 10 seconds
handleAnalyzing()
handleAnalyzing()
- Checks if AIInsight has
Analysispopulated - Searches for matching manual Runbook (
findMatchingRunbook— tiered matching) - If manual Runbook found ->
createRemediationPlan()(manual has precedence) - If no manual Runbook but AIInsight has
SuggestedActions->generateRunbookFromAI()->createRemediationPlan()using the auto-generated Runbook - If none ->
createAgenticRemediationPlan()(AgenticMode=true, no pre-defined actions — AI decides each step) - Transitions to
Remediating
findMatchingRunbook() -- Tiered Matching
findMatchingRunbook() -- Tiered Matching
- Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
- Tier 2: Severity + ResourceKind (fallback when signal doesn’t match)
SignalTyperesolved from:issue.Spec.SignalType-> fallbackissue.Labels["platform.chatcli.io/signal"]
generateRunbookFromAI()
generateRunbookFromAI()
- Materializes
SuggestedActionsfrom AI as a reusable Runbook CR - Name:
auto-{signal}-{severity}-{kind}(sanitized) - Labels:
platform.chatcli.io/auto-generated=true - Trigger: SignalType + Severity + ResourceKind (for future reuse)
- Uses
CreateOrUpdatefor idempotency
handleRemediating()
handleRemediating()
- Finds the most recent RemediationPlan (
findLatestRemediationPlan) - If
Completed-> IssueResolved+ invalidates dedup for the resource- If agentic plan: generates PostMortem CR (timeline, root cause, impact, lessons) + reusable Runbook from successful steps
- If
Failedand remaining attempts -> re-analysis: collects failure evidence (collectFailureEvidence), clears AIInsight analysis, returns toAnalyzingstate with failure context - If
Failedand max attempts ->Escalated+ invalidates dedup for the resource
- Each retry triggers AI re-analysis with context from previous failures
- AI receives
previous_failure_contextwith evidence from failed attempts - The prompt instructs: “Do not repeat the same actions. Analyze why they failed and suggest a fundamentally different approach”
- Generates new auto-generated Runbook with different strategy (name includes attempt)
5. AIInsightReconciler (aiinsight_controller.go)
Watches AIInsight CRs and calls the AnalyzeIssue RPC to populate the analysis.
Flow:
Collects K8s context
Collects K8s context via
KubernetesContextBuilder (deployment, pods, events, revisions).Reads failure context
Reads failure context from annotation
platform.chatcli.io/failure-context (if re-analysis).k8s_context.go):
Collects 4 sections of real cluster context (max 8000 chars):
- Deployment Status: replicas (desired/ready/updated/unavailable), conditions, container images + resources
- Pod Details (up to 5 pods, unhealthy first): phase, restart count, container states (Waiting/Terminated with reason + exit code)
- Recent Events (last 15): type, reason, message, count
- Revision History: Last 5 revisions (ReplicaSets) with image diff between revisions
| Field | Source | Description |
|---|---|---|
issue_name | Issue.Name | Issue name |
namespace | Issue.Namespace | Namespace |
resource_kind | Issue.Spec.Resource.Kind | Resource type (Deployment) |
resource_name | Issue.Spec.Resource.Name | Deployment name |
signal_type | Issue.Spec.SignalType / labels | Signal type |
severity | Issue.Spec.Severity | Severity |
description | Issue.Spec.Description | Problem description |
risk_score | Issue.Spec.RiskScore | Risk score |
provider | AIInsight.Spec.Provider | LLM provider |
model | AIInsight.Spec.Model | LLM model |
kubernetes_context | KubernetesContextBuilder | Deployment status, pods, events, revisions |
previous_failure_context | Annotation on AIInsight | Evidence from previous attempts (retries) |
6. RemediationReconciler (remediation_controller.go)
Executes the actions defined in a RemediationPlan.
Supported Actions:
| Type | What It Does | Parameters |
|---|---|---|
ScaleDeployment | kubectl scale deployment/<name> --replicas=N | replicas (required) |
RestartDeployment | kubectl rollout restart deployment/<name> | — |
RollbackDeployment | Rollback to previous, healthy, or specific revision | toRevision (optional: previous, healthy, number) |
PatchConfig | Updates key(s) in a ConfigMap | configmap, key=value |
AdjustResources | Adjusts CPU/memory requests/limits | memory_limit, memory_request, cpu_limit, cpu_request, container |
DeletePod | Removes the sickest pod (CrashLoop > restarts) | pod (optional — auto-selects) |
Custom | Blocked — requires manual approval | — |
7. ServerClient (grpc_client.go)
Shared gRPC client between WatcherBridge and AIInsightReconciler.
| Method | Description |
|---|---|
NewServerClient() | Creates instance (no connection) |
Connect(addr) | Connects via gRPC insecure (10s timeout) |
GetAlerts(namespace) | Fetches alerts from the watcher |
AnalyzeIssue(req) | Sends issue for AI analysis |
AgenticStep(req) | Executes one step of the agentic loop (context + history -> next action) |
IsConnected() | Checks if connection is active |
Close() | Closes gRPC connection |
Server and Operator Interaction
GetAlerts RPC
The server exposes K8s Watcher alerts via gRPC:ObservabilityStore of each MultiWatcher target, filters by namespace if specified, and returns active alerts.
AnalyzeIssue RPC
The server receives the Issue context and calls the LLM for analysis:- Issue context (name, namespace, resource, severity, risk score, description)
- List of available actions (
ScaleDeployment,RestartDeployment,RollbackDeployment,PatchConfig) - Instructions to return structured JSON with
analysis,confidence,recommendations, andactionsfields
- Removes markdown codeblocks (
```json ... ```) - Parses JSON into
analysisResult - Clamps confidence between 0.0 and 1.0
- If parsing fails -> uses raw response as analysis with confidence 0.5
AgenticStep RPC
The server receives the Issue context, history of previous steps, and updated K8s context, and decides the next action:- Role + Issue details: incident context (type, severity, resource)
- Kubernetes context: real cluster state (refreshed at each step via KubernetesContextBuilder)
- Tool definitions: 6 available mutating actions + “Observe” (no action, wait for next context)
- Conversation history: each previous step formatted with reasoning -> action -> observation
- Instructions: respond JSON, budget (step N of M), safety rules
resolved=true, the response includes data for PostMortem generation (summary, root_cause, impact, lessons_learned, prevention_actions).
PostMortem Generation
When an agentic remediation resolves an Issue, theIssueReconciler automatically generates:
PostMortem CR
Created viageneratePostMortem():
| Field | Source |
|---|---|
timeline | Issue.DetectedAt + each step from AgenticHistory + resolved |
actionsExecuted | Steps with Action != nil (includes result) |
summary | Annotation platform.chatcli.io/postmortem-summary (AI-generated) |
rootCause | Annotation platform.chatcli.io/root-cause |
impact | Annotation platform.chatcli.io/impact |
lessonsLearned | Annotation platform.chatcli.io/lessons-learned |
preventionActions | Annotation platform.chatcli.io/prevention-actions |
duration | Calculated: resolvedAt - detectedAt |
Auto-generated Runbook (Agentic)
Created viagenerateAgenticRunbook():
- Name:
agentic-{signal}-{severity}-{kind}(sanitized) - Steps: only steps with successful actions
- Labels:
auto-generated=true,source=agentic - Uses
CreateOrUpdate(reused for future incidents of the same type)
Operator Prometheus Metrics
The operator exposes Prometheus metrics for observability:| Metric | Type | Description |
|---|---|---|
chatcli_operator_issues_total | Counter | Total issues by severity and state |
chatcli_operator_issue_resolution_duration_seconds | Histogram | Duration from detection to resolution |
chatcli_operator_active_issues | Gauge | Number of unresolved issues |
Tests
The operator has 96 tests (125 with subtests) covering all components:| Component | Tests | Coverage |
|---|---|---|
| InstanceReconciler | 15 | CRUD, watcher, persistence, replicas, RBAC, deletion, deepcopy |
| AnomalyReconciler | 4 | Creation, correlation, attachment to existing Issue |
| IssueReconciler | 12 | State machine, AI fallback, retry, agentic plan, PostMortem generation |
| RemediationReconciler | 16 | All action types, safety checks, agentic loop (first step, resolved, max steps, timeout, action failed, observation) |
| AIInsightReconciler | 12 | Connectivity, mock RPC, analysis parsing, withAuth, TLS/token |
| PostMortemReconciler | 2 | State initialization, terminal state |
| WatcherBridge | 22 | Alert mapping, SHA256 dedup, hash, pruning, Anomaly creation, buildConnectionOpts (TLS, token, both) |
| CorrelationEngine | 4 | Risk scoring, severity, incident ID, related anomalies |
| Pipeline (E2E) | 3 | Complete flow: Anomaly->Issue->Insight->Plan->Resolved, escalation, correlation |
| MapActionType | 6 | All string->enum mappings |
Run Tests
Ownership Diagram (Garbage Collection)
- Instance is the owner of all Kubernetes resources it creates (Deployment, Service, ConfigMap, SA, PVC)
- Issue is the owner of AIInsight, RemediationPlan, and PostMortem (cascade delete)
- Anomalies are independent (no owner) to preserve history
AIOps Deployment Checklist
Verify AIOps pipeline
kubectl get anomalies -A— anomalies being detectedkubectl get issues -A— issues being createdkubectl get aiinsights -A— AI analyzing