Skip to main content
The ChatCLI AIOps Platform is an autonomous system that detects problems in Kubernetes, analyzes root causes with AI, and executes automatic remediations — all orchestrated by native Kubernetes CRDs. This page covers the internal architecture in depth. For configuration and usage examples, see K8s Operator.

Pipeline Overview


Internal Components

1. WatcherBridge (watcher_bridge.go)

The WatcherBridge is the pipeline entry point. It implements the controller-runtime manager.Runnable interface and runs as a manager-managed goroutine. Responsibilities:
FunctionDescription
Start()Starts the polling loop (30s) with cancelable context
poll()Queries GetAlerts and creates Anomaly CRs
discoverAndConnect()Discovers server via Instance CRs in the cluster
createAnomaly()Converts alert -> Anomaly CR with reference labels
alertHash()SHA256(type|deployment|namespace) for dedup
InvalidateDedupForResource()Removes dedup entries for a deployment+namespace
sanitizeK8sName()Ensures valid names for K8s objects (63 chars, lowercase, no special characters)
SHA256 Dedup:
hash = SHA256(alertType | deployment | namespace)
  • No temporal component: A continuous problem (e.g., CrashLoopBackOff) generates only one Anomaly
  • TTL: 2 hours — expired hashes are pruned automatically
  • Invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the affected resource are invalidated, allowing immediate recurrence detection
  • Result: Avoids duplicates during an active problem; detects recurrence after resolution
Server Discovery:
1

Lists Instance CRs in the cluster

2

Selects the first Instance with Status.Ready=true

3

Connects via gRPC insecure (10s timeout)

4

Retry

If the connection fails, retries on the next poll cycle.

2. AnomalyReconciler (anomaly_controller.go)

Watches Anomaly CRs and correlates them into Issues. Flow:
1

Receives Anomaly CR

Newly created Anomaly with Status.Correlated = false.
2

Groups anomalies

Calls CorrelationEngine.FindRelatedAnomalies() to group.
3

Calculates risk score and severity

4

Creates or updates Issue CR

5

Marks Anomaly as correlated

Sets Correlated = true with reference to the Issue.

3. CorrelationEngine (correlation.go)

Correlation engine that groups anomalies into incidents. Correlation Algorithm:
For each new anomaly:
  1. Generates incident_id = hash(resource_kind + resource_name + namespace + signal_type)
  2. Searches for existing Issue with the same incident_id
  3. If exists -> adds anomaly to Issue, recalculates risk score
  4. If not exists -> creates new Issue
Risk Scoring:
SignalWeightJustification
oom_kill30Indicates severe memory problem
error_rate25Direct impact on users
deploy_failing25Service unavailability
latency_spike20Performance degradation
pod_restart20Pod instability
pod_not_ready20Reduced capacity
Severity Classification:
risk_score >= 80 -> Critical
risk_score >= 60 -> High
risk_score >= 40 -> Medium
risk_score <  40 -> Low
Example: A deployment with oom_kill (30) + pod_restart (20) = risk 50 -> Medium. If adding error_rate (25) = risk 75 -> High. Source Mapping:
Anomaly SourceIssue Source
watcherwatcher
prometheusprometheus
manualmanual

4. IssueReconciler (issue_controller.go)

Manages the complete lifecycle of an Issue through a state machine. States and Transitions:
  1. Sets detectedAt and maxRemediationAttempts (default: 3)
  2. Creates AIInsight CR with owner reference (Issue -> AIInsight)
  3. Transitions to Analyzing
  4. Requeues after 10 seconds
  1. Checks if AIInsight has Analysis populated
  2. Searches for matching manual Runbook (findMatchingRunbook — tiered matching)
  3. If manual Runbook found -> createRemediationPlan() (manual has precedence)
  4. If no manual Runbook but AIInsight has SuggestedActions -> generateRunbookFromAI() -> createRemediationPlan() using the auto-generated Runbook
  5. If none -> createAgenticRemediationPlan() (AgenticMode=true, no pre-defined actions — AI decides each step)
  6. Transitions to Remediating
  • Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
  • Tier 2: Severity + ResourceKind (fallback when signal doesn’t match)
  • SignalType resolved from: issue.Spec.SignalType -> fallback issue.Labels["platform.chatcli.io/signal"]
  • Materializes SuggestedActions from AI as a reusable Runbook CR
  • Name: auto-{signal}-{severity}-{kind} (sanitized)
  • Labels: platform.chatcli.io/auto-generated=true
  • Trigger: SignalType + Severity + ResourceKind (for future reuse)
  • Uses CreateOrUpdate for idempotency
  1. Finds the most recent RemediationPlan (findLatestRemediationPlan)
  2. If Completed -> Issue Resolved + invalidates dedup for the resource
    • If agentic plan: generates PostMortem CR (timeline, root cause, impact, lessons) + reusable Runbook from successful steps
  3. If Failed and remaining attempts -> re-analysis: collects failure evidence (collectFailureEvidence), clears AIInsight analysis, returns to Analyzing state with failure context
  4. If Failed and max attempts -> Escalated + invalidates dedup for the resource
Retry with Strategy Escalation:
  • Each retry triggers AI re-analysis with context from previous failures
  • AI receives previous_failure_context with evidence from failed attempts
  • The prompt instructs: “Do not repeat the same actions. Analyze why they failed and suggest a fundamentally different approach”
  • Generates new auto-generated Runbook with different strategy (name includes attempt)
Remediation Priority:
1. Existing manual Runbook (tiered match: SignalType+Severity+Kind -> Severity+Kind)
2. AI auto-generated Runbook (materialized as reusable CR)
3. Escalation (last resort)

5. AIInsightReconciler (aiinsight_controller.go)

Watches AIInsight CRs and calls the AnalyzeIssue RPC to populate the analysis. Flow:
1

Checks existing analysis

Checks if Status.Analysis is already populated (skip if yes).
2

Checks connectivity

Checks if server is connected (requeue 15s if not).
3

Fetches context

Fetches parent Issue for context.
4

Collects K8s context

Collects K8s context via KubernetesContextBuilder (deployment, pods, events, revisions).
5

Reads failure context

Reads failure context from annotation platform.chatcli.io/failure-context (if re-analysis).
6

Builds request

Builds AnalyzeIssueRequest with Issue data + K8s context + failure context.
7

Calls AnalyzeIssue RPC

Calls AnalyzeIssue RPC via ServerClient.
8

Populates status

Populates Status.Analysis, Confidence, Recommendations, SuggestedActions. Clears failure-context annotation after re-analysis completes.
KubernetesContextBuilder (k8s_context.go): Collects 4 sections of real cluster context (max 8000 chars):
  • Deployment Status: replicas (desired/ready/updated/unavailable), conditions, container images + resources
  • Pod Details (up to 5 pods, unhealthy first): phase, restart count, container states (Waiting/Terminated with reason + exit code)
  • Recent Events (last 15): type, reason, message, count
  • Revision History: Last 5 revisions (ReplicaSets) with image diff between revisions
AnalyzeIssueRequest:
FieldSourceDescription
issue_nameIssue.NameIssue name
namespaceIssue.NamespaceNamespace
resource_kindIssue.Spec.Resource.KindResource type (Deployment)
resource_nameIssue.Spec.Resource.NameDeployment name
signal_typeIssue.Spec.SignalType / labelsSignal type
severityIssue.Spec.SeveritySeverity
descriptionIssue.Spec.DescriptionProblem description
risk_scoreIssue.Spec.RiskScoreRisk score
providerAIInsight.Spec.ProviderLLM provider
modelAIInsight.Spec.ModelLLM model
kubernetes_contextKubernetesContextBuilderDeployment status, pods, events, revisions
previous_failure_contextAnnotation on AIInsightEvidence from previous attempts (retries)

6. RemediationReconciler (remediation_controller.go)

Executes the actions defined in a RemediationPlan. Supported Actions:
TypeWhat It DoesParameters
ScaleDeploymentkubectl scale deployment/<name> --replicas=Nreplicas (required)
RestartDeploymentkubectl rollout restart deployment/<name>
RollbackDeploymentRollback to previous, healthy, or specific revisiontoRevision (optional: previous, healthy, number)
PatchConfigUpdates key(s) in a ConfigMapconfigmap, key=value
AdjustResourcesAdjusts CPU/memory requests/limitsmemory_limit, memory_request, cpu_limit, cpu_request, container
DeletePodRemoves the sickest pod (CrashLoop > restarts)pod (optional — auto-selects)
CustomBlocked — requires manual approval
Safety Checks: Scale to 0 replicas is blocked. AdjustResources limit cannot be less than request. DeletePod refuses to delete if only 1 pod exists (prevents total outage). Custom actions are blocked. A pre-flight snapshot records the previous state for reference.
Execution Flow (Standard):
Pending -> Executing -> (executes actions sequentially) -> Verifying -> Completed | Failed
Execution Flow (Agentic):
Pending -> Executing -> (agentic loop: AI decides -> executes -> observes -> repeat)
  -> Verifying -> Completed | Failed

Each reconcile = 1 step of the agentic loop:
  1. Refresh K8s context (KubernetesContextBuilder)
  2. Send history + context -> AgenticStep RPC
  3. AI responds: {reasoning, resolved, next_action}
  4. If resolved=true -> Verifying (+ annotations with PostMortem data)
  5. If next_action -> execute -> record observation -> requeue 5s
  6. If observation-only -> record -> requeue 10s
  Safety: max 10 steps, timeout 10 minutes

7. ServerClient (grpc_client.go)

Shared gRPC client between WatcherBridge and AIInsightReconciler.
MethodDescription
NewServerClient()Creates instance (no connection)
Connect(addr)Connects via gRPC insecure (10s timeout)
GetAlerts(namespace)Fetches alerts from the watcher
AnalyzeIssue(req)Sends issue for AI analysis
AgenticStep(req)Executes one step of the agentic loop (context + history -> next action)
IsConnected()Checks if connection is active
Close()Closes gRPC connection

Server and Operator Interaction

GetAlerts RPC

The server exposes K8s Watcher alerts via gRPC:
rpc GetAlerts(GetAlertsRequest) returns (GetAlertsResponse);

message AlertInfo {
  string alert_type = 1;    // HighRestartCount, OOMKilled, PodNotReady, DeploymentFailing
  string deployment = 2;
  string namespace = 3;
  string message = 4;
  string severity = 5;
  int64 timestamp = 6;
}
The server handler iterates over the ObservabilityStore of each MultiWatcher target, filters by namespace if specified, and returns active alerts.

AnalyzeIssue RPC

The server receives the Issue context and calls the LLM for analysis:
rpc AnalyzeIssue(AnalyzeIssueRequest) returns (AnalyzeIssueResponse);

message SuggestedAction {
  string name = 1;
  string action = 2;
  string description = 3;
  map<string, string> params = 4;
}

message AnalyzeIssueResponse {
  string analysis = 1;
  float confidence = 2;
  repeated string recommendations = 3;
  string provider = 4;
  string model = 5;
  repeated SuggestedAction suggested_actions = 6;
}
Structured Prompt: The server builds a prompt that includes:
  1. Issue context (name, namespace, resource, severity, risk score, description)
  2. List of available actions (ScaleDeployment, RestartDeployment, RollbackDeployment, PatchConfig)
  3. Instructions to return structured JSON with analysis, confidence, recommendations, and actions fields
Response Parsing:
  1. Removes markdown codeblocks (```json ... ```)
  2. Parses JSON into analysisResult
  3. Clamps confidence between 0.0 and 1.0
  4. If parsing fails -> uses raw response as analysis with confidence 0.5

AgenticStep RPC

The server receives the Issue context, history of previous steps, and updated K8s context, and decides the next action:
rpc AgenticStep(AgenticStepRequest) returns (AgenticStepResponse);

message AgenticStepRequest {
  string issue_name = 1;
  string namespace = 2;
  string resource_kind = 3;
  string resource_name = 4;
  string signal_type = 5;
  string severity = 6;
  string description = 7;
  int32 risk_score = 8;
  string provider = 9;
  string model = 10;
  string kubernetes_context = 11;   // refreshed at each step
  repeated AgenticHistoryEntry history = 12;
  int32 max_steps = 13;
  int32 current_step = 14;
}

message AgenticStepResponse {
  string reasoning = 1;              // AI reasoning (recorded in history)
  bool resolved = 2;                 // true = problem resolved
  SuggestedAction next_action = 3;   // null when resolved=true
  // Fields below only populated when resolved=true:
  string postmortem_summary = 4;
  string root_cause = 5;
  string impact = 6;
  repeated string lessons_learned = 7;
  repeated string prevention_actions = 8;
}
AgenticStep Prompt: The server builds a structured prompt with:
  1. Role + Issue details: incident context (type, severity, resource)
  2. Kubernetes context: real cluster state (refreshed at each step via KubernetesContextBuilder)
  3. Tool definitions: 6 available mutating actions + “Observe” (no action, wait for next context)
  4. Conversation history: each previous step formatted with reasoning -> action -> observation
  5. Instructions: respond JSON, budget (step N of M), safety rules
When resolved=true, the response includes data for PostMortem generation (summary, root_cause, impact, lessons_learned, prevention_actions).

PostMortem Generation

When an agentic remediation resolves an Issue, the IssueReconciler automatically generates:

PostMortem CR

Created via generatePostMortem():
FieldSource
timelineIssue.DetectedAt + each step from AgenticHistory + resolved
actionsExecutedSteps with Action != nil (includes result)
summaryAnnotation platform.chatcli.io/postmortem-summary (AI-generated)
rootCauseAnnotation platform.chatcli.io/root-cause
impactAnnotation platform.chatcli.io/impact
lessonsLearnedAnnotation platform.chatcli.io/lessons-learned
preventionActionsAnnotation platform.chatcli.io/prevention-actions
durationCalculated: resolvedAt - detectedAt
The PostMortem CR is owned by the Issue (cascade delete).

Auto-generated Runbook (Agentic)

Created via generateAgenticRunbook():
  • Name: agentic-{signal}-{severity}-{kind} (sanitized)
  • Steps: only steps with successful actions
  • Labels: auto-generated=true, source=agentic
  • Uses CreateOrUpdate (reused for future incidents of the same type)

Operator Prometheus Metrics

The operator exposes Prometheus metrics for observability:
MetricTypeDescription
chatcli_operator_issues_totalCounterTotal issues by severity and state
chatcli_operator_issue_resolution_duration_secondsHistogramDuration from detection to resolution
chatcli_operator_active_issuesGaugeNumber of unresolved issues

Tests

The operator has 96 tests (125 with subtests) covering all components:
ComponentTestsCoverage
InstanceReconciler15CRUD, watcher, persistence, replicas, RBAC, deletion, deepcopy
AnomalyReconciler4Creation, correlation, attachment to existing Issue
IssueReconciler12State machine, AI fallback, retry, agentic plan, PostMortem generation
RemediationReconciler16All action types, safety checks, agentic loop (first step, resolved, max steps, timeout, action failed, observation)
AIInsightReconciler12Connectivity, mock RPC, analysis parsing, withAuth, TLS/token
PostMortemReconciler2State initialization, terminal state
WatcherBridge22Alert mapping, SHA256 dedup, hash, pruning, Anomaly creation, buildConnectionOpts (TLS, token, both)
CorrelationEngine4Risk scoring, severity, incident ID, related anomalies
Pipeline (E2E)3Complete flow: Anomaly->Issue->Insight->Plan->Resolved, escalation, correlation
MapActionType6All string->enum mappings

Run Tests

cd operator
go test ./... -v

Ownership Diagram (Garbage Collection)

  • Instance is the owner of all Kubernetes resources it creates (Deployment, Service, ConfigMap, SA, PVC)
  • Issue is the owner of AIInsight, RemediationPlan, and PostMortem (cascade delete)
  • Anomalies are independent (no owner) to preserve history

AIOps Deployment Checklist

1

Install CRDs

kubectl apply -f operator/config/crd/bases/
2

Install operator RBAC

kubectl apply -f operator/config/rbac/role.yaml
3

Deploy the operator

kubectl apply -f operator/config/manager/manager.yaml
4

Create Secret with API keys

Create the Secret with the API keys for your chosen LLM provider.
5

Create Instance CR

Create the Instance CR with watcher.enabled: true and configured targets.
6

Verify server

kubectl get instances — confirm that the ChatCLI server is running.
7

Verify AIOps pipeline

  • kubectl get anomalies -A — anomalies being detected
  • kubectl get issues -A — issues being created
  • kubectl get aiinsights -A — AI analyzing
8

(Optional) Create manual Runbooks

Create manual Runbooks for specific scenarios.
9

Monitor metrics

Monitor operator metrics via Prometheus.

Next Steps