Manage ChatCLI instances and an autonomous AIOps platform on Kubernetes with 6 CRDs, anomaly correlation, AI analysis, and automatic remediation.
The ChatCLI Operator goes beyond instance management. It implements a complete AIOps platform that autonomously detects anomalies, correlates signals, requests AI analysis, and executes remediation — all without external dependencies beyond the LLM provider.
Queries GetAlerts from the server every 30s. Creates Anomaly CRs (dedup SHA256). Invalidates dedup when Issue reaches terminal state.
2. Correlation
AnomalyReconciler + CorrelationEngine
Groups anomalies by resource + time window. Calculates risk score and severity. Creates/updates Issue CRs with signalType.
3. Analysis
AIInsightReconciler + KubernetesContextBuilder
Collects real K8s context (deployment, pods, events, revisions). Calls AnalyzeIssue RPC with enriched context.
4. Remediation
IssueReconciler
Runbook-first: (a) Manual Runbook (tiered matching), (b) generates auto Runbook from AI, or (c) agentic remediation (AI acts step-by-step).
5. Execution
RemediationReconciler
Executes actions on the cluster: ScaleDeployment, RestartDeployment, RollbackDeployment, PatchConfig, AdjustResources, DeletePod. Agentic mode: AI decides each action via observe-decide-act loop.
6. Resolution
IssueReconciler
Success -> Resolved (invalidates dedup). Failure -> re-analysis with failure context (different strategy) -> up to maxAttempts -> Escalated.
gRPC uses persistent HTTP/2 connections that pin to a single pod via kube-proxy, leaving extra replicas idle.
1 replica (default): Standard ClusterIP Service
Multiple replicas: Headless Service (ClusterIP: None) is created automatically, enabling client-side round-robin via gRPC dns:/// resolver
Keepalive: WatcherBridge pings every 30s (5s timeout) to detect inactive pods quickly. The server accepts pings with a minimum interval of 20s (EnforcementPolicy.MinTime)
Transition: When scaling from 1 to 2+ replicas (or back), the operator deletes and recreates the Service automatically (ClusterIP is immutable in Kubernetes)
The operator monitors changes in ConfigMaps and Secrets referenced by the Instance and triggers rolling updates automatically via hash annotations on the PodTemplate:
Annotation
Source
When It Changes
chatcli.io/watch-config-hash
ConfigMap <name>-watch-config
Watcher targets changed
chatcli.io/configmap-hash
ConfigMap <name>
Environment variables updated
chatcli.io/secret-hash
Secret referenced in apiKeys.name
API keys created or updated
chatcli.io/tls-hash
Secret referenced in server.tls.secretName
TLS certificates renewed
Adding/removing targets in watcher.targets and applying the Instance causes automatic rollout. Creating or updating the API keys Secret and renewing TLS certificates also trigger rollout automatically.
The operator watches (Watches) Secrets in the Instance namespace. When a Secret referenced in apiKeys.name or server.tls.secretName is created or updated, the reconciler is triggered automatically — even if the Secret did not exist when the Instance was created.
ConfigMap and Secret envFrom: Marked as optional: true, allowing the Instance to be created before the Secret/ConfigMap
Flexible deploy order: Namespace -> Instance -> Secret/ConfigMap (any order after the namespace)
Operational procedures. Manual Runbooks have priority over everything. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR from the suggested actions.
Manual Runbook
AI Auto-generated Runbook
Copy
apiVersion: platform.chatcli.io/v1alpha1kind: Runbookmetadata: name: high-error-rate-deployment namespace: productionspec: description: "Standard procedure for high error rate incidents on Deployments" trigger: signalType: error_rate severity: high resourceKind: Deployment steps: - name: Scale up action: ScaleDeployment description: "Increase replicas to absorb the error spike" params: replicas: "4" - name: Rollback action: RollbackDeployment description: "Revert to previous stable version if scaling doesn't help" maxAttempts: 3
When there is no manual Runbook or AI-suggested actions, the operator creates an agentic plan. The AI acts as an agent with Kubernetes skills in an observe-decide-act loop:
Safety Guards: Maximum of 10 steps (configurable via agenticMaxSteps), timeout of 10 minutes. If an action fails, the observation reports “FAILED: error” and the loop continues — the AI receives the feedback and adapts.
On agentic resolution: The operator automatically generates:
PostMortem CR with timeline, root cause, impact, lessons learned
Reusable Runbook CR with successful steps (label source=agentic)
Incident report automatically generated after resolution by agentic remediation. Contains the complete incident history: detection, analysis, executed actions, and resolution.
Copy
apiVersion: platform.chatcli.io/v1alpha1kind: PostMortemmetadata: name: pm-api-gateway-pod-restart-1771276354 namespace: productionspec: issueRef: name: api-gateway-pod-restart-1771276354 resource: kind: Deployment name: api-gateway namespace: production severity: highstatus: state: Open # Open | InReview | Closed summary: "OOMKilled containers caused cascading restarts on api-gateway" rootCause: "Memory limit (512Mi) insufficient for current workload pattern" impact: "Service degradation for 5 minutes, 30% error rate increase" timeline: - timestamp: "2026-02-16T10:30:00Z" type: detected detail: "Issue detected: pod_restart on api-gateway" - timestamp: "2026-02-16T10:31:00Z" type: action_executed detail: "ScaleDeployment to 5 replicas" - timestamp: "2026-02-16T10:31:35Z" type: action_executed detail: "AdjustResources memory_limit=1Gi" - timestamp: "2026-02-16T10:32:10Z" type: resolved detail: "All pods stable, issue resolved" lessonsLearned: - "Memory limits should account for peak workload patterns" - "Set up HPA to auto-scale on memory pressure" preventionActions: - "Configure HPA with min 3 replicas for api-gateway" - "Set memory limit to 1Gi across all environments" duration: "2m10s" generatedAt: "2026-02-16T10:32:10Z"
1. Existing manual Runbook (tiered match)2. AI auto-generated Runbook (materialized as reusable CR)3. Agentic AI remediation (observe-decide-act loop, generates PostMortem + Runbook)4. Escalation (only when agentic fails after max attempts)
The WatcherBridge is the component that connects the ChatCLI server to the operator:
Polling: Queries GetAlerts from the server every 30 seconds
Discovery: Locates the server via Instance CRs (first Instance with a ready gRPC endpoint)
Dedup: SHA256 hash of type+deployment+namespace (no temporal component — a continuous problem generates only one Anomaly). 2-hour TTL
Dedup invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the resource are removed, allowing immediate recurrence detection
cd operator# Buildgo build ./...# Tests (96 functions, 125 with subtests)go test ./... -v# Docker (must be built from the repository root)docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .# Install CRDs in the clusterkubectl apply -f config/crd/bases/# Deploy the operatormake deploy IMG=myregistry/chatcli-operator:dev