Skip to main content
The ChatCLI Operator goes beyond instance management. It implements a complete AIOps platform that autonomously detects anomalies, correlates signals, requests AI analysis, and executes remediation — all without external dependencies beyond the LLM provider.

API Group and CRDs

The operator uses the API group platform.chatcli.io/v1alpha1 with 7 Custom Resource Definitions:
CRDShort NameDescription
InstanceinstChatCLI server instance (Deployment, Service, RBAC, PVC)
AnomalyanomRaw signal from the K8s Watcher (restarts, OOM, deploy failures)
IssueissCorrelated incident grouping multiple anomalies
AIInsightaiAI-generated root cause analysis with suggested actions
RemediationPlanrpConcrete actions to resolve the problem (runbook or agentic AI)
RunbookrbManual operational procedures (optional)
PostMortempmAuto-generated incident report after agentic resolution

Operator Installation

1

Install CRDs

kubectl apply -f operator/config/crd/bases/
2

Install RBAC and Manager

kubectl apply -f operator/config/rbac/role.yaml
kubectl apply -f operator/config/manager/manager.yaml
cd operator
make docker-build IMG=ghcr.io/diillson/chatcli-operator:latest
make docker-push IMG=ghcr.io/diillson/chatcli-operator:latest

AIOps Platform Architecture

Autonomous Pipeline

PhaseComponentWhat It Does
1. DetectionWatcherBridgeQueries GetAlerts from the server every 30s. Creates Anomaly CRs (dedup SHA256). Invalidates dedup when Issue reaches terminal state.
2. CorrelationAnomalyReconciler + CorrelationEngineGroups anomalies by resource + time window. Calculates risk score and severity. Creates/updates Issue CRs with signalType.
3. AnalysisAIInsightReconciler + KubernetesContextBuilderCollects real K8s context (deployment, pods, events, revisions). Calls AnalyzeIssue RPC with enriched context.
4. RemediationIssueReconcilerRunbook-first: (a) Manual Runbook (tiered matching), (b) generates auto Runbook from AI, or (c) agentic remediation (AI acts step-by-step).
5. ExecutionRemediationReconcilerExecutes actions on the cluster: ScaleDeployment, RestartDeployment, RollbackDeployment, PatchConfig, AdjustResources, DeletePod. Agentic mode: AI decides each action via observe-decide-act loop.
6. ResolutionIssueReconcilerSuccess -> Resolved (invalidates dedup). Failure -> re-analysis with failure context (different strategy) -> up to maxAttempts -> Escalated.
7. PostMortemIssueReconcilerAgentic resolution -> auto-generated PostMortem CR (timeline, root cause, lessons learned) + reusable Runbook from successful steps.

Issue State Machine


CRD: Instance

The Instance manages ChatCLI server instances in the cluster.

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli          # The namespace must exist before creating the Instance
spec:
  replicas: 1
  provider: CLAUDEAI       # OPENAI, CLAUDEAI, GOOGLEAI, XAI, STACKSPOT, OLLAMA, COPILOT
  model: claude-sonnet-4-5

  image:
    repository: ghcr.io/diillson/chatcli
    tag: latest
    pullPolicy: IfNotPresent

  server:
    port: 50051
    tls:
      enabled: true
      secretName: chatcli-tls
    token:
      name: chatcli-auth
      key: token

  watcher:
    enabled: true
    interval: "30s"
    window: "2h"
    maxLogLines: 100
    maxContextChars: 32000
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 9090
        metricsFilter: ["http_requests_*", "http_request_duration_*"]
      - deployment: auth-service
        namespace: production
        metricsPort: 9090
      - deployment: worker
        namespace: batch

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

  persistence:
    enabled: true
    size: 1Gi
    storageClassName: standard

  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault

  apiKeys:
    name: chatcli-api-keys

Spec Fields

Root

FieldTypeRequiredDefaultDescription
replicasint32No1Number of server replicas
providerstringYesLLM provider
modelstringNoLLM model
imageImageSpecNoImage configuration
serverServerSpecNogRPC server configuration
watcherWatcherSpecNoK8s Watcher configuration
resourcesResourceRequirementsNoCPU and memory requests/limits
persistencePersistenceSpecNoSession persistence
securityContextPodSecurityContextNononroot/1000Pod security context
apiKeysSecretRefSpecNoSecret with API keys

WatcherSpec

FieldTypeRequiredDefaultDescription
enabledboolNofalseEnables the watcher
targets[]WatchTargetSpecNoList of deployments (multi-target)
deploymentstringNoSingle deployment (legacy)
namespacestringNoDeployment namespace (legacy)
intervalstringNo"30s"Collection interval
windowstringNo"2h"Observation window
maxLogLinesint32No100Max log lines per pod
maxContextCharsint32No32000LLM context budget

WatchTargetSpec

FieldTypeRequiredDefaultDescription
deploymentstringYesDeployment name
namespacestringYesDeployment namespace
metricsPortint32No0Prometheus port (0 = disabled)
metricsPathstringNo/metricsPrometheus endpoint path
metricsFilter[]stringNoGlob filters for metrics

Resources Created by Instance

ResourceNameDescription
Deployment<name>ChatCLI server pods
Service<name>gRPC Service (automatic headless when replicas > 1 for client-side LB)
ConfigMap<name>Environment variables (provider, model, etc.)
ConfigMap<name>-watch-configMulti-target YAML (if targets defined)
ServiceAccount<name>Identity for RBAC
Role/ClusterRole<name>-watcherK8s watcher permissions
RoleBinding/CRB<name>-watcherSA to Role binding
PVC<name>-sessionsPersistence (if enabled)

gRPC Load Balancing

gRPC uses persistent HTTP/2 connections that pin to a single pod via kube-proxy, leaving extra replicas idle.
  • 1 replica (default): Standard ClusterIP Service
  • Multiple replicas: Headless Service (ClusterIP: None) is created automatically, enabling client-side round-robin via gRPC dns:/// resolver
  • Keepalive: WatcherBridge pings every 30s (5s timeout) to detect inactive pods quickly. The server accepts pings with a minimum interval of 20s (EnforcementPolicy.MinTime)
  • Transition: When scaling from 1 to 2+ replicas (or back), the operator deletes and recreates the Service automatically (ClusterIP is immutable in Kubernetes)

Automatic RBAC

  • Single-namespace (all targets in the same namespace): Creates Role + RoleBinding
  • Multi-namespace (targets in different namespaces): Creates ClusterRole + ClusterRoleBinding automatically
  • On CR deletion, cluster-scoped resources are cleaned up by the finalizer

Auto-Rollout on Configuration Changes

The operator monitors changes in ConfigMaps and Secrets referenced by the Instance and triggers rolling updates automatically via hash annotations on the PodTemplate:
AnnotationSourceWhen It Changes
chatcli.io/watch-config-hashConfigMap <name>-watch-configWatcher targets changed
chatcli.io/configmap-hashConfigMap <name>Environment variables updated
chatcli.io/secret-hashSecret referenced in apiKeys.nameAPI keys created or updated
chatcli.io/tls-hashSecret referenced in server.tls.secretNameTLS certificates renewed
Adding/removing targets in watcher.targets and applying the Instance causes automatic rollout. Creating or updating the API keys Secret and renewing TLS certificates also trigger rollout automatically.

Secret and ConfigMap Observation

The operator watches (Watches) Secrets in the Instance namespace. When a Secret referenced in apiKeys.name or server.tls.secretName is created or updated, the reconciler is triggered automatically — even if the Secret did not exist when the Instance was created.
  • ConfigMap and Secret envFrom: Marked as optional: true, allowing the Instance to be created before the Secret/ConfigMap
  • Flexible deploy order: Namespace -> Instance -> Secret/ConfigMap (any order after the namespace)

AIOps Platform CRDs

Anomaly

Represents a raw signal detected by the WatcherBridge.
apiVersion: platform.chatcli.io/v1alpha1
kind: Anomaly
metadata:
  name: watcher-highrestartcount-api-gateway-1234567890
  namespace: production
spec:
  signalType: pod_restart    # pod_restart | oom_kill | pod_not_ready | deploy_failing | error_rate | latency_spike
  source: watcher            # watcher | prometheus | manual
  severity: warning          # critical | high | medium | low | warning
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  description: "HighRestartCount on api-gateway: container app restarted 8 times"
  detectedAt: "2026-02-16T10:30:00Z"
status:
  correlated: true
  issueRef:
    name: api-gateway-pod-restart-1771276354

Anomaly Spec Fields

FieldTypeDescription
signalTypeAnomalySignalTypeType of detected signal
sourceAnomalySourceDetection origin (watcher, prometheus, manual)
severityIssueSeveritySignal severity
resourceResourceRefAffected K8s resource (kind, name, namespace)
descriptionstringHuman-readable description of the problem
detectedAtTimeDetection timestamp

Signals Detected by Watcher

AlertType (Server)SignalType (Anomaly)Description
HighRestartCountpod_restartPod with many restarts (CrashLoopBackOff)
OOMKilledoom_killContainer terminated due to lack of memory
PodNotReadypod_not_readyPod is not in the Ready state
DeploymentFailingdeploy_failingDeployment with Available=False

Issue

Correlated incident that groups anomalies and manages the remediation lifecycle.
apiVersion: platform.chatcli.io/v1alpha1
kind: Issue
metadata:
  name: api-gateway-pod-restart-1771276354
  namespace: production
spec:
  severity: high
  source: watcher
  signalType: pod_restart        # Propagated from Anomaly for tiered Runbook matching
  description: "Correlated incident: pod_restart on api-gateway"
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  riskScore: 65
  correlatedAnomalies:
    - name: watcher-highrestartcount-api-gateway-1234567890
    - name: watcher-oomkilled-api-gateway-1234567891
status:
  state: Analyzing          # Detected | Analyzing | Remediating | Resolved | Escalated | Failed
  remediationAttempts: 0
  maxRemediationAttempts: 3
  detectedAt: "2026-02-16T10:30:00Z"
  conditions:
    - type: Analyzing
      status: "True"
      reason: AIInsightCreated

Issue States

StateDescription
DetectedNewly created issue, awaiting analysis
AnalyzingAIInsight created, awaiting AI response (or re-analysis with failure context)
RemediatingRemediationPlan in execution
ResolvedSuccessful remediation (dedup invalidated for recurrence detection)
EscalatedMax attempts reached or no available actions (dedup invalidated)
FailedTerminal failure

AIInsight

AI-generated root cause analysis with suggested actions for automatic remediation.
apiVersion: platform.chatcli.io/v1alpha1
kind: AIInsight
metadata:
  name: api-gateway-pod-restart-1771276354-insight
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  provider: CLAUDEAI
  model: claude-sonnet-4-5
status:
  analysis: "High restart count caused by OOMKilled. Container memory limit (512Mi) is insufficient for the current workload pattern."
  confidence: 0.87
  recommendations:
    - "Increase memory limit to 1Gi"
    - "Investigate possible memory leak in the application"
    - "Monitor GC pressure metrics"
  suggestedActions:
    - name: "Restart deployment"
      action: RestartDeployment
      description: "Restart pods to reclaim leaked memory immediately"
    - name: "Scale up replicas"
      action: ScaleDeployment
      description: "Add more replicas to distribute memory pressure"
      params:
        replicas: "4"
  generatedAt: "2026-02-16T10:31:00Z"

AIInsight Status Fields

FieldTypeDescription
analysisstringAI-generated root cause analysis
confidencefloat64Analysis confidence level (0.0-1.0)
recommendations[]stringHuman-readable recommendations
suggestedActions[]SuggestedActionStructured actions for automatic remediation
generatedAtTimeWhen the analysis was generated

SuggestedAction

FieldTypeDescription
namestringHuman-readable action name
actionstringAction type: ScaleDeployment, RestartDeployment, RollbackDeployment, PatchConfig
descriptionstringExplanation of why this action is needed
paramsmap[string]stringAction parameters (e.g., replicas: "4")

RemediationPlan

Concrete remediation plan automatically generated from a Runbook or AI actions.
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-1771276354-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Attempt 1 (AI-generated): High restart count caused by OOMKilled"
  actions:
    - type: RestartDeployment
    - type: ScaleDeployment
      params:
        replicas: "4"
  safetyConstraints:
    - "No delete operations"
    - "No destructive changes"
    - "Rollback on failure"
status:
  state: Completed           # Pending | Executing | Completed | Failed | RolledBack
  result: "Deployment restarted and scaled to 4 replicas successfully"
  startedAt: "2026-02-16T10:31:30Z"
  completedAt: "2026-02-16T10:32:15Z"

Action Types

TypeDescriptionParameters
ScaleDeploymentAdjusts the number of replicasreplicas
RestartDeploymentRollout restart of the deployment
RollbackDeploymentUndoes rollout (previous, healthy, or specific revision)toRevision (optional: previous, healthy, or number)
PatchConfigUpdates keys of a ConfigMapconfigmap, key=value
AdjustResourcesAdjusts CPU/memory requests/limits for containersmemory_limit, memory_request, cpu_limit, cpu_request, container
DeletePodRemoves the sickest pod (CrashLoop > restarts)pod (optional — auto-selects the sickest)
CustomCustom action (blocked by safety checks)

Runbook (Manual or Auto-generated)

Operational procedures. Manual Runbooks have priority over everything. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR from the suggested actions.
apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: high-error-rate-deployment
  namespace: production
spec:
  description: "Standard procedure for high error rate incidents on Deployments"
  trigger:
    signalType: error_rate
    severity: high
    resourceKind: Deployment
  steps:
    - name: Scale up
      action: ScaleDeployment
      description: "Increase replicas to absorb the error spike"
      params:
        replicas: "4"
    - name: Rollback
      action: RollbackDeployment
      description: "Revert to previous stable version if scaling doesn't help"
  maxAttempts: 3

RemediationPlan (Agentic Mode)

When there is no manual Runbook or AI-suggested actions, the operator creates an agentic plan. The AI acts as an agent with Kubernetes skills in an observe-decide-act loop:
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Agentic AI remediation"
  agenticMode: true
  agenticMaxSteps: 10
  agenticHistory:
    - stepNumber: 1
      aiMessage: "High restart count with OOMKilled. Scaling up to reduce memory pressure."
      action:
        type: ScaleDeployment
        params:
          replicas: "5"
      observation: "SUCCESS: ScaleDeployment executed successfully"
    - stepNumber: 2
      aiMessage: "Pods still restarting. Adjusting memory limits."
      action:
        type: AdjustResources
        params:
          memory_limit: "1Gi"
          memory_request: "512Mi"
      observation: "SUCCESS: AdjustResources executed successfully"
    - stepNumber: 3
      aiMessage: "All pods running stable. Issue resolved."
status:
  state: Completed
  agenticStepCount: 3
  agenticStartedAt: "2026-02-16T10:31:00Z"
Safety Guards: Maximum of 10 steps (configurable via agenticMaxSteps), timeout of 10 minutes. If an action fails, the observation reports “FAILED: error” and the loop continues — the AI receives the feedback and adapts.
On agentic resolution: The operator automatically generates:
  1. PostMortem CR with timeline, root cause, impact, lessons learned
  2. Reusable Runbook CR with successful steps (label source=agentic)

PostMortem (Auto-generated)

Incident report automatically generated after resolution by agentic remediation. Contains the complete incident history: detection, analysis, executed actions, and resolution.
apiVersion: platform.chatcli.io/v1alpha1
kind: PostMortem
metadata:
  name: pm-api-gateway-pod-restart-1771276354
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  severity: high
status:
  state: Open              # Open | InReview | Closed
  summary: "OOMKilled containers caused cascading restarts on api-gateway"
  rootCause: "Memory limit (512Mi) insufficient for current workload pattern"
  impact: "Service degradation for 5 minutes, 30% error rate increase"
  timeline:
    - timestamp: "2026-02-16T10:30:00Z"
      type: detected
      detail: "Issue detected: pod_restart on api-gateway"
    - timestamp: "2026-02-16T10:31:00Z"
      type: action_executed
      detail: "ScaleDeployment to 5 replicas"
    - timestamp: "2026-02-16T10:31:35Z"
      type: action_executed
      detail: "AdjustResources memory_limit=1Gi"
    - timestamp: "2026-02-16T10:32:10Z"
      type: resolved
      detail: "All pods stable, issue resolved"
  lessonsLearned:
    - "Memory limits should account for peak workload patterns"
    - "Set up HPA to auto-scale on memory pressure"
  preventionActions:
    - "Configure HPA with min 3 replicas for api-gateway"
    - "Set memory limit to 1Gi across all environments"
  duration: "2m10s"
  generatedAt: "2026-02-16T10:32:10Z"

PostMortem Status Fields

FieldTypeDescription
statePostMortemStateState: Open, InReview, Closed
summarystringAI-generated incident summary
rootCausestringRoot cause determined by AI
impactstringIncident impact
timeline[]TimelineEventTimeline (detected, analyzed, action_executed, resolved)
actionsExecuted[]ActionRecordExecuted actions with result
lessonsLearned[]stringLessons learned
preventionActions[]stringSuggested preventive actions
durationstringTotal incident duration
generatedAtTimeWhen the PostMortem was generated

Runbook Matching (Tiered)

Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
Tier 2: Severity + ResourceKind (fallback when signal doesn't match)

Remediation Priority

1. Existing manual Runbook (tiered match)
2. AI auto-generated Runbook (materialized as reusable CR)
3. Agentic AI remediation (observe-decide-act loop, generates PostMortem + Runbook)
4. Escalation (only when agentic fails after max attempts)

Correlation Engine

The correlation engine groups anomalies into issues using:

Risk Scoring

Each signal type has a weight:
SignalWeight
oom_kill30
error_rate25
deploy_failing25
latency_spike20
pod_restart20
pod_not_ready20
The risk score is the sum of correlated anomaly weights (maximum 100).

Severity Classification

Risk ScoreSeverity
>= 80Critical
>= 60High
>= 40Medium
< 40Low

Grouping

  • Anomalies on the same resource (deployment + namespace) within the same time window are grouped into the same Issue
  • Incident ID is deterministic: hash of resource + signal type (prevents duplicates)

WatcherBridge

The WatcherBridge is the component that connects the ChatCLI server to the operator:
  • Polling: Queries GetAlerts from the server every 30 seconds
  • Discovery: Locates the server via Instance CRs (first Instance with a ready gRPC endpoint)
  • Dedup: SHA256 hash of type+deployment+namespace (no temporal component — a continuous problem generates only one Anomaly). 2-hour TTL
  • Dedup invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the resource are removed, allowing immediate recurrence detection
  • Pruning: Removes expired hashes automatically (> 2h)
  • Creation: Converts alerts to Anomaly CRs with valid K8s names

Usage Examples

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-simple
spec:
  provider: OPENAI
  apiKeys:
    name: chatcli-api-keys

Status and Monitoring

kubectl get instances
NAME            READY   REPLICAS   PROVIDER    AGE
chatcli-aiops   true    1          CLAUDEAI    5m
kubectl get issues -A
NAME                                    SEVERITY   STATE         RISK   AGE
api-gateway-pod-restart-1771276354      high       Remediating   65     2m
worker-oom-kill-3847291023              critical   Analyzing     90     30s
kubectl get aiinsights -A
NAME                                           ISSUE                                   PROVIDER   CONFIDENCE   AGE
api-gateway-pod-restart-1771276354-insight      api-gateway-pod-restart-1771276354      CLAUDEAI   0.87         1m
kubectl get remediationplans -A
NAME                                          ISSUE                                   ATTEMPT   STATE       AGE
api-gateway-pod-restart-1771276354-plan-1      api-gateway-pod-restart-1771276354      1         Completed   1m
kubectl get postmortems -A
NAME                                          ISSUE                                   SEVERITY   STATE   AGE
pm-api-gateway-pod-restart-1771276354         api-gateway-pod-restart-1771276354      high       Open    30s
kubectl get anomalies -A
NAME                                               SIGNAL        SOURCE    SEVERITY   AGE
watcher-highrestartcount-api-gateway-1234567890     pod_restart   watcher   warning    3m
watcher-oomkilled-worker-9876543210                 oom_kill      watcher   critical   1m

Development

cd operator

# Build
go build ./...

# Tests (96 functions, 125 with subtests)
go test ./... -v

# Docker (must be built from the repository root)
docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .

# Install CRDs in the cluster
kubectl apply -f config/crd/bases/

# Deploy the operator
make deploy IMG=myregistry/chatcli-operator:dev

Next Steps