Skip to main content
The ChatCLI Operator goes beyond instance management. It implements a complete AIOps platform that autonomously detects anomalies, correlates signals, requests AI analysis, and executes remediation — all without external dependencies beyond the LLM provider. The platform supports Deployments, StatefulSets, DaemonSets, Jobs, and CronJobs, integrates with Helm, ArgoCD, and Flux for GitOps-aware remediation, analyzes application logs with stack trace extraction (Java, Go, Python, Node.js), correlates Prometheus metrics with incidents, and allows linking source code repositories for code-aware diagnostics.

API Group and CRDs

The operator uses the API group platform.chatcli.io/v1alpha1 with 17 Custom Resource Definitions:
CRDShort NameDescription
InstanceinstChatCLI server instance (Deployment, Service, RBAC, PVC)
AnomalyanomRaw signal from the K8s Watcher (restarts, OOM, deploy failures)
IssueissCorrelated incident grouping multiple anomalies
AIInsightaiAI-generated root cause analysis with enriched context (logs, metrics, code, GitOps)
RemediationPlanrpConcrete actions to resolve the problem (runbook or agentic AI)
RunbookrbManual operational procedures (optional)
PostMortempmAuto-generated incident report after resolution (all modes)
SourceRepositorysrcrepoLinks workloads to git repositories for code-aware diagnostics
NotificationPolicynpMulti-channel notification routing with throttling and templates
EscalationPolicyepTiered escalation chains with timeouts (L1→L2→L3)
ServiceLevelObjectivesloSLO with multi-window burn rate alerting (Google SRE model)
IncidentSLAslaResponse/resolution SLA targets per severity with business hours
ApprovalPolicyapAuto/manual/quorum approval policies with change windows
ApprovalRequestarApproval workflow with blast radius assessment
ClusterRegistrationcrMulti-cluster federation with kubeconfig and health checks
AuditEventaeImmutable audit trail (append-only)
ChaosExperimentchaosChaos engineering experiments with 7 types and safety checks
For detailed documentation on each v2 CRD (NotificationPolicy, EscalationPolicy, SLO, SLA, ApprovalPolicy, ApprovalRequest, ClusterRegistration, AuditEvent, ChaosExperiment), see the AIOps Platform sub-pages.

Operator Installation

A single command installs everything: 17 CRDs + RBAC + Deployment + Service + Dashboard.
The operator chart (chatcli-operator) is separate from the server chart (chatcli). The operator manages the controllers and AIOps dashboard. The server is deployed via Instance CR or the chatcli chart with watcher enabled.
# With Prometheus for incident metrics
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set prometheusUrl="http://prometheus-server.monitoring.svc:9090"

# With custom image
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set image.repository=myregistry/chatcli-operator \
  --set image.tag=1.139.0

# With ServiceMonitor for Prometheus Operator
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set serviceMonitor.enabled=true
ValueDefaultDescription
image.repositoryghcr.io/diillson/chatcli-operatorOperator image
image.taglatestImage tag
replicaCount1Replicas (leader election enabled by default)
api.port8090Web dashboard and REST API port
prometheusUrl""Prometheus URL for incident metrics collection
leaderElecttrueLeader election for HA
serviceMonitor.enabledfalseCreate Prometheus ServiceMonitor
kubectl apply -f operator/config/crd/bases/
kubectl apply -f operator/config/rbac/role.yaml
kubectl apply -f operator/config/manager/manager.yaml
# Build from the repo root
docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .

# Or via Make
cd operator
make docker-build IMG=myregistry/chatcli-operator:dev
make docker-push IMG=myregistry/chatcli-operator:dev

AIOps Platform Architecture

Autonomous Pipeline

PhaseComponentWhat It Does
1. DetectionWatcherBridgeQueries GetAlerts from the server every 30s. Creates Anomaly CRs (dedup SHA256). Invalidates dedup when Issue reaches terminal state.
2. CorrelationAnomalyReconciler + CorrelationEngineGroups anomalies by resource + time window. Calculates risk score and severity. Creates/updates Issue CRs with signalType.
3. AnalysisAIInsightReconciler + 6 enrichersCollects K8s context (Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, HPAs), advanced log analysis (stack traces Java/Go/Python/Node.js, 24+ error patterns), Prometheus metrics (CPU/mem/latency trends), GitOps (Helm/ArgoCD/Flux status), source code (commit↔incident correlation), cascade analysis (cross-service).
4. RemediationIssueReconcilerAI-validated runbook selection: (a) finds ALL candidate runbooks (multi-runbook per trigger), (b) AI validates each against root cause (RUNBOOK_APPROVED: name or RUNBOOK_REJECTED), (c) if rejected or no candidates, generates new runbook from AI suggestions (with unique hash per root cause), or (d) agentic remediation (AI acts step-by-step).
5. ExecutionRemediationReconciler54 action types: workload (Scale, Restart, Rollback, AdjustResources, DeletePod, RestartStatefulSetPod), GitOps (HelmRollback, ArgoSyncApp), autoscaling (AdjustHPA), infra (CordonNode, DrainNode), storage (ResizePVC), security (RotateSecret), networking (UpdateIngress, PatchNetworkPolicy), advanced (ApplyManifest, ExecDiagnostic), statefulset (ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate), daemonset (RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod), job (RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods), cronjob (SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate). Blast radius prediction before execution.
6. ResolutionIssueReconcilerSuccess -> Resolved (invalidates dedup). Failure -> re-analysis with failure context (different strategy) -> up to maxAttempts -> Escalated.
7. PostMortemIssueReconcilerAll remediations (not just agentic) generate PostMortem CR with timeline, root cause, lessons, metrics, git correlation, cascade chain, trending (recurring incidents), dev feedback. Successful remediations also generate reusable Runbooks (one per root cause, hash-based naming).

Issue State Machine

Create Secret with API Keys

Before creating an Instance, you need a Secret with the LLM provider API keys. The Instance references this Secret via apiKeys.namewithout it, the server cannot call the AI.
kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=OPENAI_API_KEY="sk-your-key-here"
The Secret must exist in the same namespace as the Instance CR. The Secret name must match the apiKeys.name field in the Instance spec. Without this Secret, the server starts but cannot execute AI analysis or agentic remediation.

CRD: Instance

The Instance manages ChatCLI server instances in the cluster.

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli          # The namespace must exist before creating the Instance
spec:
  replicas: 1
  provider: CLAUDEAI       # OPENAI, OPENAI_ASSISTANT, CLAUDEAI, BEDROCK, GOOGLEAI, XAI, ZAI, MINIMAX, MOONSHOT, OPENROUTER, STACKSPOT, OLLAMA, COPILOT, GITHUB_MODELS
  model: claude-sonnet-4-6

  image:
    repository: ghcr.io/diillson/chatcli
    tag: latest
    pullPolicy: IfNotPresent

  server:
    port: 50051
    tls:
      enabled: true
      secretName: chatcli-tls   # Must contain tls.crt, tls.key AND ca.crt (self-signed: ca.crt=tls.crt) — see cookbook §2.1
    token:
      name: chatcli-auth
      key: token

  watcher:
    enabled: true
    interval: "30s"
    window: "2h"
    maxLogLines: 100
    maxContextChars: 32000
    targets:
      - name: api-gateway
        namespace: production
        metricsPort: 9090
        metricsFilter: ["http_requests_*", "http_request_duration_*"]
      - name: auth-service
        namespace: production
        metricsPort: 9090
      - name: worker
        namespace: batch
      - name: postgres                  # Monitor a StatefulSet
        kind: StatefulSet
        namespace: production
      - name: fluentd-agent             # Monitor a DaemonSet
        kind: DaemonSet
        namespace: logging
      - name: etl-pipeline              # Monitor a CronJob
        kind: CronJob
        namespace: data

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

  persistence:
    enabled: true
    size: 1Gi
    storageClassName: standard

  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault

  apiKeys:
    name: chatcli-api-keys

  server:
    security:
      jwtSecretRef:
        name: chatcli-jwt
        key: secret
      rateLimitRps: 20
      # bindAddress: "0.0.0.0"  # Optional — auto-detected in Kubernetes

  extraEnv:
    - name: CHATCLI_AGENT_SECURITY_MODE
      value: "strict"
    - name: CHATCLI_AUDIT_LOG_PATH
      value: "/var/log/chatcli/audit.jsonl"

Spec Fields

Root

FieldTypeRequiredDefaultDescription
replicasint32No1Number of server replicas
providerstringYesLLM provider
modelstringNoLLM model
imageImageSpecNoImage configuration
serverServerSpecNogRPC server configuration
watcherWatcherSpecNoK8s Watcher configuration
resourcesResourceRequirementsNoCPU and memory requests/limits
persistencePersistenceSpecNoSession persistence
securityContextPodSecurityContextNononroot/1000Pod security context
fallbackFallbackSpecNoLLM provider failover chain
apiKeysSecretRefSpecNoSecret with API keys (all providers in fallback chain)
aiopsAIOpsSpecNoAutonomous incident management pipeline configuration

AIOpsSpec

Configures the automatic remediation pipeline. All fields are optional with sensible defaults. AI auto-generated runbooks inherit maxRemediationAttempts from this configuration.
FieldTypeRequiredDefaultRangeDescription
maxRemediationAttemptsint32No51-10Maximum remediation attempts before escalating to human
resolutionCooldownMinutesint32No100-120Minutes after resolving before accepting new anomalies for the same resource
dedupTTLMinutesint32No605-1440How long (min) the dedup cache retains alert hashes
enableAutoResolveboolNotrueAuto-resolve Escalated issues when the resource recovers
agenticMaxStepsint32No103-30Maximum steps per agentic remediation attempt (each step = 1 AI call)
spec:
  aiops:
    maxRemediationAttempts: 5
    resolutionCooldownMinutes: 10
    dedupTTLMinutes: 60
    enableAutoResolve: true
    agenticMaxSteps: 10
In agentic mode, the postmortem includes the full AI reasoning for each step — which action was chosen, why, and the observed result. This ensures complete audit trail of autonomous AI decisions.

FallbackSpec

Configures automatic failover between LLM providers. When the primary provider fails (rate limit, timeout, server error), the system automatically tries the next provider in the chain.
FieldTypeRequiredDefaultDescription
enabledboolYesActivates the fallback chain
providers[]FallbackProviderEntryYesOrdered list of fallback providers (first = highest priority)
maxRetriesint32No2Retries per provider before moving to next
cooldownBasestringNo"30s"Initial cooldown after failure (exponential backoff)
cooldownMaxstringNo"5m"Maximum cooldown duration

FallbackProviderEntry

FieldTypeRequiredDescription
namestringYesProvider name: OPENAI, OPENAI_ASSISTANT, CLAUDEAI, BEDROCK, GOOGLEAI, XAI, ZAI, MINIMAX, MOONSHOT, OPENROUTER, STACKSPOT, OLLAMA, COPILOT, GITHUB_MODELS
modelstringNoLLM model for this provider
The primary provider (spec.provider) is always tried first. Providers in fallback.providers are tried in order when the primary fails. The Secret in apiKeys must contain API keys for all providers in the chain.

WatcherSpec

FieldTypeRequiredDefaultDescription
enabledboolNofalseEnables the watcher
targets[]WatchTargetSpecNoList of resources to monitor (multi-target)
deploymentstringNoSingle deployment (legacy)
namespacestringNoDeployment namespace (legacy)
intervalstringNo"30s"Collection interval
windowstringNo"2h"Observation window
maxLogLinesint32No100Max log lines per pod
maxContextCharsint32No32000LLM context budget

WatchTargetSpec

FieldTypeRequiredDefaultDescription
namestringYes*Resource name to monitor (e.g., postgres, fluentd)
deploymentstringNoDeprecated alias for name — kept for backward compatibility
kindstringNoDeploymentResource kind: Deployment, StatefulSet, DaemonSet, Job, CronJob
namespacestringYesResource namespace
metricsPortint32No0Prometheus port (0 = disabled)
metricsPathstringNo/metricsPrometheus endpoint path
metricsFilter[]stringNoGlob filters for metrics
Use name + kind to monitor any Kubernetes workload type. When kind is omitted, it defaults to Deployment. The legacy deployment field still works as alias for name. Examples:
targets:
  - name: api-gateway             # Deployment (default kind)
    namespace: production
  - name: postgres                # StatefulSet (database)
    kind: StatefulSet
    namespace: production
  - name: fluentd                 # DaemonSet (logging agent)
    kind: DaemonSet
    namespace: logging
  - name: etl-pipeline            # CronJob (scheduled batch)
    kind: CronJob
    namespace: data
The AIOps pipeline will automatically use resource-specific remediation actions (e.g., ScaleStatefulSet, RestartDaemonSet, SuspendCronJob) based on the detected resource kind.

Resources Created by Instance

ResourceNameDescription
Deployment<name>ChatCLI server pods
Service<name>gRPC Service (automatic headless when replicas > 1 for client-side LB)
ConfigMap<name>Environment variables (provider, model, etc.)
ConfigMap<name>-watch-configMulti-target YAML (if targets defined)
ServiceAccount<name>Identity for RBAC
Role<name>-watcherK8s watcher permissions (single-namespace)
RoleBinding<name>-watcherSA to Role binding (single-namespace)
ClusterRoleBinding<namespace>-<name>-watcherSA binding to the shared ClusterRole (multi-namespace)
PVC<name>-sessionsPersistence (if enabled)

gRPC Load Balancing

gRPC uses persistent HTTP/2 connections that pin to a single pod via kube-proxy, leaving extra replicas idle.
  • 1 replica (default): Standard ClusterIP Service
  • Multiple replicas: Headless Service (ClusterIP: None) is created automatically, enabling client-side round-robin via gRPC dns:/// resolver
  • Keepalive: WatcherBridge pings every 30s (5s timeout) to detect inactive pods quickly. The server accepts pings with a minimum interval of 20s (EnforcementPolicy.MinTime)
  • Transition: When scaling from 1 to 2+ replicas (or back), the operator deletes and recreates the Service automatically (ClusterIP is immutable in Kubernetes)

Automatic RBAC

  • Same namespace (all targets in the same namespace as the Instance): Creates per-Instance Role + RoleBinding
  • Cross-namespace (targets in a different namespace than the Instance, or in multiple namespaces): Creates only a per-Instance ClusterRoleBinding pointing at the shared chatcli-watcher ClusterRole (pre-provisioned by the Helm chart / kustomize overlay)
  • On CR deletion, the finalizer removes the ClusterRoleBinding; the shared ClusterRole stays (owned by the release)
As of v1.139.0, the operator no longer creates ClusterRole resources at runtime (H5 hardening). Shared ClusterRoles — chatcli-watcher for the watcher and chatcli-role-{viewer,operator,admin,superadmin} for platform roles — are installed by the operator Helm chart. The operator’s ServiceAccount carries the bind verb restricted to those exact names via resourceNames, preventing privilege escalation even if the operator is compromised.
Upgrading from v1.105.0: clusters with pre-existing multi-namespace Instances had a ClusterRoleBinding pointing to a per-Instance ClusterRole (legacy shape). Because roleRef is immutable in Kubernetes, a direct helm upgrade used to freeze the reconcile with cannot change roleRef. As of v1.139.0, the operator detects the divergent roleRef at the top of reconcileClusterRBAC, deletes the stale binding, and recreates it pointing at chatcli-watcher — transparent migration, no manual intervention.

Server Image and Auto-Resolution

The server image tag (spec.image.tag) follows a three-step priority:
  1. Explicit pin in spec.image.tag — honored verbatim (GitOps-friendly).
  2. Omitted — the operator resolves it from the CHATCLI_OPERATOR_APP_VERSION env var, which the Helm chart injects automatically from .Chart.AppVersion. Effect: helm upgrade chatcli-operator rolls the server of every Instance that opted into auto-resolution, with no per-Instance patch.
  3. Fallbacklatest when neither is present (e.g., make deploy without Helm).
apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
spec:
  image:
    repository: ghcr.io/diillson/chatcli
    # tag omitted → inherits chart appVersion (recommended for Helm-upgrade environments)
  provider: CLAUDEAI
  model: claude-sonnet-4-6
  replicas: 2
For environments that want immutable versioning, keep spec.image.tag pinned and manage upgrades manually. For environments that want “helm upgrade = full upgrade,” omit the tag.

Auto-Rollout on Configuration Changes

The operator monitors changes in ConfigMaps and Secrets referenced by the Instance and triggers rolling updates automatically via hash annotations on the PodTemplate:
AnnotationSourceWhen It Changes
chatcli.io/watch-config-hashConfigMap <name>-watch-configWatcher targets changed
chatcli.io/configmap-hashConfigMap <name>Environment variables updated
chatcli.io/secret-hashSecret referenced in apiKeys.nameAPI keys created or updated
chatcli.io/tls-hashSecret referenced in server.tls.secretNameTLS certificates renewed
Adding/removing targets in watcher.targets and applying the Instance causes automatic rollout. Creating or updating the API keys Secret and renewing TLS certificates also trigger rollout automatically.

Secret and ConfigMap Observation

The operator watches (Watches) Secrets in the Instance namespace. When a Secret referenced in apiKeys.name or server.tls.secretName is created or updated, the reconciler is triggered automatically — even if the Secret did not exist when the Instance was created.
  • ConfigMap and Secret envFrom: Marked as optional: true, allowing the Instance to be created before the Secret/ConfigMap
  • Flexible deploy order: Namespace -> Instance -> Secret/ConfigMap (any order after the namespace)

AIOps Platform CRDs

Anomaly

Represents a raw signal detected by the WatcherBridge.
apiVersion: platform.chatcli.io/v1alpha1
kind: Anomaly
metadata:
  name: watcher-highrestartcount-api-gateway-1234567890
  namespace: production
spec:
  signalType: pod_restart    # pod_restart | oom_kill | pod_not_ready | deploy_failing | error_rate | latency_spike
  source: watcher            # watcher | prometheus | manual
  severity: warning          # critical | high | medium | low | warning
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  description: "HighRestartCount on api-gateway: container app restarted 8 times"
  detectedAt: "2026-02-16T10:30:00Z"
status:
  correlated: true
  issueRef:
    name: api-gateway-pod-restart-1771276354

Anomaly Spec Fields

FieldTypeDescription
signalTypeAnomalySignalTypeType of detected signal
sourceAnomalySourceDetection origin (watcher, prometheus, manual)
severityIssueSeveritySignal severity
resourceResourceRefAffected K8s resource (kind, name, namespace)
descriptionstringHuman-readable description of the problem
detectedAtTimeDetection timestamp

Signals Detected (21 types)

Watcher signals:
AlertType (Server)SignalType (Anomaly)Description
HighRestartCountpod_restartPod with many restarts (CrashLoopBackOff)
OOMKilledoom_killContainer terminated due to lack of memory
PodNotReadypod_not_readyPod is not in the Ready state
DeploymentFailingdeploy_failingDeployment with Available=False
Additional signals (via Prometheus, webhooks, or internal detection):
SignalTypeDescription
error_rateElevated HTTP error rate
latencyLatency above threshold
cpu_highElevated CPU usage
memory_highElevated memory usage
disk_pressureNode with DiskPressure condition (disk full or nearly full)
node_not_readyNode with NotReady condition (kubelet unresponsive, network or hardware failure)
memory_pressureNode with MemoryPressure condition (insufficient memory for new pods)
pid_pressureNode with PIDPressure condition (excessive processes, fork bomb risk)
network_unavailableNode with network unavailable (CNI failure or interface down)
pvc_pendingPVC in Pending state
ingress_errorIngress controller errors
hpa_maxedHPA at maximum replicas
job_failedJob failed
cronjob_missedCronJob missed its schedule
certificate_expiringTLS certificate expiring
image_pull_errorError pulling container image
crashloop_backoffPod in CrashLoopBackOff
helm_release_failedHelm release in failed state
argocd_degradedArgoCD Application degraded
config_driftConfiguration drift detected

Node Monitoring

The watcher automatically monitors the health of nodes where target pods are running. On each collection cycle, it:
  1. Identifies nodes via label selector from the target’s pods
  2. Collects all 5 official Kubernetes conditions: Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable
  3. Collects node CPU/memory metrics (via metrics server)
  4. Counts active pods vs node pod capacity
  5. Checks if the node is cordoned (unschedulable)
ConditionSeveritySignalAvailable Action
Node NotReadyCRITICALnode_not_readyCordonNode, DrainNode
DiskPressureCRITICALdisk_pressureCordonNode, DrainNode
MemoryPressureCRITICALmemory_highCordonNode, DrainNode
PIDPressureWARNINGnode_not_readyCordonNode
NetworkUnavailableCRITICALnode_not_readyCordonNode, DrainNode
Cordoned (Unschedulable)WARNINGnode_not_readyInformational
Pod capacity >90%WARNINGnode_not_readyCordonNode
Node information is included in the AI analysis context, enabling root cause correlation with infrastructure problems (e.g., “OOMKill caused by MemoryPressure on node X”).

Issue

Correlated incident that groups anomalies and manages the remediation lifecycle.
apiVersion: platform.chatcli.io/v1alpha1
kind: Issue
metadata:
  name: api-gateway-pod-restart-1771276354
  namespace: production
spec:
  severity: high
  source: watcher
  signalType: pod_restart        # Propagated from Anomaly for tiered Runbook matching
  description: "Correlated incident: pod_restart on api-gateway"
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  riskScore: 65
  correlatedAnomalies:
    - name: watcher-highrestartcount-api-gateway-1234567890
    - name: watcher-oomkilled-api-gateway-1234567891
status:
  state: Analyzing          # Detected | Analyzing | Remediating | Resolved | Escalated | Failed
  remediationAttempts: 0
  maxRemediationAttempts: 5  # default: 5, configurable via Instance aiops.maxRemediationAttempts
  detectedAt: "2026-02-16T10:30:00Z"
  conditions:
    - type: Analyzing
      status: "True"
      reason: AIInsightCreated

Issue States

StateDescription
DetectedNewly created issue, awaiting analysis
AnalyzingAIInsight created, awaiting AI response (or re-analysis with failure context)
RemediatingRemediationPlan in execution
ResolvedSuccessful remediation (dedup invalidated for recurrence detection)
EscalatedMax attempts reached or no available actions (dedup invalidated)
FailedTerminal failure

AIInsight

AI-generated root cause analysis with suggested actions for automatic remediation.
apiVersion: platform.chatcli.io/v1alpha1
kind: AIInsight
metadata:
  name: api-gateway-pod-restart-1771276354-insight
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  provider: CLAUDEAI
  model: claude-sonnet-4-6
status:
  analysis: "High restart count caused by OOMKilled. Container memory limit (512Mi) is insufficient for the current workload pattern."
  confidence: 0.87
  recommendations:
    - "Increase memory limit to 1Gi"
    - "Investigate possible memory leak in the application"
    - "Monitor GC pressure metrics"
  suggestedActions:
    - name: "Restart deployment"
      action: RestartDeployment
      description: "Restart pods to reclaim leaked memory immediately"
    - name: "Scale up replicas"
      action: ScaleDeployment
      description: "Add more replicas to distribute memory pressure"
      params:
        replicas: "4"
  generatedAt: "2026-02-16T10:31:00Z"

AIInsight Status Fields

FieldTypeDescription
analysisstringAI-generated root cause analysis
confidencefloat64Analysis confidence level (0.0-1.0)
recommendations[]stringHuman-readable recommendations
suggestedActions[]SuggestedActionStructured actions for automatic remediation
generatedAtTimeWhen the analysis was generated

SuggestedAction

FieldTypeDescription
namestringHuman-readable action name
actionstringAction type (54 action types available — see Action Types)
descriptionstringExplanation of why this action is needed
paramsmap[string]stringAction parameters (e.g., replicas: "4")

RemediationPlan

Concrete remediation plan automatically generated from a Runbook or AI actions.
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-1771276354-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Attempt 1 (AI-generated): High restart count caused by OOMKilled"
  actions:
    - type: RestartDeployment
    - type: ScaleDeployment
      params:
        replicas: "4"
  safetyConstraints:
    - "No delete operations"
    - "No destructive changes"
    - "Rollback on failure"
status:
  state: Completed           # Pending | Executing | Completed | Failed | RolledBack
  result: "Deployment restarted and scaled to 4 replicas successfully"
  startedAt: "2026-02-16T10:31:30Z"
  completedAt: "2026-02-16T10:32:15Z"

Automatic Rollback and State Protection

The operator implements an automatic rollback system that ensures unsuccessful remediations do not leave the cluster in a worse state than before. Before executing any action, the complete resource state is captured in a structured restorable snapshot.
1

Pre-Remediation Snapshot

Before the first action, the RollbackEngine captures a structured ResourceSnapshot with: replicas, container images, CPU/memory requests and limits, HPA state (min/max replicas), and node state (schedulable/unschedulable). Works for Deployments, StatefulSets, DaemonSets, Nodes, and HPAs.
2

Per-Action Checkpoint

In plans with multiple actions, an ActionCheckpoint is captured before each individual action. This makes it possible to know exactly which action modified what and at which point the plan failed.
3

Automatic Rollback on Action Failure

If any action fails during execution, the operator automatically restores the resource to the PreflightSnapshot state. Replicas, images, resource requests/limits, and HPA state are reverted. The plan transitions to RolledBack state (not Failed).
4

Rollback on Verification Timeout

If all actions execute successfully but the resource does not become healthy within 90 seconds (verification timeout), the operator also performs automatic rollback to the pre-remediation state.
5

Post-Failure Health Check

After the rollback, the operator verifies whether the resource returned to a healthy state (PostFailureHealthy). This information is recorded in the plan status for auditing and retry decisions.
What is captured per resource type:
ResourceCaptured Fields
Deploymentreplicas, container images, CPU/memory requests+limits, restart annotation
StatefulSetreplicas, container images, CPU/memory requests+limits
DaemonSetcontainer images, CPU/memory requests+limits
Nodeschedulable state (to revert cordon/drain)
HPAminReplicas, maxReplicas
Example of a RemediationPlan with rollback executed:
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-plan-1
  namespace: production
status:
  state: RolledBack              # Action failed, automatic rollback executed
  result: "Action AdjustResources (index 1) failed: invalid memory_limit | Rollback: Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi | Post-rollback: resource healthy"
  rollbackPerformed: true
  rollbackResult: "Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi"
  postFailureHealthy: true       # Resource returned to normal after rollback
  preflightSnapshot:
    resourceKind: Deployment
    resourceName: api-gateway
    namespace: production
    replicas: 3
    containerImages:
      app: "ghcr.io/myorg/api-gateway:v2.1.0"
    containerResources:
      app:
        cpuRequest: "200m"
        cpuLimit: "1000m"
        memoryRequest: "256Mi"
        memoryLimit: "512Mi"
    hpaMinReplicas: 2
    hpaMaxReplicas: 8
    capturedAt: "2026-02-16T10:31:00Z"
  actionCheckpoints:
    - actionIndex: 0
      actionType: ScaleDeployment
      success: true
      timestamp: "2026-02-16T10:31:05Z"
    - actionIndex: 1
      actionType: AdjustResources
      success: false
      timestamp: "2026-02-16T10:31:10Z"
  evidence:
    - type: preflight_snapshot
      data: "Structured snapshot captured: kind=Deployment replicas=3 containers=1"
      timestamp: "2026-02-16T10:31:00Z"
    - type: action_completed
      data: "Action ScaleDeployment executed successfully"
      timestamp: "2026-02-16T10:31:05Z"
    - type: action_failed
      data: "Action AdjustResources failed: invalid memory_limit format"
      timestamp: "2026-02-16T10:31:10Z"
    - type: rollback
      data: "Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi"
      timestamp: "2026-02-16T10:31:11Z"
Automatic rollback restores the state prior to remediation, it does not fix the original issue. After the rollback, the IssueReconciler evaluates whether there are remaining attempts and triggers re-analysis with failure context — the AI receives what failed and suggests a different strategy.
Complete flow on failure: Status fields added to RemediationPlan:
FieldTypeDescription
preflightSnapshotResourceSnapshotComplete resource state before any action
actionCheckpoints[]ActionCheckpointCheckpoint before each action with result (success/fail)
rollbackPerformedboolWhether automatic rollback was executed
rollbackResultstringDescription of what was reverted (replicas, images, resources)
postFailureHealthy*boolWhether the resource is healthy after rollback

Action Types (54 types)

Workload:
TypeDescriptionParameters
ScaleDeploymentAdjusts the number of replicasreplicas
RestartDeploymentRollout restart of the deployment
RollbackDeploymentUndoes rollout (previous, healthy, or specific revision)toRevision (optional: previous, healthy, or number)
PatchConfigUpdates keys of a ConfigMapconfigmap, key=value
AdjustResourcesAdjusts CPU/memory requests/limits for containersmemory_limit, memory_request, cpu_limit, cpu_request, container
DeletePodRemoves the sickest pod (CrashLoop > restarts)pod (optional — auto-selects the sickest)
RestartStatefulSetPodRestart of StatefulSet pod (preserves identity/storage)pod (optional — omit for rolling restart of entire StatefulSet)
GitOps:
TypeDescriptionParameters
HelmRollbackRollback of Helm release to previous revisionrevision (optional — default: previous)
ArgoSyncAppTrigger sync on ArgoCD Applicationrevision (optional — default: HEAD)
Autoscaling:
TypeDescriptionParameters
AdjustHPAModifies min/max replicas or target utilization of HPAminReplicas, maxReplicas, targetCPUUtilization
Infrastructure:
TypeDescriptionParameters
CordonNodeMarks node as unschedulablenode
DrainNodeCordon + evict pods from nodenode
Storage:
TypeDescriptionParameters
ResizePVCExpands PVC (expansion only, not reduction)pvc, size (e.g., 20Gi)
Security:
TypeDescriptionParameters
RotateSecretUpdates Secret values or copies from sourcesecret, sourceSecret or key=value
Networking:
TypeDescriptionParameters
UpdateIngressModifies backend or annotations of Ingressingress, backendService, backendPort, annotation.*
PatchNetworkPolicyAdds allowed ports to NetworkPolicynetworkPolicy, allowPort, protocol
Advanced:
TypeDescriptionParameters
ApplyManifestApplies JSON manifest from a ConfigMapconfigmap, key
ExecDiagnosticExecutes a command from a read-only allowlist inside a podcommand (exact string — see allowlist)
CustomCustom action (blocked by safety checks)

ExecDiagnostic Allowlist

ExecDiagnostic does an exact-string match against a read-only command allowlist. Any variation (different flags, alternate host, etc.) is rejected with command "..." not in approved diagnostic commands whitelist. Default approved commands (~90):
CategoryCommands
Process / shellenv, whoami, id, hostname, pwd, uname -a, uname -r, ps aux, ps -ef, top -b -n1
Filesystem / resourcesdf -h, df -i, free -m, free -h, mount, uptime, ls -la /, ls -la /tmp, ls -la /var/log, du -sh /tmp, du -sh /var/log
Cgroups v2 (modern pod)cat /sys/fs/cgroup/memory.max, memory.current, memory.events, memory.stat, cpu.max, cpu.stat
Cgroups v1 (legacy pod)cat /sys/fs/cgroup/memory/memory.{limit_in_bytes,usage_in_bytes,oom_control,stat}, cat /sys/fs/cgroup/cpu/cpu.{cfs_quota_us,cfs_period_us,stat}
/proc introspectioncat /proc/1/{cgroup,status,limits,cmdline,environ}, cat /proc/{meminfo,cpuinfo,loadavg,version}, cat /proc/net/{tcp,udp,sockstat}
Network (read-only)netstat -tlnp/-an/-rn, ss -tlnp/-an/-s, ip addr, ip -s link, ip route, ip -6 route, ip neigh, ifconfig, arp -a
DNS / resolvercat /etc/{hosts,resolv.conf,nsswitch.conf}, nslookup/getent/dig/host kubernetes.default.svc.cluster.local, nslookup kube-dns.kube-system.svc.cluster.local
Health / metrics / pprofcurl -s localhost{,:8080}/{health,healthz,ready,readyz,live,livez}, curl -s localhost:{8080,8081,9090,9091}/metrics, curl -s localhost:9090/-/{ready,healthy}, curl -s localhost:6060/debug/pprof/{,goroutine,heap}?debug=1
Envoy / Istio sidecarcurl -s localhost:15000/ready, curl -s localhost:{15020,15021}/healthz/ready
wget fallback (Alpine)wget -qO- http://localhost{,:8080}/{health,healthz,metrics}
TCP reachabilitync -zv kubernetes.default.svc.cluster.local 443, nc -zv kube-dns.kube-system.svc.cluster.local 53
Extend via the CHATCLI_ALLOWED_DIAGNOSTIC_COMMANDS env var (comma-separated, read once at startup):
spec:
  extraEnv:
    - name: CHATCLI_ALLOWED_DIAGNOSTIC_COMMANDS
      value: "dig +short redis.default.svc.cluster.local, nc -zv redis.default.svc.cluster.local 6379"
Each entry is matched exactly. nslookup <other-host> will not work — it is rejected. If you need a specific host, add the full string to the env var.
The AI is given this allowlist in the remediation prompt (server/handler_analysis.go) and is instructed to pick the right command per symptom: memory.events for OOM, cpu.stat for CPU throttling, getent/dig for DNS, pprof for stuck Go apps, nc -zv for external dependency reachability.
StatefulSet:
TypeDescriptionParameters
ScaleStatefulSetOrdered replica scalingreplicas
RestartStatefulSetRolling restart via annotation
RollbackStatefulSetRollback via ControllerRevisiontoRevision
AdjustStatefulSetResourcesAdjusts CPU/memorycontainer, memory_limit, cpu_limit
DeleteStatefulSetPodDeletes specific or unhealthiest podpod (optional)
ForceDeleteStatefulSetPodForce-delete stuck pod (grace=0)pod (REQUIRED)
UpdateStatefulSetStrategyChanges updateStrategytype, maxUnavailable
RecreateStatefulSetPVCDeletes stuck PVCpvc, confirm=true
PartitionStatefulSetUpdateCanary partitionpartition
DaemonSet:
TypeDescriptionParameters
RestartDaemonSetRolling restart across all nodes
RollbackDaemonSetRollback via ControllerRevisiontoRevision
AdjustDaemonSetResourcesAdjusts CPU/memorycontainer, memory_limit, cpu_limit
DeleteDaemonSetPodDeletes pod (optionally on specific node)pod, node (optional)
UpdateDaemonSetStrategyChanges update strategytype, maxUnavailable, maxSurge
PauseDaemonSetRolloutPauses rollout (maxUnavailable=0)
CordonAndDeleteDaemonSetPodCordons node + deletes podnode (REQUIRED)
Job:
TypeDescriptionParameters
RetryJobDeletes failed Job + recreates
AdjustJobResourcesAdjusts CPU/memory on templatecontainer, memory_limit, cpu_limit
DeleteFailedJobCleans up failed Job
SuspendJobPauses Job (suspend=true)
ResumeJobResumes Job (suspend=false)
AdjustJobParallelismChanges parallelismparallelism
AdjustJobDeadlineChanges deadlineactiveDeadlineSeconds
AdjustJobBackoffLimitChanges backoff limitbackoffLimit
ForceDeleteJobPodsForce-deletes all Job pods
CronJob:
TypeDescriptionParameters
SuspendCronJobPauses scheduling
ResumeCronJobResumes scheduling
TriggerCronJobCreates Job immediately
AdjustCronJobResourcesAdjusts CPU/memory on jobTemplatecontainer, memory_limit, cpu_limit
AdjustCronJobScheduleChanges scheduleschedule
AdjustCronJobDeadlineChanges deadlinestartingDeadlineSeconds
AdjustCronJobHistoryChanges history limitssuccessfulJobsHistoryLimit, failedJobsHistoryLimit
AdjustCronJobConcurrencyChanges concurrency policyconcurrencyPolicy
DeleteCronJobActiveJobsKills running Jobs
ReplaceCronJobTemplateReplaces template from ConfigMapconfigmap, key

RemediationPlan Examples with New Actions

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: checkout-helm-rollback-plan-1
  namespace: production
spec:
  issueRef:
    name: checkout-helm-release-failed-123
  attempt: 1
  strategy: "Rollback Helm release to previous stable revision"
  actions:
    - type: HelmRollback
      params:
        revision: "41"    # specific revision (omit for previous)

Runbook (Manual or Auto-generated)

Operational procedures. Manual Runbooks have priority over everything. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR from the suggested actions.
apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: high-error-rate-deployment
  namespace: production
spec:
  description: "Standard procedure for high error rate incidents on Deployments"
  trigger:
    signalType: error_rate
    severity: high
    resourceKind: Deployment
  steps:
    - name: Scale up
      action: ScaleDeployment
      description: "Increase replicas to absorb the error spike"
      params:
        replicas: "4"
    - name: Rollback
      action: RollbackDeployment
      description: "Revert to previous stable version if scaling doesn't help"
  maxAttempts: 3

RemediationPlan (Agentic Mode)

When there is no manual Runbook or AI-suggested actions, the operator creates an agentic plan. The AI acts as an agent with Kubernetes skills in an observe-decide-act loop:
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Agentic AI remediation"
  agenticMode: true
  agenticMaxSteps: 10
  agenticHistory:
    - stepNumber: 1
      aiMessage: "High restart count with OOMKilled. Scaling up to reduce memory pressure."
      action:
        type: ScaleDeployment
        params:
          replicas: "5"
      observation: "SUCCESS: ScaleDeployment executed successfully"
    - stepNumber: 2
      aiMessage: "Pods still restarting. Adjusting memory limits."
      action:
        type: AdjustResources
        params:
          memory_limit: "1Gi"
          memory_request: "512Mi"
      observation: "SUCCESS: AdjustResources executed successfully"
    - stepNumber: 3
      aiMessage: "All pods running stable. Issue resolved."
status:
  state: Completed
  agenticStepCount: 3
  agenticStartedAt: "2026-02-16T10:31:00Z"
Safety Guards: Maximum of 10 steps (configurable via agenticMaxSteps), timeout of 10 minutes. If an action fails, the observation reports “FAILED: error” and the loop continues — the AI receives the feedback and adapts.
On agentic resolution: The operator automatically generates:
  1. PostMortem CR with timeline, root cause, impact, lessons learned
  2. Reusable Runbook CR with successful steps (label source=agentic)

PostMortem (Auto-generated)

Incident report automatically generated after any remediation resolution (standard or agentic). Contains the complete incident history: detection, analysis, executed actions, resolution, plus metrics, git correlation, cascade chain, recurring incident trending, and developer feedback field.
apiVersion: platform.chatcli.io/v1alpha1
kind: PostMortem
metadata:
  name: pm-api-gateway-pod-restart-1771276354
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  severity: high
status:
  state: Open              # Open | InReview | Closed
  summary: "OOMKilled containers caused cascading restarts on api-gateway"
  rootCause: "Memory limit (512Mi) insufficient for current workload pattern"
  impact: "Service degradation for 5 minutes, 30% error rate increase"
  timeline:
    - timestamp: "2026-02-16T10:30:00Z"
      type: detected
      detail: "Issue detected: pod_restart on api-gateway"
    - timestamp: "2026-02-16T10:31:00Z"
      type: action_executed
      detail: "ScaleDeployment to 5 replicas"
    - timestamp: "2026-02-16T10:31:35Z"
      type: action_executed
      detail: "AdjustResources memory_limit=1Gi"
    - timestamp: "2026-02-16T10:32:10Z"
      type: resolved
      detail: "All pods stable, issue resolved"
  lessonsLearned:
    - "Memory limits should account for peak workload patterns"
    - "Set up HPA to auto-scale on memory pressure"
  preventionActions:
    - "Configure HPA with min 3 replicas for api-gateway"
    - "Set memory limit to 1Gi across all environments"
  duration: "2m10s"
  generatedAt: "2026-02-16T10:32:10Z"
  # New enrichment fields (automatically populated by the operator)
  metricSnapshots:
    - name: "memory_usage"
      value: "498000000"
      timestamp: "2026-02-16T10:30:00Z"
      phase: "during"
    - name: "memory_usage"
      value: "312000000"
      timestamp: "2026-02-16T10:35:00Z"
      phase: "after"
  blastRadius:
    - resource:
        kind: Service
        name: api-gateway-svc
        namespace: production
      impact: "5xx responses during pod restarts"
      severity: "high"
  gitCorrelation:
    commitSHA: "a1b2c3d4e5f6"
    commitMessage: "feat: add webhook handler for notifications"
    author: "dev@team.com"
    timestamp: "2026-02-16T09:15:00Z"
    confidence: 0.82
    filesChanged:
      - "internal/webhook/handler.go"
      - "internal/webhook/handler_test.go"
  trending:
    occurrenceCount: 3
    windowDays: 30
    relatedPostMortems:
      - "pm-api-gateway-oom-20260205"
      - "pm-api-gateway-oom-20260210"
    pattern: "Recurring oom_kill on Deployment/api-gateway (3 occurrences in 30 days)"
  gitOpsContext: "Helm release 'api-gateway' chart=api-gw version=2.1.0 status=deployed revision=15"
  logAnalysisSummary: "1 Go panic stack trace; 8 critical error patterns (resource/connectivity)"
  cascadeChain:
    - "production/api-gateway(root_cause)"
    - "production/frontend(victim)"
  # Developer feedback (filled manually after review)
  feedback:
    overrideRootCause: ""          # empty = agrees with AI analysis
    remediationAccuracy: 4         # 1-5 scale
    comments: "Good analysis, but could have suggested AdjustResources before the restart"
    providedBy: "sre@team.com"
    providedAt: "2026-02-17T09:00:00Z"

PostMortem Status Fields

FieldTypeDescription
statePostMortemStateState: Open, InReview, Closed
summarystringAI-generated incident summary
rootCausestringRoot cause determined by AI
impactstringIncident impact
timeline[]TimelineEventTimeline (detected, analyzed, action_executed, resolved)
actionsExecuted[]ActionRecordExecuted actions with result
lessonsLearned[]stringLessons learned
preventionActions[]stringSuggested preventive actions
durationstringTotal incident duration
generatedAtTimeWhen the PostMortem was generated
reviewedAtTimeWhen the PostMortem was reviewed by a human
metricSnapshots[]MetricSnapshotPrometheus metrics captured before/during/after the incident
blastRadius[]BlastRadiusEntryServices and resources impacted by the incident
gitCorrelationGitCorrelationSuspect commit correlated with the incident (SHA, author, files, confidence)
sliImpactstringImpact on SLIs and error budgets
errorBudgetBurnedfloat64Percentage of error budget consumed
trendingTrendingInfoRecurring pattern information (count, window, related PostMortems)
feedbackDevFeedbackHuman feedback (root cause override, accuracy 1-5, comments)
gitOpsContextstringHelm/ArgoCD/Flux state at the time of the incident
logAnalysisSummarystringSummary of log analysis findings
cascadeChain[]stringCascade failure chain if applicable

Runbook Matching (Tiered)

Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
Tier 2: Severity + ResourceKind (fallback when signal doesn't match)

Remediation Priority

1. Existing manual Runbook (tiered match)
2. AI auto-generated Runbook (materialized as reusable CR)
3. Agentic AI remediation (observe-decide-act loop, generates PostMortem + Runbook)
4. Escalation (only when agentic fails after max attempts)

SourceRepository (Code-Aware Diagnostics)

Links a Kubernetes workload to its source code repository. When configured, the AI receives code context during incident analysis: recent commits correlated with the timestamp, code snippets referenced in stack traces, and configuration files (Dockerfile, values.yaml).
Input validation (hardening): spec.url only accepts the https://, ssh:// or git@host:path forms, and spec.branch is restricted to [A-Za-z0-9._/-] with no leading - — closing the git argument-injection vector (e.g. --upload-pack). Reads of code referenced by stack traces are confined to the repository clone via os.Root (no traversal, no symlink escape). Out-of-pattern URLs or branches fail the sync with an explicit status error.
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: api-gateway-repo
  namespace: production
spec:
  url: "https://github.com/myorg/api-gateway.git"
  branch: main
  authType: token
  secretRef: git-token       # Secret with key "token"
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  paths: ["cmd/", "internal/"]
  dockerfile: "Dockerfile"
  language: "Go"
  syncIntervalMinutes: 30
---
apiVersion: v1
kind: Secret
metadata:
  name: git-token
  namespace: production
type: Opaque
stringData:
  token: "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
What the operator does with SourceRepository:
  1. Shallow clone of the repository (depth 50) and periodic sync
  2. Indexes detected languages, entrypoints (main.go, app.py, etc.), config files
  3. Temporal correlation: finds commits within 30 min before the incident
  4. Suspect commit: identifies the most likely commit to have caused the problem
  5. Code extraction: when stack traces reference files, extracts the relevant snippets
  6. Feed to AI: all context is included in the analysis prompt
The repository is cloned locally on the operator pod. For private repos, create a Secret with the key corresponding to the chosen authType and reference it in secretRef. The operator supports HTTPS repos (token/basic) and SSH (ssh-key).

Correlation Engine

The correlation engine groups anomalies into issues using:

Risk Scoring

Each signal type has a weight:
SignalWeight
oom_kill30
error_rate25
deploy_failing25
latency_spike20
pod_restart20
pod_not_ready20
The risk score is the sum of correlated anomaly weights (maximum 100).

Severity Classification

Risk ScoreSeverity
>= 80Critical
>= 60High
>= 40Medium
< 40Low

Grouping

  • Anomalies on the same resource (deployment + namespace) within the same time window are grouped into the same Issue
  • Incident ID is deterministic: hash of resource + signal type (prevents duplicates)

WatcherBridge

The WatcherBridge is the component that connects the ChatCLI server to the operator:
  • Polling: Queries GetAlerts from the server every 30 seconds
  • Discovery: Locates the server via Instance CRs (first Instance with a ready gRPC endpoint)
  • Dedup: SHA256 hash of type+deployment+namespace (no temporal component — a continuous problem generates only one Anomaly). 2-hour TTL
  • Dedup invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the resource are removed, allowing immediate recurrence detection
  • Pruning: Removes expired hashes automatically (> 2h)
  • Creation: Converts alerts to Anomaly CRs with valid K8s names

Usage Examples

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-simple
spec:
  provider: OPENAI
  apiKeys:
    name: chatcli-api-keys

Status and Monitoring

kubectl get instances
NAME            READY   REPLICAS   PROVIDER    AGE
chatcli-aiops   true    1          CLAUDEAI    5m
kubectl get issues -A
NAME                                    SEVERITY   STATE         RISK   AGE
api-gateway-pod-restart-1771276354      high       Remediating   65     2m
worker-oom-kill-3847291023              critical   Analyzing     90     30s
kubectl get aiinsights -A
NAME                                           ISSUE                                   PROVIDER   CONFIDENCE   AGE
api-gateway-pod-restart-1771276354-insight      api-gateway-pod-restart-1771276354      CLAUDEAI   0.87         1m
kubectl get remediationplans -A
NAME                                          ISSUE                                   ATTEMPT   STATE       AGE
api-gateway-pod-restart-1771276354-plan-1      api-gateway-pod-restart-1771276354      1         Completed   1m
kubectl get postmortems -A
NAME                                          ISSUE                                   SEVERITY   STATE   AGE
pm-api-gateway-pod-restart-1771276354         api-gateway-pod-restart-1771276354      high       Open    30s
kubectl get anomalies -A
NAME                                               SIGNAL        SOURCE    SEVERITY   AGE
watcher-highrestartcount-api-gateway-1234567890     pod_restart   watcher   warning    3m
watcher-oomkilled-worker-9876543210                 oom_kill      watcher   critical   1m

Development

cd operator

# Build
go build ./...

# Tests (130 functions, 185 with subtests)
go test ./... -v

# Docker (must be built from the repository root)
docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .

# Deploy via Helm (recommended)
helm install chatcli-operator ../deploy/helm/chatcli-operator/ \
  --namespace chatcli-system --create-namespace \
  --set image.repository=myregistry/chatcli-operator \
  --set image.tag=dev

# Or deploy via kubectl (alternative)
make deploy IMG=myregistry/chatcli-operator:dev

Security

The Operator implements multiple security layers by default, following the fail-closed principle (deny by default):

REST API Authentication

The REST API operates in fail-closed mode by default — there is no dev mode without authentication. Every request must include a valid X-API-Key header with a mapped role (viewer/operator/admin). API keys are loaded with the following priority order and hot-reloaded every 30 seconds:
  1. Secret chatcli-operator-secrets (priority) — api-keys field containing a YAML list of {key, role, description} entries
  2. ConfigMap chatcli-operator-config (fallback) — same api-keys field
  3. Reject the request (or accept in dev-mode if CHATCLI_OPERATOR_DEV_MODE=true)
Two distinct Secrets in this project — do not confuse it with the LLM provider keys consumed by the chatcli server (chatcli-api-keys, referenced via Instance.spec.apiKeys.name). See the comparison table in Security — Operator Authentication.The chatcli-operator-secrets Secret must live in the same namespace as the operator pod (the controller resolves it via the POD_NAMESPACE env var / ServiceAccount namespace file, falling back to chatcli-system). If you ran helm install --namespace <X>, create the Secret in <X>.
Changes to API keys — in either the Secret or the ConfigMap — are picked up automatically every 30s. No operator restart is needed.

Resource Type Allowlist

The Operator classifies Kubernetes resource types into two categories:
Pods, Deployments, StatefulSets, DaemonSets, Services, ConfigMaps, Ingresses, Jobs, CronJobs, ReplicaSets, Endpoints, PersistentVolumeClaims, HorizontalPodAutoscalers, NetworkPolicies, ServiceAccounts, Namespaces, Events.

Log Scrubbing

Before sending application logs to the LLM for analysis, the Operator removes 18 sensitive patterns, including:
  • JWT/Bearer tokens, API keys, passwords
  • Email addresses, internal IPs, URLs with credentials
  • Credit card numbers, SSNs, PEM certificates

TLS and RBAC

  • TLS 1.3 required on all ChatCLI server connections
  • ClusterRoles with least privilege (read-only by default)
  • NetworkPolicy configurable to restrict network traffic to the Operator namespace
  • H5 hardening — no runtime RBAC escalation: the operator never creates or mutates ClusterRole/ClusterRoleBinding at runtime. Shared ClusterRoles (chatcli-watcher, chatcli-role-*) are pre-provisioned by the Helm chart, and the operator’s SA holds the bind verb restricted to those exact names via resourceNames. A compromised operator cannot reference a more-privileged ClusterRole from a new ClusterRoleBinding.

Audit

  • AuditEvent CRD for immutable audit trail (append-only)
  • Structured logs with Request ID for correlation
  • Integration with CHATCLI_AUDIT_LOG_PATH via extraEnv
spec:
  extraEnv:
    - name: CHATCLI_AGENT_SECURITY_MODE
      value: "strict"
    - name: CHATCLI_AUDIT_LOG_PATH
      value: "/var/log/chatcli/audit.jsonl"
In strict mode, agent security blocks any cluster write operations not on the allowlist. This is recommended for production environments.

Next Steps

AIOps Platform

Deep-dive into the AIOps architecture

K8s Watcher

Collection and budget details

Server Mode

GetAlerts and AnalyzeIssue RPCs

K8s Monitoring

Recipe: K8s Monitoring with AI