Kubernetes Operator (AIOps)

The ChatCLI Operator goes beyond instance management. It implements a complete AIOps platform that autonomously detects anomalies, correlates signals, requests AI analysis, and executes remediation — all without external dependencies beyond the LLM provider. The platform supports Deployments, StatefulSets, DaemonSets, Jobs, and CronJobs, integrates with Helm, ArgoCD, and Flux for GitOps-aware remediation, analyzes application logs with stack trace extraction (Java, Go, Python, Node.js), correlates Prometheus metrics with incidents, and allows linking source code repositories for code-aware diagnostics.

API Group and CRDs

The operator uses the API group platform.chatcli.io/v1alpha1 with 17 Custom Resource Definitions:

CRD	Short Name	Description
Instance	`inst`	ChatCLI server instance (Deployment, Service, RBAC, PVC)
Anomaly	`anom`	Raw signal from the K8s Watcher (restarts, OOM, deploy failures)
Issue	`iss`	Correlated incident grouping multiple anomalies
AIInsight	`ai`	AI-generated root cause analysis with enriched context (logs, metrics, code, GitOps)
RemediationPlan	`rp`	Concrete actions to resolve the problem (runbook or agentic AI)
Runbook	`rb`	Manual operational procedures (optional)
PostMortem	`pm`	Auto-generated incident report after resolution (all modes)
SourceRepository	`srcrepo`	Links workloads to git repositories for code-aware diagnostics
NotificationPolicy	`np`	Multi-channel notification routing with throttling and templates
EscalationPolicy	`ep`	Tiered escalation chains with timeouts (L1→L2→L3)
ServiceLevelObjective	`slo`	SLO with multi-window burn rate alerting (Google SRE model)
IncidentSLA	`sla`	Response/resolution SLA targets per severity with business hours
ApprovalPolicy	`ap`	Auto/manual/quorum approval policies with change windows
ApprovalRequest	`ar`	Approval workflow with blast radius assessment
ClusterRegistration	`cr`	Multi-cluster federation with kubeconfig and health checks
AuditEvent	`ae`	Immutable audit trail (append-only)
ChaosExperiment	`chaos`	Chaos engineering experiments with 7 types and safety checks

For detailed documentation on each v2 CRD (NotificationPolicy, EscalationPolicy, SLO, SLA, ApprovalPolicy, ApprovalRequest, ClusterRegistration, AuditEvent, ChaosExperiment), see the AIOps Platform sub-pages.

Operator Installation

A single command installs everything: 17 CRDs + RBAC + Deployment + Service + Dashboard.

Via OCI Registry (recommended)
Via local path (if you cloned the repo)

Install directly from GHCR — no need to clone the repository:

helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system \
  --create-namespace

To pin a specific version:

helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --version 1.139.0 \
  --namespace chatcli-system \
  --create-namespace

helm install chatcli-operator ./deploy/helm/chatcli-operator/ \
  --namespace chatcli-system \
  --create-namespace

The operator chart (chatcli-operator) is separate from the server chart (chatcli). The operator manages the controllers and AIOps dashboard. The server is deployed via Instance CR or the chatcli chart with watcher enabled.

Configurable values

# With Prometheus for incident metrics
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set prometheusUrl="http://prometheus-server.monitoring.svc:9090"

# With custom image
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set image.repository=myregistry/chatcli-operator \
  --set image.tag=1.139.0

# With ServiceMonitor for Prometheus Operator
helm install chatcli-operator \
    oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system --create-namespace \
  --set serviceMonitor.enabled=true

Value	Default	Description
`image.repository`	`ghcr.io/diillson/chatcli-operator`	Operator image
`image.tag`	`latest`	Image tag
`replicaCount`	`1`	Replicas (leader election enabled by default)
`api.port`	`8090`	Web dashboard and REST API port
`prometheusUrl`	`""`	Prometheus URL for incident metrics collection
`leaderElect`	`true`	Leader election for HA
`serviceMonitor.enabled`	`false`	Create Prometheus ServiceMonitor

Manual installation via kubectl (alternative)

kubectl apply -f operator/config/crd/bases/
kubectl apply -f operator/config/rbac/role.yaml
kubectl apply -f operator/config/manager/manager.yaml

Build via Docker (optional)

# Build from the repo root
docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .

# Or via Make
cd operator
make docker-build IMG=myregistry/chatcli-operator:dev
make docker-push IMG=myregistry/chatcli-operator:dev

AIOps Platform Architecture

Autonomous Pipeline

Phase	Component	What It Does
1. Detection	WatcherBridge	Queries `GetAlerts` from the server every 30s. Creates Anomaly CRs (dedup SHA256). Invalidates dedup when Issue reaches terminal state.
2. Correlation	AnomalyReconciler + CorrelationEngine	Groups anomalies by resource + time window. Calculates risk score and severity. Creates/updates Issue CRs with `signalType`.
3. Analysis	AIInsightReconciler + 6 enrichers	Collects K8s context (Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, HPAs), advanced log analysis (stack traces Java/Go/Python/Node.js, 24+ error patterns), Prometheus metrics (CPU/mem/latency trends), GitOps (Helm/ArgoCD/Flux status), source code (commit↔incident correlation), cascade analysis (cross-service).
4. Remediation	IssueReconciler	AI-validated runbook selection: (a) finds ALL candidate runbooks (multi-runbook per trigger), (b) AI validates each against root cause (`RUNBOOK_APPROVED: name` or `RUNBOOK_REJECTED`), (c) if rejected or no candidates, generates new runbook from AI suggestions (with unique hash per root cause), or (d) agentic remediation (AI acts step-by-step).
5. Execution	RemediationReconciler	54 action types: workload (Scale, Restart, Rollback, AdjustResources, DeletePod, RestartStatefulSetPod), GitOps (HelmRollback, ArgoSyncApp), autoscaling (AdjustHPA), infra (CordonNode, DrainNode), storage (ResizePVC), security (RotateSecret), networking (UpdateIngress, PatchNetworkPolicy), advanced (ApplyManifest, ExecDiagnostic), statefulset (ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate), daemonset (RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod), job (RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods), cronjob (SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate). Blast radius prediction before execution.
6. Resolution	IssueReconciler	Success -> Resolved (invalidates dedup). Failure -> re-analysis with failure context (different strategy) -> up to maxAttempts -> Escalated.
7. PostMortem	IssueReconciler	All remediations (not just agentic) generate PostMortem CR with timeline, root cause, lessons, metrics, git correlation, cascade chain, trending (recurring incidents), dev feedback. Successful remediations also generate reusable Runbooks (one per root cause, hash-based naming).

Issue State Machine

Create Secret with API Keys

Before creating an Instance, you need a Secret with the LLM provider API keys. The Instance references this Secret via apiKeys.name — without it, the server cannot call the AI.

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=OPENAI_API_KEY="sk-your-key-here"

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=ANTHROPIC_API_KEY="sk-ant-your-key-here"

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=GOOGLEAI_API_KEY="your-key-here"

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=OPENROUTER_API_KEY="sk-or-your-key-here"

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=OPENAI_API_KEY="sk-xxx" \
  --from-literal=ANTHROPIC_API_KEY="sk-ant-xxx" \
  --from-literal=GOOGLEAI_API_KEY="xxx" \
  --from-literal=OPENROUTER_API_KEY="sk-or-xxx"

apiVersion: v1
kind: Secret
metadata:
  name: chatcli-api-keys
  namespace: chatcli-system
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-your-key-here"
  # ANTHROPIC_API_KEY: "sk-ant-xxx"
  # GOOGLEAI_API_KEY: "xxx"
  # OPENROUTER_API_KEY: "sk-or-xxx"

The Secret must exist in the same namespace as the Instance CR. The Secret name must match the apiKeys.name field in the Instance spec. Without this Secret, the server starts but cannot execute AI analysis or agentic remediation.

CRD: Instance

The Instance manages ChatCLI server instances in the cluster.

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli          # The namespace must exist before creating the Instance
spec:
  replicas: 1
  provider: CLAUDEAI       # OPENAI, OPENAI_ASSISTANT, CLAUDEAI, BEDROCK, GOOGLEAI, XAI, ZAI, MINIMAX, MOONSHOT, OPENROUTER, STACKSPOT, OLLAMA, COPILOT, GITHUB_MODELS
  model: claude-sonnet-4-6

  image:
    repository: ghcr.io/diillson/chatcli
    tag: latest
    pullPolicy: IfNotPresent

  server:
    port: 50051
    tls:
      enabled: true
      secretName: chatcli-tls   # Must contain tls.crt, tls.key AND ca.crt (self-signed: ca.crt=tls.crt) — see cookbook §2.1
    token:
      name: chatcli-auth
      key: token

  watcher:
    enabled: true
    interval: "30s"
    window: "2h"
    maxLogLines: 100
    maxContextChars: 32000
    targets:
      - name: api-gateway
        namespace: production
        metricsPort: 9090
        metricsFilter: ["http_requests_*", "http_request_duration_*"]
      - name: auth-service
        namespace: production
        metricsPort: 9090
      - name: worker
        namespace: batch
      - name: postgres                  # Monitor a StatefulSet
        kind: StatefulSet
        namespace: production
      - name: fluentd-agent             # Monitor a DaemonSet
        kind: DaemonSet
        namespace: logging
      - name: etl-pipeline              # Monitor a CronJob
        kind: CronJob
        namespace: data

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

  persistence:
    enabled: true
    size: 1Gi
    storageClassName: standard

  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault

  apiKeys:
    name: chatcli-api-keys

  server:
    security:
      jwtSecretRef:
        name: chatcli-jwt
        key: secret
      rateLimitRps: 20
      # bindAddress: "0.0.0.0"  # Optional — auto-detected in Kubernetes

  extraEnv:
    - name: CHATCLI_AGENT_SECURITY_MODE
      value: "strict"
    - name: CHATCLI_AUDIT_LOG_PATH
      value: "/var/log/chatcli/audit.jsonl"

Spec Fields

Root

Field	Type	Required	Default	Description
`replicas`	int32	No	`1`	Number of server replicas
`provider`	string	Yes		LLM provider
`model`	string	No		LLM model
`image`	ImageSpec	No		Image configuration
`server`	ServerSpec	No		gRPC server configuration
`watcher`	WatcherSpec	No		K8s Watcher configuration
`resources`	ResourceRequirements	No		CPU and memory requests/limits
`persistence`	PersistenceSpec	No		Session persistence
`securityContext`	PodSecurityContext	No	nonroot/1000	Pod security context
`fallback`	FallbackSpec	No		LLM provider failover chain
`apiKeys`	SecretRefSpec	No		Secret with API keys (all providers in fallback chain)
`aiops`	AIOpsSpec	No		Autonomous incident management pipeline configuration

AIOpsSpec

Configures the automatic remediation pipeline. All fields are optional with sensible defaults. AI auto-generated runbooks inherit maxRemediationAttempts from this configuration.

Field	Type	Required	Default	Range	Description
`maxRemediationAttempts`	int32	No	`5`	1-10	Maximum remediation attempts before escalating to human
`resolutionCooldownMinutes`	int32	No	`10`	0-120	Minutes after resolving before accepting new anomalies for the same resource
`dedupTTLMinutes`	int32	No	`60`	5-1440	How long (min) the dedup cache retains alert hashes
`enableAutoResolve`	bool	No	`true`		Auto-resolve Escalated issues when the resource recovers
`agenticMaxSteps`	int32	No	`10`	3-30	Maximum steps per agentic remediation attempt (each step = 1 AI call)

spec:
  aiops:
    maxRemediationAttempts: 5
    resolutionCooldownMinutes: 10
    dedupTTLMinutes: 60
    enableAutoResolve: true
    agenticMaxSteps: 10

In agentic mode, the postmortem includes the full AI reasoning for each step — which action was chosen, why, and the observed result. This ensures complete audit trail of autonomous AI decisions.

FallbackSpec

Configures automatic failover between LLM providers. When the primary provider fails (rate limit, timeout, server error), the system automatically tries the next provider in the chain.

Field	Type	Required	Default	Description
`enabled`	bool	Yes		Activates the fallback chain
`providers`	[]FallbackProviderEntry	Yes		Ordered list of fallback providers (first = highest priority)
`maxRetries`	int32	No	`2`	Retries per provider before moving to next
`cooldownBase`	string	No	`"30s"`	Initial cooldown after failure (exponential backoff)
`cooldownMax`	string	No	`"5m"`	Maximum cooldown duration

FallbackProviderEntry

Field	Type	Required	Description
`name`	string	Yes	Provider name: OPENAI, OPENAI_ASSISTANT, CLAUDEAI, BEDROCK, GOOGLEAI, XAI, ZAI, MINIMAX, MOONSHOT, OPENROUTER, STACKSPOT, OLLAMA, COPILOT, GITHUB_MODELS
`model`	string	No	LLM model for this provider

The primary provider (spec.provider) is always tried first. Providers in fallback.providers are tried in order when the primary fails. The Secret in apiKeys must contain API keys for all providers in the chain.

WatcherSpec

Field	Type	Required	Default	Description
`enabled`	bool	No	`false`	Enables the watcher
`targets`	[]WatchTargetSpec	No		List of resources to monitor (multi-target)
`deployment`	string	No		Single deployment (legacy)
`namespace`	string	No		Deployment namespace (legacy)
`interval`	string	No	`"30s"`	Collection interval
`window`	string	No	`"2h"`	Observation window
`maxLogLines`	int32	No	`100`	Max log lines per pod
`maxContextChars`	int32	No	`32000`	LLM context budget

WatchTargetSpec

Field	Type	Required	Default	Description
`name`	string	Yes*		Resource name to monitor (e.g., `postgres`, `fluentd`)
`deployment`	string	No		Deprecated alias for `name` — kept for backward compatibility
`kind`	string	No	`Deployment`	Resource kind: `Deployment`, `StatefulSet`, `DaemonSet`, `Job`, `CronJob`
`namespace`	string	Yes		Resource namespace
`metricsPort`	int32	No	`0`	Prometheus port (0 = disabled)
`metricsPath`	string	No	`/metrics`	Prometheus endpoint path
`metricsFilter`	[]string	No		Glob filters for metrics

Use name + kind to monitor any Kubernetes workload type. When kind is omitted, it defaults to Deployment. The legacy deployment field still works as alias for name. Examples:

targets:
  - name: api-gateway             # Deployment (default kind)
    namespace: production
  - name: postgres                # StatefulSet (database)
    kind: StatefulSet
    namespace: production
  - name: fluentd                 # DaemonSet (logging agent)
    kind: DaemonSet
    namespace: logging
  - name: etl-pipeline            # CronJob (scheduled batch)
    kind: CronJob
    namespace: data

The AIOps pipeline will automatically use resource-specific remediation actions (e.g., ScaleStatefulSet, RestartDaemonSet, SuspendCronJob) based on the detected resource kind.

Resources Created by Instance

Resource	Name	Description
Deployment	`<name>`	ChatCLI server pods
Service	`<name>`	gRPC Service (automatic headless when replicas > 1 for client-side LB)
ConfigMap	`<name>`	Environment variables (provider, model, etc.)
ConfigMap	`<name>-watch-config`	Multi-target YAML (if `targets` defined)
ServiceAccount	`<name>`	Identity for RBAC
Role	`<name>-watcher`	K8s watcher permissions (single-namespace)
RoleBinding	`<name>-watcher`	SA to Role binding (single-namespace)
ClusterRoleBinding	`<namespace>-<name>-watcher`	SA binding to the shared ClusterRole (multi-namespace)
PVC	`<name>-sessions`	Persistence (if enabled)

gRPC Load Balancing

gRPC uses persistent HTTP/2 connections that pin to a single pod via kube-proxy, leaving extra replicas idle.

1 replica (default): Standard ClusterIP Service
Multiple replicas: Headless Service (ClusterIP: None) is created automatically, enabling client-side round-robin via gRPC dns:/// resolver
Keepalive: WatcherBridge pings every 30s (5s timeout) to detect inactive pods quickly. The server accepts pings with a minimum interval of 20s (EnforcementPolicy.MinTime)
Transition: When scaling from 1 to 2+ replicas (or back), the operator deletes and recreates the Service automatically (ClusterIP is immutable in Kubernetes)

Automatic RBAC

Same namespace (all targets in the same namespace as the Instance): Creates per-Instance Role + RoleBinding
Cross-namespace (targets in a different namespace than the Instance, or in multiple namespaces): Creates only a per-Instance ClusterRoleBinding pointing at the shared chatcli-watcher ClusterRole (pre-provisioned by the Helm chart / kustomize overlay)
On CR deletion, the finalizer removes the ClusterRoleBinding; the shared ClusterRole stays (owned by the release)

As of v1.139.0, the operator no longer creates ClusterRole resources at runtime (H5 hardening). Shared ClusterRoles — chatcli-watcher for the watcher and chatcli-role-{viewer,operator,admin,superadmin} for platform roles — are installed by the operator Helm chart. The operator’s ServiceAccount carries the bind verb restricted to those exact names via resourceNames, preventing privilege escalation even if the operator is compromised.

Upgrading from v1.105.0: clusters with pre-existing multi-namespace Instances had a ClusterRoleBinding pointing to a per-Instance ClusterRole (legacy shape). Because roleRef is immutable in Kubernetes, a direct helm upgrade used to freeze the reconcile with cannot change roleRef. As of v1.139.0, the operator detects the divergent roleRef at the top of reconcileClusterRBAC, deletes the stale binding, and recreates it pointing at chatcli-watcher — transparent migration, no manual intervention.

Server Image and Auto-Resolution

The server image tag (spec.image.tag) follows a three-step priority:

Explicit pin in spec.image.tag — honored verbatim (GitOps-friendly).
Omitted — the operator resolves it from the CHATCLI_OPERATOR_APP_VERSION env var, which the Helm chart injects automatically from .Chart.AppVersion. Effect: helm upgrade chatcli-operator rolls the server of every Instance that opted into auto-resolution, with no per-Instance patch.
Fallback — latest when neither is present (e.g., make deploy without Helm).

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
spec:
  image:
    repository: ghcr.io/diillson/chatcli
    # tag omitted → inherits chart appVersion (recommended for Helm-upgrade environments)
  provider: CLAUDEAI
  model: claude-sonnet-4-6
  replicas: 2

For environments that want immutable versioning, keep spec.image.tag pinned and manage upgrades manually. For environments that want “helm upgrade = full upgrade,” omit the tag.

Auto-Rollout on Configuration Changes

The operator monitors changes in ConfigMaps and Secrets referenced by the Instance and triggers rolling updates automatically via hash annotations on the PodTemplate:

Annotation	Source	When It Changes
`chatcli.io/watch-config-hash`	ConfigMap `<name>-watch-config`	Watcher targets changed
`chatcli.io/configmap-hash`	ConfigMap `<name>`	Environment variables updated
`chatcli.io/secret-hash`	Secret referenced in `apiKeys.name`	API keys created or updated
`chatcli.io/tls-hash`	Secret referenced in `server.tls.secretName`	TLS certificates renewed

Adding/removing targets in watcher.targets and applying the Instance causes automatic rollout. Creating or updating the API keys Secret and renewing TLS certificates also trigger rollout automatically.

Secret and ConfigMap Observation

The operator watches (Watches) Secrets in the Instance namespace. When a Secret referenced in apiKeys.name or server.tls.secretName is created or updated, the reconciler is triggered automatically — even if the Secret did not exist when the Instance was created.

ConfigMap and Secret envFrom: Marked as optional: true, allowing the Instance to be created before the Secret/ConfigMap
Flexible deploy order: Namespace -> Instance -> Secret/ConfigMap (any order after the namespace)

AIOps Platform CRDs

Anomaly

Represents a raw signal detected by the WatcherBridge.

apiVersion: platform.chatcli.io/v1alpha1
kind: Anomaly
metadata:
  name: watcher-highrestartcount-api-gateway-1234567890
  namespace: production
spec:
  signalType: pod_restart    # pod_restart | oom_kill | pod_not_ready | deploy_failing | error_rate | latency_spike
  source: watcher            # watcher | prometheus | manual
  severity: warning          # critical | high | medium | low | warning
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  description: "HighRestartCount on api-gateway: container app restarted 8 times"
  detectedAt: "2026-02-16T10:30:00Z"
status:
  correlated: true
  issueRef:
    name: api-gateway-pod-restart-1771276354

Anomaly Spec Fields

Field	Type	Description
`signalType`	AnomalySignalType	Type of detected signal
`source`	AnomalySource	Detection origin (watcher, prometheus, manual)
`severity`	IssueSeverity	Signal severity
`resource`	ResourceRef	Affected K8s resource (kind, name, namespace)
`description`	string	Human-readable description of the problem
`detectedAt`	Time	Detection timestamp

Signals Detected (21 types)

Watcher signals:

AlertType (Server)	SignalType (Anomaly)	Description
`HighRestartCount`	`pod_restart`	Pod with many restarts (CrashLoopBackOff)
`OOMKilled`	`oom_kill`	Container terminated due to lack of memory
`PodNotReady`	`pod_not_ready`	Pod is not in the Ready state
`DeploymentFailing`	`deploy_failing`	Deployment with Available=False

Additional signals (via Prometheus, webhooks, or internal detection):

SignalType	Description
`error_rate`	Elevated HTTP error rate
`latency`	Latency above threshold
`cpu_high`	Elevated CPU usage
`memory_high`	Elevated memory usage
`disk_pressure`	Node with DiskPressure condition (disk full or nearly full)
`node_not_ready`	Node with NotReady condition (kubelet unresponsive, network or hardware failure)
`memory_pressure`	Node with MemoryPressure condition (insufficient memory for new pods)
`pid_pressure`	Node with PIDPressure condition (excessive processes, fork bomb risk)
`network_unavailable`	Node with network unavailable (CNI failure or interface down)
`pvc_pending`	PVC in Pending state
`ingress_error`	Ingress controller errors
`hpa_maxed`	HPA at maximum replicas
`job_failed`	Job failed
`cronjob_missed`	CronJob missed its schedule
`certificate_expiring`	TLS certificate expiring
`image_pull_error`	Error pulling container image
`crashloop_backoff`	Pod in CrashLoopBackOff
`helm_release_failed`	Helm release in failed state
`argocd_degraded`	ArgoCD Application degraded
`config_drift`	Configuration drift detected

Node Monitoring

The watcher automatically monitors the health of nodes where target pods are running. On each collection cycle, it:

Identifies nodes via label selector from the target’s pods
Collects all 5 official Kubernetes conditions: Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable
Collects node CPU/memory metrics (via metrics server)
Counts active pods vs node pod capacity
Checks if the node is cordoned (unschedulable)

Condition	Severity	Signal	Available Action
Node NotReady	CRITICAL	`node_not_ready`	`CordonNode`, `DrainNode`
DiskPressure	CRITICAL	`disk_pressure`	`CordonNode`, `DrainNode`
MemoryPressure	CRITICAL	`memory_high`	`CordonNode`, `DrainNode`
PIDPressure	WARNING	`node_not_ready`	`CordonNode`
NetworkUnavailable	CRITICAL	`node_not_ready`	`CordonNode`, `DrainNode`
Cordoned (Unschedulable)	WARNING	`node_not_ready`	Informational
Pod capacity >90%	WARNING	`node_not_ready`	`CordonNode`

Node information is included in the AI analysis context, enabling root cause correlation with infrastructure problems (e.g., “OOMKill caused by MemoryPressure on node X”).

Issue

Correlated incident that groups anomalies and manages the remediation lifecycle.

apiVersion: platform.chatcli.io/v1alpha1
kind: Issue
metadata:
  name: api-gateway-pod-restart-1771276354
  namespace: production
spec:
  severity: high
  source: watcher
  signalType: pod_restart        # Propagated from Anomaly for tiered Runbook matching
  description: "Correlated incident: pod_restart on api-gateway"
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  riskScore: 65
  correlatedAnomalies:
    - name: watcher-highrestartcount-api-gateway-1234567890
    - name: watcher-oomkilled-api-gateway-1234567891
status:
  state: Analyzing          # Detected | Analyzing | Remediating | Resolved | Escalated | Failed
  remediationAttempts: 0
  maxRemediationAttempts: 5  # default: 5, configurable via Instance aiops.maxRemediationAttempts
  detectedAt: "2026-02-16T10:30:00Z"
  conditions:
    - type: Analyzing
      status: "True"
      reason: AIInsightCreated

Issue States

State	Description
`Detected`	Newly created issue, awaiting analysis
`Analyzing`	AIInsight created, awaiting AI response (or re-analysis with failure context)
`Remediating`	RemediationPlan in execution
`Resolved`	Successful remediation (dedup invalidated for recurrence detection)
`Escalated`	Max attempts reached or no available actions (dedup invalidated)
`Failed`	Terminal failure

AIInsight

AI-generated root cause analysis with suggested actions for automatic remediation.

apiVersion: platform.chatcli.io/v1alpha1
kind: AIInsight
metadata:
  name: api-gateway-pod-restart-1771276354-insight
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  provider: CLAUDEAI
  model: claude-sonnet-4-6
status:
  analysis: "High restart count caused by OOMKilled. Container memory limit (512Mi) is insufficient for the current workload pattern."
  confidence: 0.87
  recommendations:
    - "Increase memory limit to 1Gi"
    - "Investigate possible memory leak in the application"
    - "Monitor GC pressure metrics"
  suggestedActions:
    - name: "Restart deployment"
      action: RestartDeployment
      description: "Restart pods to reclaim leaked memory immediately"
    - name: "Scale up replicas"
      action: ScaleDeployment
      description: "Add more replicas to distribute memory pressure"
      params:
        replicas: "4"
  generatedAt: "2026-02-16T10:31:00Z"

AIInsight Status Fields

Field	Type	Description
`analysis`	string	AI-generated root cause analysis
`confidence`	float64	Analysis confidence level (0.0-1.0)
`recommendations`	[]string	Human-readable recommendations
`suggestedActions`	[]SuggestedAction	Structured actions for automatic remediation
`generatedAt`	Time	When the analysis was generated

SuggestedAction

Field	Type	Description
`name`	string	Human-readable action name
`action`	string	Action type (54 action types available — see Action Types)
`description`	string	Explanation of why this action is needed
`params`	map[string]string	Action parameters (e.g., `replicas: "4"`)

RemediationPlan

Concrete remediation plan automatically generated from a Runbook or AI actions.

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-1771276354-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Attempt 1 (AI-generated): High restart count caused by OOMKilled"
  actions:
    - type: RestartDeployment
    - type: ScaleDeployment
      params:
        replicas: "4"
  safetyConstraints:
    - "No delete operations"
    - "No destructive changes"
    - "Rollback on failure"
status:
  state: Completed           # Pending | Executing | Completed | Failed | RolledBack
  result: "Deployment restarted and scaled to 4 replicas successfully"
  startedAt: "2026-02-16T10:31:30Z"
  completedAt: "2026-02-16T10:32:15Z"

Automatic Rollback and State Protection

The operator implements an automatic rollback system that ensures unsuccessful remediations do not leave the cluster in a worse state than before. Before executing any action, the complete resource state is captured in a structured restorable snapshot.

Pre-Remediation Snapshot

Before the first action, the RollbackEngine captures a structured ResourceSnapshot with: replicas, container images, CPU/memory requests and limits, HPA state (min/max replicas), and node state (schedulable/unschedulable). Works for Deployments, StatefulSets, DaemonSets, Nodes, and HPAs.

Per-Action Checkpoint

In plans with multiple actions, an ActionCheckpoint is captured before each individual action. This makes it possible to know exactly which action modified what and at which point the plan failed.

Automatic Rollback on Action Failure

If any action fails during execution, the operator automatically restores the resource to the PreflightSnapshot state. Replicas, images, resource requests/limits, and HPA state are reverted. The plan transitions to RolledBack state (not Failed).

Rollback on Verification Timeout

If all actions execute successfully but the resource does not become healthy within 90 seconds (verification timeout), the operator also performs automatic rollback to the pre-remediation state.

Post-Failure Health Check

After the rollback, the operator verifies whether the resource returned to a healthy state (PostFailureHealthy). This information is recorded in the plan status for auditing and retry decisions.

What is captured per resource type:

Resource	Captured Fields
Deployment	replicas, container images, CPU/memory requests+limits, restart annotation
StatefulSet	replicas, container images, CPU/memory requests+limits
DaemonSet	container images, CPU/memory requests+limits
Node	schedulable state (to revert cordon/drain)
HPA	minReplicas, maxReplicas

Example of a RemediationPlan with rollback executed:

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-plan-1
  namespace: production
status:
  state: RolledBack              # Action failed, automatic rollback executed
  result: "Action AdjustResources (index 1) failed: invalid memory_limit | Rollback: Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi | Post-rollback: resource healthy"
  rollbackPerformed: true
  rollbackResult: "Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi"
  postFailureHealthy: true       # Resource returned to normal after rollback
  preflightSnapshot:
    resourceKind: Deployment
    resourceName: api-gateway
    namespace: production
    replicas: 3
    containerImages:
      app: "ghcr.io/myorg/api-gateway:v2.1.0"
    containerResources:
      app:
        cpuRequest: "200m"
        cpuLimit: "1000m"
        memoryRequest: "256Mi"
        memoryLimit: "512Mi"
    hpaMinReplicas: 2
    hpaMaxReplicas: 8
    capturedAt: "2026-02-16T10:31:00Z"
  actionCheckpoints:
    - actionIndex: 0
      actionType: ScaleDeployment
      success: true
      timestamp: "2026-02-16T10:31:05Z"
    - actionIndex: 1
      actionType: AdjustResources
      success: false
      timestamp: "2026-02-16T10:31:10Z"
  evidence:
    - type: preflight_snapshot
      data: "Structured snapshot captured: kind=Deployment replicas=3 containers=1"
      timestamp: "2026-02-16T10:31:00Z"
    - type: action_completed
      data: "Action ScaleDeployment executed successfully"
      timestamp: "2026-02-16T10:31:05Z"
    - type: action_failed
      data: "Action AdjustResources failed: invalid memory_limit format"
      timestamp: "2026-02-16T10:31:10Z"
    - type: rollback
      data: "Rolled back production/api-gateway: replicas: 5 → 3; container app: memory_limit=1Gi"
      timestamp: "2026-02-16T10:31:11Z"

Automatic rollback restores the state prior to remediation, it does not fix the original issue. After the rollback, the IssueReconciler evaluates whether there are remaining attempts and triggers re-analysis with failure context — the AI receives what failed and suggests a different strategy.

Complete flow on failure: Status fields added to RemediationPlan:

Field	Type	Description
`preflightSnapshot`	ResourceSnapshot	Complete resource state before any action
`actionCheckpoints`	[]ActionCheckpoint	Checkpoint before each action with result (success/fail)
`rollbackPerformed`	bool	Whether automatic rollback was executed
`rollbackResult`	string	Description of what was reverted (replicas, images, resources)
`postFailureHealthy`	*bool	Whether the resource is healthy after rollback

Action Types (54 types)

Workload:

Type	Description	Parameters
`ScaleDeployment`	Adjusts the number of replicas	`replicas`
`RestartDeployment`	Rollout restart of the deployment	—
`RollbackDeployment`	Undoes rollout (previous, healthy, or specific revision)	`toRevision` (optional: `previous`, `healthy`, or number)
`PatchConfig`	Updates keys of a ConfigMap	`configmap`, `key=value`
`AdjustResources`	Adjusts CPU/memory requests/limits for containers	`memory_limit`, `memory_request`, `cpu_limit`, `cpu_request`, `container`
`DeletePod`	Removes the sickest pod (CrashLoop > restarts)	`pod` (optional — auto-selects the sickest)
`RestartStatefulSetPod`	Restart of StatefulSet pod (preserves identity/storage)	`pod` (optional — omit for rolling restart of entire StatefulSet)

GitOps:

Type	Description	Parameters
`HelmRollback`	Rollback of Helm release to previous revision	`revision` (optional — default: previous)
`ArgoSyncApp`	Trigger sync on ArgoCD Application	`revision` (optional — default: HEAD)

Autoscaling:

Type	Description	Parameters
`AdjustHPA`	Modifies min/max replicas or target utilization of HPA	`minReplicas`, `maxReplicas`, `targetCPUUtilization`

Infrastructure:

Type	Description	Parameters
`CordonNode`	Marks node as unschedulable	`node`
`DrainNode`	Cordon + evict pods from node	`node`

Storage:

Type	Description	Parameters
`ResizePVC`	Expands PVC (expansion only, not reduction)	`pvc`, `size` (e.g., `20Gi`)

Security:

Type	Description	Parameters
`RotateSecret`	Updates Secret values or copies from source	`secret`, `sourceSecret` or `key=value`

Networking:

Type	Description	Parameters
`UpdateIngress`	Modifies backend or annotations of Ingress	`ingress`, `backendService`, `backendPort`, `annotation.*`
`PatchNetworkPolicy`	Adds allowed ports to NetworkPolicy	`networkPolicy`, `allowPort`, `protocol`

Advanced:

Type	Description	Parameters
`ApplyManifest`	Applies JSON manifest from a ConfigMap	`configmap`, `key`
`ExecDiagnostic`	Executes a command from a read-only allowlist inside a pod	`command` (exact string — see allowlist)
`Custom`	Custom action (blocked by safety checks)	—

ExecDiagnostic Allowlist

ExecDiagnostic does an exact-string match against a read-only command allowlist. Any variation (different flags, alternate host, etc.) is rejected with command "..." not in approved diagnostic commands whitelist. Default approved commands (~90):

Category	Commands
Process / shell	`env`, `whoami`, `id`, `hostname`, `pwd`, `uname -a`, `uname -r`, `ps aux`, `ps -ef`, `top -b -n1`
Filesystem / resources	`df -h`, `df -i`, `free -m`, `free -h`, `mount`, `uptime`, `ls -la /`, `ls -la /tmp`, `ls -la /var/log`, `du -sh /tmp`, `du -sh /var/log`
Cgroups v2 (modern pod)	`cat /sys/fs/cgroup/memory.max`, `memory.current`, `memory.events`, `memory.stat`, `cpu.max`, `cpu.stat`
Cgroups v1 (legacy pod)	`cat /sys/fs/cgroup/memory/memory.{limit_in_bytes,usage_in_bytes,oom_control,stat}`, `cat /sys/fs/cgroup/cpu/cpu.{cfs_quota_us,cfs_period_us,stat}`
/proc introspection	`cat /proc/1/{cgroup,status,limits,cmdline,environ}`, `cat /proc/{meminfo,cpuinfo,loadavg,version}`, `cat /proc/net/{tcp,udp,sockstat}`
Network (read-only)	`netstat -tlnp`/`-an`/`-rn`, `ss -tlnp`/`-an`/`-s`, `ip addr`, `ip -s link`, `ip route`, `ip -6 route`, `ip neigh`, `ifconfig`, `arp -a`
DNS / resolver	`cat /etc/{hosts,resolv.conf,nsswitch.conf}`, `nslookup/getent/dig/host kubernetes.default.svc.cluster.local`, `nslookup kube-dns.kube-system.svc.cluster.local`
Health / metrics / pprof	`curl -s localhost{,:8080}/{health,healthz,ready,readyz,live,livez}`, `curl -s localhost:{8080,8081,9090,9091}/metrics`, `curl -s localhost:9090/-/{ready,healthy}`, `curl -s localhost:6060/debug/pprof/{,goroutine,heap}?debug=1`
Envoy / Istio sidecar	`curl -s localhost:15000/ready`, `curl -s localhost:{15020,15021}/healthz/ready`
wget fallback (Alpine)	`wget -qO- http://localhost{,:8080}/{health,healthz,metrics}`
TCP reachability	`nc -zv kubernetes.default.svc.cluster.local 443`, `nc -zv kube-dns.kube-system.svc.cluster.local 53`

Extend via the CHATCLI_ALLOWED_DIAGNOSTIC_COMMANDS env var (comma-separated, read once at startup):

spec:
  extraEnv:
    - name: CHATCLI_ALLOWED_DIAGNOSTIC_COMMANDS
      value: "dig +short redis.default.svc.cluster.local, nc -zv redis.default.svc.cluster.local 6379"

Each entry is matched exactly. nslookup <other-host> will not work — it is rejected. If you need a specific host, add the full string to the env var.

The AI is given this allowlist in the remediation prompt (server/handler_analysis.go) and is instructed to pick the right command per symptom: memory.events for OOM, cpu.stat for CPU throttling, getent/dig for DNS, pprof for stuck Go apps, nc -zv for external dependency reachability.

StatefulSet:

Type	Description	Parameters
`ScaleStatefulSet`	Ordered replica scaling	`replicas`
`RestartStatefulSet`	Rolling restart via annotation	—
`RollbackStatefulSet`	Rollback via ControllerRevision	`toRevision`
`AdjustStatefulSetResources`	Adjusts CPU/memory	`container`, `memory_limit`, `cpu_limit`
`DeleteStatefulSetPod`	Deletes specific or unhealthiest pod	`pod` (optional)
`ForceDeleteStatefulSetPod`	Force-delete stuck pod (grace=0)	`pod` (REQUIRED)
`UpdateStatefulSetStrategy`	Changes updateStrategy	`type`, `maxUnavailable`
`RecreateStatefulSetPVC`	Deletes stuck PVC	`pvc`, `confirm=true`
`PartitionStatefulSetUpdate`	Canary partition	`partition`

DaemonSet:

Type	Description	Parameters
`RestartDaemonSet`	Rolling restart across all nodes	—
`RollbackDaemonSet`	Rollback via ControllerRevision	`toRevision`
`AdjustDaemonSetResources`	Adjusts CPU/memory	`container`, `memory_limit`, `cpu_limit`
`DeleteDaemonSetPod`	Deletes pod (optionally on specific node)	`pod`, `node` (optional)
`UpdateDaemonSetStrategy`	Changes update strategy	`type`, `maxUnavailable`, `maxSurge`
`PauseDaemonSetRollout`	Pauses rollout (maxUnavailable=0)	—
`CordonAndDeleteDaemonSetPod`	Cordons node + deletes pod	`node` (REQUIRED)

Job:

Type	Description	Parameters
`RetryJob`	Deletes failed Job + recreates	—
`AdjustJobResources`	Adjusts CPU/memory on template	`container`, `memory_limit`, `cpu_limit`
`DeleteFailedJob`	Cleans up failed Job	—
`SuspendJob`	Pauses Job (suspend=true)	—
`ResumeJob`	Resumes Job (suspend=false)	—
`AdjustJobParallelism`	Changes parallelism	`parallelism`
`AdjustJobDeadline`	Changes deadline	`activeDeadlineSeconds`
`AdjustJobBackoffLimit`	Changes backoff limit	`backoffLimit`
`ForceDeleteJobPods`	Force-deletes all Job pods	—

CronJob:

Type	Description	Parameters
`SuspendCronJob`	Pauses scheduling	—
`ResumeCronJob`	Resumes scheduling	—
`TriggerCronJob`	Creates Job immediately	—
`AdjustCronJobResources`	Adjusts CPU/memory on jobTemplate	`container`, `memory_limit`, `cpu_limit`
`AdjustCronJobSchedule`	Changes schedule	`schedule`
`AdjustCronJobDeadline`	Changes deadline	`startingDeadlineSeconds`
`AdjustCronJobHistory`	Changes history limits	`successfulJobsHistoryLimit`, `failedJobsHistoryLimit`
`AdjustCronJobConcurrency`	Changes concurrency policy	`concurrencyPolicy`
`DeleteCronJobActiveJobs`	Kills running Jobs	—
`ReplaceCronJobTemplate`	Replaces template from ConfigMap	`configmap`, `key`

RemediationPlan Examples with New Actions

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: checkout-helm-rollback-plan-1
  namespace: production
spec:
  issueRef:
    name: checkout-helm-release-failed-123
  attempt: 1
  strategy: "Rollback Helm release to previous stable revision"
  actions:
    - type: HelmRollback
      params:
        revision: "41"    # specific revision (omit for previous)

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-argo-sync-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-argocd-degraded-456
  attempt: 1
  strategy: "Force ArgoCD sync to reconcile desired state"
  actions:
    - type: ArgoSyncApp
      params:
        revision: "abc123def"   # commit SHA (omit for HEAD)

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: postgres-recovery-plan-1
  namespace: databases
spec:
  issueRef:
    name: postgres-pod-restart-789
  attempt: 1
  strategy: "Restart unhealthy StatefulSet pod and adjust HPA"
  actions:
    - type: RestartStatefulSetPod
      params:
        pod: "postgres-2"   # specific pod (omit for rolling restart)
    - type: AdjustHPA
      params:
        minReplicas: "3"
        maxReplicas: "10"
        targetCPUUtilization: "70"

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: node-disk-pressure-plan-1
  namespace: kube-system
spec:
  issueRef:
    name: node-disk-pressure-issue-321
  attempt: 1
  strategy: "Cordon and drain node with disk pressure"
  actions:
    - type: CordonNode
      params:
        node: "worker-node-03"
    - type: DrainNode
      params:
        node: "worker-node-03"

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: storage-and-certs-plan-1
  namespace: production
spec:
  issueRef:
    name: pvc-full-cert-expired-555
  attempt: 1
  strategy: "Expand PVC and rotate expired certificate"
  actions:
    - type: ResizePVC
      params:
        pvc: "data-postgres-0"
        size: "50Gi"
    - type: RotateSecret
      params:
        secret: "tls-api-gateway"
        sourceSecret: "tls-api-gateway-renewed"

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: networking-fix-plan-1
  namespace: production
spec:
  issueRef:
    name: ingress-backend-error-777
  attempt: 1
  strategy: "Fix ingress backend and open blocked port"
  actions:
    - type: UpdateIngress
      params:
        ingress: "api-gateway-ingress"
        backendService: "api-gateway-v2"
        backendPort: "8080"
    - type: PatchNetworkPolicy
      params:
        networkPolicy: "api-gateway-netpol"
        allowPort: "8443"
        protocol: "TCP"

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: advanced-fix-plan-1
  namespace: production
spec:
  issueRef:
    name: complex-issue-999
  attempt: 1
  strategy: "Run diagnostics then apply fix manifest"
  actions:
    - type: ExecDiagnostic
      params:
        command: "df -h"
    - type: ApplyManifest
      params:
        configmap: "hotfix-configmap"
        key: "manifest.yaml"

Runbook (Manual or Auto-generated)

Operational procedures. Manual Runbooks have priority over everything. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR from the suggested actions.

Manual Runbook
AI Auto-generated Runbook
Runbook: Helm + ArgoCD
Runbook: StatefulSet + Storage

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: high-error-rate-deployment
  namespace: production
spec:
  description: "Standard procedure for high error rate incidents on Deployments"
  trigger:
    signalType: error_rate
    severity: high
    resourceKind: Deployment
  steps:
    - name: Scale up
      action: ScaleDeployment
      description: "Increase replicas to absorb the error spike"
      params:
        replicas: "4"
    - name: Rollback
      action: RollbackDeployment
      description: "Revert to previous stable version if scaling doesn't help"
  maxAttempts: 3

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: auto-pod-restart-high-deployment
  labels:
    platform.chatcli.io/auto-generated: "true"
    platform.chatcli.io/source-issue: "api-gateway-pod-restart-1771276354"
spec:
  description: "Auto-generated: High restart count caused by OOMKilled..."
  trigger:
    signalType: pod_restart
    severity: high
    resourceKind: Deployment
  steps:
    - name: Restart deployment
      action: RestartDeployment
    - name: Scale up replicas
      action: ScaleDeployment
      params:
        replicas: "4"
  maxAttempts: 3

Auto-generated Runbooks are reused for future Issues with the same trigger, avoiding unnecessary LLM calls.

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: helm-release-failed-runbook
  namespace: production
spec:
  description: "Helm release rollback and ArgoCD sync after deploy failure"
  trigger:
    signalType: helm_release_failed
    severity: critical
    resourceKind: Deployment
  steps:
    - name: Rollback Helm
      action: HelmRollback
      description: "Revert to the previous stable revision"
    - name: Sync ArgoCD
      action: ArgoSyncApp
      description: "Force ArgoCD sync to reconcile desired state"
  maxAttempts: 2

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: statefulset-disk-full-runbook
  namespace: databases
spec:
  description: "Expand PVC and restart StatefulSet pod with full disk"
  trigger:
    signalType: disk_pressure
    severity: high
    resourceKind: StatefulSet
  steps:
    - name: Expand storage
      action: ResizePVC
      description: "Double the PVC size"
      params:
        pvc: "data-postgres-0"
        size: "100Gi"
    - name: Restart pod
      action: RestartStatefulSetPod
      description: "Restart the pod to recognize the new size"
      params:
        pod: "postgres-0"
  maxAttempts: 2

RemediationPlan (Agentic Mode)

When there is no manual Runbook or AI-suggested actions, the operator creates an agentic plan. The AI acts as an agent with Kubernetes skills in an observe-decide-act loop:

apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: api-gateway-pod-restart-plan-1
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  attempt: 1
  strategy: "Agentic AI remediation"
  agenticMode: true
  agenticMaxSteps: 10
  agenticHistory:
    - stepNumber: 1
      aiMessage: "High restart count with OOMKilled. Scaling up to reduce memory pressure."
      action:
        type: ScaleDeployment
        params:
          replicas: "5"
      observation: "SUCCESS: ScaleDeployment executed successfully"
    - stepNumber: 2
      aiMessage: "Pods still restarting. Adjusting memory limits."
      action:
        type: AdjustResources
        params:
          memory_limit: "1Gi"
          memory_request: "512Mi"
      observation: "SUCCESS: AdjustResources executed successfully"
    - stepNumber: 3
      aiMessage: "All pods running stable. Issue resolved."
status:
  state: Completed
  agenticStepCount: 3
  agenticStartedAt: "2026-02-16T10:31:00Z"

Safety Guards: Maximum of 10 steps (configurable via agenticMaxSteps), timeout of 10 minutes. If an action fails, the observation reports “FAILED: error” and the loop continues — the AI receives the feedback and adapts.

On agentic resolution: The operator automatically generates:

PostMortem CR with timeline, root cause, impact, lessons learned
Reusable Runbook CR with successful steps (label source=agentic)

PostMortem (Auto-generated)

Incident report automatically generated after any remediation resolution (standard or agentic). Contains the complete incident history: detection, analysis, executed actions, resolution, plus metrics, git correlation, cascade chain, recurring incident trending, and developer feedback field.

apiVersion: platform.chatcli.io/v1alpha1
kind: PostMortem
metadata:
  name: pm-api-gateway-pod-restart-1771276354
  namespace: production
spec:
  issueRef:
    name: api-gateway-pod-restart-1771276354
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  severity: high
status:
  state: Open              # Open | InReview | Closed
  summary: "OOMKilled containers caused cascading restarts on api-gateway"
  rootCause: "Memory limit (512Mi) insufficient for current workload pattern"
  impact: "Service degradation for 5 minutes, 30% error rate increase"
  timeline:
    - timestamp: "2026-02-16T10:30:00Z"
      type: detected
      detail: "Issue detected: pod_restart on api-gateway"
    - timestamp: "2026-02-16T10:31:00Z"
      type: action_executed
      detail: "ScaleDeployment to 5 replicas"
    - timestamp: "2026-02-16T10:31:35Z"
      type: action_executed
      detail: "AdjustResources memory_limit=1Gi"
    - timestamp: "2026-02-16T10:32:10Z"
      type: resolved
      detail: "All pods stable, issue resolved"
  lessonsLearned:
    - "Memory limits should account for peak workload patterns"
    - "Set up HPA to auto-scale on memory pressure"
  preventionActions:
    - "Configure HPA with min 3 replicas for api-gateway"
    - "Set memory limit to 1Gi across all environments"
  duration: "2m10s"
  generatedAt: "2026-02-16T10:32:10Z"
  # New enrichment fields (automatically populated by the operator)
  metricSnapshots:
    - name: "memory_usage"
      value: "498000000"
      timestamp: "2026-02-16T10:30:00Z"
      phase: "during"
    - name: "memory_usage"
      value: "312000000"
      timestamp: "2026-02-16T10:35:00Z"
      phase: "after"
  blastRadius:
    - resource:
        kind: Service
        name: api-gateway-svc
        namespace: production
      impact: "5xx responses during pod restarts"
      severity: "high"
  gitCorrelation:
    commitSHA: "a1b2c3d4e5f6"
    commitMessage: "feat: add webhook handler for notifications"
    author: "dev@team.com"
    timestamp: "2026-02-16T09:15:00Z"
    confidence: 0.82
    filesChanged:
      - "internal/webhook/handler.go"
      - "internal/webhook/handler_test.go"
  trending:
    occurrenceCount: 3
    windowDays: 30
    relatedPostMortems:
      - "pm-api-gateway-oom-20260205"
      - "pm-api-gateway-oom-20260210"
    pattern: "Recurring oom_kill on Deployment/api-gateway (3 occurrences in 30 days)"
  gitOpsContext: "Helm release 'api-gateway' chart=api-gw version=2.1.0 status=deployed revision=15"
  logAnalysisSummary: "1 Go panic stack trace; 8 critical error patterns (resource/connectivity)"
  cascadeChain:
    - "production/api-gateway(root_cause)"
    - "production/frontend(victim)"
  # Developer feedback (filled manually after review)
  feedback:
    overrideRootCause: ""          # empty = agrees with AI analysis
    remediationAccuracy: 4         # 1-5 scale
    comments: "Good analysis, but could have suggested AdjustResources before the restart"
    providedBy: "sre@team.com"
    providedAt: "2026-02-17T09:00:00Z"

PostMortem Status Fields

Field	Type	Description
`state`	PostMortemState	State: Open, InReview, Closed
`summary`	string	AI-generated incident summary
`rootCause`	string	Root cause determined by AI
`impact`	string	Incident impact
`timeline`	[]TimelineEvent	Timeline (detected, analyzed, action_executed, resolved)
`actionsExecuted`	[]ActionRecord	Executed actions with result
`lessonsLearned`	[]string	Lessons learned
`preventionActions`	[]string	Suggested preventive actions
`duration`	string	Total incident duration
`generatedAt`	Time	When the PostMortem was generated
`reviewedAt`	Time	When the PostMortem was reviewed by a human
`metricSnapshots`	[]MetricSnapshot	Prometheus metrics captured before/during/after the incident
`blastRadius`	[]BlastRadiusEntry	Services and resources impacted by the incident
`gitCorrelation`	GitCorrelation	Suspect commit correlated with the incident (SHA, author, files, confidence)
`sliImpact`	string	Impact on SLIs and error budgets
`errorBudgetBurned`	float64	Percentage of error budget consumed
`trending`	TrendingInfo	Recurring pattern information (count, window, related PostMortems)
`feedback`	DevFeedback	Human feedback (root cause override, accuracy 1-5, comments)
`gitOpsContext`	string	Helm/ArgoCD/Flux state at the time of the incident
`logAnalysisSummary`	string	Summary of log analysis findings
`cascadeChain`	[]string	Cascade failure chain if applicable

Runbook Matching (Tiered)

Tier 1: SignalType + Severity + ResourceKind (exact match, preferred)
Tier 2: Severity + ResourceKind (fallback when signal doesn't match)

Remediation Priority

Existing manual Runbook (tiered match)
AI auto-generated Runbook (materialized as reusable CR)
Agentic AI remediation (observe-decide-act loop, generates PostMortem + Runbook)
Escalation (only when agentic fails after max attempts)

SourceRepository (Code-Aware Diagnostics)

Links a Kubernetes workload to its source code repository. When configured, the AI receives code context during incident analysis: recent commits correlated with the timestamp, code snippets referenced in stack traces, and configuration files (Dockerfile, values.yaml).

Input validation (hardening): spec.url only accepts the https://, ssh:// or git@host:path forms, and spec.branch is restricted to [A-Za-z0-9._/-] with no leading - — closing the git argument-injection vector (e.g. --upload-pack). Reads of code referenced by stack traces are confined to the repository clone via os.Root (no traversal, no symlink escape). Out-of-pattern URLs or branches fail the sync with an explicit status error.

Token Auth (GitHub PAT)
SSH Key
Basic Auth
Public Repo (no auth)
StatefulSet (Database)

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: api-gateway-repo
  namespace: production
spec:
  url: "https://github.com/myorg/api-gateway.git"
  branch: main
  authType: token
  secretRef: git-token       # Secret with key "token"
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  paths: ["cmd/", "internal/"]
  dockerfile: "Dockerfile"
  language: "Go"
  syncIntervalMinutes: 30
---
apiVersion: v1
kind: Secret
metadata:
  name: git-token
  namespace: production
type: Opaque
stringData:
  token: "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: payment-service-repo
  namespace: production
spec:
  url: "git@github.com:myorg/payment-service.git"
  branch: main
  authType: ssh
  secretRef: git-ssh-key     # Secret with key "ssh-key"
  resource:
    kind: Deployment
    name: payment-service
    namespace: production
  language: "Java"
---
apiVersion: v1
kind: Secret
metadata:
  name: git-ssh-key
  namespace: production
type: Opaque
data:
  ssh-key: <base64-encoded-private-key>

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: worker-repo
  namespace: batch
spec:
  url: "https://gitlab.internal.com/team/worker.git"
  branch: develop
  authType: basic
  secretRef: git-basic-auth  # Secret with keys "username" and "password"
  resource:
    kind: Deployment
    name: worker
    namespace: batch
  paths: ["src/"]
  language: "Python"
---
apiVersion: v1
kind: Secret
metadata:
  name: git-basic-auth
  namespace: batch
type: Opaque
stringData:
  username: "deploy-bot"
  password: "xxxxxxxxxxxx"

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: oss-frontend-repo
  namespace: staging
spec:
  url: "https://github.com/myorg/frontend.git"
  branch: main
  authType: none             # public repo, no auth needed
  resource:
    kind: Deployment
    name: frontend
    namespace: staging
  paths: ["src/", "pages/"]
  language: "TypeScript"
  syncIntervalMinutes: 60    # less frequent sync for staging

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: postgres-migrations-repo
  namespace: databases
spec:
  url: "https://github.com/myorg/db-migrations.git"
  branch: main
  authType: token
  secretRef: git-token
  resource:
    kind: StatefulSet         # works with any resource kind
    name: postgres
    namespace: databases
  paths: ["migrations/", "schemas/"]
  dockerfile: "Dockerfile.migrate"

What the operator does with SourceRepository:

Shallow clone of the repository (depth 50) and periodic sync
Indexes detected languages, entrypoints (main.go, app.py, etc.), config files
Temporal correlation: finds commits within 30 min before the incident
Suspect commit: identifies the most likely commit to have caused the problem
Code extraction: when stack traces reference files, extracts the relevant snippets
Feed to AI: all context is included in the analysis prompt

The repository is cloned locally on the operator pod. For private repos, create a Secret with the key corresponding to the chosen authType and reference it in secretRef. The operator supports HTTPS repos (token/basic) and SSH (ssh-key).

Correlation Engine

The correlation engine groups anomalies into issues using:

Risk Scoring

Each signal type has a weight:

Signal	Weight
`oom_kill`	30
`error_rate`	25
`deploy_failing`	25
`latency_spike`	20
`pod_restart`	20
`pod_not_ready`	20

The risk score is the sum of correlated anomaly weights (maximum 100).

Severity Classification

Risk Score	Severity
>= 80	Critical
>= 60	High
>= 40	Medium
< 40	Low

Grouping

Anomalies on the same resource (deployment + namespace) within the same time window are grouped into the same Issue
Incident ID is deterministic: hash of resource + signal type (prevents duplicates)

WatcherBridge

The WatcherBridge is the component that connects the ChatCLI server to the operator:

Polling: Queries GetAlerts from the server every 30 seconds
Discovery: Locates the server via Instance CRs (first Instance with a ready gRPC endpoint)
Dedup: SHA256 hash of type+deployment+namespace (no temporal component — a continuous problem generates only one Anomaly). 2-hour TTL
Dedup invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the resource are removed, allowing immediate recurrence detection
Pruning: Removes expired hashes automatically (> 2h)
Creation: Converts alerts to Anomaly CRs with valid K8s names

Usage Examples

Minimal (no AIOps)
Full AIOps
With Fallback Multi-Provider
Manual Runbook (optional)
API Keys Secret

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-simple
spec:
  provider: OPENAI
  apiKeys:
    name: chatcli-api-keys

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-aiops
spec:
  provider: CLAUDEAI
  apiKeys:
    name: chatcli-api-keys
  server:
    port: 50051
  watcher:
    enabled: true
    interval: "15s"
    maxContextChars: 32000
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 9090
        metricsFilter: ["http_*", "grpc_*"]
      - deployment: auth-service
        namespace: production
        metricsPort: 9090
      - deployment: worker
        namespace: batch
      - deployment: ml-inference
        namespace: ml
        metricsPort: 8080
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 1Gi
  persistence:
    enabled: true
    size: 5Gi

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-resilient
  namespace: chatcli-system
spec:
  provider: OPENAI
  model: gpt-4o
  fallback:
    enabled: true
    providers:
      - name: CLAUDEAI
        model: claude-sonnet-4-20250514
      - name: GOOGLEAI
        model: gemini-2.0-flash
    maxRetries: 2
    cooldownBase: "30s"
    cooldownMax: "5m"
  apiKeys:
    name: chatcli-api-keys
  watcher:
    enabled: true
    targets:
      - deployment: api-gateway
        namespace: production
---
# Secret with keys for ALL providers in the fallback chain
apiVersion: v1
kind: Secret
metadata:
  name: chatcli-api-keys
  namespace: chatcli-system
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-xxx"
  ANTHROPIC_API_KEY: "sk-ant-xxx"
  GOOGLEAI_API_KEY: "xxx"
  OPENROUTER_API_KEY: "sk-or-xxx"

When OpenAI fails (rate limit, timeout), the system automatically tries Claude, then Google AI. Each provider gets up to 2 retries before moving to the next. After failure, the provider enters exponential cooldown (30s → 1m → 2m → … up to 5m max).

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: oom-kill-runbook
  namespace: production
spec:
  description: "Procedure for OOMKilled containers"
  trigger:
    signalType: oom_kill
    severity: critical
    resourceKind: Deployment
  steps:
    - name: Restart pods
      action: RestartDeployment
      description: "Restart to reclaim leaked memory"
    - name: Scale up
      action: ScaleDeployment
      description: "Add replicas to distribute memory pressure"
      params:
        replicas: "5"
  maxAttempts: 2

apiVersion: v1
kind: Secret
metadata:
  name: chatcli-api-keys
type: Opaque
stringData:
  ANTHROPIC_API_KEY: "sk-ant-xxx"
  # OPENAI_API_KEY: "sk-xxx"
  # GOOGLEAI_API_KEY: "xxx"

Status and Monitoring

Check Instances

kubectl get instances

NAME            READY   REPLICAS   PROVIDER    AGE
chatcli-aiops   true    1          CLAUDEAI    5m

Check Active Issues

kubectl get issues -A

NAME                                    SEVERITY   STATE         RISK   AGE
api-gateway-pod-restart-1771276354      high       Remediating   65     2m
worker-oom-kill-3847291023              critical   Analyzing     90     30s

Check AI Insights

kubectl get aiinsights -A

NAME                                           ISSUE                                   PROVIDER   CONFIDENCE   AGE
api-gateway-pod-restart-1771276354-insight      api-gateway-pod-restart-1771276354      CLAUDEAI   0.87         1m

Check Remediation Plans

kubectl get remediationplans -A

NAME                                          ISSUE                                   ATTEMPT   STATE       AGE
api-gateway-pod-restart-1771276354-plan-1      api-gateway-pod-restart-1771276354      1         Completed   1m

Check PostMortems

kubectl get postmortems -A

NAME                                          ISSUE                                   SEVERITY   STATE   AGE
pm-api-gateway-pod-restart-1771276354         api-gateway-pod-restart-1771276354      high       Open    30s

Check Anomalies

kubectl get anomalies -A

NAME                                               SIGNAL        SOURCE    SEVERITY   AGE
watcher-highrestartcount-api-gateway-1234567890     pod_restart   watcher   warning    3m
watcher-oomkilled-worker-9876543210                 oom_kill      watcher   critical   1m

Development

cd operator

# Build
go build ./...

# Tests (130 functions, 185 with subtests)
go test ./... -v

# Docker (must be built from the repository root)
docker build -f operator/Dockerfile -t myregistry/chatcli-operator:dev .

# Deploy via Helm (recommended)
helm install chatcli-operator ../deploy/helm/chatcli-operator/ \
  --namespace chatcli-system --create-namespace \
  --set image.repository=myregistry/chatcli-operator \
  --set image.tag=dev

# Or deploy via kubectl (alternative)
make deploy IMG=myregistry/chatcli-operator:dev

Security

The Operator implements multiple security layers by default, following the fail-closed principle (deny by default):

REST API Authentication

The REST API operates in fail-closed mode by default — there is no dev mode without authentication. Every request must include a valid X-API-Key header with a mapped role (viewer/operator/admin). API keys are loaded with the following priority order and hot-reloaded every 30 seconds:

Secret chatcli-operator-secrets (priority) — api-keys field containing a YAML list of {key, role, description} entries
ConfigMap chatcli-operator-config (fallback) — same api-keys field
Reject the request (or accept in dev-mode if CHATCLI_OPERATOR_DEV_MODE=true)

Two distinct Secrets in this project — do not confuse it with the LLM provider keys consumed by the chatcli server (chatcli-api-keys, referenced via Instance.spec.apiKeys.name). See the comparison table in Security — Operator Authentication.The chatcli-operator-secrets Secret must live in the same namespace as the operator pod (the controller resolves it via the POD_NAMESPACE env var / ServiceAccount namespace file, falling back to chatcli-system). If you ran helm install --namespace <X>, create the Secret in <X>.

Changes to API keys — in either the Secret or the ConfigMap — are picked up automatically every 30s. No operator restart is needed.

Resource Type Allowlist

The Operator classifies Kubernetes resource types into two categories:

17 Safe Types (allowed)
18 Dangerous Types (blocked)

Pods, Deployments, StatefulSets, DaemonSets, Services, ConfigMaps, Ingresses, Jobs, CronJobs, ReplicaSets, Endpoints, PersistentVolumeClaims, HorizontalPodAutoscalers, NetworkPolicies, ServiceAccounts, Namespaces, Events.

Log Scrubbing

Before sending application logs to the LLM for analysis, the Operator removes 18 sensitive patterns, including:

JWT/Bearer tokens, API keys, passwords
Email addresses, internal IPs, URLs with credentials
Credit card numbers, SSNs, PEM certificates

TLS and RBAC

TLS 1.3 required on all ChatCLI server connections
ClusterRoles with least privilege (read-only by default)
NetworkPolicy configurable to restrict network traffic to the Operator namespace
H5 hardening — no runtime RBAC escalation: the operator never creates or mutates ClusterRole/ClusterRoleBinding at runtime. Shared ClusterRoles (chatcli-watcher, chatcli-role-*) are pre-provisioned by the Helm chart, and the operator’s SA holds the bind verb restricted to those exact names via resourceNames. A compromised operator cannot reference a more-privileged ClusterRole from a new ClusterRoleBinding.

Audit

AuditEvent CRD for immutable audit trail (append-only)
Structured logs with Request ID for correlation
Integration with CHATCLI_AUDIT_LOG_PATH via extraEnv

spec:
  extraEnv:
    - name: CHATCLI_AGENT_SECURITY_MODE
      value: "strict"
    - name: CHATCLI_AUDIT_LOG_PATH
      value: "/var/log/chatcli/audit.jsonl"

In strict mode, agent security blocks any cluster write operations not on the allowlist. This is recommended for production environments.

Next Steps

AIOps Platform

Deep-dive into the AIOps architecture

K8s Watcher

Collection and budget details

Server Mode

GetAlerts and AnalyzeIssue RPCs

K8s Monitoring

Recipe: K8s Monitoring with AI

​API Group and CRDs

​Operator Installation

​AIOps Platform Architecture

​Autonomous Pipeline

​Issue State Machine

​Create Secret with API Keys

​CRD: Instance

​Complete Specification

​Spec Fields

​Root

​AIOpsSpec

​FallbackSpec

​FallbackProviderEntry

​WatcherSpec

​WatchTargetSpec

​Resources Created by Instance

​gRPC Load Balancing

​Automatic RBAC

​Server Image and Auto-Resolution

​Auto-Rollout on Configuration Changes

​Secret and ConfigMap Observation

​AIOps Platform CRDs

​Anomaly

​Anomaly Spec Fields

​Signals Detected (21 types)

​Node Monitoring

​Issue

​Issue States

​AIInsight

​AIInsight Status Fields

​SuggestedAction

​RemediationPlan

​Automatic Rollback and State Protection

​Action Types (54 types)

​ExecDiagnostic Allowlist

​RemediationPlan Examples with New Actions

​Runbook (Manual or Auto-generated)

​RemediationPlan (Agentic Mode)

​PostMortem (Auto-generated)

​PostMortem Status Fields

​Runbook Matching (Tiered)

​Remediation Priority

​SourceRepository (Code-Aware Diagnostics)

​Correlation Engine

​Risk Scoring

​Severity Classification

​Grouping

​WatcherBridge

​Usage Examples

​Status and Monitoring

​Development

​Security

​REST API Authentication

​Resource Type Allowlist

​Log Scrubbing

​TLS and RBAC

​Audit

​Next Steps

AIOps Platform

K8s Watcher

Server Mode

K8s Monitoring

API Group and CRDs

Operator Installation

AIOps Platform Architecture

Autonomous Pipeline

Issue State Machine

Create Secret with API Keys

CRD: Instance

Complete Specification

Spec Fields

Root

AIOpsSpec

FallbackSpec

FallbackProviderEntry

WatcherSpec

WatchTargetSpec

Resources Created by Instance

gRPC Load Balancing

Automatic RBAC

Server Image and Auto-Resolution

Auto-Rollout on Configuration Changes

Secret and ConfigMap Observation

AIOps Platform CRDs

Anomaly

Anomaly Spec Fields

Signals Detected (21 types)

Node Monitoring

Issue

Issue States

AIInsight

AIInsight Status Fields

SuggestedAction

RemediationPlan

Automatic Rollback and State Protection

Action Types (54 types)

ExecDiagnostic Allowlist

RemediationPlan Examples with New Actions

Runbook (Manual or Auto-generated)

RemediationPlan (Agentic Mode)

PostMortem (Auto-generated)

PostMortem Status Fields

Runbook Matching (Tiered)

Remediation Priority

SourceRepository (Code-Aware Diagnostics)

Correlation Engine

Risk Scoring

Severity Classification

Grouping

WatcherBridge

Usage Examples

Status and Monitoring

Development

Security

REST API Authentication

Resource Type Allowlist

Log Scrubbing

TLS and RBAC

Audit

Next Steps