API Group and CRDs
The operator uses the API groupplatform.chatcli.io/v1alpha1 with 17 Custom Resource Definitions:
| CRD | Short Name | Description |
|---|---|---|
| Instance | inst | ChatCLI server instance (Deployment, Service, RBAC, PVC) |
| Anomaly | anom | Raw signal from the K8s Watcher (restarts, OOM, deploy failures) |
| Issue | iss | Correlated incident grouping multiple anomalies |
| AIInsight | ai | AI-generated root cause analysis with enriched context (logs, metrics, code, GitOps) |
| RemediationPlan | rp | Concrete actions to resolve the problem (runbook or agentic AI) |
| Runbook | rb | Manual operational procedures (optional) |
| PostMortem | pm | Auto-generated incident report after resolution (all modes) |
| SourceRepository | srcrepo | Links workloads to git repositories for code-aware diagnostics |
| NotificationPolicy | np | Multi-channel notification routing with throttling and templates |
| EscalationPolicy | ep | Tiered escalation chains with timeouts (L1→L2→L3) |
| ServiceLevelObjective | slo | SLO with multi-window burn rate alerting (Google SRE model) |
| IncidentSLA | sla | Response/resolution SLA targets per severity with business hours |
| ApprovalPolicy | ap | Auto/manual/quorum approval policies with change windows |
| ApprovalRequest | ar | Approval workflow with blast radius assessment |
| ClusterRegistration | cr | Multi-cluster federation with kubeconfig and health checks |
| AuditEvent | ae | Immutable audit trail (append-only) |
| ChaosExperiment | chaos | Chaos engineering experiments with 7 types and safety checks |
For detailed documentation on each v2 CRD (NotificationPolicy, EscalationPolicy, SLO, SLA, ApprovalPolicy, ApprovalRequest, ClusterRegistration, AuditEvent, ChaosExperiment), see the AIOps Platform sub-pages.
Operator Installation
A single command installs everything: 17 CRDs + RBAC + Deployment + Service + Dashboard.- Via OCI Registry (recommended)
- Via local path (if you cloned the repo)
Install directly from GHCR — no need to clone the repository:To pin a specific version:
Configurable values
Configurable values
| Value | Default | Description |
|---|---|---|
image.repository | ghcr.io/diillson/chatcli-operator | Operator image |
image.tag | latest | Image tag |
replicaCount | 1 | Replicas (leader election enabled by default) |
api.port | 8090 | Web dashboard and REST API port |
prometheusUrl | "" | Prometheus URL for incident metrics collection |
leaderElect | true | Leader election for HA |
serviceMonitor.enabled | false | Create Prometheus ServiceMonitor |
Manual installation via kubectl (alternative)
Manual installation via kubectl (alternative)
Build via Docker (optional)
Build via Docker (optional)
AIOps Platform Architecture
Autonomous Pipeline
| Phase | Component | What It Does |
|---|---|---|
| 1. Detection | WatcherBridge | Queries GetAlerts from the server every 30s. Creates Anomaly CRs (dedup SHA256). Invalidates dedup when Issue reaches terminal state. |
| 2. Correlation | AnomalyReconciler + CorrelationEngine | Groups anomalies by resource + time window. Calculates risk score and severity. Creates/updates Issue CRs with signalType. |
| 3. Analysis | AIInsightReconciler + 6 enrichers | Collects K8s context (Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, HPAs), advanced log analysis (stack traces Java/Go/Python/Node.js, 24+ error patterns), Prometheus metrics (CPU/mem/latency trends), GitOps (Helm/ArgoCD/Flux status), source code (commit↔incident correlation), cascade analysis (cross-service). |
| 4. Remediation | IssueReconciler | AI-validated runbook selection: (a) finds ALL candidate runbooks (multi-runbook per trigger), (b) AI validates each against root cause (RUNBOOK_APPROVED: name or RUNBOOK_REJECTED), (c) if rejected or no candidates, generates new runbook from AI suggestions (with unique hash per root cause), or (d) agentic remediation (AI acts step-by-step). |
| 5. Execution | RemediationReconciler | 54 action types: workload (Scale, Restart, Rollback, AdjustResources, DeletePod, RestartStatefulSetPod), GitOps (HelmRollback, ArgoSyncApp), autoscaling (AdjustHPA), infra (CordonNode, DrainNode), storage (ResizePVC), security (RotateSecret), networking (UpdateIngress, PatchNetworkPolicy), advanced (ApplyManifest, ExecDiagnostic), statefulset (ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate), daemonset (RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod), job (RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods), cronjob (SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate). Blast radius prediction before execution. |
| 6. Resolution | IssueReconciler | Success -> Resolved (invalidates dedup). Failure -> re-analysis with failure context (different strategy) -> up to maxAttempts -> Escalated. |
| 7. PostMortem | IssueReconciler | All remediations (not just agentic) generate PostMortem CR with timeline, root cause, lessons, metrics, git correlation, cascade chain, trending (recurring incidents), dev feedback. Successful remediations also generate reusable Runbooks (one per root cause, hash-based naming). |
Issue State Machine
Create Secret with API Keys
Before creating an Instance, you need a Secret with the LLM provider API keys. The Instance references this Secret viaapiKeys.name — without it, the server cannot call the AI.
- OpenAI
- Anthropic (Claude)
- Google AI
- OpenRouter
- Multiple providers
- Via YAML
CRD: Instance
TheInstance manages ChatCLI server instances in the cluster.
Complete Specification
Spec Fields
Root
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
replicas | int32 | No | 1 | Number of server replicas |
provider | string | Yes | LLM provider | |
model | string | No | LLM model | |
image | ImageSpec | No | Image configuration | |
server | ServerSpec | No | gRPC server configuration | |
watcher | WatcherSpec | No | K8s Watcher configuration | |
resources | ResourceRequirements | No | CPU and memory requests/limits | |
persistence | PersistenceSpec | No | Session persistence | |
securityContext | PodSecurityContext | No | nonroot/1000 | Pod security context |
fallback | FallbackSpec | No | LLM provider failover chain | |
apiKeys | SecretRefSpec | No | Secret with API keys (all providers in fallback chain) | |
aiops | AIOpsSpec | No | Autonomous incident management pipeline configuration |
AIOpsSpec
Configures the automatic remediation pipeline. All fields are optional with sensible defaults. AI auto-generated runbooks inheritmaxRemediationAttempts from this configuration.
| Field | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
maxRemediationAttempts | int32 | No | 5 | 1-10 | Maximum remediation attempts before escalating to human |
resolutionCooldownMinutes | int32 | No | 10 | 0-120 | Minutes after resolving before accepting new anomalies for the same resource |
dedupTTLMinutes | int32 | No | 60 | 5-1440 | How long (min) the dedup cache retains alert hashes |
enableAutoResolve | bool | No | true | Auto-resolve Escalated issues when the resource recovers | |
agenticMaxSteps | int32 | No | 10 | 3-30 | Maximum steps per agentic remediation attempt (each step = 1 AI call) |
In agentic mode, the postmortem includes the full AI reasoning for each step — which action was chosen, why, and the observed result. This ensures complete audit trail of autonomous AI decisions.
FallbackSpec
Configures automatic failover between LLM providers. When the primary provider fails (rate limit, timeout, server error), the system automatically tries the next provider in the chain.| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | bool | Yes | Activates the fallback chain | |
providers | []FallbackProviderEntry | Yes | Ordered list of fallback providers (first = highest priority) | |
maxRetries | int32 | No | 2 | Retries per provider before moving to next |
cooldownBase | string | No | "30s" | Initial cooldown after failure (exponential backoff) |
cooldownMax | string | No | "5m" | Maximum cooldown duration |
FallbackProviderEntry
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Provider name: OPENAI, OPENAI_ASSISTANT, CLAUDEAI, BEDROCK, GOOGLEAI, XAI, ZAI, MINIMAX, MOONSHOT, OPENROUTER, STACKSPOT, OLLAMA, COPILOT, GITHUB_MODELS |
model | string | No | LLM model for this provider |
WatcherSpec
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | bool | No | false | Enables the watcher |
targets | []WatchTargetSpec | No | List of resources to monitor (multi-target) | |
deployment | string | No | Single deployment (legacy) | |
namespace | string | No | Deployment namespace (legacy) | |
interval | string | No | "30s" | Collection interval |
window | string | No | "2h" | Observation window |
maxLogLines | int32 | No | 100 | Max log lines per pod |
maxContextChars | int32 | No | 32000 | LLM context budget |
WatchTargetSpec
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | Yes* | Resource name to monitor (e.g., postgres, fluentd) | |
deployment | string | No | Deprecated alias for name — kept for backward compatibility | |
kind | string | No | Deployment | Resource kind: Deployment, StatefulSet, DaemonSet, Job, CronJob |
namespace | string | Yes | Resource namespace | |
metricsPort | int32 | No | 0 | Prometheus port (0 = disabled) |
metricsPath | string | No | /metrics | Prometheus endpoint path |
metricsFilter | []string | No | Glob filters for metrics |
Resources Created by Instance
| Resource | Name | Description |
|---|---|---|
| Deployment | <name> | ChatCLI server pods |
| Service | <name> | gRPC Service (automatic headless when replicas > 1 for client-side LB) |
| ConfigMap | <name> | Environment variables (provider, model, etc.) |
| ConfigMap | <name>-watch-config | Multi-target YAML (if targets defined) |
| ServiceAccount | <name> | Identity for RBAC |
| Role | <name>-watcher | K8s watcher permissions (single-namespace) |
| RoleBinding | <name>-watcher | SA to Role binding (single-namespace) |
| ClusterRoleBinding | <namespace>-<name>-watcher | SA binding to the shared ClusterRole (multi-namespace) |
| PVC | <name>-sessions | Persistence (if enabled) |
gRPC Load Balancing
gRPC uses persistent HTTP/2 connections that pin to a single pod via kube-proxy, leaving extra replicas idle.- 1 replica (default): Standard ClusterIP Service
- Multiple replicas: Headless Service (
ClusterIP: None) is created automatically, enabling client-side round-robin via gRPCdns:///resolver - Keepalive: WatcherBridge pings every 30s (5s timeout) to detect inactive pods quickly. The server accepts pings with a minimum interval of 20s (
EnforcementPolicy.MinTime) - Transition: When scaling from 1 to 2+ replicas (or back), the operator deletes and recreates the Service automatically (ClusterIP is immutable in Kubernetes)
Automatic RBAC
- Same namespace (all targets in the same namespace as the Instance): Creates per-Instance
Role+RoleBinding - Cross-namespace (targets in a different namespace than the Instance, or in multiple namespaces): Creates only a per-Instance
ClusterRoleBindingpointing at the sharedchatcli-watcherClusterRole (pre-provisioned by the Helm chart / kustomize overlay) - On CR deletion, the finalizer removes the
ClusterRoleBinding; the shared ClusterRole stays (owned by the release)
As of v1.139.0, the operator no longer creates
ClusterRole resources at runtime (H5 hardening). Shared ClusterRoles — chatcli-watcher for the watcher and chatcli-role-{viewer,operator,admin,superadmin} for platform roles — are installed by the operator Helm chart. The operator’s ServiceAccount carries the bind verb restricted to those exact names via resourceNames, preventing privilege escalation even if the operator is compromised.Upgrading from v1.105.0: clusters with pre-existing multi-namespace Instances had a
ClusterRoleBinding pointing to a per-Instance ClusterRole (legacy shape). Because roleRef is immutable in Kubernetes, a direct helm upgrade used to freeze the reconcile with cannot change roleRef. As of v1.139.0, the operator detects the divergent roleRef at the top of reconcileClusterRBAC, deletes the stale binding, and recreates it pointing at chatcli-watcher — transparent migration, no manual intervention.Server Image and Auto-Resolution
The server image tag (spec.image.tag) follows a three-step priority:
- Explicit pin in
spec.image.tag— honored verbatim (GitOps-friendly). - Omitted — the operator resolves it from the
CHATCLI_OPERATOR_APP_VERSIONenv var, which the Helm chart injects automatically from.Chart.AppVersion. Effect:helm upgrade chatcli-operatorrolls the server of every Instance that opted into auto-resolution, with no per-Instance patch. - Fallback —
latestwhen neither is present (e.g.,make deploywithout Helm).
Auto-Rollout on Configuration Changes
The operator monitors changes in ConfigMaps and Secrets referenced by the Instance and triggers rolling updates automatically via hash annotations on the PodTemplate:| Annotation | Source | When It Changes |
|---|---|---|
chatcli.io/watch-config-hash | ConfigMap <name>-watch-config | Watcher targets changed |
chatcli.io/configmap-hash | ConfigMap <name> | Environment variables updated |
chatcli.io/secret-hash | Secret referenced in apiKeys.name | API keys created or updated |
chatcli.io/tls-hash | Secret referenced in server.tls.secretName | TLS certificates renewed |
Secret and ConfigMap Observation
The operator watches (Watches) Secrets in the Instance namespace. When a Secret referenced in apiKeys.name or server.tls.secretName is created or updated, the reconciler is triggered automatically — even if the Secret did not exist when the Instance was created.
- ConfigMap and Secret
envFrom: Marked asoptional: true, allowing the Instance to be created before the Secret/ConfigMap - Flexible deploy order: Namespace -> Instance -> Secret/ConfigMap (any order after the namespace)
AIOps Platform CRDs
Anomaly
Represents a raw signal detected by the WatcherBridge.Anomaly Spec Fields
| Field | Type | Description |
|---|---|---|
signalType | AnomalySignalType | Type of detected signal |
source | AnomalySource | Detection origin (watcher, prometheus, manual) |
severity | IssueSeverity | Signal severity |
resource | ResourceRef | Affected K8s resource (kind, name, namespace) |
description | string | Human-readable description of the problem |
detectedAt | Time | Detection timestamp |
Signals Detected (21 types)
Watcher signals:| AlertType (Server) | SignalType (Anomaly) | Description |
|---|---|---|
HighRestartCount | pod_restart | Pod with many restarts (CrashLoopBackOff) |
OOMKilled | oom_kill | Container terminated due to lack of memory |
PodNotReady | pod_not_ready | Pod is not in the Ready state |
DeploymentFailing | deploy_failing | Deployment with Available=False |
| SignalType | Description |
|---|---|
error_rate | Elevated HTTP error rate |
latency | Latency above threshold |
cpu_high | Elevated CPU usage |
memory_high | Elevated memory usage |
disk_pressure | Node with DiskPressure condition (disk full or nearly full) |
node_not_ready | Node with NotReady condition (kubelet unresponsive, network or hardware failure) |
memory_pressure | Node with MemoryPressure condition (insufficient memory for new pods) |
pid_pressure | Node with PIDPressure condition (excessive processes, fork bomb risk) |
network_unavailable | Node with network unavailable (CNI failure or interface down) |
pvc_pending | PVC in Pending state |
ingress_error | Ingress controller errors |
hpa_maxed | HPA at maximum replicas |
job_failed | Job failed |
cronjob_missed | CronJob missed its schedule |
certificate_expiring | TLS certificate expiring |
image_pull_error | Error pulling container image |
crashloop_backoff | Pod in CrashLoopBackOff |
helm_release_failed | Helm release in failed state |
argocd_degraded | ArgoCD Application degraded |
config_drift | Configuration drift detected |
Node Monitoring
The watcher automatically monitors the health of nodes where target pods are running. On each collection cycle, it:- Identifies nodes via label selector from the target’s pods
- Collects all 5 official Kubernetes conditions:
Ready,DiskPressure,MemoryPressure,PIDPressure,NetworkUnavailable - Collects node CPU/memory metrics (via metrics server)
- Counts active pods vs node pod capacity
- Checks if the node is cordoned (unschedulable)
| Condition | Severity | Signal | Available Action |
|---|---|---|---|
| Node NotReady | CRITICAL | node_not_ready | CordonNode, DrainNode |
| DiskPressure | CRITICAL | disk_pressure | CordonNode, DrainNode |
| MemoryPressure | CRITICAL | memory_high | CordonNode, DrainNode |
| PIDPressure | WARNING | node_not_ready | CordonNode |
| NetworkUnavailable | CRITICAL | node_not_ready | CordonNode, DrainNode |
| Cordoned (Unschedulable) | WARNING | node_not_ready | Informational |
| Pod capacity >90% | WARNING | node_not_ready | CordonNode |
Issue
Correlated incident that groups anomalies and manages the remediation lifecycle.Issue States
| State | Description |
|---|---|
Detected | Newly created issue, awaiting analysis |
Analyzing | AIInsight created, awaiting AI response (or re-analysis with failure context) |
Remediating | RemediationPlan in execution |
Resolved | Successful remediation (dedup invalidated for recurrence detection) |
Escalated | Max attempts reached or no available actions (dedup invalidated) |
Failed | Terminal failure |
AIInsight
AI-generated root cause analysis with suggested actions for automatic remediation.AIInsight Status Fields
| Field | Type | Description |
|---|---|---|
analysis | string | AI-generated root cause analysis |
confidence | float64 | Analysis confidence level (0.0-1.0) |
recommendations | []string | Human-readable recommendations |
suggestedActions | []SuggestedAction | Structured actions for automatic remediation |
generatedAt | Time | When the analysis was generated |
SuggestedAction
| Field | Type | Description |
|---|---|---|
name | string | Human-readable action name |
action | string | Action type (54 action types available — see Action Types) |
description | string | Explanation of why this action is needed |
params | map[string]string | Action parameters (e.g., replicas: "4") |
RemediationPlan
Concrete remediation plan automatically generated from a Runbook or AI actions.Automatic Rollback and State Protection
The operator implements an automatic rollback system that ensures unsuccessful remediations do not leave the cluster in a worse state than before. Before executing any action, the complete resource state is captured in a structured restorable snapshot.Pre-Remediation Snapshot
Before the first action, the
RollbackEngine captures a structured ResourceSnapshot with: replicas, container images, CPU/memory requests and limits, HPA state (min/max replicas), and node state (schedulable/unschedulable). Works for Deployments, StatefulSets, DaemonSets, Nodes, and HPAs.Per-Action Checkpoint
In plans with multiple actions, an
ActionCheckpoint is captured before each individual action. This makes it possible to know exactly which action modified what and at which point the plan failed.Automatic Rollback on Action Failure
If any action fails during execution, the operator automatically restores the resource to the
PreflightSnapshot state. Replicas, images, resource requests/limits, and HPA state are reverted. The plan transitions to RolledBack state (not Failed).Rollback on Verification Timeout
If all actions execute successfully but the resource does not become healthy within 90 seconds (verification timeout), the operator also performs automatic rollback to the pre-remediation state.
| Resource | Captured Fields |
|---|---|
| Deployment | replicas, container images, CPU/memory requests+limits, restart annotation |
| StatefulSet | replicas, container images, CPU/memory requests+limits |
| DaemonSet | container images, CPU/memory requests+limits |
| Node | schedulable state (to revert cordon/drain) |
| HPA | minReplicas, maxReplicas |
| Field | Type | Description |
|---|---|---|
preflightSnapshot | ResourceSnapshot | Complete resource state before any action |
actionCheckpoints | []ActionCheckpoint | Checkpoint before each action with result (success/fail) |
rollbackPerformed | bool | Whether automatic rollback was executed |
rollbackResult | string | Description of what was reverted (replicas, images, resources) |
postFailureHealthy | *bool | Whether the resource is healthy after rollback |
Action Types (54 types)
Workload:| Type | Description | Parameters |
|---|---|---|
ScaleDeployment | Adjusts the number of replicas | replicas |
RestartDeployment | Rollout restart of the deployment | — |
RollbackDeployment | Undoes rollout (previous, healthy, or specific revision) | toRevision (optional: previous, healthy, or number) |
PatchConfig | Updates keys of a ConfigMap | configmap, key=value |
AdjustResources | Adjusts CPU/memory requests/limits for containers | memory_limit, memory_request, cpu_limit, cpu_request, container |
DeletePod | Removes the sickest pod (CrashLoop > restarts) | pod (optional — auto-selects the sickest) |
RestartStatefulSetPod | Restart of StatefulSet pod (preserves identity/storage) | pod (optional — omit for rolling restart of entire StatefulSet) |
| Type | Description | Parameters |
|---|---|---|
HelmRollback | Rollback of Helm release to previous revision | revision (optional — default: previous) |
ArgoSyncApp | Trigger sync on ArgoCD Application | revision (optional — default: HEAD) |
| Type | Description | Parameters |
|---|---|---|
AdjustHPA | Modifies min/max replicas or target utilization of HPA | minReplicas, maxReplicas, targetCPUUtilization |
| Type | Description | Parameters |
|---|---|---|
CordonNode | Marks node as unschedulable | node |
DrainNode | Cordon + evict pods from node | node |
| Type | Description | Parameters |
|---|---|---|
ResizePVC | Expands PVC (expansion only, not reduction) | pvc, size (e.g., 20Gi) |
| Type | Description | Parameters |
|---|---|---|
RotateSecret | Updates Secret values or copies from source | secret, sourceSecret or key=value |
| Type | Description | Parameters |
|---|---|---|
UpdateIngress | Modifies backend or annotations of Ingress | ingress, backendService, backendPort, annotation.* |
PatchNetworkPolicy | Adds allowed ports to NetworkPolicy | networkPolicy, allowPort, protocol |
| Type | Description | Parameters |
|---|---|---|
ApplyManifest | Applies JSON manifest from a ConfigMap | configmap, key |
ExecDiagnostic | Executes a command from a read-only allowlist inside a pod | command (exact string — see allowlist) |
Custom | Custom action (blocked by safety checks) | — |
ExecDiagnostic Allowlist
ExecDiagnostic does an exact-string match against a read-only command allowlist. Any variation (different flags, alternate host, etc.) is rejected with command "..." not in approved diagnostic commands whitelist.
Default approved commands (~90):
| Category | Commands |
|---|---|
| Process / shell | env, whoami, id, hostname, pwd, uname -a, uname -r, ps aux, ps -ef, top -b -n1 |
| Filesystem / resources | df -h, df -i, free -m, free -h, mount, uptime, ls -la /, ls -la /tmp, ls -la /var/log, du -sh /tmp, du -sh /var/log |
| Cgroups v2 (modern pod) | cat /sys/fs/cgroup/memory.max, memory.current, memory.events, memory.stat, cpu.max, cpu.stat |
| Cgroups v1 (legacy pod) | cat /sys/fs/cgroup/memory/memory.{limit_in_bytes,usage_in_bytes,oom_control,stat}, cat /sys/fs/cgroup/cpu/cpu.{cfs_quota_us,cfs_period_us,stat} |
| /proc introspection | cat /proc/1/{cgroup,status,limits,cmdline,environ}, cat /proc/{meminfo,cpuinfo,loadavg,version}, cat /proc/net/{tcp,udp,sockstat} |
| Network (read-only) | netstat -tlnp/-an/-rn, ss -tlnp/-an/-s, ip addr, ip -s link, ip route, ip -6 route, ip neigh, ifconfig, arp -a |
| DNS / resolver | cat /etc/{hosts,resolv.conf,nsswitch.conf}, nslookup/getent/dig/host kubernetes.default.svc.cluster.local, nslookup kube-dns.kube-system.svc.cluster.local |
| Health / metrics / pprof | curl -s localhost{,:8080}/{health,healthz,ready,readyz,live,livez}, curl -s localhost:{8080,8081,9090,9091}/metrics, curl -s localhost:9090/-/{ready,healthy}, curl -s localhost:6060/debug/pprof/{,goroutine,heap}?debug=1 |
| Envoy / Istio sidecar | curl -s localhost:15000/ready, curl -s localhost:{15020,15021}/healthz/ready |
| wget fallback (Alpine) | wget -qO- http://localhost{,:8080}/{health,healthz,metrics} |
| TCP reachability | nc -zv kubernetes.default.svc.cluster.local 443, nc -zv kube-dns.kube-system.svc.cluster.local 53 |
CHATCLI_ALLOWED_DIAGNOSTIC_COMMANDS env var (comma-separated, read once at startup):
| Type | Description | Parameters |
|---|---|---|
ScaleStatefulSet | Ordered replica scaling | replicas |
RestartStatefulSet | Rolling restart via annotation | — |
RollbackStatefulSet | Rollback via ControllerRevision | toRevision |
AdjustStatefulSetResources | Adjusts CPU/memory | container, memory_limit, cpu_limit |
DeleteStatefulSetPod | Deletes specific or unhealthiest pod | pod (optional) |
ForceDeleteStatefulSetPod | Force-delete stuck pod (grace=0) | pod (REQUIRED) |
UpdateStatefulSetStrategy | Changes updateStrategy | type, maxUnavailable |
RecreateStatefulSetPVC | Deletes stuck PVC | pvc, confirm=true |
PartitionStatefulSetUpdate | Canary partition | partition |
| Type | Description | Parameters |
|---|---|---|
RestartDaemonSet | Rolling restart across all nodes | — |
RollbackDaemonSet | Rollback via ControllerRevision | toRevision |
AdjustDaemonSetResources | Adjusts CPU/memory | container, memory_limit, cpu_limit |
DeleteDaemonSetPod | Deletes pod (optionally on specific node) | pod, node (optional) |
UpdateDaemonSetStrategy | Changes update strategy | type, maxUnavailable, maxSurge |
PauseDaemonSetRollout | Pauses rollout (maxUnavailable=0) | — |
CordonAndDeleteDaemonSetPod | Cordons node + deletes pod | node (REQUIRED) |
| Type | Description | Parameters |
|---|---|---|
RetryJob | Deletes failed Job + recreates | — |
AdjustJobResources | Adjusts CPU/memory on template | container, memory_limit, cpu_limit |
DeleteFailedJob | Cleans up failed Job | — |
SuspendJob | Pauses Job (suspend=true) | — |
ResumeJob | Resumes Job (suspend=false) | — |
AdjustJobParallelism | Changes parallelism | parallelism |
AdjustJobDeadline | Changes deadline | activeDeadlineSeconds |
AdjustJobBackoffLimit | Changes backoff limit | backoffLimit |
ForceDeleteJobPods | Force-deletes all Job pods | — |
| Type | Description | Parameters |
|---|---|---|
SuspendCronJob | Pauses scheduling | — |
ResumeCronJob | Resumes scheduling | — |
TriggerCronJob | Creates Job immediately | — |
AdjustCronJobResources | Adjusts CPU/memory on jobTemplate | container, memory_limit, cpu_limit |
AdjustCronJobSchedule | Changes schedule | schedule |
AdjustCronJobDeadline | Changes deadline | startingDeadlineSeconds |
AdjustCronJobHistory | Changes history limits | successfulJobsHistoryLimit, failedJobsHistoryLimit |
AdjustCronJobConcurrency | Changes concurrency policy | concurrencyPolicy |
DeleteCronJobActiveJobs | Kills running Jobs | — |
ReplaceCronJobTemplate | Replaces template from ConfigMap | configmap, key |
RemediationPlan Examples with New Actions
- GitOps: HelmRollback
- GitOps: ArgoSyncApp
- StatefulSet + HPA
- Infra: Node Drain
- Storage + Security
- Networking
- Advanced: Manifest + Diagnostic
Runbook (Manual or Auto-generated)
Operational procedures. Manual Runbooks have priority over everything. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR from the suggested actions.- Manual Runbook
- AI Auto-generated Runbook
- Runbook: Helm + ArgoCD
- Runbook: StatefulSet + Storage
RemediationPlan (Agentic Mode)
When there is no manual Runbook or AI-suggested actions, the operator creates an agentic plan. The AI acts as an agent with Kubernetes skills in an observe-decide-act loop:Safety Guards: Maximum of 10 steps (configurable via
agenticMaxSteps), timeout of 10 minutes. If an action fails, the observation reports “FAILED: error” and the loop continues — the AI receives the feedback and adapts.- PostMortem CR with timeline, root cause, impact, lessons learned
- Reusable Runbook CR with successful steps (label
source=agentic)
PostMortem (Auto-generated)
Incident report automatically generated after any remediation resolution (standard or agentic). Contains the complete incident history: detection, analysis, executed actions, resolution, plus metrics, git correlation, cascade chain, recurring incident trending, and developer feedback field.PostMortem Status Fields
| Field | Type | Description |
|---|---|---|
state | PostMortemState | State: Open, InReview, Closed |
summary | string | AI-generated incident summary |
rootCause | string | Root cause determined by AI |
impact | string | Incident impact |
timeline | []TimelineEvent | Timeline (detected, analyzed, action_executed, resolved) |
actionsExecuted | []ActionRecord | Executed actions with result |
lessonsLearned | []string | Lessons learned |
preventionActions | []string | Suggested preventive actions |
duration | string | Total incident duration |
generatedAt | Time | When the PostMortem was generated |
reviewedAt | Time | When the PostMortem was reviewed by a human |
metricSnapshots | []MetricSnapshot | Prometheus metrics captured before/during/after the incident |
blastRadius | []BlastRadiusEntry | Services and resources impacted by the incident |
gitCorrelation | GitCorrelation | Suspect commit correlated with the incident (SHA, author, files, confidence) |
sliImpact | string | Impact on SLIs and error budgets |
errorBudgetBurned | float64 | Percentage of error budget consumed |
trending | TrendingInfo | Recurring pattern information (count, window, related PostMortems) |
feedback | DevFeedback | Human feedback (root cause override, accuracy 1-5, comments) |
gitOpsContext | string | Helm/ArgoCD/Flux state at the time of the incident |
logAnalysisSummary | string | Summary of log analysis findings |
cascadeChain | []string | Cascade failure chain if applicable |
Runbook Matching (Tiered)
Remediation Priority
SourceRepository (Code-Aware Diagnostics)
Links a Kubernetes workload to its source code repository. When configured, the AI receives code context during incident analysis: recent commits correlated with the timestamp, code snippets referenced in stack traces, and configuration files (Dockerfile, values.yaml).Input validation (hardening):
spec.url only accepts the https://, ssh:// or git@host:path forms, and spec.branch is restricted to [A-Za-z0-9._/-] with no leading - — closing the git argument-injection vector (e.g. --upload-pack). Reads of code referenced by stack traces are confined to the repository clone via os.Root (no traversal, no symlink escape). Out-of-pattern URLs or branches fail the sync with an explicit status error.- Token Auth (GitHub PAT)
- SSH Key
- Basic Auth
- Public Repo (no auth)
- StatefulSet (Database)
- Shallow clone of the repository (depth 50) and periodic sync
- Indexes detected languages, entrypoints (main.go, app.py, etc.), config files
- Temporal correlation: finds commits within 30 min before the incident
- Suspect commit: identifies the most likely commit to have caused the problem
- Code extraction: when stack traces reference files, extracts the relevant snippets
- Feed to AI: all context is included in the analysis prompt
The repository is cloned locally on the operator pod. For private repos, create a Secret with the key corresponding to the chosen
authType and reference it in secretRef. The operator supports HTTPS repos (token/basic) and SSH (ssh-key).Correlation Engine
The correlation engine groups anomalies into issues using:Risk Scoring
Each signal type has a weight:| Signal | Weight |
|---|---|
oom_kill | 30 |
error_rate | 25 |
deploy_failing | 25 |
latency_spike | 20 |
pod_restart | 20 |
pod_not_ready | 20 |
Severity Classification
| Risk Score | Severity |
|---|---|
| >= 80 | Critical |
| >= 60 | High |
| >= 40 | Medium |
| < 40 | Low |
Grouping
- Anomalies on the same resource (deployment + namespace) within the same time window are grouped into the same Issue
- Incident ID is deterministic: hash of resource + signal type (prevents duplicates)
WatcherBridge
TheWatcherBridge is the component that connects the ChatCLI server to the operator:
- Polling: Queries
GetAlertsfrom the server every 30 seconds - Discovery: Locates the server via Instance CRs (first Instance with a ready gRPC endpoint)
- Dedup: SHA256 hash of type+deployment+namespace (no temporal component — a continuous problem generates only one Anomaly). 2-hour TTL
- Dedup invalidation: When an Issue reaches a terminal state (Resolved/Escalated), dedup entries for the resource are removed, allowing immediate recurrence detection
- Pruning: Removes expired hashes automatically (> 2h)
- Creation: Converts alerts to Anomaly CRs with valid K8s names
Usage Examples
- Minimal (no AIOps)
- Full AIOps
- With Fallback Multi-Provider
- Manual Runbook (optional)
- API Keys Secret
Status and Monitoring
Check Instances
Check Instances
Check Active Issues
Check Active Issues
Check AI Insights
Check AI Insights
Check Remediation Plans
Check Remediation Plans
Check PostMortems
Check PostMortems
Check Anomalies
Check Anomalies
Development
Security
The Operator implements multiple security layers by default, following the fail-closed principle (deny by default):REST API Authentication
The REST API operates in fail-closed mode by default — there is no dev mode without authentication. Every request must include a validX-API-Key header with a mapped role (viewer/operator/admin).
API keys are loaded with the following priority order and hot-reloaded every 30 seconds:
- Secret
chatcli-operator-secrets(priority) —api-keysfield containing a YAML list of{key, role, description}entries - ConfigMap
chatcli-operator-config(fallback) — sameapi-keysfield - Reject the request (or accept in dev-mode if
CHATCLI_OPERATOR_DEV_MODE=true)
Two distinct Secrets in this project — do not confuse it with the LLM provider keys consumed by the chatcli server (
chatcli-api-keys, referenced via Instance.spec.apiKeys.name). See the comparison table in Security — Operator Authentication.The chatcli-operator-secrets Secret must live in the same namespace as the operator pod (the controller resolves it via the POD_NAMESPACE env var / ServiceAccount namespace file, falling back to chatcli-system). If you ran helm install --namespace <X>, create the Secret in <X>.Resource Type Allowlist
The Operator classifies Kubernetes resource types into two categories:- 17 Safe Types (allowed)
- 18 Dangerous Types (blocked)
Pods, Deployments, StatefulSets, DaemonSets, Services, ConfigMaps, Ingresses, Jobs, CronJobs, ReplicaSets, Endpoints, PersistentVolumeClaims, HorizontalPodAutoscalers, NetworkPolicies, ServiceAccounts, Namespaces, Events.
Log Scrubbing
Before sending application logs to the LLM for analysis, the Operator removes 18 sensitive patterns, including:- JWT/Bearer tokens, API keys, passwords
- Email addresses, internal IPs, URLs with credentials
- Credit card numbers, SSNs, PEM certificates
TLS and RBAC
- TLS 1.3 required on all ChatCLI server connections
- ClusterRoles with least privilege (read-only by default)
- NetworkPolicy configurable to restrict network traffic to the Operator namespace
- H5 hardening — no runtime RBAC escalation: the operator never creates or mutates
ClusterRole/ClusterRoleBindingat runtime. Shared ClusterRoles (chatcli-watcher,chatcli-role-*) are pre-provisioned by the Helm chart, and the operator’s SA holds thebindverb restricted to those exact names viaresourceNames. A compromised operator cannot reference a more-privileged ClusterRole from a newClusterRoleBinding.
Audit
- AuditEvent CRD for immutable audit trail (append-only)
- Structured logs with Request ID for correlation
- Integration with
CHATCLI_AUDIT_LOG_PATHviaextraEnv
Next Steps
AIOps Platform
Deep-dive into the AIOps architecture
K8s Watcher
Collection and budget details
Server Mode
GetAlerts and AnalyzeIssue RPCs
K8s Monitoring
Recipe: K8s Monitoring with AI