Overview
The ChatCLI AIOps platform manages incidents through a well-defined state machine with 6 states for incidents and 6 states for remediation plans. Understanding this lifecycle is essential for operators who need to intervene when automatic remediation fails.
Incident States
| State | Description | Terminal? |
|---|
Detected | Anomaly correlated into an incident, awaiting analysis | No |
Analyzing | AI is performing root cause analysis | No |
Remediating | A remediation plan is executing | No |
Resolved | Incident successfully resolved | Yes |
Escalated | All automatic retries exhausted — requires human intervention (auto-resolves if resource recovers) | Semi-terminal |
Failed | Single attempt failed with no retry configured | Yes |
State Machine Flow
Detected → Analyzing → Remediating → Resolved
↓ ↓
Escalated Failed/RolledBack
(after max ↓
retries) Re-analyze (up to 5x)
↓
Escalated
↓
Auto-resolve if resource
recovers (configurable)
Detection Phase (Detected)
When the watcher bridge detects anomalies, the correlation engine groups them into incidents:
- Signal scoring — each signal type has a weight (OOMKill=40, ErrorRate=30, PodRestart=25, etc.)
- Risk score calculation — aggregated from all correlated anomalies
- Severity determination — Critical (risk > 80), High (> 60), Medium (> 40), Low (otherwise)
- Incident ID generation — format:
INC-YYYYMMDD-NNN
- Max remediation attempts set to 5 (default, configurable via Instance
aiops.maxRemediationAttempts)
Analysis Phase (Analyzing)
The system creates an AIInsight CR for AI-powered analysis. During detection, ALL matching runbooks are injected into the AI context for validation.
- Runbook candidate discovery (tiered):
- Tier 1: All runbooks matching SignalType + Severity + ResourceKind
- Tier 2: Fallback on Severity + ResourceKind
- Multiple runbooks can exist per trigger (different root causes produce different runbooks)
- AI validates candidates: The LLM receives all candidate runbooks and evaluates each against the current root cause analysis:
RUNBOOK_APPROVED: <name> → uses that specific runbook (fast path)
RUNBOOK_REJECTED → skips all candidates, uses AI suggestions or agentic mode
- Neither → uses first candidate as default (backward compatibility)
- If no candidates exist and AI has suggested actions → generates a new runbook
- If no candidates and no AI actions → enters Agentic Mode (AI-driven step-by-step)
- Transitions to
Remediating
The remediation controller executes the plan using a ReAct loop (Reason-Act-Observe):
- Pre-flight snapshot captured for rollback capability
- For each action in the plan:
- OBSERVE — checks if the resource is already healthy (after the previous action). If so, stops immediately without executing remaining actions (early exit)
- ACT — executes the action with checkpoint
- If the action fails → automatic rollback to pre-flight state
- Final health verification (polls for up to 90 seconds)
- On success →
Resolved + PostMortem generated
- On failure → automatic rollback attempted → re-analyze with failure context
Example: plan with 3 actions (AdjustResources + DeletePod + RollbackDeployment)
Action 1: AdjustResources → SUCCESS (memory 1Mi → 64Mi)
Action 2: OBSERVE → resource healthy (ReadyReplicas == Desired)
→ EARLY EXIT! Skips DeletePod and RollbackDeployment
→ Evidence: "Resource healthy after 1/3 actions — skipped remaining 2"
This prevents contradictory actions from being executed (e.g., AdjustResources followed by RollbackDeployment which would undo the fix) and reduces operational impact to the minimum necessary.
Retry Mechanism
When remediation fails:
- Attempt < MaxAttempts (5): The system re-analyzes with the failure context injected, potentially selecting a different runbook or strategy
- All attempts exhausted: Transitions to
Escalated
Escalated State — What Operators Must Do
When an incident reaches Escalated, the system has exhausted all automatic options. Here’s what happens and what you need to do:
What the system does automatically:
- Triggers the EscalationPolicy matching the incident severity
- Sends notifications to L1 on-call (Slack, PagerDuty, etc.)
- If no acknowledgment within the configured timeout, escalates to L2, then L3
- Generates audit events for compliance
What operators must do:
-
Acknowledge the incident (stops escalation progression):
curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/acknowledge \
-H "X-API-Key: $API_KEY" \
-d '{"acknowledgedBy": "your-email@company.com"}'
-
Investigate and fix the issue manually
-
Resolve the incident via one of three methods:
Method 1: REST API (recommended for automation/scripts)
curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/resolve \
-H "X-API-Key: $API_KEY" \
-d '{"resolution": "Fixed memory leak in payment-service v2.4.1, deployed hotfix manually"}'
Method 2: Web Dashboard
Navigate to the incident detail page and click the “Resolve” button. Enter the resolution description in the dialog.
Method 3: Kubernetes Direct (advanced)
kubectl patch issue INC-20260319-001 -n production --type=merge \
-p '{"status":{"state":"Resolved","resolution":"Manual fix applied"}}'
Auto-Resolve for Escalated Issues
When an incident reaches Escalated, the system continues monitoring the resource every 30 seconds. If the resource recovers (all replicas healthy), the issue is automatically resolved with the message:
“Auto-resolved: resource recovered while awaiting human intervention”
This handles cases where:
- An operator fixes the issue manually (kubectl rollout undo, etc.) without using the API
- The resource self-heals (e.g., transient network issue resolves)
- A CI/CD pipeline deploys a fix while the incident is still open
Auto-resolve can be disabled via the Instance CRD: spec.aiops.enableAutoResolve: false
Configurable AIOps Parameters
All timing and retry parameters are configurable via the Instance CRD aiops section:
apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
name: chatcli-prod
spec:
provider: OPENAI
model: gpt-5.4
aiops:
maxRemediationAttempts: 5 # default: 5, range: 1-10
resolutionCooldownMinutes: 10 # default: 10, range: 0-120
dedupTTLMinutes: 60 # default: 60, range: 5-1440
enableAutoResolve: true # default: true
| Parameter | Default | Description |
|---|
maxRemediationAttempts | 5 | How many times the AI can retry before escalating |
resolutionCooldownMinutes | 10 | After resolving, how long to suppress new anomalies for the same resource |
dedupTTLMinutes | 60 | How long the bridge dedup cache retains alert hashes |
enableAutoResolve | true | Auto-resolve Escalated issues when the resource recovers |
agenticMaxSteps | 10 | Maximum steps per agentic remediation attempt (range: 3-30) |
AI auto-generated runbooks (both standard and agentic) automatically inherit maxRemediationAttempts from the Instance configuration. Manually created runbooks via YAML or API use the CRD default (maxAttempts: 3) unless explicitly specified.
Each incident may have multiple remediation plans (one per attempt):
| State | Description |
|---|
Pending | Safety validation in progress |
Executing | Actions being executed sequentially |
Verifying | Post-action health check (up to 90s) |
Completed | All actions succeeded and health verified |
Failed | Action failed, no rollback possible |
RolledBack | Action failed, successfully rolled back to pre-flight state |
When no runbook matches, the system uses AI-driven agentic remediation:
- AI proposes an action via the AgenticStep RPC
- Action is executed and the result is observed
- AI analyzes the observation and proposes the next action
- Loop continues until resolved or convergence detected
Safety guardrails:
- Max steps: 10 (configurable via
AgenticMaxSteps)
- Max time: 10 minutes per agentic plan
- Convergence detection:
- Last 3 observations identical → force stop
- Alternating A→B→A→B pattern → force stop
- 5 consecutive failed actions → force stop
Decision Engine Confidence Thresholds
The decision engine determines whether remediation can proceed automatically:
| Severity | Auto-Approve Threshold | Action |
|---|
| Low | Confidence >= 0.95 | Auto-execute |
| Medium | Confidence >= 0.85 | Auto-execute + notify |
| High | Confidence >= 0.80 | Requires approval |
| Critical | Always | Manual approval required |
Adjustments: Historical success rate, pattern match, time of day, and active issue count all modify the base confidence score.
Circuit breaker: If 3+ remediations failed in the last hour, auto-remediation is blocked entirely.
Rollback Engine
The rollback engine provides safety nets at two levels:
- Pre-flight snapshot — captured before ANY actions. Restores the entire resource state.
- Per-action checkpoints — captured before EACH action. Allows partial rollback.
Automatic rollback triggers:
- Action execution fails
- Health verification times out (90 seconds)
Supported rollback targets:
- Deployment: replicas, container images, resource limits
- StatefulSet: replicas, images, resources, partition
- DaemonSet: images, resources, max unavailable
- Job/CronJob: suspend, deadline, backoff limit, parallelism
- Node: uncordon (restore schedulable)
The platform supports 46+ remediation action types across resource kinds:
Deployment (18 actions)
ScaleDeployment, RollbackDeployment, RestartDeployment, PatchConfig, AdjustResources, DeletePod, HelmRollback, ArgoSyncApp, AdjustHPA, RestartStatefulSetPod, CordonNode, DrainNode, ResizePVC, RotateSecret, ExecDiagnostic, UpdateIngress, PatchNetworkPolicy, ApplyManifest
StatefulSet (9 actions)
ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate
DaemonSet (7 actions)
RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod
Job (9 actions)
RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods
CronJob (10 actions)
SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate
Runbook Learning System
When a node has problems, the watcher detects the condition and emits anomalies automatically:
Node MemoryPressure detected
→ Anomaly CR created (signal: memory_high, severity: critical)
→ Issue correlated with affected pods
→ AI analyzes: "Node worker-2 with MemoryPressure, echo-app pods impacted"
→ Remediation: CordonNode (prevent new pods) + DrainNode (evict existing pods)
→ Kubernetes re-schedules pods on healthy nodes
→ Verification: pods healthy on new nodes → Resolved
The CordonNode and DrainNode actions respect PodDisruptionBudgets and perform graceful eviction. Node context (CPU, memory, pod count, conditions) is included in the AI analysis, enabling more precise decisions.
The platform builds a library of learned strategies over time. Each successful remediation generates a reusable runbook that can be applied to future incidents with the same root cause.
How Runbooks Are Named
Runbook names include a hash of the AI’s root cause analysis, ensuring different causes produce different runbooks:
auto-{signal}-{severity}-{kind}-{hash}
Examples:
auto-oom-kill-critical-deployment-a3f2b1 (cause: tail /dev/zero)
auto-oom-kill-critical-deployment-c7d4e9 (cause: memory limit too low)
auto-pod-not-ready-low-deployment-e8b3d2 (cause: bad image tag)
Multi-Runbook Selection
When multiple runbooks match the same trigger (signal + severity + kind), the AI receives ALL candidates and selects the most appropriate one:
New OOMKill incident on Deployment
↓
3 candidate runbooks found (different root causes)
↓
All 3 injected into AI context with their steps and descriptions
↓
AI analyzes current root cause and responds:
"RUNBOOK_APPROVED: auto-oom-kill-critical-deployment-c7d4e9"
(because this incident is caused by low memory limits, matching that runbook)
↓
Selected runbook executed → fast resolution without agentic loop
If none of the candidates match the current root cause, the AI responds with RUNBOOK_REJECTED, generates a new strategy from scratch, and a new runbook is created with a unique hash — expanding the library for future incidents.
Runbook Lifecycle
| Stage | What Happens |
|---|
| Created | Auto-generated after successful AI remediation |
| Matched | Found by trigger criteria (signal + severity + kind) |
| Validated | AI evaluates if the runbook fits the current root cause |
| Executed | Steps run sequentially with rollback capability |
| Library grows | Each new root cause adds a new runbook to the library |
Over time, the platform becomes faster and more accurate — common failure modes are resolved via runbooks (seconds) instead of full AI analysis (minutes).
PostMortem Generation
When an incident is resolved (automatically or manually), a PostMortem CR is auto-generated containing:
- Timeline — chronological events from detection to resolution
- Root cause analysis — AI-generated with confidence score
- Actions executed — complete remediation history
- Impact assessment — affected pods, services, SLOs
- Lessons learned — AI recommendations for prevention
- Git correlation — recent deployments that may have caused the issue
- Cascade analysis — related incidents across services
PostMortems can be reviewed and closed via the Review PostMortem and Close PostMortem API endpoints.
SLA Integration
Each incident severity can have an SLA configuration:
- Response time — max time from detection to first analysis
- Resolution time — max time from detection to resolution
- Business hours — optionally pause SLA clock outside business hours
- Escalation policy — automatically triggered on SLA breach