Incident Lifecycle

Overview

The ChatCLI AIOps platform manages incidents through a well-defined state machine with 6 states for incidents and 6 states for remediation plans. Understanding this lifecycle is essential for operators who need to intervene when automatic remediation fails.

Incident States

State	Description	Terminal?
`Detected`	Anomaly correlated into an incident, awaiting analysis	No
`Analyzing`	AI is performing root cause analysis	No
`Remediating`	A remediation plan is executing	No
`Resolved`	Incident successfully resolved	Yes
`Escalated`	All automatic retries exhausted — requires human intervention (auto-resolves if resource recovers)	Semi-terminal
`Failed`	Single attempt failed with no retry configured	Yes

State Machine Flow

Detected → Analyzing → Remediating → Resolved
                ↓              ↓
           Escalated      Failed/RolledBack
           (after max         ↓
            retries)     Re-analyze (up to 5x)
                              ↓
                         Escalated
                              ↓
                    Auto-resolve if resource
                    recovers (configurable)

Detection Phase (`Detected`)

When the watcher bridge detects anomalies, the correlation engine groups them into incidents:

Signal scoring — each signal type has a weight (OOMKill=40, ErrorRate=30, PodRestart=25, etc.)
Risk score calculation — aggregated from all correlated anomalies
Severity determination — Critical (risk > 80), High (> 60), Medium (> 40), Low (otherwise)
Incident ID generation — format: INC-YYYYMMDD-NNN
Max remediation attempts set to 5 (default, configurable via Instance aiops.maxRemediationAttempts)

Analysis Phase (`Analyzing`)

The system creates an AIInsight CR for AI-powered analysis. During detection, ALL matching runbooks are injected into the AI context for validation.

Runbook candidate discovery (tiered):
- Tier 1: All runbooks matching SignalType + Severity + ResourceKind
- Tier 2: Fallback on Severity + ResourceKind
- Multiple runbooks can exist per trigger (different root causes produce different runbooks)
AI validates candidates: The LLM receives all candidate runbooks and evaluates each against the current root cause analysis:
- RUNBOOK_APPROVED: <name> → uses that specific runbook (fast path)
- RUNBOOK_REJECTED → skips all candidates, uses AI suggestions or agentic mode
- Neither → uses first candidate as default (backward compatibility)
If no candidates exist and AI has suggested actions → generates a new runbook
If no candidates and no AI actions → enters Agentic Mode (AI-driven step-by-step)
Transitions to Remediating

Remediation Phase (`Remediating`)

The remediation controller executes the plan using a ReAct loop (Reason-Act-Observe):

Pre-flight snapshot captured for rollback capability
For each action in the plan:
- OBSERVE — checks if the resource is already healthy (after the previous action). If so, stops immediately without executing remaining actions (early exit)
- ACT — executes the action with checkpoint
- If the action fails → automatic rollback to pre-flight state
Final health verification (polls for up to 90 seconds)
On success → Resolved + PostMortem generated
On failure → automatic rollback attempted → re-analyze with failure context

Example: plan with 3 actions (AdjustResources + DeletePod + RollbackDeployment)

  Action 1: AdjustResources → SUCCESS (memory 1Mi → 64Mi)
  Action 2: OBSERVE → resource healthy (ReadyReplicas == Desired)
            → EARLY EXIT! Skips DeletePod and RollbackDeployment
            → Evidence: "Resource healthy after 1/3 actions — skipped remaining 2"

This prevents contradictory actions from being executed (e.g., AdjustResources followed by RollbackDeployment which would undo the fix) and reduces operational impact to the minimum necessary.

Retry Mechanism

When remediation fails:

Attempt < MaxAttempts (5): The system re-analyzes with the failure context injected, potentially selecting a different runbook or strategy
All attempts exhausted: Transitions to Escalated

Escalated State — What Operators Must Do

When an incident reaches Escalated, the system has exhausted all automatic options. Here’s what happens and what you need to do: What the system does automatically:

Triggers the EscalationPolicy matching the incident severity
Sends notifications to L1 on-call (Slack, PagerDuty, etc.)
If no acknowledgment within the configured timeout, escalates to L2, then L3
Generates audit events for compliance

What operators must do:

Acknowledge the incident (stops escalation progression):

curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/acknowledge \
  -H "X-API-Key: $API_KEY" \
  -d '{"acknowledgedBy": "your-email@company.com"}'

Investigate and fix the issue manually

Resolve the incident via one of three methods: Method 1: REST API (recommended for automation/scripts)

curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/resolve \
  -H "X-API-Key: $API_KEY" \
  -d '{"resolution": "Fixed memory leak in payment-service v2.4.1, deployed hotfix manually"}'

Method 2: Web Dashboard Navigate to the incident detail page and click the “Resolve” button. Enter the resolution description in the dialog. Method 3: Kubernetes Direct (advanced)

kubectl patch issue INC-20260319-001 -n production --type=merge \
  -p '{"status":{"state":"Resolved","resolution":"Manual fix applied"}}'

Auto-Resolve for Escalated Issues

When an incident reaches Escalated, the system continues monitoring the resource every 30 seconds. If the resource recovers (all replicas healthy), the issue is automatically resolved with the message:

“Auto-resolved: resource recovered while awaiting human intervention”

This handles cases where:

An operator fixes the issue manually (kubectl rollout undo, etc.) without using the API
The resource self-heals (e.g., transient network issue resolves)
A CI/CD pipeline deploys a fix while the incident is still open

Auto-resolve can be disabled via the Instance CRD: spec.aiops.enableAutoResolve: false

Configurable AIOps Parameters

All timing and retry parameters are configurable via the Instance CRD aiops section:

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
spec:
  provider: OPENAI
  model: gpt-5.4
  aiops:
    maxRemediationAttempts: 5    # default: 5, range: 1-10
    resolutionCooldownMinutes: 10 # default: 10, range: 0-120
    dedupTTLMinutes: 60           # default: 60, range: 5-1440
    enableAutoResolve: true       # default: true

Parameter	Default	Description
`maxRemediationAttempts`	5	How many times the AI can retry before escalating
`resolutionCooldownMinutes`	10	After resolving, how long to suppress new anomalies for the same resource
`dedupTTLMinutes`	60	How long the bridge dedup cache retains alert hashes
`enableAutoResolve`	true	Auto-resolve Escalated issues when the resource recovers
`agenticMaxSteps`	10	Maximum steps per agentic remediation attempt (range: 3-30)

AI auto-generated runbooks (both standard and agentic) automatically inherit maxRemediationAttempts from the Instance configuration. Manually created runbooks via YAML or API use the CRD default (maxAttempts: 3) unless explicitly specified.

Remediation Plan States

Each incident may have multiple remediation plans (one per attempt):

State	Description
`Pending`	Safety validation in progress
`Executing`	Actions being executed sequentially
`Verifying`	Post-action health check (up to 90s)
`Completed`	All actions succeeded and health verified
`Failed`	Action failed, no rollback possible
`RolledBack`	Action failed, successfully rolled back to pre-flight state

Agentic Remediation Mode

When no runbook matches, the system uses AI-driven agentic remediation:

AI proposes an action via the AgenticStep RPC
Action is executed and the result is observed
AI analyzes the observation and proposes the next action
Loop continues until resolved or convergence detected

Safety guardrails:

Max steps: 10 (configurable via AgenticMaxSteps)
Max time: 10 minutes per agentic plan
Convergence detection:
- Last 3 observations identical → force stop
- Alternating A→B→A→B pattern → force stop
- 5 consecutive failed actions → force stop

Decision Engine Confidence Thresholds

The decision engine determines whether remediation can proceed automatically:

Severity	Auto-Approve Threshold	Action
Low	Confidence >= 0.95	Auto-execute
Medium	Confidence >= 0.85	Auto-execute + notify
High	Confidence >= 0.80	Requires approval
Critical	Always	Manual approval required

Adjustments: Historical success rate, pattern match, time of day, and active issue count all modify the base confidence score. Circuit breaker: If 3+ remediations failed in the last hour, auto-remediation is blocked entirely.

Rollback Engine

The rollback engine provides safety nets at two levels:

Pre-flight snapshot — captured before ANY actions. Restores the entire resource state.
Per-action checkpoints — captured before EACH action. Allows partial rollback.

Automatic rollback triggers:

Action execution fails
Health verification times out (90 seconds)

Supported rollback targets:

Deployment: replicas, container images, resource limits
StatefulSet: replicas, images, resources, partition
DaemonSet: images, resources, max unavailable
Job/CronJob: suspend, deadline, backoff limit, parallelism
Node: uncordon (restore schedulable)

Remediation Action Types

The platform supports 46+ remediation action types across resource kinds:

Deployment (18 actions)

ScaleDeployment, RollbackDeployment, RestartDeployment, PatchConfig, AdjustResources, DeletePod, HelmRollback, ArgoSyncApp, AdjustHPA, RestartStatefulSetPod, CordonNode, DrainNode, ResizePVC, RotateSecret, ExecDiagnostic, UpdateIngress, PatchNetworkPolicy, ApplyManifest

StatefulSet (9 actions)

ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate

DaemonSet (7 actions)

RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod

Job (9 actions)

RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods

CronJob (10 actions)

SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate

Runbook Learning System

Node Failure — Remediation Flow

When a node has problems, the watcher detects the condition and emits anomalies automatically:

Node MemoryPressure detected
  → Anomaly CR created (signal: memory_high, severity: critical)
    → Issue correlated with affected pods
      → AI analyzes: "Node worker-2 with MemoryPressure, echo-app pods impacted"
        → Remediation: CordonNode (prevent new pods) + DrainNode (evict existing pods)
          → Kubernetes re-schedules pods on healthy nodes
            → Verification: pods healthy on new nodes → Resolved

The CordonNode and DrainNode actions respect PodDisruptionBudgets and perform graceful eviction. Node context (CPU, memory, pod count, conditions) is included in the AI analysis, enabling more precise decisions. The platform builds a library of learned strategies over time. Each successful remediation generates a reusable runbook that can be applied to future incidents with the same root cause.

How Runbooks Are Named

Runbook names include a hash of the AI’s root cause analysis, ensuring different causes produce different runbooks:

auto-{signal}-{severity}-{kind}-{hash}

Examples:
  auto-oom-kill-critical-deployment-a3f2b1  (cause: tail /dev/zero)
  auto-oom-kill-critical-deployment-c7d4e9  (cause: memory limit too low)
  auto-pod-not-ready-low-deployment-e8b3d2  (cause: bad image tag)

Multi-Runbook Selection

When multiple runbooks match the same trigger (signal + severity + kind), the AI receives ALL candidates and selects the most appropriate one:

New OOMKill incident on Deployment
       ↓
3 candidate runbooks found (different root causes)
       ↓
All 3 injected into AI context with their steps and descriptions
       ↓
AI analyzes current root cause and responds:
  "RUNBOOK_APPROVED: auto-oom-kill-critical-deployment-c7d4e9"
  (because this incident is caused by low memory limits, matching that runbook)
       ↓
Selected runbook executed → fast resolution without agentic loop

If none of the candidates match the current root cause, the AI responds with RUNBOOK_REJECTED, generates a new strategy from scratch, and a new runbook is created with a unique hash — expanding the library for future incidents.

Runbook Lifecycle

Stage	What Happens
Created	Auto-generated after successful AI remediation
Matched	Found by trigger criteria (signal + severity + kind)
Validated	AI evaluates if the runbook fits the current root cause
Executed	Steps run sequentially with rollback capability
Library grows	Each new root cause adds a new runbook to the library

Over time, the platform becomes faster and more accurate — common failure modes are resolved via runbooks (seconds) instead of full AI analysis (minutes).

PostMortem Generation

When an incident is resolved (automatically or manually), a PostMortem CR is auto-generated containing:

Timeline — chronological events from detection to resolution
Root cause analysis — AI-generated with confidence score
Actions executed — complete remediation history
Impact assessment — affected pods, services, SLOs
Lessons learned — AI recommendations for prevention
Git correlation — recent deployments that may have caused the issue
Cascade analysis — related incidents across services

PostMortems can be reviewed and closed via the Review PostMortem and Close PostMortem API endpoints.

SLA Integration

Each incident severity can have an SLA configuration:

Response time — max time from detection to first analysis
Resolution time — max time from detection to resolution
Business hours — optionally pause SLA clock outside business hours
Escalation policy — automatically triggered on SLA breach

​Overview

​Incident States

​State Machine Flow

​Detection Phase (Detected)

​Analysis Phase (Analyzing)

​Remediation Phase (Remediating)

​Retry Mechanism

​Escalated State — What Operators Must Do

​Auto-Resolve for Escalated Issues

​Configurable AIOps Parameters

​Remediation Plan States

​Agentic Remediation Mode

​Decision Engine Confidence Thresholds

​Rollback Engine

​Remediation Action Types

​Deployment (18 actions)

​StatefulSet (9 actions)

​DaemonSet (7 actions)

​Job (9 actions)

​CronJob (10 actions)

​Runbook Learning System

​Node Failure — Remediation Flow

​How Runbooks Are Named

​Multi-Runbook Selection

​Runbook Lifecycle

​PostMortem Generation

​SLA Integration

Overview

Incident States

State Machine Flow

Detection Phase (`Detected`)

Analysis Phase (`Analyzing`)

Remediation Phase (`Remediating`)

Retry Mechanism

Escalated State — What Operators Must Do

Auto-Resolve for Escalated Issues

Configurable AIOps Parameters

Remediation Plan States

Agentic Remediation Mode

Decision Engine Confidence Thresholds

Rollback Engine

Remediation Action Types

Deployment (18 actions)

StatefulSet (9 actions)

DaemonSet (7 actions)

Job (9 actions)

CronJob (10 actions)

Runbook Learning System

Node Failure — Remediation Flow

How Runbooks Are Named

Multi-Runbook Selection

Runbook Lifecycle

PostMortem Generation

SLA Integration