Skip to main content

Overview

The ChatCLI AIOps platform manages incidents through a well-defined state machine with 6 states for incidents and 6 states for remediation plans. Understanding this lifecycle is essential for operators who need to intervene when automatic remediation fails.

Incident States

StateDescriptionTerminal?
DetectedAnomaly correlated into an incident, awaiting analysisNo
AnalyzingAI is performing root cause analysisNo
RemediatingA remediation plan is executingNo
ResolvedIncident successfully resolvedYes
EscalatedAll automatic retries exhausted — requires human intervention (auto-resolves if resource recovers)Semi-terminal
FailedSingle attempt failed with no retry configuredYes

State Machine Flow

Detected → Analyzing → Remediating → Resolved
                ↓              ↓
           Escalated      Failed/RolledBack
           (after max         ↓
            retries)     Re-analyze (up to 5x)

                         Escalated

                    Auto-resolve if resource
                    recovers (configurable)

Detection Phase (Detected)

When the watcher bridge detects anomalies, the correlation engine groups them into incidents:
  1. Signal scoring — each signal type has a weight (OOMKill=40, ErrorRate=30, PodRestart=25, etc.)
  2. Risk score calculation — aggregated from all correlated anomalies
  3. Severity determination — Critical (risk > 80), High (> 60), Medium (> 40), Low (otherwise)
  4. Incident ID generation — format: INC-YYYYMMDD-NNN
  5. Max remediation attempts set to 5 (default, configurable via Instance aiops.maxRemediationAttempts)

Analysis Phase (Analyzing)

The system creates an AIInsight CR for AI-powered analysis. During detection, ALL matching runbooks are injected into the AI context for validation.
  1. Runbook candidate discovery (tiered):
    • Tier 1: All runbooks matching SignalType + Severity + ResourceKind
    • Tier 2: Fallback on Severity + ResourceKind
    • Multiple runbooks can exist per trigger (different root causes produce different runbooks)
  2. AI validates candidates: The LLM receives all candidate runbooks and evaluates each against the current root cause analysis:
    • RUNBOOK_APPROVED: <name> → uses that specific runbook (fast path)
    • RUNBOOK_REJECTED → skips all candidates, uses AI suggestions or agentic mode
    • Neither → uses first candidate as default (backward compatibility)
  3. If no candidates exist and AI has suggested actions → generates a new runbook
  4. If no candidates and no AI actions → enters Agentic Mode (AI-driven step-by-step)
  5. Transitions to Remediating

Remediation Phase (Remediating)

The remediation controller executes the plan using a ReAct loop (Reason-Act-Observe):
  1. Pre-flight snapshot captured for rollback capability
  2. For each action in the plan:
    • OBSERVE — checks if the resource is already healthy (after the previous action). If so, stops immediately without executing remaining actions (early exit)
    • ACT — executes the action with checkpoint
    • If the action fails → automatic rollback to pre-flight state
  3. Final health verification (polls for up to 90 seconds)
  4. On successResolved + PostMortem generated
  5. On failure → automatic rollback attempted → re-analyze with failure context
Example: plan with 3 actions (AdjustResources + DeletePod + RollbackDeployment)

  Action 1: AdjustResources → SUCCESS (memory 1Mi → 64Mi)
  Action 2: OBSERVE → resource healthy (ReadyReplicas == Desired)
            → EARLY EXIT! Skips DeletePod and RollbackDeployment
            → Evidence: "Resource healthy after 1/3 actions — skipped remaining 2"
This prevents contradictory actions from being executed (e.g., AdjustResources followed by RollbackDeployment which would undo the fix) and reduces operational impact to the minimum necessary.

Retry Mechanism

When remediation fails:
  • Attempt < MaxAttempts (5): The system re-analyzes with the failure context injected, potentially selecting a different runbook or strategy
  • All attempts exhausted: Transitions to Escalated

Escalated State — What Operators Must Do

When an incident reaches Escalated, the system has exhausted all automatic options. Here’s what happens and what you need to do: What the system does automatically:
  1. Triggers the EscalationPolicy matching the incident severity
  2. Sends notifications to L1 on-call (Slack, PagerDuty, etc.)
  3. If no acknowledgment within the configured timeout, escalates to L2, then L3
  4. Generates audit events for compliance
What operators must do:
  1. Acknowledge the incident (stops escalation progression):
    curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/acknowledge \
      -H "X-API-Key: $API_KEY" \
      -d '{"acknowledgedBy": "your-email@company.com"}'
    
  2. Investigate and fix the issue manually
  3. Resolve the incident via one of three methods: Method 1: REST API (recommended for automation/scripts)
    curl -X POST https://operator:8090/api/v1/incidents/INC-20260319-001/resolve \
      -H "X-API-Key: $API_KEY" \
      -d '{"resolution": "Fixed memory leak in payment-service v2.4.1, deployed hotfix manually"}'
    
    Method 2: Web Dashboard Navigate to the incident detail page and click the “Resolve” button. Enter the resolution description in the dialog. Method 3: Kubernetes Direct (advanced)
    kubectl patch issue INC-20260319-001 -n production --type=merge \
      -p '{"status":{"state":"Resolved","resolution":"Manual fix applied"}}'
    

Auto-Resolve for Escalated Issues

When an incident reaches Escalated, the system continues monitoring the resource every 30 seconds. If the resource recovers (all replicas healthy), the issue is automatically resolved with the message:
“Auto-resolved: resource recovered while awaiting human intervention”
This handles cases where:
  • An operator fixes the issue manually (kubectl rollout undo, etc.) without using the API
  • The resource self-heals (e.g., transient network issue resolves)
  • A CI/CD pipeline deploys a fix while the incident is still open
Auto-resolve can be disabled via the Instance CRD: spec.aiops.enableAutoResolve: false

Configurable AIOps Parameters

All timing and retry parameters are configurable via the Instance CRD aiops section:
apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
spec:
  provider: OPENAI
  model: gpt-5.4
  aiops:
    maxRemediationAttempts: 5    # default: 5, range: 1-10
    resolutionCooldownMinutes: 10 # default: 10, range: 0-120
    dedupTTLMinutes: 60           # default: 60, range: 5-1440
    enableAutoResolve: true       # default: true
ParameterDefaultDescription
maxRemediationAttempts5How many times the AI can retry before escalating
resolutionCooldownMinutes10After resolving, how long to suppress new anomalies for the same resource
dedupTTLMinutes60How long the bridge dedup cache retains alert hashes
enableAutoResolvetrueAuto-resolve Escalated issues when the resource recovers
agenticMaxSteps10Maximum steps per agentic remediation attempt (range: 3-30)
AI auto-generated runbooks (both standard and agentic) automatically inherit maxRemediationAttempts from the Instance configuration. Manually created runbooks via YAML or API use the CRD default (maxAttempts: 3) unless explicitly specified.

Remediation Plan States

Each incident may have multiple remediation plans (one per attempt):
StateDescription
PendingSafety validation in progress
ExecutingActions being executed sequentially
VerifyingPost-action health check (up to 90s)
CompletedAll actions succeeded and health verified
FailedAction failed, no rollback possible
RolledBackAction failed, successfully rolled back to pre-flight state

Agentic Remediation Mode

When no runbook matches, the system uses AI-driven agentic remediation:
  1. AI proposes an action via the AgenticStep RPC
  2. Action is executed and the result is observed
  3. AI analyzes the observation and proposes the next action
  4. Loop continues until resolved or convergence detected
Safety guardrails:
  • Max steps: 10 (configurable via AgenticMaxSteps)
  • Max time: 10 minutes per agentic plan
  • Convergence detection:
    • Last 3 observations identical → force stop
    • Alternating A→B→A→B pattern → force stop
    • 5 consecutive failed actions → force stop

Decision Engine Confidence Thresholds

The decision engine determines whether remediation can proceed automatically:
SeverityAuto-Approve ThresholdAction
LowConfidence >= 0.95Auto-execute
MediumConfidence >= 0.85Auto-execute + notify
HighConfidence >= 0.80Requires approval
CriticalAlwaysManual approval required
Adjustments: Historical success rate, pattern match, time of day, and active issue count all modify the base confidence score. Circuit breaker: If 3+ remediations failed in the last hour, auto-remediation is blocked entirely.

Rollback Engine

The rollback engine provides safety nets at two levels:
  1. Pre-flight snapshot — captured before ANY actions. Restores the entire resource state.
  2. Per-action checkpoints — captured before EACH action. Allows partial rollback.
Automatic rollback triggers:
  • Action execution fails
  • Health verification times out (90 seconds)
Supported rollback targets:
  • Deployment: replicas, container images, resource limits
  • StatefulSet: replicas, images, resources, partition
  • DaemonSet: images, resources, max unavailable
  • Job/CronJob: suspend, deadline, backoff limit, parallelism
  • Node: uncordon (restore schedulable)

Remediation Action Types

The platform supports 46+ remediation action types across resource kinds:

Deployment (18 actions)

ScaleDeployment, RollbackDeployment, RestartDeployment, PatchConfig, AdjustResources, DeletePod, HelmRollback, ArgoSyncApp, AdjustHPA, RestartStatefulSetPod, CordonNode, DrainNode, ResizePVC, RotateSecret, ExecDiagnostic, UpdateIngress, PatchNetworkPolicy, ApplyManifest

StatefulSet (9 actions)

ScaleStatefulSet, RestartStatefulSet, RollbackStatefulSet, AdjustStatefulSetResources, DeleteStatefulSetPod, ForceDeleteStatefulSetPod, UpdateStatefulSetStrategy, RecreateStatefulSetPVC, PartitionStatefulSetUpdate

DaemonSet (7 actions)

RestartDaemonSet, RollbackDaemonSet, AdjustDaemonSetResources, DeleteDaemonSetPod, UpdateDaemonSetStrategy, PauseDaemonSetRollout, CordonAndDeleteDaemonSetPod

Job (9 actions)

RetryJob, AdjustJobResources, DeleteFailedJob, SuspendJob, ResumeJob, AdjustJobParallelism, AdjustJobDeadline, AdjustJobBackoffLimit, ForceDeleteJobPods

CronJob (10 actions)

SuspendCronJob, ResumeCronJob, TriggerCronJob, AdjustCronJobResources, AdjustCronJobSchedule, AdjustCronJobDeadline, AdjustCronJobHistory, AdjustCronJobConcurrency, DeleteCronJobActiveJobs, ReplaceCronJobTemplate

Runbook Learning System

Node Failure — Remediation Flow

When a node has problems, the watcher detects the condition and emits anomalies automatically:
Node MemoryPressure detected
  → Anomaly CR created (signal: memory_high, severity: critical)
    → Issue correlated with affected pods
      → AI analyzes: "Node worker-2 with MemoryPressure, echo-app pods impacted"
        → Remediation: CordonNode (prevent new pods) + DrainNode (evict existing pods)
          → Kubernetes re-schedules pods on healthy nodes
            → Verification: pods healthy on new nodes → Resolved
The CordonNode and DrainNode actions respect PodDisruptionBudgets and perform graceful eviction. Node context (CPU, memory, pod count, conditions) is included in the AI analysis, enabling more precise decisions. The platform builds a library of learned strategies over time. Each successful remediation generates a reusable runbook that can be applied to future incidents with the same root cause.

How Runbooks Are Named

Runbook names include a hash of the AI’s root cause analysis, ensuring different causes produce different runbooks:
auto-{signal}-{severity}-{kind}-{hash}

Examples:
  auto-oom-kill-critical-deployment-a3f2b1  (cause: tail /dev/zero)
  auto-oom-kill-critical-deployment-c7d4e9  (cause: memory limit too low)
  auto-pod-not-ready-low-deployment-e8b3d2  (cause: bad image tag)

Multi-Runbook Selection

When multiple runbooks match the same trigger (signal + severity + kind), the AI receives ALL candidates and selects the most appropriate one:
New OOMKill incident on Deployment

3 candidate runbooks found (different root causes)

All 3 injected into AI context with their steps and descriptions

AI analyzes current root cause and responds:
  "RUNBOOK_APPROVED: auto-oom-kill-critical-deployment-c7d4e9"
  (because this incident is caused by low memory limits, matching that runbook)

Selected runbook executed → fast resolution without agentic loop
If none of the candidates match the current root cause, the AI responds with RUNBOOK_REJECTED, generates a new strategy from scratch, and a new runbook is created with a unique hash — expanding the library for future incidents.

Runbook Lifecycle

StageWhat Happens
CreatedAuto-generated after successful AI remediation
MatchedFound by trigger criteria (signal + severity + kind)
ValidatedAI evaluates if the runbook fits the current root cause
ExecutedSteps run sequentially with rollback capability
Library growsEach new root cause adds a new runbook to the library
Over time, the platform becomes faster and more accurate — common failure modes are resolved via runbooks (seconds) instead of full AI analysis (minutes).

PostMortem Generation

When an incident is resolved (automatically or manually), a PostMortem CR is auto-generated containing:
  • Timeline — chronological events from detection to resolution
  • Root cause analysis — AI-generated with confidence score
  • Actions executed — complete remediation history
  • Impact assessment — affected pods, services, SLOs
  • Lessons learned — AI recommendations for prevention
  • Git correlation — recent deployments that may have caused the issue
  • Cascade analysis — related incidents across services
PostMortems can be reviewed and closed via the Review PostMortem and Close PostMortem API endpoints.

SLA Integration

Each incident severity can have an SLA configuration:
  • Response time — max time from detection to first analysis
  • Resolution time — max time from detection to resolution
  • Business hours — optionally pause SLA clock outside business hours
  • Escalation policy — automatically triggered on SLA breach