This cookbook shows the complete flow of a real incident on the ChatCLI AIOps platform — from automatic detection to the post-mortem with lessons learned.
Anatomy of an Incident
Scenario: OOMKill in Production
1. Automatic Detection
The Watcher detects that payment-service pods are being OOMKilled:
# Check detected anomalies
kubectl get anomalies -n chatcli-system
NAME SOURCE SIGNAL CORRELATED AGE
anom-payment-oom-1710856400 watcher oom_kill true 2m
anom-payment-mem-1710856410 watcher memory_high true 90s
2. Issue Created
The CorrelationEngine groups the anomalies into an Issue:
kubectl get issues -n chatcli-system
NAME SEVERITY STATE RISK AGE
INC-20260319-001 high Analyzing 65 90s
kubectl get issue INC-20260319-001 -o yaml -n chatcli-system
spec :
severity : high
source : watcher
signalType : oom_kill
resource :
kind : Deployment
name : payment-service
namespace : production
description : "Correlated 2 anomalies: oom_kill, memory_high for Deployment/payment-service"
riskScore : 65
correlationId : "INC-20260319-001"
status :
state : Analyzing
detectedAt : "2026-03-19T15:20:00Z"
remediationAttempts : 0
maxRemediationAttempts : 5
3. Notification Sent
Slack automatically receives:
> High Severity: OOM Kill Detected
>
> | Severity | Resource | Namespace | State |
> |----------|----------|-----------|-------|
> | high | payment-service | production | Analyzing |
>
> Issue: INC-20260319-001 | 2026-03-19T15:20:00Z
4. AI Analyzes the Problem
kubectl get aiinsight INC-20260319-001-insight -o yaml -n chatcli-system
status :
analysis : |
The payment-service deployment is suffering repeated OOMKill.
Main container using 490Mi of 512Mi limit. Memory insufficient
for load spikes. Revision 12 (current) introduced in-memory cache
that increased base consumption by ~100Mi vs revision 11.
confidence : 0.88
recommendations :
- "Increase memory limit to 1Gi and request to 768Mi"
- "Consider rollback to revision 11 if the increase doesn't resolve it"
suggestedActions :
- name : "Increase memory"
action : "AdjustResources"
description : "Pod is OOMKilled, increase memory limit"
params :
memory_limit : "1Gi"
memory_request : "768Mi"
5. Approval Required
The ApprovalPolicy requires approval for resource changes in production:
kubectl get approvalrequests -n chatcli-system
NAME ISSUE PLAN STATE RULE
INC-20260319-001-approval INC-20260319-001 INC-20260319-001-plan Pending quorum-production
Option A — Approve via Web Dashboard:
Go to http://localhost:8090 -> Approvals -> Approve with reason.
Option B — Approve via REST API:
curl -X POST http://localhost:8090/api/v1/approvals/INC-20260319-001-approval/approve \
-H "X-API-Key: operator-key" \
-H "Content-Type: application/json" \
-d '{"approver": "edilson", "reason": "AI analysis confidence 0.88, OOM confirmed in logs"}'
Option C — Approve via kubectl:
kubectl annotate approvalrequest INC-20260319-001-approval \
-n chatcli-system \
"platform.chatcli.io/approve=edilson:OOM confirmed"
After approval, the RemediationReconciler executes:
kubectl get remediationplans -n chatcli-system
NAME ISSUE ATTEMPT STATE AGE
INC-20260319-001-plan-1 INC-20260319-001 1 Verifying 30s
The controller does:
Captures structured ResourceSnapshot (replicas, images, CPU/memory requests+limits, HPA min/max)
Creates ActionCheckpoint before each action
Applies AdjustResources (memory 512Mi -> 1Gi)
Waits 90s verifying deployment health
readyReplicas >= desired -> Completed
Automatic protection: If the action fails (e.g., invalid memory_limit), the operator automatically restores the resource to the snapshot state (replicas, images, resources). The plan transitions to RolledBack instead of Failed. If the verification expires (90s without health), the rollback is also executed. The postFailureHealthy field confirms whether the resource returned to normal. This ensures that a remediation never leaves the cluster in a worse state.
7. Issue Resolved
kubectl get issues -n chatcli-system
NAME SEVERITY STATE RISK AGE
INC-20260319-001 high Resolved 65 8m
Slack receives resolution notification
Pattern Store records: “oom_kill + Deployment + high -> AdjustResources works”
10min dedup cooldown activated (configurable via aiops.resolutionCooldownMinutes)
8. Automatic PostMortem
kubectl get postmortems -n chatcli-system
NAME ISSUE SEVERITY STATE AGE
pm-INC-20260319-001 INC-20260319-001 high Open 5m
kubectl get pm pm-INC-20260319-001 -o yaml -n chatcli-system
status :
state : Open
summary : "Payment service pods OOMKilled due to insufficient memory limits after cache feature deployment"
rootCause : "Revision 12 introduced in-memory cache increasing base memory by ~100Mi, exceeding 512Mi limit"
impact : "Payment processing degraded for ~8 minutes, 3 pods affected"
timeline :
- timestamp : "2026-03-19T15:20:00Z"
type : detected
detail : "OOM kill detected on payment-service"
- timestamp : "2026-03-19T15:20:10Z"
type : analyzed
detail : "AI analysis completed with 0.88 confidence"
- timestamp : "2026-03-19T15:22:00Z"
type : action_executed
detail : "AdjustResources: memory_limit=1Gi, memory_request=768Mi"
- timestamp : "2026-03-19T15:23:30Z"
type : verified
detail : "Deployment healthy: 3/3 replicas ready"
- timestamp : "2026-03-19T15:23:30Z"
type : resolved
detail : "Remediation verified successfully"
lessonsLearned :
- "Memory limits should be reviewed after features that introduce caching"
- "Monitor memory trends before deploys with architecture changes"
preventionActions :
- "Add memory capacity test to the CI/CD pipeline"
- "Create memory usage SLO for payment-service"
duration : "3m30s"
9. Review and Close
# Via API
curl -X POST http://localhost:8090/api/v1/postmortems/pm-INC-20260319-001/review
# After team review
curl -X POST http://localhost:8090/api/v1/postmortems/pm-INC-20260319-001/close
Day-to-Day Operations
Monitor via CLI
# Active issues
kubectl get issues -A --sort-by= '.metadata.creationTimestamp'
# SLO status
kubectl get slos -n chatcli-system
# Pending approvals
kubectl get approvalrequests -n chatcli-system --field-selector status.state=Pending
# Audit trail
kubectl get auditevents -n chatcli-system --sort-by= '.spec.timestamp' | tail -20
# Chaos experiments
kubectl get chaos -n chatcli-system
Monitor via API
# Dashboard summary
curl http://localhost:8090/api/v1/analytics/summary | jq
# Top problematic resources
curl http://localhost:8090/api/v1/analytics/top-resources | jq
# MTTR trend
curl "http://localhost:8090/api/v1/analytics/mttr?window=7d" | jq
# Compliance report (export for SIEM)
curl http://localhost:8090/api/v1/audit/export > audit- $( date +%Y%m%d ) .json
Custom Runbooks
apiVersion : platform.chatcli.io/v1alpha1
kind : Runbook
metadata :
name : restart-on-memory-leak
namespace : chatcli-system
spec :
description : "Restart deployment when memory leak detected"
trigger :
signalType : memory_high
severity : medium
resourceKind : Deployment
steps :
- name : "Restart pods"
action : RestartDeployment
description : "Rolling restart to clear leaked memory"
maxAttempts : 2
Important Metrics
Metric What to Monitor Alert When chatcli_operator_active_issuesUnresolved issues > 5 chatcli_operator_slo_error_budget_remainingRemaining SLO budget < 25% chatcli_operator_sla_compliance_percentageSLA compliance < 99% chatcli_operator_slo_burn_rate{window="1h"}Fast burn rate > 14.4x chatcli_operator_remediations_total{result="failed"}Failing remediations > 3/hour chatcli_operator_notifications_failed_totalFailing notifications > 0 chatcli_operator_approvals_total{result="expired"}Expiring approvals > 0
Configure alerts in Prometheus/Grafana for these metrics. The pre-configured dashboards in deploy/grafana/ already include panels for all of them.