This cookbook covers the complete setup of the ChatCLI AIOps platform for a real production environment — from installation to validation with chaos engineering.
Prerequisites
- [x]Kubernetes cluster 1.25+- [x]Helm 3.x installed- [x]Prometheus Operator (for ServiceMonitor)- [x]Grafana (for dashboards)- [x]At least one LLM API key (OpenAI, Claude, Google AI)
1. Install the Operator
Install Operator via Helm (CRDs + RBAC + Controllers + Dashboard)
helm install chatcli-operator \
oci://ghcr.io/diillson/charts/chatcli-operator \
--namespace chatcli-system \
--create-namespace
Verify installed CRDs
kubectl get crd | grep platform.chatcli.io
You should see 17 CRDs:aiinsights.platform.chatcli.io
anomalies.platform.chatcli.io
approvalpolicies.platform.chatcli.io
approvalrequests.platform.chatcli.io
auditevents.platform.chatcli.io
chaosexperiments.platform.chatcli.io
clusterregistrations.platform.chatcli.io
escalationpolicies.platform.chatcli.io
incidentslas.platform.chatcli.io
instances.platform.chatcli.io
issues.platform.chatcli.io
notificationpolicies.platform.chatcli.io
postmortems.platform.chatcli.io
remediationplans.platform.chatcli.io
runbooks.platform.chatcli.io
servicelevelobjectives.platform.chatcli.io
sourcerepositories.platform.chatcli.io
Create Secret with API Keys
kubectl create secret generic chatcli-api-keys \
--namespace chatcli-system \
--from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
--from-literal=OPENAI_API_KEY=sk-xxx
2. Create ChatCLI Instance
apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
name: chatcli-prod
namespace: chatcli-system
spec:
replicas: 2
provider: CLAUDEAI
model: claude-sonnet-4-20250514
server:
port: 50051
metricsPort: 9090
tls:
enabled: true
secretName: chatcli-tls
token:
name: chatcli-auth
key: token
apiKeys:
name: chatcli-api-keys
watcher:
enabled: true
targets:
- deployment: api-gateway
namespace: production
metricsPort: 8080
metricsPath: /metrics
- deployment: payment-service
namespace: production
metricsPort: 8080
- deployment: user-service
namespace: production
interval: "30s"
window: "2h"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 1Gi
persistence:
enabled: true
size: 5Gi
To enable Prometheus metrics collection during incident analysis, add the PROMETHEUS_URL variable via Helm:helm upgrade chatcli-operator oci://ghcr.io/diillson/charts/chatcli-operator \
--set prometheusUrl="http://prometheus-server.monitoring.svc:9090"
2.1 Link Source Code Repositories (Optional)
Link your monitored applications’ source code repositories for code-aware diagnostics. The AI will receive context from recent commits, code snippets from stack traces, and configuration files.
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
name: api-gateway-repo
namespace: chatcli-system
spec:
url: "https://github.com/myorg/api-gateway.git"
branch: main
authType: token
secretRef: git-token
resource:
kind: Deployment
name: api-gateway
namespace: production
paths: ["cmd/", "internal/"]
language: "Go"
---
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
name: payment-service-repo
namespace: chatcli-system
spec:
url: "git@github.com:myorg/payment-service.git"
branch: main
authType: ssh
secretRef: git-ssh-key
resource:
kind: Deployment
name: payment-service
namespace: production
language: "Java"
apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
name: prod-notifications
namespace: chatcli-system
spec:
enabled: true
channels:
- name: slack-incidents
type: slack
config:
webhook_url: "https://hooks.slack.com/services/T.../B.../xxx"
channel: "#incidents"
- name: pagerduty-critical
type: pagerduty
config:
routing_key: "R0xxxxxxxxxxxxxxxxxxxx"
- name: email-management
type: email
config:
smtp_host: smtp.gmail.com
smtp_port: "587"
from: aiops@company.com
to: "sre-team@company.com,management@company.com"
secretRef:
name: smtp-credentials
rules:
- name: critical-to-pagerduty
severities: [critical]
states: [Detected, Escalated]
channels: [pagerduty-critical, slack-incidents]
- name: high-to-slack
severities: [critical, high]
states: [Detected, Analyzing, Remediating, Resolved, Escalated]
channels: [slack-incidents]
- name: escalations-to-email
states: [Escalated]
channels: [email-management]
throttle:
maxPerHour: 20
deduplicationWindow: "5m"
apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
name: prod-escalation
namespace: chatcli-system
spec:
enabled: true
severities: [critical, high]
levels:
- name: L1-OnCall
timeoutMinutes: 5
targets:
- type: oncall
name: primary-oncall
notifyChannels: [slack-incidents, pagerduty-critical]
repeatIntervalMinutes: 5
- name: L2-SeniorSRE
timeoutMinutes: 15
targets:
- type: team
name: sre-senior
notifyChannels: [slack-incidents, pagerduty-critical]
- name: L3-Engineering-Lead
timeoutMinutes: 30
targets:
- type: user
name: eng-lead@company.com
notifyChannels: [slack-incidents, email-management]
5. Define SLOs
apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
name: api-gateway-availability
namespace: chatcli-system
spec:
serviceName: api-gateway
description: "API Gateway must maintain 99.9% availability"
enabled: true
indicator:
type: availability
metricSource: issues
resource:
kind: Deployment
name: api-gateway
namespace: production
target:
percentage: 99.9
window: "30d"
alertPolicy:
pageOnBudgetExhausted: true
notificationPolicyRef: prod-notifications
burnRateWindows:
- shortWindow: "1h"
longWindow: "6h"
burnRateThreshold: 14.4
severity: critical
- shortWindow: "6h"
longWindow: "72h"
burnRateThreshold: 6
severity: high
- shortWindow: "24h"
longWindow: "72h"
burnRateThreshold: 3
severity: medium
6. Define SLAs
apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
name: p1-sla
namespace: chatcli-system
spec:
severity: critical
responseTime: "5m"
resolutionTime: "1h"
escalationPolicyRef: prod-escalation
notificationPolicyRef: prod-notifications
businessHoursOnly: false
apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
name: p2-sla
namespace: chatcli-system
spec:
severity: high
responseTime: "15m"
resolutionTime: "4h"
escalationPolicyRef: prod-escalation
businessHoursOnly: true
businessHours:
timezone: "America/Sao_Paulo"
startHour: 9
endHour: 18
workDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]
apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
name: prod-approvals
namespace: chatcli-system
spec:
enabled: true
defaultMode: manual
rules:
- name: auto-low-confidence
match:
severities: [low]
actionTypes: [RestartDeployment, DeletePod]
mode: auto
autoApproveConditions:
minConfidence: 0.95
maxSeverity: low
historicalSuccessRate: 0.90
- name: quorum-production-rollback
match:
severities: [critical, high]
actionTypes: [RollbackDeployment, ScaleDeployment]
namespaces: [production]
mode: quorum
requiredApprovers: 2
timeoutMinutes: 15
changeWindow:
timezone: "America/Sao_Paulo"
allowedDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]
startHour: 9
endHour: 17
- name: manual-resource-changes
match:
actionTypes: [AdjustResources, PatchConfig]
mode: manual
timeoutMinutes: 30
8. Install Grafana Dashboards
# Create ConfigMap with dashboards
kubectl create configmap chatcli-grafana-dashboards \
--from-file=deploy/grafana/ \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
# Add label for Grafana sidecar auto-discovery
kubectl label configmap chatcli-grafana-dashboards \
grafana_dashboard=1 -n monitoring
# Install ServiceMonitors
kubectl apply -f deploy/grafana/dashboards-configmap.yaml
9. Validate with Chaos Engineering
Run chaos experiments only in environments with redundancy. Never on single-replica deployments.
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
name: validate-api-gateway-resilience
namespace: chatcli-system
spec:
experimentType: pod_kill
target:
kind: Deployment
name: api-gateway
namespace: production
duration: "2m"
parameters:
count: "1"
dryRun: true # Test in dry-run first!
enabled: true
safetyChecks:
minHealthyPods: 2
maxConcurrentExperiments: 1
abortOnIssueDetected: true
requireApproval: true
blockedNamespaces: ["kube-system", "monitoring"]
postExperiment:
verifyRecovery: true
recoveryTimeout: "5m"
runRemediationTest: false
Run in DryRun
kubectl apply -f chaos-experiment.yaml
kubectl get chaos -w
Verify result
kubectl get chaos validate-api-gateway-resilience -o yaml
Run for real (after validation)
Edit dryRun: false and reapply.
10. Access the Dashboard
# Port-forward to the REST API + Web UI
kubectl port-forward svc/chatcli-prod 8090:8090 -n chatcli-system
# Open in browser
open http://localhost:8090
The web dashboard shows:
- Overview with real-time stats
- Incidents with filters and actions (acknowledge, snooze)
- SLOs with error budget and burn rates
- Pending approvals
- PostMortems with timeline
- Federated clusters
- Searchable audit log
Production Checklist
- [x]Operator installed with 17 CRDs- [x]Instance created with TLS and auth- [x]Watcher monitoring target deployments- [x]NotificationPolicy with Slack + PagerDuty- [x]EscalationPolicy L1->L2->L3- [x]SLOs with burn rate alerting (Google SRE model)- [x]SLAs with response/resolution time per severity- [x]ApprovalPolicy with auto/quorum for production- [x]Grafana dashboards installed- [x]Chaos experiment validated in dry-run- [x]Web Dashboard accessible- [x]REST API with authentication configured