Skip to main content
In this recipe, you will configure ChatCLI to monitor a Kubernetes deployment and use AI to diagnose problems in real time.

Scenario

Production Application

Application “myapp” running in production on Kubernetes

Quick Diagnosis

Team needs to diagnose problems quickly

AI-Powered Analysis

Use AI to analyze logs, events, and metrics

Automatic Context

Automatic K8s context in all queries

Option 1: Local Monitoring

Use this option when you have direct access to the cluster via kubectl.
1

Verify Cluster Access

# Verify connectivity
kubectl get pods -n production

# Verify permissions
kubectl auth can-i get pods -n production
kubectl auth can-i get pods/log -n production
kubectl auth can-i list events -n production
2

Start the Watcher

chatcli watch --deployment myapp --namespace production
You will see:
K8s Watcher starting...
  Deployment: myapp
  Namespace:  production
  Interval:   30s
  Window:     2h

Collecting initial data...
Initial data collected. Starting interactive mode.
[watch] chatcli>
3

Ask Questions

[watch] chatcli> Is the deployment healthy?

Based on the collected Kubernetes data:
- The myapp deployment has 3/3 replicas available
- All pods are in Running and Ready state
- There are no active alerts
- Average CPU at 35%, memory at 120Mi
The deployment is healthy and operating normally.

[watch] chatcli> /watch status

K8s Watcher Active
  Deployment:  myapp
  Namespace:   production
  Snapshots:   5
  Pods:        3
  Alerts:      0
4

Diagnose Problems

When something goes wrong:
[watch] chatcli> Why is the pod myapp-abc12 restarting?

Analyzing data for pod myapp-abc12:
- The pod had 5 restarts in the last hour
- Reason for last restart: OOMKilled
- Container was using 490Mi out of 512Mi limit
- Logs show: "java.lang.OutOfMemoryError: Java heap space"

Diagnosis: The container is exceeding the memory limit.
Recommendations:
1. Increase resources.limits.memory to 1Gi
2. Adjust the JVM: -Xmx384m to fit within the limit
3. Investigate possible memory leak in previous logs

Option 2: Server with Watcher (Team)

Use this option so the entire team has access to monitoring via a centralized server.
1

Deploy to Kubernetes

helm install chatcli deploy/helm/chatcli \
  --namespace monitoring --create-namespace \
  --set llm.provider=CLAUDEAI \
  --set secrets.anthropicApiKey=sk-ant-xxx \
  --set server.token=team-token \
  --set watcher.enabled=true \
  --set watcher.deployment=myapp \
  --set watcher.namespace=production \
  --set watcher.interval=15s
2

Team Connects

# Each dev configures
export CHATCLI_REMOTE_ADDR=chatcli.monitoring.svc:50051
export CHATCLI_REMOTE_TOKEN=team-token

# Via port-forward (development)
kubectl port-forward -n monitoring svc/chatcli 50051:50051
chatcli connect localhost:50051 --token team-token
3

Automatic Context

Any question asked by any dev automatically includes K8s context:
> What is happening with the deployment?

[The server automatically injects K8s Watcher data]

Workflow: Production Incident

1

Alert Triggered

You receive an alert from Grafana/PagerDuty/Slack about deployment issues.
2

Connect to ChatCLI

chatcli connect prod-chatcli:50051 --token ops-token
3

Get an Overview

> Summarize the current state of the deployment for a post-mortem
4

Investigate Root Cause

> What Warning events occurred in the last 30 minutes?
> Show the most recent error logs
> What changed since the last deploy?
5

Receive Recommendations

> Based on the data, what is the most likely root cause and what
  should I do to resolve it?
6

Validate Resolution

> After applying the fix, are the pods returning to normal?
> Compare the current state with 10 minutes ago

Fine-Tuning Parameters

Collection Interval

ScenarioRecommended Interval
Stable production30s (default)
Active investigation10s
Development60s
CI/CD monitoring15s
chatcli watch --deployment myapp --interval 10s

Observation Window

ScenarioRecommended Window
Quick debugging30m
Normal analysis2h (default)
Post-mortem6h
Historical analysis24h
chatcli watch --deployment myapp --window 6h

Log Lines

ScenarioRecommended Lines
Verbose apps50
Normal100 (default)
Deep debugging500
chatcli watch --deployment myapp --max-log-lines 500

One-Shot for Scripts and Alerts

Integrate ChatCLI with your alerting system:
#!/bin/bash
# alert-handler.sh - Called when an alert fires

DEPLOYMENT=$1
NAMESPACE=$2

# Generate automatic analysis
ANALYSIS=$(chatcli watch \
  --deployment "$DEPLOYMENT" \
  --namespace "$NAMESPACE" \
  -p "Analyze the current state of the deployment and identify the root cause of the problem. Format: markdown.")

# Send to Slack
curl -X POST "$SLACK_WEBHOOK" \
  -H 'Content-type: application/json' \
  -d "{\"text\": \"*ChatCLI K8s Analysis*\n\n$ANALYSIS\"}"
Or via remote server:
chatcli connect prod-server:50051 --token ops-token \
  -p "The myapp deployment is having problems. Analyze and suggest a solution." --raw

Advanced Tips

Save project documentation as context and attach it when using the watcher:
# Save project documentation as context
/context create myapp-docs ./docs --mode full --tags "k8s,ops"

# Attach when using with the watcher
/context attach myapp-docs

# Now the AI has K8s context + project documentation
> Based on the documentation and the cluster state, what could be wrong?
Use multi-target mode to monitor everything in a single instance:
# targets.yaml
interval: "15s"
window: "2h"
maxContextChars: 32000
targets:
  - deployment: frontend
    namespace: production
    metricsPort: 3000
    metricsFilter: ["next_*", "http_*"]
  - deployment: backend
    namespace: production
    metricsPort: 9090
    metricsFilter: ["http_requests_*", "db_*", "cache_*"]
  - deployment: database
    namespace: production
# Local
chatcli watch --config targets.yaml

# Or via server (the entire team has access)
chatcli server --watch-config targets.yaml
The AI receives detailed context from targets with issues and compact summaries from healthy ones, respecting the maxContextChars budget.
When metricsPort is configured, the watcher automatically scrapes the /metrics endpoint of the pods and includes the metrics in the analysis. Use metricsFilter with glob patterns to select only relevant metrics:
metricsFilter:
  - "http_requests_total"        # Exact metric
  - "http_request_duration_*"    # All HTTP duration metrics
  - "process_*"                  # Process metrics
  - "*_errors_total"             # Any error counter

Option 3: Autonomous AIOps (Operator)

Use this option for automatic problem remediation without human intervention.
1

Install the Operator

# Install CRDs
kubectl apply -f operator/config/crd/bases/

# Install RBAC and Manager
kubectl apply -f operator/config/rbac/role.yaml
kubectl apply -f operator/config/manager/manager.yaml
2

Create Instance with Watcher

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-aiops
  namespace: monitoring
spec:
  provider: CLAUDEAI
  apiKeys:
    name: chatcli-api-keys
  server:
    port: 50051
  watcher:
    enabled: true
    interval: "15s"
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 9090
      - deployment: backend
        namespace: production
        metricsPort: 9090
      - deployment: worker
        namespace: batch
3

Monitor the Pipeline

# Check detected anomalies
kubectl get anomalies -A --watch

# Check created issues
kubectl get issues -A --watch

# Check AI analyses
kubectl get aiinsights -A

# Check remediations
kubectl get remediationplans -A

# Check runbooks (manual and auto-generated)
kubectl get runbooks -A

# Check post-mortems (generated after agentic resolution)
kubectl get postmortems -A
4

Autonomous Flow in Action

When a pod starts crashing:
1. WatcherBridge detects HighRestartCount -> creates Anomaly
2. AnomalyReconciler correlates -> creates Issue (risk: 20, severity: Low, signalType: pod_restart)
3. If OOMKilled also -> Issue updated (risk: 50, severity: Medium)
4. IssueReconciler creates AIInsight
5. AIInsightReconciler collects K8s context (pods, events, revisions)
   -> calls LLM with enriched context -> returns: "restart + scale to 4"
6. IssueReconciler looks up manual Runbook (tiered matching)
   -> if not found, generates auto Runbook CR from AI (reusable)
   -> if no Runbook and no AI actions -> Agential mode
   -> creates RemediationPlan with actions
7. RemediationReconciler executes:
   -> Normal mode: executes plan actions (restart, scale, etc.)
   -> Agential mode: observe-decide-act loop (AI decides each step)
     - Each reconcile = 1 step (max 10 steps, timeout 10min)
     - AI observes K8s state -> decides action -> executes -> next step
     - Actions: Scale, Restart, Rollback, PatchConfig, AdjustResources, DeletePod
8. Issue -> Resolved (dedup invalidated for recurrence detection)
   -> If agential mode: PostMortem CR auto-generated (timeline, root cause, lessons)
   -> Runbook auto-generated from successful agential steps
   If failed -> re-analysis with failure context -> different strategy
Everything happens automatically without human intervention. Auto-generated runbooks are reused for future occurrences of the same type. In agential mode, the AI acts as an autonomous agent with K8s “skills,” and upon resolving the issue, it generates a PostMortem CR with a complete timeline and a reusable Runbook for future occurrences.
5

(Optional) Add Runbooks

For specific scenarios where you want to control exactly what to do:
apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: oom-standard-procedure
  namespace: production
spec:
  description: "Standard OOMKill recovery for production"
  trigger:
    signalType: oom_kill
    severity: critical
    resourceKind: Deployment
  steps:
    - name: Restart pods
      action: RestartDeployment
      description: "Restart to reclaim leaked memory"
    - name: Scale up
      action: ScaleDeployment
      description: "Add replicas for redundancy"
      params:
        replicas: "5"
  maxAttempts: 2
Remediation priority: Manual Runbook > Auto-generated Runbook > Agential remediation > Escalation. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR. If neither a Runbook nor AI actions are available, the operator enters agential mode: the AI acts as an autonomous agent in an observe-decide-act loop, and upon resolution, it generates a PostMortem CR and a reusable Runbook.

Deployment Checklist

  • Verify cluster access (kubectl get pods)
  • Verify RBAC permissions for pods, logs, events
  • Choose mode: local (chatcli watch) or server (chatcli server)
  • Define targets: single (--deployment) or multi (--config targets.yaml)
  • (Optional) Configure metricsPort for Prometheus scraping
  • Configure appropriate interval and window for the scenario
  • Adjust maxContextChars if needed (default: 32000)
  • Test with a simple question: “Is the deployment healthy?”
  • (Optional) Integrate with alerts for automatic analysis
  • (Optional) Distribute access to the team via token