Recipe: K8s Monitoring with AI

In this recipe, you will configure ChatCLI to monitor a Kubernetes deployment and use AI to diagnose problems in real time.

Scenario

Production Application

Application “myapp” running in production on Kubernetes

Quick Diagnosis

Team needs to diagnose problems quickly

AI-Powered Analysis

Use AI to analyze logs, events, and metrics

Automatic Context

Automatic K8s context in all queries

Option 1: Local Monitoring

Use this option when you have direct access to the cluster via kubectl.

Verify Cluster Access

# Verify connectivity
kubectl get pods -n production

# Verify permissions
kubectl auth can-i get pods -n production
kubectl auth can-i get pods/log -n production
kubectl auth can-i list events -n production

Start the Watcher

chatcli watch --deployment myapp --namespace production

You will see:

K8s Watcher starting...
  Deployment: myapp
  Namespace:  production
  Interval:   30s
  Window:     2h

Collecting initial data...
Initial data collected. Starting interactive mode.
[watch] chatcli>

Ask Questions

[watch] chatcli> Is the deployment healthy?

Based on the collected Kubernetes data:
- The myapp deployment has 3/3 replicas available
- All pods are in Running and Ready state
- There are no active alerts
- Average CPU at 35%, memory at 120Mi
The deployment is healthy and operating normally.

[watch] chatcli> /watch status

K8s Watcher Active
  Deployment:  myapp
  Namespace:   production
  Snapshots:   5
  Pods:        3
  Alerts:      0

Diagnose Problems

When something goes wrong:

[watch] chatcli> Why is the pod myapp-abc12 restarting?

Analyzing data for pod myapp-abc12:
- The pod had 5 restarts in the last hour
- Reason for last restart: OOMKilled
- Container was using 490Mi out of 512Mi limit
- Logs show: "java.lang.OutOfMemoryError: Java heap space"

Diagnosis: The container is exceeding the memory limit.
Recommendations:
1. Increase resources.limits.memory to 1Gi
2. Adjust the JVM: -Xmx384m to fit within the limit
3. Investigate possible memory leak in previous logs

Option 2: Server with Watcher (Team)

Use this option so the entire team has access to monitoring via a centralized server.

Deploy to Kubernetes

Via Helm (single-target)
Via Operator (AIOps)

helm install chatcli oci://ghcr.io/diillson/charts/chatcli \
  --namespace monitoring --create-namespace \
  --set llm.provider=CLAUDEAI \
  --set secrets.anthropicApiKey=sk-ant-xxx \
  --set server.token=team-token \
  --set watcher.enabled=true \
  --set watcher.deployment=myapp \
  --set watcher.namespace=production \
  --set watcher.interval=15s

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
spec:
  provider: CLAUDEAI
  apiKeys:
    name: chatcli-api-keys
  server:
    port: 50051
  watcher:
    enabled: true
    interval: "15s"
    maxContextChars: 32000
    targets:
      - deployment: frontend
        namespace: production
        metricsPort: 3000
      - deployment: backend
        namespace: production
        metricsPort: 9090
        metricsFilter: ["http_*", "db_*"]
      - deployment: worker
        namespace: batch

For production, enable TLS on server (field tls: with secretName). The Secret must contain tls.crt, tls.key and ca.crt with correct SANs — full walkthrough at AIOps Production Setup §2.1.

With the Operator, in addition to watcher-based monitoring, you get the full AIOps pipeline: automatic anomaly detection, incident correlation, AI-powered root cause analysis, and autonomous remediation (scale, restart, rollback). See K8s Operator and AIOps Platform for complete documentation.

Team Connects

# Each dev configures
export CHATCLI_REMOTE_ADDR=chatcli.monitoring.svc:50051
export CHATCLI_REMOTE_TOKEN=team-token

# Via port-forward (development)
kubectl port-forward -n monitoring svc/chatcli 50051:50051
chatcli connect localhost:50051 --token team-token

Automatic Context

Any question asked by any dev automatically includes K8s context:

> What is happening with the deployment?

[The server automatically injects K8s Watcher data]

Workflow: Production Incident

Alert Triggered

You receive an alert from Grafana/PagerDuty/Slack about deployment issues.

Connect to ChatCLI

chatcli connect prod-chatcli:50051 --token ops-token

Get an Overview

> Summarize the current state of the deployment for a post-mortem

Investigate Root Cause

> What Warning events occurred in the last 30 minutes?
> Show the most recent error logs
> What changed since the last deploy?

Receive Recommendations

> Based on the data, what is the most likely root cause and what
  should I do to resolve it?

Validate Resolution

> After applying the fix, are the pods returning to normal?
> Compare the current state with 10 minutes ago

Fine-Tuning Parameters

Collection Interval

Scenario	Recommended Interval
Stable production	`30s` (default)
Active investigation	`10s`
Development	`60s`
CI/CD monitoring	`15s`

chatcli watch --deployment myapp --interval 10s

Observation Window

Scenario	Recommended Window
Quick debugging	`30m`
Normal analysis	`2h` (default)
Post-mortem	`6h`
Historical analysis	`24h`

chatcli watch --deployment myapp --window 6h

Log Lines

Scenario	Recommended Lines
Verbose apps	`50`
Normal	`100` (default)
Deep debugging	`500`

chatcli watch --deployment myapp --max-log-lines 500

One-Shot for Scripts and Alerts

Integrate ChatCLI with your alerting system:

#!/bin/bash
# alert-handler.sh - Called when an alert fires

DEPLOYMENT=$1
NAMESPACE=$2

# Generate automatic analysis
ANALYSIS=$(chatcli watch \
  --deployment "$DEPLOYMENT" \
  --namespace "$NAMESPACE" \
  -p "Analyze the current state of the deployment and identify the root cause of the problem. Format: markdown.")

# Send to Slack
curl -X POST "$SLACK_WEBHOOK" \
  -H 'Content-type: application/json' \
  -d "{\"text\": \"*ChatCLI K8s Analysis*\n\n$ANALYSIS\"}"

Or via remote server:

chatcli connect prod-server:50051 --token ops-token \
  -p "The myapp deployment is having problems. Analyze and suggest a solution." --raw

Advanced Tips

Combine with Persistent Contexts

Save project documentation as context and attach it when using the watcher:

# Save project documentation as context
/context create myapp-docs ./docs --mode full --tags "k8s,ops"

# Attach when using with the watcher
/context attach myapp-docs

# Now the AI has K8s context + project documentation
> Based on the documentation and the cluster state, what could be wrong?

Multiple Deployments

Use multi-target mode to monitor everything in a single instance:

# targets.yaml
interval: "15s"
window: "2h"
maxContextChars: 32000
targets:
  - deployment: frontend
    namespace: production
    metricsPort: 3000
    metricsFilter: ["next_*", "http_*"]
  - deployment: backend
    namespace: production
    metricsPort: 9090
    metricsFilter: ["http_requests_*", "db_*", "cache_*"]
  - deployment: database
    namespace: production

# Local
chatcli watch --config targets.yaml

# Or via server (the entire team has access)
chatcli server --watch-config targets.yaml

The AI receives detailed context from targets with issues and compact summaries from healthy ones, respecting the maxContextChars budget.

Prometheus Metrics

When metricsPort is configured, the watcher automatically scrapes the /metrics endpoint of the pods and includes the metrics in the analysis. Use metricsFilter with glob patterns to select only relevant metrics:

metricsFilter:
  - "http_requests_total"        # Exact metric
  - "http_request_duration_*"    # All HTTP duration metrics
  - "process_*"                  # Process metrics
  - "*_errors_total"             # Any error counter

Option 3: Autonomous AIOps (Operator)

Use this option for automatic problem remediation without human intervention.

Install the Operator

# Install Operator via Helm (CRDs + RBAC + Controllers + Dashboard)
helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system \
  --create-namespace

Create Instance with Watcher

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-aiops
  namespace: monitoring
spec:
  provider: CLAUDEAI
  apiKeys:
    name: chatcli-api-keys
  server:
    port: 50051
  watcher:
    enabled: true
    interval: "15s"
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 9090
      - deployment: backend
        namespace: production
        metricsPort: 9090
      - deployment: worker
        namespace: batch

Monitor the Pipeline

# Check detected anomalies
kubectl get anomalies -A --watch

# Check created issues
kubectl get issues -A --watch

# Check AI analyses
kubectl get aiinsights -A

# Check remediations
kubectl get remediationplans -A

# Check runbooks (manual and auto-generated)
kubectl get runbooks -A

# Check post-mortems (generated after agentic resolution)
kubectl get postmortems -A

Autonomous Flow in Action

When a pod starts crashing:

1. WatcherBridge detects HighRestartCount -> creates Anomaly
2. AnomalyReconciler correlates -> creates Issue (risk: 20, severity: Low, signalType: pod_restart)
3. If OOMKilled also -> Issue updated (risk: 50, severity: Medium)
4. IssueReconciler creates AIInsight
5. AIInsightReconciler collects K8s context (pods, events, revisions)
   -> calls LLM with enriched context -> returns: "restart + scale to 4"
6. IssueReconciler looks up manual Runbook (tiered matching)
   -> if not found, generates auto Runbook CR from AI (reusable)
   -> if no Runbook and no AI actions -> Agential mode
   -> creates RemediationPlan with actions
7. RemediationReconciler executes:
   -> Normal mode: executes plan actions (restart, scale, etc.)
   -> Agential mode: observe-decide-act loop (AI decides each step)
     - Each reconcile = 1 step (max 10 steps, timeout 10min)
     - AI observes K8s state -> decides action -> executes -> next step
     - Actions: Scale, Restart, Rollback, PatchConfig, AdjustResources, DeletePod
8. Issue -> Resolved (dedup invalidated for recurrence detection)
   -> If agential mode: PostMortem CR auto-generated (timeline, root cause, lessons)
   -> Runbook auto-generated from successful agential steps
   If failed -> re-analysis with failure context -> different strategy

Everything happens automatically without human intervention. Auto-generated runbooks are reused for future occurrences of the same type. In agential mode, the AI acts as an autonomous agent with K8s “skills,” and upon resolving the issue, it generates a PostMortem CR with a complete timeline and a reusable Runbook for future occurrences.

(Optional) Add Runbooks

For specific scenarios where you want to control exactly what to do:

apiVersion: platform.chatcli.io/v1alpha1
kind: Runbook
metadata:
  name: oom-standard-procedure
  namespace: production
spec:
  description: "Standard OOMKill recovery for production"
  trigger:
    signalType: oom_kill
    severity: critical
    resourceKind: Deployment
  steps:
    - name: Restart pods
      action: RestartDeployment
      description: "Restart to reclaim leaked memory"
    - name: Scale up
      action: ScaleDeployment
      description: "Add replicas for redundancy"
      params:
        replicas: "5"
  maxAttempts: 2

Remediation priority: Manual Runbook > Auto-generated Runbook > Agential remediation > Escalation. When there is no manual Runbook, the AI automatically generates a reusable Runbook CR. If neither a Runbook nor AI actions are available, the operator enters agential mode: the AI acts as an autonomous agent in an observe-decide-act loop, and upon resolution, it generates a PostMortem CR and a reusable Runbook.

​Scenario

Production Application

Quick Diagnosis

AI-Powered Analysis

Automatic Context

​Option 1: Local Monitoring

​Option 2: Server with Watcher (Team)

​Workflow: Production Incident

​Fine-Tuning Parameters

​Collection Interval

​Observation Window

​Log Lines

​One-Shot for Scripts and Alerts

​Advanced Tips

​Option 3: Autonomous AIOps (Operator)

​Deployment Checklist

Scenario

Option 1: Local Monitoring

Option 2: Server with Watcher (Team)

Workflow: Production Incident

Fine-Tuning Parameters

Collection Interval

Observation Window

Log Lines

One-Shot for Scripts and Alerts

Advanced Tips

Option 3: Autonomous AIOps (Operator)

Deployment Checklist