Skip to main content
The K8s Watcher allows ChatCLI to monitor multiple deployments simultaneously, collecting infrastructure and application metrics, logs, events, and pod status. The context is automatically injected into LLM prompts with intelligent budget management to avoid exceeding the context window.

Architecture

ChatCLI -> ResourceWatcher -> 6 Collectors -> ObservabilityStore -> Summarizer -> LLM
Each ResourceWatcher has its own collectors (including an optional PrometheusCollector) and all share a single Kubernetes clientset, minimizing connections.

Usage Modes

# Deployment (default)
chatcli watch --deployment myapp --namespace production

# StatefulSet
chatcli watch --deployment postgres --kind StatefulSet --namespace production

# DaemonSet
chatcli watch --deployment fluentd --kind DaemonSet --namespace logging

# CronJob
chatcli watch --deployment etl-pipeline --kind CronJob --namespace data

# One-shot with prompt
chatcli watch --deployment myapp -p "Is the deployment healthy?"

Multi-Target Configuration File

# targets.yaml
interval: "30s"           # Collection interval
window: "2h"              # Time window of retained data
maxLogLines: 100          # Log lines per pod per cycle
maxContextChars: 32000     # Maximum character budget for LLM context

targets:
  - deployment: api-gateway                                  # Deployment (default kind)
    namespace: production
    metricsPort: 9090                                        # Prometheus port
    metricsFilter: ["http_requests_total", "http_request_duration_*"]

  - deployment: auth-service
    namespace: production
    metricsPort: 9090

  - deployment: worker
    namespace: batch
    # No metricsPort = Prometheus disabled for this target

  - deployment: postgres                                     # StatefulSet (database)
    kind: StatefulSet
    namespace: production

  - deployment: fluentd                                      # DaemonSet (logging)
    kind: DaemonSet
    namespace: logging

  - deployment: etl-pipeline                                 # CronJob (scheduled batch)
    kind: CronJob
    namespace: data

Target Fields

FieldDescriptionRequired
deploymentResource name to monitorYes
kindResource kind: Deployment, StatefulSet, DaemonSet, Job, CronJob (default: Deployment)No
namespaceNamespace (default: default)No
metricsPortPrometheus endpoint port (0 = disabled)No
metricsPathHTTP path for metrics (default: /metrics)No
metricsFilterGlob filters for metrics (empty = all)No

Complete Flags

chatcli watch

FlagDescriptionDefaultEnv Var
--configMulti-target YAML file
--deploymentResource name to monitorCHATCLI_WATCH_DEPLOYMENT
--kindResource kind: Deployment, StatefulSet, DaemonSet, Job, CronJobDeployment
--namespaceResource namespacedefaultCHATCLI_WATCH_NAMESPACE
--intervalInterval between collections30sCHATCLI_WATCH_INTERVAL
--windowData time window2hCHATCLI_WATCH_WINDOW
--max-log-linesLog lines per pod100CHATCLI_WATCH_MAX_LOG_LINES
--kubeconfigKubeconfig pathAuto-detectedCHATCLI_KUBECONFIG
--providerLLM provider.envLLM_PROVIDER
--modelLLM model.env
-p <prompt>One-shot: send and exit
--max-tokensToken limit in response

chatcli server (watcher flags)

FlagDescriptionDefaultEnv Var
--watch-configMulti-target YAML fileCHATCLI_WATCH_CONFIG
--watch-deploymentSingle deployment (legacy)CHATCLI_WATCH_DEPLOYMENT
--watch-namespaceNamespacedefaultCHATCLI_WATCH_NAMESPACE
--watch-intervalCollection interval30sCHATCLI_WATCH_INTERVAL
--watch-windowObservation window2hCHATCLI_WATCH_WINDOW
--watch-max-log-linesMax log lines100CHATCLI_WATCH_MAX_LOG_LINES
--watch-kubeconfigKubeconfig pathAuto-detectedCHATCLI_KUBECONFIG

What Is Collected

Collectors per Target

CollectorData Collected
DeploymentReplicas (ready/available/updated), strategy, conditions
Pod StatusPhase, readiness, restarts, termination info, container status
EventsK8s events (Warning/Normal), message, reason, timestamp
LogsLast N lines per container per pod
MetricsCPU and memory per pod (via metrics-server)
HPAMin/max replicas, current metrics, desired replicas
PrometheusApplication metrics from the pod /metrics endpoint
Node HealthNode conditions (Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable), cordoned state, CPU/mem usage, pod count vs capacity, kubelet version

Node Health Collector

The NodeCollector automatically monitors the health of nodes where the target’s pods are running:
  1. Discovers nodes — via pod label selector, identifies which nodes the pods are scheduled on
  2. Collects conditions — all 5 official Kubernetes conditions (Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable)
  3. Collects metrics — node CPU and memory via metrics-server (when available)
  4. Pod capacity — counts active pods vs node maximum capacity
  5. Cordoned — detects nodes marked as unschedulable
Alerts emitted:
ConditionSeverityExample message
NotReadyCRITICALNode worker-1 is NotReady
DiskPressureCRITICALNode worker-1 has DiskPressure
MemoryPressureCRITICALNode worker-1 has MemoryPressure
PIDPressureWARNINGNode worker-1 has PIDPressure
NetworkUnavailableCRITICALNode worker-1 has NetworkUnavailable
CordonedWARNINGNode worker-1 is cordoned (unschedulable)
Pod capacity >90%WARNINGNode worker-1 pod capacity at 95/110 (>90%)
Node context is included in the summary sent to the LLM, enabling the AI to correlate pod problems with infrastructure:
## Nodes (2)
  - worker-1: Ready cpu=1200m/4 mem=3Gi/8Gi pods=45/110 k8s=v1.31.4
  - worker-2: NOT READY [MemoryPressure] cpu=3800m/4 mem=7.8Gi/8Gi pods=98/110 k8s=v1.31.4
    MemoryPressure: kubelet has insufficient memory available

Prometheus Collector (New)

The PrometheusCollector scrapes Prometheus metrics directly from pods:
  • Discovers deployment pods and selects 1 Ready pod
  • Makes HTTP GET to http://podIP:port/path (timeout: 5s)
  • Parses the Prometheus text exposition format (stdlib, no dependencies)
  • Filters by configured glob patterns
  • Ignores NaN, Inf, and comment lines
Glob filter examples:
metricsFilter:
  - "http_requests_*"          # All HTTP metrics
  - "process_*"                # Process metrics
  - "go_goroutines"            # Specific metric
  - "*_duration_seconds_*"     # Any duration metric

Context Budget Management (MultiSummarizer)

With multiple targets, the MultiSummarizer ensures the context does not exceed the LLM window:

Algorithm

1

Scores each target

0 = healthy, 1 = warning, 2 = critical
  • Critical: CrashLoopBackOff, OOMKilled, critical alerts
  • Warning: replicas < desired, error logs, warning alerts
  • Healthy: everything ok
2

Sorts by priority

Critical first, then warning, then healthy.
3

Allocates context

  • Score >= 1 — full context (~1-3 KB per target)
  • Score == 0 — compact one-liner (~80 chars per target)
4

Compresses if exceeding maxContextChars

Compresses healthy targets first.
5

Omits if still exceeding

Omits healthy targets when necessary.

Example with 20 Targets (2 with issues)

[K8s Multi-Watcher: 20 targets monitored]

--- Targets Requiring Attention ---

[K8s Context: deployment/api-gateway in namespace/production]
Collected at: 2026-02-15T10:30:00Z

## Deployment Status
  Replicas: 2/3 ready, 3 updated, 2 available
  Strategy: RollingUpdate

## Pods (3 total)
  Total restarts: 12 (delta in window: 8)
  - api-gateway-abc12: Running [Ready] restarts=0 cpu=45m mem=128Mi
  - api-gateway-def34: Running [Ready] restarts=0 cpu=52m mem=135Mi
  - api-gateway-ghi56: Running [NOT READY] restarts=8 cpu=12m mem=95Mi
    Last terminated: OOMKilled (exit code 137) at 2026-02-15T10:28:00Z

## Application Metrics (4)
  http_request_duration_seconds_sum: 8453
  http_requests_total: 1.542e+06
  process_resident_memory_bytes: 1.34e+08
  go_goroutines: 245

## Active Alerts (2)
  [CRITICAL] CrashLoopBackOff: pod/api-gateway-ghi56
  [CRITICAL] OOMKilled: pod/api-gateway-ghi56

## Recent Error Logs (3)
  [10:27:45] api-gateway-ghi56/app: OutOfMemoryError: heap space
  [10:27:46] api-gateway-ghi56/app: Shutting down...
  [10:28:00] api-gateway-ghi56/app: Process exited with code 137

--- Healthy Targets ---
- production/auth-service: 3/3 pods ready | healthy | 0 alerts | 42 snapshots
- production/frontend: 2/2 pods ready | healthy | 0 alerts | 42 snapshots
- production/backend: 5/5 pods ready | healthy | 0 alerts | 42 snapshots
- batch/worker: 3/3 pods ready | healthy | 0 alerts | 42 snapshots
... (16 compact targets)
Total budget: ~2 KB (detail) + 18 x 80 chars (compact) = ~3.5 KB, within the 8 KB limit.

Anomaly Detection

AnomalyConditionSeverity
CrashLoopBackOffPod with more than 5 restartsCritical
OOMKilledContainer terminated due to lack of memoryCritical
PodNotReadyPod is not in the Ready stateWarning
DeploymentFailingDeployment with Available=FalseCritical
Alerts are included in the context sent to the LLM and influence the budget priority of the MultiSummarizer.

Observability Store

Collected data is stored in a ring buffer per target with a configurable time window:
  • Snapshots: Complete periodic state (pods, deployment, HPA, events, metrics, app metrics)
  • Logs: Recent logs from each pod with classification (info/warning/error)
  • Alerts: Detected anomalies with severity and timestamps

Automatic Rotation

Data older than the time window (--window) is automatically discarded, keeping memory usage constant regardless of the number of targets.

/watch Command

Inside interactive ChatCLI (local or remote), use /watch to see the status:
/watch
K8s Watcher Active
  Deployment:  myapp
  Namespace:   production
  Snapshots:   42
  Pods:        3
  Alerts:      1

One-Shot with K8s Context

# Single deployment
chatcli watch --deployment myapp -p "Is the deployment healthy?"

# Multi-target
chatcli watch --config targets.yaml -p "Summarize the status of all deployments"

# Via remote server
chatcli connect myserver:50051 -p "Why are the pods restarting?"

Example Questions

> Is the deployment healthy?
> Which deployments need attention?
> Why is pod xyz restarting?
> Analyze the HTTP metrics of api-gateway. Is the latency acceptable?
> Compare the auth-service state with 30 minutes ago
> What warning events occurred in the last hour?
> Based on the Prometheus metrics, do I need to scale any deployment?
> Summarize the status of all targets for a team report

Requirements

  • Kubernetes Cluster: Access via kubeconfig or in-cluster config
  • RBAC Permissions: Read access to pods, events, logs, deployments, HPA, ingresses
  • metrics-server (optional): For CPU/memory collection
  • Prometheus endpoints (optional): Apps that expose /metrics in Prometheus text format

RBAC

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chatcli-watcher
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "events", "services", "endpoints"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list"]
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods"]
    verbs: ["get", "list"]

AIOps Integration

K8s Watcher alerts automatically feed into the Operator’s AIOps pipeline. When the Operator detects alerts via GetAlerts RPC, it creates Anomaly CRs that are correlated into Issues, analyzed by AI, and automatically remediated.
Alerts detected by Watcher -> Anomaly -> Issue -> AIInsight -> RemediationPlan -> Resolution
See AIOps Platform for the complete flow.
Starting with AIOps Platform v2, Watcher alerts also feed into:
  • NotificationPolicy for automatic routing to Slack, PagerDuty, OpsGenie, Email, Webhook and Teams
  • ApprovalPolicy for approval gates before production remediations
  • ServiceLevelObjective for burn rate and error budget calculation
  • NoiseReducer for suppression of repetitive, seasonal and flapping alerts
See the full AIOps Platform documentation for details.

Next Steps

Server Mode

Configure the server with watcher

K8s Operator

K8s Operator (AIOps)

AIOps Platform

AIOps Platform (deep-dive)

Deploy on Kubernetes

Deploy on Kubernetes