Architecture
- Single-Target (legacy)
- Multi-Target (current)
ResourceWatcher has its own collectors (including an optional PrometheusCollector) and all share a single Kubernetes clientset, minimizing connections.
Usage Modes
- Single Resource
- Multiple Resources (YAML)
- Server with Watcher
Multi-Target Configuration File
Target Fields
| Field | Description | Required |
|---|---|---|
deployment | Resource name to monitor | Yes |
kind | Resource kind: Deployment, StatefulSet, DaemonSet, Job, CronJob (default: Deployment) | No |
namespace | Namespace (default: default) | No |
metricsPort | Prometheus endpoint port (0 = disabled) | No |
metricsPath | HTTP path for metrics (default: /metrics) | No |
metricsFilter | Glob filters for metrics (empty = all) | No |
Complete Flags
chatcli watch
| Flag | Description | Default | Env Var |
|---|---|---|---|
--config | Multi-target YAML file | ||
--deployment | Resource name to monitor | CHATCLI_WATCH_DEPLOYMENT | |
--kind | Resource kind: Deployment, StatefulSet, DaemonSet, Job, CronJob | Deployment | |
--namespace | Resource namespace | default | CHATCLI_WATCH_NAMESPACE |
--interval | Interval between collections | 30s | CHATCLI_WATCH_INTERVAL |
--window | Data time window | 2h | CHATCLI_WATCH_WINDOW |
--max-log-lines | Log lines per pod | 100 | CHATCLI_WATCH_MAX_LOG_LINES |
--kubeconfig | Kubeconfig path | Auto-detected | CHATCLI_KUBECONFIG |
--provider | LLM provider | .env | LLM_PROVIDER |
--model | LLM model | .env | |
-p <prompt> | One-shot: send and exit | ||
--max-tokens | Token limit in response |
chatcli server (watcher flags)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--watch-config | Multi-target YAML file | CHATCLI_WATCH_CONFIG | |
--watch-deployment | Single deployment (legacy) | CHATCLI_WATCH_DEPLOYMENT | |
--watch-namespace | Namespace | default | CHATCLI_WATCH_NAMESPACE |
--watch-interval | Collection interval | 30s | CHATCLI_WATCH_INTERVAL |
--watch-window | Observation window | 2h | CHATCLI_WATCH_WINDOW |
--watch-max-log-lines | Max log lines | 100 | CHATCLI_WATCH_MAX_LOG_LINES |
--watch-kubeconfig | Kubeconfig path | Auto-detected | CHATCLI_KUBECONFIG |
What Is Collected
Collectors per Target
| Collector | Data Collected |
|---|---|
| Deployment | Replicas (ready/available/updated), strategy, conditions |
| Pod Status | Phase, readiness, restarts, termination info, container status |
| Events | K8s events (Warning/Normal), message, reason, timestamp |
| Logs | Last N lines per container per pod |
| Metrics | CPU and memory per pod (via metrics-server) |
| HPA | Min/max replicas, current metrics, desired replicas |
| Prometheus | Application metrics from the pod /metrics endpoint |
| Node Health | Node conditions (Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable), cordoned state, CPU/mem usage, pod count vs capacity, kubelet version |
Node Health Collector
TheNodeCollector automatically monitors the health of nodes where the target’s pods are running:
- Discovers nodes — via pod label selector, identifies which nodes the pods are scheduled on
- Collects conditions — all 5 official Kubernetes conditions (
Ready,DiskPressure,MemoryPressure,PIDPressure,NetworkUnavailable) - Collects metrics — node CPU and memory via metrics-server (when available)
- Pod capacity — counts active pods vs node maximum capacity
- Cordoned — detects nodes marked as
unschedulable
| Condition | Severity | Example message |
|---|---|---|
| NotReady | CRITICAL | Node worker-1 is NotReady |
| DiskPressure | CRITICAL | Node worker-1 has DiskPressure |
| MemoryPressure | CRITICAL | Node worker-1 has MemoryPressure |
| PIDPressure | WARNING | Node worker-1 has PIDPressure |
| NetworkUnavailable | CRITICAL | Node worker-1 has NetworkUnavailable |
| Cordoned | WARNING | Node worker-1 is cordoned (unschedulable) |
| Pod capacity >90% | WARNING | Node worker-1 pod capacity at 95/110 (>90%) |
Prometheus Collector (New)
ThePrometheusCollector scrapes Prometheus metrics directly from pods:
- Discovers deployment pods and selects 1 Ready pod
- Makes HTTP GET to
http://podIP:port/path(timeout: 5s) - Parses the Prometheus text exposition format (stdlib, no dependencies)
- Filters by configured glob patterns
- Ignores NaN, Inf, and comment lines
Context Budget Management (MultiSummarizer)
With multiple targets, the MultiSummarizer ensures the context does not exceed the LLM window:Algorithm
Scores each target
0 = healthy, 1 = warning, 2 = critical- Critical: CrashLoopBackOff, OOMKilled, critical alerts
- Warning: replicas < desired, error logs, warning alerts
- Healthy: everything ok
Allocates context
- Score >= 1 — full context (~1-3 KB per target)
- Score == 0 — compact one-liner (~80 chars per target)
Example with 20 Targets (2 with issues)
Anomaly Detection
| Anomaly | Condition | Severity |
|---|---|---|
| CrashLoopBackOff | Pod with more than 5 restarts | Critical |
| OOMKilled | Container terminated due to lack of memory | Critical |
| PodNotReady | Pod is not in the Ready state | Warning |
| DeploymentFailing | Deployment with Available=False | Critical |
Observability Store
Collected data is stored in a ring buffer per target with a configurable time window:- Snapshots: Complete periodic state (pods, deployment, HPA, events, metrics, app metrics)
- Logs: Recent logs from each pod with classification (info/warning/error)
- Alerts: Detected anomalies with severity and timestamps
Automatic Rotation
Data older than the time window (--window) is automatically discarded, keeping memory usage constant regardless of the number of targets.
/watch Command
Inside interactive ChatCLI (local or remote), use /watch to see the status:
- Single-Target
- Multi-Target
One-Shot with K8s Context
Example Questions
Requirements
- Kubernetes Cluster: Access via kubeconfig or in-cluster config
- RBAC Permissions: Read access to pods, events, logs, deployments, HPA, ingresses
- metrics-server (optional): For CPU/memory collection
- Prometheus endpoints (optional): Apps that expose
/metricsin Prometheus text format
RBAC
- Single-namespace (Role + RoleBinding)
AIOps Integration
K8s Watcher alerts automatically feed into the Operator’s AIOps pipeline. When the Operator detects alerts viaGetAlerts RPC, it creates Anomaly CRs that are correlated into Issues, analyzed by AI, and automatically remediated.
Starting with AIOps Platform v2, Watcher alerts also feed into:
- NotificationPolicy for automatic routing to Slack, PagerDuty, OpsGenie, Email, Webhook and Teams
- ApprovalPolicy for approval gates before production remediations
- ServiceLevelObjective for burn rate and error budget calculation
- NoiseReducer for suppression of repetitive, seasonal and flapping alerts
Next Steps
Server Mode
Configure the server with watcher
K8s Operator
K8s Operator (AIOps)
AIOps Platform
AIOps Platform (deep-dive)
Deploy on Kubernetes
Deploy on Kubernetes