Web Dashboard and Grafana

The ChatCLI AIOps platform includes a built-in web dashboard (self-contained SPA) and 4 pre-configured Grafana dashboards for complete observability of the autonomous operations pipeline.

Web Dashboard

Overview

The Web Dashboard is a Single Page Application embedded directly in the operator binary via Go embed.FS — it does not require Node.js, npm, or any separate frontend build.

Characteristic	Detail
Technology	HTML/CSS/JS vanilla (zero external dependencies)
Packaging	Go `embed.FS` — compiled into the binary
Port	`8090` (configurable via `CHATCLI_AIOPS_PORT`)
Theme	Dark theme (dark background, light text)
Responsive	Adapts to desktop, tablet, and mobile
Auto-refresh	Configurable interval: Off, 10s, 30s (default), 1m, 2m, 5m — with manual refresh button
Sortable tables	Click any column header to sort (▲ ascending / ▼ descending)
Stable ordering	Lists maintain consistent order between refreshes (incidents/audit by timestamp, SLOs by name)
Authentication	Same API key as the REST API (header `X-API-Key`)
URL	`http://<operator-host>:8090/`

The dashboard consumes the same REST API documented in API Reference. All operations available in the dashboard (acknowledge, snooze, approve, reject) are authenticated REST calls.

Architecture

+-------------------------------------------------------------+
|                    Operator Binary                            |
|                                                              |
|  +--------------------+  +-------------------------------+   |
|  |  embed.FS          |  |  HTTP Server (:8090)          |   |
|  |  +-- index.html    |  |  +-- /          -> SPA        |   |
|  |  +-- styles.css    |  |  +-- /api/v1/  -> REST API    |   |
|  |  +-- app.js        |  |  +-- /healthz  -> Health      |   |
|  |                     |  |  +-- /readyz   -> Ready       |   |
|  +--------------------+  +-------------------------------+   |
|                                                              |
|  +------------------------------------------------------+    |
|  |  Kubernetes Client (informers + watch)                |    |
|  |  Issues, Anomalies, AIInsights, Plans, PostMortems    |    |
|  +------------------------------------------------------+    |
+-------------------------------------------------------------+

Dashboard Views

The dashboard has 10 views accessible via tab navigation:

1. Overview

Platform overview with aggregated metrics.Components:

Component	Description
Stats Cards	6 cards: Active Issues, Resolved, Remediations (success/total), Success Rate, PostMortems, Pending Approvals
Compliance & SLA	Real-time compliance percentage, MTTD, MTTR, SLA response/resolution violations, approval stats
Capacity Warnings	Alert banner showing resources at risk of exhaustion (urgent/plan), with recommendations
Remediation by Strategy	Horizontal bar chart showing success rate per action type (e.g., RestartDeployment 93%, RollbackDeployment 80%)
Pie Chart (Severity)	Incident distribution by severity (Critical/High/Medium/Low)
Recent Incidents	List of the 10 most recent incidents with state, severity, and timestamp
Timeline	Temporal chart of incidents in the last 24h (stacked bars by severity)

The Overview provides comprehensive situational awareness with compliance, capacity, and remediation effectiveness metrics.

2. Incidents

Interactive table of all incidents with filters and actions.Features:

Feature	Description
Filters	Severity, state, namespace, period (dropdowns at top)
Table	Columns: Name, Severity, State, Resource, Namespace, Detected at, Duration
Sorting	Click column header to sort (asc/desc)
Expansion	Click row to expand details: description, resource, signal type, remediation attempts, resolution
AI Insight Preview	Expanded rows automatically load the AI analysis inline with confidence badge, provider/model, analysis summary (first 500 chars), and top 3 recommendations
Acknowledge	Button to acknowledge incident (requires `operator` role)
Snooze	Button with duration selector (30m, 1h, 2h, 4h, 24h)
Pagination	Page navigation with 20 items per page

Severity badges:

Severity	Color
Critical	Red (#ef4444)
High	Orange (#f97316)
Medium	Yellow (#eab308)
Low	Blue (#3b82f6)

State badges:

State	Color
Detected	Gray (#6b7280)
Analyzing	Blue (#3b82f6)
Remediating	Yellow (#eab308)
Resolved	Green (#22c55e)
Escalated	Red (#ef4444)

3. SLOs

SLO cards with visual indicators of error budget and burn rate.Components per SLO:

Component	Description
Card	SLO name, service, SLI type, target vs. current
Error Budget Gauge	Circular progress bar showing % of error budget consumed. Green (<70%), Yellow (70-90%), Red (>90%)
Burn Rate Chips	4 chips: 1h, 6h, 24h, 72h. Color indicates if burn rate exceeds threshold (Google SRE model)
State	Badge: `healthy` (green), `at_risk` (yellow), `breached` (red)
History	Sparkline of last 7 days of compliance

Burn Rate Thresholds (Google SRE):

Window	Threshold	Meaning
1h	14.4x	Consumes 100% of budget in 5 days
6h	6.0x	Consumes 100% of budget in 5 days (confirmation)
24h	3.0x	Consumes 100% of budget in 10 days
72h	1.0x	Consumes 100% of budget in 30 days

4. Approvals

List of pending approvals with approve/reject actions.Features:

Feature	Description
Pending List	Pending approvals at top, with visual highlighting
Context	Each approval shows: associated incident, proposed action, severity, AI confidence
Blast Radius	Risk level badge (CRITICAL/HIGH/MEDIUM/LOW) from blast radius prediction, showing potential impact before approval
Approve	Green button. Opens modal with required approver name field and optional reason field.
Reject	Red button. Opens modal with required name and reason fields.
History	Tab to view historical approvals (approved/rejected/expired)
Expiration	Countdown timer showing remaining time before expiration

5. AI Insights

View all AI-generated analyses to understand how the AI reasoned about each incident.Features:

Feature	Description
Filter	Filter by incident name to see insights for a specific issue
Table	Columns: Incident, Provider, Model, Confidence, Recommendations, Actions, Generated
Confidence	Color-coded confidence score: green (≥85%), yellow (70-84%), red (<70%)
Expansion	Click row to expand: full AI analysis text, recommendations list, suggested actions with parameters
Log Analysis	Expanded view shows structured log findings (stack traces, error patterns)
Cascade Analysis	Shows cross-service cascade chain when detected
GitOps Context	Displays Helm/ArgoCD/Flux status at the time of analysis
Blast Radius	Shows predicted impact of suggested remediation actions

This view is essential when an incident is escalated to human action — it shows exactly what the AI found, why it recommended specific actions, and what enrichment data informed its analysis.API endpoint: GET /api/v1/aiinsights

6. Remediations

Track all remediation plans with execution details, both runbook-based and agentic.Features:

Feature	Description
Filters	State dropdown (Pending/Executing/Verifying/Completed/Failed/RolledBack), incident name filter
Table	Columns: Name, Incident, Attempt, State, Mode (Runbook/Agentic), Actions/Steps, Started, Duration
Mode indicator	Agentic mode highlighted in red accent; Runbook mode in default text
Expansion	Click row to expand: strategy description, planned actions list, result
Agentic details	For agentic plans: step count shown in table, full conversation in detail view
Duration	Auto-calculated from start to completion time
State badges	Same color scheme as incidents: Completed (green), Executing (yellow), Failed (red), RolledBack (orange)

Remediation modes explained:

Runbook mode: Displays the pre-defined action sequence from the matched runbook
Agentic mode: Shows step count in the table; use the Get Remediation Plan API for the full AI conversation history

API endpoint: GET /api/v1/remediations

7. Runbooks

View all runbooks — both manually created and AI-generated from successful remediations.Features:

Feature	Description
Table	Columns: Name, Signal Type, Severity, Resource Kind, Steps, Max Attempts, Created
Signal badge	Color-coded badge showing the trigger signal type (oom_kill, pod_not_ready, deploy_failing, etc.)
Expansion	Click row to expand: full description, trigger match criteria, and ordered step list
Step details	Each step shows: action type badge, description, and parameters as JSON
Auto-generated	Runbooks are automatically created when the AI successfully remediates an incident — they capture the winning strategy for reuse

Runbooks serve as the AI’s “institutional memory” — when a similar incident occurs in the future, the platform matches it to an existing runbook instead of starting from scratch, significantly reducing MTTR.API endpoint: GET /api/v1/runbooks

8. PostMortems

List of post-mortems with expandable details.Features:

Feature	Description
List	All post-mortems with state (open/in_review/closed), associated incident, duration
Expansion	Click to expand: complete timeline, root cause, impact, executed actions
Lessons Learned	Section with lessons learned (AI-generated)
Prevention Actions	Checklist of suggested preventive actions
Developer Feedback	Inline form for the developer to rate the remediation (1-5 stars), override root cause, and add comments. Once submitted, displays the feedback with visual rating
Review	Button to mark as “in review”
Close	Button to close the post-mortem after review
Source	Badge indicating if generated by `agentic` or `standard` remediation

9. Clusters

Cards of monitored clusters with health status and federation overview.Federation Panel:

Component	Description
Federation Status	Connected/disconnected cluster counts and total active issues across federation
Cross-Cluster Correlations	Issues correlated across clusters with severity badge, signal type, CASCADE/ELEVATED flags, and correlated cluster names

Components per cluster:

Component	Description
Card	Cluster name, provider (EKS/GKE/AKS), K8s version
Status	Badge: `healthy` (green), `degraded` (yellow), `unreachable` (red)
Metrics	Number of nodes, active incidents, monitored namespaces
Resources	CPU and memory usage bars (capacity vs. usage)
Targets	List of watcher targets with alert counters per namespace
Last Sync	Last synchronization timestamp with freshness indicator

API endpoints: GET /api/v1/federation/status, GET /api/v1/federation/correlations

10. Audit

Searchable audit log with export.Features:

Feature	Description
Search	Text field to search in type, resource, actor, description
Filters	Event type, severity, period
Table	Columns: Timestamp, Type, Severity, Actor, Resource, Description
Severity	Icons and colors: info (blue), warning (yellow), critical (red)
Export	Button to export in JSON or CSV (requires `admin` role)
Pagination	50 items per page
Auto-scroll	New events appear at the top with highlight animation

Grafana Dashboards

The AIOps platform includes 4 pre-configured Grafana dashboards in JSON format, ready for import.

1. AIOps Overview (`aiops-overview.json`)

Main dashboard with operational overview. Panels:

Panel	Type	Description
Active Issues	Stat	Number of unresolved issues (gauge with thresholds: green <5, yellow 5-15, red >15)
MTTR	Stat	Mean time to resolution in minutes
Success Rate	Gauge	Remediation success rate (0-100%)
Issues by Severity	Pie Chart	Issue distribution by severity (Critical/High/Medium/Low)
Issues by State	Bar Chart	Issue count by state (Detected/Analyzing/Remediating/Resolved/Escalated)
Remediation Actions	Time Series	Remediation actions over time, separated by type (Restart/Scale/Rollback/Adjust/Delete/Patch)
Resolution Duration	Histogram	Resolution time distribution with buckets of 1min, 2min, 5min, 10min, 30min
Issues Over Time	Time Series	Incidents created vs. resolved per hour

Template variables:

Variable	Type	Values
`namespace`	Query	All namespaces with issues
`severity`	Custom	All, Critical, High, Medium, Low
`interval`	Interval	1m, 5m, 15m, 1h

2. SLO Burn Rate (`slo-burn-rate.json`)

Dashboard dedicated to SLOs following the Google SRE model. Panels:

Panel	Type	Description
Error Budget Gauge	Gauge	Remaining error budget percentage per SLO. Thresholds: green >30%, yellow 10-30%, red <10%
Burn Rate 1h	Time Series	Burn rate in the 1-hour window with threshold line at 14.4x
Burn Rate 6h	Time Series	Burn rate in the 6-hour window with threshold line at 6.0x
Burn Rate 24h	Time Series	Burn rate in the 24-hour window with threshold line at 3.0x
Burn Rate 72h	Time Series	Burn rate in the 72-hour window with threshold line at 1.0x
SLA Compliance	Stat	Current compliance percentage per SLO
SLA Violations	Table	Violation list with timestamp, affected SLO, duration, and budget impact
Budget Consumption Over Time	Time Series	Cumulative error budget consumption over the 30-day window

Threshold lines (annotations): Each burn rate chart includes a dashed red horizontal line at the corresponding threshold (Google SRE multi-window, multi-burn-rate alerting model).

3. Incident Timeline (`incident-timeline.json`)

Dashboard focused on the temporal flow of incidents and notifications. Panels:

Panel	Type	Description
Critical Incidents	Stat	Count of active critical incidents (pulsing red if >0)
Escalated Incidents	Stat	Count of escalated incidents
Resolved Today	Stat	Incidents resolved in the last 24h
Incident Timeline	Timeline/Annotations	Temporal visualization of incidents with severity colors
Notifications by Channel	Bar Chart	Notification count by channel (Slack/PagerDuty/Webhook/Email)
Approval Decisions	Pie Chart	Approval decision distribution (Approved/Rejected/Expired)
Federation Status	Table	Federated cluster status with last sync, active incidents, and health
MTTD Over Time	Time Series	Mean Time to Detect over time

4. Remediation Stats (`remediation-stats.json`)

Detailed dashboard on remediation performance. Panels:

Panel	Type	Description
Success Rate Gauge	Gauge	Overall remediation success rate. Thresholds: green >90%, yellow 75-90%, red <75%
Actions by Type	Bar Chart (horizontal)	Total actions executed grouped by type (RestartDeployment, ScaleDeployment, etc.)
Actions by Result	Stacked Bar	Actions by result (success/failed) over time
Duration p50	Stat	Median remediation time
Duration p90	Stat	90th percentile remediation time
Duration p99	Stat	99th percentile remediation time
Duration Distribution	Histogram	Remediation time distribution with buckets
Operator Reconciliation	Time Series	Reconciliation count per controller (Issue/Anomaly/AIInsight/Remediation)
Reconciliation Errors	Time Series	Reconciliation errors per controller
Reconciliation Duration	Heatmap	Reconciliation duration per controller (detects bottlenecks)

Grafana Dashboard Installation

Via Grafana Sidecar (Recommended)

If you use the Grafana Helm chart with sidecar enabled, create ConfigMaps with the label grafana_dashboard: "1":

# Create ConfigMaps for each dashboard
kubectl create configmap grafana-aiops-overview \
  --from-file=aiops-overview.json=operator/dashboards/aiops-overview.json \
  -n monitoring

kubectl create configmap grafana-slo-burn-rate \
  --from-file=slo-burn-rate.json=operator/dashboards/slo-burn-rate.json \
  -n monitoring

kubectl create configmap grafana-incident-timeline \
  --from-file=incident-timeline.json=operator/dashboards/incident-timeline.json \
  -n monitoring

kubectl create configmap grafana-remediation-stats \
  --from-file=remediation-stats.json=operator/dashboards/remediation-stats.json \
  -n monitoring

# Add label for sidecar discovery
kubectl label configmap grafana-aiops-overview grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-slo-burn-rate grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-incident-timeline grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-remediation-stats grafana_dashboard=1 -n monitoring

The Grafana sidecar automatically detects ConfigMaps with the label grafana_dashboard: "1" and imports the dashboards without restart.

Via Manual Import

Go to Grafana > Dashboards > Import
Upload the JSON file or paste the content
Select the Prometheus datasource
Click Import

ServiceMonitor for Prometheus Operator

Configure metrics scraping for the operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chatcli-operator
  namespace: chatcli-system
  labels:
    app.kubernetes.io/name: chatcli-operator
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: chatcli-operator
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - chatcli-system

Prometheus Metrics Reference

The operator exposes the following Prometheus metrics to feed the Grafana dashboards:

Metric	Type	Labels	Description
`chatcli_operator_issues_total`	Counter	`severity`, `state`	Total issues created by severity and state
`chatcli_operator_active_issues`	Gauge	`namespace`	Number of active (unresolved) issues
`chatcli_operator_issue_resolution_duration_seconds`	Histogram	`severity`	Duration from detection to resolution
`chatcli_operator_remediation_actions_total`	Counter	`type`, `result`	Total remediation actions by type and result
`chatcli_operator_remediation_duration_seconds`	Histogram	`type`	Remediation action duration by type
`chatcli_operator_anomalies_total`	Counter	`signal_type`, `namespace`	Total anomalies detected
`chatcli_operator_anomalies_suppressed_total`	Counter	`strategy`	Anomalies suppressed by Noise Reducer per strategy
`chatcli_operator_ai_analysis_duration_seconds`	Histogram	`provider`, `model`	AI analysis duration
`chatcli_operator_ai_analysis_confidence`	Histogram	`provider`	Analysis confidence distribution
`chatcli_operator_ai_tokens_total`	Counter	`provider`, `direction`	Total tokens consumed (direction: input/output)
`chatcli_operator_ai_cost_dollars`	Counter	`provider`	Accumulated cost in dollars per provider
`chatcli_operator_slo_current_ratio`	Gauge	`slo_name`, `service`	Current SLO ratio (0-1)
`chatcli_operator_slo_error_budget_remaining`	Gauge	`slo_name`	Remaining error budget in minutes
`chatcli_operator_slo_burn_rate`	Gauge	`slo_name`, `window`	Burn rate per window (1h/6h/24h/72h)
`chatcli_operator_approvals_total`	Counter	`decision`	Total approval decisions (approved/rejected/expired)
`chatcli_operator_notifications_total`	Counter	`channel`, `result`	Notifications sent per channel and result
`chatcli_operator_postmortems_total`	Counter	`source`	Total post-mortems generated by source (agentic/standard)
`chatcli_operator_reconcile_total`	Counter	`controller`, `result`	Total reconciliations per controller and result
`chatcli_operator_reconcile_duration_seconds`	Histogram	`controller`	Reconciliation duration per controller
`chatcli_operator_reconcile_errors_total`	Counter	`controller`	Total reconciliation errors per controller
`chatcli_operator_cluster_health`	Gauge	`cluster`, `provider`	Cluster health (1=healthy, 0.5=degraded, 0=unreachable)
`chatcli_operator_capacity_cpu_usage_percent`	Gauge	`resource`, `namespace`	Current CPU usage percentage
`chatcli_operator_capacity_memory_usage_percent`	Gauge	`resource`, `namespace`	Current memory usage percentage
`chatcli_operator_capacity_exhaustion_days`	Gauge	`resource`, `namespace`, `type`	Projected days until exhaustion (type: cpu/memory, -1 if stable)

Useful Prometheus Queries

PromQL query examples for dashboards or alerts:

MTTR by severity (last 24h)

histogram_quantile(0.5,
  rate(chatcli_operator_issue_resolution_duration_seconds_bucket{severity="critical"}[24h])
)

Remediation success rate

sum(chatcli_operator_remediation_actions_total{result="success"})
/
sum(chatcli_operator_remediation_actions_total)
* 100

SLO burn rate (multi-window alert)

# Alert: burn rate 1h > 14.4x AND burn rate 6h > 6x (Google SRE model)
chatcli_operator_slo_burn_rate{window="1h"} > 14.4
and
chatcli_operator_slo_burn_rate{window="6h"} > 6.0

Accumulated LLM cost per hour

increase(chatcli_operator_ai_cost_dollars[1h])

Suppressed vs. processed anomalies

# Suppression ratio
sum(rate(chatcli_operator_anomalies_suppressed_total[1h]))
/
(sum(rate(chatcli_operator_anomalies_total[1h])) + sum(rate(chatcli_operator_anomalies_suppressed_total[1h])))
* 100

Resources near exhaustion (less than 7 days)

chatcli_operator_capacity_exhaustion_days > 0
and
chatcli_operator_capacity_exhaustion_days &lt; 7

Accessing the Dashboard

Verify the operator

Confirm the operator is running:

kubectl get pods -n chatcli-system -l app.kubernetes.io/name=chatcli-operator

Port-forward (development)

For local access during development:

kubectl port-forward -n chatcli-system svc/chatcli-operator 8090:8090

Access: http://localhost:8090/

Ingress (production)

For production access, configure an Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatcli-dashboard
  namespace: chatcli-system
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts: ["aiops.company.com"]
      secretName: aiops-tls
  rules:
    - host: aiops.company.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: chatcli-operator
                port:
                  number: 8090

Configure API Key

Configure at least one API key before exposing the dashboard externally:

kubectl create configmap chatcli-api-keys -n chatcli-system \
  --from-literal=keys.json='{"keys":[{"key":"your-secure-key","role":"admin","description":"Dashboard admin"}]}'

Never expose the dashboard without authentication (API key) in production. Dev mode (without keys) allows unrestricted access, including write operations such as acknowledge, approve, and reject.

Next Steps

REST API Reference

Complete reference of all endpoints consumed by the dashboard.

Capacity & Costs

Details on the Capacity Planner, Noise Reducer, and Cost Tracker.

AIOps Platform

Complete architecture of the autonomous operations pipeline.

K8s Operator

Kubernetes operator configuration and deployment.

​Web Dashboard

​Overview

​Architecture

​Dashboard Views

​Grafana Dashboards

​1. AIOps Overview (aiops-overview.json)

​2. SLO Burn Rate (slo-burn-rate.json)

​3. Incident Timeline (incident-timeline.json)

​4. Remediation Stats (remediation-stats.json)

​Grafana Dashboard Installation

​Via Grafana Sidecar (Recommended)

​Via Manual Import

​ServiceMonitor for Prometheus Operator

​Prometheus Metrics Reference

​Useful Prometheus Queries

​Accessing the Dashboard

​Next Steps

REST API Reference

Capacity & Costs

AIOps Platform

K8s Operator

Web Dashboard

Overview

Architecture

Dashboard Views

Grafana Dashboards

1. AIOps Overview (`aiops-overview.json`)

2. SLO Burn Rate (`slo-burn-rate.json`)

3. Incident Timeline (`incident-timeline.json`)

4. Remediation Stats (`remediation-stats.json`)

Grafana Dashboard Installation

Via Grafana Sidecar (Recommended)

Via Manual Import

ServiceMonitor for Prometheus Operator

Prometheus Metrics Reference

Useful Prometheus Queries

Accessing the Dashboard

Next Steps