Web Dashboard
Overview
The Web Dashboard is a Single Page Application embedded directly in the operator binary via Goembed.FS — it does not require Node.js, npm, or any separate frontend build.
| Characteristic | Detail |
|---|---|
| Technology | HTML/CSS/JS vanilla (zero external dependencies) |
| Packaging | Go embed.FS — compiled into the binary |
| Port | 8090 (configurable via CHATCLI_AIOPS_PORT) |
| Theme | Dark theme (dark background, light text) |
| Responsive | Adapts to desktop, tablet, and mobile |
| Auto-refresh | Configurable interval: Off, 10s, 30s (default), 1m, 2m, 5m — with manual refresh button |
| Sortable tables | Click any column header to sort (▲ ascending / ▼ descending) |
| Stable ordering | Lists maintain consistent order between refreshes (incidents/audit by timestamp, SLOs by name) |
| Authentication | Same API key as the REST API (header X-API-Key) |
| URL | http://<operator-host>:8090/ |
The dashboard consumes the same REST API documented in API Reference. All operations available in the dashboard (acknowledge, snooze, approve, reject) are authenticated REST calls.
Architecture
Dashboard Views
The dashboard has 10 views accessible via tab navigation:1. Overview
1. Overview
Platform overview with aggregated metrics.Components:
The Overview provides comprehensive situational awareness with compliance, capacity, and remediation effectiveness metrics.
| Component | Description |
|---|---|
| Stats Cards | 6 cards: Active Issues, Resolved, Remediations (success/total), Success Rate, PostMortems, Pending Approvals |
| Compliance & SLA | Real-time compliance percentage, MTTD, MTTR, SLA response/resolution violations, approval stats |
| Capacity Warnings | Alert banner showing resources at risk of exhaustion (urgent/plan), with recommendations |
| Remediation by Strategy | Horizontal bar chart showing success rate per action type (e.g., RestartDeployment 93%, RollbackDeployment 80%) |
| Pie Chart (Severity) | Incident distribution by severity (Critical/High/Medium/Low) |
| Recent Incidents | List of the 10 most recent incidents with state, severity, and timestamp |
| Timeline | Temporal chart of incidents in the last 24h (stacked bars by severity) |
2. Incidents
2. Incidents
Interactive table of all incidents with filters and actions.Features:
Severity badges:
State badges:
| Feature | Description |
|---|---|
| Filters | Severity, state, namespace, period (dropdowns at top) |
| Table | Columns: Name, Severity, State, Resource, Namespace, Detected at, Duration |
| Sorting | Click column header to sort (asc/desc) |
| Expansion | Click row to expand details: description, resource, signal type, remediation attempts, resolution |
| AI Insight Preview | Expanded rows automatically load the AI analysis inline with confidence badge, provider/model, analysis summary (first 500 chars), and top 3 recommendations |
| Acknowledge | Button to acknowledge incident (requires operator role) |
| Snooze | Button with duration selector (30m, 1h, 2h, 4h, 24h) |
| Pagination | Page navigation with 20 items per page |
| Severity | Color |
|---|---|
| Critical | Red (#ef4444) |
| High | Orange (#f97316) |
| Medium | Yellow (#eab308) |
| Low | Blue (#3b82f6) |
| State | Color |
|---|---|
| Detected | Gray (#6b7280) |
| Analyzing | Blue (#3b82f6) |
| Remediating | Yellow (#eab308) |
| Resolved | Green (#22c55e) |
| Escalated | Red (#ef4444) |
3. SLOs
3. SLOs
SLO cards with visual indicators of error budget and burn rate.Components per SLO:
Burn Rate Thresholds (Google SRE):
| Component | Description |
|---|---|
| Card | SLO name, service, SLI type, target vs. current |
| Error Budget Gauge | Circular progress bar showing % of error budget consumed. Green (<70%), Yellow (70-90%), Red (>90%) |
| Burn Rate Chips | 4 chips: 1h, 6h, 24h, 72h. Color indicates if burn rate exceeds threshold (Google SRE model) |
| State | Badge: healthy (green), at_risk (yellow), breached (red) |
| History | Sparkline of last 7 days of compliance |
| Window | Threshold | Meaning |
|---|---|---|
| 1h | 14.4x | Consumes 100% of budget in 5 days |
| 6h | 6.0x | Consumes 100% of budget in 5 days (confirmation) |
| 24h | 3.0x | Consumes 100% of budget in 10 days |
| 72h | 1.0x | Consumes 100% of budget in 30 days |
4. Approvals
4. Approvals
List of pending approvals with approve/reject actions.Features:
| Feature | Description |
|---|---|
| Pending List | Pending approvals at top, with visual highlighting |
| Context | Each approval shows: associated incident, proposed action, severity, AI confidence |
| Blast Radius | Risk level badge (CRITICAL/HIGH/MEDIUM/LOW) from blast radius prediction, showing potential impact before approval |
| Approve | Green button. Opens modal with required approver name field and optional reason field. |
| Reject | Red button. Opens modal with required name and reason fields. |
| History | Tab to view historical approvals (approved/rejected/expired) |
| Expiration | Countdown timer showing remaining time before expiration |
5. AI Insights
5. AI Insights
View all AI-generated analyses to understand how the AI reasoned about each incident.Features:
This view is essential when an incident is escalated to human action — it shows exactly what the AI found, why it recommended specific actions, and what enrichment data informed its analysis.API endpoint:
| Feature | Description |
|---|---|
| Filter | Filter by incident name to see insights for a specific issue |
| Table | Columns: Incident, Provider, Model, Confidence, Recommendations, Actions, Generated |
| Confidence | Color-coded confidence score: green (≥85%), yellow (70-84%), red (<70%) |
| Expansion | Click row to expand: full AI analysis text, recommendations list, suggested actions with parameters |
| Log Analysis | Expanded view shows structured log findings (stack traces, error patterns) |
| Cascade Analysis | Shows cross-service cascade chain when detected |
| GitOps Context | Displays Helm/ArgoCD/Flux status at the time of analysis |
| Blast Radius | Shows predicted impact of suggested remediation actions |
GET /api/v1/aiinsights6. Remediations
6. Remediations
Track all remediation plans with execution details, both runbook-based and agentic.Features:
Remediation modes explained:
| Feature | Description |
|---|---|
| Filters | State dropdown (Pending/Executing/Verifying/Completed/Failed/RolledBack), incident name filter |
| Table | Columns: Name, Incident, Attempt, State, Mode (Runbook/Agentic), Actions/Steps, Started, Duration |
| Mode indicator | Agentic mode highlighted in red accent; Runbook mode in default text |
| Expansion | Click row to expand: strategy description, planned actions list, result |
| Agentic details | For agentic plans: step count shown in table, full conversation in detail view |
| Duration | Auto-calculated from start to completion time |
| State badges | Same color scheme as incidents: Completed (green), Executing (yellow), Failed (red), RolledBack (orange) |
- Runbook mode: Displays the pre-defined action sequence from the matched runbook
- Agentic mode: Shows step count in the table; use the Get Remediation Plan API for the full AI conversation history
GET /api/v1/remediations7. Runbooks
7. Runbooks
View all runbooks — both manually created and AI-generated from successful remediations.Features:
Runbooks serve as the AI’s “institutional memory” — when a similar incident occurs in the future, the platform matches it to an existing runbook instead of starting from scratch, significantly reducing MTTR.API endpoint:
| Feature | Description |
|---|---|
| Table | Columns: Name, Signal Type, Severity, Resource Kind, Steps, Max Attempts, Created |
| Signal badge | Color-coded badge showing the trigger signal type (oom_kill, pod_not_ready, deploy_failing, etc.) |
| Expansion | Click row to expand: full description, trigger match criteria, and ordered step list |
| Step details | Each step shows: action type badge, description, and parameters as JSON |
| Auto-generated | Runbooks are automatically created when the AI successfully remediates an incident — they capture the winning strategy for reuse |
GET /api/v1/runbooks8. PostMortems
8. PostMortems
List of post-mortems with expandable details.Features:
| Feature | Description |
|---|---|
| List | All post-mortems with state (open/in_review/closed), associated incident, duration |
| Expansion | Click to expand: complete timeline, root cause, impact, executed actions |
| Lessons Learned | Section with lessons learned (AI-generated) |
| Prevention Actions | Checklist of suggested preventive actions |
| Developer Feedback | Inline form for the developer to rate the remediation (1-5 stars), override root cause, and add comments. Once submitted, displays the feedback with visual rating |
| Review | Button to mark as “in review” |
| Close | Button to close the post-mortem after review |
| Source | Badge indicating if generated by agentic or standard remediation |
9. Clusters
9. Clusters
Cards of monitored clusters with health status and federation overview.Federation Panel:
Components per cluster:
API endpoints:
| Component | Description |
|---|---|
| Federation Status | Connected/disconnected cluster counts and total active issues across federation |
| Cross-Cluster Correlations | Issues correlated across clusters with severity badge, signal type, CASCADE/ELEVATED flags, and correlated cluster names |
| Component | Description |
|---|---|
| Card | Cluster name, provider (EKS/GKE/AKS), K8s version |
| Status | Badge: healthy (green), degraded (yellow), unreachable (red) |
| Metrics | Number of nodes, active incidents, monitored namespaces |
| Resources | CPU and memory usage bars (capacity vs. usage) |
| Targets | List of watcher targets with alert counters per namespace |
| Last Sync | Last synchronization timestamp with freshness indicator |
GET /api/v1/federation/status, GET /api/v1/federation/correlations10. Audit
10. Audit
Searchable audit log with export.Features:
| Feature | Description |
|---|---|
| Search | Text field to search in type, resource, actor, description |
| Filters | Event type, severity, period |
| Table | Columns: Timestamp, Type, Severity, Actor, Resource, Description |
| Severity | Icons and colors: info (blue), warning (yellow), critical (red) |
| Export | Button to export in JSON or CSV (requires admin role) |
| Pagination | 50 items per page |
| Auto-scroll | New events appear at the top with highlight animation |
Grafana Dashboards
The AIOps platform includes 4 pre-configured Grafana dashboards in JSON format, ready for import.1. AIOps Overview (aiops-overview.json)
Main dashboard with operational overview.
Panels:
| Panel | Type | Description |
|---|---|---|
| Active Issues | Stat | Number of unresolved issues (gauge with thresholds: green <5, yellow 5-15, red >15) |
| MTTR | Stat | Mean time to resolution in minutes |
| Success Rate | Gauge | Remediation success rate (0-100%) |
| Issues by Severity | Pie Chart | Issue distribution by severity (Critical/High/Medium/Low) |
| Issues by State | Bar Chart | Issue count by state (Detected/Analyzing/Remediating/Resolved/Escalated) |
| Remediation Actions | Time Series | Remediation actions over time, separated by type (Restart/Scale/Rollback/Adjust/Delete/Patch) |
| Resolution Duration | Histogram | Resolution time distribution with buckets of 1min, 2min, 5min, 10min, 30min |
| Issues Over Time | Time Series | Incidents created vs. resolved per hour |
| Variable | Type | Values |
|---|---|---|
namespace | Query | All namespaces with issues |
severity | Custom | All, Critical, High, Medium, Low |
interval | Interval | 1m, 5m, 15m, 1h |
2. SLO Burn Rate (slo-burn-rate.json)
Dashboard dedicated to SLOs following the Google SRE model.
Panels:
| Panel | Type | Description |
|---|---|---|
| Error Budget Gauge | Gauge | Remaining error budget percentage per SLO. Thresholds: green >30%, yellow 10-30%, red <10% |
| Burn Rate 1h | Time Series | Burn rate in the 1-hour window with threshold line at 14.4x |
| Burn Rate 6h | Time Series | Burn rate in the 6-hour window with threshold line at 6.0x |
| Burn Rate 24h | Time Series | Burn rate in the 24-hour window with threshold line at 3.0x |
| Burn Rate 72h | Time Series | Burn rate in the 72-hour window with threshold line at 1.0x |
| SLA Compliance | Stat | Current compliance percentage per SLO |
| SLA Violations | Table | Violation list with timestamp, affected SLO, duration, and budget impact |
| Budget Consumption Over Time | Time Series | Cumulative error budget consumption over the 30-day window |
3. Incident Timeline (incident-timeline.json)
Dashboard focused on the temporal flow of incidents and notifications.
Panels:
| Panel | Type | Description |
|---|---|---|
| Critical Incidents | Stat | Count of active critical incidents (pulsing red if >0) |
| Escalated Incidents | Stat | Count of escalated incidents |
| Resolved Today | Stat | Incidents resolved in the last 24h |
| Incident Timeline | Timeline/Annotations | Temporal visualization of incidents with severity colors |
| Notifications by Channel | Bar Chart | Notification count by channel (Slack/PagerDuty/Webhook/Email) |
| Approval Decisions | Pie Chart | Approval decision distribution (Approved/Rejected/Expired) |
| Federation Status | Table | Federated cluster status with last sync, active incidents, and health |
| MTTD Over Time | Time Series | Mean Time to Detect over time |
4. Remediation Stats (remediation-stats.json)
Detailed dashboard on remediation performance.
Panels:
| Panel | Type | Description |
|---|---|---|
| Success Rate Gauge | Gauge | Overall remediation success rate. Thresholds: green >90%, yellow 75-90%, red <75% |
| Actions by Type | Bar Chart (horizontal) | Total actions executed grouped by type (RestartDeployment, ScaleDeployment, etc.) |
| Actions by Result | Stacked Bar | Actions by result (success/failed) over time |
| Duration p50 | Stat | Median remediation time |
| Duration p90 | Stat | 90th percentile remediation time |
| Duration p99 | Stat | 99th percentile remediation time |
| Duration Distribution | Histogram | Remediation time distribution with buckets |
| Operator Reconciliation | Time Series | Reconciliation count per controller (Issue/Anomaly/AIInsight/Remediation) |
| Reconciliation Errors | Time Series | Reconciliation errors per controller |
| Reconciliation Duration | Heatmap | Reconciliation duration per controller (detects bottlenecks) |
Grafana Dashboard Installation
Via Grafana Sidecar (Recommended)
If you use the Grafana Helm chart with sidecar enabled, create ConfigMaps with the labelgrafana_dashboard: "1":
The Grafana sidecar automatically detects ConfigMaps with the label
grafana_dashboard: "1" and imports the dashboards without restart.Via Manual Import
- Go to Grafana > Dashboards > Import
- Upload the JSON file or paste the content
- Select the Prometheus datasource
- Click Import
ServiceMonitor for Prometheus Operator
Configure metrics scraping for the operator:Prometheus Metrics Reference
The operator exposes the following Prometheus metrics to feed the Grafana dashboards:| Metric | Type | Labels | Description |
|---|---|---|---|
chatcli_operator_issues_total | Counter | severity, state | Total issues created by severity and state |
chatcli_operator_active_issues | Gauge | namespace | Number of active (unresolved) issues |
chatcli_operator_issue_resolution_duration_seconds | Histogram | severity | Duration from detection to resolution |
chatcli_operator_remediation_actions_total | Counter | type, result | Total remediation actions by type and result |
chatcli_operator_remediation_duration_seconds | Histogram | type | Remediation action duration by type |
chatcli_operator_anomalies_total | Counter | signal_type, namespace | Total anomalies detected |
chatcli_operator_anomalies_suppressed_total | Counter | strategy | Anomalies suppressed by Noise Reducer per strategy |
chatcli_operator_ai_analysis_duration_seconds | Histogram | provider, model | AI analysis duration |
chatcli_operator_ai_analysis_confidence | Histogram | provider | Analysis confidence distribution |
chatcli_operator_ai_tokens_total | Counter | provider, direction | Total tokens consumed (direction: input/output) |
chatcli_operator_ai_cost_dollars | Counter | provider | Accumulated cost in dollars per provider |
chatcli_operator_slo_current_ratio | Gauge | slo_name, service | Current SLO ratio (0-1) |
chatcli_operator_slo_error_budget_remaining | Gauge | slo_name | Remaining error budget in minutes |
chatcli_operator_slo_burn_rate | Gauge | slo_name, window | Burn rate per window (1h/6h/24h/72h) |
chatcli_operator_approvals_total | Counter | decision | Total approval decisions (approved/rejected/expired) |
chatcli_operator_notifications_total | Counter | channel, result | Notifications sent per channel and result |
chatcli_operator_postmortems_total | Counter | source | Total post-mortems generated by source (agentic/standard) |
chatcli_operator_reconcile_total | Counter | controller, result | Total reconciliations per controller and result |
chatcli_operator_reconcile_duration_seconds | Histogram | controller | Reconciliation duration per controller |
chatcli_operator_reconcile_errors_total | Counter | controller | Total reconciliation errors per controller |
chatcli_operator_cluster_health | Gauge | cluster, provider | Cluster health (1=healthy, 0.5=degraded, 0=unreachable) |
chatcli_operator_capacity_cpu_usage_percent | Gauge | resource, namespace | Current CPU usage percentage |
chatcli_operator_capacity_memory_usage_percent | Gauge | resource, namespace | Current memory usage percentage |
chatcli_operator_capacity_exhaustion_days | Gauge | resource, namespace, type | Projected days until exhaustion (type: cpu/memory, -1 if stable) |
Useful Prometheus Queries
PromQL query examples for dashboards or alerts:MTTR by severity (last 24h)
MTTR by severity (last 24h)
Remediation success rate
Remediation success rate
SLO burn rate (multi-window alert)
SLO burn rate (multi-window alert)
Accumulated LLM cost per hour
Accumulated LLM cost per hour
Suppressed vs. processed anomalies
Suppressed vs. processed anomalies
Resources near exhaustion (less than 7 days)
Resources near exhaustion (less than 7 days)
Accessing the Dashboard
Next Steps
REST API Reference
Complete reference of all endpoints consumed by the dashboard.
Capacity & Costs
Details on the Capacity Planner, Noise Reducer, and Cost Tracker.
AIOps Platform
Complete architecture of the autonomous operations pipeline.
K8s Operator
Kubernetes operator configuration and deployment.