Skip to main content
The ChatCLI AIOps platform includes a built-in web dashboard (self-contained SPA) and 4 pre-configured Grafana dashboards for complete observability of the autonomous operations pipeline.

Web Dashboard

Overview

The Web Dashboard is a Single Page Application embedded directly in the operator binary via Go embed.FS — it does not require Node.js, npm, or any separate frontend build.
CharacteristicDetail
TechnologyHTML/CSS/JS vanilla (zero external dependencies)
PackagingGo embed.FS — compiled into the binary
Port8090 (configurable via CHATCLI_AIOPS_PORT)
ThemeDark theme (dark background, light text)
ResponsiveAdapts to desktop, tablet, and mobile
Auto-refreshConfigurable interval: Off, 10s, 30s (default), 1m, 2m, 5m — with manual refresh button
Sortable tablesClick any column header to sort (▲ ascending / ▼ descending)
Stable orderingLists maintain consistent order between refreshes (incidents/audit by timestamp, SLOs by name)
AuthenticationSame API key as the REST API (header X-API-Key)
URLhttp://<operator-host>:8090/
The dashboard consumes the same REST API documented in API Reference. All operations available in the dashboard (acknowledge, snooze, approve, reject) are authenticated REST calls.

Architecture

+-------------------------------------------------------------+
|                    Operator Binary                            |
|                                                              |
|  +--------------------+  +-------------------------------+   |
|  |  embed.FS          |  |  HTTP Server (:8090)          |   |
|  |  +-- index.html    |  |  +-- /          -> SPA        |   |
|  |  +-- styles.css    |  |  +-- /api/v1/  -> REST API    |   |
|  |  +-- app.js        |  |  +-- /healthz  -> Health      |   |
|  |                     |  |  +-- /readyz   -> Ready       |   |
|  +--------------------+  +-------------------------------+   |
|                                                              |
|  +------------------------------------------------------+    |
|  |  Kubernetes Client (informers + watch)                |    |
|  |  Issues, Anomalies, AIInsights, Plans, PostMortems    |    |
|  +------------------------------------------------------+    |
+-------------------------------------------------------------+

Dashboard Views

The dashboard has 10 views accessible via tab navigation:
Platform overview with aggregated metrics.Components:
ComponentDescription
Stats Cards6 cards: Active Issues, Resolved, Remediations (success/total), Success Rate, PostMortems, Pending Approvals
Compliance & SLAReal-time compliance percentage, MTTD, MTTR, SLA response/resolution violations, approval stats
Capacity WarningsAlert banner showing resources at risk of exhaustion (urgent/plan), with recommendations
Remediation by StrategyHorizontal bar chart showing success rate per action type (e.g., RestartDeployment 93%, RollbackDeployment 80%)
Pie Chart (Severity)Incident distribution by severity (Critical/High/Medium/Low)
Recent IncidentsList of the 10 most recent incidents with state, severity, and timestamp
TimelineTemporal chart of incidents in the last 24h (stacked bars by severity)
The Overview provides comprehensive situational awareness with compliance, capacity, and remediation effectiveness metrics.
Interactive table of all incidents with filters and actions.Features:
FeatureDescription
FiltersSeverity, state, namespace, period (dropdowns at top)
TableColumns: Name, Severity, State, Resource, Namespace, Detected at, Duration
SortingClick column header to sort (asc/desc)
ExpansionClick row to expand details: description, resource, signal type, remediation attempts, resolution
AI Insight PreviewExpanded rows automatically load the AI analysis inline with confidence badge, provider/model, analysis summary (first 500 chars), and top 3 recommendations
AcknowledgeButton to acknowledge incident (requires operator role)
SnoozeButton with duration selector (30m, 1h, 2h, 4h, 24h)
PaginationPage navigation with 20 items per page
Severity badges:
SeverityColor
CriticalRed (#ef4444)
HighOrange (#f97316)
MediumYellow (#eab308)
LowBlue (#3b82f6)
State badges:
StateColor
DetectedGray (#6b7280)
AnalyzingBlue (#3b82f6)
RemediatingYellow (#eab308)
ResolvedGreen (#22c55e)
EscalatedRed (#ef4444)
SLO cards with visual indicators of error budget and burn rate.Components per SLO:
ComponentDescription
CardSLO name, service, SLI type, target vs. current
Error Budget GaugeCircular progress bar showing % of error budget consumed. Green (<70%), Yellow (70-90%), Red (>90%)
Burn Rate Chips4 chips: 1h, 6h, 24h, 72h. Color indicates if burn rate exceeds threshold (Google SRE model)
StateBadge: healthy (green), at_risk (yellow), breached (red)
HistorySparkline of last 7 days of compliance
Burn Rate Thresholds (Google SRE):
WindowThresholdMeaning
1h14.4xConsumes 100% of budget in 5 days
6h6.0xConsumes 100% of budget in 5 days (confirmation)
24h3.0xConsumes 100% of budget in 10 days
72h1.0xConsumes 100% of budget in 30 days
List of pending approvals with approve/reject actions.Features:
FeatureDescription
Pending ListPending approvals at top, with visual highlighting
ContextEach approval shows: associated incident, proposed action, severity, AI confidence
Blast RadiusRisk level badge (CRITICAL/HIGH/MEDIUM/LOW) from blast radius prediction, showing potential impact before approval
ApproveGreen button. Opens modal with required approver name field and optional reason field.
RejectRed button. Opens modal with required name and reason fields.
HistoryTab to view historical approvals (approved/rejected/expired)
ExpirationCountdown timer showing remaining time before expiration
View all AI-generated analyses to understand how the AI reasoned about each incident.Features:
FeatureDescription
FilterFilter by incident name to see insights for a specific issue
TableColumns: Incident, Provider, Model, Confidence, Recommendations, Actions, Generated
ConfidenceColor-coded confidence score: green (≥85%), yellow (70-84%), red (<70%)
ExpansionClick row to expand: full AI analysis text, recommendations list, suggested actions with parameters
Log AnalysisExpanded view shows structured log findings (stack traces, error patterns)
Cascade AnalysisShows cross-service cascade chain when detected
GitOps ContextDisplays Helm/ArgoCD/Flux status at the time of analysis
Blast RadiusShows predicted impact of suggested remediation actions
This view is essential when an incident is escalated to human action — it shows exactly what the AI found, why it recommended specific actions, and what enrichment data informed its analysis.API endpoint: GET /api/v1/aiinsights
Track all remediation plans with execution details, both runbook-based and agentic.Features:
FeatureDescription
FiltersState dropdown (Pending/Executing/Verifying/Completed/Failed/RolledBack), incident name filter
TableColumns: Name, Incident, Attempt, State, Mode (Runbook/Agentic), Actions/Steps, Started, Duration
Mode indicatorAgentic mode highlighted in red accent; Runbook mode in default text
ExpansionClick row to expand: strategy description, planned actions list, result
Agentic detailsFor agentic plans: step count shown in table, full conversation in detail view
DurationAuto-calculated from start to completion time
State badgesSame color scheme as incidents: Completed (green), Executing (yellow), Failed (red), RolledBack (orange)
Remediation modes explained:
  • Runbook mode: Displays the pre-defined action sequence from the matched runbook
  • Agentic mode: Shows step count in the table; use the Get Remediation Plan API for the full AI conversation history
API endpoint: GET /api/v1/remediations
View all runbooks — both manually created and AI-generated from successful remediations.Features:
FeatureDescription
TableColumns: Name, Signal Type, Severity, Resource Kind, Steps, Max Attempts, Created
Signal badgeColor-coded badge showing the trigger signal type (oom_kill, pod_not_ready, deploy_failing, etc.)
ExpansionClick row to expand: full description, trigger match criteria, and ordered step list
Step detailsEach step shows: action type badge, description, and parameters as JSON
Auto-generatedRunbooks are automatically created when the AI successfully remediates an incident — they capture the winning strategy for reuse
Runbooks serve as the AI’s “institutional memory” — when a similar incident occurs in the future, the platform matches it to an existing runbook instead of starting from scratch, significantly reducing MTTR.API endpoint: GET /api/v1/runbooks
List of post-mortems with expandable details.Features:
FeatureDescription
ListAll post-mortems with state (open/in_review/closed), associated incident, duration
ExpansionClick to expand: complete timeline, root cause, impact, executed actions
Lessons LearnedSection with lessons learned (AI-generated)
Prevention ActionsChecklist of suggested preventive actions
Developer FeedbackInline form for the developer to rate the remediation (1-5 stars), override root cause, and add comments. Once submitted, displays the feedback with visual rating
ReviewButton to mark as “in review”
CloseButton to close the post-mortem after review
SourceBadge indicating if generated by agentic or standard remediation
Cards of monitored clusters with health status and federation overview.Federation Panel:
ComponentDescription
Federation StatusConnected/disconnected cluster counts and total active issues across federation
Cross-Cluster CorrelationsIssues correlated across clusters with severity badge, signal type, CASCADE/ELEVATED flags, and correlated cluster names
Components per cluster:
ComponentDescription
CardCluster name, provider (EKS/GKE/AKS), K8s version
StatusBadge: healthy (green), degraded (yellow), unreachable (red)
MetricsNumber of nodes, active incidents, monitored namespaces
ResourcesCPU and memory usage bars (capacity vs. usage)
TargetsList of watcher targets with alert counters per namespace
Last SyncLast synchronization timestamp with freshness indicator
API endpoints: GET /api/v1/federation/status, GET /api/v1/federation/correlations
Searchable audit log with export.Features:
FeatureDescription
SearchText field to search in type, resource, actor, description
FiltersEvent type, severity, period
TableColumns: Timestamp, Type, Severity, Actor, Resource, Description
SeverityIcons and colors: info (blue), warning (yellow), critical (red)
ExportButton to export in JSON or CSV (requires admin role)
Pagination50 items per page
Auto-scrollNew events appear at the top with highlight animation

Grafana Dashboards

The AIOps platform includes 4 pre-configured Grafana dashboards in JSON format, ready for import.

1. AIOps Overview (aiops-overview.json)

Main dashboard with operational overview. Panels:
PanelTypeDescription
Active IssuesStatNumber of unresolved issues (gauge with thresholds: green <5, yellow 5-15, red >15)
MTTRStatMean time to resolution in minutes
Success RateGaugeRemediation success rate (0-100%)
Issues by SeverityPie ChartIssue distribution by severity (Critical/High/Medium/Low)
Issues by StateBar ChartIssue count by state (Detected/Analyzing/Remediating/Resolved/Escalated)
Remediation ActionsTime SeriesRemediation actions over time, separated by type (Restart/Scale/Rollback/Adjust/Delete/Patch)
Resolution DurationHistogramResolution time distribution with buckets of 1min, 2min, 5min, 10min, 30min
Issues Over TimeTime SeriesIncidents created vs. resolved per hour
Template variables:
VariableTypeValues
namespaceQueryAll namespaces with issues
severityCustomAll, Critical, High, Medium, Low
intervalInterval1m, 5m, 15m, 1h

2. SLO Burn Rate (slo-burn-rate.json)

Dashboard dedicated to SLOs following the Google SRE model. Panels:
PanelTypeDescription
Error Budget GaugeGaugeRemaining error budget percentage per SLO. Thresholds: green >30%, yellow 10-30%, red <10%
Burn Rate 1hTime SeriesBurn rate in the 1-hour window with threshold line at 14.4x
Burn Rate 6hTime SeriesBurn rate in the 6-hour window with threshold line at 6.0x
Burn Rate 24hTime SeriesBurn rate in the 24-hour window with threshold line at 3.0x
Burn Rate 72hTime SeriesBurn rate in the 72-hour window with threshold line at 1.0x
SLA ComplianceStatCurrent compliance percentage per SLO
SLA ViolationsTableViolation list with timestamp, affected SLO, duration, and budget impact
Budget Consumption Over TimeTime SeriesCumulative error budget consumption over the 30-day window
Threshold lines (annotations): Each burn rate chart includes a dashed red horizontal line at the corresponding threshold (Google SRE multi-window, multi-burn-rate alerting model).

3. Incident Timeline (incident-timeline.json)

Dashboard focused on the temporal flow of incidents and notifications. Panels:
PanelTypeDescription
Critical IncidentsStatCount of active critical incidents (pulsing red if >0)
Escalated IncidentsStatCount of escalated incidents
Resolved TodayStatIncidents resolved in the last 24h
Incident TimelineTimeline/AnnotationsTemporal visualization of incidents with severity colors
Notifications by ChannelBar ChartNotification count by channel (Slack/PagerDuty/Webhook/Email)
Approval DecisionsPie ChartApproval decision distribution (Approved/Rejected/Expired)
Federation StatusTableFederated cluster status with last sync, active incidents, and health
MTTD Over TimeTime SeriesMean Time to Detect over time

4. Remediation Stats (remediation-stats.json)

Detailed dashboard on remediation performance. Panels:
PanelTypeDescription
Success Rate GaugeGaugeOverall remediation success rate. Thresholds: green >90%, yellow 75-90%, red <75%
Actions by TypeBar Chart (horizontal)Total actions executed grouped by type (RestartDeployment, ScaleDeployment, etc.)
Actions by ResultStacked BarActions by result (success/failed) over time
Duration p50StatMedian remediation time
Duration p90Stat90th percentile remediation time
Duration p99Stat99th percentile remediation time
Duration DistributionHistogramRemediation time distribution with buckets
Operator ReconciliationTime SeriesReconciliation count per controller (Issue/Anomaly/AIInsight/Remediation)
Reconciliation ErrorsTime SeriesReconciliation errors per controller
Reconciliation DurationHeatmapReconciliation duration per controller (detects bottlenecks)

Grafana Dashboard Installation

If you use the Grafana Helm chart with sidecar enabled, create ConfigMaps with the label grafana_dashboard: "1":
# Create ConfigMaps for each dashboard
kubectl create configmap grafana-aiops-overview \
  --from-file=aiops-overview.json=operator/dashboards/aiops-overview.json \
  -n monitoring

kubectl create configmap grafana-slo-burn-rate \
  --from-file=slo-burn-rate.json=operator/dashboards/slo-burn-rate.json \
  -n monitoring

kubectl create configmap grafana-incident-timeline \
  --from-file=incident-timeline.json=operator/dashboards/incident-timeline.json \
  -n monitoring

kubectl create configmap grafana-remediation-stats \
  --from-file=remediation-stats.json=operator/dashboards/remediation-stats.json \
  -n monitoring

# Add label for sidecar discovery
kubectl label configmap grafana-aiops-overview grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-slo-burn-rate grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-incident-timeline grafana_dashboard=1 -n monitoring
kubectl label configmap grafana-remediation-stats grafana_dashboard=1 -n monitoring
The Grafana sidecar automatically detects ConfigMaps with the label grafana_dashboard: "1" and imports the dashboards without restart.

Via Manual Import

  1. Go to Grafana > Dashboards > Import
  2. Upload the JSON file or paste the content
  3. Select the Prometheus datasource
  4. Click Import

ServiceMonitor for Prometheus Operator

Configure metrics scraping for the operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chatcli-operator
  namespace: chatcli-system
  labels:
    app.kubernetes.io/name: chatcli-operator
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: chatcli-operator
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - chatcli-system

Prometheus Metrics Reference

The operator exposes the following Prometheus metrics to feed the Grafana dashboards:
MetricTypeLabelsDescription
chatcli_operator_issues_totalCounterseverity, stateTotal issues created by severity and state
chatcli_operator_active_issuesGaugenamespaceNumber of active (unresolved) issues
chatcli_operator_issue_resolution_duration_secondsHistogramseverityDuration from detection to resolution
chatcli_operator_remediation_actions_totalCountertype, resultTotal remediation actions by type and result
chatcli_operator_remediation_duration_secondsHistogramtypeRemediation action duration by type
chatcli_operator_anomalies_totalCountersignal_type, namespaceTotal anomalies detected
chatcli_operator_anomalies_suppressed_totalCounterstrategyAnomalies suppressed by Noise Reducer per strategy
chatcli_operator_ai_analysis_duration_secondsHistogramprovider, modelAI analysis duration
chatcli_operator_ai_analysis_confidenceHistogramproviderAnalysis confidence distribution
chatcli_operator_ai_tokens_totalCounterprovider, directionTotal tokens consumed (direction: input/output)
chatcli_operator_ai_cost_dollarsCounterproviderAccumulated cost in dollars per provider
chatcli_operator_slo_current_ratioGaugeslo_name, serviceCurrent SLO ratio (0-1)
chatcli_operator_slo_error_budget_remainingGaugeslo_nameRemaining error budget in minutes
chatcli_operator_slo_burn_rateGaugeslo_name, windowBurn rate per window (1h/6h/24h/72h)
chatcli_operator_approvals_totalCounterdecisionTotal approval decisions (approved/rejected/expired)
chatcli_operator_notifications_totalCounterchannel, resultNotifications sent per channel and result
chatcli_operator_postmortems_totalCountersourceTotal post-mortems generated by source (agentic/standard)
chatcli_operator_reconcile_totalCountercontroller, resultTotal reconciliations per controller and result
chatcli_operator_reconcile_duration_secondsHistogramcontrollerReconciliation duration per controller
chatcli_operator_reconcile_errors_totalCountercontrollerTotal reconciliation errors per controller
chatcli_operator_cluster_healthGaugecluster, providerCluster health (1=healthy, 0.5=degraded, 0=unreachable)
chatcli_operator_capacity_cpu_usage_percentGaugeresource, namespaceCurrent CPU usage percentage
chatcli_operator_capacity_memory_usage_percentGaugeresource, namespaceCurrent memory usage percentage
chatcli_operator_capacity_exhaustion_daysGaugeresource, namespace, typeProjected days until exhaustion (type: cpu/memory, -1 if stable)

Useful Prometheus Queries

PromQL query examples for dashboards or alerts:
histogram_quantile(0.5,
  rate(chatcli_operator_issue_resolution_duration_seconds_bucket{severity="critical"}[24h])
)
sum(chatcli_operator_remediation_actions_total{result="success"})
/
sum(chatcli_operator_remediation_actions_total)
* 100
# Alert: burn rate 1h > 14.4x AND burn rate 6h > 6x (Google SRE model)
chatcli_operator_slo_burn_rate{window="1h"} > 14.4
and
chatcli_operator_slo_burn_rate{window="6h"} > 6.0
increase(chatcli_operator_ai_cost_dollars[1h])
# Suppression ratio
sum(rate(chatcli_operator_anomalies_suppressed_total[1h]))
/
(sum(rate(chatcli_operator_anomalies_total[1h])) + sum(rate(chatcli_operator_anomalies_suppressed_total[1h])))
* 100
chatcli_operator_capacity_exhaustion_days > 0
and
chatcli_operator_capacity_exhaustion_days &lt; 7

Accessing the Dashboard

1

Verify the operator

Confirm the operator is running:
kubectl get pods -n chatcli-system -l app.kubernetes.io/name=chatcli-operator
2

Port-forward (development)

For local access during development:
kubectl port-forward -n chatcli-system svc/chatcli-operator 8090:8090
Access: http://localhost:8090/
3

Ingress (production)

For production access, configure an Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatcli-dashboard
  namespace: chatcli-system
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts: ["aiops.company.com"]
      secretName: aiops-tls
  rules:
    - host: aiops.company.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: chatcli-operator
                port:
                  number: 8090
4

Configure API Key

Configure at least one API key before exposing the dashboard externally:
kubectl create configmap chatcli-api-keys -n chatcli-system \
  --from-literal=keys.json='{"keys":[{"key":"your-secure-key","role":"admin","description":"Dashboard admin"}]}'
Never expose the dashboard without authentication (API key) in production. Dev mode (without keys) allows unrestricted access, including write operations such as acknowledge, approve, and reject.

Next Steps

REST API Reference

Complete reference of all endpoints consumed by the dashboard.

Capacity & Costs

Details on the Capacity Planner, Noise Reducer, and Cost Tracker.

AIOps Platform

Complete architecture of the autonomous operations pipeline.

K8s Operator

Kubernetes operator configuration and deployment.