Skip to main content
The ChatCLI AIOps platform provides native management of Service Level Objectives (SLOs) and Service Level Agreements (SLAs) via Kubernetes CRDs. The system implements Google SRE’s burn rate model for intelligent alerting and tracks SLA compliance with business hours support.

SLO vs SLA: Understanding the Difference

AspectSLO (Service Level Objective)SLA (Service Level Agreement)
DefinitionInternal reliability target for a serviceFormal contract with customers/stakeholders
Who definesEngineering teamBusiness + engineering + legal
Consequence of violationInternal alert, deploy freeze, reviewContractual penalties, credits, fines
Example”99.9% availability in 30 days""P1 incidents responded to within 5 minutes”
CRDServiceLevelObjectiveIncidentSLA
Best practice is to define SLOs that are more stringent than SLAs. If your SLA guarantees 99.9%, set the SLO at 99.95%. This creates a safety margin (internal error budget) that allows detecting degradations before the SLA is violated.

ServiceLevelObjective CRD

The ServiceLevelObjective defines a reliability target for a service, with alerts based on burn rate and error budget tracking.
apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-availability
  namespace: production
spec:
  service: api-gateway
  description: "API Gateway must maintain 99.9% availability in a 30-day window"

  indicator:
    type: availability
    prometheusQuery:
      goodQuery: 'sum(rate(http_requests_total{service="api-gateway",code=~"2..|3.."}[5m]))'
      totalQuery: 'sum(rate(http_requests_total{service="api-gateway"}[5m]))'

  target:
    percentage: 99.9
    window: 30d

  burnRateAlerts:
    - name: page-fast-burn
      shortWindow: 1h
      longWindow: 6h
      burnRateThreshold: 14.4
      severity: critical
      notificationPolicy: production-alerts

    - name: ticket-medium-burn
      shortWindow: 6h
      longWindow: 3d
      burnRateThreshold: 6.0
      severity: high
      notificationPolicy: production-alerts

    - name: ticket-slow-burn
      shortWindow: 24h
      longWindow: 3d
      burnRateThreshold: 3.0
      severity: medium

    - name: monitor-gradual-burn
      shortWindow: 72h
      longWindow: 30d
      burnRateThreshold: 1.0
      severity: low

  alertPolicy:
    multiWindowRequired: true
    pageOnBudgetExhausted: true
    budgetWarningThresholds: [50, 25, 10, 0]

status:
  currentValue: 99.92
  errorBudgetTotal: 0.001
  errorBudgetRemaining: 0.0008
  errorBudgetRemainingPercent: 80.0
  burnRate: 1.2
  lastCalculatedAt: "2026-03-19T14:00:00Z"
  condition: Met

Spec Fields

Root

FieldTypeRequiredDescription
servicestringYesName of the monitored service
descriptionstringNoHuman-readable description of the SLO
indicatorSLOIndicatorYesService Level Indicator (SLI) definition
targetSLOTargetYesTarget and time window
burnRateAlerts[]BurnRateWindowNoMulti-window alert configuration
alertPolicySLOAlertPolicyNoGeneral alert policy

SLOIndicator

Defines what to measure. The type determines the semantics and required Prometheus queries.
FieldTypeRequiredDescription
typestringYesavailability, latency, error_rate, throughput
prometheusQueryPrometheusQuerySpecYesPromQL queries to calculate the SLI
Indicator types:
TypeGood EventsTotal EventsCalculation
availabilitySuccessful requests (2xx, 3xx)Total requestsgood / total
latencyRequests below the latency thresholdTotal requestsfast / total
error_rateN/A (inverted)Error requests1 - (errors / total)
throughputRequests processed within budgetRequests receivedprocessed / received
PrometheusQuerySpec:
FieldTypeRequiredDescription
goodQuerystringYesPromQL that returns the rate of “good” events
totalQuerystringYesPromQL that returns the total event rate
indicator:
  type: availability
  prometheusQuery:
    goodQuery: 'sum(rate(http_requests_total{service="api-gateway",code=~"2..|3.."}[5m]))'
    totalQuery: 'sum(rate(http_requests_total{service="api-gateway"}[5m]))'

SLOTarget

FieldTypeRequiredDescription
percentagefloat64YesTarget percentage (e.g., 99.9)
windowdurationYesRolling time window (e.g., 30d, 7d, 24h)

BurnRateWindow

Each entry defines an alert window based on burn rate.
FieldTypeRequiredDescription
namestringYesAlert identifier name
shortWindowdurationYesShort observation window
longWindowdurationYesLong observation window
burnRateThresholdfloat64YesBurn rate threshold to trigger the alert
severitystringYescritical, high, medium, low
notificationPolicystringNoNotificationPolicy name for routing

SLOAlertPolicy

FieldTypeDefaultDescription
multiWindowRequiredbooltrueRequires BOTH windows (short AND long) to exceed the threshold
pageOnBudgetExhaustedbooltrueSends a critical page when error budget reaches 0%
budgetWarningThresholds[]int[50, 25, 10, 0]Remaining budget percentages that trigger warnings

How the Calculation Works (Google SRE Model)

The system implements the multi-window, multi-burn-rate alerting model described in Google’s “Site Reliability Engineering” book.

Error Budget

The error budget is the maximum amount of “error” allowed within the SLO window.
Error Budget = 1 - (target / 100)

Example for a 99.9% SLO:
  Error Budget = 1 - (99.9 / 100) = 0.001 = 0.1%
In a 30-day window, this means:
Allowed downtime = 30 days x 24h x 60min x 0.001 = 43.2 minutes
SLO TargetError BudgetDowntime/30d
99%1.0%7h 12min
99.5%0.5%3h 36min
99.9%0.1%43.2 min
99.95%0.05%21.6 min
99.99%0.01%4.32 min

Burn Rate

The burn rate indicates how fast the error budget is being consumed.
Burn Rate = error_rate_in_window / error_budget

Where:
  error_rate_in_window = 1 - (good_events / total_events) [in the window]
  error_budget = 1 - (target / 100)
1

Calculate error rate in the window

Using the Prometheus queries, the ratio of good events vs total in the specified window is calculated.
Example: In the last 1h, there were 10,000 requests, 9,950 successful.
error_rate = 1 - (9950 / 10000) = 0.005 = 0.5%
2

Calculate burn rate

Divide the error rate by the error budget.
burn_rate = 0.005 / 0.001 = 5.0x

Interpretation: The budget is being consumed 5x faster than sustainable.
At this rate, the 30-day budget would be exhausted in 6 days.
3

Verify multi-window

To trigger an alert, BOTH windows (short AND long) must exceed the threshold.
Alert "page-fast-burn" (threshold 14.4x):
  - Short window (1h): burn_rate = 16.2x  > 14.4  CHECK
  - Long window (6h):  burn_rate = 15.1x  > 14.4  CHECK
  -> ALERT FIRED (both exceed)

If short=16.2x but long=12.0x:
  -> DOES NOT fire (long below threshold)
  -> Indicates a temporary spike, not sustained degradation
4

Classify and notify

Based on the configured severity, the alert is routed to the corresponding NotificationPolicy.

Multi-Window Alerting: Default Thresholds

The default thresholds follow Google SRE’s recommendation for a 30-day SLO:
NameShort WindowLong WindowBurn RateSeverityMeaning
page-fast-burn1h6h14.4xCriticalBudget exhausted in ~2 days. Requires immediate action.
ticket-medium-burn6h3d6.0xHighBudget exhausted in ~5 days. Create an urgent ticket.
ticket-slow-burn24h3d3.0xMediumBudget exhausted in ~10 days. Investigate and plan.
monitor-gradual-burn72h30d1.0xLowBudget exactly at a sustainable pace. Monitor.
The formula to calculate the threshold: burn_rate_threshold = (window_days / budget_consumption_days). For a 30-day SLO where you want to alert when the budget would be exhausted in 2 days: 30 / 2.08 = 14.4x.

Complete Numerical Example

Consider a 99.9% availability SLO over 30 days for the api-gateway service:
Configuration:
  Target: 99.9%
  Window: 30 days
  Error Budget: 0.1% = 43.2 minutes of downtime

Current situation (measured by Prometheus):
  Last 24h: 99.85% availability (0.15% error rate)
  Last 6h:  99.80% availability (0.20% error rate)
  Last 1h:  99.70% availability (0.30% error rate)

Burn rate calculation per window:
  1h:  0.003 / 0.001 = 3.0x
  6h:  0.002 / 0.001 = 2.0x
  24h: 0.0015 / 0.001 = 1.5x

Alert evaluation:
  page-fast-burn (14.4x):  1h=3.0x < 14.4  -> DOES NOT fire
  ticket-medium-burn (6x): 6h=2.0x < 6.0   -> DOES NOT fire
  ticket-slow-burn (3x):   24h=1.5x < 3.0  -> DOES NOT fire
  monitor-gradual-burn (1x): both > 1.0     -> FIRES (severity: low)

Result: Slow degradation detected. Not critical, but the budget is
being consumed 1.5x faster than sustainable. At this rate, the
43.2-minute budget would be exhausted in 20 days (instead of 30).

Remaining error budget:
  Consumed so far: ~22 minutes (estimated)
  Remaining: 43.2 - 22 = 21.2 minutes
  Remaining percentage: 49.1%
  -> 50% warning threshold nearly reached

Error Budget Tracking

The ServiceLevelObjective status is periodically updated by the reconciler:
FieldTypeDescription
currentValuefloat64Current SLI value (e.g., 99.92%)
errorBudgetTotalfloat64Total budget (e.g., 0.001 for 99.9%)
errorBudgetRemainingfloat64Remaining budget
errorBudgetRemainingPercentfloat64Remaining budget percentage
burnRatefloat64Current burn rate (shortest window)
lastCalculatedAtTimeLast calculation
conditionstringMet (within SLO), AtRisk (budget < 25%), Violated (budget exhausted)
SLO Conditions: Budget Warning Thresholds: When configured, the system sends notifications upon reaching each threshold:
Remaining BudgetAction
50%Informational notification
25%Warning: freeze non-essential deploys
10%Alert: full focus on stability
0%Critical page (if pageOnBudgetExhausted: true)

IncidentSLA CRD

The IncidentSLA defines response and resolution time contracts by severity, with business hours support and violation tracking.
apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: production-sla
  namespace: production
spec:
  service: api-gateway
  description: "Production SLA for the API Gateway"

  responseTimes:
    - severity: critical
      maxResponseTime: "5m"
      maxResolutionTime: "1h"
    - severity: high
      maxResponseTime: "15m"
      maxResolutionTime: "4h"
    - severity: medium
      maxResponseTime: "1h"
      maxResolutionTime: "24h"
    - severity: low
      maxResponseTime: "4h"
      maxResolutionTime: "72h"

  businessHours:
    enabled: true
    timezone: "America/Sao_Paulo"
    startHour: 9
    endHour: 18
    workDays: [1, 2, 3, 4, 5]   # Monday to Friday (0=Sun, 6=Sat)
    holidays:
      - date: "2026-01-01"
        name: "New Year's Day"
      - date: "2026-04-03"
        name: "Good Friday"
      - date: "2026-12-25"
        name: "Christmas"

  violationPolicy:
    notificationPolicy: sla-breach-notifications
    escalationPolicy: p0-escalation
    autoEscalateOnBreach: true

status:
  activeIncidents: 2
  totalViolations: 3
  compliancePercentage: 97.5
  violations:
    - issueName: "api-gateway-oom-kill-1771276354"
      severity: critical
      type: resolution_time
      exceededBy: "12m"
      occurredAt: "2026-03-15T14:30:00Z"
  lastCalculatedAt: "2026-03-19T14:00:00Z"

Spec Fields

Root

FieldTypeRequiredDescription
servicestringYesName of the service covered by the SLA
descriptionstringNoSLA description
responseTimes[]ResponseTimeConfigYesMaximum times per severity
businessHoursBusinessHoursSpecNoBusiness hours configuration
violationPolicyViolationPolicySpecNoActions on violation

ResponseTimeConfig

FieldTypeRequiredDescription
severitystringYescritical, high, medium, low
maxResponseTimedurationYesMaximum time for first acknowledgement
maxResolutionTimedurationYesMaximum time for complete resolution
Response time is measured as the time between Issue creation (state Detected) and the first transition to Analyzing or Remediating. Resolution time is measured between Detected and Resolved.

BusinessHoursSpec

FieldTypeRequiredDescription
enabledboolYesEnable counting only during business hours
timezonestringYesIANA timezone (e.g., America/Sao_Paulo)
startHourintYesStart hour (0-23)
endHourintYesEnd hour (0-23)
workDays[]intYesWork days (0=Sunday, 6=Saturday)
holidays[]HolidayNoHolidays (clock paused on these days)

How the Business Hours Clock Works

The SLA clock only counts during business hours. Outside of business hours, the clock is automatically paused.
1

Incident detected

Issue created at 17:45 (Friday). Clock starts.
Business hours: 09:00-18:00 (Mon-Fri), timezone America/Sao_Paulo
2

Clock counts 15 minutes (Friday)

From 17:45 to 18:00 = 15 minutes of SLA clock. Clock pauses at 18:00 (end of business hours).
3

Weekend: clock paused

All of Saturday and Sunday: clock remains paused. Accumulated SLA time: 15 minutes.
4

Monday: clock resumes

Clock resumes at 09:00 on Monday. If the incident is resolved at 10:30 on Monday:
  • Friday: 15 minutes
  • Monday: 1h30 = 90 minutes
  • Total SLA: 105 minutes (1h45)
5

Compliance evaluation

For critical severity with maxResolutionTime: 1h:
  • SLA time spent: 1h45 = 105 minutes
  • Limit: 60 minutes
  • VIOLATION: exceeded by 45 minutes
For high severity with maxResolutionTime: 4h:
  • SLA time spent: 105 minutes
  • Limit: 240 minutes
  • WITHIN SLA
For critical incidents, consider disabling business hours (enabled: false) and using a 24/7 clock. Critical production issues should not wait for the next business day.

ViolationPolicySpec

FieldTypeDescription
notificationPolicystringNotificationPolicy to send violation alerts
escalationPolicystringEscalationPolicy to escalate violations
autoEscalateOnBreachboolAutomatically escalate when SLA is violated

CompliancePercentage Calculation

CompliancePercentage = ((total_incidents - violations) / total_incidents) * 100

Example:
  Total incidents in the period: 120
  Violations: 3
  Compliance = ((120 - 3) / 120) * 100 = 97.5%
Compliance is calculated per severity and aggregated:
SeverityIncidentsViolationsCompliance
Critical5180.0%
High15286.7%
Medium400100.0%
Low600100.0%
Total120397.5%

Complete YAML Examples

99.9% Availability SLO with Burn Rate Alerting

apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-availability-slo
  namespace: production
spec:
  service: api-gateway
  description: "99.9% API Gateway availability measured by HTTP success rate"

  indicator:
    type: availability
    prometheusQuery:
      goodQuery: |
        sum(rate(http_requests_total{
          service="api-gateway",
          code=~"2..|3.."
        }[5m]))
      totalQuery: |
        sum(rate(http_requests_total{
          service="api-gateway"
        }[5m]))

  target:
    percentage: 99.9
    window: 30d

  burnRateAlerts:
    - name: page-immediate
      shortWindow: 1h
      longWindow: 6h
      burnRateThreshold: 14.4
      severity: critical
      notificationPolicy: production-alerts

    - name: ticket-urgent
      shortWindow: 6h
      longWindow: 3d
      burnRateThreshold: 6.0
      severity: high
      notificationPolicy: production-alerts

    - name: ticket-normal
      shortWindow: 24h
      longWindow: 3d
      burnRateThreshold: 3.0
      severity: medium

    - name: monitor
      shortWindow: 72h
      longWindow: 30d
      burnRateThreshold: 1.0
      severity: low

  alertPolicy:
    multiWindowRequired: true
    pageOnBudgetExhausted: true
    budgetWarningThresholds: [50, 25, 10, 0]

SLA P1=5min Response / 1h Resolution (Business Hours)

apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: api-gateway-production-sla
  namespace: production
spec:
  service: api-gateway
  description: "Production SLA per enterprise customer contract"

  responseTimes:
    - severity: critical
      maxResponseTime: "5m"
      maxResolutionTime: "1h"
    - severity: high
      maxResponseTime: "15m"
      maxResolutionTime: "4h"
    - severity: medium
      maxResponseTime: "2h"
      maxResolutionTime: "24h"
    - severity: low
      maxResponseTime: "8h"
      maxResolutionTime: "72h"

  businessHours:
    enabled: true
    timezone: "America/Sao_Paulo"
    startHour: 9
    endHour: 18
    workDays: [1, 2, 3, 4, 5]
    holidays:
      - date: "2026-01-01"
        name: "New Year's Day"
      - date: "2026-02-16"
        name: "Carnival"
      - date: "2026-02-17"
        name: "Carnival"
      - date: "2026-04-03"
        name: "Good Friday"
      - date: "2026-04-21"
        name: "Tiradentes Day"
      - date: "2026-05-01"
        name: "Labor Day"
      - date: "2026-09-07"
        name: "Independence Day"
      - date: "2026-10-12"
        name: "Our Lady of Aparecida"
      - date: "2026-11-02"
        name: "All Souls' Day"
      - date: "2026-11-15"
        name: "Republic Proclamation Day"
      - date: "2026-12-25"
        name: "Christmas"

  violationPolicy:
    notificationPolicy: sla-breach-notifications
    escalationPolicy: p0-escalation
    autoEscalateOnBreach: true
For critical severity, even with business hours enabled, consider creating a separate rule with businessHours.enabled: false. P1 issues typically require 24/7 response.

SLO with Custom PrometheusQuery (Latency P99)

apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: payment-service-latency-slo
  namespace: payments
spec:
  service: payment-service
  description: "99.5% of Payment Service requests must complete in under 500ms"

  indicator:
    type: latency
    prometheusQuery:
      goodQuery: |
        sum(rate(http_request_duration_seconds_bucket{
          service="payment-service",
          le="0.5"
        }[5m]))
      totalQuery: |
        sum(rate(http_request_duration_seconds_count{
          service="payment-service"
        }[5m]))

  target:
    percentage: 99.5
    window: 7d

  burnRateAlerts:
    - name: page-latency-spike
      shortWindow: 30m
      longWindow: 3h
      burnRateThreshold: 14.4
      severity: critical
      notificationPolicy: payments-alerts

    - name: ticket-latency-degradation
      shortWindow: 3h
      longWindow: 1d
      burnRateThreshold: 6.0
      severity: high

    - name: monitor-latency-trend
      shortWindow: 12h
      longWindow: 7d
      burnRateThreshold: 1.0
      severity: low

  alertPolicy:
    multiWindowRequired: true
    pageOnBudgetExhausted: true
    budgetWarningThresholds: [25, 10, 0]

Grafana Dashboards

The AIOps platform provides 4 pre-configured Grafana dashboards for SLO and SLA visualization:

SLO Overview

Unified panel with all SLOs, current values, remaining error budget, and burn rate. Includes a burn rate heatmap by service.

Error Budget Burn-Down

Error budget burn-down chart over time. Shows trends and exhaustion projections. Reference lines for each warning threshold.

SLA Compliance Report

Compliance report by severity and period. Table with each incident, response/resolution times, and compliance status. Exportable to PDF.

Incident Timeline

Incident timeline with detection, analysis, remediation, and resolution. Visual correlation with SLO burn rate and SLA clock.
Importing the dashboards:
# The dashboards are available as ConfigMaps
kubectl apply -f operator/config/grafana/dashboards/

# Or import via the Grafana API
for f in operator/config/grafana/dashboards/*.json; do
  curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer $GRAFANA_API_KEY" \
    -d @"$f" \
    "https://grafana.company.com/api/dashboards/db"
done

Prometheus Metrics

The SLO and SLA system exposes detailed metrics:

SLO Metrics

MetricTypeLabelsDescription
chatcli_slo_current_valueGaugeslo, service, namespace, indicator_typeCurrent SLI value (e.g., 99.92)
chatcli_slo_target_valueGaugeslo, serviceConfigured target (e.g., 99.9)
chatcli_slo_error_budget_totalGaugeslo, serviceTotal error budget (e.g., 0.001)
chatcli_slo_error_budget_remainingGaugeslo, service, namespaceRemaining error budget
chatcli_slo_error_budget_remaining_percentGaugeslo, serviceRemaining budget percentage
chatcli_slo_burn_rateGaugeslo, service, windowBurn rate per window
chatcli_slo_alerts_fired_totalCounterslo, service, severity, alert_nameTotal burn rate alerts fired
chatcli_slo_conditionGaugeslo, service, conditionCurrent state (1=active): Met, AtRisk, Violated

SLA Metrics

MetricTypeLabelsDescription
chatcli_sla_violations_totalCountersla, service, severity, violation_typeTotal SLA violations
chatcli_sla_compliance_percentageGaugesla, service, severityCurrent compliance percentage
chatcli_sla_response_time_secondsHistogramsla, service, severityResponse time distribution
chatcli_sla_resolution_time_secondsHistogramsla, service, severityResolution time distribution
chatcli_sla_active_incidentsGaugesla, serviceCurrently active incidents
chatcli_sla_business_hours_activeGaugesla, service1 if within business hours, 0 if outside
Recommended Prometheus alerts:
groups:
  - name: chatcli-slo-sla
    rules:
      - alert: SLOBudgetExhausted
        expr: chatcli_slo_error_budget_remaining_percent <= 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error budget exhausted for SLO {{ $labels.slo }}"
          description: "Service {{ $labels.service }} has exhausted its error budget. No additional downtime is allowed."

      - alert: SLOBudgetLow
        expr: chatcli_slo_error_budget_remaining_percent <= 10 and chatcli_slo_error_budget_remaining_percent > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low error budget ({{ $value }}%) for SLO {{ $labels.slo }}"

      - alert: SLAComplianceBelow95
        expr: chatcli_sla_compliance_percentage < 95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "SLA compliance below 95% for {{ $labels.service }}"
          description: "Current compliance: {{ $value }}%. Review recent incidents and take corrective action."

      - alert: SLAResponseTimeExceeded
        expr: histogram_quantile(0.95, rate(chatcli_sla_response_time_seconds_bucket[1h])) > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 SLA response time exceeds 5 minutes"

Next Steps

Notifications and Escalation

Multi-channel notification system and automatic escalation

Approval Workflow

Change control with approval policies and blast radius

AIOps Platform

Deep-dive into the AIOps architecture

K8s Operator

Operator configuration and CRDs