Skip to main content
The ChatCLI AIOps platform includes three advanced subsystems that work together to optimize operations: Capacity Planner (resource exhaustion forecasting), Noise Reducer (intelligent alert suppression), and Cost Tracker (cost tracking and ROI calculation).

Capacity Planner

The Capacity Planner analyzes historical CPU and memory usage trends to predict when a cluster’s or namespace’s resources will be exhausted — enabling proactive action before incidents occur.

Linear Regression Algorithm

The Capacity Planner uses least-squares linear regression to project the resource exhaustion date.
Given a set of points (t_i, y_i) where:
  t_i = normalized timestamp (hours since first point)
  y_i = resource usage percentage (0-100)

Calculates:
  slope     = (n * sum(t*y) - sum(t) * sum(y)) / (n * sum(t^2) - sum(t)^2)
  intercept = (sum(y) - slope * sum(t)) / n

Exhaustion projection:
  If slope > 0:
    hours_until_full = (100 - current_usage) / slope
    exhaustion_date  = now + hours_until_full
The algorithm requires at least 3 data points to generate a reliable projection. With fewer than 3 points, the planner returns trend: insufficient_data.

Data Structures

Represents a resource usage data point in time.
FieldTypeDescription
Timestamptime.TimeCollection timestamp
CPUPercentfloat64CPU usage percentage (0-100)
MemoryPercentfloat64Memory usage percentage (0-100)
CPUCoresfloat64Absolute usage in cores
MemoryBytesint64Absolute usage in bytes
NamespacestringSource namespace
ResourcestringResource name (deployment, node)
Linear regression result for a resource.
FieldTypeDescription
ResourcestringResource name
NamespacestringNamespace
CPUSlopefloat64CPU growth rate (%/hour)
MemorySlopefloat64Memory growth rate (%/hour)
CPUInterceptfloat64Regression intercept for CPU
MemoryInterceptfloat64Regression intercept for memory
DataPointsintNumber of points used in regression
R2Scorefloat64Coefficient of determination (fit quality)
Exhaustion projection with recommendations.
FieldTypeDescription
ResourcestringResource name
NamespacestringNamespace
CPUExhaustionDate*time.TimeProjected CPU exhaustion date (nil if stable)
MemoryExhaustionDate*time.TimeProjected memory exhaustion date (nil if stable)
CPUCurrentPercentfloat64Current CPU usage
MemoryCurrentPercentfloat64Current memory usage
CPUGrowthRatefloat64CPU growth rate (%/hour)
MemoryGrowthRatefloat64Memory growth rate (%/hour)
UrgencystringClassification: urgent, plan, stable
Recommendations[]stringList of recommendations
IsBottleneckboolWhether the resource is a bottleneck in active incidents

Correlation with Incidents

The ResourceIsBottleneck method checks if a resource is related to active incidents:
For each active Issue (state != Resolved && state != Escalated):
  If issue.resource.name == forecast.Resource
  AND issue.resource.namespace == forecast.Namespace
  Then: forecast.IsBottleneck = true
When IsBottleneck = true, the capacity recommendation is automatically prioritized and includes a reference to the active incident.

Recommendation Generation

The Capacity Planner generates recommendations based on projection urgency:
ConditionUrgencyRecommendation
Exhaustion in less than 7 daysurgent”URGENT: {resource} in {namespace} projected to exhaust {type} on {date}. Immediate action required: scale horizontally or increase limits.”
Exhaustion between 7 and 30 daysplan”PLANNING: {resource} in {namespace} projected to exhaust {type} on {date}. Plan capacity increase in the coming weeks.”
Exhaustion in more than 30 days or negative/stable slopestable”Stable: {resource} in {namespace} shows no exhaustion trend within the 30-day horizon.”

How to Use

planner := capacity.NewCapacityPlanner(k8sClient, logger)

// Collect historical data and analyze trends
forecasts, err := planner.AnalyzeResourceTrends(ctx, namespace)
if err != nil {
    logger.Error("capacity analysis failed", zap.Error(err))
    return
}

for _, forecast := range forecasts {
    if forecast.Urgency == "urgent" {
        // Create proactive alert or incident
        logger.Warn("resource at risk of exhaustion",
            zap.String("resource", forecast.Resource),
            zap.Time("cpu_exhaustion", *forecast.CPUExhaustionDate),
        )
    }
}
The Capacity Planner collects data every reconciliation cycle (30 seconds) and stores history in a ConfigMap (chatcli-capacity-history). The regression is recalculated every 5 minutes.

Noise Reducer

The Noise Reducer implements four alert suppression strategies to reduce alert fatigue and improve the signal-to-noise ratio.

Strategy 1: Repetitive Suppression

Suppresses identical alerts when there is accumulation without state change.
Suppression condition:
  - 5 or more identical alerts in the last 1 hour
  - No state change (severity, resource, signal_type are the same)
  - Identity hash: SHA256(signal_type + resource_name + namespace)

Result:
  - Alert suppressed (no new Anomaly CR generated)
  - Suppression counter incremented
  - Log: "repetitive alert suppressed (N occurrences in 1h)"
If the alert severity changes (e.g., high to critical), suppression is immediately disabled and the alert is processed normally.

Strategy 2: Seasonal Patterns

Identifies and suppresses alerts that occur at predictable times (e.g., cleanup jobs, scheduled deploys). SeasonalPattern struct:
FieldTypeDescription
SignalTypestringSignal type (e.g., pod_restart)
ResourcestringResource name
NamespacestringNamespace
DayOfWeektime.WeekdayDay of the week (0=Sunday)
HourOfDayintHour of the day (0-23)
MinuteWindowintTolerance window in minutes
OccurrencesintNumber of confirmed occurrences
Confidencefloat64Pattern confidence (0-1)
LastSeentime.TimeLast occurrence
Detection Algorithm:
For each alert received:
  1. Search for existing seasonal patterns for (signal_type, resource, namespace)
  2. For each pattern found:
     - Check if day_of_week == pattern.DayOfWeek
     - Check if |current_hour - pattern.HourOfDay| <= 30 minutes (tolerance window)
     - If match: increment Occurrences, update LastSeen
     - If Occurrences >= 3 AND Confidence >= 0.7: SUPPRESS the alert
  3. If no existing pattern matches:
     - Create new candidate SeasonalPattern (Occurrences=1, Confidence=0.3)
How patterns are learned: Pattern confidence grows with each confirmation:
Confidence = min(1.0, 0.3 + (Occurrences - 1) * 0.15)

Occurrences | Confidence | Suppress?
     1      |    0.30    |   No
     2      |    0.45    |   No
     3      |    0.60    |   No
     4      |    0.75    |   Yes (>= 0.7 and >= 3)
     5      |    0.90    |   Yes
     6+     |    1.00    |   Yes
Practical example:
  • ConfigMap update job runs every Monday at 03:00
  • Generates pod_restart alert in the jobs namespace
  • After 4 weeks: pattern identified (Monday, 03:00, confidence 0.75)
  • From the 5th week: alert automatically suppressed
Patterns are stored in the ConfigMap chatcli-seasonal-patterns.

Strategy 3: Flap Detection

Detects resources that oscillate between states (resolved -> detected -> resolved) repeatedly.
Flap condition:
  - 3 or more resolved -> detected cycles in the same 24-hour window
  - Same resource (resource_name + namespace)

Actions:
  1. Mark resource as "flapping"
  2. Suppress new alerts from the resource for 2 hours
  3. Generate consolidated alert: "Resource {name} is flapping -- {N} cycles in 24h"
  4. Recommend root cause investigation

Exiting flap:
  - If no new cycle in 2 hours: remove flap flag
  - New alerts resume normal processing

Strategy 4: Alert Fatigue Scoring

Calculates an alert fatigue score (0-100) to determine if alert volume is excessive.
fatigue_score = volume_score + resolve_rate_score + recency_score

Where:
  volume_score (0-40):
    - alerts_per_hour = total_alerts_last_6h / 6
    - volume_score = min(40, alerts_per_hour * 4)

  resolve_rate_score (0-30):
    - auto_resolve_rate = auto_resolved / total * 100
    - If auto_resolve_rate > 90%: resolve_rate_score = 30
    - If auto_resolve_rate > 70%: resolve_rate_score = 20
    - If auto_resolve_rate > 50%: resolve_rate_score = 10
    - Else: resolve_rate_score = 0

  recency_score (0-30):
    - If > 5 alerts in last 10 minutes: recency_score = 30
    - If > 3 alerts in last 10 minutes: recency_score = 20
    - If > 1 alert in last 10 minutes: recency_score = 10
    - Else: recency_score = 0
Classification:
ScoreClassificationAction
0-25lowNo action. Healthy volume.
26-50moderateWarning log. Monitor trend.
51-75highActivate aggressive suppression. Consolidate similar alerts.
76-100criticalSuppress everything except critical severity. Generate meta-alert for SRE.

Cost Tracker

The Cost Tracker tracks operational costs (LLM + downtime) and calculates AIOps automation ROI.

LLM Costs per Provider

The cost of each LLM call is calculated based on tokens consumed and configured prices per provider:
ProviderInput Cost (per 1M tokens)Output Cost (per 1M tokens)Reference Model
claude$3.00$15.00Claude Sonnet
gpt-4$10.00$30.00GPT-4 Turbo
default$1.00$3.00Any other

Cost Configuration

Prices are configurable via ConfigMap chatcli-cost-config:
apiVersion: v1
kind: ConfigMap
metadata:
  name: chatcli-cost-config
  namespace: chatcli-system
data:
  costs.json: |
    {
      "providers": {
        "claude": {"input_per_million": 3.00, "output_per_million": 15.00},
        "gpt-4": {"input_per_million": 10.00, "output_per_million": 30.00},
        "gpt-4o": {"input_per_million": 2.50, "output_per_million": 10.00},
        "gemini": {"input_per_million": 0.50, "output_per_million": 1.50},
        "default": {"input_per_million": 1.00, "output_per_million": 3.00}
      },
      "downtime_cost_per_minute": 10.00,
      "engineer_hourly_rate": 75.00
    }
If the ConfigMap does not exist, default values are used. ConfigMap updates are reflected in real time (watch on ConfigMap).

IncidentCost

Total cost of an incident, decomposed into components:
FieldTypeDescription
IncidentNamestringIncident name
NamespacestringNamespace
LLMCostCostBreakdownCost of LLM calls
DowntimeCostfloat64Downtime cost (minutes * cost/minute)
TotalCostfloat64LLMCost.Total + DowntimeCost
ProviderstringLLM provider used
ModelstringLLM model used
Durationtime.DurationTotal incident duration
AutoRemediatedboolWhether it was resolved automatically
CostBreakdown:
FieldTypeDescription
InputTokensint64Total input tokens
OutputTokensint64Total output tokens
InputCostfloat64Input token cost
OutputCostfloat64Output token cost
Totalfloat64InputCost + OutputCost
CallsintNumber of LLM calls
Calculation example:
Incident: CrashLoopBackOff on api-server
  - Provider: claude
  - LLM calls: 2 (AnalyzeIssue + 1 AgenticStep)
  - Input tokens: 8,500 | Output tokens: 2,200
  - InputCost:  8,500 / 1,000,000 * $3.00  = $0.0255
  - OutputCost: 2,200 / 1,000,000 * $15.00 = $0.0330
  - Total LLMCost: $0.0585

  - Downtime: 3 minutes
  - DowntimeCost: 3 * $10.00 = $30.00

  - TotalCost: $0.0585 + $30.00 = $30.06

CostSummary

Cost aggregation for a period:
FieldTypeDescription
PeriodstringAggregation period (e.g., 30d)
TotalLLMCostfloat64Sum of all LLM costs
TotalDowntimeCostfloat64Sum of all downtime costs
TotalCostfloat64TotalLLMCost + TotalDowntimeCost
IncidentCountintTotal incidents in the period
AutoRemediatedCountintIncidents resolved automatically
AvgCostPerIncidentfloat64TotalCost / IncidentCount
TopCostlyIncidents[]IncidentCostTop 5 most expensive incidents
CostByProvidermap[string]float64Aggregated cost by LLM provider
CostByNamespacemap[string]float64Aggregated cost by namespace
ROIROICalculationReturn on investment calculation

ROI Calculation

ROI is calculated by comparing automation cost with the estimated cost of manual resolution:
Variables:
  autoRemediated    = number of automatically resolved incidents
  avgManualHours    = 2h (estimated manual resolution time per incident)
  engineerRate      = $75/hour (configurable via ConfigMap)
  downtimePrevented = autoRemediated * avgDowntimeMinutes
  downtimeCostRate  = $10/minute (configurable via ConfigMap)

Calculation:
  engineerHoursSaved = autoRemediated * avgManualHours
  laborSavings       = engineerHoursSaved * engineerRate
  downtimeSavings    = downtimePrevented * downtimeCostRate
  totalSavings       = laborSavings + downtimeSavings
  totalLLMCost       = sum of all LLM costs in the period

  ROI% = ((totalSavings - totalLLMCost) / totalLLMCost) * 100
ROICalculation struct:
FieldTypeDescription
EngineerHoursSavedfloat64Engineer hours saved
LaborSavingsfloat64Labor savings ($)
DowntimePreventedfloat64Minutes of downtime prevented
DowntimeSavingsfloat64Downtime savings ($)
TotalSavingsfloat64Total savings ($)
TotalLLMCostfloat64Total LLM cost ($)
ROIPercentfloat64Return on investment (%)
Monthly ROI example:
Monthly data:
  - 312 automatically resolved incidents
  - Total LLM cost: $18.72

Calculation:
  engineerHoursSaved = 312 * 2h = 624 hours
  laborSavings       = 624 * $75 = $46,800
  downtimePrevented  = 312 * 4.5min = 1,404 minutes
  downtimeSavings    = 1,404 * $10 = $14,040
  totalSavings       = $46,800 + $14,040 = $60,840
  totalLLMCost       = $18.72

  ROI% = ($60,840 - $18.72) / $18.72 * 100 = 324,935%
ROI typically exceeds 100,000% because the cost of LLM calls ($0.03-0.10 per incident) is orders of magnitude lower than the cost of manual resolution (2h of engineer time + downtime).

Storage Architecture (ConfigMaps)

All Capacity Planner, Noise Reducer, and Cost Tracker data is persisted in ConfigMaps in the operator namespace:
ConfigMapDataRetention
chatcli-capacity-historyHistorical CPU/memory usage per resource. Compact time series in JSON.7 days (rolling window)
chatcli-pattern-storeNoise Reducer dedup hashes, suppression counters, flap flags.24 hours (automatic pruning)
chatcli-seasonal-patternsLearned seasonal patterns (SeasonalPattern structs in JSON).Indefinite (patterns with Confidence < 0.3 and LastSeen > 30 days are removed)
chatcli-cost-ledgerCost records per incident (IncidentCost in JSON). Used for aggregation.90 days (rolling window)
chatcli-cost-configPer-provider pricing configuration, downtime cost, engineer rate.Indefinite (user-managed)
ConfigMaps have a 1MB limit in Kubernetes. For clusters with high incident volume (>1000/month), the Cost Tracker automatically compacts old records, keeping only daily aggregations for data older than 30 days.

Storage Format

apiVersion: v1
kind: ConfigMap
metadata:
  name: chatcli-cost-ledger
  namespace: chatcli-system
  labels:
    app.kubernetes.io/component: cost-tracker
    platform.chatcli.io/managed-by: operator
data:
  ledger.json: |
    {
      "version": 2,
      "lastCompaction": "2026-03-19T00:00:00Z",
      "entries": [
        {
          "incident": "issue-crashloop-api-server-production",
          "namespace": "production",
          "timestamp": "2026-03-19T14:13:00Z",
          "llmCost": 0.0585,
          "downtimeCost": 30.00,
          "totalCost": 30.06,
          "provider": "claude",
          "autoRemediated": true,
          "durationSeconds": 180
        }
      ],
      "dailyAggregates": [
        {
          "date": "2026-03-18",
          "totalLLMCost": 0.62,
          "totalDowntimeCost": 350.00,
          "incidentCount": 12,
          "autoRemediatedCount": 11
        }
      ]
    }

Integrations

REST API

Endpoints /api/v1/analytics/remediation-stats and /api/v1/analytics/summary expose cost and capacity data.

Web Dashboard

The Overview view displays ROI metrics and capacity projections in real time.

Grafana

The remediation-stats.json dashboard includes cost and ROI panels.

AIOps Platform

Complete AIOps pipeline architecture and how these subsystems integrate.