Capacity Planner
The Capacity Planner analyzes historical CPU and memory usage trends to predict when a cluster’s or namespace’s resources will be exhausted — enabling proactive action before incidents occur.Linear Regression Algorithm
The Capacity Planner uses least-squares linear regression to project the resource exhaustion date.The algorithm requires at least 3 data points to generate a reliable projection. With fewer than 3 points, the planner returns
trend: insufficient_data.Data Structures
ResourceUsage
ResourceUsage
Represents a resource usage data point in time.
| Field | Type | Description |
|---|---|---|
Timestamp | time.Time | Collection timestamp |
CPUPercent | float64 | CPU usage percentage (0-100) |
MemoryPercent | float64 | Memory usage percentage (0-100) |
CPUCores | float64 | Absolute usage in cores |
MemoryBytes | int64 | Absolute usage in bytes |
Namespace | string | Source namespace |
Resource | string | Resource name (deployment, node) |
ResourceTrend
ResourceTrend
Linear regression result for a resource.
| Field | Type | Description |
|---|---|---|
Resource | string | Resource name |
Namespace | string | Namespace |
CPUSlope | float64 | CPU growth rate (%/hour) |
MemorySlope | float64 | Memory growth rate (%/hour) |
CPUIntercept | float64 | Regression intercept for CPU |
MemoryIntercept | float64 | Regression intercept for memory |
DataPoints | int | Number of points used in regression |
R2Score | float64 | Coefficient of determination (fit quality) |
ForecastResult
ForecastResult
Exhaustion projection with recommendations.
| Field | Type | Description |
|---|---|---|
Resource | string | Resource name |
Namespace | string | Namespace |
CPUExhaustionDate | *time.Time | Projected CPU exhaustion date (nil if stable) |
MemoryExhaustionDate | *time.Time | Projected memory exhaustion date (nil if stable) |
CPUCurrentPercent | float64 | Current CPU usage |
MemoryCurrentPercent | float64 | Current memory usage |
CPUGrowthRate | float64 | CPU growth rate (%/hour) |
MemoryGrowthRate | float64 | Memory growth rate (%/hour) |
Urgency | string | Classification: urgent, plan, stable |
Recommendations | []string | List of recommendations |
IsBottleneck | bool | Whether the resource is a bottleneck in active incidents |
Correlation with Incidents
TheResourceIsBottleneck method checks if a resource is related to active incidents:
IsBottleneck = true, the capacity recommendation is automatically prioritized and includes a reference to the active incident.
Recommendation Generation
The Capacity Planner generates recommendations based on projection urgency:| Condition | Urgency | Recommendation |
|---|---|---|
| Exhaustion in less than 7 days | urgent | ”URGENT: {resource} in {namespace} projected to exhaust {type} on {date}. Immediate action required: scale horizontally or increase limits.” |
| Exhaustion between 7 and 30 days | plan | ”PLANNING: {resource} in {namespace} projected to exhaust {type} on {date}. Plan capacity increase in the coming weeks.” |
| Exhaustion in more than 30 days or negative/stable slope | stable | ”Stable: {resource} in {namespace} shows no exhaustion trend within the 30-day horizon.” |
How to Use
The Capacity Planner collects data every reconciliation cycle (30 seconds) and stores history in a ConfigMap (
chatcli-capacity-history). The regression is recalculated every 5 minutes.Noise Reducer
The Noise Reducer implements four alert suppression strategies to reduce alert fatigue and improve the signal-to-noise ratio.Strategy 1: Repetitive Suppression
Suppresses identical alerts when there is accumulation without state change.Strategy 2: Seasonal Patterns
Identifies and suppresses alerts that occur at predictable times (e.g., cleanup jobs, scheduled deploys). SeasonalPattern struct:| Field | Type | Description |
|---|---|---|
SignalType | string | Signal type (e.g., pod_restart) |
Resource | string | Resource name |
Namespace | string | Namespace |
DayOfWeek | time.Weekday | Day of the week (0=Sunday) |
HourOfDay | int | Hour of the day (0-23) |
MinuteWindow | int | Tolerance window in minutes |
Occurrences | int | Number of confirmed occurrences |
Confidence | float64 | Pattern confidence (0-1) |
LastSeen | time.Time | Last occurrence |
- ConfigMap update job runs every Monday at 03:00
- Generates
pod_restartalert in thejobsnamespace - After 4 weeks: pattern identified (Monday, 03:00, confidence 0.75)
- From the 5th week: alert automatically suppressed
chatcli-seasonal-patterns.
Strategy 3: Flap Detection
Detects resources that oscillate between states (resolved -> detected -> resolved) repeatedly.Strategy 4: Alert Fatigue Scoring
Calculates an alert fatigue score (0-100) to determine if alert volume is excessive.| Score | Classification | Action |
|---|---|---|
| 0-25 | low | No action. Healthy volume. |
| 26-50 | moderate | Warning log. Monitor trend. |
| 51-75 | high | Activate aggressive suppression. Consolidate similar alerts. |
| 76-100 | critical | Suppress everything except critical severity. Generate meta-alert for SRE. |
Cost Tracker
The Cost Tracker tracks operational costs (LLM + downtime) and calculates AIOps automation ROI.LLM Costs per Provider
The cost of each LLM call is calculated based on tokens consumed and configured prices per provider:| Provider | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Reference Model |
|---|---|---|---|
claude | $3.00 | $15.00 | Claude Sonnet |
gpt-4 | $10.00 | $30.00 | GPT-4 Turbo |
default | $1.00 | $3.00 | Any other |
Cost Configuration
Prices are configurable via ConfigMapchatcli-cost-config:
If the ConfigMap does not exist, default values are used. ConfigMap updates are reflected in real time (watch on ConfigMap).
IncidentCost
Total cost of an incident, decomposed into components:| Field | Type | Description |
|---|---|---|
IncidentName | string | Incident name |
Namespace | string | Namespace |
LLMCost | CostBreakdown | Cost of LLM calls |
DowntimeCost | float64 | Downtime cost (minutes * cost/minute) |
TotalCost | float64 | LLMCost.Total + DowntimeCost |
Provider | string | LLM provider used |
Model | string | LLM model used |
Duration | time.Duration | Total incident duration |
AutoRemediated | bool | Whether it was resolved automatically |
| Field | Type | Description |
|---|---|---|
InputTokens | int64 | Total input tokens |
OutputTokens | int64 | Total output tokens |
InputCost | float64 | Input token cost |
OutputCost | float64 | Output token cost |
Total | float64 | InputCost + OutputCost |
Calls | int | Number of LLM calls |
CostSummary
Cost aggregation for a period:| Field | Type | Description |
|---|---|---|
Period | string | Aggregation period (e.g., 30d) |
TotalLLMCost | float64 | Sum of all LLM costs |
TotalDowntimeCost | float64 | Sum of all downtime costs |
TotalCost | float64 | TotalLLMCost + TotalDowntimeCost |
IncidentCount | int | Total incidents in the period |
AutoRemediatedCount | int | Incidents resolved automatically |
AvgCostPerIncident | float64 | TotalCost / IncidentCount |
TopCostlyIncidents | []IncidentCost | Top 5 most expensive incidents |
CostByProvider | map[string]float64 | Aggregated cost by LLM provider |
CostByNamespace | map[string]float64 | Aggregated cost by namespace |
ROI | ROICalculation | Return on investment calculation |
ROI Calculation
ROI is calculated by comparing automation cost with the estimated cost of manual resolution:| Field | Type | Description |
|---|---|---|
EngineerHoursSaved | float64 | Engineer hours saved |
LaborSavings | float64 | Labor savings ($) |
DowntimePrevented | float64 | Minutes of downtime prevented |
DowntimeSavings | float64 | Downtime savings ($) |
TotalSavings | float64 | Total savings ($) |
TotalLLMCost | float64 | Total LLM cost ($) |
ROIPercent | float64 | Return on investment (%) |
ROI typically exceeds 100,000% because the cost of LLM calls ($0.03-0.10 per incident) is orders of magnitude lower than the cost of manual resolution (2h of engineer time + downtime).
Storage Architecture (ConfigMaps)
All Capacity Planner, Noise Reducer, and Cost Tracker data is persisted in ConfigMaps in the operator namespace:| ConfigMap | Data | Retention |
|---|---|---|
chatcli-capacity-history | Historical CPU/memory usage per resource. Compact time series in JSON. | 7 days (rolling window) |
chatcli-pattern-store | Noise Reducer dedup hashes, suppression counters, flap flags. | 24 hours (automatic pruning) |
chatcli-seasonal-patterns | Learned seasonal patterns (SeasonalPattern structs in JSON). | Indefinite (patterns with Confidence < 0.3 and LastSeen > 30 days are removed) |
chatcli-cost-ledger | Cost records per incident (IncidentCost in JSON). Used for aggregation. | 90 days (rolling window) |
chatcli-cost-config | Per-provider pricing configuration, downtime cost, engineer rate. | Indefinite (user-managed) |
Storage Format
Integrations
REST API
Endpoints
/api/v1/analytics/remediation-stats and /api/v1/analytics/summary expose cost and capacity data.Web Dashboard
The Overview view displays ROI metrics and capacity projections in real time.
Grafana
The
remediation-stats.json dashboard includes cost and ROI panels.AIOps Platform
Complete AIOps pipeline architecture and how these subsystems integrate.