The Decision Engine never acts blindly. Every decision goes through a pipeline of
confidence adjustments, circuit breaker checks, and pattern validation
before any action is executed.
Architecture Overview
Base Confidence (AIInsight)
The entire process starts with theconfidence field of the AIInsight CR, which is generated by the LLM provider during root cause analysis. This value represents the AI’s certainty about the diagnosis and suggested actions.
High Confidence
0.90 - 1.00 — The AI identified the problem with high precision. Well-known
scenarios like OOMKilled, CrashLoopBackOff with invalid image.
Medium Confidence
0.70 - 0.89 — Probable diagnosis but with uncertainty. Performance
issues, resource pressure, intermittent dependencies.
Low Confidence
0.50 - 0.69 — The AI does not have sufficient certainty. Complex problems
with multiple possible causes.
Very Low Confidence
< 0.50 — Unknown scenario or insufficient data. Always requires
human intervention.
Confidence Adjustment Factors
The base confidence is never used directly. It goes through 5 adjustment factors that refine it based on the current operational context.1. Historical Success Rate
Query the Pattern Store
The engine calculates the success rate of previous remediations for the same
signal type (
signalType).2. Pattern Match
When the Pattern Store finds a previously resolved pattern that matches the current incident, confidence receives a significant boost.| Condition | Adjustment |
|---|---|
| Pattern found with successful resolution | +0.15 |
| No matching pattern | 0.00 |
3. Time of Day
Automatic actions outside business hours carry additional risk because fewer engineers are available to intervene if something goes wrong.| Condition | Adjustment |
|---|---|
| Within business hours (09:00-18:00 local) | 0.00 |
| Outside business hours | -0.05 |
4. Simultaneous Active Issues
When the cluster is under pressure with multiple active incidents, the engine becomes more conservative to avoid chain actions that could worsen the situation.| Condition | Adjustment |
|---|---|
| Up to 3 active issues | 0.00 |
| Each issue beyond 3 | -0.02 per issue |
5. Incident Severity
TheIssue CR severity applies a fixed modifier reflecting the inherent operational risk.
| Severity | Adjustment | Justification |
|---|---|---|
| critical | -0.10 | Production impact, requires maximum caution |
| high | -0.05 | Significant risk, moderate conservatism |
| medium | 0.00 | Standard level, no adjustment |
| low | +0.05 | Low risk, favors automation |
Practical Calculation Example
Scenario: CrashLoopBackOff after deploy during business hours
Scenario: CrashLoopBackOff after deploy during business hours
Incident data:Decision: Confidence 1.00 + severity high = Requires approval (threshold >=0.80 + high).Even with maximum confidence,
- Base AIInsight confidence: 0.88
- Severity: high
- Time: 14:30 (business hours)
- Active issues: 2
- Pattern Store: pattern found (successful rollback 5 days ago)
- Historical success rate: 90%
high incidents always require human approval.Scenario: Pod with OOMKilled in staging namespace at night
Scenario: Pod with OOMKilled in staging namespace at night
Incident data:Decision: Confidence 1.00 + severity low = Auto-remediation (threshold >=0.95 + low).
- Base AIInsight confidence: 0.92
- Severity: low
- Time: 02:15 (outside business hours)
- Active issues: 1
- Pattern Store: pattern found (successful memory adjustment)
- Historical success rate: 95%
Scenario: Unknown problem in overloaded cluster
Scenario: Unknown problem in overloaded cluster
Incident data:Decision: Confidence 0.35 + severity critical = Manual only (<0.70 or critical).
- Base AIInsight confidence: 0.65
- Severity: critical
- Time: 10:00 (business hours)
- Active issues: 8
- Pattern Store: no matching pattern
- Historical success rate: 30%
Decision Thresholds
The combination of final confidence and severity determines the allowed level of autonomy.- Full Auto-Remediation
- Auto with Notification
- Requires Approval
- Manual Only
Requirements: Confidence >= 0.95 and severity
lowThe platform executes remediation automatically without any human intervention.
The RemediationPlan is created and executed immediately.Circuit Breaker
The circuit breaker is a safety mechanism that blocks all auto-remediations when it detects consecutive failures, preventing the platform from causing cascading damage.Failure Monitoring
Each remediation failure is recorded with a timestamp. The circuit breaker
maintains a sliding window of 1 hour.
Circuit Breaker Trigger
When 3 or more failures occur within the 1-hour window, the circuit
breaker opens and blocks all auto-remediation in the namespace.
Open State
While open, all
RemediationPlan CRs are created with
requiresApproval: true, regardless of the calculated confidence.Pattern Store
The Pattern Store is the platform’s pattern learning system. It allows AIOps to “remember” past incidents and use that memory to make more informed decisions.SHA256 Fingerprinting
Each pattern is identified by a unique fingerprint calculated as:| Signal Type | Resource Kind | Severity | Fingerprint (truncated) |
|---|---|---|---|
CrashLoopBackOff | Deployment | high | a3f8c2... |
OOMKilled | Pod | medium | 7b1d9e... |
FailedScheduling | Pod | low | c4e6a1... |
ImagePullBackOff | Deployment | high | 2d8f5b... |
ConfigMap Storage
Patterns are persisted in a dedicatedConfigMap in the operator namespace:
RecordResolution and RecordFailure
Confidence Boost Calculation
The confidence boost derived from the Pattern Store is calculated directly from the success rate:| Success Rate | Confidence Boost | Example |
|---|---|---|
| 100% (10/10) | +0.150 | All rollbacks successful |
| 80% (8/10) | +0.120 | Most resource adjustments worked |
| 50% (5/10) | +0.075 | Mixed results |
| 20% (2/10) | +0.030 | Most failed |
Scenario: Recent Similar Incident
'Similar incident resolved 3 days ago with rollback'
'Similar incident resolved 3 days ago with rollback'
When the Pattern Store finds a match, the engine adds context
to the This information is displayed in the
AIInsight and the RemediationPlan:Issue CR so operators can quickly
see that the problem has been resolved before and how.Root Cause Analysis (RCA) Enrichment
Before making any decision, the engine enriches the incident context with additional cluster data. This enrichment feeds both the LLM (for better diagnosis) and the decision engine (for more precise adjustments).DeploymentChange Detection
The engine checks if there was a recent deploy change by comparing ReplicaSet revisions:ConfigChange Detection
The engine searches for Kubernetes events related to ConfigMap and Secret updates:Related Issues
Lists active issues in the same namespace that may be correlated:Dependency Status
Checks the health of Services and Endpoints that the affected resource depends on:Time Correlation
The engine calculates the temporal correlation between detected changes and the incident start:Strong temporal correlation (< 5 min) automatically elevates the cause to the top
of the
PossibleCauses list, as the probability of a causal relationship is high.PossibleCauses Ranking
All possible causes are ranked by probability based on the enrichment data:Convergence Detector
The Convergence Detector is designed for the agentic remediation loop. It monitors the agent’s observations to determine if the situation is improving, stagnating, or worsening.IsConverged
Checks if the last 3 observations are identical, indicating that the system has reached a stable state (for better or worse).IsOscillating
Detects A-B-A-B oscillation patterns where the system alternates between two states without real progress.ShouldStop
Main function that combines all agentic loop stop criteria:| Criterion | Condition | Action |
|---|---|---|
| Convergence | 3 identical observations | Stops the loop, marks as resolved or not |
| Oscillation | A-B-A-B pattern | Stops the loop, escalates to human |
| Timeout | > 10 minutes | Stops the loop, escalates to human |
| Consecutive failures | >= 5 failures | Stops the loop, triggers circuit breaker |
EstimateProgress
Estimates agentic loop progress from 0.0 to 1.0, used for visual feedback and logging:Complete Decision Flow
Decision Engine Metrics
The engine exposes Prometheus metrics for observability:| Metric | Type | Description |
|---|---|---|
decision_engine_evaluations_total | Counter | Total confidence evaluations |
decision_engine_confidence_histogram | Histogram | Final confidence distribution |
decision_engine_auto_remediations_total | Counter | Total auto-remediations by mode |
decision_engine_circuit_breaker_state | Gauge | Circuit breaker state (0=closed, 1=open) |
decision_engine_pattern_matches_total | Counter | Total Pattern Store matches |
decision_engine_rca_enrichment_duration | Histogram | RCA enrichment time |
decision_engine_convergence_stops_total | Counter | Total stops by type (convergence, oscillation, timeout, failures) |
Next Steps
Multi-Cluster Federation
See how the decision engine operates in multi-cluster environments with policies
per tier.
Chaos Engineering
Validate engine decisions with controlled chaos experiments.
Audit and Compliance
Every decision generates an immutable AuditEvent for traceability.
AIOps Platform
Return to the complete AIOps platform overview.