Skip to main content
The Decision Engine is the central component that determines when and how the AIOps platform should act autonomously. It combines calculated confidence, historical patterns, root cause enrichment, and convergence detection to make safe decisions in production.
The Decision Engine never acts blindly. Every decision goes through a pipeline of confidence adjustments, circuit breaker checks, and pattern validation before any action is executed.

Architecture Overview

Base Confidence (AIInsight)

The entire process starts with the confidence field of the AIInsight CR, which is generated by the LLM provider during root cause analysis. This value represents the AI’s certainty about the diagnosis and suggested actions.

High Confidence

0.90 - 1.00 — The AI identified the problem with high precision. Well-known scenarios like OOMKilled, CrashLoopBackOff with invalid image.

Medium Confidence

0.70 - 0.89 — Probable diagnosis but with uncertainty. Performance issues, resource pressure, intermittent dependencies.

Low Confidence

0.50 - 0.69 — The AI does not have sufficient certainty. Complex problems with multiple possible causes.

Very Low Confidence

< 0.50 — Unknown scenario or insufficient data. Always requires human intervention.

Confidence Adjustment Factors

The base confidence is never used directly. It goes through 5 adjustment factors that refine it based on the current operational context.

1. Historical Success Rate

1

Query the Pattern Store

The engine calculates the success rate of previous remediations for the same signal type (signalType).
2

Apply the adjustment

  • High success rate (>80%): adjustment of +0.10
  • Low success rate (<40%): adjustment of -0.10
  • No history: no adjustment (0.00)
// Historical adjustment calculation
if successRate > 0.8 {
    adjustment += 0.10
} else if successRate < 0.4 {
    adjustment -= 0.10
}

2. Pattern Match

When the Pattern Store finds a previously resolved pattern that matches the current incident, confidence receives a significant boost.
ConditionAdjustment
Pattern found with successful resolution+0.15
No matching pattern0.00
Pattern Match is the most powerful factor. An identical incident resolved previously can raise confidence enough to allow auto-remediation even in scenarios that would normally require approval.

3. Time of Day

Automatic actions outside business hours carry additional risk because fewer engineers are available to intervene if something goes wrong.
ConditionAdjustment
Within business hours (09:00-18:00 local)0.00
Outside business hours-0.05

4. Simultaneous Active Issues

When the cluster is under pressure with multiple active incidents, the engine becomes more conservative to avoid chain actions that could worsen the situation.
ConditionAdjustment
Up to 3 active issues0.00
Each issue beyond 3-0.02 per issue
// Example: 7 active issues
// Adjustment = -(7 - 3) * 0.02 = -0.08
activeIssues := countActiveIssues(namespace)
if activeIssues > 3 {
    adjustment -= float64(activeIssues - 3) * 0.02
}
With 10 or more simultaneous active issues, the cumulative adjustment (-0.14 or more) makes it practically impossible to reach the auto-remediation threshold, forcing human review — exactly the desired behavior during a cascade incident.

5. Incident Severity

The Issue CR severity applies a fixed modifier reflecting the inherent operational risk.
SeverityAdjustmentJustification
critical-0.10Production impact, requires maximum caution
high-0.05Significant risk, moderate conservatism
medium0.00Standard level, no adjustment
low+0.05Low risk, favors automation

Practical Calculation Example

Incident data:
  • Base AIInsight confidence: 0.88
  • Severity: high
  • Time: 14:30 (business hours)
  • Active issues: 2
  • Pattern Store: pattern found (successful rollback 5 days ago)
  • Historical success rate: 90%
Calculation:
Base:                    0.88
+ Historical success:   +0.10  (90% > 80%)
+ Pattern match:        +0.15  (pattern found)
+ Time of day:           0.00  (business hours)
+ Active issues:         0.00  (2 <= 3)
+ Severity (high):      -0.05
─────────────────────────────
Final confidence:        1.00  (capped at 1.0)
Decision: Confidence 1.00 + severity high = Requires approval (threshold >=0.80 + high).Even with maximum confidence, high incidents always require human approval.
Incident data:
  • Base AIInsight confidence: 0.92
  • Severity: low
  • Time: 02:15 (outside business hours)
  • Active issues: 1
  • Pattern Store: pattern found (successful memory adjustment)
  • Historical success rate: 95%
Calculation:
Base:                    0.92
+ Historical success:   +0.10  (95% > 80%)
+ Pattern match:        +0.15  (pattern found)
+ Time of day:          -0.05  (outside business hours)
+ Active issues:         0.00  (1 <= 3)
+ Severity (low):       +0.05
─────────────────────────────
Final confidence:        1.00  (capped at 1.0)
Decision: Confidence 1.00 + severity low = Auto-remediation (threshold >=0.95 + low).
Incident data:
  • Base AIInsight confidence: 0.65
  • Severity: critical
  • Time: 10:00 (business hours)
  • Active issues: 8
  • Pattern Store: no matching pattern
  • Historical success rate: 30%
Calculation:
Base:                    0.65
+ Historical success:   -0.10  (30% < 40%)
+ Pattern match:         0.00  (no pattern)
+ Time of day:           0.00  (business hours)
+ Active issues:        -0.10  (8-3=5 x 0.02)
+ Severity (critical):  -0.10
─────────────────────────────
Final confidence:        0.35
Decision: Confidence 0.35 + severity critical = Manual only (<0.70 or critical).

Decision Thresholds

The combination of final confidence and severity determines the allowed level of autonomy.
Requirements: Confidence >= 0.95 and severity lowThe platform executes remediation automatically without any human intervention. The RemediationPlan is created and executed immediately.
# Automatically generated RemediationPlan
apiVersion: platform.chatcli.io/v1alpha1
kind: RemediationPlan
metadata:
  name: auto-fix-oom-api-server
  annotations:
    platform.chatcli.io/decision-mode: "auto"
    platform.chatcli.io/confidence: "0.97"
spec:
  issueRef:
    name: issue-oom-api-server
  actions:
    - type: AdjustResources
      target: deployment/api-server
      parameters:
        memoryLimit: "512Mi"
        memoryRequest: "256Mi"

Circuit Breaker

The circuit breaker is a safety mechanism that blocks all auto-remediations when it detects consecutive failures, preventing the platform from causing cascading damage.
1

Failure Monitoring

Each remediation failure is recorded with a timestamp. The circuit breaker maintains a sliding window of 1 hour.
2

Circuit Breaker Trigger

When 3 or more failures occur within the 1-hour window, the circuit breaker opens and blocks all auto-remediation in the namespace.
3

Open State

While open, all RemediationPlan CRs are created with requiresApproval: true, regardless of the calculated confidence.
4

Reset

The circuit breaker closes automatically after the cooldown period or when an operator performs a manual reset via annotation.
type CircuitBreaker struct {
    failures    []time.Time
    window      time.Duration  // 1 hour
    threshold   int            // 3 failures
    isOpen      bool
    mu          sync.Mutex
}

func (cb *CircuitBreaker) RecordFailure() {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    now := time.Now()
    cb.failures = append(cb.failures, now)

    // Remove failures outside the window
    cutoff := now.Add(-cb.window)
    var recent []time.Time
    for _, f := range cb.failures {
        if f.After(cutoff) {
            recent = append(recent, f)
        }
    }
    cb.failures = recent

    if len(cb.failures) >= cb.threshold {
        cb.isOpen = true
    }
}

func (cb *CircuitBreaker) IsOpen() bool {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    return cb.isOpen
}
When the circuit breaker is open, the annotation platform.chatcli.io/circuit-breaker: open is added to the namespace. This is visible via kubectl get ns &lt;namespace&gt; -o yaml for quick diagnosis.

Pattern Store

The Pattern Store is the platform’s pattern learning system. It allows AIOps to “remember” past incidents and use that memory to make more informed decisions.

SHA256 Fingerprinting

Each pattern is identified by a unique fingerprint calculated as:
SHA256(signalType | resourceKind | severity)
Fingerprint examples:
Signal TypeResource KindSeverityFingerprint (truncated)
CrashLoopBackOffDeploymenthigha3f8c2...
OOMKilledPodmedium7b1d9e...
FailedSchedulingPodlowc4e6a1...
ImagePullBackOffDeploymenthigh2d8f5b...

ConfigMap Storage

Patterns are persisted in a dedicated ConfigMap in the operator namespace:
apiVersion: v1
kind: ConfigMap
metadata:
  name: chatcli-pattern-store
  namespace: chatcli-system
  labels:
    app.kubernetes.io/component: pattern-store
    platform.chatcli.io/managed-by: decision-engine
data:
  patterns.json: |
    {
      "a3f8c2...": {
        "signalType": "CrashLoopBackOff",
        "resourceKind": "Deployment",
        "severity": "high",
        "totalOccurrences": 12,
        "successfulResolutions": 10,
        "lastResolution": {
          "action": "Rollback",
          "timestamp": "2026-03-18T14:30:00Z",
          "durationSeconds": 45
        },
        "averageResolutionTime": "38s"
      }
    }

RecordResolution and RecordFailure

func (ps *PatternStore) RecordResolution(fingerprint string, action string) {
    ps.mu.Lock()
    defer ps.mu.Unlock()

    pattern, exists := ps.patterns[fingerprint]
    if !exists {
        pattern = &Pattern{Fingerprint: fingerprint}
        ps.patterns[fingerprint] = pattern
    }

    pattern.TotalOccurrences++
    pattern.SuccessfulResolutions++
    pattern.LastResolution = &Resolution{
        Action:    action,
        Timestamp: time.Now(),
    }

    ps.persistToConfigMap()
}

Confidence Boost Calculation

The confidence boost derived from the Pattern Store is calculated directly from the success rate:
ConfidenceBoost = successRate * 0.15
Success RateConfidence BoostExample
100% (10/10)+0.150All rollbacks successful
80% (8/10)+0.120Most resource adjustments worked
50% (5/10)+0.075Mixed results
20% (2/10)+0.030Most failed

Scenario: Recent Similar Incident

When the Pattern Store finds a match, the engine adds context to the AIInsight and the RemediationPlan:
status:
  patternMatch:
    found: true
    fingerprint: "a3f8c2..."
    previousResolution:
      action: "Rollback"
      daysAgo: 3
      wasSuccessful: true
      message: "Similar incident resolved 3 days ago with rollback"
    confidenceBoost: 0.15
    successRate: 0.83
This information is displayed in the Issue CR so operators can quickly see that the problem has been resolved before and how.

Root Cause Analysis (RCA) Enrichment

Before making any decision, the engine enriches the incident context with additional cluster data. This enrichment feeds both the LLM (for better diagnosis) and the decision engine (for more precise adjustments).

DeploymentChange Detection

The engine checks if there was a recent deploy change by comparing ReplicaSet revisions:
func (r *RCAEnricher) DetectDeploymentChange(ctx context.Context,
    deployment *appsv1.Deployment) (*DeploymentChange, error) {

    // List ReplicaSets for the deployment
    rsList, _ := r.client.AppsV1().ReplicaSets(deployment.Namespace).List(ctx,
        metav1.ListOptions{
            LabelSelector: labels.SelectorFromSet(deployment.Spec.Selector.MatchLabels).String(),
        })

    // Compare revisions (annotation deployment.kubernetes.io/revision)
    current, previous := findCurrentAndPrevious(rsList.Items)

    if current != nil && previous != nil {
        return &DeploymentChange{
            RevisionBefore: previous.Annotations["deployment.kubernetes.io/revision"],
            RevisionAfter:  current.Annotations["deployment.kubernetes.io/revision"],
            ImageBefore:    previous.Spec.Template.Spec.Containers[0].Image,
            ImageAfter:     current.Spec.Template.Spec.Containers[0].Image,
            Timestamp:      current.CreationTimestamp.Time,
        }, nil
    }
    return nil, nil
}
Enrichment result:
rcaEnrichment:
  deploymentChange:
    detected: true
    revisionBefore: "5"
    revisionAfter: "6"
    imageBefore: "api-server:v2.3.1"
    imageAfter: "api-server:v2.4.0"
    timestamp: "2026-03-19T10:15:00Z"

ConfigChange Detection

The engine searches for Kubernetes events related to ConfigMap and Secret updates:
rcaEnrichment:
  configChanges:
    - resource: "ConfigMap/api-config"
      field: "database.maxConnections"
      timestamp: "2026-03-19T10:12:00Z"
      reason: "Updated via kubectl"
Lists active issues in the same namespace that may be correlated:
rcaEnrichment:
  relatedIssues:
    - name: issue-high-latency-redis
      severity: medium
      signalType: HighLatency
      resource: deployment/redis-cache
    - name: issue-memory-pressure-node2
      severity: high
      signalType: MemoryPressure
      resource: node/worker-2

Dependency Status

Checks the health of Services and Endpoints that the affected resource depends on:
rcaEnrichment:
  dependencyStatus:
    - service: database-svc
      endpointsReady: 3
      endpointsTotal: 3
      healthy: true
    - service: redis-svc
      endpointsReady: 0
      endpointsTotal: 2
      healthy: false
      reason: "No endpoints ready"

Time Correlation

The engine calculates the temporal correlation between detected changes and the incident start:
rcaEnrichment:
  timeCorrelation:
    deploymentChange:
      minutesBefore: 3
      message: "Deploy changed 3 min before the incident"
      correlationStrength: "strong"
    configChange:
      minutesBefore: 12
      message: "ConfigMap updated 12 min before the incident"
      correlationStrength: "moderate"
Strong temporal correlation (< 5 min) automatically elevates the cause to the top of the PossibleCauses list, as the probability of a causal relationship is high.

PossibleCauses Ranking

All possible causes are ranked by probability based on the enrichment data:
rcaEnrichment:
  possibleCauses:
    - rank: 1
      cause: "New image version (v2.4.0) introduced a memory leak"
      confidence: 0.85
      evidence:
        - "Deploy occurred 3 min before the first OOMKilled"
        - "Image changed from v2.3.1 to v2.4.0"
        - "Similar pattern resolved with rollback 5 days ago"
    - rank: 2
      cause: "Memory limit insufficient for current load"
      confidence: 0.45
      evidence:
        - "Increasing memory usage over the last 2 hours"
        - "No recent change in limits"
    - rank: 3
      cause: "Degraded redis-svc dependency causing retry storm"
      confidence: 0.30
      evidence:
        - "redis-svc with 0/2 ready endpoints"
        - "Weak temporal correlation"

Convergence Detector

The Convergence Detector is designed for the agentic remediation loop. It monitors the agent’s observations to determine if the situation is improving, stagnating, or worsening.

IsConverged

Checks if the last 3 observations are identical, indicating that the system has reached a stable state (for better or worse).
func (cd *ConvergenceDetector) IsConverged(observations []string) bool {
    if len(observations) < 3 {
        return false
    }
    last3 := observations[len(observations)-3:]
    return last3[0] == last3[1] && last3[1] == last3[2]
}

IsOscillating

Detects A-B-A-B oscillation patterns where the system alternates between two states without real progress.
func (cd *ConvergenceDetector) IsOscillating(observations []string) bool {
    if len(observations) < 4 {
        return false
    }
    last4 := observations[len(observations)-4:]
    // Pattern A->B->A->B
    return last4[0] == last4[2] && last4[1] == last4[3] && last4[0] != last4[1]
}
Oscillation is a strong signal that the remediation action is creating the problem it is trying to solve. When detected, the agentic loop is interrupted immediately and the incident is escalated for human intervention.

ShouldStop

Main function that combines all agentic loop stop criteria:
func (cd *ConvergenceDetector) ShouldStop(
    observations []string,
    startTime time.Time,
    consecutiveFailures int,
) (bool, string) {
    // 1. Convergence
    if cd.IsConverged(observations) {
        return true, "System converged (3 identical observations)"
    }

    // 2. Oscillation
    if cd.IsOscillating(observations) {
        return true, "Oscillation detected (A->B->A->B pattern)"
    }

    // 3. Timeout (10 minutes)
    if time.Since(startTime) > 10*time.Minute {
        return true, "Agentic loop timeout (10 min)"
    }

    // 4. Consecutive failures
    if consecutiveFailures >= 5 {
        return true, "5 consecutive failures"
    }

    return false, ""
}
CriterionConditionAction
Convergence3 identical observationsStops the loop, marks as resolved or not
OscillationA-B-A-B patternStops the loop, escalates to human
Timeout> 10 minutesStops the loop, escalates to human
Consecutive failures>= 5 failuresStops the loop, triggers circuit breaker

EstimateProgress

Estimates agentic loop progress from 0.0 to 1.0, used for visual feedback and logging:
func (cd *ConvergenceDetector) EstimateProgress(
    currentStep int,
    maxSteps int,
    lastObservation string,
    targetState string,
) float64 {
    // Base progress by step count
    stepProgress := float64(currentStep) / float64(maxSteps)

    // Adjust if observation indicates improvement
    if strings.Contains(lastObservation, "healthy") ||
       strings.Contains(lastObservation, "running") {
        return math.Min(1.0, stepProgress + 0.2)
    }

    return stepProgress
}

Complete Decision Flow

Decision Engine Metrics

The engine exposes Prometheus metrics for observability:
MetricTypeDescription
decision_engine_evaluations_totalCounterTotal confidence evaluations
decision_engine_confidence_histogramHistogramFinal confidence distribution
decision_engine_auto_remediations_totalCounterTotal auto-remediations by mode
decision_engine_circuit_breaker_stateGaugeCircuit breaker state (0=closed, 1=open)
decision_engine_pattern_matches_totalCounterTotal Pattern Store matches
decision_engine_rca_enrichment_durationHistogramRCA enrichment time
decision_engine_convergence_stops_totalCounterTotal stops by type (convergence, oscillation, timeout, failures)
# Prometheus alert example
groups:
  - name: decision-engine
    rules:
      - alert: CircuitBreakerOpen
        expr: decision_engine_circuit_breaker_state == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Decision engine circuit breaker is open"
          description: "3+ remediation failures in the last hour. Auto-remediation blocked."
      - alert: LowPatternMatchRate
        expr: >
          rate(decision_engine_pattern_matches_total[1h])
          / rate(decision_engine_evaluations_total[1h]) < 0.1
        for: 24h
        labels:
          severity: info
        annotations:
          summary: "Low pattern match rate"
          description: "Less than 10% of incidents have a known pattern. Consider reviewing runbooks."

Next Steps

Multi-Cluster Federation

See how the decision engine operates in multi-cluster environments with policies per tier.

Chaos Engineering

Validate engine decisions with controlled chaos experiments.

Audit and Compliance

Every decision generates an immutable AuditEvent for traceability.

AIOps Platform

Return to the complete AIOps platform overview.