Skip to main content
The Chaos Engineering module allows validating the resilience of Kubernetes workloads and the effectiveness of AIOps platform remediations. Unlike standalone chaos tools, experiments here are integrated into the AIOps pipeline — allowing you to validate that a remediation actually works under adverse conditions.
Each experiment is a native Kubernetes CRD. All security controls are declarative and auditable, ensuring that chaos experiments never affect critical workloads without explicit approval.

Chaos Engineering in the AIOps Context

Remediation Validation

After fixing an incident, re-inject the failure to confirm that automatic remediation works.

Resilience Testing

Run recurring experiments to ensure the platform detects and responds to known failures.

Automated Game Days

Schedule experiments via cron to simulate regular game days without manual intervention.

Recovery Baseline

Measure actual recovery times to establish SLOs and identify bottlenecks.

ChaosExperiment CRD

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-api-server-recovery
  namespace: staging
spec:
  # Experiment type
  experimentType: pod_kill

  # Target
  target:
    kind: Deployment
    name: api-server
    namespace: staging

  # Type-specific parameters
  parameters:
    count: 2           # Number of pods to affect
    # Parameters per type (see detailed section below)

  # Maximum duration
  duration: 5m

  # DryRun: simulate without executing
  dryRun: false

  # Scheduling (cron, optional)
  schedule: ""          # E.g., "0 3 * * 1" (every Monday at 3am)

  # Issue reference (post-remediation validation)
  linkedIssueRef:
    name: issue-api-server-crashloop
    namespace: staging

  # Safety Checks
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    requireApproval: false
    allowedNamespaces:
      - staging
      - chaos-testing
    blockedNamespaces:
      - production
      - kube-system
      - chatcli-system

  # Post-experiment verification
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 3m
    runRemediationTest: false

status:
  phase: Completed     # Pending | Running | Completed | Failed | Aborted
  startTime: "2026-03-19T03:00:00Z"
  completionTime: "2026-03-19T03:04:30Z"
  affectedPods:
    - api-server-7d8f9c6b5-x2k4p
    - api-server-7d8f9c6b5-m9n3q
  recoveryVerified: true
  recoveryDuration: "45s"
  conditions:
    - type: SafetyChecksPassed
      status: "True"
    - type: ExperimentCompleted
      status: "True"
    - type: RecoveryVerified
      status: "True"

7 Experiment Types

1. Pod Kill

Deletes pods randomly using the Fisher-Yates shuffle algorithm with crypto/rand for truly random selection.
func (e *PodKillExperiment) Execute(ctx context.Context, pods []corev1.Pod) error {
    // Fisher-Yates shuffle with crypto/rand
    shuffled := make([]corev1.Pod, len(pods))
    copy(shuffled, pods)
    for i := len(shuffled) - 1; i > 0; i-- {
        jBig, _ := rand.Int(rand.Reader, big.NewInt(int64(i+1)))
        j := jBig.Int64()
        shuffled[i], shuffled[j] = shuffled[j], shuffled[i]
    }

    // Delete the first N pods (forced, no graceful period)
    count := e.Parameters.Count
    for i := 0; i < count && i < len(shuffled); i++ {
        err := e.client.CoreV1().Pods(shuffled[i].Namespace).Delete(ctx,
            shuffled[i].Name,
            metav1.DeleteOptions{
                GracePeriodSeconds: pointer.Int64(0),
            })
        if err != nil {
            return fmt.Errorf("failed to delete pod %s: %w", shuffled[i].Name, err)
        }
    }
    return nil
}
ParameterTypeDefaultDescription
countint1Number of pods to delete
Pod kill uses GracePeriodSeconds: 0, simulating an abrupt failure (e.g., node crash). For graceful termination, use pod_failure.

2. Pod Failure

Graceful pod deletion, respecting the terminationGracePeriodSeconds configured in the PodSpec.
ParameterTypeDefaultDescription
countint1Number of pods to delete gracefully
spec:
  experimentType: pod_failure
  target:
    kind: Deployment
    name: payment-service
  parameters:
    count: 1

3. CPU Stress

Creates a stress-ng pod on the same node as the target pod to simulate CPU contention.
spec:
  experimentType: cpu_stress
  target:
    kind: Deployment
    name: api-server
  parameters:
    cores: 4              # Number of cores to stress
    loadPercent: 80       # Load percentage per core
  duration: 2m
Generated stress pod:
apiVersion: v1
kind: Pod
metadata:
  name: chaos-cpu-stress-api-server-x7k2
  labels:
    platform.chatcli.io/chaos-experiment: validate-cpu-resilience
    platform.chatcli.io/chaos-type: cpu_stress
spec:
  nodeSelector:
    kubernetes.io/hostname: worker-node-3   # Same node as target
  containers:
    - name: stress
      image: alexeiled/stress-ng:latest
      command: ["stress-ng"]
      args: ["--cpu", "4", "--cpu-load", "80", "--timeout", "120"]
      resources:
        limits:
          cpu: "4"
  restartPolicy: Never
ParameterTypeDefaultDescription
coresint1Number of stress-ng CPU workers
loadPercentint100Load percentage per core (0-100)

4. Memory Stress

Creates a stress-ng pod that allocates memory on the same node as the target.
spec:
  experimentType: memory_stress
  target:
    kind: Deployment
    name: cache-service
  parameters:
    vmBytes: "256M"       # Amount of memory to allocate
  duration: 3m
Generated stress-ng command:
stress-ng --vm 1 --vm-bytes 256M --timeout 180
ParameterTypeDefaultDescription
vmBytesstring128MAmount of memory (format: 128M, 1G)

5. Network Delay

Simulates network latency using annotations on the target pods. The sidecar or CNI plugin interprets the annotation to inject delay.
spec:
  experimentType: network_delay
  target:
    kind: Deployment
    name: api-gateway
  parameters:
    latencyMs: 500        # Additional latency in milliseconds
  duration: 5m
Applied annotation:
metadata:
  annotations:
    platform.chatcli.io/chaos-network-delay: "500ms"
    platform.chatcli.io/chaos-experiment-ref: "validate-latency-handling"
ParameterTypeDefaultDescription
latencyMsint100Additional latency in milliseconds

6. Network Loss

Simulates network packet loss via annotations.
spec:
  experimentType: network_loss
  target:
    kind: Deployment
    name: api-gateway
  parameters:
    percent: 30           # Percentage of dropped packets
  duration: 2m
ParameterTypeDefaultDescription
percentint10Packet loss percentage (0-100)

7. Disk Stress

Creates a stress-ng pod that generates intensive disk I/O on the same node.
spec:
  experimentType: disk_stress
  target:
    kind: Deployment
    name: database-proxy
  parameters:
    hdd: 2                # Number of disk workers
    hddBytes: "1G"        # Amount of data per worker
  duration: 3m
Generated stress-ng command:
stress-ng --hdd 2 --hdd-bytes 1G --timeout 180
ParameterTypeDefaultDescription
hddint1Number of stress-ng HDD workers
hddBytesstring512MBytes written per worker

Type Summary

TypeMechanismTargetReversible
pod_killDelete (force)Randomly selected podsYes (ReplicaSet recreates)
pod_failureDelete (graceful)Selected podsYes (ReplicaSet recreates)
cpu_stressstress-ng pod on same nodeNode CPUYes (pod removed after duration)
memory_stressstress-ng pod on same nodeNode memoryYes (pod removed after duration)
network_delayAnnotation on podPod networkYes (annotation removed)
network_lossAnnotation on podPod networkYes (annotation removed)
disk_stressstress-ng pod on same nodeNode diskYes (pod removed after duration)

Safety Checks

Safety checks are the protection layer that prevents chaos experiments from causing real damage.

MinHealthyPods

Ensures a minimum number of pods remain healthy during the experiment.
func (sc *SafetyChecker) CheckMinHealthyPods(
    ctx context.Context,
    target *ExperimentTarget,
    minHealthy int,
    killCount int,
) error {
    pods, _ := sc.listTargetPods(ctx, target)

    healthyPods := 0
    for _, pod := range pods {
        if isPodReady(&pod) {
            healthyPods++
        }
    }

    remainingHealthy := healthyPods - killCount
    if remainingHealthy < minHealthy {
        return fmt.Errorf(
            "safety check failed: %d healthy pods - %d kill = %d remaining, minimum required: %d",
            healthyPods, killCount, remainingHealthy, minHealthy,
        )
    }
    return nil
}

MaxConcurrentExperiments

Prevents chaos storms by limiting the number of simultaneous experiments in the namespace.
func (sc *SafetyChecker) CheckMaxConcurrent(
    ctx context.Context,
    namespace string,
    maxConcurrent int,
) error {
    running, _ := sc.listRunningExperiments(ctx, namespace)
    if len(running) >= maxConcurrent {
        return fmt.Errorf(
            "safety check failed: %d experiments running, maximum allowed: %d",
            len(running), maxConcurrent,
        )
    }
    return nil
}

AbortOnIssueDetected

If AIOps detects a new issue unrelated to the experiment during execution, the experiment is immediately aborted.
func (sc *SafetyChecker) MonitorForNewIssues(
    ctx context.Context,
    experiment *v1alpha1.ChaosExperiment,
    stopCh <-chan struct{},
) {
    ticker := time.NewTicker(10 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-stopCh:
            return
        case <-ticker.C:
            issues, _ := sc.listNewIssues(ctx, experiment.Namespace, experiment.Status.StartTime)
            for _, issue := range issues {
                if !isRelatedToExperiment(&issue, experiment) {
                    sc.abortExperiment(ctx, experiment,
                        fmt.Sprintf("Unrelated issue detected: %s", issue.Name))
                    return
                }
            }
        }
    }
}

RequireApproval

Integrates with the ApprovalRequest system to require human approval before executing the experiment.
spec:
  safetyChecks:
    requireApproval: true
When enabled, the controller creates an ApprovalRequest CR and waits for approval before proceeding:
apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalRequest
metadata:
  name: approval-chaos-pod-kill-api-server
spec:
  resourceRef:
    kind: ChaosExperiment
    name: validate-api-server-recovery
  requiredRole: Operator
  expiresIn: 1h
  summary: |
    Chaos experiment: pod_kill on Deployment/api-server (staging)
    Affected pods: 2, MinHealthyPods: 2
    Duration: 5m

AllowedNamespaces / BlockedNamespaces

Whitelist of namespaces where experiments can be executed. If defined, only these namespaces are allowed.
safetyChecks:
  allowedNamespaces:
    - staging
    - chaos-testing
    - development
blockedNamespaces takes precedence over allowedNamespaces. If a namespace appears in both lists, it is blocked. The kube-system and chatcli-system namespaces are always blocked, regardless of configuration.

Post-Experiment Verification

VerifyRecovery

After the experiment completes, the controller verifies whether the deployment returned to a healthy state.
func (v *PostExperimentVerifier) VerifyRecovery(
    ctx context.Context,
    experiment *v1alpha1.ChaosExperiment,
) (bool, time.Duration, error) {
    startCheck := time.Now()
    timeout := experiment.Spec.PostExperiment.RecoveryTimeout

    for time.Since(startCheck) < timeout {
        deployment, _ := v.client.AppsV1().Deployments(
            experiment.Spec.Target.Namespace,
        ).Get(ctx, experiment.Spec.Target.Name, metav1.GetOptions{})

        if deployment.Status.ReadyReplicas == *deployment.Spec.Replicas {
            recoveryTime := time.Since(startCheck)
            return true, recoveryTime, nil
        }

        time.Sleep(5 * time.Second)
    }

    return false, timeout, fmt.Errorf("recovery timeout: deployment did not recover within %v", timeout)
}

RecoveryTimeout

Maximum wait time for recovery verification. If the deployment does not return to a healthy state within this period, the experiment is marked as Failed.

RunRemediationTest

When enabled together with linkedIssueRef, the controller:
1

Re-inject the failure

Executes the same experiment again to recreate the original incident scenario.
2

Wait for detection

Waits for the AIOps platform to automatically detect the anomaly.
3

Verify remediation

Confirms that automatic remediation was triggered and resolved the problem.
4

Record result

Updates the ChaosExperiment.Status with the validation result.
spec:
  linkedIssueRef:
    name: issue-api-server-crashloop
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 3m
    runRemediationTest: true   # Re-inject and validate automatic remediation

State Machine

StateDescriptionTransitions
PendingCR created, awaiting safety checks or approvalRunning, Failed
RunningExperiment in execution, failure being injectedCompleted, Failed, Aborted
CompletedExperiment finished successfully, recovery verifiedTerminal
FailedSafety check failed, execution error, or recovery timeoutTerminal
AbortedInterrupted by detected issue, timeout, or manual interventionTerminal

DryRun Mode

DryRun mode executes all experiment logic (safety checks, pod selection, command generation) without applying any real changes to the cluster.
spec:
  experimentType: pod_kill
  dryRun: true
  target:
    kind: Deployment
    name: api-server
  parameters:
    count: 3
DryRun result:
status:
  phase: Completed
  dryRun: true
  dryRunResults:
    safetyChecksPassed: true
    podsSelected:
      - api-server-7d8f9c6b5-x2k4p
      - api-server-7d8f9c6b5-m9n3q
      - api-server-7d8f9c6b5-k8j7r
    actionsPlanned:
      - "DELETE pod api-server-7d8f9c6b5-x2k4p (GracePeriod: 0s)"
      - "DELETE pod api-server-7d8f9c6b5-m9n3q (GracePeriod: 0s)"
      - "DELETE pod api-server-7d8f9c6b5-k8j7r (GracePeriod: 0s)"
    warnings:
      - "3 of 5 pods would be deleted, leaving 2 (= minHealthyPods)"
Always run a DryRun before configuring a scheduled experiment. This validates that safety checks are correct and that targets are as expected.

Schedule (Recurring Experiments)

The schedule field accepts standard cron expressions for recurring execution:
spec:
  schedule: "0 3 * * 1"    # Every Monday at 03:00
  experimentType: pod_kill
  target:
    kind: Deployment
    name: api-server
  parameters:
    count: 1
  safetyChecks:
    minHealthyPods: 3
    abortOnIssueDetected: true
ExpressionFrequency
0 3 * * 1Every Monday at 03:00
0 */6 * * *Every 6 hours
0 2 1 * *First day of the month at 02:00
30 4 * * 1-5Weekdays at 04:30
Each scheduled execution creates a new ChaosExperiment CR with a timestamp suffix.

LinkedIssueRef

The linkedIssueRef field connects the experiment to a specific incident, allowing validation that the applied remediation actually works.
spec:
  linkedIssueRef:
    name: issue-api-server-crashloop
    namespace: staging
  experimentType: pod_kill
  parameters:
    count: 2
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 3m
    runRemediationTest: true
When linkedIssueRef is defined, the controller:
  1. Fetches the Issue CR and associated RemediationPlan
  2. Records the connection in the experiment status
  3. If runRemediationTest: true, validates that AIOps detects and remediates automatically
  4. Updates the Issue CR with the validation result

Complete YAML Examples

apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-api-server-pod-kill
  namespace: staging
  labels:
    team: platform
    experiment-type: resilience
spec:
  experimentType: pod_kill
  target:
    kind: Deployment
    name: api-server
    namespace: staging
  parameters:
    count: 2
  duration: 5m
  dryRun: false
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    requireApproval: false
    allowedNamespaces: [staging, chaos-testing]
    blockedNamespaces: [production, kube-system]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 2m
    runRemediationTest: false
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: weekly-cpu-stress-api
  namespace: staging
spec:
  experimentType: cpu_stress
  schedule: "0 3 * * 1"     # Every Monday at 03:00
  target:
    kind: Deployment
    name: api-server
    namespace: staging
  parameters:
    cores: 4
    loadPercent: 90
  duration: 10m
  safetyChecks:
    minHealthyPods: 3
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    blockedNamespaces: [production, kube-system]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 5m
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-crashloop-fix
  namespace: staging
spec:
  experimentType: pod_kill
  target:
    kind: Deployment
    name: payment-service
    namespace: staging
  parameters:
    count: 1
  duration: 3m
  linkedIssueRef:
    name: issue-payment-crashloop
    namespace: staging
  safetyChecks:
    minHealthyPods: 1
    abortOnIssueDetected: false   # We expect AIOps to detect
    requireApproval: true
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 3m
    runRemediationTest: true      # Validate automatic remediation
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: dryrun-memory-stress
  namespace: staging
spec:
  experimentType: memory_stress
  dryRun: true
  target:
    kind: Deployment
    name: cache-service
    namespace: staging
  parameters:
    vmBytes: "512M"
  duration: 5m
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    blockedNamespaces: [production]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: 2m

Metrics

The chaos engineering module exposes Prometheus metrics for observability and resilience tracking.
MetricTypeLabelsDescription
chaos_experiments_totalCountertype, result, namespaceTotal experiments by type and result
chaos_experiments_activeGaugenamespaceCurrently running experiments
chaos_recovery_time_secondsHistogramtype, targetRecovery time after experiment
chaos_pods_affected_totalCountertype, namespaceTotal pods affected by experiments
chaos_safety_checks_failed_totalCountercheck_typeSafety checks that blocked experiments
chaos_aborted_totalCounterreasonExperiments aborted by reason
chaos_remediation_validated_totalCounterresultRemediation validations (pass/fail)

Alert Examples

groups:
  - name: chaos-engineering
    rules:
      - alert: ChaosExperimentFailed
        expr: increase(chaos_experiments_total{result="failed"}[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Chaos experiment failed"
          description: >
            {{ $labels.type }} failed in namespace {{ $labels.namespace }}.
            Check if the deployment recovered correctly.

      - alert: HighRecoveryTime
        expr: >
          histogram_quantile(0.95, chaos_recovery_time_seconds_bucket) > 300
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "High recovery time after chaos"
          description: >
            P95 recovery time above 5 minutes. Workloads may have
            self-healing issues.

      - alert: RemediationValidationFailed
        expr: increase(chaos_remediation_validated_total{result="fail"}[24h]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Remediation validation failed"
          description: >
            Automatic remediation did not work when the failure was
            re-injected. The AIOps platform may not be responding
            correctly to this type of incident.

Best Practices

1

Start with DryRun

Always run a DryRun before real experiments to validate safety checks and target selection.
2

Staging First

Run experiments in staging before enabling in higher-tier environments. Use allowedNamespaces for enforcement.
3

Conservative Safety Checks

Configure minHealthyPods with margin. If the deployment has 5 replicas and needs 3 to operate, configure minHealthyPods: 3.
4

Schedule Game Days

Use schedule for recurring experiments. Resilience is not a one-time test — it is a continuous practice.
5

Validate Remediations

After fixing an incident, use linkedIssueRef + runRemediationTest to confirm that the fix works under failure.

Next Steps

Decision Engine

Understand how chaos results influence the Pattern Store and engine confidence.

Multi-Cluster Federation

Run chaos experiments on specific clusters with per-tier policies.

Audit and Compliance

All experiments generate immutable AuditEvents for complete traceability.

AIOps Platform

Return to the AIOps platform overview.