Skip to main content
The ChatCLI AIOps platform maintains an immutable audit trail of all actions — from anomaly detection to remediation execution. Combined with granular RBAC and automated compliance reports, the system meets governance requirements even in regulated environments.
Each AuditEvent is an immutable CRD (no status). Once created, it cannot be modified or deleted via controllers. This ensures record integrity for investigations and external audits.

Why Audit Trail for AIOps

When a platform makes autonomous decisions on production infrastructure, traceability is no longer optional:

Accountability

Who approved the remediation? Which AI recommended the action? When did the circuit breaker open? Every decision has a trail.

Post-Incident Investigation

Reconstruct the complete timeline of an incident — from the first signal to resolution — with precise timestamps.

Regulatory Compliance

SOC2, ISO 27001, PCI-DSS: demonstrate controls over automated actions with immutable records and documented RBAC.

Continuous Improvement

MTTD, MTTR, success rate, and SLA metrics automatically calculated from audit events.

AuditEvent CRD

The AuditEvent is an immutable CRD — it has only spec, no status. Once created, its content is permanent.

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: AuditEvent
metadata:
  name: audit-1710856200-a7f3b2
  namespace: chatcli-system
  labels:
    platform.chatcli.io/event-type: RemediationExecuted
    platform.chatcli.io/severity: high
    platform.chatcli.io/correlation-id: inc-8f2a4b
  annotations:
    platform.chatcli.io/immutable: "true"
spec:
  # Event type
  eventType: RemediationExecuted

  # Precise timestamp
  timestamp: "2026-03-19T14:30:00.123Z"

  # Who performed the action
  actor:
    type: controller         # system | user | controller
    name: remediation-reconciler
    serviceAccount: chatcli-operator

  # Affected resource
  resource:
    apiVersion: platform.chatcli.io/v1alpha1
    kind: RemediationPlan
    name: plan-rollback-api-server
    namespace: production

  # Event-specific details
  details:
    action: Rollback
    target: deployment/api-server
    fromRevision: "6"
    toRevision: "5"
    confidence: "0.92"
    decisionMode: auto-notify
    duration: "12s"
    result: success

  # Correlation ID to group related events
  correlationID: inc-8f2a4b

  # Event severity
  severity: high

Event Types (EventType)

The platform defines 20+ event types covering the entire AIOps lifecycle. The following events are automatically recorded by the operator:
  • issue_created — when a new Issue is detected and transitions to Analyzing
  • issue_resolved — when a remediation successfully resolves the Issue
  • issue_escalated — when all remediation attempts fail
  • remediation_started — when a RemediationPlan begins execution
  • remediation_completed — when health verification confirms successful remediation
  • remediation_failed — when a remediation fails
EventTypeDescription
AnomalyDetectedNew anomaly detected by WatcherBridge
AnomalyCorrelatedAnomaly correlated with existing issue
IssueCreatedNew issue created by AnomalyReconciler
IssueEscalatedIssue escalated (severity elevated)
IssueResolvedIssue marked as resolved
CrossClusterCorrelationCross-cluster correlation detected
CascadeDetectedStaging-to-production cascade detected

AuditActor

The actor field identifies who or what performed the action:
TypeDescriptionExample
systemAutomatic system actionwatcher-bridge, correlation-engine
controllerOperator controllerremediation-reconciler, issue-reconciler
userHuman action (via kubectl or API)john@company.com, admin

AuditResource

The resource field identifies the affected Kubernetes resource:
type AuditResource struct {
    APIVersion string `json:"apiVersion"`
    Kind       string `json:"kind"`
    Name       string `json:"name"`
    Namespace  string `json:"namespace"`
}

Name Format

Each AuditEvent follows the name format:
audit-{unix-timestamp}-{random-6-chars}
Examples:
  • audit-1710856200-a7f3b2
  • audit-1710856245-c9d4e1
  • audit-1710856300-f2b8a6

Immutability Annotation

Every AuditEvent is created with the annotation platform.chatcli.io/immutable: "true". An admission webhook can be configured to reject updates/deletes on resources with this annotation.

Audit Recorder

The AuditRecorder is the central component that generates audit events. It offers 12 convenience functions for the most common scenarios.

Convenience Functions

type AuditRecorder struct {
    client    client.Client
    namespace string
}

// Detection
func (ar *AuditRecorder) RecordAnomalyDetected(anomaly *v1alpha1.Anomaly) error
func (ar *AuditRecorder) RecordIssueCreated(issue *v1alpha1.Issue) error
func (ar *AuditRecorder) RecordIssueResolved(issue *v1alpha1.Issue, resolution string) error

// Analysis
func (ar *AuditRecorder) RecordAIInsightCompleted(insight *v1alpha1.AIInsight) error
func (ar *AuditRecorder) RecordConfidenceCalculated(issue string, confidence float64, factors map[string]float64) error
func (ar *AuditRecorder) RecordPatternMatched(fingerprint string, issue string) error

// Remediation
func (ar *AuditRecorder) RecordRemediationExecuted(plan *v1alpha1.RemediationPlan, action string) error
func (ar *AuditRecorder) RecordRemediationResult(plan *v1alpha1.RemediationPlan, success bool) error
func (ar *AuditRecorder) RecordCircuitBreakerTriggered(namespace string, failures int) error

// Governance
func (ar *AuditRecorder) RecordApprovalDecision(request *v1alpha1.ApprovalRequest, decision string) error
func (ar *AuditRecorder) RecordRoleChange(user string, role string, action string) error
func (ar *AuditRecorder) RecordChaosExperiment(experiment *v1alpha1.ChaosExperiment, phase string) error

Automatic Generation by Controllers

Controllers automatically generate AuditEvents at key points in the pipeline:
func (r *RemediationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    plan := &v1alpha1.RemediationPlan{}
    r.Get(ctx, req.NamespacedName, plan)

    // Record remediation start
    r.auditRecorder.RecordRemediationExecuted(plan, plan.Spec.Actions[0].Type)

    // Execute the action
    err := r.executeAction(ctx, plan)

    if err != nil {
        // Record failure
        r.auditRecorder.RecordRemediationResult(plan, false)
        return ctrl.Result{}, err
    }

    // Record success
    r.auditRecorder.RecordRemediationResult(plan, true)
    return ctrl.Result{}, nil
}

Generated Event Example

apiVersion: platform.chatcli.io/v1alpha1
kind: AuditEvent
metadata:
  name: audit-1710856200-a7f3b2
  namespace: chatcli-system
  annotations:
    platform.chatcli.io/immutable: "true"
spec:
  eventType: RemediationExecuted
  timestamp: "2026-03-19T14:30:00.123Z"
  actor:
    type: controller
    name: remediation-reconciler
    serviceAccount: chatcli-operator
  resource:
    apiVersion: platform.chatcli.io/v1alpha1
    kind: RemediationPlan
    name: plan-rollback-api-server
    namespace: production
  details:
    action: Rollback
    target: deployment/api-server
    fromRevision: "6"
    toRevision: "5"
  correlationID: inc-8f2a4b
  severity: high

Compliance Reporter

The ComplianceReporter generates automated reports from AuditEvents, calculating essential operational metrics.

GenerateReport

func (cr *ComplianceReporter) GenerateReport(
    ctx context.Context,
    namespace string,
    window time.Duration,  // E.g., 7*24h for weekly report
) (*ComplianceReport, error)

Report Metrics

Incident metrics measuring detection and resolution speed.
MetricDescriptionCalculation
MTTD (Mean Time to Detect)Average time between problem start and detectionavg(anomaly.detected - anomaly.started)
MTTR (Mean Time to Resolve)Average time between detection and resolutionavg(issue.resolved - issue.created)
MeanRemediationAttemptsAverage number of remediation attempts per issuetotal_attempts / total_issues
incidentMetrics:
  totalIncidents: 47
  mttd: "2m15s"
  mttr: "8m30s"
  meanRemediationAttempts: 1.3
  incidentsBySeverity:
    critical: 2
    high: 8
    medium: 22
    low: 15

Audit Summary

The report includes a summary of audit events generated during the period:
auditSummary:
  totalEvents: 312
  eventsByType:
    AnomalyDetected: 89
    IssueCreated: 47
    AIInsightCompleted: 45
    RemediationExecuted: 52
    RemediationSucceeded: 46
    RemediationFailed: 6
    ApprovalRequested: 12
    ApprovalGranted: 8
    ApprovalRejected: 2
    ApprovalExpired: 2
    CircuitBreakerTriggered: 1
    ChaosExperimentCompleted: 4
    PatternMatched: 23
    IssueResolved: 44
  eventsBySeverity:
    critical: 8
    high: 45
    medium: 156
    low: 103
  eventsByActor:
    system: 134
    controller: 166
    user: 12

RBAC Manager

The RBAC Manager implements granular access control with 4 predefined roles, mapped to Kubernetes ClusterRoles.

Role Definitions

Viewer — Read-only access to all AIOps resources.
ResourcePermissions
Anomaly, Issue, AIInsightget, list, watch
RemediationPlan, Runbookget, list, watch
AuditEventget, list, watch
PostMortemget, list, watch
ChaosExperimentget, list, watch
ApprovalRequestget, list, watch
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chatcli-aiops-viewer
  labels:
    platform.chatcli.io/rbac-role: viewer
rules:
  - apiGroups: ["platform.chatcli.io"]
    resources: ["*"]
    verbs: ["get", "list", "watch"]

EnsureRoles

The RBAC Manager ensures that ClusterRoles exist in the cluster:
func (rm *RBACManager) EnsureRoles(ctx context.Context) error {
    roles := []struct {
        name  string
        rules []rbacv1.PolicyRule
    }{
        {name: "chatcli-aiops-viewer", rules: viewerRules},
        {name: "chatcli-aiops-operator", rules: operatorRules},
        {name: "chatcli-aiops-admin", rules: adminRules},
        {name: "chatcli-aiops-superadmin", rules: superAdminRules},
    }

    for _, role := range roles {
        cr := &rbacv1.ClusterRole{
            ObjectMeta: metav1.ObjectMeta{Name: role.name},
            Rules:      role.rules,
        }
        _, err := controllerutil.CreateOrUpdate(ctx, rm.client, cr, func() error {
            cr.Rules = role.rules
            return nil
        })
        if err != nil {
            return fmt.Errorf("failed to create/update role %s: %w", role.name, err)
        }
    }
    return nil
}

GrantRole and RevokeRole

func (rm *RBACManager) GrantRole(ctx context.Context, user string, role string) error {
    binding := &rbacv1.ClusterRoleBinding{
        ObjectMeta: metav1.ObjectMeta{
            Name: fmt.Sprintf("chatcli-aiops-%s-%s", role, sanitize(user)),
        },
        RoleRef: rbacv1.RoleRef{
            APIGroup: "rbac.authorization.k8s.io",
            Kind:     "ClusterRole",
            Name:     fmt.Sprintf("chatcli-aiops-%s", role),
        },
        Subjects: []rbacv1.Subject{
            {Kind: "User", Name: user, APIGroup: "rbac.authorization.k8s.io"},
        },
    }

    _, err := controllerutil.CreateOrUpdate(ctx, rm.client, binding, func() error { return nil })
    if err != nil {
        return err
    }

    // Record in audit trail
    return rm.auditRecorder.RecordRoleChange(user, role, "granted")
}

Audit REST API

The platform exposes REST endpoints for querying and exporting audit events.

GET /api/v1/audit

Query events with filters. Query parameters:
ParameterTypeDescriptionExample
typestringFilter by EventTypeRemediationExecuted
severitystringFilter by severityhigh
fromISO 8601Start of time window2026-03-18T00:00:00Z
toISO 8601End of time window2026-03-19T23:59:59Z
actorstringFilter by actorjohn@company.com
correlation_idstringFilter by correlationinc-8f2a4b
limitintMaximum results (default: 100)50
offsetintOffset for pagination100
Request example:
curl -s "https://chatcli.example.com/api/v1/audit?\
type=RemediationExecuted&\
severity=high&\
from=2026-03-18T00:00:00Z&\
to=2026-03-19T23:59:59Z&\
limit=10" | jq .
Response example:
{
  "total": 15,
  "returned": 10,
  "events": [
    {
      "name": "audit-1710856200-a7f3b2",
      "eventType": "RemediationExecuted",
      "timestamp": "2026-03-19T14:30:00.123Z",
      "actor": {
        "type": "controller",
        "name": "remediation-reconciler"
      },
      "resource": {
        "kind": "RemediationPlan",
        "name": "plan-rollback-api-server",
        "namespace": "production"
      },
      "details": {
        "action": "Rollback",
        "target": "deployment/api-server",
        "result": "success"
      },
      "correlationID": "inc-8f2a4b",
      "severity": "high"
    }
  ]
}

GET /api/v1/audit/export

Exports events in JSON format for SIEM integration. Returns a .json file with all events in the specified period.
# Export last 24 hours
curl -s -o audit-export.json \
  "https://chatcli.example.com/api/v1/audit/export?\
from=2026-03-18T14:00:00Z&\
to=2026-03-19T14:00:00Z"

# Check export size
wc -l audit-export.json
# 312 lines (1 event per line, NDJSON format)
Export format (NDJSON):
{"name":"audit-1710856200-a7f3b2","eventType":"RemediationExecuted","timestamp":"2026-03-19T14:30:00.123Z","actor":{"type":"controller","name":"remediation-reconciler"},"resource":{"kind":"RemediationPlan","name":"plan-rollback-api-server","namespace":"production"},"details":{"action":"Rollback"},"correlationID":"inc-8f2a4b","severity":"high"}
{"name":"audit-1710856245-c9d4e1","eventType":"RemediationSucceeded","timestamp":"2026-03-19T14:30:12.456Z","actor":{"type":"controller","name":"remediation-reconciler"},"resource":{"kind":"RemediationPlan","name":"plan-rollback-api-server","namespace":"production"},"details":{"duration":"12s"},"correlationID":"inc-8f2a4b","severity":"high"}

SIEM Integration

The platform supports event export to SIEM (Security Information and Event Management) systems such as Splunk, Elastic, and Datadog.

Splunk

1

Configure HEC (HTTP Event Collector)

Create an HEC token in Splunk to receive events from the AIOps platform.
2

Create export CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: audit-export-splunk
  namespace: chatcli-system
spec:
  schedule: "*/15 * * * *"   # Every 15 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: exporter
              image: curlimages/curl:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Export last 20 minutes (5 min overlap for safety)
                  FROM=$(date -u -d '20 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
                  TO=$(date -u +%Y-%m-%dT%H:%M:%SZ)

                  curl -s "http://chatcli-server.chatcli-system:8080/api/v1/audit/export?from=$FROM&to=$TO" | \
                  while IFS= read -r line; do
                    curl -s -X POST \
                      "https://splunk.example.com:8088/services/collector/event" \
                      -H "Authorization: Splunk $SPLUNK_HEC_TOKEN" \
                      -d "{\"event\": $line, \"sourcetype\": \"chatcli:audit\"}"
                  done
              env:
                - name: SPLUNK_HEC_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: splunk-credentials
                      key: hec-token
          restartPolicy: OnFailure
3

Create index and dashboards

Configure a dedicated chatcli_audit index in Splunk and create dashboards to visualize events by type, severity, and actor.

Elasticsearch

apiVersion: batch/v1
kind: CronJob
metadata:
  name: audit-export-elastic
  namespace: chatcli-system
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: exporter
              image: curlimages/curl:latest
              command:
                - /bin/sh
                - -c
                - |
                  FROM=$(date -u -d '20 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
                  TO=$(date -u +%Y-%m-%dT%H:%M:%SZ)

                  curl -s "http://chatcli-server.chatcli-system:8080/api/v1/audit/export?from=$FROM&to=$TO" | \
                  while IFS= read -r line; do
                    curl -s -X POST \
                      "https://elastic.example.com:9200/chatcli-audit/_doc" \
                      -H "Content-Type: application/json" \
                      -u "$ELASTIC_USER:$ELASTIC_PASS" \
                      -d "$line"
                  done
              env:
                - name: ELASTIC_USER
                  valueFrom:
                    secretKeyRef:
                      name: elastic-credentials
                      key: username
                - name: ELASTIC_PASS
                  valueFrom:
                    secretKeyRef:
                      name: elastic-credentials
                      key: password
          restartPolicy: OnFailure

kubectl Commands

# List all audit events
kubectl get auditevents -n chatcli-system

# Filter by event type
kubectl get auditevents -n chatcli-system \
  -l platform.chatcli.io/event-type=RemediationExecuted

# Filter by severity
kubectl get auditevents -n chatcli-system \
  -l platform.chatcli.io/severity=critical

# Filter by correlation (all events for an incident)
kubectl get auditevents -n chatcli-system \
  -l platform.chatcli.io/correlation-id=inc-8f2a4b \
  --sort-by=.spec.timestamp

# View details of a specific event
kubectl get auditevent audit-1710856200-a7f3b2 -n chatcli-system -o yaml

# Count events by type (last 24h)
kubectl get auditevents -n chatcli-system -o json | \
  jq '[.items[].spec.eventType] | group_by(.) | map({type: .[0], count: length})'

# Check configured RBAC roles
kubectl get clusterroles -l platform.chatcli.io/rbac-role

# List bindings for a user
kubectl get clusterrolebindings -l platform.chatcli.io/rbac-role \
  -o jsonpath='{range .items[*]}{.metadata.name}: {.subjects[*].name}{"\n"}{end}'

# Generate compliance report (via API)
curl -s "https://chatcli.example.com/api/v1/compliance/report?\
namespace=production&window=7d" | jq .

Event Retention

AuditEvents are immutable but not eternal. Configure a retention policy to avoid excessive CR accumulation in etcd.
# CronJob for cleaning up old events (>90 days)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: audit-retention
  namespace: chatcli-system
spec:
  schedule: "0 2 * * 0"    # Every Sunday at 02:00
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: chatcli-superadmin
          containers:
            - name: cleanup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  CUTOFF=$(date -u -d '90 days ago' +%Y-%m-%dT%H:%M:%SZ)
                  kubectl get auditevents -n chatcli-system -o json | \
                  jq -r ".items[] | select(.spec.timestamp < \"$CUTOFF\") | .metadata.name" | \
                  xargs -r kubectl delete auditevent -n chatcli-system
          restartPolicy: OnFailure

Next Steps

Decision Engine

See how every engine decision generates AuditEvents for complete traceability.

Multi-Cluster Federation

Cross-cluster AuditEvents with unified CorrelationID.

Chaos Engineering

Chaos experiments generate audit events for game day compliance.

AIOps Platform

Return to the AIOps platform overview.