Skip to main content
In production environments, not every automatic remediation should be executed without human oversight. The ChatCLI Approval Workflow system allows defining granular policies that control which actions require approval, who can approve, and during which change windows actions are allowed.

Why Approval Workflows are Essential

Security

Prevents automatic remediation from causing greater impact than the original problem (e.g., accidental rollback in production)

Compliance

Complete audit trail of who approved, when, and why. Required for SOC2, PCI-DSS, HIPAA.

Trust

Teams adopt AIOps more easily when they know that critical actions require human approval.
Without approval workflows, an AI that detects a false positive could execute an unnecessary rollback, affecting a healthy deployment. With approval policies, high-impact actions are blocked until a human validates the analysis and blast radius.

Flow Overview

ApprovalPolicy CRD

The ApprovalPolicy defines rules that determine which remediation actions need approval, under which conditions, and who can approve.
apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: production-approval-policy
  namespace: production
spec:
  rules:
    - name: auto-approve-low-risk
      match:
        severities: [low, medium]
        actionTypes: [RestartDeployment, ScaleDeployment]
        namespaces: [staging, development]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.85
        maxSeverity: medium
        historicalSuccessRate: 0.90

    - name: manual-approve-rollback
      match:
        actionTypes: [RollbackDeployment]
        namespaces: [production, payments]
      mode: manual
      requiredApprovers: 1
      timeoutMinutes: 30

    - name: quorum-critical-production
      match:
        severities: [critical, high]
        namespaces: [production]
        resourceKinds: [Deployment, StatefulSet]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 15

    - name: block-critical-namespace-rollback
      match:
        actionTypes: [RollbackDeployment]
        namespaces: [payments, auth]
        severities: [critical]
      mode: manual
      requiredApprovers: 2
      timeoutMinutes: 10

  changeWindow:
    timezone: "America/Sao_Paulo"
    allowedDays: [1, 2, 3, 4, 5]    # Monday to Friday
    startHour: 9
    endHour: 18
    overrideForCritical: true         # Critical ignores change window
    blackoutDates:
      - date: "2026-03-20"
        reason: "Q1 pre-release freeze"
      - date: "2026-12-24"
        reason: "Christmas Eve"
      - date: "2026-12-31"
        reason: "New Year's Eve"

Spec Fields

ApprovalRule

Each rule defines a match + mode pair with specific configurations.
FieldTypeRequiredDescription
namestringYesUnique rule name within the policy
matchApprovalMatchYesMatching criteria
modestringYesauto, manual, quorum
requiredApproversintFor manual/quorumMinimum number of approvers
timeoutMinutesintNoTimeout in minutes (default: 60)
autoApproveConditionsAutoApproveConditionsFor autoConditions for auto-approval

ApprovalMatch

Defines which remediations are covered by this rule. The logic is AND between fields and OR within each field.
FieldTypeDescription
severities[]stringcritical, high, medium, low
actionTypes[]stringScaleDeployment, RestartDeployment, RollbackDeployment, PatchConfig, AdjustResources, DeletePod, Custom
namespaces[]stringAffected K8s namespaces
resourceKinds[]stringDeployment, StatefulSet, DaemonSet
When multiple rules match, the most restrictive rule prevails. Priority order is: manual > quorum > auto. If one rule requires quorum with 2 approvers and another requires manual with 1, the system applies quorum with 2.

Three Approval Modes

Auto-approve: The system automatically approves if all autoApproveConditions are met. Otherwise, it escalates to manual.
ConditionTypeDescription
minConfidencefloat64Minimum AI analysis confidence (0.0-1.0)
maxSeveritystringMaximum severity for auto-approve
historicalSuccessRatefloat64Minimum historical success rate for this action type
mode: auto
autoApproveConditions:
  minConfidence: 0.90      # AI has >= 90% confidence
  maxSeverity: medium      # Up to medium severity
  historicalSuccessRate: 0.85  # >= 85% success in similar actions
Evaluation logic:
auto_approve = (
  ai_confidence >= minConfidence AND
  severity <= maxSeverity AND
  historical_success_rate >= historicalSuccessRate
)

If auto_approve = false -> escalates to manual mode (1 approver)

ChangeWindowSpec

Defines change windows that control when automatic remediation can be executed.
FieldTypeRequiredDescription
timezonestringYesIANA timezone (e.g., America/Sao_Paulo)
allowedDays[]intYesAllowed days (0=Sunday, 6=Saturday)
startHourintYesWindow start hour (0-23)
endHourintYesWindow end hour (0-23)
overrideForCriticalboolNoIf true, critical severity ignores the change window
blackoutDates[]BlackoutDateNoSpecific dates with total freeze
When outside the change window, remediation actions are queued (not discarded). They will be automatically executed when the next window opens — as long as the Issue is still active and the approval has not expired.

ApprovalRequest CRD

The ApprovalRequest is automatically created by the RemediationReconciler when an action requires approval. It contains all the information needed for the approver to make an informed decision.
apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalRequest
metadata:
  name: approve-api-gateway-rollback-1234
  namespace: production
  labels:
    platform.chatcli.io/issue: api-gateway-oom-kill-1771276354
    platform.chatcli.io/action-type: RollbackDeployment
    platform.chatcli.io/severity: critical
spec:
  issueRef:
    name: api-gateway-oom-kill-1771276354
  remediationPlanRef:
    name: api-gateway-oom-kill-plan-1
  requestedAction:
    type: RollbackDeployment
    params:
      toRevision: "previous"
  policyRef:
    name: production-approval-policy
    rule: manual-approve-rollback
  requiredApprovers: 1
  timeoutMinutes: 30

  blastRadius:
    affectedPods: 5
    affectedServices:
      - name: api-gateway
        namespace: production
        endpoints: 3
      - name: api-gateway-internal
        namespace: production
        endpoints: 2
    affectedIngresses:
      - name: api-gateway-ingress
        namespace: production
    riskLevel: high
    estimatedDowntime: "30s"
    rollbackAvailable: true

  evidence:
    aiConfidence: 0.87
    analysis: "High restart count caused by OOMKilled. Container memory limit (512Mi) insufficient."
    historicalSuccessRate: 0.92
    similarIncidents: 3
    lastSimilarResolution: "RollbackDeployment to revision 5 (2 days ago, success)"

status:
  state: Pending            # Pending | Approved | Rejected | Expired
  decisions: []
  createdAt: "2026-03-19T14:30:00Z"
  expiresAt: "2026-03-19T15:00:00Z"

Spec Fields

Root

FieldTypeRequiredDescription
issueRefObjectRefYesReference to the Issue that originated the request
remediationPlanRefObjectRefYesReference to the paused RemediationPlan
requestedActionActionSpecYesAction that requires approval
policyRefPolicyRefYesReference to the policy and rule that triggered it
requiredApproversintYesMinimum number of approvers
timeoutMinutesintYesTime until expiration
blastRadiusBlastRadiusAssessmentYesImpact assessment
evidenceApprovalEvidenceYesEvidence for decision-making

BlastRadiusAssessment

FieldTypeDescription
affectedPodsintNumber of pods that will be affected by the action
affectedServices[]ServiceRefServices that route to the affected pods
affectedIngresses[]IngressRefIngresses that expose the affected services
riskLevelstringcritical, high, medium, low (calculated)
estimatedDowntimestringEstimated downtime during the action
rollbackAvailableboolWhether the action can be reverted

ApprovalEvidence

FieldTypeDescription
aiConfidencefloat64AI analysis confidence level (0.0-1.0)
analysisstringAI analysis summary
historicalSuccessRatefloat64Success rate of similar actions in history
similarIncidentsintNumber of similar incidents in the past
lastSimilarResolutionstringDescription of the last similar resolution

ApprovalDecision

Each approval or rejection is recorded as a decision in the status:
FieldTypeDescription
approverstringApprover identifier (user or system)
decisionstringapproved or rejected
reasonstringJustification for the decision
timestampTimeWhen the decision was made

ApprovalRequest States

A single rejection is sufficient to block the action, regardless of the number of approvals. This ensures that any team member can veto a risky action.

Blast Radius Calculator

The blast radius calculator evaluates the potential impact of a remediation action before requesting approval.

How It Works

1

Query deployment pods

The calculator lists all pods managed by the target deployment using label selectors.
pods = kubectl get pods -l app=api-gateway -n production
affectedPods = len(pods)  // e.g., 5
2

Find services routing to the pods

For each Service in the namespace, checks if the selector matches the deployment pod labels.
for service in namespace.services:
  if service.selector matches pod.labels:
    affectedServices.append(service)
3

Find ingresses exposing the services

For each Ingress in the namespace, checks if it references any of the affected services.
for ingress in namespace.ingresses:
  for rule in ingress.rules:
    if rule.backend.service in affectedServices:
      affectedIngresses.append(ingress)
4

Calculate risk level

The risk level is determined by the number of affected pods:
if affectedPods > 10:  riskLevel = "critical"
if affectedPods > 5:   riskLevel = "high"
if affectedPods > 2:   riskLevel = "medium"
else:                  riskLevel = "low"
5

Estimate downtime

Based on the action type:
ActionEstimated Downtime
ScaleDeployment (up)0s (no pods removed)
RestartDeployment~30s (rolling update)
RollbackDeployment~30-60s (rolling update)
AdjustResources~30s (rolling update)
DeletePod~10s (recreation by ReplicaSet)
PatchConfig0s (no restart)

Integration with RemediationReconciler

Complete Flow

Control Annotation

The RemediationReconciler uses the platform.chatcli.io/approval-pending annotation to control the flow:
metadata:
  annotations:
    platform.chatcli.io/approval-pending: "approve-api-gateway-rollback-1234"
When this annotation is present:
  1. The reconciler does not execute any action
  2. Queries the status of the referenced ApprovalRequest
  3. Removes the annotation only when the request is Approved
  4. If Rejected or Expired, marks the plan as Failed

How to Approve

Via kubectl

The most direct way to approve is using annotations:
# Approve
kubectl annotate approvalrequest approve-api-gateway-rollback-1234 \
  -n production \
  platform.chatcli.io/approve="edilson:LGTM, acceptable blast radius"

# Reject
kubectl annotate approvalrequest approve-api-gateway-rollback-1234 \
  -n production \
  platform.chatcli.io/reject="edilson:Risk too high, investigate memory leak first"
Annotation format:
platform.chatcli.io/approve="<user>:<reason>"
platform.chatcli.io/reject="<user>:<reason>"
The ApprovalRequest reconciler detects the annotation, records the decision in the status, and removes the annotation.

Via REST API

The operator exposes a REST API for integrations:
curl -X POST \
  http://localhost:8090/api/v1/approvals/approve-api-gateway-rollback-1234/approve \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "approver": "edilson",
    "reason": "LGTM, acceptable blast radius. AI confidence 87% with success history."
  }'
API response (example):
{
  "name": "approve-api-gateway-rollback-1234",
  "namespace": "production",
  "state": "Approved",
  "requestedAction": {
    "type": "RollbackDeployment",
    "params": {"toRevision": "previous"}
  },
  "blastRadius": {
    "affectedPods": 5,
    "riskLevel": "high"
  },
  "decisions": [
    {
      "approver": "edilson",
      "decision": "approved",
      "reason": "LGTM, acceptable blast radius",
      "timestamp": "2026-03-19T14:35:00Z"
    }
  ]
}

Via Slack (interactive)

When integrated with the Slack channel via NotificationPolicy, the ApprovalRequest includes interactive buttons in Block Kit:
  • Approve: Records approval with the Slack user as approver
  • Reject: Opens a dialog for rejection reason
  • Details: Expands blast radius and AI evidence
The interactive Slack integration requires additional configuration of a Slack App with Interactive Components enabled and a callback endpoint pointing to the operator.

Complete YAML Examples

Auto-approve for Low Severity + High Confidence

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: staging-auto-approve
  namespace: staging
spec:
  rules:
    - name: auto-approve-all-staging
      match:
        severities: [low, medium]
        actionTypes:
          - RestartDeployment
          - ScaleDeployment
          - AdjustResources
          - DeletePod
        namespaces: [staging]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.80
        maxSeverity: medium
        historicalSuccessRate: 0.75

    - name: manual-for-rollback-staging
      match:
        actionTypes: [RollbackDeployment]
        namespaces: [staging]
      mode: manual
      requiredApprovers: 1
      timeoutMinutes: 60

  changeWindow:
    timezone: "America/Sao_Paulo"
    allowedDays: [0, 1, 2, 3, 4, 5, 6]   # All days
    startHour: 0
    endHour: 23

Quorum of 2 Approvers for Production

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: production-strict
  namespace: production
spec:
  rules:
    - name: quorum-all-production-actions
      match:
        severities: [critical, high, medium]
        namespaces: [production]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 15

    - name: auto-low-severity-restart
      match:
        severities: [low]
        actionTypes: [RestartDeployment]
        namespaces: [production]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.95
        maxSeverity: low
        historicalSuccessRate: 0.98

  changeWindow:
    timezone: "America/Sao_Paulo"
    allowedDays: [1, 2, 3, 4, 5]
    startHour: 9
    endHour: 18
    overrideForCritical: true

Change Window Weekdays 9-18 UTC

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: change-window-policy
  namespace: production
spec:
  rules:
    - name: all-actions-require-approval
      match:
        namespaces: [production]
      mode: manual
      requiredApprovers: 1
      timeoutMinutes: 120

  changeWindow:
    timezone: "UTC"
    allowedDays: [1, 2, 3, 4, 5]    # Monday to Friday
    startHour: 9
    endHour: 18
    overrideForCritical: true
    blackoutDates:
      - date: "2026-03-27"
        reason: "End of Q1 freeze"
      - date: "2026-03-28"
        reason: "End of Q1 freeze"
      - date: "2026-06-30"
        reason: "End of Q2 freeze"
Use overrideForCritical: true to allow critical incidents to be remediated outside the change window. Without this, a critical incident at 3am would be queued until 9am.

RollbackDeployment Block in Critical Namespaces

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: critical-namespace-protection
  namespace: payments
spec:
  rules:
    - name: block-rollback-payments
      match:
        actionTypes: [RollbackDeployment]
        namespaces: [payments, auth, billing]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 10

    - name: block-delete-pod-payments
      match:
        actionTypes: [DeletePod]
        namespaces: [payments]
      mode: manual
      requiredApprovers: 1
      timeoutMinutes: 15

    - name: auto-scale-only
      match:
        actionTypes: [ScaleDeployment]
        namespaces: [payments]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.90
        maxSeverity: high
        historicalSuccessRate: 0.95

  changeWindow:
    timezone: "America/Sao_Paulo"
    allowedDays: [1, 2, 3, 4]    # Mon-Thu (no Friday for pre-weekend freeze)
    startHour: 10
    endHour: 16
    overrideForCritical: true
    blackoutDates:
      - date: "2026-03-31"
        reason: "Month-end close"
      - date: "2026-04-30"
        reason: "Month-end close"

Auditing and Compliance

All approval decisions are recorded in the ApprovalRequest CR status, creating a complete audit trail:
# View approval history
kubectl get approvalrequests -n production \
  -o custom-columns=NAME:.metadata.name,STATE:.status.state,APPROVER:.status.decisions[0].approver,REASON:.status.decisions[0].reason,TIME:.status.decisions[0].timestamp

# Output:
# NAME                                  STATE      APPROVER   REASON              TIME
# approve-api-gw-rollback-1234          Approved   edilson    LGTM                2026-03-19T14:35:00Z
# approve-worker-scale-5678             Approved   system     Auto-approved       2026-03-19T15:00:00Z
# approve-payment-restart-9012          Rejected   maria      Risk too high       2026-03-19T15:30:00Z
# approve-auth-rollback-3456            Expired    -          Timeout (15min)     2026-03-19T16:00:00Z
For SOC2 and PCI-DSS compliance, export ApprovalRequests periodically:
kubectl get approvalrequests -A -o json | jq '.items[] | {
  name: .metadata.name,
  namespace: .metadata.namespace,
  action: .spec.requestedAction.type,
  state: .status.state,
  decisions: .status.decisions,
  blastRadius: .spec.blastRadius.riskLevel,
  created: .status.createdAt
}' > approval-audit-$(date +%Y%m%d).json

Prometheus Metrics

The approval workflow system exposes metrics for monitoring:
MetricTypeLabelsDescription
chatcli_approvals_totalCounterpolicy, rule, namespace, decisionTotal approvals by decision (approved/rejected/expired/auto)
chatcli_approval_duration_secondsHistogrampolicy, rule, decisionTime between request creation and decision
chatcli_approvals_pendingGaugepolicy, namespaceNumber of pending ApprovalRequests
chatcli_approval_auto_approved_totalCounterpolicy, rule, namespaceTotal auto-approvals
chatcli_approval_auto_escalated_totalCounterpolicy, rule, namespaceAuto-approve that escalated to manual
chatcli_approval_blast_radius_podsHistogramnamespace, action_typeDistribution of affected pods in requests
chatcli_change_window_blocked_totalCounterpolicy, namespaceActions blocked by change window
Recommended Prometheus alerts:
groups:
  - name: chatcli-approvals
    rules:
      - alert: ApprovalRequestPendingTooLong
        expr: chatcli_approvals_pending > 0 and time() - chatcli_approval_created_timestamp > 600
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "ApprovalRequest pending for more than 10 minutes"
          description: "{{ $labels.policy }}/{{ $labels.namespace }} has requests awaiting approval"

      - alert: HighRejectionRate
        expr: rate(chatcli_approvals_total{decision="rejected"}[1h]) / rate(chatcli_approvals_total[1h]) > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Approval rejection rate above 30%"
          description: "May indicate false positives in AI analysis or an overly permissive policy"

      - alert: ApprovalTimeoutRate
        expr: rate(chatcli_approvals_total{decision="expired"}[1h]) > 0.1
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Approvals expiring due to timeout"
          description: "Teams may not be receiving notifications or timeouts are too short"

Next Steps

Notifications and Escalation

Multi-channel notification system and escalation policies

SLOs and SLAs

Service Level Objectives management with burn rate alerting

AIOps Platform

Deep-dive into the complete AIOps architecture

K8s Operator

Operator configuration and CRDs