Skip to main content
The AIOps platform notification system allows alerts, Issue state changes, and SLA violations to be automatically communicated to the right teams, on the right channels, at the right time. Combined with escalation policies, it ensures that no critical incident goes unnoticed.

Overview

The NotificationEngine is triggered whenever:
EventDescription
Issue state changeState transition (Detected, Analyzing, Remediating, Resolved, Escalated)
SLO burn rate alertBurn rate exceeds threshold in short+long windows
SLA violationResponse or resolution time exceeded the limit
Remediation failureRemediationPlan failed and reached max attempts
Approval requestApprovalRequest created awaiting approval

NotificationPolicy CRD

The NotificationPolicy defines which events trigger notifications, to which channels, and with which throttling rules.
apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: production-alerts
  namespace: production
spec:
  rules:
    - name: critical-incidents
      match:
        severities: [critical, high]
        signalTypes: [oom_kill, deploy_failing, error_rate]
        namespaces: [production, payments]
        resourceKinds: [Deployment, StatefulSet]
        states: [Detected, Escalated]
      channels:
        - type: slack
          config:
            webhook_url: "https://hooks.slack.com/services/T00/B00/xxxxx"
            channel: "#incidents-critical"
            mention: "@oncall-sre"
        - type: pagerduty
          config:
            routing_key: "R0xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            severity_mapping:
              critical: critical
              high: error
        - type: opsgenie
          config:
            api_key: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
            priority_mapping:
              critical: P1
              high: P2
            responders:
              - type: team
                name: platform-sre
            tags: [production, aiops]

    - name: low-severity-digest
      match:
        severities: [medium, low]
        states: [Resolved]
      channels:
        - type: email
          config:
            smtp_host: smtp.company.com
            smtp_port: 587
            from: "aiops@company.com"
            to: ["sre-team@company.com"]
            subject_template: "[ChatCLI AIOps] {{.Severity}} - {{.ResourceName}}"
            tls_skip_verify: false

    - name: all-events-webhook
      match:
        severities: [critical, high, medium, low]
      channels:
        - type: webhook
          config:
            url: "https://internal-api.company.com/aiops/events"
            secret: "whsec_xxxxxxxxxxxxxxxxxxxxxxxx"
            headers:
              X-Source: chatcli-aiops
              X-Environment: production

    - name: teams-infra
      match:
        severities: [critical, high]
        namespaces: [infrastructure]
      channels:
        - type: teams
          config:
            webhook_url: "https://outlook.office.com/webhook/xxx/IncomingWebhook/yyy/zzz"

  throttle:
    deduplicationWindow: "5m"
    maxPerHour: 60
    groupBy: [namespace, resourceName, severity]

Spec Fields

NotificationRule

Each rule defines a match + channels pair. Multiple rules can be defined in the same policy.
FieldTypeRequiredDescription
namestringYesUnique rule name within the policy
matchNotificationMatchYesMatching criteria
channels[]ChannelConfigYesList of destination channels

NotificationMatch

All fields are optional. If omitted, it acts as a wildcard (match all). When multiple fields are defined, the logic is AND between fields and OR within each field.
FieldTypeDescription
severities[]stringcritical, high, medium, low
signalTypes[]stringoom_kill, pod_restart, pod_not_ready, deploy_failing, error_rate, latency_spike
namespaces[]stringK8s namespaces to monitor
resourceKinds[]stringDeployment, StatefulSet, DaemonSet
states[]stringDetected, Analyzing, Remediating, Resolved, Escalated, Failed
Combine severities with states for fine-grained control. Example: notify critical only on Detected and Escalated, avoiding noise from intermediate transitions.

ThrottleConfig

Controls the frequency and deduplication of notifications to prevent alert fatigue.
FieldTypeDefaultDescription
deduplicationWindowduration5mTemporal window for deduplication. Identical notifications within this window are suppressed.
maxPerHourint120Maximum notifications per hour per policy. Excess notifications are queued.
groupBy[]string[namespace, resourceName]Fields used to group notifications. Notifications from the same group are consolidated.
Dedup key = hash(policy_name + rule_name + groupBy_values + severity)

Example with groupBy=[namespace, resourceName, severity]:
  key = hash("production-alerts" + "critical-incidents" + "production" + "api-gateway" + "critical")
Setting maxPerHour too low (e.g., 5) may suppress critical alerts. Use values >= 30 for policies covering critical and high severities. The throttle never blocks the first notification of a new incident.

Notification Channels

1. Slack

Sends notifications via Slack Incoming Webhooks using Block Kit for rich formatting.
FieldTypeRequiredDescription
webhook_urlstringYesSlack Incoming Webhook URL
channelstringNoChannel override (requires webhook with permission)
mentionstringNoMention (@user, @here, @channel, @oncall-group)
usernamestringNoBot name (default: ChatCLI AIOps)
icon_emojistringNoBot emoji (default: :robot_face:)
Block Kit colors by severity:
SeverityColor (hex)Visual
Critical#E74C3CIntense red
High#E67E22Orange
Medium#F1C40FYellow
Low#2ECC71Green
Block Kit payload sent:
{
  "channel": "#incidents-critical",
  "username": "ChatCLI AIOps",
  "icon_emoji": ":robot_face:",
  "blocks": [
    {
      "type": "header",
      "text": {
        "type": "plain_text",
        "text": "CRITICAL: OOMKilled on api-gateway"
      }
    },
    {
      "type": "section",
      "fields": [
        {"type": "mrkdwn", "text": "*Namespace:*\nproduction"},
        {"type": "mrkdwn", "text": "*Resource:*\nDeployment/api-gateway"},
        {"type": "mrkdwn", "text": "*Signal:*\noom_kill"},
        {"type": "mrkdwn", "text": "*Risk Score:*\n85/100"},
        {"type": "mrkdwn", "text": "*State:*\nDetected"},
        {"type": "mrkdwn", "text": "*Confidence:*\n0.92"}
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Analysis:*\nMemory limit (512Mi) insufficient for current workload..."
      }
    },
    {
      "type": "context",
      "elements": [
        {"type": "mrkdwn", "text": "Issue: `api-gateway-oom-kill-1771276354` | <https://grafana.company.com/d/aiops|Dashboard>"}
      ]
    }
  ],
  "attachments": [{"color": "#E74C3C"}]
}
Minimal example:
channels:
  - type: slack
    config:
      webhook_url: "https://hooks.slack.com/services/T00/B00/xxxxx"

2. PagerDuty

Integrates with PagerDuty via Events API v2 for on-call incident management.
FieldTypeRequiredDescription
routing_keystringYesPagerDuty service Integration Key (Events API v2)
severity_mappingmapNoMapping of ChatCLI severities to PagerDuty
dedup_key_templatestringNoTemplate for dedup_key (default: {{.IssueName}})
custom_detailsmapNoExtra fields in the payload
Default severity mapping:
ChatCLIPagerDutyBehavior
criticalcriticalTriggers on-call immediately
higherrorHigh priority
mediumwarningModerate priority
lowinfoInformational
Deduplication:The dedup_key ensures that updates to the same incident do not create duplicate alerts in PagerDuty. The default uses the Issue name, but it can be customized:
dedup_key_template: "{{.Namespace}}-{{.ResourceName}}-{{.SignalType}}"
Payload sent (Events API v2):
{
  "routing_key": "R0xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "event_action": "trigger",
  "dedup_key": "api-gateway-oom-kill-1771276354",
  "payload": {
    "summary": "[CRITICAL] OOMKilled on production/api-gateway",
    "severity": "critical",
    "source": "chatcli-aiops",
    "component": "api-gateway",
    "group": "production",
    "class": "oom_kill",
    "custom_details": {
      "risk_score": 85,
      "confidence": 0.92,
      "analysis": "Memory limit (512Mi) insufficient...",
      "issue_name": "api-gateway-oom-kill-1771276354",
      "remediation_plan": "RestartDeployment + AdjustResources"
    }
  }
}
Automatic resolution: When the Issue transitions to Resolved, the NotificationEngine sends event_action: resolve with the same dedup_key, automatically closing the incident in PagerDuty.

3. OpsGenie

Integrates with OpsGenie for alerts and on-call management with P1-P4 priorities.
FieldTypeRequiredDescription
api_keystringYesOpsGenie API Key
api_urlstringNoAPI URL (default: https://api.opsgenie.com)
priority_mappingmapNoMapping of severities to priorities
responders[]ResponderNoResponsible teams or users
tags[]stringNoTags to categorize alerts
visible_to[]ResponderNoWho can see the alert
actions[]stringNoCustom actions on the alert
Default priority mapping:
ChatCLIOpsGenieDescription
criticalP1Critical - triggers on-call immediately
highP2High priority
mediumP3Moderate priority
lowP4Low priority
Responder types:
responders:
  - type: team       # Entire team
    name: platform-sre
  - type: user       # Specific user
    username: edilson@company.com
  - type: escalation # OpsGenie escalation policy
    name: sre-escalation
  - type: schedule   # On-call schedule
    name: sre-oncall

4. Email

Sends notifications via SMTP with STARTTLS support and HTML templates.
FieldTypeRequiredDescription
smtp_hoststringYesSMTP server host
smtp_portintYesSMTP port (587 for STARTTLS, 465 for SSL)
fromstringYesSender address
to[]stringYesList of recipients
cc[]stringNoCarbon copy
bcc[]stringNoBlind carbon copy
usernamestringNoSMTP credential (if auth required)
password_secretSecretRefNoReference to the Secret containing the SMTP password
subject_templatestringNoGo template for the subject
tls_skip_verifyboolNoSkip TLS verification (default: false)
html_templatestringNoCustom HTML template (Go template)
Variables available in templates:
VariableDescription
{{.Severity}}Alert severity
{{.ResourceName}}K8s resource name
{{.Namespace}}Namespace
{{.SignalType}}Signal type
{{.State}}Current Issue state
{{.RiskScore}}Risk score (0-100)
{{.Analysis}}AI analysis
{{.IssueName}}Issue CR name
{{.Timestamp}}ISO 8601 timestamp
Example with STARTTLS:
channels:
  - type: email
    config:
      smtp_host: smtp.company.com
      smtp_port: 587
      from: "aiops@company.com"
      to: ["sre-team@company.com", "platform-leads@company.com"]
      cc: ["vp-engineering@company.com"]
      username: "aiops@company.com"
      password_secret:
        name: smtp-credentials
        key: password
      subject_template: "[{{.Severity}}] {{.SignalType}} on {{.Namespace}}/{{.ResourceName}}"
      tls_skip_verify: false
Never put SMTP credentials directly in the NotificationPolicy YAML. Always use password_secret pointing to a Kubernetes Secret.

5. Webhook

Sends notifications to arbitrary HTTP endpoints with HMAC-SHA256 signing.
FieldTypeRequiredDescription
urlstringYesDestination endpoint URL
secretstringNoKey for HMAC-SHA256 signing
headersmapNoCustom HTTP headers
methodstringNoHTTP method (default: POST)
timeoutdurationNoRequest timeout (default: 10s)
retry_countintNoNumber of retries on failure (default: 3)
retry_intervaldurationNoInterval between retries (default: 5s)
HMAC-SHA256 signing:When secret is defined, every request includes the X-ChatCLI-Signature header with the HMAC-SHA256 signature of the body:
X-ChatCLI-Signature: sha256=<hex(HMAC-SHA256(secret, body))>
Validation on the receiver:
import hmac, hashlib

def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(), payload, hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)
JSON payload sent:
{
  "event_type": "issue.state_changed",
  "timestamp": "2026-03-19T14:30:00Z",
  "issue": {
    "name": "api-gateway-oom-kill-1771276354",
    "namespace": "production",
    "severity": "critical",
    "state": "Detected",
    "signal_type": "oom_kill",
    "risk_score": 85,
    "resource": {
      "kind": "Deployment",
      "name": "api-gateway"
    }
  },
  "analysis": {
    "text": "Memory limit (512Mi) insufficient...",
    "confidence": 0.92,
    "recommendations": ["Increase memory limit to 1Gi"]
  },
  "remediation": {
    "plan_name": "api-gateway-oom-kill-plan-1",
    "actions": ["RestartDeployment", "AdjustResources"]
  }
}

6. Microsoft Teams

Sends notifications to Microsoft Teams channels via Adaptive Cards and Incoming Webhooks.
FieldTypeRequiredDescription
webhook_urlstringYesTeams Incoming Webhook URL
title_templatestringNoTemplate for the card title
theme_colorstringNoTheme color (hex, without #)
Generated Adaptive Card:The NotificationEngine builds an Adaptive Card with sections for:
  • Header with colored severity
  • Resource details (namespace, kind, name)
  • AI analysis (if available)
  • Suggested actions
  • Link to the Grafana dashboard
Card colors by severity:
Severitytheme_color
CriticalE74C3C
HighE67E22
MediumF1C40F
Low2ECC71

EscalationPolicy CRD

The EscalationPolicy defines the automatic escalation chain when an alert is not acknowledged within the defined timeout.
apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
  name: production-escalation
  namespace: production
spec:
  match:
    severities: [critical, high]
    namespaces: [production, payments]
  levels:
    - name: L1 - On-Call SRE
      timeout: "5m"
      targets:
        - type: channel
          channel:
            type: slack
            config:
              webhook_url: "https://hooks.slack.com/services/T00/B00/l1-hook"
              channel: "#sre-oncall"
              mention: "@oncall-sre"
        - type: channel
          channel:
            type: pagerduty
            config:
              routing_key: "R0-l1-routing-key"

    - name: L2 - SRE Lead + Platform Team
      timeout: "15m"
      targets:
        - type: user
          user: "sre-lead@company.com"
        - type: team
          team: "platform-engineering"
        - type: channel
          channel:
            type: opsgenie
            config:
              api_key: "xxxxxxxx"
              priority_mapping:
                critical: P1
                high: P1

    - name: L3 - VP Engineering + Incident Commander
      timeout: "30m"
      targets:
        - type: user
          user: "vp-eng@company.com"
        - type: oncall
          oncall:
            schedule: "incident-commander"
            provider: opsgenie
        - type: channel
          channel:
            type: email
            config:
              smtp_host: smtp.company.com
              smtp_port: 587
              from: "aiops-critical@company.com"
              to: ["exec-team@company.com"]

  repeatInterval: "30m"
  maxRepeats: 3

Spec Fields

FieldTypeRequiredDescription
matchEscalationMatchYesCriteria for applying this escalation
levels[]EscalationLevelYesOrdered chain of escalation levels
repeatIntervaldurationNoInterval to repeat the last level (default: 30m)
maxRepeatsintNoMaximum repetitions of the last level (default: 3)

EscalationLevel

FieldTypeRequiredDescription
namestringYesDescriptive name of the level
timeoutdurationYesTime without acknowledgement before escalating to the next level
targets[]EscalationTargetYesNotification destinations at this level

EscalationTarget

FieldTypeDescription
typestringchannel, user, team, oncall
channelChannelConfigChannel configuration (when type=channel)
userstringUser email (when type=user)
teamstringTeam name (when type=team)
oncallOnCallRefReference to the on-call schedule (when type=oncall)

How Escalation Works

Tracking via annotations: The EscalationPolicy reconciler tracks escalation state using annotations on the Issue CR:
AnnotationDescription
platform.chatcli.io/escalation-levelCurrent level (0=L1, 1=L2, 2=L3)
platform.chatcli.io/escalation-started-atTimestamp of escalation start
platform.chatcli.io/escalation-acknowledgedtrue when acknowledged
platform.chatcli.io/escalation-acknowledged-byWho acknowledged
platform.chatcli.io/escalation-repeat-countRepeat count of the last level
Acknowledgement: To stop the escalation chain, the on-call must acknowledge the alert:
kubectl annotate issue api-gateway-oom-kill-1771276354 \
  platform.chatcli.io/escalation-acknowledged=true \
  platform.chatcli.io/escalation-acknowledged-by=edilson
Or via PagerDuty/OpsGenie (the return webhook updates the annotation automatically).

Complete Examples

Notification Policy: Slack + PagerDuty

apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: critical-alerts-multi-channel
  namespace: production
spec:
  rules:
    - name: critical-to-slack-and-pagerduty
      match:
        severities: [critical]
        states: [Detected, Escalated]
      channels:
        - type: slack
          config:
            webhook_url: "https://hooks.slack.com/services/T00/B00/critical-hook"
            channel: "#p0-incidents"
            mention: "@here"
        - type: pagerduty
          config:
            routing_key: "R0-critical-routing-key"
            severity_mapping:
              critical: critical

    - name: high-to-slack
      match:
        severities: [high]
        states: [Detected, Remediating, Escalated]
      channels:
        - type: slack
          config:
            webhook_url: "https://hooks.slack.com/services/T00/B00/high-hook"
            channel: "#incidents"

    - name: resolved-to-slack
      match:
        states: [Resolved]
      channels:
        - type: slack
          config:
            webhook_url: "https://hooks.slack.com/services/T00/B00/resolved-hook"
            channel: "#incidents"

  throttle:
    deduplicationWindow: "3m"
    maxPerHour: 100
    groupBy: [namespace, resourceName]

Escalation Policy L1 -> L2 -> L3

apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
  name: p0-escalation
  namespace: production
spec:
  match:
    severities: [critical]
  levels:
    - name: L1 - Primary On-Call
      timeout: "5m"
      targets:
        - type: channel
          channel:
            type: pagerduty
            config:
              routing_key: "R0-primary-oncall"
        - type: channel
          channel:
            type: slack
            config:
              webhook_url: "https://hooks.slack.com/services/T00/B00/oncall"
              channel: "#sre-oncall"
              mention: "@oncall-primary"

    - name: L2 - Secondary On-Call + SRE Manager
      timeout: "10m"
      targets:
        - type: oncall
          oncall:
            schedule: "secondary-oncall"
            provider: pagerduty
        - type: user
          user: "sre-manager@company.com"
        - type: channel
          channel:
            type: opsgenie
            config:
              api_key: "xxx"
              priority_mapping:
                critical: P1

    - name: L3 - Engineering Leadership
      timeout: "20m"
      targets:
        - type: team
          team: engineering-leadership
        - type: channel
          channel:
            type: email
            config:
              smtp_host: smtp.company.com
              smtp_port: 587
              from: "aiops-escalation@company.com"
              to: ["cto@company.com", "vp-eng@company.com"]
              subject_template: "[P0 ESCALATED] {{.ResourceName}} - {{.SignalType}} unresolved"

  repeatInterval: "15m"
  maxRepeats: 5

Email for SLA Breaches

apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: sla-breach-notifications
  namespace: production
spec:
  rules:
    - name: sla-violation-email
      match:
        severities: [critical, high]
        signalTypes: [sla_violation]
      channels:
        - type: email
          config:
            smtp_host: smtp.company.com
            smtp_port: 587
            from: "sla-alerts@company.com"
            to:
              - "sre-team@company.com"
              - "service-owners@company.com"
            cc:
              - "vp-eng@company.com"
            username: "sla-alerts@company.com"
            password_secret:
              name: smtp-credentials
              key: password
            subject_template: "[SLA BREACH] {{.Severity}} - {{.Namespace}}/{{.ResourceName}} exceeds SLA"
            tls_skip_verify: false
        - type: slack
          config:
            webhook_url: "https://hooks.slack.com/services/T00/B00/sla-hook"
            channel: "#sla-violations"
            mention: "@service-owners"

  throttle:
    deduplicationWindow: "30m"
    maxPerHour: 20
    groupBy: [namespace, resourceName]

Troubleshooting

Diagnostic checklist:
  1. Verify that the NotificationPolicy exists in the correct namespace:
kubectl get notificationpolicies -A
  1. Check the operator logs for dispatch errors:
kubectl logs -l app=chatcli-operator -n chatcli-system | grep "notification"
  1. Confirm that the matching is correct:
kubectl get issues -n production -o yaml | grep -A5 "severity\|state\|signalType"
  1. Verify that throttling is not suppressing notifications:
kubectl logs -l app=chatcli-operator -n chatcli-system | grep "throttled\|deduplicated"
  • Confirm that the webhook_url is correct and the Slack app is installed in the workspace
  • Verify that the channel exists and the bot has permission to post
  • Test the webhook manually:
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"ChatCLI AIOps Test"}' \
  "https://hooks.slack.com/services/T00/B00/xxxxx"
  • Confirm that the routing_key is an Integration Key (not an API Key)
  • Verify that the service in PagerDuty is active
  • Validate the payload in the PagerDuty Event Debugger
  • Confirm that the event is not being deduplicated by the dedup_key
  • Verify SMTP connectivity:
kubectl exec -it deploy/chatcli-operator -n chatcli-system -- \
  nc -zv smtp.company.com 587
  • Confirm credentials in the Secret referenced by password_secret
  • Verify that tls_skip_verify: false and the server certificate is valid
  • Check the recipients’ spam folder
  • Check Issue annotations:
kubectl get issue &lt;name&gt; -o yaml | grep "escalation"
  • Confirm that escalation-acknowledged is not set to true
  • Check the EscalationPolicy reconciler logs
  • Confirm that the level timeout is not greater than the time since creation
  • Confirm that the secret in the policy is the same used by the receiver for verification
  • Verify that the receiver is reading the raw body before parsing JSON
  • Use hmac.compare_digest (or equivalent) to avoid timing attacks

Prometheus Metrics

The notification system exposes metrics for full observability:
MetricTypeLabelsDescription
chatcli_notifications_sent_totalCounterchannel, severity, rule, namespaceTotal notifications sent successfully
chatcli_notifications_failed_totalCounterchannel, severity, rule, error_typeTotal notifications that failed
chatcli_notifications_throttled_totalCounterrule, reasonNotifications suppressed by throttle or dedup
chatcli_notification_dispatch_duration_secondsHistogramchannelDispatch latency per channel
chatcli_escalation_level_reachedGaugepolicy, namespaceCurrent escalation level per policy
chatcli_escalation_acknowledged_totalCounterpolicy, levelTotal escalations acknowledged per level
chatcli_escalation_timeout_totalCounterpolicy, levelTotal escalation timeouts per level
Recommended Prometheus alerts:
groups:
  - name: chatcli-notifications
    rules:
      - alert: NotificationChannelFailing
        expr: rate(chatcli_notifications_failed_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Notification channel {{ $labels.channel }} is failing"
          description: "Failure rate > 0.1/s in the last 5 minutes"

      - alert: EscalationReachedL3
        expr: chatcli_escalation_level_reached >= 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Escalation reached L3 for policy {{ $labels.policy }}"
          description: "Unacknowledged incident reached the last escalation level"

      - alert: HighThrottleRate
        expr: rate(chatcli_notifications_throttled_total[10m]) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High throttling rate on rule {{ $labels.rule }}"

Next Steps

SLOs and SLAs

Service Level Objectives management with burn rate alerting

Approval Workflow

Change control with approval policies and blast radius

AIOps Platform

Deep-dive into the AIOps architecture

K8s Operator

Operator configuration and CRDs