Overview
TheNotificationEngine is triggered whenever:
| Event | Description |
|---|---|
| Issue state change | State transition (Detected, Analyzing, Remediating, Resolved, Escalated) |
| SLO burn rate alert | Burn rate exceeds threshold in short+long windows |
| SLA violation | Response or resolution time exceeded the limit |
| Remediation failure | RemediationPlan failed and reached max attempts |
| Approval request | ApprovalRequest created awaiting approval |
NotificationPolicy CRD
TheNotificationPolicy defines which events trigger notifications, to which channels, and with which throttling rules.
Spec Fields
NotificationRule
Each rule defines a match + channels pair. Multiple rules can be defined in the same policy.| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique rule name within the policy |
match | NotificationMatch | Yes | Matching criteria |
channels | []ChannelConfig | Yes | List of destination channels |
NotificationMatch
All fields are optional. If omitted, it acts as a wildcard (match all). When multiple fields are defined, the logic is AND between fields and OR within each field.| Field | Type | Description |
|---|---|---|
severities | []string | critical, high, medium, low |
signalTypes | []string | oom_kill, pod_restart, pod_not_ready, deploy_failing, error_rate, latency_spike |
namespaces | []string | K8s namespaces to monitor |
resourceKinds | []string | Deployment, StatefulSet, DaemonSet |
states | []string | Detected, Analyzing, Remediating, Resolved, Escalated, Failed |
ThrottleConfig
Controls the frequency and deduplication of notifications to prevent alert fatigue.| Field | Type | Default | Description |
|---|---|---|---|
deduplicationWindow | duration | 5m | Temporal window for deduplication. Identical notifications within this window are suppressed. |
maxPerHour | int | 120 | Maximum notifications per hour per policy. Excess notifications are queued. |
groupBy | []string | [namespace, resourceName] | Fields used to group notifications. Notifications from the same group are consolidated. |
Notification Channels
1. Slack
Sends notifications via Slack Incoming Webhooks using Block Kit for rich formatting.Full Slack configuration
Full Slack configuration
| Field | Type | Required | Description |
|---|---|---|---|
webhook_url | string | Yes | Slack Incoming Webhook URL |
channel | string | No | Channel override (requires webhook with permission) |
mention | string | No | Mention (@user, @here, @channel, @oncall-group) |
username | string | No | Bot name (default: ChatCLI AIOps) |
icon_emoji | string | No | Bot emoji (default: :robot_face:) |
| Severity | Color (hex) | Visual |
|---|---|---|
| Critical | #E74C3C | Intense red |
| High | #E67E22 | Orange |
| Medium | #F1C40F | Yellow |
| Low | #2ECC71 | Green |
2. PagerDuty
Integrates with PagerDuty via Events API v2 for on-call incident management.Full PagerDuty configuration
Full PagerDuty configuration
| Field | Type | Required | Description |
|---|---|---|---|
routing_key | string | Yes | PagerDuty service Integration Key (Events API v2) |
severity_mapping | map | No | Mapping of ChatCLI severities to PagerDuty |
dedup_key_template | string | No | Template for dedup_key (default: {{.IssueName}}) |
custom_details | map | No | Extra fields in the payload |
| ChatCLI | PagerDuty | Behavior |
|---|---|---|
critical | critical | Triggers on-call immediately |
high | error | High priority |
medium | warning | Moderate priority |
low | info | Informational |
dedup_key ensures that updates to the same incident do not create duplicate alerts in PagerDuty. The default uses the Issue name, but it can be customized:Resolved, the NotificationEngine sends event_action: resolve with the same dedup_key, automatically closing the incident in PagerDuty.3. OpsGenie
Integrates with OpsGenie for alerts and on-call management with P1-P4 priorities.Full OpsGenie configuration
Full OpsGenie configuration
| Field | Type | Required | Description |
|---|---|---|---|
api_key | string | Yes | OpsGenie API Key |
api_url | string | No | API URL (default: https://api.opsgenie.com) |
priority_mapping | map | No | Mapping of severities to priorities |
responders | []Responder | No | Responsible teams or users |
tags | []string | No | Tags to categorize alerts |
visible_to | []Responder | No | Who can see the alert |
actions | []string | No | Custom actions on the alert |
| ChatCLI | OpsGenie | Description |
|---|---|---|
critical | P1 | Critical - triggers on-call immediately |
high | P2 | High priority |
medium | P3 | Moderate priority |
low | P4 | Low priority |
4. Email
Sends notifications via SMTP with STARTTLS support and HTML templates.Full Email configuration
Full Email configuration
| Field | Type | Required | Description |
|---|---|---|---|
smtp_host | string | Yes | SMTP server host |
smtp_port | int | Yes | SMTP port (587 for STARTTLS, 465 for SSL) |
from | string | Yes | Sender address |
to | []string | Yes | List of recipients |
cc | []string | No | Carbon copy |
bcc | []string | No | Blind carbon copy |
username | string | No | SMTP credential (if auth required) |
password_secret | SecretRef | No | Reference to the Secret containing the SMTP password |
subject_template | string | No | Go template for the subject |
tls_skip_verify | bool | No | Skip TLS verification (default: false) |
html_template | string | No | Custom HTML template (Go template) |
| Variable | Description |
|---|---|
{{.Severity}} | Alert severity |
{{.ResourceName}} | K8s resource name |
{{.Namespace}} | Namespace |
{{.SignalType}} | Signal type |
{{.State}} | Current Issue state |
{{.RiskScore}} | Risk score (0-100) |
{{.Analysis}} | AI analysis |
{{.IssueName}} | Issue CR name |
{{.Timestamp}} | ISO 8601 timestamp |
5. Webhook
Sends notifications to arbitrary HTTP endpoints with HMAC-SHA256 signing.Full Webhook configuration
Full Webhook configuration
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Destination endpoint URL |
secret | string | No | Key for HMAC-SHA256 signing |
headers | map | No | Custom HTTP headers |
method | string | No | HTTP method (default: POST) |
timeout | duration | No | Request timeout (default: 10s) |
retry_count | int | No | Number of retries on failure (default: 3) |
retry_interval | duration | No | Interval between retries (default: 5s) |
secret is defined, every request includes the X-ChatCLI-Signature header with the HMAC-SHA256 signature of the body:6. Microsoft Teams
Sends notifications to Microsoft Teams channels via Adaptive Cards and Incoming Webhooks.Full Microsoft Teams configuration
Full Microsoft Teams configuration
| Field | Type | Required | Description |
|---|---|---|---|
webhook_url | string | Yes | Teams Incoming Webhook URL |
title_template | string | No | Template for the card title |
theme_color | string | No | Theme color (hex, without #) |
- Header with colored severity
- Resource details (namespace, kind, name)
- AI analysis (if available)
- Suggested actions
- Link to the Grafana dashboard
| Severity | theme_color |
|---|---|
| Critical | E74C3C |
| High | E67E22 |
| Medium | F1C40F |
| Low | 2ECC71 |
EscalationPolicy CRD
TheEscalationPolicy defines the automatic escalation chain when an alert is not acknowledged within the defined timeout.
Spec Fields
| Field | Type | Required | Description |
|---|---|---|---|
match | EscalationMatch | Yes | Criteria for applying this escalation |
levels | []EscalationLevel | Yes | Ordered chain of escalation levels |
repeatInterval | duration | No | Interval to repeat the last level (default: 30m) |
maxRepeats | int | No | Maximum repetitions of the last level (default: 3) |
EscalationLevel
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Descriptive name of the level |
timeout | duration | Yes | Time without acknowledgement before escalating to the next level |
targets | []EscalationTarget | Yes | Notification destinations at this level |
EscalationTarget
| Field | Type | Description |
|---|---|---|
type | string | channel, user, team, oncall |
channel | ChannelConfig | Channel configuration (when type=channel) |
user | string | User email (when type=user) |
team | string | Team name (when type=team) |
oncall | OnCallRef | Reference to the on-call schedule (when type=oncall) |
How Escalation Works
Tracking via annotations: TheEscalationPolicy reconciler tracks escalation state using annotations on the Issue CR:
| Annotation | Description |
|---|---|
platform.chatcli.io/escalation-level | Current level (0=L1, 1=L2, 2=L3) |
platform.chatcli.io/escalation-started-at | Timestamp of escalation start |
platform.chatcli.io/escalation-acknowledged | true when acknowledged |
platform.chatcli.io/escalation-acknowledged-by | Who acknowledged |
platform.chatcli.io/escalation-repeat-count | Repeat count of the last level |
Complete Examples
Notification Policy: Slack + PagerDuty
Escalation Policy L1 -> L2 -> L3
Email for SLA Breaches
Troubleshooting
Notifications are not being sent
Notifications are not being sent
Diagnostic checklist:
- Verify that the
NotificationPolicyexists in the correct namespace:
- Check the operator logs for dispatch errors:
- Confirm that the matching is correct:
- Verify that throttling is not suppressing notifications:
Slack returns 404 or invalid_payload error
Slack returns 404 or invalid_payload error
- Confirm that the
webhook_urlis correct and the Slack app is installed in the workspace - Verify that the channel exists and the bot has permission to post
- Test the webhook manually:
PagerDuty does not create incidents
PagerDuty does not create incidents
- Confirm that the
routing_keyis an Integration Key (not an API Key) - Verify that the service in PagerDuty is active
- Validate the payload in the PagerDuty Event Debugger
- Confirm that the event is not being deduplicated by the
dedup_key
Emails are not arriving
Emails are not arriving
- Verify SMTP connectivity:
- Confirm credentials in the Secret referenced by
password_secret - Verify that
tls_skip_verify: falseand the server certificate is valid - Check the recipients’ spam folder
Escalation does not advance to the next level
Escalation does not advance to the next level
- Check Issue annotations:
- Confirm that
escalation-acknowledgedis not set totrue - Check the EscalationPolicy reconciler logs
- Confirm that the level
timeoutis not greater than the time since creation
Webhook returns signature error
Webhook returns signature error
- Confirm that the
secretin the policy is the same used by the receiver for verification - Verify that the receiver is reading the raw body before parsing JSON
- Use
hmac.compare_digest(or equivalent) to avoid timing attacks
Prometheus Metrics
The notification system exposes metrics for full observability:| Metric | Type | Labels | Description |
|---|---|---|---|
chatcli_notifications_sent_total | Counter | channel, severity, rule, namespace | Total notifications sent successfully |
chatcli_notifications_failed_total | Counter | channel, severity, rule, error_type | Total notifications that failed |
chatcli_notifications_throttled_total | Counter | rule, reason | Notifications suppressed by throttle or dedup |
chatcli_notification_dispatch_duration_seconds | Histogram | channel | Dispatch latency per channel |
chatcli_escalation_level_reached | Gauge | policy, namespace | Current escalation level per policy |
chatcli_escalation_acknowledged_total | Counter | policy, level | Total escalations acknowledged per level |
chatcli_escalation_timeout_total | Counter | policy, level | Total escalation timeouts per level |
Next Steps
SLOs and SLAs
Service Level Objectives management with burn rate alerting
Approval Workflow
Change control with approval policies and blast radius
AIOps Platform
Deep-dive into the AIOps architecture
K8s Operator
Operator configuration and CRDs