AIOps: Complete Production Setup

This cookbook covers the complete setup of the ChatCLI AIOps platform for a real production environment — from installation to validation with chaos engineering.

Prerequisites

[x]Kubernetes cluster 1.25+- [x]Helm 3.x installed- [x]Prometheus Operator (for ServiceMonitor)- [x]Grafana (for dashboards)- [x]At least one LLM API key (OpenAI, Claude, Google AI)

1. Install the Operator

Install Operator via Helm (CRDs + RBAC + Controllers + Dashboard)

helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system \
  --create-namespace

Verify installed CRDs

kubectl get crd | grep platform.chatcli.io

You should see 17 CRDs:

aiinsights.platform.chatcli.io
anomalies.platform.chatcli.io
approvalpolicies.platform.chatcli.io
approvalrequests.platform.chatcli.io
auditevents.platform.chatcli.io
chaosexperiments.platform.chatcli.io
clusterregistrations.platform.chatcli.io
escalationpolicies.platform.chatcli.io
incidentslas.platform.chatcli.io
instances.platform.chatcli.io
issues.platform.chatcli.io
notificationpolicies.platform.chatcli.io
postmortems.platform.chatcli.io
remediationplans.platform.chatcli.io
runbooks.platform.chatcli.io
servicelevelobjectives.platform.chatcli.io
sourcerepositories.platform.chatcli.io

Create Secret with API Keys

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
  --from-literal=OPENAI_API_KEY=sk-xxx

2. Create ChatCLI Instance

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli-system
spec:
  replicas: 2
  provider: CLAUDEAI
  model: claude-sonnet-4-20250514
  server:
    port: 50051
    metricsPort: 9090
    tls:
      enabled: true
      secretName: chatcli-tls
    token:
      name: chatcli-auth
      key: token
  apiKeys:
    name: chatcli-api-keys
  watcher:
    enabled: true
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 8080
        metricsPath: /metrics
      - deployment: payment-service
        namespace: production
        metricsPort: 8080
      - deployment: user-service
        namespace: production
    interval: "30s"
    window: "2h"
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 1Gi
  persistence:
    enabled: true
    size: 5Gi

To enable Prometheus metrics collection during incident analysis, add the PROMETHEUS_URL variable via Helm:

helm upgrade chatcli-operator oci://ghcr.io/diillson/charts/chatcli-operator \
  --set prometheusUrl="http://prometheus-server.monitoring.svc:9090"

TLS Secret: SANs and CA

This is where most installs fail silently. The Instance CR references secretName: chatcli-tls, but the Secret must be generated with two details that openssl req -x509 does not produce by default.

Generate the cert with `subjectAltName`

Without SANs covering the DNS name the operator uses to dial gRPC, the handshake fails with:

transport: authentication handshake failed: x509: certificate is not valid for any names, but wanted to match chatcli-prod.chatcli-system.svc.cluster.local

Use an explicit openssl.cnf:

cat > openssl.cnf <<'EOF'
[req]
distinguished_name = req_dn
x509_extensions    = v_ext
prompt             = no

[req_dn]
CN = chatcli-prod.chatcli-system.svc.cluster.local

[v_ext]
subjectAltName = @alt_names

[alt_names]
DNS.1 = chatcli-prod.chatcli-system.svc.cluster.local
DNS.2 = chatcli-prod.chatcli-system.svc
DNS.3 = chatcli-prod
DNS.4 = localhost
EOF

openssl req -x509 -newkey rsa:4096 -sha256 -days 825 -nodes \
  -keyout tls.key -out tls.crt -config openssl.cnf -extensions v_ext

Verify with:

openssl x509 -in tls.crt -noout -text | grep -A1 'Subject Alternative Name'

Include `ca.crt` in the Secret

A self-signed cert is its own CA. If the Secret only contains tls.crt and tls.key, the operator connects but fails with:

transport: authentication handshake failed: x509: certificate signed by unknown authority

The WatcherBridge automatically reads the ca.crt key from the Secret referenced by the Instance and uses it as the trust root — so the Secret needs all three keys:

kubectl -n chatcli-system create secret generic chatcli-tls \
  --from-file=tls.crt=tls.crt \
  --from-file=tls.key=tls.key \
  --from-file=ca.crt=tls.crt   # self-signed: cert is its own CA

With ca.crt inside the Secret, you do not need to mount a CA ConfigMap or set SSL_CERT_FILE / CHATCLI_GRPC_TLS_CA on the operator deployment. That env var is a secondary path for multi-Instance setups sharing a CA and requires manual volume mounting (extraEnv + volume).

What if the cert is issued by cert-manager or ACM?

§2.1 above covers the fragile self-signed case. With cert-manager or AWS ACM the setup is simpler, but each issuer has its own gotcha:

Issuer	`ca.crt` in Secret?	Where SAN must match	`spec.server.address` points to…
cert-manager + Let’s Encrypt / public ACME	No — CA already in the system trust store	Public FQDN (e.g. `chatcli.example.com`)	Public FQDN via Ingress/NLB with gRPC passthrough
cert-manager + internal ClusterIssuer (CA)	Yes — cert-manager writes `ca.crt` into the Secret automatically	`dnsNames` in the `Certificate` CR; include in-cluster names	In-cluster Service (`<svc>.<ns>.svc.cluster.local`)
AWS ACM Public	N/A — private key is not exportable	Public FQDN	Public FQDN via ALB/NLB (TLS terminates at the LB)
AWS ACM Private CA	Yes — include the Private CA bundle as `ca.crt`	Set at issuance; include in-cluster names	In-cluster Service
Self-signed (manual openssl — §2.1)	Yes — `ca.crt=tls.crt` (cert is its own CA)	Set via `subjectAltName` in `openssl.cnf`	In-cluster Service

Key notes:

Publicly trusted cert → trust already exists. The operator code (grpc_client.go) only sets RootCAs when a custom CA is provided; without one, Go uses the container’s ca-certificates bundle. That’s why Let’s Encrypt and ACM Public “just work” on the CA side — but spec.server.address must be the public FQDN, not the in-cluster Service, or the SAN won’t match.

cert-manager with an internal CA is the cleanest K8s path. The Certificate CR below emits everything ready for WatcherBridge auto-trust — no manual openssl:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: chatcli-tls
  namespace: chatcli-system
spec:
  secretName: chatcli-tls          # Same Secret referenced in the Instance CR
  issuerRef:
    name: internal-ca              # ClusterIssuer with kind: CA
    kind: ClusterIssuer
  commonName: chatcli-prod.chatcli-system.svc.cluster.local
  dnsNames:
    - chatcli-prod.chatcli-system.svc.cluster.local
    - chatcli-prod.chatcli-system.svc
    - chatcli-prod
  duration: 8760h                  # 1 year
  renewBefore: 720h                # renew 30 days before expiry

With Certificate.issuerRef.kind: CA, cert-manager automatically writes ca.crt into the generated Secret — WatcherBridge picks it up directly, no extra config.

ACM Public does not fit pod-to-pod gRPC. The private key is not exportable; only use when TLS terminates at the ALB/NLB and the operator dials the public endpoint.
ACM Private CA — export the Private CA bundle (aws acm-pca get-certificate-authority-certificate) and include it as ca.crt in the Secret. From there on, it follows the auto-trust path.

2.2 Link Source Code Repositories (Optional)

Link your monitored applications’ source code repositories for code-aware diagnostics. The AI will receive context from recent commits, code snippets from stack traces, and configuration files.

apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: api-gateway-repo
  namespace: chatcli-system
spec:
  url: "https://github.com/myorg/api-gateway.git"
  branch: main
  authType: token
  secretRef: git-token
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  paths: ["cmd/", "internal/"]
  language: "Go"
---
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: payment-service-repo
  namespace: chatcli-system
spec:
  url: "git@github.com:myorg/payment-service.git"
  branch: main
  authType: ssh
  secretRef: git-ssh-key
  resource:
    kind: Deployment
    name: payment-service
    namespace: production
  language: "Java"

3. Configure Notifications

apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: prod-notifications
  namespace: chatcli-system
spec:
  enabled: true
  channels:
    - name: slack-incidents
      type: slack
      config:
        webhook_url: "https://hooks.slack.com/services/T.../B.../xxx"
        channel: "#incidents"
    - name: pagerduty-critical
      type: pagerduty
      config:
        routing_key: "R0xxxxxxxxxxxxxxxxxxxx"
    - name: email-management
      type: email
      config:
        smtp_host: smtp.gmail.com
        smtp_port: "587"
        from: aiops@company.com
        to: "sre-team@company.com,management@company.com"
      secretRef:
        name: smtp-credentials
  rules:
    - name: critical-to-pagerduty
      severities: [critical]
      states: [Detected, Escalated]
      channels: [pagerduty-critical, slack-incidents]
    - name: high-to-slack
      severities: [critical, high]
      states: [Detected, Analyzing, Remediating, Resolved, Escalated]
      channels: [slack-incidents]
    - name: escalations-to-email
      states: [Escalated]
      channels: [email-management]
  throttle:
    maxPerHour: 20
    deduplicationWindow: "5m"

4. Configure Escalation

apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
  name: prod-escalation
  namespace: chatcli-system
spec:
  enabled: true
  severities: [critical, high]
  levels:
    - name: L1-OnCall
      timeoutMinutes: 5
      targets:
        - type: oncall
          name: primary-oncall
      notifyChannels: [slack-incidents, pagerduty-critical]
      repeatIntervalMinutes: 5
    - name: L2-SeniorSRE
      timeoutMinutes: 15
      targets:
        - type: team
          name: sre-senior
      notifyChannels: [slack-incidents, pagerduty-critical]
    - name: L3-Engineering-Lead
      timeoutMinutes: 30
      targets:
        - type: user
          name: eng-lead@company.com
      notifyChannels: [slack-incidents, email-management]

5. Define SLOs

apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-availability
  namespace: chatcli-system
spec:
  serviceName: api-gateway
  description: "API Gateway must maintain 99.9% availability"
  enabled: true
  indicator:
    type: availability
    metricSource: issues
    resource:
      kind: Deployment
      name: api-gateway
      namespace: production
  target:
    percentage: 99.9
    window: "30d"
  alertPolicy:
    pageOnBudgetExhausted: true
    notificationPolicyRef: prod-notifications
    burnRateWindows:
      - shortWindow: "1h"
        longWindow: "6h"
        burnRateThreshold: 14.4
        severity: critical
      - shortWindow: "6h"
        longWindow: "72h"
        burnRateThreshold: 6
        severity: high
      - shortWindow: "24h"
        longWindow: "72h"
        burnRateThreshold: 3
        severity: medium

6. Define SLAs

apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p1-sla
  namespace: chatcli-system
spec:
  severity: critical
  responseTime: "5m"
  resolutionTime: "1h"
  escalationPolicyRef: prod-escalation
  notificationPolicyRef: prod-notifications
  businessHoursOnly: false

apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p2-sla
  namespace: chatcli-system
spec:
  severity: high
  responseTime: "15m"
  resolutionTime: "4h"
  escalationPolicyRef: prod-escalation
  businessHoursOnly: true
  businessHours:
    timezone: "America/Sao_Paulo"
    startHour: 9
    endHour: 18
    workDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]

7. Configure Approvals

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: prod-approvals
  namespace: chatcli-system
spec:
  enabled: true
  defaultMode: manual
  rules:
    - name: auto-low-confidence
      match:
        severities: [low]
        actionTypes: [RestartDeployment, DeletePod]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.95
        maxSeverity: low
        historicalSuccessRate: 0.90
    - name: quorum-production-rollback
      match:
        severities: [critical, high]
        actionTypes: [RollbackDeployment, ScaleDeployment]
        namespaces: [production]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 15
      changeWindow:
        timezone: "America/Sao_Paulo"
        allowedDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]
        startHour: 9
        endHour: 17
    - name: manual-resource-changes
      match:
        actionTypes: [AdjustResources, PatchConfig]
      mode: manual
      timeoutMinutes: 30

8. Install Grafana Dashboards

# Create ConfigMap with dashboards
kubectl create configmap chatcli-grafana-dashboards \
  --from-file=deploy/grafana/ \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# Add label for Grafana sidecar auto-discovery
kubectl label configmap chatcli-grafana-dashboards \
  grafana_dashboard=1 -n monitoring

# Install ServiceMonitors
kubectl apply -f deploy/grafana/dashboards-configmap.yaml

9. Validate with Chaos Engineering

Run chaos experiments only in environments with redundancy. Never on single-replica deployments.

apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-api-gateway-resilience
  namespace: chatcli-system
spec:
  experimentType: pod_kill
  target:
    kind: Deployment
    name: api-gateway
    namespace: production
  duration: "2m"
  parameters:
    count: "1"
  dryRun: true  # Test in dry-run first!
  enabled: true
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    requireApproval: true
    blockedNamespaces: ["kube-system", "monitoring"]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: "5m"
    runRemediationTest: false

Run in DryRun

kubectl apply -f chaos-experiment.yaml
kubectl get chaos -w

Verify result

kubectl get chaos validate-api-gateway-resilience -o yaml

Run for real (after validation)

Edit dryRun: false and reapply.

10. Access the Dashboard

# Port-forward to the REST API + Web UI
kubectl port-forward svc/chatcli-prod 8090:8090 -n chatcli-system

# Open in browser
open http://localhost:8090

The web dashboard shows:

Overview with real-time stats
Incidents with filters and actions (acknowledge, snooze)
SLOs with error budget and burn rates
Pending approvals
PostMortems with timeline
Federated clusters
Searchable audit log

10.1 Expose the Dashboard via Ingress (alternative to port-forward)

To reach the dashboard from outside the cluster, create an Ingress pointing at the operator Service. When mounting under a sub-path, rewrite-target with a capture group is required — the dashboard’s static assets are served from / and would 404 otherwise:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatcli-dashboard
  namespace: chatcli-system
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  ingressClassName: nginx
  rules:
    - host: chatcli.example.com
      http:
        paths:
          - path: /chatcli(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: chatcli-operator
                port:
                  number: 8090

11. Common Troubleshooting

Operator log error	Cause	Fix
`x509: certificate is not valid for any names`	gRPC server cert missing SAN for `spec.server.address`	Regenerate cert with `openssl.cnf` + `subjectAltName` for the Service FQDN (see §2.1)
`x509: certificate signed by unknown authority`	Self-signed cert with no trust configured	Add `ca.crt` key to the `chatcli-tls` Secret referenced by the Instance (see §2.1)
`no ready Instance found`	Instance not `Ready` or in a different namespace	`kubectl describe instance chatcli-prod -n chatcli-system` — inspect status and events
`connection refused` after TLS succeeds	Service has no endpoints or wrong gRPC port	`kubectl get endpoints chatcli-prod -n chatcli-system` must list pod IPs

Production Checklist

[x]Operator installed with 17 CRDs- [x]Instance created with TLS and auth- [x]Secret chatcli-tls contains tls.crt, tls.key and ca.crt (self-signed: ca.crt=tls.crt)- [x]tls.crt has SANs for <instance>.<ns>.svc.cluster.local, <instance>.<ns>.svc and <instance>- [x]spec.server.address in the Instance matches one of the cert SANs- [x]Operator logs show Connected to Instance with no x509: errors within ~30s of the Instance becoming Ready- [x]Watcher monitoring target deployments- [x]NotificationPolicy with Slack + PagerDuty- [x]EscalationPolicy L1->L2->L3- [x]SLOs with burn rate alerting (Google SRE model)- [x]SLAs with response/resolution time per severity- [x]ApprovalPolicy with auto/quorum for production- [x]Grafana dashboards installed- [x]Chaos experiment validated in dry-run- [x]Web Dashboard accessible- [x]REST API with authentication configured

​Prerequisites

​1. Install the Operator

​2. Create ChatCLI Instance

​TLS Secret: SANs and CA

​Generate the cert with subjectAltName

​Include ca.crt in the Secret

​What if the cert is issued by cert-manager or ACM?

​2.2 Link Source Code Repositories (Optional)

​3. Configure Notifications

​4. Configure Escalation

​5. Define SLOs

​6. Define SLAs

​7. Configure Approvals

​8. Install Grafana Dashboards

​9. Validate with Chaos Engineering

​10. Access the Dashboard

​10.1 Expose the Dashboard via Ingress (alternative to port-forward)

​11. Common Troubleshooting

​Production Checklist

Prerequisites

1. Install the Operator

2. Create ChatCLI Instance

TLS Secret: SANs and CA

Generate the cert with `subjectAltName`

Include `ca.crt` in the Secret

What if the cert is issued by cert-manager or ACM?

2.2 Link Source Code Repositories (Optional)

3. Configure Notifications

4. Configure Escalation

5. Define SLOs

6. Define SLAs

7. Configure Approvals

8. Install Grafana Dashboards

9. Validate with Chaos Engineering

10. Access the Dashboard

10.1 Expose the Dashboard via Ingress (alternative to port-forward)

11. Common Troubleshooting

Production Checklist