Skip to main content
This cookbook covers the complete setup of the ChatCLI AIOps platform for a real production environment — from installation to validation with chaos engineering.

Prerequisites

  • [x]Kubernetes cluster 1.25+- [x]Helm 3.x installed- [x]Prometheus Operator (for ServiceMonitor)- [x]Grafana (for dashboards)- [x]At least one LLM API key (OpenAI, Claude, Google AI)

1. Install the Operator

1

Install Operator via Helm (CRDs + RBAC + Controllers + Dashboard)

helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system \
  --create-namespace
2

Verify installed CRDs

kubectl get crd | grep platform.chatcli.io
You should see 17 CRDs:
aiinsights.platform.chatcli.io
anomalies.platform.chatcli.io
approvalpolicies.platform.chatcli.io
approvalrequests.platform.chatcli.io
auditevents.platform.chatcli.io
chaosexperiments.platform.chatcli.io
clusterregistrations.platform.chatcli.io
escalationpolicies.platform.chatcli.io
incidentslas.platform.chatcli.io
instances.platform.chatcli.io
issues.platform.chatcli.io
notificationpolicies.platform.chatcli.io
postmortems.platform.chatcli.io
remediationplans.platform.chatcli.io
runbooks.platform.chatcli.io
servicelevelobjectives.platform.chatcli.io
sourcerepositories.platform.chatcli.io
3

Create Secret with API Keys

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
  --from-literal=OPENAI_API_KEY=sk-xxx

2. Create ChatCLI Instance

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli-system
spec:
  replicas: 2
  provider: CLAUDEAI
  model: claude-sonnet-4-20250514
  server:
    port: 50051
    metricsPort: 9090
    tls:
      enabled: true
      secretName: chatcli-tls
    token:
      name: chatcli-auth
      key: token
  apiKeys:
    name: chatcli-api-keys
  watcher:
    enabled: true
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 8080
        metricsPath: /metrics
      - deployment: payment-service
        namespace: production
        metricsPort: 8080
      - deployment: user-service
        namespace: production
    interval: "30s"
    window: "2h"
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 1Gi
  persistence:
    enabled: true
    size: 5Gi
To enable Prometheus metrics collection during incident analysis, add the PROMETHEUS_URL variable via Helm:
helm upgrade chatcli-operator oci://ghcr.io/diillson/charts/chatcli-operator \
  --set prometheusUrl="http://prometheus-server.monitoring.svc:9090"

TLS Secret: SANs and CA

This is where most installs fail silently. The Instance CR references secretName: chatcli-tls, but the Secret must be generated with two details that openssl req -x509 does not produce by default.

Generate the cert with subjectAltName

Without SANs covering the DNS name the operator uses to dial gRPC, the handshake fails with:
transport: authentication handshake failed: x509: certificate is not valid for any names, but wanted to match chatcli-prod.chatcli-system.svc.cluster.local
Use an explicit openssl.cnf:
cat > openssl.cnf <<'EOF'
[req]
distinguished_name = req_dn
x509_extensions    = v_ext
prompt             = no

[req_dn]
CN = chatcli-prod.chatcli-system.svc.cluster.local

[v_ext]
subjectAltName = @alt_names

[alt_names]
DNS.1 = chatcli-prod.chatcli-system.svc.cluster.local
DNS.2 = chatcli-prod.chatcli-system.svc
DNS.3 = chatcli-prod
DNS.4 = localhost
EOF

openssl req -x509 -newkey rsa:4096 -sha256 -days 825 -nodes \
  -keyout tls.key -out tls.crt -config openssl.cnf -extensions v_ext
Verify with:
openssl x509 -in tls.crt -noout -text | grep -A1 'Subject Alternative Name'

Include ca.crt in the Secret

A self-signed cert is its own CA. If the Secret only contains tls.crt and tls.key, the operator connects but fails with:
transport: authentication handshake failed: x509: certificate signed by unknown authority
The WatcherBridge automatically reads the ca.crt key from the Secret referenced by the Instance and uses it as the trust root — so the Secret needs all three keys:
kubectl -n chatcli-system create secret generic chatcli-tls \
  --from-file=tls.crt=tls.crt \
  --from-file=tls.key=tls.key \
  --from-file=ca.crt=tls.crt   # self-signed: cert is its own CA
With ca.crt inside the Secret, you do not need to mount a CA ConfigMap or set SSL_CERT_FILE / CHATCLI_GRPC_TLS_CA on the operator deployment. That env var is a secondary path for multi-Instance setups sharing a CA and requires manual volume mounting (extraEnv + volume).

What if the cert is issued by cert-manager or ACM?

§2.1 above covers the fragile self-signed case. With cert-manager or AWS ACM the setup is simpler, but each issuer has its own gotcha:
Issuerca.crt in Secret?Where SAN must matchspec.server.address points to…
cert-manager + Let’s Encrypt / public ACMENo — CA already in the system trust storePublic FQDN (e.g. chatcli.example.com)Public FQDN via Ingress/NLB with gRPC passthrough
cert-manager + internal ClusterIssuer (CA)Yes — cert-manager writes ca.crt into the Secret automaticallydnsNames in the Certificate CR; include in-cluster namesIn-cluster Service (<svc>.<ns>.svc.cluster.local)
AWS ACM PublicN/A — private key is not exportablePublic FQDNPublic FQDN via ALB/NLB (TLS terminates at the LB)
AWS ACM Private CAYes — include the Private CA bundle as ca.crtSet at issuance; include in-cluster namesIn-cluster Service
Self-signed (manual openssl — §2.1)Yesca.crt=tls.crt (cert is its own CA)Set via subjectAltName in openssl.cnfIn-cluster Service
Key notes:
  • Publicly trusted cert → trust already exists. The operator code (grpc_client.go) only sets RootCAs when a custom CA is provided; without one, Go uses the container’s ca-certificates bundle. That’s why Let’s Encrypt and ACM Public “just work” on the CA side — but spec.server.address must be the public FQDN, not the in-cluster Service, or the SAN won’t match.
  • cert-manager with an internal CA is the cleanest K8s path. The Certificate CR below emits everything ready for WatcherBridge auto-trust — no manual openssl:
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: chatcli-tls
      namespace: chatcli-system
    spec:
      secretName: chatcli-tls          # Same Secret referenced in the Instance CR
      issuerRef:
        name: internal-ca              # ClusterIssuer with kind: CA
        kind: ClusterIssuer
      commonName: chatcli-prod.chatcli-system.svc.cluster.local
      dnsNames:
        - chatcli-prod.chatcli-system.svc.cluster.local
        - chatcli-prod.chatcli-system.svc
        - chatcli-prod
      duration: 8760h                  # 1 year
      renewBefore: 720h                # renew 30 days before expiry
    
    With Certificate.issuerRef.kind: CA, cert-manager automatically writes ca.crt into the generated Secret — WatcherBridge picks it up directly, no extra config.
  • ACM Public does not fit pod-to-pod gRPC. The private key is not exportable; only use when TLS terminates at the ALB/NLB and the operator dials the public endpoint.
  • ACM Private CA — export the Private CA bundle (aws acm-pca get-certificate-authority-certificate) and include it as ca.crt in the Secret. From there on, it follows the auto-trust path.
Link your monitored applications’ source code repositories for code-aware diagnostics. The AI will receive context from recent commits, code snippets from stack traces, and configuration files.
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: api-gateway-repo
  namespace: chatcli-system
spec:
  url: "https://github.com/myorg/api-gateway.git"
  branch: main
  authType: token
  secretRef: git-token
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  paths: ["cmd/", "internal/"]
  language: "Go"
---
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: payment-service-repo
  namespace: chatcli-system
spec:
  url: "git@github.com:myorg/payment-service.git"
  branch: main
  authType: ssh
  secretRef: git-ssh-key
  resource:
    kind: Deployment
    name: payment-service
    namespace: production
  language: "Java"

3. Configure Notifications

apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: prod-notifications
  namespace: chatcli-system
spec:
  enabled: true
  channels:
    - name: slack-incidents
      type: slack
      config:
        webhook_url: "https://hooks.slack.com/services/T.../B.../xxx"
        channel: "#incidents"
    - name: pagerduty-critical
      type: pagerduty
      config:
        routing_key: "R0xxxxxxxxxxxxxxxxxxxx"
    - name: email-management
      type: email
      config:
        smtp_host: smtp.gmail.com
        smtp_port: "587"
        from: aiops@company.com
        to: "sre-team@company.com,management@company.com"
      secretRef:
        name: smtp-credentials
  rules:
    - name: critical-to-pagerduty
      severities: [critical]
      states: [Detected, Escalated]
      channels: [pagerduty-critical, slack-incidents]
    - name: high-to-slack
      severities: [critical, high]
      states: [Detected, Analyzing, Remediating, Resolved, Escalated]
      channels: [slack-incidents]
    - name: escalations-to-email
      states: [Escalated]
      channels: [email-management]
  throttle:
    maxPerHour: 20
    deduplicationWindow: "5m"

4. Configure Escalation

apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
  name: prod-escalation
  namespace: chatcli-system
spec:
  enabled: true
  severities: [critical, high]
  levels:
    - name: L1-OnCall
      timeoutMinutes: 5
      targets:
        - type: oncall
          name: primary-oncall
      notifyChannels: [slack-incidents, pagerduty-critical]
      repeatIntervalMinutes: 5
    - name: L2-SeniorSRE
      timeoutMinutes: 15
      targets:
        - type: team
          name: sre-senior
      notifyChannels: [slack-incidents, pagerduty-critical]
    - name: L3-Engineering-Lead
      timeoutMinutes: 30
      targets:
        - type: user
          name: eng-lead@company.com
      notifyChannels: [slack-incidents, email-management]

5. Define SLOs

apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-availability
  namespace: chatcli-system
spec:
  serviceName: api-gateway
  description: "API Gateway must maintain 99.9% availability"
  enabled: true
  indicator:
    type: availability
    metricSource: issues
    resource:
      kind: Deployment
      name: api-gateway
      namespace: production
  target:
    percentage: 99.9
    window: "30d"
  alertPolicy:
    pageOnBudgetExhausted: true
    notificationPolicyRef: prod-notifications
    burnRateWindows:
      - shortWindow: "1h"
        longWindow: "6h"
        burnRateThreshold: 14.4
        severity: critical
      - shortWindow: "6h"
        longWindow: "72h"
        burnRateThreshold: 6
        severity: high
      - shortWindow: "24h"
        longWindow: "72h"
        burnRateThreshold: 3
        severity: medium

6. Define SLAs

apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p1-sla
  namespace: chatcli-system
spec:
  severity: critical
  responseTime: "5m"
  resolutionTime: "1h"
  escalationPolicyRef: prod-escalation
  notificationPolicyRef: prod-notifications
  businessHoursOnly: false
apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p2-sla
  namespace: chatcli-system
spec:
  severity: high
  responseTime: "15m"
  resolutionTime: "4h"
  escalationPolicyRef: prod-escalation
  businessHoursOnly: true
  businessHours:
    timezone: "America/Sao_Paulo"
    startHour: 9
    endHour: 18
    workDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]

7. Configure Approvals

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: prod-approvals
  namespace: chatcli-system
spec:
  enabled: true
  defaultMode: manual
  rules:
    - name: auto-low-confidence
      match:
        severities: [low]
        actionTypes: [RestartDeployment, DeletePod]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.95
        maxSeverity: low
        historicalSuccessRate: 0.90
    - name: quorum-production-rollback
      match:
        severities: [critical, high]
        actionTypes: [RollbackDeployment, ScaleDeployment]
        namespaces: [production]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 15
      changeWindow:
        timezone: "America/Sao_Paulo"
        allowedDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]
        startHour: 9
        endHour: 17
    - name: manual-resource-changes
      match:
        actionTypes: [AdjustResources, PatchConfig]
      mode: manual
      timeoutMinutes: 30

8. Install Grafana Dashboards

# Create ConfigMap with dashboards
kubectl create configmap chatcli-grafana-dashboards \
  --from-file=deploy/grafana/ \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# Add label for Grafana sidecar auto-discovery
kubectl label configmap chatcli-grafana-dashboards \
  grafana_dashboard=1 -n monitoring

# Install ServiceMonitors
kubectl apply -f deploy/grafana/dashboards-configmap.yaml

9. Validate with Chaos Engineering

Run chaos experiments only in environments with redundancy. Never on single-replica deployments.
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-api-gateway-resilience
  namespace: chatcli-system
spec:
  experimentType: pod_kill
  target:
    kind: Deployment
    name: api-gateway
    namespace: production
  duration: "2m"
  parameters:
    count: "1"
  dryRun: true  # Test in dry-run first!
  enabled: true
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    requireApproval: true
    blockedNamespaces: ["kube-system", "monitoring"]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: "5m"
    runRemediationTest: false
1

Run in DryRun

kubectl apply -f chaos-experiment.yaml
kubectl get chaos -w
2

Verify result

kubectl get chaos validate-api-gateway-resilience -o yaml
3

Run for real (after validation)

Edit dryRun: false and reapply.

10. Access the Dashboard

# Port-forward to the REST API + Web UI
kubectl port-forward svc/chatcli-prod 8090:8090 -n chatcli-system

# Open in browser
open http://localhost:8090
The web dashboard shows:
  • Overview with real-time stats
  • Incidents with filters and actions (acknowledge, snooze)
  • SLOs with error budget and burn rates
  • Pending approvals
  • PostMortems with timeline
  • Federated clusters
  • Searchable audit log

10.1 Expose the Dashboard via Ingress (alternative to port-forward)

To reach the dashboard from outside the cluster, create an Ingress pointing at the operator Service. When mounting under a sub-path, rewrite-target with a capture group is required — the dashboard’s static assets are served from / and would 404 otherwise:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatcli-dashboard
  namespace: chatcli-system
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  ingressClassName: nginx
  rules:
    - host: chatcli.example.com
      http:
        paths:
          - path: /chatcli(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: chatcli-operator
                port:
                  number: 8090

11. Common Troubleshooting

Operator log errorCauseFix
x509: certificate is not valid for any namesgRPC server cert missing SAN for spec.server.addressRegenerate cert with openssl.cnf + subjectAltName for the Service FQDN (see §2.1)
x509: certificate signed by unknown authoritySelf-signed cert with no trust configuredAdd ca.crt key to the chatcli-tls Secret referenced by the Instance (see §2.1)
no ready Instance foundInstance not Ready or in a different namespacekubectl describe instance chatcli-prod -n chatcli-system — inspect status and events
connection refused after TLS succeedsService has no endpoints or wrong gRPC portkubectl get endpoints chatcli-prod -n chatcli-system must list pod IPs

Production Checklist

  • [x]Operator installed with 17 CRDs- [x]Instance created with TLS and auth- [x]Secret chatcli-tls contains tls.crt, tls.key and ca.crt (self-signed: ca.crt=tls.crt)- [x]tls.crt has SANs for <instance>.<ns>.svc.cluster.local, <instance>.<ns>.svc and <instance>- [x]spec.server.address in the Instance matches one of the cert SANs- [x]Operator logs show Connected to Instance with no x509: errors within ~30s of the Instance becoming Ready- [x]Watcher monitoring target deployments- [x]NotificationPolicy with Slack + PagerDuty- [x]EscalationPolicy L1->L2->L3- [x]SLOs with burn rate alerting (Google SRE model)- [x]SLAs with response/resolution time per severity- [x]ApprovalPolicy with auto/quorum for production- [x]Grafana dashboards installed- [x]Chaos experiment validated in dry-run- [x]Web Dashboard accessible- [x]REST API with authentication configured