Pular para o conteúdo principal
Este cookbook cobre a configuração completa da plataforma AIOps do ChatCLI para um ambiente de produção real — desde a instalação até a validação com chaos engineering.

Pré-requisitos

  • Cluster Kubernetes 1.25+
  • Helm 3.x instalado
  • Prometheus Operator (para ServiceMonitor)
  • Grafana (para dashboards)
  • Pelo menos uma API key de LLM (OpenAI, Claude, Google AI)

1. Instalar o Operator

1

Instalar Operator via Helm (CRDs + RBAC + Controllers + Dashboard)

helm install chatcli-operator \
  oci://ghcr.io/diillson/charts/chatcli-operator \
  --namespace chatcli-system \
  --create-namespace
2

Verificar CRDs instalados

kubectl get crd | grep platform.chatcli.io
Devem aparecer 17 CRDs:
aiinsights.platform.chatcli.io
anomalies.platform.chatcli.io
approvalpolicies.platform.chatcli.io
approvalrequests.platform.chatcli.io
auditevents.platform.chatcli.io
chaosexperiments.platform.chatcli.io
clusterregistrations.platform.chatcli.io
escalationpolicies.platform.chatcli.io
incidentslas.platform.chatcli.io
instances.platform.chatcli.io
issues.platform.chatcli.io
notificationpolicies.platform.chatcli.io
postmortems.platform.chatcli.io
remediationplans.platform.chatcli.io
runbooks.platform.chatcli.io
servicelevelobjectives.platform.chatcli.io
sourcerepositories.platform.chatcli.io
3

Criar Secret com API Keys

kubectl create secret generic chatcli-api-keys \
  --namespace chatcli-system \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-xxx \
  --from-literal=OPENAI_API_KEY=sk-xxx

2. Criar Instância ChatCLI

apiVersion: platform.chatcli.io/v1alpha1
kind: Instance
metadata:
  name: chatcli-prod
  namespace: chatcli-system
spec:
  replicas: 2
  provider: CLAUDEAI
  model: claude-sonnet-4-20250514
  server:
    port: 50051
    metricsPort: 9090
    tls:
      enabled: true
      secretName: chatcli-tls
    token:
      name: chatcli-auth
      key: token
  apiKeys:
    name: chatcli-api-keys
  watcher:
    enabled: true
    targets:
      - deployment: api-gateway
        namespace: production
        metricsPort: 8080
        metricsPath: /metrics
      - deployment: payment-service
        namespace: production
        metricsPort: 8080
      - deployment: user-service
        namespace: production
    interval: "30s"
    window: "2h"
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 1Gi
  persistence:
    enabled: true
    size: 5Gi
Para habilitar coleta de métricas Prometheus durante análise de incidentes, adicione a variável PROMETHEUS_URL ao ConfigMap ou passe via Helm:
helm upgrade chatcli-operator oci://ghcr.io/diillson/charts/chatcli-operator \
  --set prometheusUrl="http://prometheus-server.monitoring.svc:9090"

2.1 Vincular Repositórios de Código (Opcional)

Vincule os repositórios de código das aplicações monitoradas para diagnóstico code-aware. A IA receberá contexto de commits recentes, trechos de código de stack traces e arquivos de configuração.
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: api-gateway-repo
  namespace: chatcli-system
spec:
  url: "https://github.com/myorg/api-gateway.git"
  branch: main
  authType: token
  secretRef: git-token
  resource:
    kind: Deployment
    name: api-gateway
    namespace: production
  paths: ["cmd/", "internal/"]
  language: "Go"
---
apiVersion: platform.chatcli.io/v1alpha1
kind: SourceRepository
metadata:
  name: payment-service-repo
  namespace: chatcli-system
spec:
  url: "git@github.com:myorg/payment-service.git"
  branch: main
  authType: ssh
  secretRef: git-ssh-key
  resource:
    kind: Deployment
    name: payment-service
    namespace: production
  language: "Java"

3. Configurar Notificações

apiVersion: platform.chatcli.io/v1alpha1
kind: NotificationPolicy
metadata:
  name: prod-notifications
  namespace: chatcli-system
spec:
  enabled: true
  channels:
    - name: slack-incidents
      type: slack
      config:
        webhook_url: "https://hooks.slack.com/services/T.../B.../xxx"
        channel: "#incidents"
    - name: pagerduty-critical
      type: pagerduty
      config:
        routing_key: "R0xxxxxxxxxxxxxxxxxxxx"
    - name: email-management
      type: email
      config:
        smtp_host: smtp.gmail.com
        smtp_port: "587"
        from: aiops@company.com
        to: "sre-team@company.com,management@company.com"
      secretRef:
        name: smtp-credentials
  rules:
    - name: critical-to-pagerduty
      severities: [critical]
      states: [Detected, Escalated]
      channels: [pagerduty-critical, slack-incidents]
    - name: high-to-slack
      severities: [critical, high]
      states: [Detected, Analyzing, Remediating, Resolved, Escalated]
      channels: [slack-incidents]
    - name: escalations-to-email
      states: [Escalated]
      channels: [email-management]
  throttle:
    maxPerHour: 20
    deduplicationWindow: "5m"

4. Configurar Escalação

apiVersion: platform.chatcli.io/v1alpha1
kind: EscalationPolicy
metadata:
  name: prod-escalation
  namespace: chatcli-system
spec:
  enabled: true
  severities: [critical, high]
  levels:
    - name: L1-OnCall
      timeoutMinutes: 5
      targets:
        - type: oncall
          name: primary-oncall
      notifyChannels: [slack-incidents, pagerduty-critical]
      repeatIntervalMinutes: 5
    - name: L2-SeniorSRE
      timeoutMinutes: 15
      targets:
        - type: team
          name: sre-senior
      notifyChannels: [slack-incidents, pagerduty-critical]
    - name: L3-Engineering-Lead
      timeoutMinutes: 30
      targets:
        - type: user
          name: eng-lead@company.com
      notifyChannels: [slack-incidents, email-management]

5. Definir SLOs

apiVersion: platform.chatcli.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-availability
  namespace: chatcli-system
spec:
  serviceName: api-gateway
  description: "API Gateway deve manter 99.9% de disponibilidade"
  enabled: true
  indicator:
    type: availability
    metricSource: issues
    resource:
      kind: Deployment
      name: api-gateway
      namespace: production
  target:
    percentage: 99.9
    window: "30d"
  alertPolicy:
    pageOnBudgetExhausted: true
    notificationPolicyRef: prod-notifications
    burnRateWindows:
      - shortWindow: "1h"
        longWindow: "6h"
        burnRateThreshold: 14.4
        severity: critical
      - shortWindow: "6h"
        longWindow: "72h"
        burnRateThreshold: 6
        severity: high
      - shortWindow: "24h"
        longWindow: "72h"
        burnRateThreshold: 3
        severity: medium

6. Definir SLAs

apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p1-sla
  namespace: chatcli-system
spec:
  severity: critical
  responseTime: "5m"
  resolutionTime: "1h"
  escalationPolicyRef: prod-escalation
  notificationPolicyRef: prod-notifications
  businessHoursOnly: false
apiVersion: platform.chatcli.io/v1alpha1
kind: IncidentSLA
metadata:
  name: p2-sla
  namespace: chatcli-system
spec:
  severity: high
  responseTime: "15m"
  resolutionTime: "4h"
  escalationPolicyRef: prod-escalation
  businessHoursOnly: true
  businessHours:
    timezone: "America/Sao_Paulo"
    startHour: 9
    endHour: 18
    workDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]

7. Configurar Aprovações

apiVersion: platform.chatcli.io/v1alpha1
kind: ApprovalPolicy
metadata:
  name: prod-approvals
  namespace: chatcli-system
spec:
  enabled: true
  defaultMode: manual
  rules:
    - name: auto-low-confidence
      match:
        severities: [low]
        actionTypes: [RestartDeployment, DeletePod]
      mode: auto
      autoApproveConditions:
        minConfidence: 0.95
        maxSeverity: low
        historicalSuccessRate: 0.90
    - name: quorum-production-rollback
      match:
        severities: [critical, high]
        actionTypes: [RollbackDeployment, ScaleDeployment]
        namespaces: [production]
      mode: quorum
      requiredApprovers: 2
      timeoutMinutes: 15
      changeWindow:
        timezone: "America/Sao_Paulo"
        allowedDays: ["Monday","Tuesday","Wednesday","Thursday","Friday"]
        startHour: 9
        endHour: 17
    - name: manual-resource-changes
      match:
        actionTypes: [AdjustResources, PatchConfig]
      mode: manual
      timeoutMinutes: 30

8. Instalar Grafana Dashboards

# Criar ConfigMap com dashboards
kubectl create configmap chatcli-grafana-dashboards \
  --from-file=deploy/grafana/ \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# Adicionar label para Grafana sidecar auto-discovery
kubectl label configmap chatcli-grafana-dashboards \
  grafana_dashboard=1 -n monitoring

# Instalar ServiceMonitors
kubectl apply -f deploy/grafana/dashboards-configmap.yaml

9. Validar com Chaos Engineering

Execute chaos experiments apenas em ambientes com redundância. Nunca em single-replica deployments.
apiVersion: platform.chatcli.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: validate-api-gateway-resilience
  namespace: chatcli-system
spec:
  experimentType: pod_kill
  target:
    kind: Deployment
    name: api-gateway
    namespace: production
  duration: "2m"
  parameters:
    count: "1"
  dryRun: true  # Teste primeiro em dry-run!
  enabled: true
  safetyChecks:
    minHealthyPods: 2
    maxConcurrentExperiments: 1
    abortOnIssueDetected: true
    requireApproval: true
    blockedNamespaces: ["kube-system", "monitoring"]
  postExperiment:
    verifyRecovery: true
    recoveryTimeout: "5m"
    runRemediationTest: false
1

Executar em DryRun

kubectl apply -f chaos-experiment.yaml
kubectl get chaos -w
2

Verificar resultado

kubectl get chaos validate-api-gateway-resilience -o yaml
3

Executar de verdade (após validação)

Edite dryRun: false e reaplique.

10. Configurar API Keys do Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: chatcli-operator-config
  namespace: chatcli-system
data:
  api-keys: |
    - key: "ck_live_admin_SUA_CHAVE"
      role: admin
      description: "SRE Team"
    - key: "ck_live_viewer_SUA_CHAVE"
      role: viewer
      description: "Read-only NOC"
kubectl apply -f operator-config.yaml
kubectl rollout restart deployment chatcli-operator -n chatcli-system
Sem este ConfigMap, a API REST roda em dev mode (sem autenticação). Sempre configure API keys antes de expor externamente.

11. Acessar o Dashboard

# Port-forward para o REST API + Web UI
kubectl port-forward svc/chatcli-operator 8090:8090 -n chatcli-system

# Abrir no browser
open http://localhost:8090
O dashboard web mostra:
  • Overview com stats em tempo real
  • Incidents com filtros e ações (acknowledge, snooze)
  • SLOs com error budget e burn rates
  • Approvals pendentes
  • PostMortems com timeline
  • Clusters federados
  • Audit log pesquisável

Checklist de Produção

  • Operator instalado com 17 CRDs
  • Instance criada com TLS e auth
  • Watcher monitorando deployments alvo
  • NotificationPolicy com Slack + PagerDuty
  • EscalationPolicy L1 - L2 - L3
  • SLOs com burn rate alerting (Google SRE model)
  • SLAs com response/resolution time por severity
  • ApprovalPolicy com auto/quorum para produção
  • Grafana dashboards instalados
  • Chaos experiment validado em dry-run
  • API Keys do operator configuradas (ConfigMap chatcli-operator-config)
  • Web Dashboard acessível
  • REST API com autenticação configurada (header X-API-Key)