Skip to main content
Multi-Cluster Federation allows the ChatCLI AIOps platform to manage multiple Kubernetes clusters from a single control plane. Incidents are correlated across clusters, cascades are detected automatically, and remediation policies respect each environment’s tier.
Federation does not require a service mesh or external tools. The operator connects directly to each cluster via kubeconfig stored in Secrets.

Why Multi-Cluster Federation?

In modern production environments, infrastructure is rarely limited to a single cluster:

Multi-Region

Clusters in us-east-1, eu-west-1, and ap-southeast-1 for latency and regional compliance.

Multi-Environment

Staging, production, and DR in separate clusters with different security policies.

Multi-Tenant

Dedicated clusters per team or product with strong workload isolation.
Without federation, each cluster is a silo. AIOps loses the ability to:
  • Detect that the same problem affects 5 clusters simultaneously
  • Correlate a staging deploy with a production failure
  • Apply differentiated remediation policies by cluster importance
  • Aggregate health metrics into a global view

ClusterRegistration CRD

The ClusterRegistration CRD is the entry point for adding clusters to the federation.

Complete Specification

apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: prod-us-east-1
  namespace: chatcli-system
spec:
  # Reference to the Secret containing the kubeconfig
  kubeconfigSecretRef:
    name: cluster-prod-us-east-1-kubeconfig
    key: kubeconfig

  # Cluster metadata
  region: us-east-1
  environment: production
  tier: critical          # critical | standard | non-critical

  # Monitoring configuration
  healthCheckInterval: 30s
  capabilities:
    - monitoring
    - remediation
    - chaos-engineering

  # Safety limits
  maxConcurrentRemediations: 2

status:
  # Automatically populated by the controller
  connected: true
  lastHealthCheck: "2026-03-19T14:30:00Z"
  kubernetesVersion: "v1.29.2"
  nodeCount: 12
  namespaceCount: 34
  conditions:
    - type: Connected
      status: "True"
      lastTransitionTime: "2026-03-19T10:00:00Z"
      reason: HealthCheckSucceeded
      message: "Cluster accessible, 12 nodes, 34 namespaces"
    - type: RemediationCapable
      status: "True"
      lastTransitionTime: "2026-03-19T10:00:00Z"
      reason: RBACConfigured
      message: "ServiceAccount with remediation permissions"

Spec Fields

FieldTypeRequiredDescription
kubeconfigSecretRef.namestringYesName of the Secret with the kubeconfig
kubeconfigSecretRef.keystringYesKey within the Secret (usually kubeconfig)
regionstringYesGeographic region of the cluster
environmentstringYesEnvironment: staging, production, dr, development
tierstringYesImportance: critical, standard, non-critical
healthCheckIntervaldurationNoHealth check interval (default: 30s)
capabilities[]stringNoEnabled capabilities: monitoring, remediation, chaos-engineering
maxConcurrentRemediationsintNoMaximum concurrent remediations (default: 3)

Status Fields

FieldTypeDescription
connectedboolWhether the cluster is accessible
lastHealthChecktimestampLast successful health check
kubernetesVersionstringRemote cluster Kubernetes version
nodeCountintNumber of active nodes
namespaceCountintNumber of namespaces
conditions[]ConditionDetailed conditions (Connected, RemediationCapable)

How Federation Works

Kubeconfig Parsing

The controller reads the kubeconfig from the referenced Secret and creates a Kubernetes client configured for the remote cluster.
func (r *FederationReconciler) buildRemoteClient(
    ctx context.Context,
    reg *v1alpha1.ClusterRegistration,
) (kubernetes.Interface, error) {
    // 1. Fetch the Secret with the kubeconfig
    secret := &corev1.Secret{}
    err := r.client.Get(ctx, types.NamespacedName{
        Name:      reg.Spec.KubeconfigSecretRef.Name,
        Namespace: reg.Namespace,
    }, secret)
    if err != nil {
        return nil, fmt.Errorf("secret not found: %w", err)
    }

    // 2. Extract and parse the kubeconfig
    kubeconfigData := secret.Data[reg.Spec.KubeconfigSecretRef.Key]
    config, err := clientcmd.RESTConfigFromKubeConfig(kubeconfigData)
    if err != nil {
        return nil, fmt.Errorf("invalid kubeconfig: %w", err)
    }

    // 3. Create the client
    return kubernetes.NewForConfig(config)
}

Remote Client Cache

Clients are stored in a sync.Map for reuse, avoiding unnecessary reconnections:
type FederationManager struct {
    clients    sync.Map  // map[string]kubernetes.Interface
    mu         sync.RWMutex
}

func (fm *FederationManager) GetClient(clusterName string) (kubernetes.Interface, bool) {
    client, ok := fm.clients.Load(clusterName)
    if !ok {
        return nil, false
    }
    return client.(kubernetes.Interface), true
}

func (fm *FederationManager) RegisterClient(clusterName string, client kubernetes.Interface) {
    fm.clients.Store(clusterName, client)
}

Health Check Loop

The controller executes periodic health checks on each registered cluster:
1

List Nodes

Executes List Nodes on the remote cluster to verify connectivity and count active nodes.
nodes, err := remoteClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
2

List Namespaces

Executes List Namespaces to count namespaces and verify RBAC permissions.
namespaces, err := remoteClient.CoreV1().Namespaces().List(ctx, metav1.ListOptions{})
3

Update Status

Updates the ClusterRegistration.Status with the results, including connected, nodeCount, namespaceCount, and kubernetesVersion.
4

Generate Metrics

Exports Prometheus metrics with the cluster state.

Cross-Cluster Correlation

One of the most powerful federation features is the ability to correlate incidents across clusters.

Automatic Severity Elevation

When the same signalType is detected in 3 or more clusters within a time window, the platform automatically elevates severity to critical:
func (ce *CrossClusterCorrelator) Evaluate(issues []FederatedIssue) []Correlation {
    // Group by signalType
    bySignal := make(map[string][]FederatedIssue)
    for _, issue := range issues {
        bySignal[issue.SignalType] = append(bySignal[issue.SignalType], issue)
    }

    var correlations []Correlation
    for signalType, clusterIssues := range bySignal {
        uniqueClusters := countUniqueClusters(clusterIssues)
        if uniqueClusters >= 3 {
            correlations = append(correlations, Correlation{
                SignalType:       signalType,
                AffectedClusters: uniqueClusters,
                ElevateTo:        "critical",
                CorrelationID:    generateCorrelationID(),
            })
        }
    }
    return correlations
}

CorrelationID Annotation

When incidents are correlated across clusters, they all receive the same correlationID annotation for traceability:
apiVersion: platform.chatcli.io/v1alpha1
kind: Issue
metadata:
  name: issue-oom-api-server
  namespace: production
  annotations:
    platform.chatcli.io/correlation-id: "cross-7f8a2b3c"
    platform.chatcli.io/correlated-clusters: "prod-us-east-1,prod-eu-west-1,prod-ap-southeast-1"
    platform.chatcli.io/elevated-severity: "true"
    platform.chatcli.io/original-severity: "medium"
spec:
  severity: critical  # Elevated from medium to critical
  signalType: OOMKilled
The correlationID allows operators to run kubectl queries to find all related incidents across all clusters:
kubectl get issues -A -l platform.chatcli.io/correlation-id=cross-7f8a2b3c

Cascade Detection

Cascade detection identifies when a problem in a lower-tier environment (staging) may be about to affect a higher-tier environment (production).

Staging to Production

func (cd *CascadeDetector) DetectStagingToProd(
    stagingIssues []FederatedIssue,
    prodIssues []FederatedIssue,
) []CascadeAlert {
    var alerts []CascadeAlert

    for _, staging := range stagingIssues {
        for _, prod := range prodIssues {
            // Same signalType AND same resourceKind
            if staging.SignalType == prod.SignalType &&
               staging.ResourceKind == prod.ResourceKind {
                // Staging occurred before production
                if staging.DetectedAt.Before(prod.DetectedAt) {
                    alerts = append(alerts, CascadeAlert{
                        SourceCluster: staging.ClusterName,
                        TargetCluster: prod.ClusterName,
                        SignalType:    staging.SignalType,
                        TimeDelta:     prod.DetectedAt.Sub(staging.DetectedAt),
                    })
                }
            }
        }
    }
    return alerts
}
When a cascade is detected, the annotation platform.chatcli.io/cascade-detected: true is added to the production Issue:
metadata:
  annotations:
    platform.chatcli.io/cascade-detected: "true"
    platform.chatcli.io/cascade-source: "staging-us-east-1"
    platform.chatcli.io/cascade-signal: "CrashLoopBackOff"
    platform.chatcli.io/cascade-delta: "15m"
Detected cascades automatically elevate the issue’s priority and add extra context to the LLM prompt, including the incident history from the source cluster. This allows the AI to recommend preventive actions based on what happened in staging.

Global Status Aggregation

The operator maintains an aggregated status of the entire federation, accessible via CRD and API:
apiVersion: platform.chatcli.io/v1alpha1
kind: FederationStatus
metadata:
  name: global-status
  namespace: chatcli-system
status:
  totalClusters: 5
  connectedClusters: 4
  disconnectedClusters:
    - name: dr-us-west-2
      lastSeen: "2026-03-19T12:00:00Z"
      reason: "Network timeout"
  totalActiveIssues: 12
  issuesBySeverity:
    critical: 1
    high: 3
    medium: 5
    low: 3
  issuesByCluster:
    prod-us-east-1: 4
    prod-eu-west-1: 3
    prod-ap-southeast-1: 2
    staging-us-east-1: 3
  crossClusterCorrelations: 2
  cascadesDetected: 1
  remediationsInProgress: 3
  lastUpdated: "2026-03-19T14:35:00Z"
# Federation overview
kubectl get federationstatus global-status -n chatcli-system -o yaml

# List all registered clusters
kubectl get clusterregistrations -n chatcli-system

# See disconnected clusters
kubectl get clusterregistrations -n chatcli-system \
  -o jsonpath='{range .items[?(@.status.connected==false)]}{.metadata.name}{"\n"}{end}'

Remediation Policy per Tier

Each cluster has a remediation policy based on its tier, which controls the level of autonomy allowed by the Decision Engine.

Policy Definitions

TierSeverityPolicyJustification
criticalAnyManual with approvalZero risk of automatic action on critical infra
standardcritical/highManual with approvalConservatism for high severities
standardmedium/lowAuto-remediationAutomation for lower-impact problems
non-criticalAnyAuto-remediationMaximum automation in dev/test environments
func (pm *PolicyManager) GetRemediationPolicy(
    tier string,
    severity string,
) RemediationPolicy {
    switch tier {
    case "critical":
        // Critical clusters: ALWAYS manual
        return RemediationPolicy{
            Mode:            "manual",
            RequiresApproval: true,
            RequiredRole:    "Admin",
            Reason:          "Critical tier cluster: all remediations require approval",
        }

    case "standard":
        switch severity {
        case "critical", "high":
            return RemediationPolicy{
                Mode:            "manual",
                RequiresApproval: true,
                RequiredRole:    "Operator",
                Reason:          "High severity in standard cluster",
            }
        default: // medium, low
            return RemediationPolicy{
                Mode:            "auto",
                RequiresApproval: false,
                Reason:          "Medium/low severity in standard cluster",
            }
        }

    case "non-critical":
        // Dev/test: auto for everything
        return RemediationPolicy{
            Mode:            "auto",
            RequiresApproval: false,
            Reason:          "Non-critical cluster: auto-remediation for all severities",
        }
    }

    // Safe fallback
    return RemediationPolicy{
        Mode:            "manual",
        RequiresApproval: true,
    }
}
The per-tier policy is evaluated before the Decision Engine calculates confidence. If the tier requires manual approval, the confidence calculation is still performed (for logging and auditing), but the result does not change the decision.

YAML Examples

Register a Production Cluster

apiVersion: v1
kind: Secret
metadata:
  name: cluster-prod-us-east-1-kubeconfig
  namespace: chatcli-system
type: Opaque
data:
  kubeconfig: |
    # base64 encoded kubeconfig
    YXBpVmVyc2lvbjogdjEKa2lu...

Register a Staging Cluster

apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: staging-us-east-1
  namespace: chatcli-system
  labels:
    environment: staging
    region: us-east-1
spec:
  kubeconfigSecretRef:
    name: cluster-staging-us-east-1-kubeconfig
    key: kubeconfig
  region: us-east-1
  environment: staging
  tier: non-critical
  healthCheckInterval: 60s
  capabilities:
    - monitoring
    - remediation
    - chaos-engineering   # Chaos enabled only in staging
  maxConcurrentRemediations: 5

Complete Multi-Region Setup

# Production US
apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: prod-us-east-1
  namespace: chatcli-system
spec:
  kubeconfigSecretRef:
    name: kubeconfig-prod-us
    key: kubeconfig
  region: us-east-1
  environment: production
  tier: critical
  healthCheckInterval: 15s
  capabilities: [monitoring, remediation]
  maxConcurrentRemediations: 2
---
# Production EU
apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: prod-eu-west-1
  namespace: chatcli-system
spec:
  kubeconfigSecretRef:
    name: kubeconfig-prod-eu
    key: kubeconfig
  region: eu-west-1
  environment: production
  tier: critical
  healthCheckInterval: 15s
  capabilities: [monitoring, remediation]
  maxConcurrentRemediations: 2
---
# Production APAC
apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: prod-ap-southeast-1
  namespace: chatcli-system
spec:
  kubeconfigSecretRef:
    name: kubeconfig-prod-ap
    key: kubeconfig
  region: ap-southeast-1
  environment: production
  tier: critical
  healthCheckInterval: 15s
  capabilities: [monitoring, remediation]
  maxConcurrentRemediations: 2
---
# Staging (shared)
apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: staging-global
  namespace: chatcli-system
spec:
  kubeconfigSecretRef:
    name: kubeconfig-staging
    key: kubeconfig
  region: us-east-1
  environment: staging
  tier: non-critical
  healthCheckInterval: 60s
  capabilities: [monitoring, remediation, chaos-engineering]
  maxConcurrentRemediations: 10
---
# DR (Disaster Recovery)
apiVersion: platform.chatcli.io/v1alpha1
kind: ClusterRegistration
metadata:
  name: dr-us-west-2
  namespace: chatcli-system
spec:
  kubeconfigSecretRef:
    name: kubeconfig-dr
    key: kubeconfig
  region: us-west-2
  environment: dr
  tier: standard
  healthCheckInterval: 60s
  capabilities: [monitoring]
  maxConcurrentRemediations: 1

Federation Monitoring

Prometheus Metrics

MetricTypeLabelsDescription
federation_clusters_totalGaugestatusTotal clusters by status (connected/disconnected)
federation_health_check_duration_secondsHistogramclusterHealth check time per cluster
federation_cluster_nodesGaugecluster, regionNumber of nodes per cluster
cross_cluster_issues_totalCountersignal_typeTotal cross-cluster correlated issues
cross_cluster_correlations_activeGauge-Currently active correlations
cascade_detected_totalCountersource_tier, target_tierTotal cascades detected
federation_remediation_policy_appliedCountertier, modePolicies applied by tier and mode
{
  "panels": [
    {
      "title": "Connected Clusters",
      "type": "stat",
      "targets": [{
        "expr": "federation_clusters_total{status='connected'}"
      }]
    },
    {
      "title": "Issues by Cluster",
      "type": "barchart",
      "targets": [{
        "expr": "sum by (cluster) (aiops_active_issues)"
      }]
    },
    {
      "title": "Cascades Detected (24h)",
      "type": "stat",
      "targets": [{
        "expr": "increase(cascade_detected_total[24h])"
      }]
    },
    {
      "title": "Health Check Latency",
      "type": "timeseries",
      "targets": [{
        "expr": "histogram_quantile(0.95, federation_health_check_duration_seconds_bucket)"
      }]
    }
  ]
}
groups:
  - name: federation
    rules:
      - alert: ClusterDisconnected
        expr: federation_clusters_total{status="disconnected"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Federated cluster disconnected"
          description: >
            {{ $value }} cluster(s) disconnected for more than 5 minutes.
            Check network connectivity and credentials.

      - alert: CrossClusterIncident
        expr: cross_cluster_correlations_active > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cross-cluster correlated incident"
          description: >
            {{ $value }} active cross-cluster correlation(s).
            Same problem detected in 3+ clusters.

      - alert: CascadeDetected
        expr: increase(cascade_detected_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Staging-to-production cascade detected"
          description: >
            Problem detected in staging is propagating to production.
            Check if the same deploy was applied in both environments.

Network Architecture

The control cluster needs network connectivity to the API server of each remote cluster. In restricted network environments, consider using a bastion host or dedicated VPN for management traffic.

Next Steps

Decision Engine

Understand how confidence is calculated and how per-tier policies affect decisions.

Chaos Engineering

Run chaos experiments on specific clusters with safety checks per tier.

Audit and Compliance

Complete audit trail of cross-cluster actions with correlationID.

AIOps Platform

Return to the AIOps platform overview.