Server Mode (chatcli server)

Server Mode transforms ChatCLI into a high-performance gRPC service that can be accessed remotely from any terminal. This allows centralizing AI access on a server (bare-metal, VM, Docker, or Kubernetes) and connecting from anywhere.

Why Use Server Mode?

Centralization

A single server with configured API keys serves multiple clients

Security

API keys stay on the server, never exposed on client terminals

Flexibility

Clients can use their own credentials (API key or OAuth) if desired

Performance

Communication via gRPC with TLS support and progressive streaming

Server mode offers native integration with the K8s Watcher for Kubernetes deployment monitoring.

Starting the Server

Simplest mode

Server on the default port (50051):

chatcli server

With custom port and authentication

chatcli server --port 8080 --token my-secret-token

With TLS enabled

chatcli server --tls-cert cert.pem --tls-key key.pem

With integrated K8s Watcher (optional)

# Single-target (legacy)
chatcli server --watch-deployment myapp --watch-namespace production

# Multi-target + Prometheus metrics
chatcli server --watch-config targets.yaml

With provider fallback (optional)

chatcli server --fallback-providers OPENAI,CLAUDEAI,GOOGLEAI,ZAI,MINIMAX,MOONSHOT,OPENROUTER,COPILOT

With MCP (optional)

chatcli server --mcp-config ~/.chatcli/mcp_servers.json

Available Flags

Flag	Description	Default	Env Var
`--port`	gRPC server port	`50051`	`CHATCLI_SERVER_PORT`
`--token`	Authentication token (empty = no auth)	`""`	`CHATCLI_SERVER_TOKEN`
`--tls-cert`	TLS certificate file	`""`	`CHATCLI_SERVER_TLS_CERT`
`--tls-key`	TLS key file	`""`	`CHATCLI_SERVER_TLS_KEY`
`--provider`	Default LLM provider	Auto-detected	`LLM_PROVIDER`
`--model`	Default LLM model	Auto-detected
`--metrics-port`	HTTP port for Prometheus metrics (0 = disable)	`9090`	`CHATCLI_METRICS_PORT`

Fallback Flags (optional)

Flag	Description	Default	Env Var
`--fallback-providers`	Comma-separated list of providers for failover	`""`	`CHATCLI_FALLBACK_PROVIDERS`
`--fallback-max-retries`	Attempts per provider before advancing	`2`	`CHATCLI_FALLBACK_MAX_RETRIES`
`--fallback-cooldown-base`	Base cooldown after failure	`30s`	`CHATCLI_FALLBACK_COOLDOWN_BASE`
`--fallback-cooldown-max`	Maximum cooldown (exponential backoff)	`5m`	`CHATCLI_FALLBACK_COOLDOWN_MAX`

MCP Flag (optional)

Flag	Description	Default	Env Var
`--mcp-config`	MCP configuration JSON file	`""`	`CHATCLI_MCP_CONFIG`

Prometheus Metrics

The server exposes Prometheus metrics at http://localhost:9090/metrics by default. Metrics include:

gRPC: chatcli_grpc_requests_total, chatcli_grpc_request_duration_seconds, chatcli_grpc_in_flight_requests
LLM: chatcli_llm_requests_total, chatcli_llm_request_duration_seconds, chatcli_llm_errors_total
Watcher: chatcli_watcher_collection_duration_seconds, chatcli_watcher_alerts_total, chatcli_watcher_pods_ready
Session: chatcli_session_active_total, chatcli_session_operations_total
Server: chatcli_server_uptime_seconds, chatcli_server_info
Go runtime: goroutines, memory, GC (via GoCollector/ProcessCollector)

To disable, use --metrics-port 0.

Security Variables

Env Var	Description	Default
`CHATCLI_GRPC_REFLECTION`	Enables gRPC reflection for debugging. Requires BOTH the `--grpc-reflection` flag AND this variable set to `true`. Keep disabled in production. Configurable via Helm with `server.grpcReflection`.	`false`
`CHATCLI_DISABLE_VERSION_CHECK`	Disables automatic version check on startup.	`false`
`CHATCLI_BIND_ADDRESS`	Server bind address. Defaults to `127.0.0.1` (local); in Kubernetes, auto-detects via `KUBERNETES_SERVICE_HOST` and defaults to `0.0.0.0`. Explicit value always takes precedence.	`127.0.0.1` / `0.0.0.0` (K8s)

gRPC reflection now requires two conditions: the --grpc-reflection flag AND the CHATCLI_GRPC_REFLECTION=true variable. This prevents accidental exposure in production. See the security documentation for all hardening measures.

The default bind address is 127.0.0.1 (secure for local use). In Kubernetes, the server auto-detects the environment via KUBERNETES_SERVICE_HOST and automatically binds to 0.0.0.0 — no additional configuration needed. An explicit CHATCLI_BIND_ADDRESS value always takes precedence.

K8s Watcher Flags (optional)

Flag	Description	Default	Env Var
`--watch-config`	Multi-target YAML file	`""`	`CHATCLI_WATCH_CONFIG`
`--watch-deployment`	Single deployment (legacy)	`""`	`CHATCLI_WATCH_DEPLOYMENT`
`--watch-namespace`	Deployment namespace	`"default"`	`CHATCLI_WATCH_NAMESPACE`
`--watch-interval`	Collection interval	`30s`	`CHATCLI_WATCH_INTERVAL`
`--watch-window`	Observation window	`2h`	`CHATCLI_WATCH_WINDOW`
`--watch-max-log-lines`	Max log lines per pod	`100`	`CHATCLI_WATCH_MAX_LOG_LINES`
`--watch-kubeconfig`	Kubeconfig path	Auto-detected	`CHATCLI_KUBECONFIG`

Use --watch-config to monitor multiple deployments simultaneously with Prometheus metrics. See K8s Watcher for the YAML file format.

Server Authentication

No Authentication
With Token
JWT with RBAC
TLS (HTTPS)

By default, the server does not require authentication. Any client can connect:

chatcli server  # no --token = open access

Set a token to protect the server:

# Via flag
chatcli server --token my-secret-token

# Via environment variable
export CHATCLI_SERVER_TOKEN=my-secret-token
chatcli server

The client must provide the same token when connecting:

chatcli connect server:50051 --token my-secret-token

For role-based access control, configure JWT authentication:

export CHATCLI_JWT_SECRET="my-256-bit-secret-key"
export CHATCLI_JWT_ISSUER="chatcli-server"
export CHATCLI_JWT_AUDIENCE="chatcli-api"
chatcli server

The server validates JWT tokens and extracts the role from the role claim. Three roles are available:

Role	Permissions
`admin`	Full access: manage sessions, execute plugins, configure server
`user`	Execute plugins, manage own sessions, send prompts
`readonly`	View only: list sessions, query server status

Example: generating a JWT for a user:

import "github.com/golang-jwt/jwt/v5"

claims := jwt.MapClaims{
    "sub":  "user@company.com",
    "role": "user",
    "iss":  "chatcli-server",
    "aud":  "chatcli-api",
    "exp":  time.Now().Add(24 * time.Hour).Unix(),
}
token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims)
signed, _ := token.SignedString([]byte(os.Getenv("CHATCLI_JWT_SECRET")))

Authentication rate limiting blocks IPs after 5 failures per minute. This protects against brute force attacks on tokens and JWTs.

For encrypted connections, provide a TLS certificate and key. The server enforces TLS 1.3 as the minimum version:

chatcli server --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem

The client uses the --tls flag and optionally --ca-cert:

chatcli connect server:50051 --tls --ca-cert /path/to/ca.pem

mTLS (Mutual TLS): For mutual authentication, also configure the client certificate:

export CHATCLI_TLS_CLIENT_CERT=/path/to/client-cert.pem
export CHATCLI_TLS_CLIENT_KEY=/path/to/client-key.pem
chatcli server --tls-cert cert.pem --tls-key key.pem

If TLS certificate loading fails, the error is written to both stderr and the structured log, including the cert and key paths for debugging. In Kubernetes, check kubectl logs to see the error.

Credential Modes

The server supports multiple LLM credential modes, providing full flexibility:

1. Server Credentials (Default)

The server uses its own API keys configured via environment variables:

export OPENAI_API_KEY=sk-xxx
export LLM_PROVIDER=OPENAI
chatcli server

No additional client configuration needed.

2. Client Credentials (API Key)

The client can send its own API key, which the server uses instead of its own:

chatcli connect server:50051 --llm-key sk-my-key --provider OPENAI

3. Client Credentials (Local OAuth)

The client can use OAuth tokens from the local auth store (~/.chatcli/auth-profiles.json):

# First, log in with OAuth locally
/auth login anthropic

# Then, connect using local credentials
chatcli connect server:50051 --use-local-auth

4. StackSpot Credentials

For the StackSpot provider, send the complete credentials:

chatcli connect server:50051 --provider STACKSPOT \
  --client-id <id> --client-key <key> --realm <realm> --agent-id <agent>

5. GitHub Copilot (Local OAuth)

To use GitHub Copilot, log in via Device Flow and connect with --use-local-auth:

# First, log in to GitHub Copilot
/auth login github-copilot

# Connect using local credentials
chatcli connect server:50051 --use-local-auth --provider COPILOT

6. Ollama (No Credentials)

For local models via Ollama, just provide the URL:

chatcli connect server:50051 --provider OLLAMA --ollama-url http://gpu-server:11434

gRPC Architecture

The server implements a gRPC service with the following RPCs:

RPC	Description
`SendPrompt`	Sends a prompt and receives the complete response
`StreamPrompt`	Sends a prompt and receives the response in progressive chunks
`InteractiveSession`	Bidirectional streaming for interactive sessions
`ListSessions`	Lists sessions saved on the server
`LoadSession`	Loads a saved session
`SaveSession`	Saves the current session
`Health`	Server health check
`GetServerInfo`	Server information (version, provider, model, watcher)
`GetWatcherStatus`	K8s Watcher status (if active)
`ListRemotePlugins`	Lists plugins available on the server
`ListRemoteAgents`	Lists agents available on the server
`ListRemoteSkills`	Lists skills available on the server
`GetAgentDefinition`	Returns the complete content of an agent (markdown + frontmatter)
`GetSkillContent`	Returns the complete content of a skill
`ExecuteRemotePlugin`	Executes a plugin on the server and returns the result
`DownloadPlugin`	Streaming download of a plugin binary
`GetAlerts`	Returns active alerts from the K8s Watcher (used by the Operator)
`AnalyzeIssue`	Sends Issue context to the LLM and returns analysis + suggested actions

gRPC with Multiple Replicas

gRPC uses persistent HTTP/2 connections that, by default, pin to a single pod via kube-proxy. For scenarios with multiple replicas in Kubernetes:

1 replica: Standard ClusterIP Service — no extra configuration needed
Multiple replicas: Use a headless Service (ClusterIP: None) so that DNS returns individual pod IPs, enabling client-side round-robin load balancing via gRPC dns:/// resolver
The ChatCLI client already has built-in keepalive (ping every 10s) and round-robin support
In the Helm chart, enable service.headless: true when replicaCount > 1
In the Operator, headless is activated automatically when spec.replicas > 1

For more details, see the K8s Operator documentation and Helm deployment.

Progressive Streaming

The StreamPrompt RPC splits the response into ~200 character chunks at natural boundaries (paragraphs, lines, sentences), providing a progressive response experience on the client.

Resource Discovery RPCs

The ListRemotePlugins, ListRemoteAgents, ListRemoteSkills, GetAgentDefinition, GetSkillContent, ExecuteRemotePlugin, and DownloadPlugin RPCs allow connected clients to discover and use resources installed on the server.

Plugins: Executed on the server via ExecuteRemotePlugin or downloaded via DownloadPlugin (binary streaming)
Agents/Skills: Markdown content transferred to the client via GetAgentDefinition/GetSkillContent for local prompt composition

AIOps Platform RPCs

The GetAlerts and AnalyzeIssue RPCs are used by the AIOps Operator to feed the autonomous remediation pipeline.

GetAlerts

Returns active alerts detected by the K8s Watcher:

rpc GetAlerts(GetAlertsRequest) returns (GetAlertsResponse);

message GetAlertsRequest {
  string namespace = 1;     // Filter by namespace (empty = all)
  string deployment = 2;    // Filter by deployment (empty = all)
}

message AlertInfo {
  string alert_type = 1;    // HighRestartCount, OOMKilled, PodNotReady, DeploymentFailing
  string deployment = 2;
  string namespace = 3;
  string message = 4;
  string severity = 5;      // critical, warning
  int64 timestamp = 6;
}

AnalyzeIssue

Sends Issue context to the LLM and returns structured analysis with suggested actions:

rpc AnalyzeIssue(AnalyzeIssueRequest) returns (AnalyzeIssueResponse);

message AnalyzeIssueRequest {
  string issue_name = 1;
  string namespace = 2;
  string resource_kind = 3;
  string resource_name = 4;
  string signal_type = 5;
  string severity = 6;
  string description = 7;
  int32 risk_score = 8;
  string provider = 9;
  string model = 10;
}

message SuggestedAction {
  string name = 1;
  string action = 2;
  string description = 3;
  map<string, string> params = 4;
}

message AnalyzeIssueResponse {
  string analysis = 1;
  float confidence = 2;     // 0.0-1.0
  repeated string recommendations = 3;
  string provider = 4;
  string model = 5;
  repeated SuggestedAction suggested_actions = 6;
}

REST API Gateway

In addition to gRPC, the operator now exposes a REST HTTP API on port :8090 with:

30+ endpoints covering incidents, SLOs, runbooks, approvals, postmortems, analytics, clusters and audit
Authentication via X-API-Key with role mapping (viewer/operator/admin)
Rate limiting at 100 req/min per key
Web Dashboard embedded and served at /

For the complete reference, see the API Reference.

Remote Commands via InteractiveSession

When connecting to a server via chatcli connect, the interactive session supports commands executed directly on the server:

Command	Description
`/status`	Server information (version, provider, model, uptime)
`/watcher status`	K8s Watcher details (targets, snapshots, alerts)
`/plugins list`	Lists plugins available on the server
`/agents list`	Lists agents available on the server
`/skills list`	Lists skills available on the server

These commands are processed by the server and return results via bidirectional gRPC streaming (InteractiveSession).

K8s Watcher Integration

When the server is started with --watch-config or --watch-deployment, the K8s Watcher continuously monitors deployments and automatically injects the Kubernetes context into all prompts from remote clients.

Single-Target (legacy)
Multi-Target (recommended)

chatcli server --watch-deployment myapp --watch-namespace production

chatcli server --watch-config targets.yaml

The targets.yaml file defines multiple deployments, Prometheus metrics, and context budget. See K8s Watcher for the complete format.

Any connected user can ask questions about the deployments without additional configuration:

Connected to ChatCLI server (version: 1.0.0, provider: OPENAI, model: gpt-4o)
K8s watcher active: 5 targets (interval: 30s)

> Which deployments need attention?
> Analyze the HTTP metrics of api-gateway

Rate Limiting

The server implements per-client rate limiting using a token bucket to protect against abuse:

Variable	Description	Default
`CHATCLI_RATE_LIMIT_RPS`	Requests per second per client	`10`
`CHATCLI_RATE_LIMIT_BURST`	Maximum allowed burst	`30`

When the limit is reached, the server returns gRPC ResourceExhausted with a Retry-After header indicating how many seconds the client should wait.

export CHATCLI_RATE_LIMIT_RPS=20
export CHATCLI_RATE_LIMIT_BURST=50
chatcli server

In environments with multiple legitimate clients, increase the burst to accommodate usage spikes. RPS controls the sustained rate.

SSRF Prevention

The server validates all URLs configured in provider_config before use, blocking:

Private IPs: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Cloud metadata: 169.254.169.254 (AWS, GCP, Azure)
Link-local: 169.254.0.0/16, fe80::/10
Loopback: 127.0.0.0/8, ::1

This prevents malicious LLM providers or misconfigurations from accessing internal network resources. Validation occurs before any HTTP request is sent.

Message Size Limits

Variable	Description	Default
`CHATCLI_MAX_RECV_MSG_SIZE`	Maximum received message size	`50MB`
`CHATCLI_MAX_SEND_MSG_SIZE`	Maximum sent message size	`50MB`
`CHATCLI_MAX_CONCURRENT_STREAMS`	Concurrent streams per connection	`100`

These limits protect against resource exhaustion attacks and ensure server stability under load.

Audit Logging

The server can generate audit logs in JSON-lines format for complete traceability:

Variable	Description	Default
`CHATCLI_AUDIT_LOG_PATH`	Audit log file path (empty = disabled)	`""`

Each request receives a unique Request ID for correlation. Recorded events include:

Authentication (success/failure)
Prompt and plugin execution
Session operations (save/load/delete)
Configuration changes

export CHATCLI_AUDIT_LOG_PATH=/var/log/chatcli/audit.jsonl
chatcli server

The JSON-lines format facilitates integration with tools like jq, Elasticsearch, Loki, and Splunk. Each line is an independent JSON object with timestamp, request ID, action, and result.

Log Rotation

Variable	Description	Default
`CHATCLI_LOG_FILE`	Main log file path	`""` (stdout)
`CHATCLI_LOG_MAX_SIZE_MB`	Maximum size before rotating	`100`
`CHATCLI_LOG_MAX_BACKUPS`	Number of old backups to keep	`5`
`CHATCLI_LOG_MAX_AGE_DAYS`	Maximum retention days	`30`
`CHATCLI_LOG_COMPRESS`	Compress backups with gzip	`true`

export CHATCLI_LOG_FILE=/var/log/chatcli/server.log
export CHATCLI_LOG_MAX_SIZE_MB=50
export CHATCLI_LOG_MAX_BACKUPS=10
chatcli server

Environment Variables

All environment variables used by local ChatCLI also work on the server:

# Server
CHATCLI_SERVER_PORT=50051
CHATCLI_SERVER_TOKEN=my-token
CHATCLI_SERVER_TLS_CERT=/path/to/cert.pem
CHATCLI_SERVER_TLS_KEY=/path/to/key.pem
CHATCLI_BIND_ADDRESS=127.0.0.1  # Auto-detected as 0.0.0.0 in Kubernetes

# Security
CHATCLI_GRPC_REFLECTION=false
CHATCLI_DISABLE_VERSION_CHECK=false
CHATCLI_JWT_SECRET=my-secret-key
CHATCLI_JWT_ISSUER=chatcli-server
CHATCLI_JWT_AUDIENCE=chatcli-api

# Rate Limiting
CHATCLI_RATE_LIMIT_RPS=10
CHATCLI_RATE_LIMIT_BURST=30

# Message Limits
CHATCLI_MAX_RECV_MSG_SIZE=52428800
CHATCLI_MAX_SEND_MSG_SIZE=52428800
CHATCLI_MAX_CONCURRENT_STREAMS=100

# mTLS
CHATCLI_TLS_CLIENT_CERT=/path/to/client-cert.pem
CHATCLI_TLS_CLIENT_KEY=/path/to/client-key.pem

# Audit and Logs
CHATCLI_AUDIT_LOG_PATH=/var/log/chatcli/audit.jsonl
CHATCLI_LOG_FILE=/var/log/chatcli/server.log
CHATCLI_LOG_MAX_SIZE_MB=100

# LLM
LLM_PROVIDER=CLAUDEAI
ANTHROPIC_API_KEY=sk-ant-xxx
ANTHROPIC_MODEL=claude-sonnet-4-6

# K8s Watcher (optional)
CHATCLI_WATCH_DEPLOYMENT=myapp
CHATCLI_WATCH_NAMESPACE=production
CHATCLI_WATCH_INTERVAL=30s
CHATCLI_WATCH_WINDOW=2h
CHATCLI_WATCH_MAX_LOG_LINES=100

Next Steps

Remote Connection

Connect to the server remotely

K8s Watcher

Multi-target + Prometheus

K8s Operator

K8s Operator (AIOps)

Deploy

Deploy with Docker and Helm

​Why Use Server Mode?

Centralization

Security

Flexibility

Performance

​Starting the Server

​Available Flags

​Fallback Flags (optional)

​MCP Flag (optional)

​Prometheus Metrics

​Security Variables

​K8s Watcher Flags (optional)

​Server Authentication

​Credential Modes

​gRPC Architecture

​gRPC with Multiple Replicas

​Progressive Streaming

​Resource Discovery RPCs

​AIOps Platform RPCs

​GetAlerts

​AnalyzeIssue

​REST API Gateway

​Remote Commands via InteractiveSession

​K8s Watcher Integration

​Rate Limiting

​SSRF Prevention

​Message Size Limits

​Audit Logging

​Log Rotation

​Environment Variables

​Next Steps

Remote Connection

K8s Watcher

K8s Operator

Deploy

Why Use Server Mode?

Starting the Server

Available Flags

Fallback Flags (optional)

MCP Flag (optional)

Prometheus Metrics

Security Variables

K8s Watcher Flags (optional)

Server Authentication

Credential Modes

gRPC Architecture

gRPC with Multiple Replicas

Progressive Streaming

Resource Discovery RPCs

AIOps Platform RPCs

GetAlerts

AnalyzeIssue

REST API Gateway

Remote Commands via InteractiveSession

K8s Watcher Integration

Rate Limiting

SSRF Prevention

Message Size Limits

Audit Logging

Log Rotation

Environment Variables

Next Steps