Skip to main content
Server Mode transforms ChatCLI into a high-performance gRPC service that can be accessed remotely from any terminal. This allows centralizing AI access on a server (bare-metal, VM, Docker, or Kubernetes) and connecting from anywhere.

Why Use Server Mode?

Centralization

A single server with configured API keys serves multiple clients

Security

API keys stay on the server, never exposed on client terminals

Flexibility

Clients can use their own credentials (API key or OAuth) if desired

Performance

Communication via gRPC with TLS support and progressive streaming
Server mode offers native integration with the K8s Watcher for Kubernetes deployment monitoring.

Starting the Server

1

Simplest mode

Server on the default port (50051):
chatcli server
2

With custom port and authentication

chatcli server --port 8080 --token my-secret-token
3

With TLS enabled

chatcli server --tls-cert cert.pem --tls-key key.pem
4

With integrated K8s Watcher (optional)

# Single-target (legacy)
chatcli server --watch-deployment myapp --watch-namespace production

# Multi-target + Prometheus metrics
chatcli server --watch-config targets.yaml
5

With provider fallback (optional)

chatcli server --fallback-providers OPENAI,CLAUDEAI,GOOGLEAI,ZAI,MINIMAX,MOONSHOT,OPENROUTER,COPILOT
6

With MCP (optional)

chatcli server --mcp-config ~/.chatcli/mcp_servers.json

Available Flags

FlagDescriptionDefaultEnv Var
--portgRPC server port50051CHATCLI_SERVER_PORT
--tokenAuthentication token (empty = no auth)""CHATCLI_SERVER_TOKEN
--tls-certTLS certificate file""CHATCLI_SERVER_TLS_CERT
--tls-keyTLS key file""CHATCLI_SERVER_TLS_KEY
--providerDefault LLM providerAuto-detectedLLM_PROVIDER
--modelDefault LLM modelAuto-detected
--metrics-portHTTP port for Prometheus metrics (0 = disable)9090CHATCLI_METRICS_PORT

Fallback Flags (optional)

FlagDescriptionDefaultEnv Var
--fallback-providersComma-separated list of providers for failover""CHATCLI_FALLBACK_PROVIDERS
--fallback-max-retriesAttempts per provider before advancing2CHATCLI_FALLBACK_MAX_RETRIES
--fallback-cooldown-baseBase cooldown after failure30sCHATCLI_FALLBACK_COOLDOWN_BASE
--fallback-cooldown-maxMaximum cooldown (exponential backoff)5mCHATCLI_FALLBACK_COOLDOWN_MAX

MCP Flag (optional)

FlagDescriptionDefaultEnv Var
--mcp-configMCP configuration JSON file""CHATCLI_MCP_CONFIG

Prometheus Metrics

The server exposes Prometheus metrics at http://localhost:9090/metrics by default. Metrics include:
  • gRPC: chatcli_grpc_requests_total, chatcli_grpc_request_duration_seconds, chatcli_grpc_in_flight_requests
  • LLM: chatcli_llm_requests_total, chatcli_llm_request_duration_seconds, chatcli_llm_errors_total
  • Watcher: chatcli_watcher_collection_duration_seconds, chatcli_watcher_alerts_total, chatcli_watcher_pods_ready
  • Session: chatcli_session_active_total, chatcli_session_operations_total
  • Server: chatcli_server_uptime_seconds, chatcli_server_info
  • Go runtime: goroutines, memory, GC (via GoCollector/ProcessCollector)
To disable, use --metrics-port 0.

Security Variables

Env VarDescriptionDefault
CHATCLI_GRPC_REFLECTIONEnables gRPC reflection for debugging. Requires BOTH the --grpc-reflection flag AND this variable set to true. Keep disabled in production. Configurable via Helm with server.grpcReflection.false
CHATCLI_DISABLE_VERSION_CHECKDisables automatic version check on startup.false
CHATCLI_BIND_ADDRESSServer bind address. Defaults to 127.0.0.1 (local); in Kubernetes, auto-detects via KUBERNETES_SERVICE_HOST and defaults to 0.0.0.0. Explicit value always takes precedence.127.0.0.1 / 0.0.0.0 (K8s)
gRPC reflection now requires two conditions: the --grpc-reflection flag AND the CHATCLI_GRPC_REFLECTION=true variable. This prevents accidental exposure in production. See the security documentation for all hardening measures.
The default bind address is 127.0.0.1 (secure for local use). In Kubernetes, the server auto-detects the environment via KUBERNETES_SERVICE_HOST and automatically binds to 0.0.0.0 — no additional configuration needed. An explicit CHATCLI_BIND_ADDRESS value always takes precedence.

K8s Watcher Flags (optional)

FlagDescriptionDefaultEnv Var
--watch-configMulti-target YAML file""CHATCLI_WATCH_CONFIG
--watch-deploymentSingle deployment (legacy)""CHATCLI_WATCH_DEPLOYMENT
--watch-namespaceDeployment namespace"default"CHATCLI_WATCH_NAMESPACE
--watch-intervalCollection interval30sCHATCLI_WATCH_INTERVAL
--watch-windowObservation window2hCHATCLI_WATCH_WINDOW
--watch-max-log-linesMax log lines per pod100CHATCLI_WATCH_MAX_LOG_LINES
--watch-kubeconfigKubeconfig pathAuto-detectedCHATCLI_KUBECONFIG
Use --watch-config to monitor multiple deployments simultaneously with Prometheus metrics. See K8s Watcher for the YAML file format.

Server Authentication

By default, the server does not require authentication. Any client can connect:
chatcli server  # no --token = open access

Credential Modes

The server supports multiple LLM credential modes, providing full flexibility:
The server uses its own API keys configured via environment variables:
export OPENAI_API_KEY=sk-xxx
export LLM_PROVIDER=OPENAI
chatcli server
No additional client configuration needed.
The client can send its own API key, which the server uses instead of its own:
chatcli connect server:50051 --llm-key sk-my-key --provider OPENAI
The client can use OAuth tokens from the local auth store (~/.chatcli/auth-profiles.json):
# First, log in with OAuth locally
/auth login anthropic

# Then, connect using local credentials
chatcli connect server:50051 --use-local-auth
For the StackSpot provider, send the complete credentials:
chatcli connect server:50051 --provider STACKSPOT \
  --client-id <id> --client-key <key> --realm <realm> --agent-id <agent>
To use GitHub Copilot, log in via Device Flow and connect with --use-local-auth:
# First, log in to GitHub Copilot
/auth login github-copilot

# Connect using local credentials
chatcli connect server:50051 --use-local-auth --provider COPILOT
For local models via Ollama, just provide the URL:
chatcli connect server:50051 --provider OLLAMA --ollama-url http://gpu-server:11434

gRPC Architecture

The server implements a gRPC service with the following RPCs:
RPCDescription
SendPromptSends a prompt and receives the complete response
StreamPromptSends a prompt and receives the response in progressive chunks
InteractiveSessionBidirectional streaming for interactive sessions
ListSessionsLists sessions saved on the server
LoadSessionLoads a saved session
SaveSessionSaves the current session
HealthServer health check
GetServerInfoServer information (version, provider, model, watcher)
GetWatcherStatusK8s Watcher status (if active)
ListRemotePluginsLists plugins available on the server
ListRemoteAgentsLists agents available on the server
ListRemoteSkillsLists skills available on the server
GetAgentDefinitionReturns the complete content of an agent (markdown + frontmatter)
GetSkillContentReturns the complete content of a skill
ExecuteRemotePluginExecutes a plugin on the server and returns the result
DownloadPluginStreaming download of a plugin binary
GetAlertsReturns active alerts from the K8s Watcher (used by the Operator)
AnalyzeIssueSends Issue context to the LLM and returns analysis + suggested actions

gRPC with Multiple Replicas

gRPC uses persistent HTTP/2 connections that, by default, pin to a single pod via kube-proxy. For scenarios with multiple replicas in Kubernetes:
  • 1 replica: Standard ClusterIP Service — no extra configuration needed
  • Multiple replicas: Use a headless Service (ClusterIP: None) so that DNS returns individual pod IPs, enabling client-side round-robin load balancing via gRPC dns:/// resolver
  • The ChatCLI client already has built-in keepalive (ping every 10s) and round-robin support
  • In the Helm chart, enable service.headless: true when replicaCount > 1
  • In the Operator, headless is activated automatically when spec.replicas > 1
For more details, see the K8s Operator documentation and Helm deployment.

Progressive Streaming

The StreamPrompt RPC splits the response into ~200 character chunks at natural boundaries (paragraphs, lines, sentences), providing a progressive response experience on the client.

Resource Discovery RPCs

The ListRemotePlugins, ListRemoteAgents, ListRemoteSkills, GetAgentDefinition, GetSkillContent, ExecuteRemotePlugin, and DownloadPlugin RPCs allow connected clients to discover and use resources installed on the server.
  • Plugins: Executed on the server via ExecuteRemotePlugin or downloaded via DownloadPlugin (binary streaming)
  • Agents/Skills: Markdown content transferred to the client via GetAgentDefinition/GetSkillContent for local prompt composition

AIOps Platform RPCs

The GetAlerts and AnalyzeIssue RPCs are used by the AIOps Operator to feed the autonomous remediation pipeline.

GetAlerts

Returns active alerts detected by the K8s Watcher:
rpc GetAlerts(GetAlertsRequest) returns (GetAlertsResponse);

message GetAlertsRequest {
  string namespace = 1;     // Filter by namespace (empty = all)
  string deployment = 2;    // Filter by deployment (empty = all)
}

message AlertInfo {
  string alert_type = 1;    // HighRestartCount, OOMKilled, PodNotReady, DeploymentFailing
  string deployment = 2;
  string namespace = 3;
  string message = 4;
  string severity = 5;      // critical, warning
  int64 timestamp = 6;
}

AnalyzeIssue

Sends Issue context to the LLM and returns structured analysis with suggested actions:
rpc AnalyzeIssue(AnalyzeIssueRequest) returns (AnalyzeIssueResponse);

message AnalyzeIssueRequest {
  string issue_name = 1;
  string namespace = 2;
  string resource_kind = 3;
  string resource_name = 4;
  string signal_type = 5;
  string severity = 6;
  string description = 7;
  int32 risk_score = 8;
  string provider = 9;
  string model = 10;
}

message SuggestedAction {
  string name = 1;
  string action = 2;
  string description = 3;
  map<string, string> params = 4;
}

message AnalyzeIssueResponse {
  string analysis = 1;
  float confidence = 2;     // 0.0-1.0
  repeated string recommendations = 3;
  string provider = 4;
  string model = 5;
  repeated SuggestedAction suggested_actions = 6;
}

REST API Gateway

In addition to gRPC, the operator now exposes a REST HTTP API on port :8090 with:
  • 30+ endpoints covering incidents, SLOs, runbooks, approvals, postmortems, analytics, clusters and audit
  • Authentication via X-API-Key with role mapping (viewer/operator/admin)
  • Rate limiting at 100 req/min per key
  • Web Dashboard embedded and served at /
For the complete reference, see the API Reference.

Remote Commands via InteractiveSession

When connecting to a server via chatcli connect, the interactive session supports commands executed directly on the server:
CommandDescription
/statusServer information (version, provider, model, uptime)
/watcher statusK8s Watcher details (targets, snapshots, alerts)
/plugins listLists plugins available on the server
/agents listLists agents available on the server
/skills listLists skills available on the server
These commands are processed by the server and return results via bidirectional gRPC streaming (InteractiveSession).

K8s Watcher Integration

When the server is started with --watch-config or --watch-deployment, the K8s Watcher continuously monitors deployments and automatically injects the Kubernetes context into all prompts from remote clients.
chatcli server --watch-deployment myapp --watch-namespace production
Any connected user can ask questions about the deployments without additional configuration:
Connected to ChatCLI server (version: 1.0.0, provider: OPENAI, model: gpt-4o)
K8s watcher active: 5 targets (interval: 30s)

> Which deployments need attention?
> Analyze the HTTP metrics of api-gateway

Rate Limiting

The server implements per-client rate limiting using a token bucket to protect against abuse:
VariableDescriptionDefault
CHATCLI_RATE_LIMIT_RPSRequests per second per client10
CHATCLI_RATE_LIMIT_BURSTMaximum allowed burst30
When the limit is reached, the server returns gRPC ResourceExhausted with a Retry-After header indicating how many seconds the client should wait.
export CHATCLI_RATE_LIMIT_RPS=20
export CHATCLI_RATE_LIMIT_BURST=50
chatcli server
In environments with multiple legitimate clients, increase the burst to accommodate usage spikes. RPS controls the sustained rate.

SSRF Prevention

The server validates all URLs configured in provider_config before use, blocking:
  • Private IPs: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
  • Cloud metadata: 169.254.169.254 (AWS, GCP, Azure)
  • Link-local: 169.254.0.0/16, fe80::/10
  • Loopback: 127.0.0.0/8, ::1
This prevents malicious LLM providers or misconfigurations from accessing internal network resources. Validation occurs before any HTTP request is sent.

Message Size Limits

VariableDescriptionDefault
CHATCLI_MAX_RECV_MSG_SIZEMaximum received message size50MB
CHATCLI_MAX_SEND_MSG_SIZEMaximum sent message size50MB
CHATCLI_MAX_CONCURRENT_STREAMSConcurrent streams per connection100
These limits protect against resource exhaustion attacks and ensure server stability under load.

Audit Logging

The server can generate audit logs in JSON-lines format for complete traceability:
VariableDescriptionDefault
CHATCLI_AUDIT_LOG_PATHAudit log file path (empty = disabled)""
Each request receives a unique Request ID for correlation. Recorded events include:
  • Authentication (success/failure)
  • Prompt and plugin execution
  • Session operations (save/load/delete)
  • Configuration changes
export CHATCLI_AUDIT_LOG_PATH=/var/log/chatcli/audit.jsonl
chatcli server
The JSON-lines format facilitates integration with tools like jq, Elasticsearch, Loki, and Splunk. Each line is an independent JSON object with timestamp, request ID, action, and result.

Log Rotation

VariableDescriptionDefault
CHATCLI_LOG_FILEMain log file path"" (stdout)
CHATCLI_LOG_MAX_SIZE_MBMaximum size before rotating100
CHATCLI_LOG_MAX_BACKUPSNumber of old backups to keep5
CHATCLI_LOG_MAX_AGE_DAYSMaximum retention days30
CHATCLI_LOG_COMPRESSCompress backups with gziptrue
export CHATCLI_LOG_FILE=/var/log/chatcli/server.log
export CHATCLI_LOG_MAX_SIZE_MB=50
export CHATCLI_LOG_MAX_BACKUPS=10
chatcli server

Environment Variables

All environment variables used by local ChatCLI also work on the server:
# Server
CHATCLI_SERVER_PORT=50051
CHATCLI_SERVER_TOKEN=my-token
CHATCLI_SERVER_TLS_CERT=/path/to/cert.pem
CHATCLI_SERVER_TLS_KEY=/path/to/key.pem
CHATCLI_BIND_ADDRESS=127.0.0.1  # Auto-detected as 0.0.0.0 in Kubernetes

# Security
CHATCLI_GRPC_REFLECTION=false
CHATCLI_DISABLE_VERSION_CHECK=false
CHATCLI_JWT_SECRET=my-secret-key
CHATCLI_JWT_ISSUER=chatcli-server
CHATCLI_JWT_AUDIENCE=chatcli-api

# Rate Limiting
CHATCLI_RATE_LIMIT_RPS=10
CHATCLI_RATE_LIMIT_BURST=30

# Message Limits
CHATCLI_MAX_RECV_MSG_SIZE=52428800
CHATCLI_MAX_SEND_MSG_SIZE=52428800
CHATCLI_MAX_CONCURRENT_STREAMS=100

# mTLS
CHATCLI_TLS_CLIENT_CERT=/path/to/client-cert.pem
CHATCLI_TLS_CLIENT_KEY=/path/to/client-key.pem

# Audit and Logs
CHATCLI_AUDIT_LOG_PATH=/var/log/chatcli/audit.jsonl
CHATCLI_LOG_FILE=/var/log/chatcli/server.log
CHATCLI_LOG_MAX_SIZE_MB=100

# LLM
LLM_PROVIDER=CLAUDEAI
ANTHROPIC_API_KEY=sk-ant-xxx
ANTHROPIC_MODEL=claude-sonnet-4-6

# K8s Watcher (optional)
CHATCLI_WATCH_DEPLOYMENT=myapp
CHATCLI_WATCH_NAMESPACE=production
CHATCLI_WATCH_INTERVAL=30s
CHATCLI_WATCH_WINDOW=2h
CHATCLI_WATCH_MAX_LOG_LINES=100

Next Steps

Remote Connection

Connect to the server remotely

K8s Watcher

Multi-target + Prometheus

K8s Operator

K8s Operator (AIOps)

Deploy

Deploy with Docker and Helm