Why Use Server Mode?
Centralization
A single server with configured API keys serves multiple clients
Security
API keys stay on the server, never exposed on client terminals
Flexibility
Clients can use their own credentials (API key or OAuth) if desired
Performance
Communication via gRPC with TLS support and progressive streaming
Starting the Server
Available Flags
| Flag | Description | Default | Env Var |
|---|---|---|---|
--port | gRPC server port | 50051 | CHATCLI_SERVER_PORT |
--token | Authentication token (empty = no auth) | "" | CHATCLI_SERVER_TOKEN |
--tls-cert | TLS certificate file | "" | CHATCLI_SERVER_TLS_CERT |
--tls-key | TLS key file | "" | CHATCLI_SERVER_TLS_KEY |
--provider | Default LLM provider | Auto-detected | LLM_PROVIDER |
--model | Default LLM model | Auto-detected | |
--metrics-port | HTTP port for Prometheus metrics (0 = disable) | 9090 | CHATCLI_METRICS_PORT |
Fallback Flags (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--fallback-providers | Comma-separated list of providers for failover | "" | CHATCLI_FALLBACK_PROVIDERS |
--fallback-max-retries | Attempts per provider before advancing | 2 | CHATCLI_FALLBACK_MAX_RETRIES |
--fallback-cooldown-base | Base cooldown after failure | 30s | CHATCLI_FALLBACK_COOLDOWN_BASE |
--fallback-cooldown-max | Maximum cooldown (exponential backoff) | 5m | CHATCLI_FALLBACK_COOLDOWN_MAX |
MCP Flag (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--mcp-config | MCP configuration JSON file | "" | CHATCLI_MCP_CONFIG |
Prometheus Metrics
The server exposes Prometheus metrics athttp://localhost:9090/metrics by default. Metrics include:
- gRPC:
chatcli_grpc_requests_total,chatcli_grpc_request_duration_seconds,chatcli_grpc_in_flight_requests - LLM:
chatcli_llm_requests_total,chatcli_llm_request_duration_seconds,chatcli_llm_errors_total - Watcher:
chatcli_watcher_collection_duration_seconds,chatcli_watcher_alerts_total,chatcli_watcher_pods_ready - Session:
chatcli_session_active_total,chatcli_session_operations_total - Server:
chatcli_server_uptime_seconds,chatcli_server_info - Go runtime: goroutines, memory, GC (via GoCollector/ProcessCollector)
--metrics-port 0.
Security Variables
| Env Var | Description | Default |
|---|---|---|
CHATCLI_GRPC_REFLECTION | Enables gRPC reflection for debugging. Requires BOTH the --grpc-reflection flag AND this variable set to true. Keep disabled in production. Configurable via Helm with server.grpcReflection. | false |
CHATCLI_DISABLE_VERSION_CHECK | Disables automatic version check on startup. | false |
CHATCLI_BIND_ADDRESS | Server bind address. Defaults to 127.0.0.1 (local); in Kubernetes, auto-detects via KUBERNETES_SERVICE_HOST and defaults to 0.0.0.0. Explicit value always takes precedence. | 127.0.0.1 / 0.0.0.0 (K8s) |
The default bind address is
127.0.0.1 (secure for local use). In Kubernetes, the server auto-detects the environment via KUBERNETES_SERVICE_HOST and automatically binds to 0.0.0.0 — no additional configuration needed. An explicit CHATCLI_BIND_ADDRESS value always takes precedence.K8s Watcher Flags (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--watch-config | Multi-target YAML file | "" | CHATCLI_WATCH_CONFIG |
--watch-deployment | Single deployment (legacy) | "" | CHATCLI_WATCH_DEPLOYMENT |
--watch-namespace | Deployment namespace | "default" | CHATCLI_WATCH_NAMESPACE |
--watch-interval | Collection interval | 30s | CHATCLI_WATCH_INTERVAL |
--watch-window | Observation window | 2h | CHATCLI_WATCH_WINDOW |
--watch-max-log-lines | Max log lines per pod | 100 | CHATCLI_WATCH_MAX_LOG_LINES |
--watch-kubeconfig | Kubeconfig path | Auto-detected | CHATCLI_KUBECONFIG |
Use
--watch-config to monitor multiple deployments simultaneously with Prometheus metrics. See K8s Watcher for the YAML file format.Server Authentication
- No Authentication
- With Token
- JWT with RBAC
- TLS (HTTPS)
By default, the server does not require authentication. Any client can connect:
Credential Modes
The server supports multiple LLM credential modes, providing full flexibility:1. Server Credentials (Default)
1. Server Credentials (Default)
The server uses its own API keys configured via environment variables:No additional client configuration needed.
2. Client Credentials (API Key)
2. Client Credentials (API Key)
The client can send its own API key, which the server uses instead of its own:
3. Client Credentials (Local OAuth)
3. Client Credentials (Local OAuth)
The client can use OAuth tokens from the local auth store (
~/.chatcli/auth-profiles.json):4. StackSpot Credentials
4. StackSpot Credentials
For the StackSpot provider, send the complete credentials:
5. GitHub Copilot (Local OAuth)
5. GitHub Copilot (Local OAuth)
To use GitHub Copilot, log in via Device Flow and connect with
--use-local-auth:6. Ollama (No Credentials)
6. Ollama (No Credentials)
For local models via Ollama, just provide the URL:
gRPC Architecture
The server implements a gRPC service with the following RPCs:| RPC | Description |
|---|---|
SendPrompt | Sends a prompt and receives the complete response |
StreamPrompt | Sends a prompt and receives the response in progressive chunks |
InteractiveSession | Bidirectional streaming for interactive sessions |
ListSessions | Lists sessions saved on the server |
LoadSession | Loads a saved session |
SaveSession | Saves the current session |
Health | Server health check |
GetServerInfo | Server information (version, provider, model, watcher) |
GetWatcherStatus | K8s Watcher status (if active) |
ListRemotePlugins | Lists plugins available on the server |
ListRemoteAgents | Lists agents available on the server |
ListRemoteSkills | Lists skills available on the server |
GetAgentDefinition | Returns the complete content of an agent (markdown + frontmatter) |
GetSkillContent | Returns the complete content of a skill |
ExecuteRemotePlugin | Executes a plugin on the server and returns the result |
DownloadPlugin | Streaming download of a plugin binary |
GetAlerts | Returns active alerts from the K8s Watcher (used by the Operator) |
AnalyzeIssue | Sends Issue context to the LLM and returns analysis + suggested actions |
gRPC with Multiple Replicas
gRPC uses persistent HTTP/2 connections that, by default, pin to a single pod via kube-proxy. For scenarios with multiple replicas in Kubernetes:- 1 replica: Standard ClusterIP Service — no extra configuration needed
- Multiple replicas: Use a headless Service (
ClusterIP: None) so that DNS returns individual pod IPs, enabling client-side round-robin load balancing via gRPCdns:///resolver - The ChatCLI client already has built-in keepalive (ping every 10s) and round-robin support
- In the Helm chart, enable
service.headless: truewhenreplicaCount > 1 - In the Operator, headless is activated automatically when
spec.replicas > 1
For more details, see the K8s Operator documentation and Helm deployment.
Progressive Streaming
TheStreamPrompt RPC splits the response into ~200 character chunks at natural boundaries (paragraphs, lines, sentences), providing a progressive response experience on the client.
Resource Discovery RPCs
TheListRemotePlugins, ListRemoteAgents, ListRemoteSkills, GetAgentDefinition, GetSkillContent, ExecuteRemotePlugin, and DownloadPlugin RPCs allow connected clients to discover and use resources installed on the server.
- Plugins: Executed on the server via
ExecuteRemotePluginor downloaded viaDownloadPlugin(binary streaming) - Agents/Skills: Markdown content transferred to the client via
GetAgentDefinition/GetSkillContentfor local prompt composition
AIOps Platform RPCs
TheGetAlerts and AnalyzeIssue RPCs are used by the AIOps Operator to feed the autonomous remediation pipeline.
GetAlerts
Returns active alerts detected by the K8s Watcher:AnalyzeIssue
Sends Issue context to the LLM and returns structured analysis with suggested actions:REST API Gateway
In addition to gRPC, the operator now exposes a REST HTTP API on port:8090 with:
- 30+ endpoints covering incidents, SLOs, runbooks, approvals, postmortems, analytics, clusters and audit
- Authentication via
X-API-Keywith role mapping (viewer/operator/admin) - Rate limiting at 100 req/min per key
- Web Dashboard embedded and served at
/
Remote Commands via InteractiveSession
When connecting to a server viachatcli connect, the interactive session supports commands executed directly on the server:
| Command | Description |
|---|---|
/status | Server information (version, provider, model, uptime) |
/watcher status | K8s Watcher details (targets, snapshots, alerts) |
/plugins list | Lists plugins available on the server |
/agents list | Lists agents available on the server |
/skills list | Lists skills available on the server |
InteractiveSession).
K8s Watcher Integration
When the server is started with--watch-config or --watch-deployment, the K8s Watcher continuously monitors deployments and automatically injects the Kubernetes context into all prompts from remote clients.
- Single-Target (legacy)
- Multi-Target (recommended)
Rate Limiting
The server implements per-client rate limiting using a token bucket to protect against abuse:| Variable | Description | Default |
|---|---|---|
CHATCLI_RATE_LIMIT_RPS | Requests per second per client | 10 |
CHATCLI_RATE_LIMIT_BURST | Maximum allowed burst | 30 |
ResourceExhausted with a Retry-After header indicating how many seconds the client should wait.
SSRF Prevention
The server validates all URLs configured inprovider_config before use, blocking:
- Private IPs:
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 - Cloud metadata:
169.254.169.254(AWS, GCP, Azure) - Link-local:
169.254.0.0/16,fe80::/10 - Loopback:
127.0.0.0/8,::1
Message Size Limits
| Variable | Description | Default |
|---|---|---|
CHATCLI_MAX_RECV_MSG_SIZE | Maximum received message size | 50MB |
CHATCLI_MAX_SEND_MSG_SIZE | Maximum sent message size | 50MB |
CHATCLI_MAX_CONCURRENT_STREAMS | Concurrent streams per connection | 100 |
Audit Logging
The server can generate audit logs in JSON-lines format for complete traceability:| Variable | Description | Default |
|---|---|---|
CHATCLI_AUDIT_LOG_PATH | Audit log file path (empty = disabled) | "" |
- Authentication (success/failure)
- Prompt and plugin execution
- Session operations (save/load/delete)
- Configuration changes
The JSON-lines format facilitates integration with tools like
jq, Elasticsearch, Loki, and Splunk. Each line is an independent JSON object with timestamp, request ID, action, and result.Log Rotation
| Variable | Description | Default |
|---|---|---|
CHATCLI_LOG_FILE | Main log file path | "" (stdout) |
CHATCLI_LOG_MAX_SIZE_MB | Maximum size before rotating | 100 |
CHATCLI_LOG_MAX_BACKUPS | Number of old backups to keep | 5 |
CHATCLI_LOG_MAX_AGE_DAYS | Maximum retention days | 30 |
CHATCLI_LOG_COMPRESS | Compress backups with gzip | true |
Environment Variables
All environment variables used by local ChatCLI also work on the server:Next Steps
Remote Connection
Connect to the server remotely
K8s Watcher
Multi-target + Prometheus
K8s Operator
K8s Operator (AIOps)
Deploy
Deploy with Docker and Helm