Why Use Server Mode?
Centralization
A single server with configured API keys serves multiple clients
Security
API keys stay on the server, never exposed on client terminals
Flexibility
Clients can use their own credentials (API key or OAuth) if desired
Performance
Communication via gRPC with TLS support and progressive streaming
Starting the Server
Available Flags
| Flag | Description | Default | Env Var |
|---|---|---|---|
--port | gRPC server port | 50051 | CHATCLI_SERVER_PORT |
--token | Authentication token (empty = no auth) | "" | CHATCLI_SERVER_TOKEN |
--tls-cert | TLS certificate file | "" | CHATCLI_SERVER_TLS_CERT |
--tls-key | TLS key file | "" | CHATCLI_SERVER_TLS_KEY |
--provider | Default LLM provider | Auto-detected | LLM_PROVIDER |
--model | Default LLM model | Auto-detected | |
--metrics-port | HTTP port for Prometheus metrics (0 = disable) | 9090 | CHATCLI_METRICS_PORT |
Fallback Flags (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--fallback-providers | Comma-separated list of providers for failover | "" | CHATCLI_FALLBACK_PROVIDERS |
--fallback-max-retries | Attempts per provider before advancing | 2 | CHATCLI_FALLBACK_MAX_RETRIES |
--fallback-cooldown-base | Base cooldown after failure | 30s | CHATCLI_FALLBACK_COOLDOWN_BASE |
--fallback-cooldown-max | Maximum cooldown (exponential backoff) | 5m | CHATCLI_FALLBACK_COOLDOWN_MAX |
MCP Flag (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--mcp-config | MCP configuration JSON file | "" | CHATCLI_MCP_CONFIG |
Prometheus Metrics
The server exposes Prometheus metrics athttp://localhost:9090/metrics by default. Metrics include:
- gRPC:
chatcli_grpc_requests_total,chatcli_grpc_request_duration_seconds,chatcli_grpc_in_flight_requests - LLM:
chatcli_llm_requests_total,chatcli_llm_request_duration_seconds,chatcli_llm_errors_total - Watcher:
chatcli_watcher_collection_duration_seconds,chatcli_watcher_alerts_total,chatcli_watcher_pods_ready - Session:
chatcli_session_active_total,chatcli_session_operations_total - Server:
chatcli_server_uptime_seconds,chatcli_server_info - Go runtime: goroutines, memory, GC (via GoCollector/ProcessCollector)
--metrics-port 0.
Security Variables
| Env Var | Description | Default |
|---|---|---|
CHATCLI_GRPC_REFLECTION | Enables gRPC reflection for debugging. Keep disabled in production. | false |
CHATCLI_DISABLE_VERSION_CHECK | Disables automatic version check on startup. | false |
K8s Watcher Flags (optional)
| Flag | Description | Default | Env Var |
|---|---|---|---|
--watch-config | Multi-target YAML file | "" | CHATCLI_WATCH_CONFIG |
--watch-deployment | Single deployment (legacy) | "" | CHATCLI_WATCH_DEPLOYMENT |
--watch-namespace | Deployment namespace | "default" | CHATCLI_WATCH_NAMESPACE |
--watch-interval | Collection interval | 30s | CHATCLI_WATCH_INTERVAL |
--watch-window | Observation window | 2h | CHATCLI_WATCH_WINDOW |
--watch-max-log-lines | Max log lines per pod | 100 | CHATCLI_WATCH_MAX_LOG_LINES |
--watch-kubeconfig | Kubeconfig path | Auto-detected | CHATCLI_KUBECONFIG |
Use
--watch-config to monitor multiple deployments simultaneously with Prometheus metrics. See K8s Watcher for the YAML file format.Server Authentication
- No Authentication
- With Token
- TLS (HTTPS)
By default, the server does not require authentication. Any client can connect:
Credential Modes
The server supports multiple LLM credential modes, providing full flexibility:1. Server Credentials (Default)
1. Server Credentials (Default)
The server uses its own API keys configured via environment variables:No additional client configuration needed.
2. Client Credentials (API Key)
2. Client Credentials (API Key)
The client can send its own API key, which the server uses instead of its own:
3. Client Credentials (Local OAuth)
3. Client Credentials (Local OAuth)
The client can use OAuth tokens from the local auth store (
~/.chatcli/auth-profiles.json):4. StackSpot Credentials
4. StackSpot Credentials
For the StackSpot provider, send the complete credentials:
5. GitHub Copilot (Local OAuth)
5. GitHub Copilot (Local OAuth)
To use GitHub Copilot, log in via Device Flow and connect with
--use-local-auth:6. Ollama (No Credentials)
6. Ollama (No Credentials)
For local models via Ollama, just provide the URL:
gRPC Architecture
The server implements a gRPC service with the following RPCs:| RPC | Description |
|---|---|
SendPrompt | Sends a prompt and receives the complete response |
StreamPrompt | Sends a prompt and receives the response in progressive chunks |
InteractiveSession | Bidirectional streaming for interactive sessions |
ListSessions | Lists sessions saved on the server |
LoadSession | Loads a saved session |
SaveSession | Saves the current session |
Health | Server health check |
GetServerInfo | Server information (version, provider, model, watcher) |
GetWatcherStatus | K8s Watcher status (if active) |
ListRemotePlugins | Lists plugins available on the server |
ListRemoteAgents | Lists agents available on the server |
ListRemoteSkills | Lists skills available on the server |
GetAgentDefinition | Returns the complete content of an agent (markdown + frontmatter) |
GetSkillContent | Returns the complete content of a skill |
ExecuteRemotePlugin | Executes a plugin on the server and returns the result |
DownloadPlugin | Streaming download of a plugin binary |
GetAlerts | Returns active alerts from the K8s Watcher (used by the Operator) |
AnalyzeIssue | Sends Issue context to the LLM and returns analysis + suggested actions |
gRPC with Multiple Replicas
gRPC uses persistent HTTP/2 connections that, by default, pin to a single pod via kube-proxy. For scenarios with multiple replicas in Kubernetes:- 1 replica: Standard ClusterIP Service — no extra configuration needed
- Multiple replicas: Use a headless Service (
ClusterIP: None) so that DNS returns individual pod IPs, enabling client-side round-robin load balancing via gRPCdns:///resolver - The ChatCLI client already has built-in keepalive (ping every 10s) and round-robin support
- In the Helm chart, enable
service.headless: truewhenreplicaCount > 1 - In the Operator, headless is activated automatically when
spec.replicas > 1
For more details, see the K8s Operator documentation and Helm deployment.
Progressive Streaming
TheStreamPrompt RPC splits the response into ~200 character chunks at natural boundaries (paragraphs, lines, sentences), providing a progressive response experience on the client.
Resource Discovery RPCs
TheListRemotePlugins, ListRemoteAgents, ListRemoteSkills, GetAgentDefinition, GetSkillContent, ExecuteRemotePlugin, and DownloadPlugin RPCs allow connected clients to discover and use resources installed on the server.
- Plugins: Executed on the server via
ExecuteRemotePluginor downloaded viaDownloadPlugin(binary streaming) - Agents/Skills: Markdown content transferred to the client via
GetAgentDefinition/GetSkillContentfor local prompt composition
AIOps Platform RPCs
TheGetAlerts and AnalyzeIssue RPCs are used by the AIOps Operator to feed the autonomous remediation pipeline.
GetAlerts
Returns active alerts detected by the K8s Watcher:AnalyzeIssue
Sends Issue context to the LLM and returns structured analysis with suggested actions:K8s Watcher Integration
When the server is started with--watch-config or --watch-deployment, the K8s Watcher continuously monitors deployments and automatically injects the Kubernetes context into all prompts from remote clients.
- Single-Target (legacy)
- Multi-Target (recommended)