#4 RAG + HyDE — Semantic Retrieval

HyDE (Hypothetical Document Embeddings) is the classic technique of generating a hypothetical answer to the user’s question and using that answer as an additional retrieval signal. In ChatCLI, HyDE operates in two complementary phases: 3a expands keywords via LLM hypothesis, 3b adds vector cosine search.

HyDE is opt-in (CHATCLI_QUALITY_HYDE_ENABLED=true) to keep the steady-state with no additional cost. Phase 3a costs +1 cheap LLM call; Phase 3b requires configuring an embedding provider.

The problem HyDE solves

The pre-pipeline memory.Fact retrieval was keyword-only: the scorer matches tokens extracted from recent messages against tags and content of stored facts. Works well when vocabulary matches exactly — fails when the user uses synonyms or asks abstract questions. Gap example:

Without HyDE
With HyDE 3a
With HyDE 3b

User: how to do X in Go?
Extracted keywords: [do, go]
Stored fact: "use goroutines for concurrency in X pipelines"
Match: ❌ — “do” and “go” don’t literally appear in the fact.

User: how to do X in Go?
LLM hypothesis: "In Go you would use goroutines and channels for X, leveraging the concurrency primitives..."
Expanded keywords: [do, go, goroutines, channels, concurrency, primitives]
Match: ✅

Phase 3a — Hypothesis-based keyword expansion

User types query

The query enters cli_llm.go or agent_mode.go.

HyDEAugmenter.Augment

augmenter := memory.NewHyDEAugmenter(cfg, llmCallback, logger)
expanded := augmenter.Augment(ctx, query, originalHints)

LLM generates short hypothesis

Prompt: “Write a 2-4 sentence plausible answer that uses the technical nouns that would appear in any matching note. Bilingual if the query mixes languages.”

ExtractKeywords from the hypothesis

The same extractor already used in chat mode (en+pt stop words, min 3 chars).

Merge unique + lower-case

Original keywords + top-N from hypothesis, cap configurable via CHATCLI_QUALITY_HYDE_NUM_KEYWORDS (default 5).

FactIndex.Search uses the expanded set

Existing keyword-based scorer operates over richer hints → much higher recall.

Phase 3a works without configuring an embedding provider. It’s the recommended default if the cost of +1 light LLM call is acceptable.

Phase 3b — Vector embeddings + blended ranking

Adds cosine similarity search over fact embeddings — and, since the ranking refactor, fuses the cosine score with the lexical and temporal signals into a single ranking.

What changed (PR #1027): previously the vector hits were dissolved back into keywords and the cosine score was discarded — you paid for an embedding call and threw away the very semantic signal. Now the cosine flows straight into the SearchBlended ranker, so a fact matched only by paraphrase (zero lexical overlap) can still rank.

Architecture

┌──────────────────┐
│ User query       │
└────────┬─────────┘
         │  Phase 3a: LLM hypothesis → expanded keywords
         ▼
┌──────────────────────────┐
│ EmbeddingProvider.Embed  │  (Voyage / OpenAI / Bedrock / Null)
└────────┬─────────────────┘
         │  vector float32
         ▼
┌──────────────────────────────────────┐
│ vindex.Index.Search(q, k, floor)     │  cosine top-K, O(n log k), pure-Go
└────────┬─────────────────────────────┘
         │  []Hit{ID, Score}   ← the cosine score is PRESERVED
         ▼
┌────────────────────────────────────────────────┐
│ FactIndex.SearchBlended(keywords, semantic, w)  │
│   final = wSem·cosine + wLex·keyword            │
│         + wTemp·recency    (min-max normalized)  │
└────────┬───────────────────────────────────────┘
         │
         ▼
    Facts ranked by FUSED relevance

Blended ranking (`SearchBlended`)

Three independent, complementary signals, each min-max normalized across the candidate set before the weighted sum:

Signal	Source	Captures
semantic	cosine from the vector store	synonymy, paraphrase
lexical	keyword/tag overlap	exact terms, identifiers, file names
temporal	recency decay × access frequency	what the user actually uses

Default weights (DefaultRankWeights): semantic 0.55 · lexical 0.30 · temporal 0.15 — semantic-first, because the embedding call was already paid for. Fusion is additive (not multiplicative) on purpose: a fact with high cosine and zero keyword overlap still ranks — a product would zero it out. Min-max normalization is what makes the weights provider-agnostic: Voyage 1024-d and OpenAI 1536-d cosine both land in [0,1] after normalizing.

Cosine floor (MinCosineScore, default 0.25): embeddings over normalized text are almost always weakly positive, so the old > 0 cutoff admitted near-orthogonal noise. The floor keeps only genuinely related facts in the top-K.

Supported providers

Voyage (recommended)
OpenAI
Bedrock (Titan + Cohere)
Null (default)

export CHATCLI_QUALITY_HYDE_ENABLED=true
export CHATCLI_QUALITY_HYDE_USE_VECTORS=true
export CHATCLI_EMBED_PROVIDER=voyage
export VOYAGE_API_KEY=pa-...
# Optional: pick model (default voyage-3, 1024-dim)
export CHATCLI_EMBED_MODEL=voyage-3

Why Voyage: it’s the provider recommended by Anthropic for Claude workflows, offers high quality/$ for retrieval, and the default model (voyage-3, 1024-dim) is the general-purpose sweet spot.

export CHATCLI_QUALITY_HYDE_ENABLED=true
export CHATCLI_QUALITY_HYDE_USE_VECTORS=true
export CHATCLI_EMBED_PROVIDER=openai
export OPENAI_API_KEY=sk-...
# Optional: pick model and custom dimension
export CHATCLI_EMBED_MODEL=text-embedding-3-small
export CHATCLI_EMBED_DIMENSIONS=512   # can reduce to save storage

Models: text-embedding-3-small (default, 1536-dim, $0.02/1M tokens) or text-embedding-3-large (3072-dim, higher quality). The native dimension is auto-detected from the model — only set CHATCLI_EMBED_DIMENSIONS if you want to truncate via Matryoshka (e.g. 512 to save storage).

export CHATCLI_QUALITY_HYDE_ENABLED=true
export CHATCLI_QUALITY_HYDE_USE_VECTORS=true
export CHATCLI_EMBED_PROVIDER=bedrock

# Reuses the chat client's credentials chain — no extra API key.
# Set region/profile if not already in the environment.
export BEDROCK_REGION=us-east-1
export AWS_PROFILE=my-profile        # optional

# Optional: pick model
export CHATCLI_EMBED_MODEL=amazon.titan-embed-text-v2:0   # default
export CHATCLI_EMBED_DIMENSIONS=1024                      # Titan v2: 256/512/1024

Why Bedrock: same AWS credentials chain as the chat — single billing, AWS compliance (CloudTrail, guardrails), VPC endpoints. Ideal for companies already running chat through Bedrock and wanting retrieval inside the same perimeter.Supported families:

Model ID	Family	Supported dims	Native batching
`amazon.titan-embed-text-v2:0` (default)	Titan v2	256, 512, 1024	No — parallelized with 8-worker pool
`amazon.titan-embed-text-v1`	Titan v1	1536 (fixed)	No — parallelized
`cohere.embed-english-v3`	Cohere v3 (en)	1024 (fixed)	Yes — one call per batch
`cohere.embed-multilingual-v3`	Cohere v3 (multi)	1024 (fixed)	Yes — one call per batch

Dispatch between Titan and Cohere is automatic by the model id prefix. CHATCLI_EMBED_DIMENSIONS only applies to Titan v2 (256/512/1024); invalid values are rejected in NewBedrock before any AWS call.No Converse API for embeddings: embeddings only live in InvokeModel, so each family keeps its dedicated body schema — unlike chat, where Converse covers nearly everything.

Hot-swap via /reload: changing CHATCLI_EMBED_PROVIDER / CHATCLI_EMBED_MODEL / CHATCLI_EMBED_DIMENSIONS in .env and running /reload rebuilds the provider on the spot — rewiring /context --rag semantic retrieval and the memory vector index without restarting ChatCLI. The index auto-migrates when provider/dimension change.

See AWS Bedrock for the full credentials chain config (SSO, IAM role, corporate proxy, VPC endpoint).

If CHATCLI_EMBED_PROVIDER isn’t set, NullProvider is returned. Embed() returns an error — the code checks embedding.IsNull(p) before calling, so it gracefully falls back to keyword-only. No surprises, no silent failures.

Pure-Go vector store — generic `vindex` primitive

No CGO, no SQLite-vec, no external deps. Just float32[] + cosine + JSON persistence in ~/.chatcli/memory/vector_index.json.

The cosine index lives in a generic, reusable package, llm/embedding/vindex, extracted once a second consumer (the semantic /context retrieval) appeared. memory is just a thin adapter over it — no vector machinery duplicated per package:

// llm/embedding/vindex — the single, agnostic primitive
type Index struct { /* provider, dim, entries map[string][]float32, path */ }
func (x *Index) Upsert(ctx, items map[string]string) error
func (x *Index) Search(query []float32, k int, minScore float64) []Hit  // top-K via min-heap
func (x *Index) Prune(keep map[string]struct{})                         // never serve stale vectors

// cli/workspace/memory/vector_store.go — fact-domain adapter
type VectorIndex struct { idx *vindex.Index; provider embedding.Provider }
type ScoredFact struct { ID string; Score float64 }

Top-K selection uses a min-heap O(n log k) (not a full O(n log n) sort), so cost scales with k, not corpus size — the scale ceiling is no longer “hundreds of facts”. For the typical chatcli case linear search completes in microseconds; no HNSW or IVFFlat.

Provider/dimension auto-migration

Switching provider or dimension (Voyage 1024 → OpenAI 1536, or voyage→cohere at the same 1024) no longer needs a manual rm. On load, the index detects the mismatch — of dimension (cosine between different arities is undefined) or of provider (two embedding spaces aren’t comparable, even at the same dimension) — and auto-clears the cache, removing the file so backfill repopulates:

WARN vindex provider changed — auto-clearing for re-embed  on_disk_provider=voyage provider=cohere
WARN vindex dimension mismatch — auto-clearing for re-embed  on_disk_dim=1024 provider_dim=1536

The provider→provider case at the same dim (e.g. voyage 1024 → cohere 1024) was previously served silently as garbage — cosine between different spaces is meaningless. It is now detected and re-embedded.

Lazy backfill

When retrieving a fact, if it has no vector (fact predates embeddings activation), the index spawns a detached goroutine to embed the top-500 visible facts:

// cli/workspace/memory/store.go:120
go func(items map[string]string) { //#nosec G118 -- detached on purpose
    if err := m.vectors.BackfillFacts(context.Background(), items); err != nil {
        m.logger.Warn("vector backfill failed", zap.Error(err))
    }
}(items)

Backfill is bounded and configurable: at most Config.BackfillBatchMax facts per retrieve (default 500). With Voyage (~

0.05/M tokens) that costs roughly US

0.001 for a fully cold cache. In a normal session, most of the index is embedded on the first interaction.

Evaluation — proving retrieval (not assuming it)

There used to be no way to measure whether retrieval was any good. Now there is a dependency-free evaluation harness in cli/workspace/memory/eval — standard macro-averaged IR metrics:

Metric	What it measures
recall@k	fraction of relevant facts retrieved in the top-k
precision@k	fraction of the top-k that was relevant
MRR	mean reciprocal rank of the first hit
nDCG@k	normalized discounted cumulative gain

The versioned A/B (in ranking_test.go, with a deterministic embedding provider) compares keyword-only vs. blended ranking on queries phrased with synonyms absent from the fact text:

baseline (keyword): recall@1=0.2857  MRR=0.2857  nDCG@1=0.2857
candidate (blended): recall@1=1.0000  MRR=1.0000  nDCG@1=1.0000
delta              : recall@1=+0.7143

The harness runs the same way in CI (deterministic provider, reproducible) or against a real backend — it never imports a provider, so it stays neutral across all 14 supported ones. It is what turns “looks good” into “is good, measured”, and guards against regressions.

Ranking tunables (no new env vars)

The blended-ranking parameters live as Config fields with strong defaults — no new env vars were introduced (and the pre-existing CHATCLI_MEMORY_* vars, previously ignored on the structured path, are now applied via ConfigFromEnv + clamping):

`Config` field	Default	Controls
`RankWeights`	`{0.55, 0.30, 0.15}`	semantic / lexical / temporal weights
`MinCosineScore`	`0.25`	cosine floor in the top-K
`VectorTopK`	`12`	vector candidates per query
`BackfillBatchMax`	`500`	cap of facts embedded per retrieve

Full configuration

Env var	Default	Effect
`CHATCLI_QUALITY_HYDE_ENABLED`	`false`	Master switch (phase 3a)
`CHATCLI_QUALITY_HYDE_USE_VECTORS`	`false`	Enable phase 3b (requires provider)
`CHATCLI_QUALITY_HYDE_NUM_KEYWORDS`	`5`	Hypothesis keyword cap in phase 3a
`CHATCLI_EMBED_PROVIDER`	—	`voyage` / `openai` / `bedrock` / `null` — single source of truth for provider selection
`CHATCLI_EMBED_MODEL`	provider default	Voyage: `voyage-3`. OpenAI: `text-embedding-3-small` / `-large`. Bedrock: `amazon.titan-embed-text-v2:0` (default), `amazon.titan-embed-text-v1`, `cohere.embed-english-v3`, `cohere.embed-multilingual-v3`.
`CHATCLI_EMBED_DIMENSIONS`	model native	OpenAI: truncate via Matryoshka. Bedrock Titan v2: 256 / 512 / 1024 (rejects others). Bedrock Titan v1 / Cohere v3: fixed dimension, ignored.
`BEDROCK_REGION` / `AWS_REGION`	`us-east-1`	AWS region — only used when `CHATCLI_EMBED_PROVIDER=bedrock`.
`AWS_PROFILE`	—	AWS profile — only used when `CHATCLI_EMBED_PROVIDER=bedrock`.

`/config quality` surfaces state

── RAG + HyDE (#4)
  CHATCLI_QUALITY_HYDE_ENABLED    : enabled
  CHATCLI_QUALITY_HYDE_USE_VECTORS: enabled
  CHATCLI_EMBED_PROVIDER          : bedrock
  CHATCLI_EMBED_MODEL             : amazon.titan-embed-text-v2:0
  CHATCLI_EMBED_DIMENSIONS        : 1024
  CHATCLI_QUALITY_HYDE_NUM_KEYWORDS: 5
  Vector provider                : bedrock:amazon.titan-embed-text-v2:0
  Vector entries                 : 127

Integration with Reflexion

HyDE amplifies Reflexion’s value: lessons persisted by #3 are retrieved with much higher recall when the next task doesn’t use the exact same keywords. Workflow:

Turn 1: auth.go refactor fails (timeout)

Reflexion persists lesson: "use Edit tool for large files", tags [go, refactor, edit-tool].

Turn 5 (days later): 'help me split pkg/engine'

Query doesn’t contain refactor or edit. Keyword-only would miss the lesson.

HyDE 3a generates hypothesis

"To split a Go package, identify logical groupings and use refactor patterns with Edit tool for surgical changes..."
Extracted keywords: [split, package, refactor, edit, patterns, …]

Match!

Lesson appears in system prompt. Coder picks Edit over write from the start.

Caveats and tuning

Token cost of phase 3a: ~200 tokens per retrieval turn. In workflows with many read turns, the cost compounds. Use CHATCLI_QUALITY_HYDE_NUM_KEYWORDS=3 for tighter budget.

Privacy: the user query is sent to the embedding provider. For sensitive workloads in corporate environments, Bedrock is the preferred path — it stays inside the AWS perimeter (CloudTrail, VPC endpoint, IAM). For local workloads with no network, consider self-hosting (roadmap: Ollama-embedding provider).

Graceful fallback: if the LLM fails or the embedding provider returns an error, retrieval falls back to keyword-only silently. No turn is aborted by HyDE failure.

#3 Reflexion

The lessons that HyDE retrieves with higher recall.

Bootstrap Memory

The layer underneath: how memory.Fact is populated and maintained.

Persistent Context

/context attach for explicit file contexts.

Full configuration

All envs and slashes.

​The problem HyDE solves

​Phase 3a — Hypothesis-based keyword expansion

​Phase 3b — Vector embeddings + blended ranking

​Architecture

​Blended ranking (SearchBlended)

​Supported providers

​Pure-Go vector store — generic vindex primitive

​Provider/dimension auto-migration

​Lazy backfill

​Evaluation — proving retrieval (not assuming it)

​Ranking tunables (no new env vars)

​Full configuration

​/config quality surfaces state

​Integration with Reflexion

​Caveats and tuning

​See also

#3 Reflexion

Bootstrap Memory

Persistent Context

Full configuration

The problem HyDE solves

Phase 3a — Hypothesis-based keyword expansion

Phase 3b — Vector embeddings + blended ranking

Architecture

Blended ranking (`SearchBlended`)

Supported providers

Pure-Go vector store — generic `vindex` primitive

Provider/dimension auto-migration

Lazy backfill

Evaluation — proving retrieval (not assuming it)

Ranking tunables (no new env vars)

Full configuration

`/config quality` surfaces state

Integration with Reflexion

Caveats and tuning

See also