Skip to main content
HyDE (Hypothetical Document Embeddings) is the classic technique of generating a hypothetical answer to the user’s question and using that answer as an additional retrieval signal. In ChatCLI, HyDE operates in two complementary phases: 3a expands keywords via LLM hypothesis, 3b adds vector cosine search.
HyDE is opt-in (CHATCLI_QUALITY_HYDE_ENABLED=true) to keep the steady-state with no additional cost. Phase 3a costs +1 cheap LLM call; Phase 3b requires configuring an embedding provider.

The problem HyDE solves

The pre-pipeline memory.Fact retrieval was keyword-only: the scorer matches tokens extracted from recent messages against tags and content of stored facts. Works well when vocabulary matches exactly — fails when the user uses synonyms or asks abstract questions. Gap example:
User: how to do X in Go?
Extracted keywords: [do, go]
Stored fact: "use goroutines for concurrency in X pipelines"
Match: ❌ — “do” and “go” don’t literally appear in the fact.

Phase 3a — Hypothesis-based keyword expansion

1

User types query

The query enters cli_llm.go or agent_mode.go.
2

HyDEAugmenter.Augment

augmenter := memory.NewHyDEAugmenter(cfg, llmCallback, logger)
expanded := augmenter.Augment(ctx, query, originalHints)
3

LLM generates short hypothesis

Prompt: “Write a 2-4 sentence plausible answer that uses the technical nouns that would appear in any matching note. Bilingual if the query mixes languages.”
4

ExtractKeywords from the hypothesis

The same extractor already used in chat mode (en+pt stop words, min 3 chars).
5

Merge unique + lower-case

Original keywords + top-N from hypothesis, cap configurable via CHATCLI_QUALITY_HYDE_NUM_KEYWORDS (default 5).
6

FactIndex.Search uses the expanded set

Existing keyword-based scorer operates over richer hints → much higher recall.
Phase 3a works without configuring an embedding provider. It’s the recommended default if the cost of +1 light LLM call is acceptable.

Phase 3b — Vector embeddings + blended ranking

Adds cosine similarity search over fact embeddings — and, since the ranking refactor, fuses the cosine score with the lexical and temporal signals into a single ranking.
What changed (PR #1027): previously the vector hits were dissolved back into keywords and the cosine score was discarded — you paid for an embedding call and threw away the very semantic signal. Now the cosine flows straight into the SearchBlended ranker, so a fact matched only by paraphrase (zero lexical overlap) can still rank.

Architecture

┌──────────────────┐
│ User query       │
└────────┬─────────┘
         │  Phase 3a: LLM hypothesis → expanded keywords

┌──────────────────────────┐
│ EmbeddingProvider.Embed  │  (Voyage / OpenAI / Bedrock / Null)
└────────┬─────────────────┘
         │  vector float32

┌──────────────────────────────────────┐
│ vindex.Index.Search(q, k, floor)     │  cosine top-K, O(n log k), pure-Go
└────────┬─────────────────────────────┘
         │  []Hit{ID, Score}   ← the cosine score is PRESERVED

┌────────────────────────────────────────────────┐
│ FactIndex.SearchBlended(keywords, semantic, w)  │
│   final = wSem·cosine + wLex·keyword            │
│         + wTemp·recency    (min-max normalized)  │
└────────┬───────────────────────────────────────┘


    Facts ranked by FUSED relevance

Blended ranking (SearchBlended)

Three independent, complementary signals, each min-max normalized across the candidate set before the weighted sum:
SignalSourceCaptures
semanticcosine from the vector storesynonymy, paraphrase
lexicalkeyword/tag overlapexact terms, identifiers, file names
temporalrecency decay × access frequencywhat the user actually uses
Default weights (DefaultRankWeights): semantic 0.55 · lexical 0.30 · temporal 0.15 — semantic-first, because the embedding call was already paid for. Fusion is additive (not multiplicative) on purpose: a fact with high cosine and zero keyword overlap still ranks — a product would zero it out. Min-max normalization is what makes the weights provider-agnostic: Voyage 1024-d and OpenAI 1536-d cosine both land in [0,1] after normalizing.
Cosine floor (MinCosineScore, default 0.25): embeddings over normalized text are almost always weakly positive, so the old > 0 cutoff admitted near-orthogonal noise. The floor keeps only genuinely related facts in the top-K.

Supported providers

Pure-Go vector store — generic vindex primitive

No CGO, no SQLite-vec, no external deps. Just float32[] + cosine + JSON persistence in ~/.chatcli/memory/vector_index.json.
The cosine index lives in a generic, reusable package, llm/embedding/vindex, extracted once a second consumer (the semantic /context retrieval) appeared. memory is just a thin adapter over it — no vector machinery duplicated per package:
// llm/embedding/vindex — the single, agnostic primitive
type Index struct { /* provider, dim, entries map[string][]float32, path */ }
func (x *Index) Upsert(ctx, items map[string]string) error
func (x *Index) Search(query []float32, k int, minScore float64) []Hit  // top-K via min-heap
func (x *Index) Prune(keep map[string]struct{})                         // never serve stale vectors

// cli/workspace/memory/vector_store.go — fact-domain adapter
type VectorIndex struct { idx *vindex.Index; provider embedding.Provider }
type ScoredFact struct { ID string; Score float64 }
Top-K selection uses a min-heap O(n log k) (not a full O(n log n) sort), so cost scales with k, not corpus size — the scale ceiling is no longer “hundreds of facts”. For the typical chatcli case linear search completes in microseconds; no HNSW or IVFFlat.

Provider/dimension auto-migration

Switching provider or dimension (Voyage 1024 → OpenAI 1536, or voyage→cohere at the same 1024) no longer needs a manual rm. On load, the index detects the mismatch — of dimension (cosine between different arities is undefined) or of provider (two embedding spaces aren’t comparable, even at the same dimension) — and auto-clears the cache, removing the file so backfill repopulates:
WARN vindex provider changed — auto-clearing for re-embed  on_disk_provider=voyage provider=cohere
WARN vindex dimension mismatch — auto-clearing for re-embed  on_disk_dim=1024 provider_dim=1536
The provider→provider case at the same dim (e.g. voyage 1024 → cohere 1024) was previously served silently as garbage — cosine between different spaces is meaningless. It is now detected and re-embedded.

Lazy backfill

When retrieving a fact, if it has no vector (fact predates embeddings activation), the index spawns a detached goroutine to embed the top-500 visible facts:
// cli/workspace/memory/store.go:120
go func(items map[string]string) { //#nosec G118 -- detached on purpose
    if err := m.vectors.BackfillFacts(context.Background(), items); err != nil {
        m.logger.Warn("vector backfill failed", zap.Error(err))
    }
}(items)
Backfill is bounded and configurable: at most Config.BackfillBatchMax facts per retrieve (default 500). With Voyage (~0.05/Mtokens)thatcostsroughlyUS0.05/M tokens) that costs roughly US 0.001 for a fully cold cache. In a normal session, most of the index is embedded on the first interaction.

Evaluation — proving retrieval (not assuming it)

There used to be no way to measure whether retrieval was any good. Now there is a dependency-free evaluation harness in cli/workspace/memory/eval — standard macro-averaged IR metrics:
MetricWhat it measures
recall@kfraction of relevant facts retrieved in the top-k
precision@kfraction of the top-k that was relevant
MRRmean reciprocal rank of the first hit
nDCG@knormalized discounted cumulative gain
The versioned A/B (in ranking_test.go, with a deterministic embedding provider) compares keyword-only vs. blended ranking on queries phrased with synonyms absent from the fact text:
baseline (keyword): recall@1=0.2857  MRR=0.2857  nDCG@1=0.2857
candidate (blended): recall@1=1.0000  MRR=1.0000  nDCG@1=1.0000
delta              : recall@1=+0.7143
The harness runs the same way in CI (deterministic provider, reproducible) or against a real backend — it never imports a provider, so it stays neutral across all 14 supported ones. It is what turns “looks good” into “is good, measured”, and guards against regressions.

Ranking tunables (no new env vars)

The blended-ranking parameters live as Config fields with strong defaults — no new env vars were introduced (and the pre-existing CHATCLI_MEMORY_* vars, previously ignored on the structured path, are now applied via ConfigFromEnv + clamping):
Config fieldDefaultControls
RankWeights{0.55, 0.30, 0.15}semantic / lexical / temporal weights
MinCosineScore0.25cosine floor in the top-K
VectorTopK12vector candidates per query
BackfillBatchMax500cap of facts embedded per retrieve

Full configuration

Env varDefaultEffect
CHATCLI_QUALITY_HYDE_ENABLEDfalseMaster switch (phase 3a)
CHATCLI_QUALITY_HYDE_USE_VECTORSfalseEnable phase 3b (requires provider)
CHATCLI_QUALITY_HYDE_NUM_KEYWORDS5Hypothesis keyword cap in phase 3a
CHATCLI_EMBED_PROVIDERvoyage / openai / bedrock / nullsingle source of truth for provider selection
CHATCLI_EMBED_MODELprovider defaultVoyage: voyage-3. OpenAI: text-embedding-3-small / -large. Bedrock: amazon.titan-embed-text-v2:0 (default), amazon.titan-embed-text-v1, cohere.embed-english-v3, cohere.embed-multilingual-v3.
CHATCLI_EMBED_DIMENSIONSmodel nativeOpenAI: truncate via Matryoshka. Bedrock Titan v2: 256 / 512 / 1024 (rejects others). Bedrock Titan v1 / Cohere v3: fixed dimension, ignored.
BEDROCK_REGION / AWS_REGIONus-east-1AWS region — only used when CHATCLI_EMBED_PROVIDER=bedrock.
AWS_PROFILEAWS profile — only used when CHATCLI_EMBED_PROVIDER=bedrock.

/config quality surfaces state

── RAG + HyDE (#4)
  CHATCLI_QUALITY_HYDE_ENABLED    : enabled
  CHATCLI_QUALITY_HYDE_USE_VECTORS: enabled
  CHATCLI_EMBED_PROVIDER          : bedrock
  CHATCLI_EMBED_MODEL             : amazon.titan-embed-text-v2:0
  CHATCLI_EMBED_DIMENSIONS        : 1024
  CHATCLI_QUALITY_HYDE_NUM_KEYWORDS: 5
  Vector provider                : bedrock:amazon.titan-embed-text-v2:0
  Vector entries                 : 127

Integration with Reflexion

HyDE amplifies Reflexion’s value: lessons persisted by #3 are retrieved with much higher recall when the next task doesn’t use the exact same keywords. Workflow:
1

Turn 1: auth.go refactor fails (timeout)

Reflexion persists lesson: "use Edit tool for large files", tags [go, refactor, edit-tool].
2

Turn 5 (days later): 'help me split pkg/engine'

Query doesn’t contain refactor or edit. Keyword-only would miss the lesson.
3

HyDE 3a generates hypothesis

"To split a Go package, identify logical groupings and use refactor patterns with Edit tool for surgical changes..."
Extracted keywords: [split, package, refactor, edit, patterns, …]
4

Match!

Lesson appears in system prompt. Coder picks Edit over write from the start.

Caveats and tuning

Token cost of phase 3a: ~200 tokens per retrieval turn. In workflows with many read turns, the cost compounds. Use CHATCLI_QUALITY_HYDE_NUM_KEYWORDS=3 for tighter budget.
Privacy: the user query is sent to the embedding provider. For sensitive workloads in corporate environments, Bedrock is the preferred path — it stays inside the AWS perimeter (CloudTrail, VPC endpoint, IAM). For local workloads with no network, consider self-hosting (roadmap: Ollama-embedding provider).
Graceful fallback: if the LLM fails or the embedding provider returns an error, retrieval falls back to keyword-only silently. No turn is aborted by HyDE failure.

See also

#3 Reflexion

The lessons that HyDE retrieves with higher recall.

Bootstrap Memory

The layer underneath: how memory.Fact is populated and maintained.

Persistent Context

/context attach for explicit file contexts.

Full configuration

All envs and slashes.