#3 Reflexion — Learn from Failures

Reflexion closes the learning loop: when an agent fails or produces low-quality output, instead of losing the experience, the pipeline generates a structured Lesson and persists it to long-term memory. On the next similar task, that lesson naturally surfaces via RAG+HyDE.

Reflexion is the only post-hook on by default — because it only fires in exceptional conditions (error, discrepancy) and lesson generation never blocks the user turn.

Durable mode (default since Apr 2026): triggers flow through a WAL-backed queue with worker pool and dead-letter queue. Lessons survive process crashes via replay on next boot. See Durable Queue.

What is a Lesson

A Lesson is a four-line record:

type Lesson struct {
    Situation  string   // "When editing large Go files..."
    Mistake    string   // "Tried to rewrite the whole file at once"
    Correction string   // "Use Edit tool with specific old_string/new_string"
    Tags       []string // ["go", "edit-file", "large-file", "reflexion"]
    Trigger    string   // "error" | "hallucination" | "low_quality" | "manual"
    CreatedAt  time.Time
}

When persisted as memory.Fact, Content becomes:

LESSON: When editing large Go files
MISTAKE: Tried to rewrite the whole file at once
CORRECTION: Use Edit tool with specific old_string/new_string
TRIGGER: error

The Fact category is lesson and tags include reflexion + trigger:<x> + domain-specific tags. This enables precise queries: “show me all lessons about edit-file” becomes a regular memory search.

Four triggers

OnError
OnHallucination
OnLowQuality
Manual via /reflect

if cfg.OnError && result.Error != nil {
    return "error"
}

The worker returned Error != nil. Examples: timeout, invalid tool call, provider crash. Default: ON.

if cfg.OnHallucination && result.MetadataFlag("verified_with_discrepancy") {
    return "hallucination"
}

The VerifyHook (#6 CoVe) flagged a discrepancy between the draft and the verification answers. Default: ON.

if cfg.OnLowQuality && result.MetadataFlag("refine_low_quality") {
    return "low_quality"
}

The RefineHook (#5) gave the original draft a low score. Default: OFF (too noisy by default).

if result.MetadataFlag(MetaForceReflexion) {
    return "manual"
}

Slash /reflect <free-text> persists directly, without LLM call. Always available.

Flow — durable mode (default)

With CHATCLI_QUALITY_REFLEXION_QUEUE_ENABLED=true (default), triggers flow through a persistent queue. The hook never blocks the turn and the process can crash without losing the lesson:

PostRun inspects the trigger

ReflexionHook.PostRun(ctx, hc, result) looks at result.Metadata + result.Error — if no gate matches, returns in μs.

WAL Append (synchronous, sub-ms)

The hook calls enqueuer.Enqueue(req). The Runner computes JobID = sha256(task|trigger|attempt)[:16], writes a record to the WAL (~/.chatcli/reflexion/wal/<id>.wal) via tmp → fsync → atomic rename → dir fsync, then pushes in-memory.

Immediate return to the pipeline

PostRun returns nil; the user’s turn continues without waiting. Added latency is the fsync (typically < 1 ms).

Worker pool processes async

One of N workers (default 2) dequeues, calls GenerateLesson with per-job timeout (default 2 min), and persists to memory.Fact unless the LLM emits <skip>.

Outcome classification

Success or Skipped → ACK (delete WAL record). Transient error (timeout, 429/503) → reschedule with exponential backoff + jitter. Permanent (parser error) → move to DLQ immediately.

Replay on boot

Next session, Runner.Replay() runs async and re-queues every pending record from the WAL (discarding those older than StaleAfter, default 7 days).

Observable turn latency: a local fsync on SSD is typically < 1 ms. Lesson generation (LLM call) happens after the turn responds — users never wait.

Fallback: legacy mode (detached goroutine)

If CHATCLI_QUALITY_REFLEXION_QUEUE_ENABLED=false, the hook reverts to the original behavior:

go h.runReflexion(context.Background(), req)  // fire-and-forget

Zero filesystem dependency, but in-flight lessons vanish if the process is killed. Kept for backward compatibility and for users who prefer simplicity over durability.

Durable Queue — WAL + Worker Pool + DLQ

The queue is implemented in cli/agent/quality/lessonq/ with enterprise guarantees:

WAL (Write-Ahead Log)

Each pending lesson is a .wal file in ~/.chatcli/reflexion/wal/ — one per Job ID. Binary layout:

[4B magic 'LSN1'][4B length BE][4B CRC32 payload][N bytes JSON payload][4B CRC32 trailer]

Double CRC detects torn writes (crash mid-fsync). Corrupt records are discarded on replay + chatcli_lessonq_wal_corruption_total increments.
Atomic rename: write to <id>.tmp.<pid>.<seq> → fsync → rename → dir fsync. A reader never sees a partial record.
O(1) ACK: a single unlink removes the record. No background compaction.

Worker Pool

Queue (min-heap by NextAttemptAt)
      │
      ├─► Worker 1 ─► GenerateLesson ─► persist ─► ACK
      ├─► Worker 2 ─► GenerateLesson ─► persist ─► ACK
      └─► Worker N ─► ...

Each worker:

Blocking Dequeue (waits until NextAttemptAt ≤ now).
Bounded per-job timeout (doesn’t inherit turn ctx — reflexion outlives the turn by design).
Panic recovery: if the processor panics, goes straight to DLQ (retrying a bug loops).
Emits chatcli_lessonq_processing_duration_seconds{outcome}.

Dead Letter Queue

Permanent failures or retry exhaustion go to ~/.chatcli/reflexion/dlq/ (same WAL format, read-only to the process). Operator inspects and decides:

/reflect failed              # list with last error
/reflect retry <job-id>      # re-queue (resets Attempts=0)
/reflect purge <job-id>      # remove permanently

Retry with Jitter

Transient errors (ctx timeout, provider 429/503, temp fs error) become reschedules:

delay = InitialDelay × Multiplier^(attempt-1)
delay = min(delay, MaxDelay)
delay = delay × uniform(1-JitterFraction, 1+JitterFraction)

Defaults: 1s initial, 5min cap, 2.0 multiplier, ±20% jitter, 5 attempts. Full jitter prevents thundering herd when the provider recovers.

Idempotency

JobID is content-addressed: sha256(normalized(task) | trigger | attempt | outcome)[:16]. Re-triggering the same situation while the job is in-flight is a no-op (WAL exists → Runner skips queue insert). Whitespace is normalized to avoid inflation from trivial churn.

Drain + Graceful Shutdown

On exit (cli.cleanup()), the Runner enters DrainAndShutdown(30s):

Queue closes — no new dequeues.
Workers finish in-flight (or get cancelled on timeout).
WAL/DLQ close.

Jobs still queued survive in the WAL and reprocess on next boot. Zero data loss on SIGTERM or kill -9.

`/reflect` — Commands

/reflect                     # queue depth + DLQ size + subcommands hint

All subcommands have Tab autocomplete. /reflect retry and /reflect purge list live DLQ IDs with task preview + last error.

Files and layout

~/.chatcli/reflexion/
├── wal/                          # active queue (pending + in-flight)
│   ├── a3f8...bc.wal            # one file per Job ID
│   └── ...
└── dlq/                          # dead letter queue (permanent failures)
    ├── 9e2c...7a.wal
    └── ...

Path configurable via CHATCLI_QUALITY_REFLEXION_QUEUE_BASE_DIR (default: <workspace>/.chatcli/reflexion).

Operators can ls the directory for quick triage without special tools. Each record is JSON inside the binary framing — xxd + the lessonq protocol docs help in forensics.

Lesson generator protocol

The system prompt instructs the model to be general, not one-off:

Rules:
- A "lesson" must be GENERAL enough to apply next time a similar task
  comes up — not one-off and not a play-by-play.
- If there is genuinely nothing to learn (e.g. the task was trivial and
  the failure was a transient network blip), reply with exactly:
  <skip>nothing actionable</skip>
- Otherwise emit ALL of the following blocks. Keep each to ONE line.
- "tags" is a comma-separated list of 2-5 short keywords (lowercase,
  hyphenated if needed) that future similar tasks will likely contain.

OUTPUT:
<situation>brief description of when this lesson applies</situation>
<mistake>what went wrong this time</mistake>
<correction>what to do differently next time</correction>
<tags>tag1, tag2, tag3</tags>

The <skip> block exists precisely to avoid memory pollution with “lessons” from transient failures. The model can refuse to generate a lesson at zero persistence cost.

`/reflect` — manual path without LLM

When you know the lesson and don’t need an LLM distilling:

/reflect when editing large Go files use Edit, not full rewrite

Goes straight into memory.Fact:

LESSON: when editing large Go files use Edit, not full rewrite
MISTAKE: (user-supplied lesson; no automatic mistake detection)
CORRECTION: when editing large Go files use Edit, not full rewrite
TRIGGER: manual

Generated tags: ["reflexion", "trigger:manual", "user-supplied"].

The manual path does not make an LLM call — it’s cheap, synchronous, and ideal for capturing learnings during a session.

How the lesson “comes back”

Once persisted, the lesson is a regular fact in the index. It surfaces via:

Hint-based retrieval: if the next task mentions keywords in Tags, the relevance-based scorer surfaces it.
HyDE amplifies: with CHATCLI_QUALITY_HYDE_ENABLED=true, the generated hypothesis covers similar concepts, increasing match chance.
Vector search: with embeddings configured, the lesson is searched by cosine proximity.

The next turn’s system prompt contains the ## Long-term Memory section with the lesson text, and the model has all the cues to not repeat the mistake.

Environment variables

Gates (when to fire)

Env var	Default	What it does
`CHATCLI_QUALITY_REFLEXION_ENABLED`	`true`	Master switch
`CHATCLI_QUALITY_REFLEXION_ON_ERROR`	`true`	Fire on tool error
`CHATCLI_QUALITY_REFLEXION_ON_HALLUCINATION`	`true`	Fire on `verified_with_discrepancy`
`CHATCLI_QUALITY_REFLEXION_ON_LOW_QUALITY`	`false`	Fire on `refine_low_quality`
`CHATCLI_QUALITY_REFLEXION_PERSIST`	`true`	Write to memory.Fact (false = log-only)

Durable queue (WAL + worker pool + DLQ)

Env var	Default	Effect
`CHATCLI_QUALITY_REFLEXION_QUEUE_ENABLED`	`true`	Queue master switch. `false` falls back to legacy (detached goroutine)
`CHATCLI_QUALITY_REFLEXION_QUEUE_WORKERS`	`2`	Worker pool size. Reflexion is I/O-bound on the LLM call
`CHATCLI_QUALITY_REFLEXION_QUEUE_CAPACITY`	`1000`	Max in-memory depth before overflow policy kicks in
`CHATCLI_QUALITY_REFLEXION_QUEUE_DROP_OLDEST`	`false`	Overflow: `true` drop oldest; `false` block with timeout
`CHATCLI_QUALITY_REFLEXION_QUEUE_BLOCK_TIMEOUT`	`5s`	How long Enqueue waits when full (if `DROP_OLDEST=false`)
`CHATCLI_QUALITY_REFLEXION_QUEUE_MAX_ATTEMPTS`	`5`	Total retries before moving to DLQ
`CHATCLI_QUALITY_REFLEXION_QUEUE_INITIAL_DELAY`	`1s`	First retry delay
`CHATCLI_QUALITY_REFLEXION_QUEUE_MAX_DELAY`	`5m`	Cap on exponential retry
`CHATCLI_QUALITY_REFLEXION_QUEUE_JITTER`	`0.2`	Fractional jitter ([0, 0.5]) — AWS-style full jitter
`CHATCLI_QUALITY_REFLEXION_QUEUE_JOB_TIMEOUT`	`2m`	Per-processor-call timeout (LLM + persist)
`CHATCLI_QUALITY_REFLEXION_QUEUE_STALE_AFTER`	`168h`	WAL records older than this are discarded on replay (7 days)
`CHATCLI_QUALITY_REFLEXION_QUEUE_BASE_DIR`	`<workspace>/.chatcli/reflexion`	Override of the root dir (WAL + DLQ)

Prometheus metrics

The queue emits 10 metrics under chatcli_lessonq_*:

Metric	Type	Labels	Meaning
`enqueue_total`	Counter	`outcome`	accepted, rejected_full, deduped, dropped_oldest
`queue_depth`	Gauge	—	In-memory pending jobs
`processing_duration_seconds`	Histogram	`outcome`	dequeue→outcome time
`attempts_total`	Counter	`outcome`	success, skipped, transient, permanent
`retry_total`	Counter	`attempt`	retries bucketed by attempt number
`dlq_size`	Gauge	—	Jobs in DLQ
`wal_segments`	Gauge	—	Active `.wal` files
`wal_corruption_total`	Counter	—	Records rejected for CRC mismatch/torn write
`stale_discarded_total`	Counter	—	Records dropped at replay due to age
`persist_failures_total`	Counter	—	memory.Fact callback failures

Full cycle example

User asks for a task that fails

/coder refactor pkg/engine to extract Close method

CoderAgent tries full rewrite

File has 2000 lines, provider responds with timeout.

PostRun detects result.Error != nil

OnError trigger matched.

goroutine: GenerateLesson

Model emits:

<situation>Refactoring large Go files (>1000 lines)</situation>
<mistake>Attempted full rewrite via @coder write</mistake>
<correction>Use @coder patch or Edit tool for surgical changes</correction>
<tags>go, refactor, large-file, edit-tool</tags>

Persists in memory.Fact

Category=lesson, workspace=current project.

Next week, user asks for a similar refactor

/coder refactor pkg/auth/manager.go split into smaller files

RAG+HyDE brings the lesson

Tags refactor + large-file match. Lesson appears in the system prompt.

Coder picks the right approach from the start

Emits multiple @coder patch instead of write. Task done without timeout.

Inspect stored lessons

# Lessons already persisted (materialized to memory.Fact)
/memory longterm | grep -A3 "^LESSON:"
cat ~/.chatcli/memory/memory_index.json | jq '.[] | select(.category=="lesson")'

# Durable queue — live pending + DLQ
/reflect list               # pending + DLQ
/reflect failed             # DLQ only (triage)
/config quality             # hook state + queue depth + dlq size

Useful Prometheus snapshots

# General queue health
chatcli_lessonq_queue_depth
chatcli_lessonq_dlq_size
sum(rate(chatcli_lessonq_attempts_total[5m])) by (outcome)

# Regression detection: DLQ growing without new success
rate(chatcli_lessonq_attempts_total{outcome="permanent"}[5m]) > 0

# Alert on WAL corruption (signal of unstable fs)
increase(chatcli_lessonq_wal_corruption_total[1h]) > 0

# Processor percentile latency
histogram_quantile(0.95,
  rate(chatcli_lessonq_processing_duration_seconds_bucket[5m]))

Legacy inspection (pre-queue)

# All lessons (same as above, shown for backward-compat)
/memory longterm | grep -A3 "^LESSON:"
cat ~/.chatcli/memory/memory_index.json | jq '.[] | select(.category=="lesson")'

# Or via /config
/config quality
# → shows total registered post-hooks (reflexion appears if Enabled=true)

#4 RAG + HyDE

How lessons are retrieved in future tasks via semantic retrieval.

#6 CoVe

The verifier generates the verified_with_discrepancy signal that Reflexion consumes.

Bootstrap Memory

The layer underneath: how memory.Fact is populated and maintained.

Memory Commands

/memory load, /memory show, /memory longterm.

​What is a Lesson

​Four triggers

​Flow — durable mode (default)

​Fallback: legacy mode (detached goroutine)

​Durable Queue — WAL + Worker Pool + DLQ

​WAL (Write-Ahead Log)

​Worker Pool

​Dead Letter Queue

​Retry with Jitter

​Idempotency

​Drain + Graceful Shutdown

​/reflect — Commands

​Files and layout

​Lesson generator protocol

​/reflect — manual path without LLM

​How the lesson “comes back”

​Environment variables

​Gates (when to fire)

​Durable queue (WAL + worker pool + DLQ)

​Prometheus metrics

​Full cycle example

​Inspect stored lessons

​Useful Prometheus snapshots

​Legacy inspection (pre-queue)

​See also

#4 RAG + HyDE

#6 CoVe

Bootstrap Memory

Memory Commands

What is a Lesson

Four triggers

Flow — durable mode (default)

Fallback: legacy mode (detached goroutine)

Durable Queue — WAL + Worker Pool + DLQ

WAL (Write-Ahead Log)

Worker Pool

Dead Letter Queue

Retry with Jitter

Idempotency

Drain + Graceful Shutdown

`/reflect` — Commands

Files and layout

Lesson generator protocol

`/reflect` — manual path without LLM

How the lesson “comes back”

Environment variables

Gates (when to fire)

Durable queue (WAL + worker pool + DLQ)

Prometheus metrics

Full cycle example

Inspect stored lessons

Useful Prometheus snapshots

Legacy inspection (pre-queue)

See also