From da225314b0c0e122fae51b09edef897cf6015447 Mon Sep 17 00:00:00 2001 From: tlancas25 Date: Fri, 19 Jun 2026 11:16:37 -0700 Subject: [PATCH] docs: note output-trimming UX issue for self-hosted models + proposed fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Smaller/local models driving Claude Code as a harness read the cooler's compact summary as data loss and abandon ctx_execute (falling back to raw Bash) — losing both the token savings and the guardrails. Document the root cause (hard-coded trim defaults in compactDefault / execute, plus no truncation signal in the response) and a proposed fix (env-configurable caps + a retrieval hint) to implement later. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/SELF-HOSTED-MODELS.md | 90 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 docs/SELF-HOSTED-MODELS.md diff --git a/docs/SELF-HOSTED-MODELS.md b/docs/SELF-HOSTED-MODELS.md new file mode 100644 index 0000000..84bf2a0 --- /dev/null +++ b/docs/SELF-HOSTED-MODELS.md @@ -0,0 +1,90 @@ +# Self-Hosted / Small-Model Output Trimming + +**Status:** Known issue + proposed fix. *Not yet implemented* — documented here so we can +come back to it. Interim mitigation is harness-side only (see below). + +## Who this affects + +Setups that drive the Claude Code TUI as a **harness for a self-hosted / local model** +(ollama, vLLM, LM Studio, llama.cpp via LiteLLM, etc.) instead of a frontier Anthropic +model. Context Cooler is most valuable exactly here — local context windows are small and +fill fast — so losing the model's trust in the tool is costly. + +## Observed problem + +A local model (`huihui-q8`, a quantized Qwen) **stopped using `ctx_execute` entirely** +after it saw the tool "swallowing" output, and fell back to raw `Bash`. Net result: zero +token savings *and* the loss of the cooler's guardrails — the opposite of the tool's +purpose. + +The model interpreted a trimmed summary as **"my output was lost,"** when in fact the full +output is captured and indexed (FTS5) and is retrievable with `ctx_search`. Frontier models +tend to infer "this is a summary, the rest is on disk"; smaller models do not, and react by +abandoning the tool. + +## Root cause + +Two things compound: + +1. **Trim defaults are aggressive and hard-coded** (not configurable): + - `src/lib/filter.ts` → `compactDefault()`: + - arrays → `data.slice(0, 5)` (`filter.ts:197`) + - objects → scalar fields only, else first 5 keys (`filter.ts:202-204`) + - `src/lib/filter.ts` → `filterByIntent()` default `limit = 5` for arrays. + - `src/tools/execute.ts` → non-JSON text → `String(summary).slice(0, 5000)` (`execute.ts:170`). +2. **The response gives no signal that trimming happened or that the full output is + retrievable.** There is no `truncated` flag and no pointer to `ctx_search`, so a model + that doesn't already know the cooler's design can't tell a summary from data loss. + +## Why it matters + +The savings pitch ("70–98% fewer tokens") only pays off if the model *keeps* routing heavy +output through the cooler. For the smaller models that need the savings most, an opaque trim +reads as unreliability and triggers abandonment. + +## Interim mitigation (already in place, harness-side only) + +The harness operating profile was updated to tell the model explicitly that a trimmed +summary means "the rest is on disk, retrievable with `ctx_search`," with sharper use/don't-use +rules (cooler for *volume to skim*, plain tools for *small or exact* output). This is a +prompt-side band-aid and does **not** help operators who haven't tuned their profile — hence +the code-side fix below. + +## Proposed fix (to implement) + +1. **Make the caps configurable via env**, following the existing `CTX_*` convention parsed + in `src/lib/env.ts` (cf. `CTX_SNAPSHOT_BUDGET`, `CTX_FTS_ENABLED`). Defaults = today's + values, so behavior is unchanged unless an operator opts in: + - `CTX_MAX_ARRAY_ITEMS` (default `5`) — used by `compactDefault` + `filterByIntent`. + - `CTX_MAX_OBJECT_KEYS` (default `5`) — used by `compactDefault`. + - `CTX_MAX_TEXT_CHARS` (default `5000`) — used by the non-JSON path in `execute.ts`. + This lets a self-hosted operator set e.g. `CTX_MAX_TEXT_CHARS=20000` / `CTX_MAX_ARRAY_ITEMS=25` + to trade some savings for less surprise. + +2. **Add a retrieval hint to the `ctx_execute` response when trimming actually occurred.** + When `summary_bytes < raw_bytes` (or an array/object/text was cut), include a small, + cache-friendly note, e.g.: + ```json + { "truncated": true, "indexed": true, + "retrieve_with": "ctx_search(\"