context + repl + config: route-aware context compression (closes #87)

Small local models effectively use a fraction of their advertised context window. Per-request compression for routes that hit a local-compress-flagged model preset: keeps only the last N turns and tail-truncates oversized content. Cloud routes get the full context unchanged. Changes: - context.lua _compress_turns(turns, keep, max_chars): returns a new list (self.turns NEVER mutated) with the last `keep` turns preserved + content tail-truncated to `max_chars`. Defensive: drops tool turns at the slice head (orphaned without their assistant-with-tool_calls anchor — strict chat templates would reject them; same gotcha PHASE0 §6 warned about for user/user). - Context:to_messages(opts) — opts.compress = { keep_turns, max_turn_chars } swaps the turn iterable for the compressed view. Affects BOTH the use_tool_role=true path and the use_tool_role=false fallback (PHASE2.md Q18 strict-template workaround). Persistence + display via :history see the full uncompressed ctx.turns. - repl.lua ask_ai: when req_cfg (the routed model's cfg) has `local_compress = true`, build compress_opts from config.context.compress (defaults keep_turns=2, max_turn_chars=800). Pass through ctx:to_messages alongside the existing system_prompt_override (#86) — orthogonal opts that compose. - Norris unaffected: safety.norris_step builds its own messages array; the planner needs full history per PHASE3 design. - config.lua gains a header comment explaining the per-model opt-in + the context.compress defaults block + the documented tool-turn truncation trade-off. 13 unit cases verified: - no opts -> full turn list (no regression) - keep_turns=2 -> exactly last 2 emitted - long content tail-truncated to max_chars - self.turns unchanged after render - orphan tool-turn at slice head dropped (no chat-template violation) - tool turn included WITH its assistant anchor when keep_turns >= 3 E2E against live local broker: - models.fast.local_compress = true; keep_turns=1; max=200 - 4-turn session: each broker call sees ONLY the current turn (verified by short coherent CMD replies despite no cross-turn memory available to the model). FR-promised small-model friendliness in action; conversation continuity is the documented trade-off. Regression: test_safety 87/87, test_router_model 31/31, repl loads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 07:50:07 +00:00
parent 74e4bffb37
commit c55077bc07
3 changed files with 90 additions and 5 deletions
@@ -285,6 +285,28 @@ return {
    --     probe_grammar      = [[root ::= ("YES" | "NO")]],
    -- },

+    -- ── Issue #87 (route-aware context compression).
+    -- When a routed model preset has `local_compress = true`, each
+    -- broker call against THAT preset gets a compressed view of
+    -- ctx.turns: only the last `keep_turns` turns; any turn whose
+    -- content exceeds `max_turn_chars` is tail-truncated. The full
+    -- context lives on (visible via :history); compression is purely
+    -- per-request for small models that effectively use a fraction
+    -- of their advertised context window.
+    --
+    -- Set the per-model opt-in on models[<name>]:
+    --     models.fast = { ..., local_compress = true }
+    -- Defaults live under context.compress:
+    --     context = {
+    --         ...
+    --         compress = { keep_turns = 2, max_turn_chars = 800 },
+    --     }
+    --
+    -- Trade-off documented in the FR: tool turns lose information
+    -- when tail-truncated. Acceptable for shell-output blocks (the
+    -- tail is usually the relevant bit); known limitation for
+    -- structured tool results. Disable per-model if it bites.
+
    -- ── Phase 5 context summarization on sliding-window eviction.
    -- Set INSIDE the context = { ... } block above to enable:
    --     context = {