context + repl + config: route-aware context compression (closes #87)
Small local models effectively use a fraction of their advertised
context window. Per-request compression for routes that hit a
local-compress-flagged model preset: keeps only the last N turns
and tail-truncates oversized content. Cloud routes get the full
context unchanged.
Changes:
- context.lua _compress_turns(turns, keep, max_chars): returns a
new list (self.turns NEVER mutated) with the last `keep` turns
preserved + content tail-truncated to `max_chars`. Defensive:
drops tool turns at the slice head (orphaned without their
assistant-with-tool_calls anchor — strict chat templates would
reject them; same gotcha PHASE0 §6 warned about for user/user).
- Context:to_messages(opts) — opts.compress = { keep_turns,
max_turn_chars } swaps the turn iterable for the compressed
view. Affects BOTH the use_tool_role=true path and the
use_tool_role=false fallback (PHASE2.md Q18 strict-template
workaround). Persistence + display via :history see the full
uncompressed ctx.turns.
- repl.lua ask_ai: when req_cfg (the routed model's cfg) has
`local_compress = true`, build compress_opts from
config.context.compress (defaults keep_turns=2, max_turn_chars=800).
Pass through ctx:to_messages alongside the existing
system_prompt_override (#86) — orthogonal opts that compose.
- Norris unaffected: safety.norris_step builds its own messages
array; the planner needs full history per PHASE3 design.
- config.lua gains a header comment explaining the per-model opt-in
+ the context.compress defaults block + the documented tool-turn
truncation trade-off.
13 unit cases verified:
- no opts -> full turn list (no regression)
- keep_turns=2 -> exactly last 2 emitted
- long content tail-truncated to max_chars
- self.turns unchanged after render
- orphan tool-turn at slice head dropped (no chat-template violation)
- tool turn included WITH its assistant anchor when keep_turns >= 3
E2E against live local broker:
- models.fast.local_compress = true; keep_turns=1; max=200
- 4-turn session: each broker call sees ONLY the current turn
(verified by short coherent CMD replies despite no cross-turn
memory available to the model). FR-promised small-model
friendliness in action; conversation continuity is the
documented trade-off.
Regression: test_safety 87/87, test_router_model 31/31, repl loads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1007,6 +1007,20 @@ function M.run(config)
|
||||
and config.routing.grammars
|
||||
and req_class
|
||||
and config.routing.grammars[req_class]
|
||||
-- #87: route-aware context compression. When the routed model
|
||||
-- preset has `local_compress = true`, ctx:to_messages keeps only
|
||||
-- the last N turns and tail-truncates oversized content for
|
||||
-- THIS request. Cloud routes (model_cfg.local_compress nil/false)
|
||||
-- get the full context unchanged. Defaults from cfg.context.compress;
|
||||
-- per-model opt-in keeps the design surface predictable.
|
||||
local compress_opts
|
||||
if req_cfg and req_cfg.local_compress then
|
||||
local cc = (config.context and config.context.compress) or {}
|
||||
compress_opts = {
|
||||
keep_turns = cc.keep_turns or 2,
|
||||
max_turn_chars = cc.max_turn_chars or 800,
|
||||
}
|
||||
end
|
||||
|
||||
local depth = 0
|
||||
local final_resp = ""
|
||||
@@ -1017,7 +1031,10 @@ function M.run(config)
|
||||
local tool_calls_seen = {}
|
||||
local redact_mode = secrets_mode_for(req_cfg)
|
||||
local scrubbed_msgs = scrub_messages(
|
||||
ctx:to_messages({ system_prompt_override = sys_override }),
|
||||
ctx:to_messages({
|
||||
system_prompt_override = sys_override,
|
||||
compress = compress_opts,
|
||||
}),
|
||||
redact_mode)
|
||||
-- Streaming rehydrator wraps the on_delta so the user sees real
|
||||
-- values; text_parts accumulates the REHYDRATED chunks so
|
||||
|
||||
Reference in New Issue
Block a user