From 4453b93ab586e66b89e87d5cee1b078375be50bb Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Wed, 13 May 2026 11:11:26 +0000
Subject: [PATCH] =?UTF-8?q?docs/PHASE5:=20formulate=20=E2=80=94=20multi-mo?=
 =?UTF-8?q?del=20routing=20+=20cloud=20fallback=20+=20summarize-on-evict?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 5 formulate manifest. Three pillars per PHASE0 §11 row 5:
heuristic-based per-request model routing, single-hop cloud fallback
on local transport failure, and fast-model summarization at sliding-
window eviction time.

Resolutions baked in via §2:
  - Routing trigger: per-request in repl.ask_ai, gated by
    cfg.routing.auto (default off)
  - Classification: pure-Lua heuristics (length, keywords, code-fence
    detection, exception markers) — no LLM probe in v1
  - Classes: code → deep, reasoning → cloud, default → keep active
  - Fallback trigger: string-match on err for HTTP 5xx /
    model_not_found / "Connection refused" / DNS / timeout
  - Fallback: one retry against cfg.routing.fallback_model (default
    "cloud" if configured); status line on every retry
  - Summarize: enforce_budget invokes summarize_fn callback wired
    by repl.lua to broker.chat with the fast model
  - Summary turn: single rolling _summary at turns[1], appended to
    on each eviction, re-summarized when it exceeds max_summary_chars

Open questions (Q37-Q42) in §10:
  Q37 routing for :ask explicit ask
  Q38 summary turn vs system-role alternation
  Q39 fallback under Norris (proposal: single-request only)
  Q40 summary re-summarize fidelity loss (lossy by design)
  Q41 HTTP 408 pattern eligibility (default yes)
  Q42 routing inside tool-call sub-loop (proposal: fix at entry)

5-commit roadmap in §11. No new module files; mostly repl.lua and
router.lua growth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/PHASE5.md | 329 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 329 insertions(+)
 create mode 100644 docs/PHASE5.md

diff --git a/docs/PHASE5.md b/docs/PHASE5.md
new file mode 100644
index 0000000..fae4f8c
--- /dev/null
+++ b/docs/PHASE5.md
@@ -0,0 +1,329 @@
+# aish — Phase 5 Manifest
+
+**Project:** aish — AI-augmented conversational shell
+**Document:** Phase 5 Requirements, Architecture & Design Decisions
+**Status:** Formulate (pre-analyze)
+**Date:** 2026-05-13
+
+PHASE0 is the locked substrate; PHASE1-4 are layered on top. This manifest
+specifies what Phase 5 adds — **multi-model routing**, **cloud fallback**,
+and **context summarization on eviction**.
+
+---
+
+## 1. Scope of Phase 5
+
+Three pillars per PHASE0.md §11 row 5:
+
+1. **Multi-model routing by task type** — `router.lua` extended with a
+   per-request `classify_model(text, cfg)` that suggests a model
+   preset based on lightweight heuristics over the user input.
+   Opt-in via `cfg.routing.auto = true`; default off (explicit `:model`
+   stays the only switch).
+
+2. **Cloud fallback on local failure** — when the active broker call
+   returns `nil, err` for a transport reason that looks like
+   "local backend down" (HTTP 502 / 503 / 404 model-not-found /
+   libcurl connection-refused / timeout), automatically retry once
+   against the configured `cloud` preset, surfacing a status line so
+   the user knows what happened. Opt-in via `cfg.routing.cloud_fallback = true`;
+   default off (single-shot only).
+
+3. **Context summarization on eviction** — when
+   `context.enforce_budget()` would evict the oldest turn pair, instead
+   send those turns to the `fast` model (or `cfg.context.summarizer_model`)
+   with "summarize these turns in 2-3 sentences", then replace them
+   with a synthetic `role:"system"`-adjacent turn carrying the summary.
+   Subsequent evictions append to or re-summarize the rolling summary.
+   Opt-in via `cfg.context.summarize_on_evict = true`; default off
+   (Phase 0 silent eviction stays the default).
+
+**Phase 5 is done when:**
+
+- With `cfg.routing.auto = true`, a prompt like "explain this Python
+  traceback ..." gets routed to `deep` while "ls /tmp" or "what time
+  is it?" stays on `fast` — visible status `[aish] routed to deep`.
+- With `cfg.routing.cloud_fallback = true`, killing the local
+  llama.cpp upstream and asking a question yields a single retry on
+  the cloud preset + a status line.
+- With `cfg.context.summarize_on_evict = true`, a long conversation
+  that exceeds `max_turns` no longer silently drops history — the
+  evicted span is summarized into a single rolling turn the model
+  still sees.
+- Existing configs without `cfg.routing` or `cfg.context.summarize_on_evict`
+  behave exactly like Phase 4 (Phase 4 regression coverage).
+
+---
+
+## 2. Technology Decisions (delta from Phase 4)
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Routing trigger | Per-request, in `repl.ask_ai`, BEFORE the broker call | Same hook point as the tool-sub-loop entry. Decision is one function call (`router.classify_model`) that returns the resolved (name, cfg) pair OR nil = keep current. |
+| Classification mechanism | **Pure-Lua heuristics** in `router.classify_model` — keyword/length thresholds, no LLM call | Fast (no network), deterministic, debuggable. An LLM-based classifier is overkill v1; can be added in Phase 6+ if heuristics drift. |
+| Routing classes (v1) | `code`, `reasoning`, `default` → mapped to model presets via `cfg.routing.classes` | Three is enough for the first cut. The defaults map `code → deep`, `reasoning → cloud`, `default → cfg.default_model`. User can remap. |
+| Fallback trigger | Transport-error pattern match against `err` string — HTTP 5xx, model_not_found, "Connection refused", "Couldn't resolve host", "Timeout was reached" | These are the four shapes the broker actually emits. Library-error patterns are stable enough that string-match is fine for v1. |
+| Fallback target | `cfg.routing.fallback_model` (default `"cloud"` when present) | One-hop fallback only; if cloud also fails, surface the error normally. No retry loops. |
+| Fallback announcement | Status line `[aish] local <name> failed (<reason>); retrying via <fallback_name>` | Visibility — user always knows when a fallback fired. |
+| Summarize trigger | Inside `context.enforce_budget()`, when it would otherwise `table.remove` | Same place the eviction status fires. The summarize is a *replacement* not an addition; total turn count stays bounded. |
+| Summary turn shape | Single rolling `{role = "system", content = "[earlier conversation]\n<summary>", _summary = true}` turn at index 1 (after the system prompt) | One synthetic turn carries all evicted history. New evictions either *append* to it (cheap) or trigger a re-summarize when the summary itself exceeds a char cap (default 2000). |
+| Summary model | `cfg.context.summarizer_model` (default `"fast"`) | Same pattern as `cfg.memory.summarizer_model`. Fast model is cheap enough to summarize on every eviction. |
+| Summary failure handling | If broker returns nil, fall back to *silent eviction* (Phase 0 behavior) and status-log once. Don't block the user's main request. | Best-effort; never let summarization break the REPL. |
+
+---
+
+## 3. Module Changes
+
+| File | State after Phase 4 | Phase 5 changes |
+|---|---|---|
+| `router.lua` | `classify(line, config)` → `(kind, payload)` for shell/AI/meta dispatch | Add `M.classify_model(text, cfg) -> name | nil`. Heuristics: line length > N, presence of code-fence backticks, keywords like "traceback", "stacktrace", "explain", "why does", etc. Returns the model NAME (string) or nil = keep current. |
+| `context.lua` | turns + memory_items + Norris suffix | Extend `enforce_budget()` to invoke a callback (passed in via `Context.new(opts.summarize_fn)`) when about to evict. If callback returns text, prepend a `{role="system", _summary=true, content=...}` turn (replacing prior `_summary` turn if present). If callback returns nil, fall back to silent eviction. The summarize callback itself lives in repl.lua (because it needs broker.chat). |
+| `repl.lua` | tool-sub-loop + meta + memory injection | (a) Pre-broker hook: if `cfg.routing.auto`, call `router.classify_model(text, cfg)` and switch `active_cfg` for THIS request only (revert after). (b) Post-broker error hook: if err matches a fallback pattern AND `cfg.routing.cloud_fallback`, retry against the fallback model once. (c) Wire `Context.new` with a `summarize_fn = function(turns) ... end` closure that calls `broker.chat(cfg.models[cfg.context.summarizer_model], ..., {max_tokens=300})`. |
+| `broker.lua` | streaming + opts.tools/max_tokens/timeout_ms | Unchanged — Phase 5 composes on top of the existing surface. |
+| `config.lua` | example with mcp/safety/memory blocks | Add commented-out `routing = {...}` and `context.summarize_on_evict = true` example. |
+
+No new module files. All Phase 5 functionality grows existing files —
+mostly `repl.lua` and `router.lua`.
+
+---
+
+## 4. Routing Heuristics (v1)
+
+`router.classify_model(text, cfg)` returns a model NAME (looked up in
+`cfg.routing.classes`) or `nil` (use the user-set active model).
+
+Heuristics, in order — first hit wins:
+
+1. **Code class** if any of:
+   - Triple-backtick code fence anywhere
+   - Token "traceback" / "stacktrace" / "stack trace" (case-insensitive)
+   - Token "error:" or "exception:" near beginning
+   - Text contains a path-like `./|/usr|~/` + `.py|.lua|.c|.js|.go|.rs`
+   - More than 4 lines AND has indentation (looks like a paste)
+
+2. **Reasoning class** if any of:
+   - Token "explain" / "why" / "how does" / "compare"
+   - Question mark + > 100 chars total
+
+3. **Default class** otherwise.
+
+Each class maps to a model name via `cfg.routing.classes`:
+
+```lua
+routing = {
+    auto = true,
+    classes = {
+        code      = "deep",     -- code questions to deep
+        reasoning = "cloud",    -- reasoning to cloud (best quality)
+        default   = nil,        -- nil = keep current active model
+    },
+    cloud_fallback = true,
+    fallback_model = "cloud",
+}
+```
+
+When `auto = false`, `classify_model` returns nil always — equivalent to
+not setting a routing block. The heuristic functions live behind the
+flag.
+
+---
+
+## 5. Cloud Fallback Flow
+
+In `repl.ask_ai` after the broker call:
+
+```lua
+local ok, err = broker.chat_stream(active_cfg, msgs, on_delta, opts)
+if not ok and should_fallback(err, cfg) then
+    renderer.status(("local %s failed (%s); retrying via %s")
+                    :format(active_name, fallback_reason(err),
+                            cfg.routing.fallback_model))
+    local fb_cfg = cfg.models[cfg.routing.fallback_model]
+    if fb_cfg then
+        ok, err = broker.chat_stream(fb_cfg, msgs, on_delta, opts)
+    end
+end
+```
+
+`should_fallback(err, cfg)` matches `err` against fallback patterns
+ONLY when `cfg.routing.cloud_fallback == true`. Otherwise returns false.
+
+### Fallback-eligible error patterns
+
+| Pattern | Meaning |
+|---|---|
+| `HTTP 5%d%d` | server-side error (502 Bad Gateway, 503 Unavailable, 504 Timeout) |
+| `HTTP 404.*model_not_found` | the routed model isn't loaded on the local backend |
+| `Couldn'?t resolve host` | DNS / unreachable local broker |
+| `Connection refused` | broker not listening |
+| `Timeout was reached` | broker too slow |
+
+Errors NOT matched (and therefore NOT retried):
+- HTTP 401 / 403 (auth failure — won't get better on cloud)
+- HTTP 400 (bad request — schema issue)
+- Lua-level errors (broker pipeline bug, not transport)
+
+---
+
+## 6. Context Summarization on Eviction
+
+`Context.new(opts)` accepts an optional `summarize_fn(turns) -> string |
+nil` closure. When set AND `enforce_budget` would evict, the callback
+is invoked with the evicted slice; the returned summary (if non-nil)
+replaces the rolling summary turn.
+
+### Storage shape
+
+Index 1 of `ctx.turns` is reserved for the summary turn when present:
+
+```lua
+{ role = "system", _summary = true,
+  content = "[earlier conversation summary]\n<summary text>" }
+```
+
+`to_messages()` renders this normally (system role; model sees it).
+
+### Summary update flow
+
+1. enforce_budget identifies the oldest 2 turns to evict (user + assistant).
+2. If `summarize_fn` is set, call it with those 2 turns.
+3. If summary text returned:
+   - If `ctx.turns[1]._summary` exists, append to its content
+     (truncate at `max_summary_chars` default 2000 — re-summarize via
+     same callback if exceeded).
+   - Else insert a new `_summary` turn at index 1.
+4. Remove the evicted turns from `ctx.turns`.
+5. If callback returned nil → silent eviction (Phase 0 behavior).
+
+### Failure handling
+
+Inside the callback (in `repl.lua`):
+
+```lua
+local summary, err = broker.chat(summarizer_cfg, {
+    {role="system", content="Summarize the following conversation in 2-3 sentences."},
+    {role="user",   content=render_turns_compact(evicted)},
+}, {max_tokens=300, timeout_ms=30000})
+return summary  -- nil propagates; context.lua falls back to silent eviction
+```
+
+---
+
+## 7. Meta Commands (Phase 5 additions)
+
+| Command | Action |
+|---|---|
+| `:route on` / `:route off` | Toggle `cfg.routing.auto` at runtime (overrides config) |
+| `:route classes` | Show the active class → model mapping |
+| `:route check <text>` | Print which class a given text would be routed to (debug aid) |
+| `:fallback on` / `:fallback off` | Toggle `cfg.routing.cloud_fallback` at runtime |
+
+`:help` updated.
+
+---
+
+## 8. Migration from Phase 4
+
+User-visible:
+- New `:route` and `:fallback` meta commands.
+- With `cfg.routing.auto`, the active model may CHANGE per-request as
+  the heuristic fires. Prompt color tag could vary (Phase 6 maybe).
+- With `cfg.context.summarize_on_evict`, eviction now spends a fast-
+  model round-trip instead of silently dropping turns.
+
+Existing configs without `routing` or `context.summarize_on_evict`
+continue exactly as Phase 4 — defaults are OFF.
+
+Substrate (PHASE0.md §3) invariants: unchanged. The `CMD:` extraction
+marker, `cd` interception, and the entire system-prompt suffix order
+from Phase 4 stay the same.
+
+---
+
+## 9. Out of Scope (Phase 5)
+
+Per PHASE0.md §11 these belong to Phase 6:
+- Tree-sitter syntax highlighting hooks
+- Diff-aware code injection
+- Project-level context (file tree summary)
+
+Specifically out of Phase 5:
+- LLM-based classification (heuristics-only v1).
+- Multi-hop fallback chains (one retry only).
+- Per-class temperature overrides (use the model preset's default).
+- Cost accounting for cloud calls (Q-list candidate).
+- Auto-router learning from user `:model` overrides (Phase 6+).
+
+---
+
+## 10. Open Questions
+
+| # | Question | Impact | Resolve by |
+|---|---|---|---|
+| Q37 | Should routing apply to `:ask <text>` (explicit AI route) the same way it does to bare prompts? Yes seems obvious but worth documenting. | repl.lua | Phase 5 (plan) |
+| Q38 | Summary turn placement: index 1 (right after a putative system prompt — but to_messages prepends the system prompt fresh each call) vs index 0 in self.turns. Index 1 in self.turns means the system message comes from `to_messages()` and the `_summary` turn lives next as a `role:"system"` message. Strict templates may reject system/system back-to-back. | context.lua | Phase 5 (analyze) |
+| Q39 | When fallback fires AND the user is in Norris mode, does the fallback model also drive the planning loop? Or single-request-only fallback? v1 says single-request only (Norris stays on its configured model). | repl.lua + safety.lua | Phase 5 (plan) |
+| Q40 | Summarizer recursion: the summary itself might be summarized later when it grows past max_summary_chars. Does the re-summarize lose fidelity? Probably yes; acceptable trade-off. Note the lossy-by-design contract in §6. | context.lua | Phase 5 (verify) |
+| Q41 | Eligibility patterns for fallback: should HTTP 408 (Request Timeout, distinct from Timeout was reached) be matched? Some servers emit it. Default: yes — pattern `HTTP 408`. Phase 7 verify will adjust. | repl.lua | Phase 7 (verify) |
+| Q42 | Auto-router decisions inside the tool-call sub-loop: does each sub-iteration re-classify, or does the first user turn fix the model for the whole sub-loop? Proposal: fix at sub-loop entry — model switching mid-tool-call would confuse the model AND cost tokens by rebuilding context. | repl.lua | Phase 5 (plan) |
+
+---
+
+## 11. Implementation Plan (commit-by-commit)
+
+Five commits expected:
+
+1. **`router.lua` — `classify_model`.** Pure-Lua heuristics; no IO. Returns
+   model name or nil. Module-local pattern set so tests can introspect.
+   **Test in isolation**: ~30-case corpus of (input → expected class).
+
+2. **`context.lua` — eviction callback.** Add `opts.summarize_fn`,
+   `_summary` index-1 turn convention, `to_messages()` rendering
+   (which Just Works since `_summary` turns have `role` + `content`).
+   **Test in isolation**: mock summarize_fn returning "(summary N)",
+   build a context that exceeds budget, verify the summary turn
+   appears and accumulates.
+
+3. **`repl.lua` — fallback + routing wiring.** Pre-broker
+   classify_model hook (gated by cfg.routing.auto); post-error
+   fallback retry (gated by cfg.routing.cloud_fallback); wire
+   summarize_fn at Context.new time. **Test against hossenfelder**:
+   prompt classified as "code" → routes to deep; deliberately
+   misconfigure local endpoint → fallback fires.
+
+4. **`:route` and `:fallback` meta commands.** Standalone — config
+   toggles via runtime cmds. **End-to-end**: boot, `:route on`,
+   issue a query, observe routing status; `:route off`, query
+   again, no routing.
+
+5. **`config.lua` — routing + summarize_on_evict example.**
+   Documentation-only; commented-out example block. Final commit.
+
+### Risk / non-obvious
+
+- **Heuristic false positives**: a normal conversational question
+  containing the word "explain" gets routed to cloud. Conservative
+  defaults (`reasoning → nil` by default? then user opts in
+  explicitly per class) might be safer. Default mapping in §4 is
+  aggressive; tone down at plan if user prefers.
+- **Active-model state after routing**: the per-request routing
+  switches `active_cfg` momentarily. The `prompt()` function reads
+  `active_name` which IS reverted post-request, so the prompt label
+  stays accurate.
+- **Fallback during streaming**: if the local broker fails MID-stream
+  (e.g. emits some text then 5xx), the user has already seen partial
+  text. Retrying via cloud means duplicated prefix. v1 only retries
+  on errors BEFORE any deltas arrived (we can detect by tracking
+  whether on_delta was called).
+- **Summarize during Norris**: Norris's planning loop generates many
+  turns. Eviction during Norris means summarizing mid-plan — the
+  model loses context about its earlier steps. Risky. v1 disables
+  summarize when ctx.norris_active.
+- **Memory items + summary turn**: both are dynamic system-context
+  additions. The summary is `role:"system"` in turns[1]; memory
+  is the `[background]` block in the actual system message.
+  Compatible — no overlap.
+
+---
+
+*End of Phase 5 Manifest — aish*