aish/docs/PHASE5.md

# aish — Phase 5 Manifest

**Project:** aish — AI-augmented conversational shell
**Document:** Phase 5 Requirements, Architecture & Design Decisions
**Status:** Plan (review fold-in 2026-05-13 — callback signature, Norris suppression, cost defaults resolved)
**Date:** 2026-05-13

**Review fold-in (2026-05-13):**

R-B1. **Summary callback signature canonical**: the closure is
    `summarize_fn(prior_summary, evicted_turns) -> string | nil`.
    `prior_summary` is `nil` on the first ever summarize; otherwise
    the current `ctx.summary` string. `evicted_turns` is `nil` for
    the re-summarize-compress trigger (C1 resolution); otherwise the
    array of evicted turn tables. The closure dispatches:
      first-time:  prior=nil,    evicted=[...]   → "summarize these turns"
      additive:    prior=str,    evicted=[...]   → "extend the prior summary"
      compress:    prior=str,    evicted=nil     → "compress the prior summary"

R-C2. **Routing taken once per ask_ai**: the model decision happens
    on entry to `ask_ai`. The chosen `active_cfg` is used for every
    iteration of the tool-call sub-loop. Original `active_cfg` is
    restored after `ask_ai` returns. NOT per-broker-call.

R-C3. **AUTO-routing does NOT fire inside Norris**: `run_norris`
    operates on a fixed model (whatever the user set via `:model`
    before launching). The auto-router would otherwise switch models
    mid-plan, which loses planning continuity and costs tokens
    rebuilding context. State explicit in §4 + §10.

R-C4. **Summary block suppressed under Norris**: mirrors Phase 4
    R-C1 ([background] suppression). Both blocks are "earlier context"
    the planner generally doesn't need mid-iteration. §6 + §3 reflect.

R-C5. **Fallback pattern coverage**:
    - Add `HTTP 408` to §5 patterns (Q41 moves from open to resolved).
    - Add `Operation timed out` (curl variant of "Timeout was reached").
    - Drop "HTTP response code said error" from A2 — FAILONERROR was
      removed in Phase 4 commit `f26cbd9`, this shape no longer fires.

NITs folded:
  N1. `:route check <text>` always runs the heuristic regardless of
      `cfg.routing.auto` — debug aid surfaces the class + would-be
      model + "(routing currently disabled)" suffix when auto is off.
  N2. **`reasoning → nil` by default** — the v1 heuristic that maps
      "explain" / "why" / "how does" to a model is too aggressive
      paired with `nil = keep current` semantics. User must
      EXPLICITLY map `routing.classes.reasoning = "cloud"` to send
      reasoning prompts to paid API. Same cost-safety rationale as
      `cfg.routing.auto = false`.
  N3. "Retry only when no deltas have arrived" promoted to normative
      rule in §5 (was in §11 risk row).
  N4. Config key renamed `cfg.routing.cloud_fallback` →
      `cfg.routing.fallback` to align with the `:fallback` meta verb.
      Single-source naming.

**Analyze findings (2026-05-13):**

**Analyze findings (2026-05-13):**

A1. **router.lua surface clean** — already a pure-Lua module with
    `M.classify(line, config) -> (kind, payload)`. Adding
    `M.classify_model(text, cfg) -> name | nil` is a natural sibling.
    No structural refactor.

A2. **broker error message shapes** all carry transport-stage prefixes
    that the fallback matcher must account for. The actual shapes
    callers see:
      "transport: HTTP %d%d%d: <body-snippet>"   (from post_sse status>=400)
      "transport: Timeout was reached"
      "transport: Couldn't resolve host"
      "transport: Connection refused"
      "transport: HTTP response code said error"  (rare; from FAILONERROR)
      "api: <error.message>"                      (SSE-framed error envelope)
      "broker: model_cfg.endpoint and .model required" (config bug)
    Fallback patterns in §5 should match against the "transport: "
    prefix explicitly. "api: ..." errors don't fall back (they're
    semantic — bad request shape, not server failure). "broker: ..."
    errors don't fall back either (config bug).

A3. **Q38 resolved at analyze** — placing the rolling summary as
    `turns[1]` with `role:"system"` would produce system/system
    back-to-back in to_messages output (msg[1] is the composed
    system prompt; msg[2] would be the summary as another system
    message). Strict templates may reject this same way they reject
    user/user (PHASE0 §6). Resolution: render the summary INSIDE the
    composed system message (same pattern as the [background] and
    NORRIS blocks). Storage stays simple — keep `_summary` text on
    `ctx.summary` (NOT in `ctx.turns`), append to the system prompt
    in `to_messages` alongside the [background] and NORRIS blocks.
    §6 + §3 reflect.

PHASE0 is the locked substrate; PHASE1-4 are layered on top. This manifest
specifies what Phase 5 adds — **multi-model routing**, **cloud fallback**,
and **context summarization on eviction**.

---

## 1. Scope of Phase 5

Three pillars per PHASE0.md §11 row 5:

1. **Multi-model routing by task type** — `router.lua` extended with a
   per-request `classify_model(text, cfg)` that suggests a model
   preset based on lightweight heuristics over the user input.
   Opt-in via `cfg.routing.auto = true`; default off (explicit `:model`
   stays the only switch).

2. **Cloud fallback on local failure** — when the active broker call
   returns `nil, err` for a transport reason that looks like
   "local backend down" (HTTP 502 / 503 / 404 model-not-found /
   libcurl connection-refused / timeout), automatically retry once
   against the configured `cloud` preset, surfacing a status line so
   the user knows what happened. Opt-in via `cfg.routing.cloud_fallback = true`;
   default off (single-shot only).

3. **Context summarization on eviction** — when
   `context.enforce_budget()` would evict the oldest turn pair, instead
   send those turns to the `fast` model (or `cfg.context.summarizer_model`)
   with "summarize these turns in 2-3 sentences", then replace them
   with a synthetic `role:"system"`-adjacent turn carrying the summary.
   Subsequent evictions append to or re-summarize the rolling summary.
   Opt-in via `cfg.context.summarize_on_evict = true`; default off
   (Phase 0 silent eviction stays the default).

**Phase 5 is done when:**

- With `cfg.routing.auto = true`, a prompt like "explain this Python
  traceback ..." gets routed to `deep` while "ls /tmp" or "what time
  is it?" stays on `fast` — visible status `[aish] routed to deep`.
- With `cfg.routing.cloud_fallback = true`, killing the local
  llama.cpp upstream and asking a question yields a single retry on
  the cloud preset + a status line.
- With `cfg.context.summarize_on_evict = true`, a long conversation
  that exceeds `max_turns` no longer silently drops history — the
  evicted span is summarized into a single rolling turn the model
  still sees.
- Existing configs without `cfg.routing` or `cfg.context.summarize_on_evict`
  behave exactly like Phase 4 (Phase 4 regression coverage).

---

## 2. Technology Decisions (delta from Phase 4)

| Decision | Choice | Rationale |
|---|---|---|
| Routing trigger | Per-request, in `repl.ask_ai`, BEFORE the broker call | Same hook point as the tool-sub-loop entry. Decision is one function call (`router.classify_model`) that returns the resolved (name, cfg) pair OR nil = keep current. |
| Classification mechanism | **Pure-Lua heuristics** in `router.classify_model` — keyword/length thresholds, no LLM call | Fast (no network), deterministic, debuggable. An LLM-based classifier is overkill v1; can be added in Phase 6+ if heuristics drift. |
| Routing classes (v1) | `code`, `reasoning`, `default` → mapped to model presets via `cfg.routing.classes` | Three classes for the first cut. **Defaults (N2 fold-in)**: `code → "deep"`, `reasoning → nil` (heuristic still fires but no override unless user maps it), `default → nil`. The aggressive `reasoning → "cloud"` default sent ordinary "why does ..." prompts to a paid API; user must opt in explicitly to pay for reasoning. Same cost-safety rationale as `cfg.routing.auto = false`. |
| Routing cost-safety | `cfg.routing.auto = false` default | Same rationale as `confirm_cmd = true` and `llm_second_opinion = true`: a default-on routing maps "explain ..." prompts to whatever class maps to `"cloud"`, spending paid-API tokens on prompts the user typed for what they thought was their local model. Default off; user opts in. |
| Fallback trigger | Transport-error pattern match against `err` string — HTTP 5xx, model_not_found, "Connection refused", "Couldn't resolve host", "Timeout was reached" | These are the four shapes the broker actually emits. Library-error patterns are stable enough that string-match is fine for v1. |
| Fallback target | `cfg.routing.fallback_model` (default `"cloud"` when present) | One-hop fallback only; if cloud also fails, surface the error normally. No retry loops. |
| Fallback timing | **Only retry when no deltas have arrived yet** (N3 fold-in) | If the local broker emits partial text then 5xx's mid-stream, the user has seen prose; retrying via cloud would duplicate the prefix and confuse the user. The retry path checks an `any_delta` flag in the on_delta callback; only retries when false. |
| Fallback announcement | Status line `[aish] local <name> failed (<reason>); retrying via <fallback_name>` | Visibility — user always knows when a fallback fired. |
| Summarize trigger | Inside `context.enforce_budget()`, when it would otherwise `table.remove` | Same place the eviction status fires. The summarize is a *replacement* not an addition; total turn count stays bounded. |
| Summary turn shape | Single rolling `{role = "system", content = "[earlier conversation]\n<summary>", _summary = true}` turn at index 1 (after the system prompt) | One synthetic turn carries all evicted history. New evictions either *append* to it (cheap) or trigger a re-summarize when the summary itself exceeds a char cap (default 2000). |
| Summary model | `cfg.context.summarizer_model` (default `"fast"`) | Same pattern as `cfg.memory.summarizer_model`. Fast model is cheap enough to summarize on every eviction. |
| Summary failure handling | If broker returns nil, fall back to *silent eviction* (Phase 0 behavior) and status-log once. Don't block the user's main request. | Best-effort; never let summarization break the REPL. |

---

## 3. Module Changes

| File | State after Phase 4 | Phase 5 changes |
|---|---|---|
| `router.lua` | `classify(line, config)` → `(kind, payload)` for shell/AI/meta dispatch | Add `M.classify_model(text, cfg) -> name | nil`. Heuristics: line length > N, presence of code-fence backticks, keywords like "traceback", "stacktrace", "explain", "why does", etc. Returns the model NAME (string) or nil = keep current. |
| `context.lua` | turns + memory_items + Norris suffix | Extend `enforce_budget()` to invoke a callback (passed via `Context.new(opts.summarize_fn)`) when about to evict. Store the returned summary as `ctx.summary` (string) — NOT a turn (A3 — avoids system/system alternation). `to_messages` composes it into the system message alongside `[background]` and NORRIS, between them: `system → [background] → [earlier summary] → NORRIS`. New evictions append to `ctx.summary`; when its length exceeds `max_summary_chars` (default 2000), the callback is invoked AGAIN with `(prior_summary, new_evicted_turns)` to re-summarize. Silent eviction is the fallback when the callback returns nil. |
| `repl.lua` | tool-sub-loop + meta + memory injection | (a) Pre-broker hook: if `cfg.routing.auto`, call `router.classify_model(text, cfg)` and switch `active_cfg` for THIS request only (revert after). (b) Post-broker error hook: if err matches a fallback pattern AND `cfg.routing.cloud_fallback`, retry against the fallback model once. (c) Wire `Context.new` with a `summarize_fn = function(turns) ... end` closure that calls `broker.chat(cfg.models[cfg.context.summarizer_model], ..., {max_tokens=300})`. |
| `broker.lua` | streaming + opts.tools/max_tokens/timeout_ms | Unchanged — Phase 5 composes on top of the existing surface. |
| `config.lua` | example with mcp/safety/memory blocks | Add commented-out `routing = {...}` and `context.summarize_on_evict = true` example. |

No new module files. All Phase 5 functionality grows existing files —
mostly `repl.lua` and `router.lua`.

---

## 4. Routing Heuristics (v1)

`router.classify_model(text, cfg)` returns a model NAME (looked up in
`cfg.routing.classes`) or `nil` (use the user-set active model).

Heuristics, in order — first hit wins:

1. **Code class** if any of:
   - Triple-backtick code fence anywhere
   - Token "traceback" / "stacktrace" / "stack trace" (case-insensitive)
   - Token "error:" or "exception:" near beginning
   - Text contains a path-like `./|/usr|~/` + `.py|.lua|.c|.js|.go|.rs`
   - More than 4 lines AND has indentation (looks like a paste)

2. **Reasoning class** if any of:
   - Token "explain" / "why" / "how does" / "compare"
   - Question mark + > 100 chars total

3. **Default class** otherwise.

Each class maps to a model name via `cfg.routing.classes`:

```lua
routing = {
    auto = true,
    classes = {
        code      = "deep",     -- code questions to deep
        reasoning = "cloud",    -- reasoning to cloud (best quality)
        default   = nil,        -- nil = keep current active model
    },
    cloud_fallback = true,
    fallback_model = "cloud",
}
```

When `auto = false`, `classify_model` returns nil always — equivalent to
not setting a routing block. The heuristic functions live behind the
flag.

---

## 5. Cloud Fallback Flow

In `repl.ask_ai` after the broker call:

```lua
local ok, err = broker.chat_stream(active_cfg, msgs, on_delta, opts)
if not ok and should_fallback(err, cfg) then
    renderer.status(("local %s failed (%s); retrying via %s")
                    :format(active_name, fallback_reason(err),
                            cfg.routing.fallback_model))
    local fb_cfg = cfg.models[cfg.routing.fallback_model]
    if fb_cfg then
        ok, err = broker.chat_stream(fb_cfg, msgs, on_delta, opts)
    end
end
```

`should_fallback(err, cfg)` matches `err` against fallback patterns
ONLY when `cfg.routing.cloud_fallback == true`. Otherwise returns false.

### Fallback-eligible error patterns

All patterns match against the err string AS IT ARRIVES from broker.lua,
which is prefixed `"transport: "` for libcurl/HTTP issues (A2 confirmed).
The matcher strips the prefix before testing.

| Pattern (after prefix strip) | Meaning |
|---|---|
| `^HTTP 5%d%d` | server-side error (502 Bad Gateway, 503 Unavailable, 504 Timeout) |
| `^HTTP 404.*model_not_found` | the routed model isn't loaded on the local backend |
| `^HTTP 408` | Request Timeout (gateway-level; some proxies emit this — Q41 resolved) |
| `Couldn'?t resolve host` | DNS / unreachable local broker |
| `Connection refused` | broker not listening |
| `Timeout was reached` | libcurl's internal timeout phrasing |
| `Operation timed out` | curl variant of timeout (libcurl version-dependent) |

Errors NOT matched (NOT retried):
- HTTP 401 / 403 (auth failure — won't get better on cloud)
- HTTP 400 (bad request — schema issue)
- `^api:` errors (semantic — bad request shape)
- `^broker:` errors (config bug — endpoint/model missing)
- Lua-level errors (broker pipeline bug, not transport)

---

## 6. Context Summarization on Eviction

`Context.new(opts)` accepts an optional `summarize_fn(turns) -> string |
nil` closure. When set AND `enforce_budget` would evict, the callback
is invoked with the evicted slice; the returned summary (if non-nil)
replaces the rolling summary turn.

### Storage shape (post-A3 resolution)

The rolling summary lives on `ctx.summary` (a string), NOT in `ctx.turns`:

```lua
ctx.summary = "Earlier conversation: user discussed X, asked about Y, "
           .. "agreed to Z. Later asked..."
```

`to_messages()` composes it into the system message between `[background]`
and the NORRIS suffix:

```
DEFAULT_SYSTEM_PROMPT

[background] (memory items)
- (fact) ...

[earlier conversation summary]
<ctx.summary>

[NORRIS MODE] (if active)
...
```

No new role:"system" message at turns[1] — avoids system/system alternation.

### Summary update flow

1. enforce_budget identifies the oldest 2 turns to evict (user + assistant).
2. If `summarize_fn` is set, call it with `(prior_summary, evicted_turns)`.
3. If summary text returned:
   - Replace `ctx.summary` with the new text.
   - If `#ctx.summary > max_summary_chars` (default 2000), invoke the
     callback once more with `(ctx.summary, {})` to re-summarize for
     compactness. Lossy by design — Q40 documents this trade-off.
4. Remove the evicted turns from `ctx.turns`.
5. If callback returned nil → silent eviction; `ctx.summary` unchanged.

### Failure handling

Inside the callback (in `repl.lua`):

```lua
local summary, err = broker.chat(summarizer_cfg, {
    {role="system", content="Summarize the following conversation in 2-3 sentences."},
    {role="user",   content=render_turns_compact(evicted)},
}, {max_tokens=300, timeout_ms=30000})
return summary  -- nil propagates; context.lua falls back to silent eviction
```

---

## 7. Meta Commands (Phase 5 additions)

| Command | Action |
|---|---|
| `:route on` / `:route off` | Toggle `cfg.routing.auto` at runtime (overrides config) |
| `:route classes` | Show the active class → model mapping |
| `:route check <text>` | Print which class a given text would be routed to (debug aid) |
| `:fallback on` / `:fallback off` | Toggle `cfg.routing.cloud_fallback` at runtime |

`:help` updated.

---

## 8. Migration from Phase 4

User-visible:
- New `:route` and `:fallback` meta commands.
- With `cfg.routing.auto`, the active model may CHANGE per-request as
  the heuristic fires. Prompt color tag could vary (Phase 6 maybe).
- With `cfg.context.summarize_on_evict`, eviction now spends a fast-
  model round-trip instead of silently dropping turns.

Existing configs without `routing` or `context.summarize_on_evict`
continue exactly as Phase 4 — defaults are OFF.

Substrate (PHASE0.md §3) invariants: unchanged. The `CMD:` extraction
marker, `cd` interception, and the entire system-prompt suffix order
from Phase 4 stay the same.

---

## 9. Out of Scope (Phase 5)

Per PHASE0.md §11 these belong to Phase 6:
- Tree-sitter syntax highlighting hooks
- Diff-aware code injection
- Project-level context (file tree summary)

Specifically out of Phase 5:
- LLM-based classification (heuristics-only v1).
- Multi-hop fallback chains (one retry only).
- Per-class temperature overrides (use the model preset's default).
- Cost accounting for cloud calls (Q-list candidate).
- Auto-router learning from user `:model` overrides (Phase 6+).

---

## 10. Open Questions

| # | Question | Impact | Resolve by |
|---|---|---|---|
| Q37 | Should routing apply to `:ask <text>` (explicit AI route) the same way it does to bare prompts? Yes seems obvious but worth documenting. | repl.lua | Phase 5 (plan) |
| Q38 | ~~Summary turn placement: index 1 vs index 0~~ | context.lua | **Resolved at analyze (A3)**: NEITHER — summary lives on `ctx.summary` (string) and composes into the SYSTEM MESSAGE alongside [background] and NORRIS suffix. No new role:"system" message; no alternation risk. |
| Q39 | ~~Fallback under Norris~~ | repl.lua + safety.lua | **Resolved at review (R-C3)**: AUTO-routing does NOT fire inside the Norris loop. The model is fixed at `:norris <goal>` launch time; the planner stays on it for every iteration. Per-iteration fallback (if a local broker call inside Norris fails) is still gated by `cfg.routing.fallback`; that retries the failed call against cloud but doesn't permanently switch the planner. |
| Q40 | Summarizer recursion: the summary itself might be summarized later when it grows past max_summary_chars. Does the re-summarize lose fidelity? Probably yes; acceptable trade-off. Note the lossy-by-design contract in §6. | context.lua | Phase 5 (verify) |
| Q41 | ~~HTTP 408 / Operation timed out eligibility~~ | repl.lua | **Resolved at review (R-C5)**: both added to §5 patterns. |
| Q42 | Auto-router decisions inside the tool-call sub-loop: does each sub-iteration re-classify, or does the first user turn fix the model for the whole sub-loop? Proposal: fix at sub-loop entry — model switching mid-tool-call would confuse the model AND cost tokens by rebuilding context. | repl.lua | Phase 5 (plan) |

---

## 11. Implementation Plan (commit-by-commit)

Five commits expected:

1. **`router.lua` — `classify_model`.** Pure-Lua heuristics; no IO. Returns
   model name or nil. Module-local pattern set so tests can introspect.
   **Test in isolation**: ~30-case corpus of (input → expected class).

2. **`context.lua` — eviction callback.** Add `opts.summarize_fn`,
   `_summary` index-1 turn convention, `to_messages()` rendering
   (which Just Works since `_summary` turns have `role` + `content`).
   **Test in isolation**: mock summarize_fn returning "(summary N)",
   build a context that exceeds budget, verify the summary turn
   appears and accumulates.

3. **`repl.lua` — fallback + routing wiring.** Pre-broker
   classify_model hook (gated by cfg.routing.auto); post-error
   fallback retry (gated by cfg.routing.cloud_fallback); wire
   summarize_fn at Context.new time. **Test against hossenfelder**:
   prompt classified as "code" → routes to deep; deliberately
   misconfigure local endpoint → fallback fires.

4. **`:route` and `:fallback` meta commands.** Standalone — config
   toggles via runtime cmds. **End-to-end**: boot, `:route on`,
   issue a query, observe routing status; `:route off`, query
   again, no routing.

5. **`config.lua` — routing + summarize_on_evict example.**
   Documentation-only; commented-out example block. Final commit.

### Risk / non-obvious

- **Heuristic false positives**: a normal conversational question
  containing the word "explain" gets routed to cloud. Conservative
  defaults (`reasoning → nil` by default? then user opts in
  explicitly per class) might be safer. Default mapping in §4 is
  aggressive; tone down at plan if user prefers.
- **Active-model state after routing**: the per-request routing
  switches `active_cfg` momentarily. The `prompt()` function reads
  `active_name` which IS reverted post-request, so the prompt label
  stays accurate.
- **Fallback during streaming**: if the local broker fails MID-stream
  (e.g. emits some text then 5xx), the user has already seen partial
  text. Retrying via cloud means duplicated prefix. v1 only retries
  on errors BEFORE any deltas arrived (we can detect by tracking
  whether on_delta was called).
- **Summarize during Norris**: Norris's planning loop generates many
  turns. Eviction during Norris means summarizing mid-plan — the
  model loses context about its earlier steps. Risky. v1 disables
  summarize when ctx.norris_active.
- **Memory items + summary turn**: both are dynamic system-context
  additions. The summary is `role:"system"` in turns[1]; memory
  is the `[background]` block in the actual system message.
  Compatible — no overlap.

---

*End of Phase 5 Manifest — aish*