Files

T

marfrit cf4d79dd9d docs/PHASE3: analyze + baseline — \C-n mechanics, LLM latency, module pre-state

Analyze findings folded into the manifest:

  A1. \C-n binding can't toggle mid-prompt without rl_insert_text /
      rl_redisplay. Solution: bind those (one cdef + 2 wrappers in
      ffi/readline.lua) so \C-n inserts ":norris " at the cursor; user
      types goal + Enter. Routes through existing meta dispatch.

  A2. broker has no max_tokens passthrough. Add opts.max_tokens for
      the LLM second-opinion path (terminates at ~2 tokens; verified
      proxy honors it).

  A3. Phase 2 tool-sub-loop pattern IS the planner shape. safety.norris_step
      is the per-iteration extraction; driver loop in repl.lua.

Module-changes table (§3) updated with the rl_insert_text and
max_tokens rows.

Baseline doc (PHASE3-baseline.md, 80 lines) captures:
  - LLM second-opinion latency: 425-1162ms per probe, all 5 test
    cases correct. Worst-case 16-step Norris = ~20s overhead; with
    static-pattern fast-path + session cache, ~5s realistic.
  - Module pre-state at commit f26cbd9 (Phase 2 tip): LOC + state
    per file before Phase 3 edits.
  - Six static-pattern Lua-match sanity checks (all correct).
  - Carries: aish#15 (still open), aish#14, aish#32/#33.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 22:37:58 +00:00

4.3 KiB

Raw Blame History

Phase 3 Baseline — pre-implementation measurements

Date: 2026-05-12 Target probed: hossenfelder.fritz.box:8082 (OpenAI-compat broker → qwen2.5-coder-1.5b-q4_k_m.gguf local).

This is the Phase 7 (verify) anchor for Phase 3. Captures the world just before Norris/destructive-heuristic implementation lands.

1. LLM second-opinion latency (Q23 budget check)

fast preset, temperature=0, max_tokens=4, system prompt "Reply YES or NO only":

Command	Reply	Latency
`rm -rf /tmp/foo`	YES	1162 ms
`ls /tmp`	NO	666 ms
`truncate -s 0 important.log`	YES	475 ms
`git push --force origin main`	YES	451 ms
`cat /etc/hostname`	NO	425 ms

Five-for-five correct answers; median ~475 ms; 95th percentile (small sample) ~1200 ms. The first request was slowest (likely cold-cache), subsequent ones settled below 700 ms.

Budget implication for a 16-step Norris session

Worst-case (no static-pattern hits, all queries to LLM, no cache): 16 × 1200 ms = ~19s of additional latency over the Norris run.

With realistic mix (static patterns catch the obvious cases without LLM, repeated commands hit the session cache): ~5s typical, dominated by genuinely-novel command tokens.

Conclusion: LLM second-opinion is workable as a default-on feature. The session-scoped cache (§12 commit #2) is the right mitigation; an additional async pre-check on the static patterns first means most calls never reach the LLM.

2. Module pre-state (Phase 2 head `f26cbd9` + cosmetic fix `3fa6279`)

Module	LOC	State
`safety.lua`	55	confirm_tool_call only; `is_destructive` and `norris_step` raise error()
`renderer.lua`	110	exec frame + tool-call frame + assistant streaming + status; no norris frames
`repl.lua`	(post-Phase 2)	tool-sub-loop + :mcp meta + `\C-n` no-op placeholder
`context.lua`	(post-Phase 2)	static system_prompt (Phase 0+Phase 2 MCP block); no norris suffix wiring
`broker.lua`	96	chat_stream(cfg, msgs, on_delta, opts) with opts.tools; no opts.max_tokens
`ffi/readline.lua`	(Phase 1)	rl_bind_keyseq + M.bind wrapper; no rl_insert_text or rl_redisplay
`config.lua`	(Phase 2)	mcp example block; no safety example block

After Phase 3 lands, git diff main..post-phase-3 --stat should show:

safety.lua substantial growth (~150 LOC for is_destructive + norris_step)
modest renderer.lua growth (~30 LOC for norris frames)
modest repl.lua growth (Norris driver + :norris meta)
one-line context.lua addition (system prompt suffix builder)
4-line broker.lua addition (opts.max_tokens)
6-line ffi/readline.lua addition (rl_insert_text + rl_redisplay)

3. Static-pattern hit-rate sanity check

Six patterns from §5 of the manifest exercised against safe vs destructive corpora:

Pattern	Test command	Expected	Result
`rm%s+.-%-rf?`	`rm -rf /tmp/x`	YES	HIT (pre-implementation Lua check)
`rm%s+.-%-rf?`	`rm /tmp/x.log`	NO	MISS (correct — no -r/-f flags)
`git%s+push%s+.-%-%-force`	`git push --force origin main`	YES	HIT
`git%s+push%s+.-%-%-force`	`git push origin main`	NO	MISS
`find%s+.-%-delete`	`find . -name '*.log' -delete`	YES	HIT
`find%s+.-%-delete`	`find . -name '*.log'`	NO	MISS

All six match the intent. Pattern soundness verified via Lua's string.match on each test string. Implementation in safety.is_destructive will use the same syntax.

4. Known carries from earlier phases

Issue #15 — hossenfelder SSE buffering bug. Open. Affects Norris streaming visibility (the model's plan/explanation streams in one batch). Workaround: nothing aish-side; fix is upstream.
Issue #14 — :model swap should re-render Context.system_prompt. Phase 3 makes this MORE relevant since the Norris suffix is dynamically composed; if the user :model deep then :norris <goal>, the new system prompt must take effect on the next broker call.
Issues #32 / #33 — Phase 2 follow-ups (tool-name validation, auto_approve typo warning). Not blocking Phase 3.

End of Phase 3 Baseline — aish

4.3 KiB Raw Blame History Unescape Escape