Files

T

marfrit c5116bf129 docs/PHASE2-baseline: pre-implementation measurements

Phase 7 (verify) anchor. Captures:

- MCP RPC round-trip timings against boltzmann lmcp v0.5.4 (all sub-100ms
  on LAN; LLM is the latency floor, not the transport).
- 6 fixture responses saved to /tmp/aish-baseline/ covering initialize,
  notifications/initialized, tools/list, tools/call success, isError,
  and JSON-RPC unknown-tool error.
- Baseline design finding: boltzmann's read_file returns isError:false
  even on failure (error text in content). aish should treat content as
  authoritative, isError as advisory; feed both to the model. PHASE2.md
  §4's "pass-through" stance already accommodates; no manifest amendment
  needed.
- Streaming tool_calls delta shape verified against hossenfelder; matches
  PHASE2.md §5.
- Pre-MCP aish behavior snapshot: loaded model emits markdown code-fence
  ignoring the CMD: contract — once MCP tools exist the model gets a
  structured path that doesn't depend on prose-formatting compliance.
- Module pre-state at Phase 1 head 5878f73: LOC + capability snapshot
  per module so Phase 2 diff has a reference frame.
- Two boltzmann-proxy blockers (SSE buffering, model-field routing)
  carried explicitly into Phase 7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 12:34:32 +00:00

6.9 KiB

Raw Blame History

Phase 2 Baseline — pre-implementation measurements

Date: 2026-05-12 Targets probed: lmcp v0.5.4 on boltzmann.fritz.box:8080/mcp; OpenAI-compat broker on hossenfelder.fritz.box:8082.

This is the Phase 7 (verify) anchor — captures what the world looked like just before Phase 2 implementation lands, so post-implementation behavior can be compared against it. Companion to PHASE2.md (manifest).

1. MCP RPC round-trip timings (cold path, single warm-up)

RPC	Latency
`initialize`	19 ms
`notifications/initialized` (HTTP 202, no body)	11 ms
`tools/list`	17 ms
`tools/call` `list_dir({path:"/tmp"})` (success, ~1 KB result)	72 ms
`tools/call` `read_file({path:"/nonexistent/..."})` (handler-caught failure)	12 ms
`tools/call` `nope_tool` (JSON-RPC -32601 unknown tool)	12 ms

LAN-local; sub-100ms for everything but a file-listing payload. Phase 2's sequential tool-call dispatch won't be the bottleneck — the LLM is.

2. Fixtures (saved to `/tmp/aish-baseline/`)

File	Shape
`01_initialize.json`	`{result:{protocolVersion, serverInfo:{name,version}, capabilities:{tools:{listChanged:false}}}}`
`02_notif_init.body`	empty (HTTP 202)
`03_tools_list.json`	`{result:{tools:[{name, description, inputSchema}...]}}` — 7 tools on boltzmann
`04_tools_call_ok.json`	`{result:{isError:false, content:[{type:"text", text:"<listing>"}]}}`
`05_tools_call_iserror.json`	see §3 finding
`06_tools_call_unknown.json`	`{error:{code:-32601, message:"Tool not found: nope_tool"}}`

Initialize response (compact)

{"id":1,"jsonrpc":"2.0","result":{
    "serverInfo":{"version":"0.1.0","name":"boltzmann-tools"},
    "protocolVersion":"2025-03-26",
    "capabilities":{"tools":{"listChanged":false}}}}

Unknown-tool error (transport-level failure)

{"id":5,"jsonrpc":"2.0","error":{
    "message":"Tool not found: nope_tool","code":-32601}}

3. Baseline finding: `isError` is not a complete failure signal

read_file({path:"/nonexistent/baseline-probe"}) returned:

{"id":4,"jsonrpc":"2.0","result":{
    "isError":false,
    "content":[{"type":"text","text":"Error: could not read /nonexistent/baseline-probe"}]}}

isError: false despite an obvious failure. The handler caught the error and put it in content text but didn't set the flag.

Implication for Phase 2 design: aish cannot rely solely on result.isError to decide success/failure of a tool call. The model must read the text content. This actually simplifies Phase 2: just feed content straight back as the role:"tool" turn body regardless of isError. The flag is advisory; the model is the discriminator. (No PHASE2.md amendment needed — §4's "pass-through to the model" stance already accommodates this.)

This is a per-tool boltzmann-lmcp implementation quirk, not a spec issue. Other lmcp deployments may set isError: true correctly; aish should still pass content through and not crash on either shape.

4. Streaming `tool_calls` delta shape (verified against hossenfelder)

For stream: true requests with tools declared, observed deltas:

data: {"choices":[{"delta":{"role":"assistant","content":null}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"id":"...","type":"function",
                                            "function":{"name":"get_weather","arguments":""}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\""}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"city"}}]}}]}
...
data: {"choices":[{"finish_reason":"tool_calls","delta":{}}]}
data: [DONE]

Accumulator rules confirmed:

On the first delta containing tool_calls[i]: capture id, type, function.name. arguments may be empty "".
On subsequent deltas matching same index: concatenate function.arguments into the running buffer.
finish_reason: "tool_calls" closes the set; arguments buffer is parsed as JSON at that point.

Matches PHASE2.md §5 design.

5. Baseline aish behavior (pre-MCP, what Phase 1 does today)

Sent to hossenfelder with the standard system prompt and no tools field:

user: List the files in /tmp

Response (qwen2.5-coder-1.5b via hossenfelder, sans tools):

```cmd
dir /tmp


`finish_reason: stop`, `tool_calls: null`, 9 completion tokens.

The loaded model emits Windows shell syntax in a markdown code-fence, ignoring the system prompt's `CMD:` extraction contract. **No tool_calls path is exercised today** because no tools are declared. This is the empirical "before" of Phase 2 — once MCP servers are wired and a real tool exists (`list_dir({path:"/tmp"})`), the model has a structured path that doesn't depend on getting `CMD:` formatting right.

---

## 6. Known blockers carried into Phase 7 (verify)

Both live in the **boltzmann proxy** (`hossenfelder.fritz.box:8082`), not in aish:

| # | Bug | Affects | Tracking |
|---|---|---|---|
| 1 | SSE buffering — proxy sets `Content-Length` on `text/event-stream` and flushes the whole response at once | streaming visibility (Phase 1) AND streaming tool_calls deltas (Phase 2) | [aish#15](https://git.reauktion.de/marfrit/aish/issues/15) + [[reference-hossenfelder-sse-buffering]] |
| 2 | `model` field routing — every request returns chunks tagged `qwen2.5-coder-1.5b-q4_k_m.gguf` regardless of requested `model`, suggesting the proxy ignores the field | Phase 2 testing against mistral-nemo specifically (the strict-chat-template canary for Q18); also any `:model deep` / `:model cloud` switch | side-finding in #15 triage; needs its own issue when Phase 7 hits it |

Phase 2 implement/verify will proceed against whatever model is loaded.
Full template-strictness verification of Q18 (`role:"tool"` acceptance on
mistral-nemo) waits for bug #2 to be fixed in the boltzmann proxy code.

---

## 7. Module pre-state (Phase 1 head: `5878f73`)

| Module | LOC (incl. comments) | State |
|---|---|---|
| `broker.lua` | 92 | chat + chat_stream, no `tools` field |
| `context.lua` | (per Phase 1) | `pending_exec_output` buffer; no `role:"tool"`; no `tool_calls` on assistant turns |
| `executor.lua` | (per Phase 1) | PTY-backed, `CMD:` extract, no tool dispatch |
| `repl.lua` | 287 | meta cmds, ask_ai stream loop, no `:mcp …`, no tool-call sub-loop |
| `renderer.lua` | 79 | exec frame, streaming text; no tool-call frame |
| `safety.lua` | (per PHASE0 §4) | stub — only the file exists |
| `mcp.lua` | — | does not exist yet |
| `config.lua` | (per user's edits) | models registry; no `mcp = { servers = {...} }` section |

After Phase 2 lands, `git diff main..post-phase-2 --stat` should show:
new `mcp.lua` (substantial), modest growth in `broker.lua` / `context.lua` /
`repl.lua` / `renderer.lua`, finally non-stub `safety.lua`.

---

*End of Phase 2 Baseline — aish*

6.9 KiB Raw Blame History