diff --git a/docs/PHASE8-baseline.md b/docs/PHASE8-baseline.md new file mode 100644 index 0000000..0e679de --- /dev/null +++ b/docs/PHASE8-baseline.md @@ -0,0 +1,110 @@ +# Phase 8 Baseline — pre-implementation measurements + +**Date:** 2026-05-16 +**Tree probed:** `1a136d8` (PHASE8 formulate + analyze + pillar-5 addition). +**Broker probed:** `hossenfelder.fritz.box:8082` (local `qwen-coder-7b-snappy-8k` was the active local model at probe time). + +--- + +## B1. `/tokenize` ignores the `model` request field + +Probed three variants of the same request: + +| Request body | Response | +|---|---| +| `{"model":"qwen-coder-7b-snappy-8k","content":"hello world"}` | `{"tokens":[14990,1879]}` | +| `{"model":"Qwen2.5-7B-Instruct-Q4_K_M.gguf","content":"hello world"}` | `{"tokens":[14990,1879]}` (identical) | +| `{"content":"hello world"}` (no model) | `{"tokens":[14990,1879]}` (identical) | + +**Q-T5 RESOLVED**: hossenfelder's `/tokenize` does NOT switch +tokenizer based on the request's `model` field. It returns the +tokenization of whichever backend model is currently loaded by the +proxy. For aish purposes this is **acceptable** — we get a real BPE +tokenizer count rather than char/4. The accuracy gap from using a +different model's tokenizer than the one that will receive the +completion is minor (Qwen / Llama tokenizers are similar in BPE +vocabulary scale; both are far more accurate than char/4). + +**Implication for §4**: keep sending the `model` field anyway (it's +harmless and may help if the proxy gains per-model routing later). +Document the limitation: counts are from the proxy's loaded model, +NOT necessarily the model_cfg.model requested. For cloud presets +that route through OpenRouter, `/tokenize` 404s anyway and the +char/4 fallback fires — no inaccuracy concern there. + +--- + +## B2. `/tokenize` round-trip latency + +Five probes against `hossenfelder.fritz.box:8082` for random-base64 +payloads of varying sizes: + +| Input size (chars) | Tokens returned | Round-trip (ms) | +|---|---|---| +| 50 | 39 | 23 | +| 500 | 369 | 34 | +| 2000 | 1509 | 32 | +| 5000 | 3741 | 24 | + +**Latency is flat at ~25-35ms** across the size range, dominated by +network round-trip (not tokenizer cost). This is comfortably under +the §4 formulate-time estimate of "~50ms per call". + +**Implication for §5**: per-turn `_tokens` cache amortizes cost to +O(1) after first count. Worst case fresh session with 40 cached +turns: 40 × 30ms = 1.2s one-time cost for `enforce_budget`'s first +call (after that, cached). Acceptable. + +The total tokens count for random base64 input is unusually high +(~74% chars-to-tokens vs ~25% for natural prose). This is because +base64 lacks the common-token patterns BPE compresses. Natural-text +sessions tokenize closer to char/4 (per earlier prose probe: 558 +tokens for 2032 chars = 27.5%). + +--- + +## B3. `/tokenize` body shape — `{tokens: [int, int, ...]}` + +Confirmed across all probes: response is `{"tokens": [N1, N2, ...]}` +where each `Ni` is the token ID (integer). For aish purposes we only +need the count (`#response.tokens`), so the token IDs themselves are +discarded. + +The response is JSON (not SSE), so `ffi.curl.M.post` (blocking POST) +is the right call — not `M.post_sse`. + +--- + +## B4. No /tokenize on cloud (OpenRouter) — char/4 fallback path validated + +Already probed during formulate-time: + +``` +curl http://hossenfelder.fritz.box:8082/v1/tokenize -> 404 +curl http://hossenfelder.fritz.box:8082/tokenize ... model=anthropic/... + -> 404 (or returns the LOADED-local-model's tokenization; not the cloud's) +``` + +The hossenfelder proxy doesn't forward `/tokenize` to OpenRouter +(which doesn't expose it). Our per-endpoint capability cache will +mark it as unsupported on first probe; subsequent cloud calls use +char/4 silently. + +**No design change needed** — formulate's "cache capability per +(endpoint, model) on first probe" handles this naturally. + +--- + +## Summary + +| Finding | Affects | Resolution | +|---|---|---| +| B1 /tokenize ignores `model` field | §4 token_count accuracy gap | Document; acceptable — BPE >> char/4 even with wrong tokenizer | +| B2 ~25-35ms latency, flat over size | §5 per-turn cache strategy | Per-turn cache amortizes; worst case 1.2s on first enforce_budget | +| B3 `{tokens: [...]}` body shape | §4 broker.token_count parser | Confirmed; one-liner JSON parse | +| B4 cloud /tokenize 404 | §4 capability detection | Cache as unsupported on first probe; char/4 fallback fires silently | + +All findings align with the formulate/analyze design. No +structural changes needed. Ready for plan. + +**Q-T5 RESOLVED** per B1. All open questions now resolved.