Files
aish/docs/PHASE8-baseline.md
marfrit 79bd40db79 docs/PHASE8-baseline: live /tokenize probes
Four findings, all align with formulate/analyze:

B1. /tokenize IGNORES the `model` request field — returns the
    tokenization of whichever model is currently loaded on the
    proxy backend, NOT the requested model. Acceptable: a real BPE
    count is still much better than char/4, and the gap between
    Qwen/Llama tokenizers is small. Cloud (OpenRouter) 404s
    regardless, so cloud falls back to char/4 via the capability
    cache.

B2. Latency 23-34ms per call, FLAT across input sizes 50-5000 chars.
    Network round-trip dominates. Per-turn _tokens cache amortizes
    to O(1); worst case 40 cached turns × ~30ms = 1.2s one-time
    cost on first enforce_budget call. Acceptable.

B3. Response shape confirmed: `{"tokens":[N1,N2,...]}` (token IDs;
    we use #response.tokens for count, discard the IDs). JSON not
    SSE; ffi.curl.M.post is the right call.

B4. Cloud /tokenize 404s as expected. Capability cache marks it
    unsupported on first probe; char/4 fallback silent thereafter.
    No design change.

Q-T5 RESOLVED per B1. All open questions now resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:22:05 +00:00

111 lines
4.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 8 Baseline — pre-implementation measurements
**Date:** 2026-05-16
**Tree probed:** `1a136d8` (PHASE8 formulate + analyze + pillar-5 addition).
**Broker probed:** `hossenfelder.fritz.box:8082` (local `qwen-coder-7b-snappy-8k` was the active local model at probe time).
---
## B1. `/tokenize` ignores the `model` request field
Probed three variants of the same request:
| Request body | Response |
|---|---|
| `{"model":"qwen-coder-7b-snappy-8k","content":"hello world"}` | `{"tokens":[14990,1879]}` |
| `{"model":"Qwen2.5-7B-Instruct-Q4_K_M.gguf","content":"hello world"}` | `{"tokens":[14990,1879]}` (identical) |
| `{"content":"hello world"}` (no model) | `{"tokens":[14990,1879]}` (identical) |
**Q-T5 RESOLVED**: hossenfelder's `/tokenize` does NOT switch
tokenizer based on the request's `model` field. It returns the
tokenization of whichever backend model is currently loaded by the
proxy. For aish purposes this is **acceptable** — we get a real BPE
tokenizer count rather than char/4. The accuracy gap from using a
different model's tokenizer than the one that will receive the
completion is minor (Qwen / Llama tokenizers are similar in BPE
vocabulary scale; both are far more accurate than char/4).
**Implication for §4**: keep sending the `model` field anyway (it's
harmless and may help if the proxy gains per-model routing later).
Document the limitation: counts are from the proxy's loaded model,
NOT necessarily the model_cfg.model requested. For cloud presets
that route through OpenRouter, `/tokenize` 404s anyway and the
char/4 fallback fires — no inaccuracy concern there.
---
## B2. `/tokenize` round-trip latency
Five probes against `hossenfelder.fritz.box:8082` for random-base64
payloads of varying sizes:
| Input size (chars) | Tokens returned | Round-trip (ms) |
|---|---|---|
| 50 | 39 | 23 |
| 500 | 369 | 34 |
| 2000 | 1509 | 32 |
| 5000 | 3741 | 24 |
**Latency is flat at ~25-35ms** across the size range, dominated by
network round-trip (not tokenizer cost). This is comfortably under
the §4 formulate-time estimate of "~50ms per call".
**Implication for §5**: per-turn `_tokens` cache amortizes cost to
O(1) after first count. Worst case fresh session with 40 cached
turns: 40 × 30ms = 1.2s one-time cost for `enforce_budget`'s first
call (after that, cached). Acceptable.
The total tokens count for random base64 input is unusually high
(~74% chars-to-tokens vs ~25% for natural prose). This is because
base64 lacks the common-token patterns BPE compresses. Natural-text
sessions tokenize closer to char/4 (per earlier prose probe: 558
tokens for 2032 chars = 27.5%).
---
## B3. `/tokenize` body shape — `{tokens: [int, int, ...]}`
Confirmed across all probes: response is `{"tokens": [N1, N2, ...]}`
where each `Ni` is the token ID (integer). For aish purposes we only
need the count (`#response.tokens`), so the token IDs themselves are
discarded.
The response is JSON (not SSE), so `ffi.curl.M.post` (blocking POST)
is the right call — not `M.post_sse`.
---
## B4. No /tokenize on cloud (OpenRouter) — char/4 fallback path validated
Already probed during formulate-time:
```
curl http://hossenfelder.fritz.box:8082/v1/tokenize -> 404
curl http://hossenfelder.fritz.box:8082/tokenize ... model=anthropic/...
-> 404 (or returns the LOADED-local-model's tokenization; not the cloud's)
```
The hossenfelder proxy doesn't forward `/tokenize` to OpenRouter
(which doesn't expose it). Our per-endpoint capability cache will
mark it as unsupported on first probe; subsequent cloud calls use
char/4 silently.
**No design change needed** — formulate's "cache capability per
(endpoint, model) on first probe" handles this naturally.
---
## Summary
| Finding | Affects | Resolution |
|---|---|---|
| B1 /tokenize ignores `model` field | §4 token_count accuracy gap | Document; acceptable — BPE >> char/4 even with wrong tokenizer |
| B2 ~25-35ms latency, flat over size | §5 per-turn cache strategy | Per-turn cache amortizes; worst case 1.2s on first enforce_budget |
| B3 `{tokens: [...]}` body shape | §4 broker.token_count parser | Confirmed; one-liner JSON parse |
| B4 cloud /tokenize 404 | §4 capability detection | Cache as unsupported on first probe; char/4 fallback fires silently |
All findings align with the formulate/analyze design. No
structural changes needed. Ready for plan.
**Q-T5 RESOLVED** per B1. All open questions now resolved.