Files
aish/docs/PHASE8-baseline.md
T
marfrit 79bd40db79 docs/PHASE8-baseline: live /tokenize probes
Four findings, all align with formulate/analyze:

B1. /tokenize IGNORES the `model` request field — returns the
    tokenization of whichever model is currently loaded on the
    proxy backend, NOT the requested model. Acceptable: a real BPE
    count is still much better than char/4, and the gap between
    Qwen/Llama tokenizers is small. Cloud (OpenRouter) 404s
    regardless, so cloud falls back to char/4 via the capability
    cache.

B2. Latency 23-34ms per call, FLAT across input sizes 50-5000 chars.
    Network round-trip dominates. Per-turn _tokens cache amortizes
    to O(1); worst case 40 cached turns × ~30ms = 1.2s one-time
    cost on first enforce_budget call. Acceptable.

B3. Response shape confirmed: `{"tokens":[N1,N2,...]}` (token IDs;
    we use #response.tokens for count, discard the IDs). JSON not
    SSE; ffi.curl.M.post is the right call.

B4. Cloud /tokenize 404s as expected. Capability cache marks it
    unsupported on first probe; char/4 fallback silent thereafter.
    No design change.

Q-T5 RESOLVED per B1. All open questions now resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:22:05 +00:00

4.3 KiB
Raw Blame History

Phase 8 Baseline — pre-implementation measurements

Date: 2026-05-16 Tree probed: 1a136d8 (PHASE8 formulate + analyze + pillar-5 addition). Broker probed: hossenfelder.fritz.box:8082 (local qwen-coder-7b-snappy-8k was the active local model at probe time).


B1. /tokenize ignores the model request field

Probed three variants of the same request:

Request body Response
{"model":"qwen-coder-7b-snappy-8k","content":"hello world"} {"tokens":[14990,1879]}
{"model":"Qwen2.5-7B-Instruct-Q4_K_M.gguf","content":"hello world"} {"tokens":[14990,1879]} (identical)
{"content":"hello world"} (no model) {"tokens":[14990,1879]} (identical)

Q-T5 RESOLVED: hossenfelder's /tokenize does NOT switch tokenizer based on the request's model field. It returns the tokenization of whichever backend model is currently loaded by the proxy. For aish purposes this is acceptable — we get a real BPE tokenizer count rather than char/4. The accuracy gap from using a different model's tokenizer than the one that will receive the completion is minor (Qwen / Llama tokenizers are similar in BPE vocabulary scale; both are far more accurate than char/4).

Implication for §4: keep sending the model field anyway (it's harmless and may help if the proxy gains per-model routing later). Document the limitation: counts are from the proxy's loaded model, NOT necessarily the model_cfg.model requested. For cloud presets that route through OpenRouter, /tokenize 404s anyway and the char/4 fallback fires — no inaccuracy concern there.


B2. /tokenize round-trip latency

Five probes against hossenfelder.fritz.box:8082 for random-base64 payloads of varying sizes:

Input size (chars) Tokens returned Round-trip (ms)
50 39 23
500 369 34
2000 1509 32
5000 3741 24

Latency is flat at ~25-35ms across the size range, dominated by network round-trip (not tokenizer cost). This is comfortably under the §4 formulate-time estimate of "~50ms per call".

Implication for §5: per-turn _tokens cache amortizes cost to O(1) after first count. Worst case fresh session with 40 cached turns: 40 × 30ms = 1.2s one-time cost for enforce_budget's first call (after that, cached). Acceptable.

The total tokens count for random base64 input is unusually high (~74% chars-to-tokens vs ~25% for natural prose). This is because base64 lacks the common-token patterns BPE compresses. Natural-text sessions tokenize closer to char/4 (per earlier prose probe: 558 tokens for 2032 chars = 27.5%).


B3. /tokenize body shape — {tokens: [int, int, ...]}

Confirmed across all probes: response is {"tokens": [N1, N2, ...]} where each Ni is the token ID (integer). For aish purposes we only need the count (#response.tokens), so the token IDs themselves are discarded.

The response is JSON (not SSE), so ffi.curl.M.post (blocking POST) is the right call — not M.post_sse.


B4. No /tokenize on cloud (OpenRouter) — char/4 fallback path validated

Already probed during formulate-time:

curl http://hossenfelder.fritz.box:8082/v1/tokenize  -> 404
curl http://hossenfelder.fritz.box:8082/tokenize ... model=anthropic/...
  -> 404 (or returns the LOADED-local-model's tokenization; not the cloud's)

The hossenfelder proxy doesn't forward /tokenize to OpenRouter (which doesn't expose it). Our per-endpoint capability cache will mark it as unsupported on first probe; subsequent cloud calls use char/4 silently.

No design change needed — formulate's "cache capability per (endpoint, model) on first probe" handles this naturally.


Summary

Finding Affects Resolution
B1 /tokenize ignores model field §4 token_count accuracy gap Document; acceptable — BPE >> char/4 even with wrong tokenizer
B2 ~25-35ms latency, flat over size §5 per-turn cache strategy Per-turn cache amortizes; worst case 1.2s on first enforce_budget
B3 {tokens: [...]} body shape §4 broker.token_count parser Confirmed; one-liner JSON parse
B4 cloud /tokenize 404 §4 capability detection Cache as unsupported on first probe; char/4 fallback fires silently

All findings align with the formulate/analyze design. No structural changes needed. Ready for plan.

Q-T5 RESOLVED per B1. All open questions now resolved.