Files

T

marfrit 79bd40db79 docs/PHASE8-baseline: live /tokenize probes

Four findings, all align with formulate/analyze:

B1. /tokenize IGNORES the `model` request field — returns the
    tokenization of whichever model is currently loaded on the
    proxy backend, NOT the requested model. Acceptable: a real BPE
    count is still much better than char/4, and the gap between
    Qwen/Llama tokenizers is small. Cloud (OpenRouter) 404s
    regardless, so cloud falls back to char/4 via the capability
    cache.

B2. Latency 23-34ms per call, FLAT across input sizes 50-5000 chars.
    Network round-trip dominates. Per-turn _tokens cache amortizes
    to O(1); worst case 40 cached turns × ~30ms = 1.2s one-time
    cost on first enforce_budget call. Acceptable.

B3. Response shape confirmed: `{"tokens":[N1,N2,...]}` (token IDs;
    we use #response.tokens for count, discard the IDs). JSON not
    SSE; ffi.curl.M.post is the right call.

B4. Cloud /tokenize 404s as expected. Capability cache marks it
    unsupported on first probe; char/4 fallback silent thereafter.
    No design change.

Q-T5 RESOLVED per B1. All open questions now resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 23:22:05 +00:00

4.3 KiB

Raw Blame History

Phase 8 Baseline — pre-implementation measurements

Date: 2026-05-16 Tree probed: 1a136d8 (PHASE8 formulate + analyze + pillar-5 addition). Broker probed: hossenfelder.fritz.box:8082 (local qwen-coder-7b-snappy-8k was the active local model at probe time).

B1. `/tokenize` ignores the `model` request field

Probed three variants of the same request:

Request body	Response
`{"model":"qwen-coder-7b-snappy-8k","content":"hello world"}`	`{"tokens":[14990,1879]}`
`{"model":"Qwen2.5-7B-Instruct-Q4_K_M.gguf","content":"hello world"}`	`{"tokens":[14990,1879]}` (identical)
`{"content":"hello world"}` (no model)	`{"tokens":[14990,1879]}` (identical)

Q-T5 RESOLVED: hossenfelder's /tokenize does NOT switch tokenizer based on the request's model field. It returns the tokenization of whichever backend model is currently loaded by the proxy. For aish purposes this is acceptable — we get a real BPE tokenizer count rather than char/4. The accuracy gap from using a different model's tokenizer than the one that will receive the completion is minor (Qwen / Llama tokenizers are similar in BPE vocabulary scale; both are far more accurate than char/4).

Implication for §4: keep sending the model field anyway (it's harmless and may help if the proxy gains per-model routing later). Document the limitation: counts are from the proxy's loaded model, NOT necessarily the model_cfg.model requested. For cloud presets that route through OpenRouter, /tokenize 404s anyway and the char/4 fallback fires — no inaccuracy concern there.

B2. `/tokenize` round-trip latency

Five probes against hossenfelder.fritz.box:8082 for random-base64 payloads of varying sizes:

Input size (chars)	Tokens returned	Round-trip (ms)
50	39	23
500	369	34
2000	1509	32
5000	3741	24

Latency is flat at ~25-35ms across the size range, dominated by network round-trip (not tokenizer cost). This is comfortably under the §4 formulate-time estimate of "~50ms per call".

Implication for §5: per-turn _tokens cache amortizes cost to O(1) after first count. Worst case fresh session with 40 cached turns: 40 × 30ms = 1.2s one-time cost for enforce_budget's first call (after that, cached). Acceptable.

The total tokens count for random base64 input is unusually high (~74% chars-to-tokens vs ~25% for natural prose). This is because base64 lacks the common-token patterns BPE compresses. Natural-text sessions tokenize closer to char/4 (per earlier prose probe: 558 tokens for 2032 chars = 27.5%).

B3. `/tokenize` body shape — `{tokens: [int, int, ...]}`

Confirmed across all probes: response is {"tokens": [N1, N2, ...]} where each Ni is the token ID (integer). For aish purposes we only need the count (#response.tokens), so the token IDs themselves are discarded.

The response is JSON (not SSE), so ffi.curl.M.post (blocking POST) is the right call — not M.post_sse.

B4. No /tokenize on cloud (OpenRouter) — char/4 fallback path validated

Already probed during formulate-time:

curl http://hossenfelder.fritz.box:8082/v1/tokenize  -> 404
curl http://hossenfelder.fritz.box:8082/tokenize ... model=anthropic/...
  -> 404 (or returns the LOADED-local-model's tokenization; not the cloud's)

The hossenfelder proxy doesn't forward /tokenize to OpenRouter (which doesn't expose it). Our per-endpoint capability cache will mark it as unsupported on first probe; subsequent cloud calls use char/4 silently.

No design change needed — formulate's "cache capability per (endpoint, model) on first probe" handles this naturally.

Summary

Finding	Affects	Resolution
B1 /tokenize ignores `model` field	§4 token_count accuracy gap	Document; acceptable — BPE >> char/4 even with wrong tokenizer
B2 ~25-35ms latency, flat over size	§5 per-turn cache strategy	Per-turn cache amortizes; worst case 1.2s on first enforce_budget
B3 `{tokens: [...]}` body shape	§4 broker.token_count parser	Confirmed; one-liner JSON parse
B4 cloud /tokenize 404	§4 capability detection	Cache as unsupported on first probe; char/4 fallback fires silently

All findings align with the formulate/analyze design. No structural changes needed. Ready for plan.

Q-T5 RESOLVED per B1. All open questions now resolved.

4.3 KiB Raw Blame History Unescape Escape