Four findings, all align with formulate/analyze:
B1. /tokenize IGNORES the `model` request field — returns the
tokenization of whichever model is currently loaded on the
proxy backend, NOT the requested model. Acceptable: a real BPE
count is still much better than char/4, and the gap between
Qwen/Llama tokenizers is small. Cloud (OpenRouter) 404s
regardless, so cloud falls back to char/4 via the capability
cache.
B2. Latency 23-34ms per call, FLAT across input sizes 50-5000 chars.
Network round-trip dominates. Per-turn _tokens cache amortizes
to O(1); worst case 40 cached turns × ~30ms = 1.2s one-time
cost on first enforce_budget call. Acceptable.
B3. Response shape confirmed: `{"tokens":[N1,N2,...]}` (token IDs;
we use #response.tokens for count, discard the IDs). JSON not
SSE; ffi.curl.M.post is the right call.
B4. Cloud /tokenize 404s as expected. Capability cache marks it
unsupported on first probe; char/4 fallback silent thereafter.
No design change.
Q-T5 RESOLVED per B1. All open questions now resolved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>