Real probes against hossenfelder.fritz.box:8082 against both backends.
Five findings, all align with the formulate/analyze design — no
structural changes.
B1. `stream_options.include_usage = true` is safely accepted by
both backends. REQUIRED for local llama.cpp to emit usage;
no-op for cloud (which emits anyway). Default-true is correct.
B2. Two emission patterns observed:
- Cloud (Bedrock): usage rides the FINAL delta chunk with
non-empty `choices` carrying finish_reason.
- Local: usage rides a SEPARATE chunk with `choices: []`
preceding `[DONE]`.
Both shapes are handled by the same `if doc.usage then ...`
check; the existing on_event choices-branch short-circuits
safely when choices is empty.
B3. `cost` field is dollar-denominated (number) and cloud-only.
Local returns `timings` instead (perf, not cost). Accumulator
captures `usage.cost` as-is; nil treated as 0. :cost detail
annotates local lines so $0 isn't misread.
B4. `doc.model` in the usage event reflects the upstream-API-version
(e.g., Bedrock rewrites `anthropic/claude-haiku-4.5` to
`anthropic/claude-4.5-haiku-20251001`). Accumulator keys by
caller-intended `model_cfg.model`, NOT `doc.model`, for stable
cross-call comparison.
B5. Usage event is always the LAST data event before `[DONE]`.
Emission of `on_delta("usage", ...)` happens after curl.post_sse
returns — one call per stream, after all text + tool_calls.
Q-C4 RESOLVED: hossenfelder forwards `stream_options.include_usage`
to all backends correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.8 KiB
Phase 7 Baseline — pre-implementation measurements
Date: 2026-05-16
Tree probed: f0bccde (PHASE7 formulate + analyze).
Broker probed: hossenfelder.fritz.box:8082 (local qwen-coder-7b-snappy-8k, cloud anthropic/claude-haiku-4.5).
This is the Phase 7 (verify) anchor for the cost/usage observability work. Captures the world just before broker.lua / context.lua / repl.lua edits land.
B1. stream_options.include_usage = true is safely accepted everywhere
Probed both backends with and without the flag in the request body:
| Backend | Without flag | With flag | Notes |
|---|---|---|---|
| Cloud (Anthropic via Bedrock through OpenRouter) | usage IS in final chunk | usage IS in final chunk | OpenRouter emits usage by default; the flag is a no-op there |
| Local llama.cpp (qwen-coder-7b-snappy-8k via hossenfelder) | NO usage emitted | usage IS in final chunk | The flag is required for local; hossenfelder forwards it correctly to llama.cpp |
Implication for §2 / §4: the formulate-time decision to default
opts.include_usage = true is correct. Without the flag we'd silently
miss local-model usage tracking. With the flag both backends emit
usage reliably. No need for a per-backend opt-out in v1.
B2. Usage payload shape — TWO emission patterns
Cloud (Anthropic/Bedrock): usage rides the FINAL delta chunk that
ALSO carries the closing finish_reason. choices is non-empty.
{
"id": "gen-...",
"object": "chat.completion.chunk",
"model": "anthropic/claude-4.5-haiku-20251001",
"provider": "Amazon Bedrock",
"choices": [{
"index": 0,
"delta": { "content": "", "role": "assistant" },
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 4,
"total_tokens": 12,
"cost": 0.000028, // dollars
"cost_details": {
"upstream_inference_cost": 0.000028,
"upstream_inference_prompt_cost": 0.000008,
"upstream_inference_completions_cost": 0.00002
},
"prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0, ... },
"completion_tokens_details": { "reasoning_tokens": 0, ... }
}
}
Local (llama.cpp): usage rides a SEPARATE final chunk where
choices: []. Then [DONE] marker.
{
"id": "chatcmpl-...",
"object": "chat.completion.chunk",
"model": "qwen-coder-7b-snappy-8k",
"choices": [],
"usage": {
"prompt_tokens": 30,
"completion_tokens": 6,
"total_tokens": 36,
"prompt_tokens_details": { "cached_tokens": 29 }
},
"timings": {
"cache_n": 29, "prompt_n": 1, "prompt_ms": 152.391,
"predicted_n": 6, "predicted_ms": 758.778, ...
}
}
data: [DONE]
Implication for §4 extraction algorithm: if doc.usage then final_usage = doc.usage end works for BOTH shapes (cloud-style
non-empty-choices chunk AND local-style empty-choices chunk). The
existing on_event branch on choices and choices[1] and delta is
short-circuited safely when choices is empty.
B3. cost field is dollar-denominated and present on cloud only
| Provider | usage.cost |
usage.cost_details |
|---|---|---|
| Anthropic via Bedrock (OpenRouter) | ✓ (number, USD) | ✓ (upstream_inference_cost / _prompt_cost / _completions_cost) |
| Local llama.cpp | absent | absent |
The local model has timings instead — useful for perf observability
but NOT cost. Implication: in the accumulator, capture
usage.cost as-is when present; treat nil as 0 (matches the
formulate-time "local: free" framing). :cost detail annotates
local lines as (local) so the displayed $0 isn't misread.
B4. Model identifier in usage events — choose source carefully
Cloud's usage event carries:
doc.model = "anthropic/claude-4.5-haiku-20251001"(the resolved upstream-API-version)
But the REQUEST was "model": "anthropic/claude-haiku-4.5". The
broker / OpenRouter rewrote the model name to the dated version.
Implication: the accumulator should key by the CALLER-INTENDED model
name (i.e., model_cfg.model from the request, NOT doc.model from the
response). This keeps :cost detail output stable across upstream API
version bumps. Documented in §5 of the manifest already (uses
model_name).
For local the two match (model_cfg.model == doc.model), so this is a cloud-only consideration.
B5. Multi-chunk vs single-chunk delivery
Cloud (Bedrock) returns the whole 4-token response in ~3 chunks (median
27 chars each per B2 of Phase 6 baseline). Local returns ~6 chunks of
~4 chars each. In both cases the usage event is the LAST data event
before [DONE]. So the post-curl.post_sse emission of
on_delta("usage", ...) in chat_stream is the right place to fire —
it happens once per stream, after all text/tool_calls have been
delivered.
Summary
| Finding | Affects | Resolution |
|---|---|---|
| B1 stream_options safe + required for local | §4 opts.include_usage default |
Default true; no per-backend opt-out needed |
| B2 two emission patterns (non-empty vs empty choices) | broker.on_event branch | if doc.usage then final_usage = doc.usage end works for both |
| B3 cost dollar-denominated, cloud-only | accumulator + :cost detail | Capture as-is; nil→0; annotate local lines |
| B4 model identifier rewrite by upstream | accumulator keying | Key by model_cfg.model (caller-intended) not doc.model |
| B5 usage is last event before [DONE] | emission placement | Fire on_delta("usage", ...) after curl.post_sse returns |
All findings align with the formulate/analyze design. No structural changes needed. The implementation can proceed to plan.
Q-C4 RESOLVED (was: does the hossenfelder broker forward
stream_options to all backends?): YES — local llama.cpp receives
and honors the flag; cloud emits usage with or without (the flag is
a no-op there). Both confirmed via real probes against
hossenfelder.fritz.box:8082.