Files

T

marfrit 2244a3f1ee docs/PHASE7-baseline: live broker probes for usage shape

Real probes against hossenfelder.fritz.box:8082 against both backends.
Five findings, all align with the formulate/analyze design — no
structural changes.

B1. `stream_options.include_usage = true` is safely accepted by
    both backends. REQUIRED for local llama.cpp to emit usage;
    no-op for cloud (which emits anyway). Default-true is correct.

B2. Two emission patterns observed:
    - Cloud (Bedrock): usage rides the FINAL delta chunk with
      non-empty `choices` carrying finish_reason.
    - Local: usage rides a SEPARATE chunk with `choices: []`
      preceding `[DONE]`.
    Both shapes are handled by the same `if doc.usage then ...`
    check; the existing on_event choices-branch short-circuits
    safely when choices is empty.

B3. `cost` field is dollar-denominated (number) and cloud-only.
    Local returns `timings` instead (perf, not cost). Accumulator
    captures `usage.cost` as-is; nil treated as 0. :cost detail
    annotates local lines so $0 isn't misread.

B4. `doc.model` in the usage event reflects the upstream-API-version
    (e.g., Bedrock rewrites `anthropic/claude-haiku-4.5` to
    `anthropic/claude-4.5-haiku-20251001`). Accumulator keys by
    caller-intended `model_cfg.model`, NOT `doc.model`, for stable
    cross-call comparison.

B5. Usage event is always the LAST data event before `[DONE]`.
    Emission of `on_delta("usage", ...)` happens after curl.post_sse
    returns — one call per stream, after all text + tool_calls.

Q-C4 RESOLVED: hossenfelder forwards `stream_options.include_usage`
to all backends correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 22:49:53 +00:00

5.8 KiB

Raw Blame History

Phase 7 Baseline — pre-implementation measurements

Date: 2026-05-16 Tree probed: f0bccde (PHASE7 formulate + analyze). Broker probed: hossenfelder.fritz.box:8082 (local qwen-coder-7b-snappy-8k, cloud anthropic/claude-haiku-4.5).

This is the Phase 7 (verify) anchor for the cost/usage observability work. Captures the world just before broker.lua / context.lua / repl.lua edits land.

B1. `stream_options.include_usage = true` is safely accepted everywhere

Probed both backends with and without the flag in the request body:

Backend	Without flag	With flag	Notes
Cloud (Anthropic via Bedrock through OpenRouter)	usage IS in final chunk	usage IS in final chunk	OpenRouter emits usage by default; the flag is a no-op there
Local llama.cpp (qwen-coder-7b-snappy-8k via hossenfelder)	NO usage emitted	usage IS in final chunk	The flag is required for local; hossenfelder forwards it correctly to llama.cpp

Implication for §2 / §4: the formulate-time decision to default opts.include_usage = true is correct. Without the flag we'd silently miss local-model usage tracking. With the flag both backends emit usage reliably. No need for a per-backend opt-out in v1.

B2. Usage payload shape — TWO emission patterns

Cloud (Anthropic/Bedrock): usage rides the FINAL delta chunk that ALSO carries the closing finish_reason. choices is non-empty.

{
  "id": "gen-...",
  "object": "chat.completion.chunk",
  "model": "anthropic/claude-4.5-haiku-20251001",
  "provider": "Amazon Bedrock",
  "choices": [{
    "index": 0,
    "delta": { "content": "", "role": "assistant" },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 4,
    "total_tokens": 12,
    "cost": 0.000028,                                  // dollars
    "cost_details": {
      "upstream_inference_cost": 0.000028,
      "upstream_inference_prompt_cost": 0.000008,
      "upstream_inference_completions_cost": 0.00002
    },
    "prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0, ... },
    "completion_tokens_details": { "reasoning_tokens": 0, ... }
  }
}

Local (llama.cpp): usage rides a SEPARATE final chunk where choices: []. Then [DONE] marker.

{
  "id": "chatcmpl-...",
  "object": "chat.completion.chunk",
  "model": "qwen-coder-7b-snappy-8k",
  "choices": [],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 6,
    "total_tokens": 36,
    "prompt_tokens_details": { "cached_tokens": 29 }
  },
  "timings": {
    "cache_n": 29, "prompt_n": 1, "prompt_ms": 152.391,
    "predicted_n": 6, "predicted_ms": 758.778, ...
  }
}
data: [DONE]

Implication for §4 extraction algorithm: if doc.usage then final_usage = doc.usage end works for BOTH shapes (cloud-style non-empty-choices chunk AND local-style empty-choices chunk). The existing on_event branch on choices and choices[1] and delta is short-circuited safely when choices is empty.

B3. `cost` field is dollar-denominated and present on cloud only

Provider	`usage.cost`	`usage.cost_details`
Anthropic via Bedrock (OpenRouter)	✓ (number, USD)	✓ (upstream_inference_cost / _prompt_cost / _completions_cost)
Local llama.cpp	absent	absent

The local model has timings instead — useful for perf observability but NOT cost. Implication: in the accumulator, capture usage.cost as-is when present; treat nil as 0 (matches the formulate-time "local: free" framing). :cost detail annotates local lines as (local) so the displayed $0 isn't misread.

B4. Model identifier in usage events — choose source carefully

Cloud's usage event carries:

doc.model = "anthropic/claude-4.5-haiku-20251001" (the resolved upstream-API-version)

But the REQUEST was "model": "anthropic/claude-haiku-4.5". The broker / OpenRouter rewrote the model name to the dated version.

Implication: the accumulator should key by the CALLER-INTENDED model name (i.e., model_cfg.model from the request, NOT doc.model from the response). This keeps :cost detail output stable across upstream API version bumps. Documented in §5 of the manifest already (uses model_name).

For local the two match (model_cfg.model == doc.model), so this is a cloud-only consideration.

B5. Multi-chunk vs single-chunk delivery

Cloud (Bedrock) returns the whole 4-token response in ~3 chunks (median 27 chars each per B2 of Phase 6 baseline). Local returns ~6 chunks of ~4 chars each. In both cases the usage event is the LAST data event before [DONE]. So the post-curl.post_sse emission of on_delta("usage", ...) in chat_stream is the right place to fire — it happens once per stream, after all text/tool_calls have been delivered.

Summary

Finding	Affects	Resolution
B1 stream_options safe + required for local	§4 `opts.include_usage` default	Default true; no per-backend opt-out needed
B2 two emission patterns (non-empty vs empty choices)	broker.on_event branch	`if doc.usage then final_usage = doc.usage end` works for both
B3 cost dollar-denominated, cloud-only	accumulator + :cost detail	Capture as-is; nil→0; annotate local lines
B4 model identifier rewrite by upstream	accumulator keying	Key by `model_cfg.model` (caller-intended) not `doc.model`
B5 usage is last event before [DONE]	emission placement	Fire `on_delta("usage", ...)` after curl.post_sse returns

All findings align with the formulate/analyze design. No structural changes needed. The implementation can proceed to plan.

Q-C4 RESOLVED (was: does the hossenfelder broker forward stream_options to all backends?): YES — local llama.cpp receives and honors the flag; cloud emits usage with or without (the flag is a no-op there). Both confirmed via real probes against hossenfelder.fritz.box:8082.

5.8 KiB Raw Blame History