test-case: streaming visibility — tokens arrive incrementally #15

Closed
opened 2026-05-10 21:37:59 +00:00 by claude-noether · 3 comments
Collaborator

Steps

  1. cd ~/src/aish && luajit main.lua
  2. At [aish:fast]>, paste:
    Count from one to fifty, space-separated, no other words.
    

Expected

  • Tokens appear on screen progressively (you can watch "1 2 3 4 ..." fill in).
  • Total visible time matches the model's generation time — no big pause between the last token and the next prompt.
  • The next [aish:fast]> prompt appears immediately after the final number.

What this exercises

Phase 1 §1 done-criteria #1 (streaming).
broker.chat_streamrenderer.assistant_delta → live stdout.

## Steps 1. `cd ~/src/aish && luajit main.lua` 2. At `[aish:fast]>`, paste: ``` Count from one to fifty, space-separated, no other words. ``` ## Expected - Tokens appear on screen **progressively** (you can watch "1 2 3 4 ..." fill in). - Total visible time matches the model's generation time — no big pause between the last token and the next prompt. - The next `[aish:fast]>` prompt appears immediately after the final number. ## What this exercises Phase 1 §1 done-criteria #1 (streaming). `broker.chat_stream` → `renderer.assistant_delta` → live stdout.
claude-noether added the test-case label 2026-05-10 21:37:59 +00:00
Owner

First invocation, fast model:
aish: loaded config from ./config.lua
[aish:fast]> :models
[aish] models (active: fast):
deep mistral-nemo-12b-instruct @ http://hossenfelder.fritz.box:8082

  • fast qwen2.5-coder-1.5b-q4_k_m.gguf @ http://hossenfelder.fritz.box:8082
    cloud anthropic/claude-haiku-4.5 @ http://hossenfelder.fritz.box:8082
    [aish:fast]> Count from one to fifty, space-separated, no other words.
    CMD: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
    execute '1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50'? [y/N] n

deep model presented the thinking too - but all at once.

First invocation, fast model: aish: loaded config from ./config.lua [aish:fast]> :models [aish] models (active: fast): deep mistral-nemo-12b-instruct @ http://hossenfelder.fritz.box:8082 * fast qwen2.5-coder-1.5b-q4_k_m.gguf @ http://hossenfelder.fritz.box:8082 cloud anthropic/claude-haiku-4.5 @ http://hossenfelder.fritz.box:8082 [aish:fast]> Count from one to fifty, space-separated, no other words. CMD: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 execute '1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50'? [y/N] n deep model presented the thinking too - but all at once.
Author
Collaborator

Triage: this is an upstream proxy bug, not an aish bug.

Wire-level evidence

Running the same request against the proxy directly with timestamped stdin, every SSE event arrives in a single batch:

$ curl -N -d '{"model":"mistral-nemo-12b-instruct","stream":true,...}' \
      http://hossenfelder.fritz.box:8082/v1/chat/completions | ts

t= 8.551s  data: {"choices":[{"delta":{"role":"assistant"}}],...,"created":1778535796}
t= 8.551s  data: {"choices":[{"delta":{"content":"1"}}],...,"created":1778535796}
t= 8.551s  data: {"choices":[{"delta":{"content":"."}}],...,"created":1778535796}
...
t= 8.551s  data: {"choices":[{"delta":{"content":"0"}}],...,"created":1778535804}
t= 8.551s  data: [DONE]

All events stamped t=8.551s on the receiving side, but the created timestamps inside the JSON span 1778535796..1778535804 — so the upstream llama.cpp did generate incrementally over ~8 s; the proxy held all of it before flushing.

Response headers

HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.12.12
Content-Type: text/event-stream
Content-Length: 6147
Connection: close

Content-Length on a text/event-stream body is the smoking gun: Python's BaseHTTPRequestHandler is collecting the entire upstream stream into a buffer, then sending response with a fixed Content-Length. SSE is supposed to be chunked transfer-encoding or close-on-EOF with progressive wfile.write()+flush().

aish-side verification

  • ffi/curl.post_sse parses SSE events as they arrive via WRITEFUNCTION — confirmed correct (fast model with the same proxy round-trip would have looked the same if there were a client buffer).
  • broker.chat_stream decodes per-event and invokes on_delta immediately.
  • renderer.assistant_delta calls io.write()+io.flush() per delta.

Nothing on the aish side accumulates.

Side finding (separate)

The response chunks come back with "model":"qwen2.5-coder-1.5b-q4_k_m.gguf" even when the request specified mistral-nemo-12b-instruct. Either the proxy is routing all traffic to the loaded model regardless of model field, or only one model was loaded at probe time. Worth a separate issue against the boltzmann proxy.

Resolution

Fix lives in the boltzmann proxy on hossenfelder (Python BaseHTTPRequestHandler SSE handler — switch to chunked transfer-encoding, drop Content-Length, flush after every data:\\n\\n write). Not actionable in this repo.

Proposing this issue be re-tagged from test-case to bug and either (a) re-scoped as a tracking issue for the proxy fix landing, or (b) closed here and re-filed against the boltzmann repo. Your call.

Triage: this is an **upstream proxy bug**, not an aish bug. ### Wire-level evidence Running the same request against the proxy directly with timestamped stdin, every SSE event arrives in a single batch: ``` $ curl -N -d '{"model":"mistral-nemo-12b-instruct","stream":true,...}' \ http://hossenfelder.fritz.box:8082/v1/chat/completions | ts t= 8.551s data: {"choices":[{"delta":{"role":"assistant"}}],...,"created":1778535796} t= 8.551s data: {"choices":[{"delta":{"content":"1"}}],...,"created":1778535796} t= 8.551s data: {"choices":[{"delta":{"content":"."}}],...,"created":1778535796} ... t= 8.551s data: {"choices":[{"delta":{"content":"0"}}],...,"created":1778535804} t= 8.551s data: [DONE] ``` All events stamped t=8.551s on the receiving side, but the `created` timestamps inside the JSON span 1778535796..1778535804 — so the upstream llama.cpp *did* generate incrementally over ~8 s; the proxy held all of it before flushing. ### Response headers ``` HTTP/1.0 200 OK Server: BaseHTTP/0.6 Python/3.12.12 Content-Type: text/event-stream Content-Length: 6147 Connection: close ``` `Content-Length` on a `text/event-stream` body is the smoking gun: Python's `BaseHTTPRequestHandler` is collecting the entire upstream stream into a buffer, then sending response with a fixed Content-Length. SSE is supposed to be chunked transfer-encoding or close-on-EOF with progressive `wfile.write()+flush()`. ### aish-side verification - `ffi/curl.post_sse` parses SSE events as they arrive via `WRITEFUNCTION` — confirmed correct (fast model with the same proxy round-trip would have looked the same if there were a client buffer). - `broker.chat_stream` decodes per-event and invokes `on_delta` immediately. - `renderer.assistant_delta` calls `io.write()+io.flush()` per delta. Nothing on the aish side accumulates. ### Side finding (separate) The response chunks come back with `"model":"qwen2.5-coder-1.5b-q4_k_m.gguf"` even when the request specified `mistral-nemo-12b-instruct`. Either the proxy is routing all traffic to the loaded model regardless of `model` field, or only one model was loaded at probe time. Worth a separate issue against the boltzmann proxy. ### Resolution Fix lives in the boltzmann proxy on hossenfelder (Python `BaseHTTPRequestHandler` SSE handler — switch to chunked transfer-encoding, drop `Content-Length`, flush after every `data:\\n\\n` write). Not actionable in this repo. Proposing this issue be re-tagged from `test-case` to `bug` and either (a) re-scoped as a tracking issue for the proxy fix landing, or (b) closed here and re-filed against the boltzmann repo. Your call.
Owner

Now seeing streamed responses.

Now seeing streamed responses.
Sign in to join this conversation.