lmcp-hub serializes all requests; a long probe blocks the whole endpoint #1

Closed
opened 2026-04-20 09:28:32 +00:00 by marfrit · 1 comment
Owner

Originally tracked as 7d00216 in his:todos on DokuWiki; migrating to Gitea per canonical-channel decision.

Symptom

The Lua hub's main loop is accept() → handle synchronously → loop. A single remote_list_hosts force=true call re-probes every backend serially (lmcp timeout + ssh timeout per DOWN host, ~16 s each, ~2 min across the current 8 DOWN hosts). During that window every other incoming request queues at TCP and the client times out before the hub gets to it.

Observed while fixing the sibling 2f41f2d bug (meitner param-name mismatch) — a force=true re-probe left the hub unresponsive to subsequent tools/list calls for the duration of the probe.

Cause

Not a bug in any one tool — a hot-path design limitation of lmcp's single-threaded HTTP server:

  • lmcp.lua's run() loop is server:accept()serve_request() → return.
  • serve_request() dispatches the JSON-RPC synchronously; a tool handler that does slow I/O (network probes to every backend, say) holds the handler for the full probe duration.
  • No request-level concurrency, no async, no coroutine dispatch.

Fixes worth considering

  1. Background probe thread — periodically refresh the per-backend cache so the hot path is always O(cache lookup), independent of individual backend latency.
  2. Coroutine dispatch per request — still single-threaded but lets slow handlers yield; new requests can be accepted and dispatched while others wait for I/O.
  3. Parallelise the probe loop inside remote_list_hosts — independent backend probes are embarrassingly parallel; force-probe would return in slowest-backend time, not sum-of-backends time.

(1) has the highest leverage and is the smallest change to the hub; (2) benefits every tool but is a wider refactor.

Priority

Low. force=true is rarely needed; steady-state polling hits the 30 s cache and returns in ~100 ms. Raise priority if vitruvius or another client starts to misbehave during cold-start or stale-cache windows.

Originally tracked as `7d00216` in `his:todos` on DokuWiki; migrating to Gitea per canonical-channel decision. ## Symptom The Lua hub's main loop is `accept()` → handle synchronously → loop. A single `remote_list_hosts force=true` call re-probes every backend serially (lmcp timeout + ssh timeout per DOWN host, ~16 s each, ~2 min across the current 8 DOWN hosts). During that window every other incoming request queues at TCP and the client times out before the hub gets to it. Observed while fixing the sibling `2f41f2d` bug (meitner param-name mismatch) — a `force=true` re-probe left the hub unresponsive to subsequent `tools/list` calls for the duration of the probe. ## Cause Not a bug in any one tool — a hot-path design limitation of `lmcp`'s single-threaded HTTP server: - `lmcp.lua`'s `run()` loop is `server:accept()` → `serve_request()` → return. - `serve_request()` dispatches the JSON-RPC synchronously; a tool handler that does slow I/O (network probes to every backend, say) holds the handler for the full probe duration. - No request-level concurrency, no async, no coroutine dispatch. ## Fixes worth considering 1. **Background probe thread** — periodically refresh the per-backend cache so the hot path is always O(cache lookup), independent of individual backend latency. 2. **Coroutine dispatch per request** — still single-threaded but lets slow handlers yield; new requests can be accepted and dispatched while others wait for I/O. 3. **Parallelise the probe loop inside `remote_list_hosts`** — independent backend probes are embarrassingly parallel; force-probe would return in slowest-backend time, not sum-of-backends time. (1) has the highest leverage and is the smallest change to the hub; (2) benefits every tool but is a wider refactor. ## Priority Low. `force=true` is rarely needed; steady-state polling hits the 30 s cache and returns in ~100 ms. Raise priority if vitruvius or another client starts to misbehave during cold-start or stale-cache windows.
Author
Owner

Already fixed in v0.5.3 (commit 17af91a, "hub hardening — hard ssh timeout, parallel probes, sticky DOWN cache"), tagged 2026-04-21 — one day after this issue was filed.

The fan-out is in probe_all_parallel (hub.lua:273): for every backend with an lmcp_url, the hub builds one bash script of (curl … &) per backend + wait, runs it through io.popen, and parses results. Wall-clock is bounded by PROBE_BUDGET (default 3 s, env LMCP_HUB_PROBE_BUDGET), not by sum-of-backends.

remote_list_hosts calls probe_all_parallel(force) directly — the original 8-DOWN × 16 s = ~2 min cascade is gone.

Two things this does not fix, intentionally:

  1. The wider single-thread limitation in lmcp.lua run() — any slow synchronous handler still blocks the accept loop. Coroutine dispatch (option 2 in the original issue) remains a wider refactor; reopen if a real client misbehaves on cold-start.
  2. SSH-only backends (no lmcp_url) are not background-probed at all — they only get probed lazily on direct lookup. Acceptable today; revisit when the registry has more SSH-only entries.

Closing as fixed.

Already fixed in v0.5.3 (commit 17af91a, "hub hardening — hard ssh timeout, parallel probes, sticky DOWN cache"), tagged 2026-04-21 — one day after this issue was filed. The fan-out is in `probe_all_parallel` (hub.lua:273): for every backend with an `lmcp_url`, the hub builds one bash script of `(curl … &)` per backend + `wait`, runs it through `io.popen`, and parses results. Wall-clock is bounded by `PROBE_BUDGET` (default 3 s, env `LMCP_HUB_PROBE_BUDGET`), not by sum-of-backends. `remote_list_hosts` calls `probe_all_parallel(force)` directly — the original 8-DOWN × 16 s = ~2 min cascade is gone. Two things this *does not* fix, intentionally: 1. The wider single-thread limitation in `lmcp.lua` `run()` — any slow synchronous handler still blocks the accept loop. Coroutine dispatch (option 2 in the original issue) remains a wider refactor; reopen if a real client misbehaves on cold-start. 2. SSH-only backends (no `lmcp_url`) are not background-probed at all — they only get probed lazily on direct lookup. Acceptable today; revisit when the registry has more SSH-only entries. Closing as fixed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/lmcp#1