lmcp-hub serializes all requests; a long probe blocks the whole endpoint #1
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally tracked as
7d00216inhis:todoson DokuWiki; migrating to Gitea per canonical-channel decision.Symptom
The Lua hub's main loop is
accept()→ handle synchronously → loop. A singleremote_list_hosts force=truecall re-probes every backend serially (lmcp timeout + ssh timeout per DOWN host, ~16 s each, ~2 min across the current 8 DOWN hosts). During that window every other incoming request queues at TCP and the client times out before the hub gets to it.Observed while fixing the sibling
2f41f2dbug (meitner param-name mismatch) — aforce=truere-probe left the hub unresponsive to subsequenttools/listcalls for the duration of the probe.Cause
Not a bug in any one tool — a hot-path design limitation of
lmcp's single-threaded HTTP server:lmcp.lua'srun()loop isserver:accept()→serve_request()→ return.serve_request()dispatches the JSON-RPC synchronously; a tool handler that does slow I/O (network probes to every backend, say) holds the handler for the full probe duration.Fixes worth considering
remote_list_hosts— independent backend probes are embarrassingly parallel; force-probe would return in slowest-backend time, not sum-of-backends time.(1) has the highest leverage and is the smallest change to the hub; (2) benefits every tool but is a wider refactor.
Priority
Low.
force=trueis rarely needed; steady-state polling hits the 30 s cache and returns in ~100 ms. Raise priority if vitruvius or another client starts to misbehave during cold-start or stale-cache windows.Already fixed in v0.5.3 (commit
17af91a, "hub hardening — hard ssh timeout, parallel probes, sticky DOWN cache"), tagged 2026-04-21 — one day after this issue was filed.The fan-out is in
probe_all_parallel(hub.lua:273): for every backend with anlmcp_url, the hub builds one bash script of(curl … &)per backend +wait, runs it throughio.popen, and parses results. Wall-clock is bounded byPROBE_BUDGET(default 3 s, envLMCP_HUB_PROBE_BUDGET), not by sum-of-backends.remote_list_hostscallsprobe_all_parallel(force)directly — the original 8-DOWN × 16 s = ~2 min cascade is gone.Two things this does not fix, intentionally:
lmcp.luarun()— any slow synchronous handler still blocks the accept loop. Coroutine dispatch (option 2 in the original issue) remains a wider refactor; reopen if a real client misbehaves on cold-start.lmcp_url) are not background-probed at all — they only get probed lazily on direct lookup. Acceptable today; revisit when the registry has more SSH-only entries.Closing as fixed.