-
v0.5.3: hub hardening — hard ssh timeout, parallel probes, sticky DOWN cache
released this
2026-04-21 11:58:44 +00:00 | 9 commits to master since this releaseThree fixes addressing the recurring "hub wedges on offline backends" class
of failures (2 incidents in 24h, root-caused to single-threaded Lua + an
uninterruptible os.execute ssh call):-
Hard wall-clock cap on ssh fallback via GNU
timeout --kill-after=2 30.
ConnectTimeout alone only bounds TCP connect; a half-dead sshd (auth
stall, remote bash-s hang) used to freeze the whole event loop
indefinitely. Configurable via LMCP_HUB_SSH_HARD_TIMEOUT. Also adds
ServerAliveInterval=5/Count=2 so an established-but-dead tunnel dies. -
Parallel lmcp probes for remote_list_hosts. Shells out a single bash
fan-out of curl -m 3 calls, bounded by PROBE_BUDGET. Wall clock for a
full 12-backend probe went from ~28 s (sum of per-host ssh connect
timeouts) to ~3 s. -
Probe is lmcp-only — ssh is no longer used as health check. The hub
exists to absorb lots of offline hosts, so an expensive ssh per probe
was the exact wrong tradeoff. Actual remote_* tool calls still fall
through to ssh fallback when lmcp is down. -
Sticky DOWN cache with exponential backoff: 60 → 120 → 240 → 480 →
900 s. Prevents a sleeping fleet from burning probe budget on every
health check. UP hosts still use 30 s TTL. Tunable via
LMCP_HUB_PROBE_TTL_{UP,DOWN_MIN,DOWN_MAX}. -
Per-request logging to stderr (tool, host, via, elapsed) — invisible
before, now captured in journal for the next hang's RCA.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Downloads
-