• v0.5.3 17af91a99b

    v0.5.3: hub hardening — hard ssh timeout, parallel probes, sticky DOWN cache

    marfrit released this 2026-04-21 11:58:44 +00:00 | 9 commits to master since this release

    Three fixes addressing the recurring "hub wedges on offline backends" class
    of failures (2 incidents in 24h, root-caused to single-threaded Lua + an
    uninterruptible os.execute ssh call):

    1. Hard wall-clock cap on ssh fallback via GNU timeout --kill-after=2 30.
      ConnectTimeout alone only bounds TCP connect; a half-dead sshd (auth
      stall, remote bash-s hang) used to freeze the whole event loop
      indefinitely. Configurable via LMCP_HUB_SSH_HARD_TIMEOUT. Also adds
      ServerAliveInterval=5/Count=2 so an established-but-dead tunnel dies.

    2. Parallel lmcp probes for remote_list_hosts. Shells out a single bash
      fan-out of curl -m 3 calls, bounded by PROBE_BUDGET. Wall clock for a
      full 12-backend probe went from ~28 s (sum of per-host ssh connect
      timeouts) to ~3 s.

    3. Probe is lmcp-only — ssh is no longer used as health check. The hub
      exists to absorb lots of offline hosts, so an expensive ssh per probe
      was the exact wrong tradeoff. Actual remote_* tool calls still fall
      through to ssh fallback when lmcp is down.

    4. Sticky DOWN cache with exponential backoff: 60 → 120 → 240 → 480 →
      900 s. Prevents a sleeping fleet from burning probe budget on every
      health check. UP hosts still use 30 s TTL. Tunable via
      LMCP_HUB_PROBE_TTL_{UP,DOWN_MIN,DOWN_MAX}.

    5. Per-request logging to stderr (tool, host, via, elapsed) — invisible
      before, now captured in journal for the next hang's RCA.

    Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

    Downloads