Add fetch tool — HTTP GET with bounded output and optional HTML→plain rendering #3

Closed
opened 2026-05-17 15:09:01 +00:00 by claude-noether · 1 comment
Collaborator

Goal

First-class web-fetch primitive in lmcp so MCP clients (notably aish) can pull URLs without going through the generic shell tool. Cleaner schema for the model to introspect, safer (no shell-injection surface), and the byte-cap / render mode are part of the contract rather than ad-hoc pipe stages.

Motivation

Web research is currently possible via shell + curl … | pandoc -f html -t plain | head -c 50000. Three problems with that:

  1. Schema opacity — the model sees "execute a shell command". A typed fetch tool with url + render + max_bytes parameters is much easier for the model to pick correctly, and surfaces in :mcp tools as a discrete capability.
  2. Output unbounded by design — the user/model has to remember to chain head -c N. Easy to forget; OOM-grade pages then poison context.
  3. Renderer availability is per-hostpandoc / lynx may or may not be installed on the lmcp box. A fetch handler can pick whatever is available (or shell out internally) and present one consistent interface to the client.

API sketch

tool: fetch
params:
  url        : string (required) — http(s) URL
  method     : string (default "GET") — GET | HEAD only in v1; POST deferred
  render     : string (default "plain") — "plain" | "html" | "raw"
               plain = HTML→text via pandoc/lynx/w3m if available, fallback raw
               html  = strip <script>/<style> only, preserve markup
               raw   = bytes verbatim
  max_bytes  : integer (default 65536) — body cap; output truncated with marker
  timeout_s  : integer (default 20) — total request timeout
  user_agent : string (default "lmcp-fetch/<version>") — UA override
returns:
  on success: { status: int, content_type: string, bytes_read: int, body: string, truncated: bool }
  on failure: error string (curl-style)

Follow redirects by default (up to 5). Reject file:// / gopher:// / non-http(s) schemes. No cookie persistence.

Implementation notes

  • lmcp already shells out (shell tool), so handler can be curl -sSL --max-time T -A UA -w '\n__HTTP_STATUS__=%{http_code}\n__CONTENT_TYPE__=%{content_type}\n' … and post-process the body. No new C dependency.
  • For render="plain": try pandoc -f html -t plain first, then lynx -stdin -dump -nolist, then w3m -dump -T text/html, then raw. Cache which renderer worked across calls (per-process).
  • Cap fetch at max_bytes before rendering — curl --max-filesize plus a Lua-side defensive trim.
  • Return a structured table (lmcp already encodes returns through json.lua), not just a string, so the model can see the status code without parsing.

Priority

Medium. The shell-based workaround exists, but every aish session that wants web research re-derives the same curl | pandoc | head incantation, and the model picks the wrong cap roughly half the time. Worth ~half a day to land cleanly.

Out of scope (defer to a follow-up)

  • POST/PUT/DELETE — only GET/HEAD in v1.
  • Cookie jars / sessions.
  • JavaScript rendering (would pull in headless Chromium — different tool entirely).
  • Authenticated fetches (Bearer / Basic). Reasonable v1.1 add via an auth param; left out of v1 to keep the surface small.
## Goal First-class web-fetch primitive in lmcp so MCP clients (notably `aish`) can pull URLs without going through the generic `shell` tool. Cleaner schema for the model to introspect, safer (no shell-injection surface), and the byte-cap / render mode are part of the contract rather than ad-hoc pipe stages. ## Motivation Web research is currently possible via `shell` + `curl … | pandoc -f html -t plain | head -c 50000`. Three problems with that: 1. **Schema opacity** — the model sees "execute a shell command". A typed `fetch` tool with `url` + `render` + `max_bytes` parameters is much easier for the model to pick correctly, and surfaces in `:mcp tools` as a discrete capability. 2. **Output unbounded by design** — the user/model has to remember to chain `head -c N`. Easy to forget; OOM-grade pages then poison context. 3. **Renderer availability is per-host** — `pandoc` / `lynx` may or may not be installed on the lmcp box. A `fetch` handler can pick whatever is available (or shell out internally) and present one consistent interface to the client. ## API sketch ``` tool: fetch params: url : string (required) — http(s) URL method : string (default "GET") — GET | HEAD only in v1; POST deferred render : string (default "plain") — "plain" | "html" | "raw" plain = HTML→text via pandoc/lynx/w3m if available, fallback raw html = strip <script>/<style> only, preserve markup raw = bytes verbatim max_bytes : integer (default 65536) — body cap; output truncated with marker timeout_s : integer (default 20) — total request timeout user_agent : string (default "lmcp-fetch/<version>") — UA override returns: on success: { status: int, content_type: string, bytes_read: int, body: string, truncated: bool } on failure: error string (curl-style) ``` Follow redirects by default (up to 5). Reject `file://` / `gopher://` / non-http(s) schemes. No cookie persistence. ## Implementation notes - lmcp already shells out (`shell` tool), so handler can be `curl -sSL --max-time T -A UA -w '\n__HTTP_STATUS__=%{http_code}\n__CONTENT_TYPE__=%{content_type}\n' …` and post-process the body. No new C dependency. - For `render="plain"`: try `pandoc -f html -t plain` first, then `lynx -stdin -dump -nolist`, then `w3m -dump -T text/html`, then raw. Cache which renderer worked across calls (per-process). - Cap fetch at `max_bytes` *before* rendering — `curl --max-filesize` plus a Lua-side defensive trim. - Return a structured table (lmcp already encodes returns through `json.lua`), not just a string, so the model can see the status code without parsing. ## Priority Medium. The `shell`-based workaround exists, but every aish session that wants web research re-derives the same `curl | pandoc | head` incantation, and the model picks the wrong cap roughly half the time. Worth ~half a day to land cleanly. ## Out of scope (defer to a follow-up) - POST/PUT/DELETE — only GET/HEAD in v1. - Cookie jars / sessions. - JavaScript rendering (would pull in headless Chromium — different tool entirely). - Authenticated fetches (Bearer / Basic). Reasonable v1.1 add via an `auth` param; left out of v1 to keep the surface small.
Author
Collaborator

Implemented in server.lua. fetch tool: HTTP GET/HEAD via curl with --max-filesize (mid-stream cap), --max-time, structured {ok, status, content_type, bytes_read, body, truncated, renderer, error?} return, renderer chain pandoc → lynx → w3m → pure-Lua strip, RFC-3986 URL whitelist. Verified live across 12 cases incl. truncation, transport failures, 404, HEAD, malformed URLs.

Also patched json.lua to combine UTF-16 surrogate pairs (issue #4 context — same change-cycle benefits both tools).

Memory: project_runtime_lua.md captures the LuaJIT vs Lua 5.4 os.execute portability rule the renderer probe relies on.

Implemented in server.lua. `fetch` tool: HTTP GET/HEAD via curl with `--max-filesize` (mid-stream cap), `--max-time`, structured `{ok, status, content_type, bytes_read, body, truncated, renderer, error?}` return, renderer chain `pandoc → lynx → w3m → pure-Lua strip`, RFC-3986 URL whitelist. Verified live across 12 cases incl. truncation, transport failures, 404, HEAD, malformed URLs. Also patched json.lua to combine UTF-16 surrogate pairs (issue #4 context — same change-cycle benefits both tools). Memory: project_runtime_lua.md captures the LuaJIT vs Lua 5.4 os.execute portability rule the renderer probe relies on.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/lmcp#3