Add web_search tool — pluggable backend (SearXNG / DDG-html / Tavily / Brave) #4

Closed
opened 2026-05-17 15:09:22 +00:00 by claude-noether · 1 comment
Collaborator

Goal

First-class web-search primitive in lmcp returning [{title, url, snippet}]. Pairs with the proposed fetch tool (issue #3) — search yields URLs, fetch reads them.

Motivation

aish (and any LLM-driven MCP client) needs a search step before it can use fetch for anything that isn't a URL the user already typed. Today there is no path inside aish; the user has to leave the shell, search, and paste a URL back. A structured web_search tool inside lmcp closes that loop.

Unlike fetch, search is not implementable purely with curl + post-processing — it requires a search backend. So this is a real new capability rather than a shell-wrapping convenience.

API sketch

tool: web_search
params:
  query        : string (required)
  max_results  : integer (default 8, hard cap 25)
  region       : string (default "") — backend-specific; e.g. "de-de", "us-en"
  time_range   : string (default "") — "" | "day" | "week" | "month" | "year"
  safesearch   : string (default "moderate") — "off" | "moderate" | "strict"
returns:
  { backend: string, results: [ { title, url, snippet, age? } ] }
  on failure: error string

Results are ranked by the upstream backend; lmcp does no re-ranking.

Backend options (configurable at server start)

  1. SearXNG — preferred default for self-hosters. POST to <instance>/search?format=json&q=…. Free, no API key, no rate ceiling beyond the instance's own. mfritsche could host one at e.g. searx.fritz.box and point lmcp at it via LMCP_SEARXNG_URL. Recommended default backend.
  2. DDG-html scrape — fallback when no SearXNG / API key is available. POST to https://html.duckduckgo.com/html/ and parse the result. Brittle (DDG changes the HTML occasionally) but zero-config. Useful so the tool works out of the box even without setup.
  3. Tavily (https://api.tavily.com/search) — API key, paid past free tier, returns JSON natively. Quality is good; cost is real.
  4. Brave Search API (https://api.search.brave.com/res/v1/web/search) — API key, free tier exists. JSON native.

Backend selected by env at lmcp start:

LMCP_SEARCH_BACKEND=searxng|ddg|tavily|brave   # default: searxng if URL set, else ddg
LMCP_SEARXNG_URL=https://searx.fritz.box
LMCP_TAVILY_API_KEY=…
LMCP_BRAVE_API_KEY=…

If the selected backend isn't configured (missing URL or key), tool surfaces a clean error rather than silently falling back — the operator should know which backend is being used.

Implementation notes

  • All four backends are JSON-over-HTTPS (DDG via a thin HTML→JSON parser); handler is curl -sS + json.lua + small per-backend normalization to the common result shape.
  • Pre-flight check during lmcp.new() startup: log which backend is active and whether config is complete; don't error out (operator may add the tool but not configure it yet — that's fine, errors land at call time).
  • Snippets should be plain text with HTML entities decoded; truncate to ~280 chars to keep returned payloads small.
  • Honour safesearch — pass-through to backends that support it (SearXNG, DDG, Brave); silently ignore on backends that don't (Tavily).

Priority

Medium-low. fetch (issue #3) is the higher-leverage one — once that lands, the model can do research on URLs the user pastes. web_search is needed for the autonomous flow where the user just says "find me an example of X" and Norris mode goes off and gathers URLs itself.

Out of scope

  • News-specific / image / video search verticals.
  • Result caching (re-issued identical queries hit upstream every time). Operators can put nginx in front if needed.
  • Backend round-robin / multiplexing — one backend per lmcp instance in v1.

Related

  • Issue #3 (lmcp fetch tool) — pairs with this; search→fetch is the canonical flow.
## Goal First-class web-search primitive in lmcp returning `[{title, url, snippet}]`. Pairs with the proposed `fetch` tool (issue #3) — search yields URLs, fetch reads them. ## Motivation `aish` (and any LLM-driven MCP client) needs a search step before it can use `fetch` for anything that isn't a URL the user already typed. Today there is no path inside aish; the user has to leave the shell, search, and paste a URL back. A structured `web_search` tool inside lmcp closes that loop. Unlike `fetch`, search is **not** implementable purely with `curl` + post-processing — it requires a search backend. So this is a real new capability rather than a shell-wrapping convenience. ## API sketch ``` tool: web_search params: query : string (required) max_results : integer (default 8, hard cap 25) region : string (default "") — backend-specific; e.g. "de-de", "us-en" time_range : string (default "") — "" | "day" | "week" | "month" | "year" safesearch : string (default "moderate") — "off" | "moderate" | "strict" returns: { backend: string, results: [ { title, url, snippet, age? } ] } on failure: error string ``` Results are ranked by the upstream backend; lmcp does no re-ranking. ## Backend options (configurable at server start) 1. **SearXNG** — preferred default for self-hosters. POST to `<instance>/search?format=json&q=…`. Free, no API key, no rate ceiling beyond the instance's own. mfritsche could host one at e.g. `searx.fritz.box` and point lmcp at it via `LMCP_SEARXNG_URL`. **Recommended default backend.** 2. **DDG-html scrape** — fallback when no SearXNG / API key is available. POST to `https://html.duckduckgo.com/html/` and parse the result. Brittle (DDG changes the HTML occasionally) but zero-config. Useful so the tool *works* out of the box even without setup. 3. **Tavily** (`https://api.tavily.com/search`) — API key, paid past free tier, returns JSON natively. Quality is good; cost is real. 4. **Brave Search API** (`https://api.search.brave.com/res/v1/web/search`) — API key, free tier exists. JSON native. Backend selected by env at lmcp start: ``` LMCP_SEARCH_BACKEND=searxng|ddg|tavily|brave # default: searxng if URL set, else ddg LMCP_SEARXNG_URL=https://searx.fritz.box LMCP_TAVILY_API_KEY=… LMCP_BRAVE_API_KEY=… ``` If the selected backend isn't configured (missing URL or key), tool surfaces a clean error rather than silently falling back — the operator should know which backend is being used. ## Implementation notes - All four backends are JSON-over-HTTPS (DDG via a thin HTML→JSON parser); handler is `curl -sS` + `json.lua` + small per-backend normalization to the common result shape. - Pre-flight check during `lmcp.new()` startup: log which backend is active and whether config is complete; don't error out (operator may add the tool but not configure it yet — that's fine, errors land at call time). - Snippets should be plain text with HTML entities decoded; truncate to ~280 chars to keep returned payloads small. - Honour `safesearch` — pass-through to backends that support it (SearXNG, DDG, Brave); silently ignore on backends that don't (Tavily). ## Priority Medium-low. `fetch` (issue #3) is the higher-leverage one — once that lands, the model can do research on URLs the user pastes. `web_search` is needed for the *autonomous* flow where the user just says "find me an example of X" and Norris mode goes off and gathers URLs itself. ## Out of scope - News-specific / image / video search verticals. - Result caching (re-issued identical queries hit upstream every time). Operators can put nginx in front if needed. - Backend round-robin / multiplexing — one backend per lmcp instance in v1. ## Related - Issue #3 (lmcp `fetch` tool) — pairs with this; search→fetch is the canonical flow.
Author
Collaborator

Implemented in server.lua. web_search tool with backend pluggability: explicit LMCP_SEARCH_BACKEND env wins; auto-picks first-configured of SEARXNG_URL/TAVILY_API_KEY/BRAVE_API_KEY; falls back to DDG-HTML zero-config. Structured {ok, backend, query, results:[{title,url,snippet,age?}], error?} envelope.

DDG parser: per-block iteration (avoids title↔snippet mispair); per-block URL unwrap from uddg=; drops results with un-decodable href; surfaces anti-bot 202 as structured {ok=false, error="ddg parser matched no results"} rather than silent empty list.

Memory: project_search_backends.md captures that DDG is anti-bot-blocked from the deployment host — SearXNG self-host is the recommended path. Phase 5 reviewer caught a Phase 0 loopback (DDG worked once then 202-d within the same session); success criterion was honestly re-anchored.

Implemented in server.lua. `web_search` tool with backend pluggability: explicit `LMCP_SEARCH_BACKEND` env wins; auto-picks first-configured of `SEARXNG_URL`/`TAVILY_API_KEY`/`BRAVE_API_KEY`; falls back to DDG-HTML zero-config. Structured `{ok, backend, query, results:[{title,url,snippet,age?}], error?}` envelope. DDG parser: per-block iteration (avoids title↔snippet mispair); per-block URL unwrap from `uddg=`; drops results with un-decodable href; surfaces anti-bot 202 as structured `{ok=false, error="ddg parser matched no results"}` rather than silent empty list. Memory: project_search_backends.md captures that DDG is anti-bot-blocked from the deployment host — SearXNG self-host is the recommended path. Phase 5 reviewer caught a Phase 0 loopback (DDG worked once then 202-d within the same session); success criterion was honestly re-anchored.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/lmcp#4