safety: LLM second-opinion + session cache (Phase 3 commit #2)

Phase 3 commit #2 per docs/PHASE3.md §12. Adds the LLM-probe gate on top of commit #1's static patterns. Together they form is_destructive. broker.lua extension: - opts.max_tokens (A2) — passed through to the request body. Phase 3 probes cap at 4 tokens for YES/NO replies. - opts.timeout_ms — overrides model_cfg.timeout_ms per-call. Probe uses 15000ms cap regardless of the model's normal timeout (the user's deep model has 1800000ms for long generations; the probe must stay snappy). - M.chat now accepts an opts table (same shape as chat_stream's). Backwards compatible — existing callers passing (cfg, msgs) unaffected. safety.lua additions: - llm_probe(cfg, system, cmd): single broker.chat call returning "YES"/"NO"/"YES_FAILSAFE"/"YES_UNPARSEABLE" — fail-safe defaults. - llm_second_opinion(cmd, cfg): two-probe protocol per R-B2. Probe 1: "Is this destructive?" — YES → flag. Probe 2 (only if probe 1 said NO): "Is this safe?" inverted question — NO → flag (disagreement = HALT). Both NO → safe. - Session-scoped cache _llm_cache keyed by normalized command (lowercased + whitespace-collapsed). Mitigates Q23 latency for repeated commands within a Norris run. - Model-selection precedence: cfg.safety.llm_model (explicit) → cfg.models.deep (independent local class) → cfg.models[default]. Fail-safe YES if none configured. - is_destructive(cmd, cfg): runs static patterns first (always), then LLM if cfg present + not explicitly opted-out. cfg=nil yields static-only mode (handy for tests). End-to-end verified against hossenfelder using qwen-coder-7b-32k as the deep probe (qwen3-30b-a3b-instruct in repo's config.lua isn't currently loaded on the local backend): cat /etc/hostname → hit=false (LLM: NO, NO inverted = safe) rm /tmp/x.log → hit=true (LLM flagged; static missed because no -r/-f flags) cp /etc/passwd /tmp/passwd.bak → hit=false (safe copy) cache: second probe on same cmd → 0s wall time static-only (cfg=nil): rm -rf /tmp/x → static hit, no LLM call opt-out (llm_second_opinion=false): cp x y → hit=false, no probe Test corpus (test_safety.lua, 87 cases) still all pass — cfg=nil preserves the static-only behavior. Note: production config.lua currently has `deep = qwen3-30b-a3b-instruct` which isn't loaded on the proxy backend right now; Norris users will hit the fail-safe (everything flagged destructive) until either the deep model is brought up OR cfg.safety.llm_model = "cloud" is set to route the probe through anthropic/claude-haiku-4.5. Update the config or model deployment for production use — covered by Phase 3 verify test case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 23:36:06 +00:00
parent bd59ce7243
commit 2abd5da3a6
2 changed files with 134 additions and 11 deletions
@@ -27,7 +27,7 @@ local function build_headers(model_cfg)
    return h
 end

-local function build_request(model_cfg, messages, stream, tools)
+local function build_request(model_cfg, messages, stream, tools, max_tokens)
    if not (model_cfg and model_cfg.endpoint and model_cfg.model) then
        return nil, "broker: model_cfg.endpoint and .model are required"
    end
@@ -41,6 +41,10 @@ local function build_request(model_cfg, messages, stream, tools)
    -- Per PHASE2.md §12 risk row "Empty tools array": some servers reject
    -- "tools": []. Only set the field when the list has entries.
    if tools and #tools > 0 then req.tools = tools end
+    -- Phase 3 (A2): max_tokens passthrough — used by safety.is_destructive
+    -- to cap YES/NO probes at ~4 tokens. Omitted when nil (Phase 1/2
+    -- callers unaffected — model defaults still apply).
+    if max_tokens then req.max_tokens = max_tokens end
    return url, json.encode(req), build_headers(model_cfg),
           (model_cfg.timeout_ms or 60000)
 end
@@ -59,8 +63,13 @@ end
 function M.chat_stream(model_cfg, messages, on_delta, opts)
    opts = opts or {}
    local url, body, headers, timeout_ms =
-        build_request(model_cfg, messages, true, opts.tools)
+        build_request(model_cfg, messages, true, opts.tools, opts.max_tokens)
    if not url then return nil, body end  -- url slot carries err on bad cfg
+    -- Phase 3: opts.timeout_ms overrides the model's default. Used by
+    -- safety.is_destructive's LLM probe to cap YES/NO checks at ~15s even
+    -- when the model's normal timeout is much higher (e.g. user's deep
+    -- model has 1800000ms for long generations).
+    if opts.timeout_ms then timeout_ms = opts.timeout_ms end

    local done = false
    local api_err
@@ -152,11 +161,11 @@ end
 -- Returns:
 --   assistant_content_string         on success
 --   nil, errmsg                       on transport / decode / API failure
-function M.chat(model_cfg, messages)
+function M.chat(model_cfg, messages, opts)
    local parts = {}
    local ok, err = M.chat_stream(model_cfg, messages, function(kind, payload)
        if kind == "text" then parts[#parts + 1] = payload end
-    end)
+    end, opts)
    if not ok then return nil, err end
    return table.concat(parts)
 end