Wire #13 secret redaction into safety.lua broker call sites #52

Closed
opened 2026-05-16 21:43:21 +00:00 by claude-noether · 0 comments
Collaborator

Context

Issue #13 (commits e4b818b + d852aca) landed secret redaction across the main egress points:

  • ask_ai main broker call
  • MCP dispatch_tool_call (rehydrate args before sess:call_tool)
  • DELEGATE: and :delegate
  • Phase 5 summarize-on-evict callback
  • :memory summarize

Two broker call sites in safety.lua were deliberately deferred from that PR because the scrubbing convention is owned by repl.lua's closure (the secrets_session local) and safety.lua is a separate module that doesn't see it. This issue tracks closing that gap.

What's missing

1. safety.norris_step main broker call

safety.lua:321 — the Norris autonomous-mode planner round-trips through broker.chat_stream(model_cfg, ctx:to_messages(), ...) directly. No scrub on outbound; no rehydrate on the stream. A user running Norris on a goal that touches secrets in context (e.g. a goal whose pre-state includes cat ~/.env) sends them plain to the broker even with #13 configured.

2. safety.is_destructive LLM second-opinion probe

safety.lua:177 — the deep-model probe sends the user's literal CMD: line to broker.chat(model_cfg, ...) for YES/NO destructive classification. If the model emitted a CMD: that quotes a secret (e.g. CMD: curl -H "Authorization: Bearer sk-or-v1-..."), that whole line goes plain to the probe model.

This is especially leaky because the recommended setting is safety.llm_model = "cloud" (better verdict reliability per the small-model false-positive issue documented in reference_safety_destructive_patterns.md) — so the probe is the most likely paid-cloud destination on the fleet.

Why it wasn't covered in #13

safety.lua is invoked from repl.lua via the helpers table for norris_step, and via a direct require("safety") call for is_destructive. Neither path currently carries the secrets session — adding it means either:

(a) thread helpers.secrets_session through safety.norris_step and safety.is_destructive, OR
(b) provide helpers.scrub_msgs(messages, mode) and helpers.rehydrate(text) callbacks the way helpers.exec_cmd is already threaded.

Option (b) is the cleaner separation — safety.lua doesn't need to know about the secrets module; it just calls a callback. Matches the existing helpers convention.

Fix sketch (option b)

In repl.lua where the helpers table is built for safety.norris_step:

local helpers = {
    -- ... existing entries ...
    scrub_msgs = function(msgs, mode_cfg)
        return scrub_messages(msgs, secrets_mode_for(mode_cfg or active_cfg))
    end,
    rehydrate  = function(text)
        return secrets_session and secrets_session:rehydrate(text) or text
    end,
    streaming_rehydrator = function()
        return secrets_session
            and secrets.streaming_rehydrator(secrets_session)
            or  nil
    end,
}

In safety.norris_step:

local msgs = helpers.scrub_msgs and helpers.scrub_msgs(ctx:to_messages(), model_cfg)
             or ctx:to_messages()
local rehy = helpers.streaming_rehydrator and helpers.streaming_rehydrator() or nil
-- wrap on_delta to push through rehy:push if present; flush at end
local ok, err = broker.chat_stream(model_cfg, msgs, wrapped, opts)

For safety.is_destructive: the probe is called from repl.lua's CMD: gate. Add cfg.helpers access OR pass the scrub callbacks explicitly. The function is small enough that explicit pass via opts is fine:

M.is_destructive(cmd, cfg, opts)
  opts.scrub_msgs(...)
  opts.rehydrate(...)

Acceptance criteria

  • safety.norris_step scrubs outbound messages per the active model's redact policy; the streamed reply is rehydrated for display, accumulator stores rehydrated text (parity with ask_ai).
  • safety.is_destructive LLM probe scrubs the command string before send; the YES/NO verdict response is rehydrated before parsing (the verdict text itself shouldn't contain placeholders, but rehydration is a safe no-op there).
  • No new module-level dependencies in safety.lua (it stays clean of require("secrets")); the integration goes through the helpers/opts callback convention.
  • :secrets status reporting accurately covers Norris (e.g. mapping count grows when a Norris session scrubs new values).
  • Smoke test: a Norris goal that includes a vault literal in its prose scrubs to a placeholder in the broker payload, rehydrates in displayed output.

Scope notes

This is the last gap from the #13 design. Out of scope for THIS issue:

  • The helpers.exec_cmd path itself doesn't need scrubbing (it runs locally, no broker).
  • Tool result content in Norris is already covered (it appends to ctx and re-scrubs at the next safety.norris_step outbound call).
  • dispatch_tool in Norris's helpers table inherits from repl.lua's dispatch_tool_call which already rehydrates args (#13). Confirm this still works once the scrub-via-helpers wiring lands.

Estimated size: ~80 LOC across safety.lua + repl.lua helpers expansion.

References

  • #13 — secret redaction implementation (closed)
  • secrets.lua — the module to plug into
  • docs/PHASE3.md §4 — Norris broker contract
  • reference_safety_destructive_patterns.md (project memory) — why llm_model = "cloud" is recommended
## Context Issue #13 (commits `e4b818b` + `d852aca`) landed secret redaction across the main egress points: - `ask_ai` main broker call - MCP `dispatch_tool_call` (rehydrate args before `sess:call_tool`) - `DELEGATE:` and `:delegate` - Phase 5 summarize-on-evict callback - `:memory summarize` Two broker call sites in `safety.lua` were deliberately deferred from that PR because the scrubbing convention is owned by `repl.lua`'s closure (the `secrets_session` local) and `safety.lua` is a separate module that doesn't see it. This issue tracks closing that gap. ## What's missing ### 1. `safety.norris_step` main broker call `safety.lua:321` — the Norris autonomous-mode planner round-trips through `broker.chat_stream(model_cfg, ctx:to_messages(), ...)` directly. No scrub on outbound; no rehydrate on the stream. A user running Norris on a goal that touches secrets in context (e.g. a goal whose pre-state includes `cat ~/.env`) sends them plain to the broker even with #13 configured. ### 2. `safety.is_destructive` LLM second-opinion probe `safety.lua:177` — the deep-model probe sends the user's literal `CMD:` line to `broker.chat(model_cfg, ...)` for YES/NO destructive classification. If the model emitted a CMD: that quotes a secret (e.g. `CMD: curl -H "Authorization: Bearer sk-or-v1-..."`), that whole line goes plain to the probe model. This is especially leaky because the recommended setting is `safety.llm_model = "cloud"` (better verdict reliability per the small-model false-positive issue documented in `reference_safety_destructive_patterns.md`) — so the probe is the most likely paid-cloud destination on the fleet. ## Why it wasn't covered in #13 `safety.lua` is invoked from `repl.lua` via the `helpers` table for `norris_step`, and via a direct `require("safety")` call for `is_destructive`. Neither path currently carries the secrets session — adding it means either: (a) thread `helpers.secrets_session` through `safety.norris_step` and `safety.is_destructive`, OR (b) provide `helpers.scrub_msgs(messages, mode)` and `helpers.rehydrate(text)` callbacks the way `helpers.exec_cmd` is already threaded. Option (b) is the cleaner separation — `safety.lua` doesn't need to know about the secrets module; it just calls a callback. Matches the existing helpers convention. ## Fix sketch (option b) In `repl.lua` where the helpers table is built for `safety.norris_step`: ```lua local helpers = { -- ... existing entries ... scrub_msgs = function(msgs, mode_cfg) return scrub_messages(msgs, secrets_mode_for(mode_cfg or active_cfg)) end, rehydrate = function(text) return secrets_session and secrets_session:rehydrate(text) or text end, streaming_rehydrator = function() return secrets_session and secrets.streaming_rehydrator(secrets_session) or nil end, } ``` In `safety.norris_step`: ```lua local msgs = helpers.scrub_msgs and helpers.scrub_msgs(ctx:to_messages(), model_cfg) or ctx:to_messages() local rehy = helpers.streaming_rehydrator and helpers.streaming_rehydrator() or nil -- wrap on_delta to push through rehy:push if present; flush at end local ok, err = broker.chat_stream(model_cfg, msgs, wrapped, opts) ``` For `safety.is_destructive`: the probe is called from `repl.lua`'s CMD: gate. Add `cfg.helpers` access OR pass the scrub callbacks explicitly. The function is small enough that explicit pass via opts is fine: ```lua M.is_destructive(cmd, cfg, opts) opts.scrub_msgs(...) opts.rehydrate(...) ``` ## Acceptance criteria - [ ] `safety.norris_step` scrubs outbound messages per the active model's redact policy; the streamed reply is rehydrated for display, accumulator stores rehydrated text (parity with `ask_ai`). - [ ] `safety.is_destructive` LLM probe scrubs the command string before send; the YES/NO verdict response is rehydrated before parsing (the verdict text itself shouldn't contain placeholders, but rehydration is a safe no-op there). - [ ] No new module-level dependencies in `safety.lua` (it stays clean of `require("secrets")`); the integration goes through the helpers/opts callback convention. - [ ] `:secrets status` reporting accurately covers Norris (e.g. mapping count grows when a Norris session scrubs new values). - [ ] Smoke test: a Norris goal that includes a vault literal in its prose scrubs to a placeholder in the broker payload, rehydrates in displayed output. ## Scope notes This is the last gap from the #13 design. Out of scope for THIS issue: - The `helpers.exec_cmd` path itself doesn't need scrubbing (it runs locally, no broker). - Tool result content in Norris is already covered (it appends to ctx and re-scrubs at the next `safety.norris_step` outbound call). - `dispatch_tool` in Norris's helpers table inherits from `repl.lua`'s `dispatch_tool_call` which already rehydrates args (#13). Confirm this still works once the scrub-via-helpers wiring lands. Estimated size: ~80 LOC across `safety.lua` + `repl.lua` helpers expansion. ## References - #13 — secret redaction implementation (closed) - `secrets.lua` — the module to plug into - `docs/PHASE3.md` §4 — Norris broker contract - `reference_safety_destructive_patterns.md` (project memory) — why `llm_model = "cloud"` is recommended
claude-noether added the feature requestsecurity labels 2026-05-16 21:43:21 +00:00
Sign in to join this conversation.