Commit Graph

7 Commits

Author SHA1 Message Date
marfrit 17e62c0326 safety: permission policy DSL — allow/confirm/deny rule lists (closes #9)
The confirm_cmd boolean was too coarse: true interrupts every harmless
ls; false ungates everything. Most workflows want trust for read-only
ops while still gating writes/network/sudo.

New config:

    permissions = {
        allow   = { "^ls%s", "^cat%s", "^git status" },
        confirm = { "^rm%s", "^git push", "^docker%s", "^sudo%s" },
        deny    = { "^ssh%s+root@", "^curl%s+http[^s]" },
    }

Verdict order: deny > confirm > allow. First match in the chosen
category wins. Unmatched defaults to "confirm". Patterns are Lua
patterns (not regex) per PHASE0.md §3 — no compiled extensions.

Verdict behavior in the interactive CMD: loop:
  - allow   → run without prompt
  - deny    → status line, skip
  - confirm → [y/N] prompt (same UX as legacy confirm_cmd=true)

Backward compat:
  - permissions unset + confirm_cmd=true  → always confirm
  - permissions unset + confirm_cmd=false → always allow
  - permissions set                        → policy table is authoritative

Scope deliberately limited to the interactive AI-suggested CMD: gate.
Norris autonomous mode keeps its own safety.is_destructive machinery
(combining the two would double-gate or replace the LLM probe — both
non-obvious behavioral changes that belong in their own issues).
User-typed shell-routed lines (`router.classify → "shell"`) and
:exec also bypass the policy by design — those are direct user intent.

New introspection:
  :perms list           — show the configured rule lists
  :perms check <cmd>    — report verdict + matching rule (debug)

safety.classify_command is exported and unit-tested with 12 cases
covering each category, priority order (deny > allow on overlap),
and both fallback paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:20:56 +00:00
marfrit 11b1f566b3 safety: norris_step planner (Phase 3 commit #4)
Phase 3 commit #4 per docs/PHASE3.md §12. Single-iteration planner.
The driver loop in repl.lua (commit #5) calls this in a while loop,
advancing step_n on every "continue" return.

M.norris_step(ctx, model_cfg, helpers, opts):
  1. One broker.chat_stream round-trip — text + tool_calls collected,
     text streamed via helpers.render_assistant_delta.
  2. Parse actions from response: tool_calls (already collected),
     CMD: lines (via helpers.extract_cmd_lines), GOAL: complete
     sentinel (line-level exact match per R-C5).
  3. Record the assistant turn (with tool_calls if any) and log it.
     If no actions AND no goal_done → status="stalled".
  4. Dispatch tool_calls (structured route first):
       - is_destructive check on serialized call.
       - If destructive → halt_fn(proceed/skip/abort).
       - Else → auto_approve lookup; absent → halt for consent (R-C6:
         Norris is conservative; auto_approve is the only consent
         bypass).
       - On skip: synthesize role:tool turn "[aish] tool call
         skipped by user" — alternation preserved per C5/C7.
       - On abort: return status="aborted".
       - On proceed: dispatch via helpers.dispatch_tool, append
         role:tool turn with result content.
       - Argument JSON parse failure also synthesizes a tool turn
         (same alternation rationale).
  5. Dispatch CMD: lines (legacy route):
       - is_destructive check.
       - Destructive → halt_fn.
       - Non-destructive → run directly (Norris user accepted
         autonomy for non-destructive shell).
       - skip → ctx:append_exec_output "[aish] CMD skipped by user".
       - proceed → exec via helpers.exec_cmd, frame via
         render_exec_begin/end.
  6. Skip-budget escalation (R-C1): after dispatch, if
     ctx.norris_consecutive_skips >= 3 → escalation halt; abort exits,
     proceed resets counter.
  7. Goal-done check AFTER all dispatch (R-C2 / Q25 resolution).
  8. Budget check: step_n >= max_steps → status="budget_exhausted".
  9. Otherwise → status="continue", driver advances.

Helpers are passed in as injected functions rather than directly
requiring repl/renderer/executor — keeps safety.lua's coupling clean
and norris_step testable with a mocked helpers table.

State carried across iterations on the ctx:
  - ctx.norris_consecutive_skips (resets on any successful proceed)
  - ctx.norris_goal / ctx.norris_active (set/cleared by the driver)

Existing test_safety.lua corpus (87 cases) still passes — norris_step
addition doesn't touch is_destructive's behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 23:37:53 +00:00
marfrit 2abd5da3a6 safety: LLM second-opinion + session cache (Phase 3 commit #2)
Phase 3 commit #2 per docs/PHASE3.md §12. Adds the LLM-probe gate on
top of commit #1's static patterns. Together they form is_destructive.

broker.lua extension:
  - opts.max_tokens (A2) — passed through to the request body. Phase 3
    probes cap at 4 tokens for YES/NO replies.
  - opts.timeout_ms — overrides model_cfg.timeout_ms per-call. Probe
    uses 15000ms cap regardless of the model's normal timeout
    (the user's deep model has 1800000ms for long generations; the
    probe must stay snappy).
  - M.chat now accepts an opts table (same shape as chat_stream's).
    Backwards compatible — existing callers passing (cfg, msgs)
    unaffected.

safety.lua additions:
  - llm_probe(cfg, system, cmd): single broker.chat call returning
    "YES"/"NO"/"YES_FAILSAFE"/"YES_UNPARSEABLE" — fail-safe defaults.
  - llm_second_opinion(cmd, cfg): two-probe protocol per R-B2.
    Probe 1: "Is this destructive?" — YES → flag.
    Probe 2 (only if probe 1 said NO): "Is this safe?" inverted
    question — NO → flag (disagreement = HALT).
    Both NO → safe.
  - Session-scoped cache _llm_cache keyed by normalized command
    (lowercased + whitespace-collapsed). Mitigates Q23 latency for
    repeated commands within a Norris run.
  - Model-selection precedence: cfg.safety.llm_model (explicit)
    → cfg.models.deep (independent local class) → cfg.models[default].
    Fail-safe YES if none configured.
  - is_destructive(cmd, cfg): runs static patterns first (always),
    then LLM if cfg present + not explicitly opted-out. cfg=nil
    yields static-only mode (handy for tests).

End-to-end verified against hossenfelder using qwen-coder-7b-32k as
the deep probe (qwen3-30b-a3b-instruct in repo's config.lua isn't
currently loaded on the local backend):
  cat /etc/hostname              → hit=false (LLM: NO, NO inverted = safe)
  rm /tmp/x.log                  → hit=true  (LLM flagged; static missed
                                              because no -r/-f flags)
  cp /etc/passwd /tmp/passwd.bak → hit=false (safe copy)
  cache: second probe on same cmd → 0s wall time
  static-only (cfg=nil): rm -rf /tmp/x → static hit, no LLM call
  opt-out (llm_second_opinion=false): cp x y → hit=false, no probe

Test corpus (test_safety.lua, 87 cases) still all pass — cfg=nil
preserves the static-only behavior.

Note: production config.lua currently has `deep = qwen3-30b-a3b-instruct`
which isn't loaded on the proxy backend right now; Norris users will
hit the fail-safe (everything flagged destructive) until either the
deep model is brought up OR cfg.safety.llm_model = "cloud" is set
to route the probe through anthropic/claude-haiku-4.5. Update the
config or model deployment for production use — covered by Phase 3
verify test case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 23:36:06 +00:00
marfrit bd59ce7243 safety: is_destructive static pattern matcher (Phase 3 commit #1)
Phase 3 commit #1 per docs/PHASE3.md §12. Static-pattern destructive-op
heuristic; no LLM second-opinion yet (lands in commit #2).

Implementation:
  - 34 patterns in DESTRUCTIVE_PATTERNS table, grouped:
      9 shell-wrapper patterns (R-B1 — bash -c / sh -c / zsh -c / eval /
        python -c / perl -e / pipe-to-sh both forms / pipe-to-bash both
        forms / xargs ... rm). HALT on the wrapper itself; user reads
        the inner before proceeding.
     10 filesystem destructive (rm -rf, find -delete, dd to device, mkfs,
        shred, wipefs, truncate -s 0, ...).
      5 version-control destructive (git push --force/-f, git reset
        --hard, git clean -fd, git branch -D).
      5 database/process (DROP TABLE/DATABASE, TRUNCATE TABLE,
        kill/pkill -9).
      2 permission (chmod 777, chown on root path).
  - ci=true flag for case-insensitive SQL patterns; rule patterns must
    be lowercase when ci is set (matcher lowercases input).
  - pkill -9 ordered BEFORE kill -9; kill rule uses %f[%w] frontier so
    "pkill -9 nginx" reports "pkill -9" not "kill -9" substring match.
  - M._patterns exposes the rule table for :safety patterns meta (Phase
    3 commit #5) and for the test corpus.
  - M.norris_step stub stays — lands in commit #4.

Test corpus (test_safety.lua, 87 cases):
  - 49 destructive cases across all categories (incl. all 11 wrapper
    forms, the canonical curl|sh end-of-string bypass, sudo-prefixed
    rm -rf, etc.).
  - 38 safe cases (read-only commands, non-destructive variants
    of risky verbs like "git push" without --force, "find" without
    -delete, "chmod 644", "kill 1234" without -9, etc.).
  - Documented one accepted false positive: echo "rm -rf /" matches
    the rm pattern by substring — Norris user can proceed after
    reading; tradeoff between false positives and false negatives,
    biased toward false positives per §5.
  - Run from repo root: `luajit test_safety.lua`. Exit 0 on pass.
  - Verified all 87 pass at commit time.

R-C4 / readline rebind, broker opts.max_tokens, LLM second-opinion,
norris_step planner, repl driver, and the wider Norris UX land in
subsequent commits per §12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:47:10 +00:00
marfrit f26cbd9a3a phase2 amend: __ separator (Bedrock-safe) + post_sse error diagnostics
Phase 7 verify finding from TC #26 against :model cloud:
  HTTP 400 from openrouter→Amazon Bedrock:
  "tools.0.custom.name: String should match pattern
   '^[a-zA-Z0-9_-]{1,128}$'"

Anthropic via Bedrock validates tool names against that regex and
rejects dots. PHASE2 originally chose "." as the namespace separator
("boltzmann.list_dir"); OpenAI tolerated it, Bedrock does not.

Separator switched to "__" (two underscores) everywhere — internal
API matches on-wire shape, no transformation layer:

  - repl.lua:
    - tools_schema builds "alias__name"
    - dispatch_tool_call splits via "^(.-)__(.+)$" (non-greedy → leftmost __)
    - :mcp tool parser uses same split
    - :mcp tools formatter prints "alias__name"
    - HELP block shows <alias__name>
  - safety.lua confirm_tool_call: alias.* glob → alias__* glob
  - config.lua example block: keys rewritten
  - docs/PHASE2.md: amendment header added; §1, §2 row, §3 config.lua
    row, §5 wire-shape JSON examples, §6 auto_approve schema, §7
    meta-cmd table, §12 plan all updated. Original "." references
    preserved in commit history.

Constraint: aliases must not themselves contain "__" so the parse
stays unambiguous. Tool names from MCP servers may have underscores
freely.

Second fix bundled — uninformative broker error:
  Previously "broker error: transport: HTTP response code said error"
  Now      "broker error: transport: HTTP 400: {full body snippet}"

ffi/curl.lua M.post_sse changes:
  - FAILONERROR no longer set (was hiding the response body).
  - raw_body accumulator added alongside the SSE buffer; captures
    every byte regardless of SSE shape.
  - After perform, check status_code via curl_easy_getinfo. On >=400,
    return (nil, "HTTP <code>: <body[:400]>"). 2xx unchanged.
  - End-of-stream SSE flush only runs on 2xx (no false event on
    error bodies that aren't SSE-shaped).
  - Phase 1 callers reading just first return slot stay correct.

End-to-end verified:
  - :model cloud + tools=[boltzmann__read_file ...] +
    "Use boltzmann__read_file with path=/etc/hostname" →
    Claude emits tool_call with name="boltzmann__read_file",
    args='{"path": "/etc/hostname"}'. ok=true, transport clean.
  - Force-bad tool name "bad.name.with.dots" → err string carries
    the full bedrock 400 with the regex-pattern message visible.

TC #26 (sub-loop end-to-end) is now testable against cloud — the
error that blocked it is resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:04:57 +00:00
marfrit 0fde77fe35 safety: confirm_tool_call gate with auto-approve policy
Phase 2 commit #2 per docs/PHASE2.md §12. Implements just the per-call
confirm-gate surface; Phase 3 stubs (is_destructive, norris_step) stay
unimplemented with their error() bodies.

M.confirm_tool_call(name, args, cfg) checks cfg.mcp.auto_approve for:
  - exact match on "<alias>.<tool>"
  - "<alias>.*" glob covering a whole server

Miss falls back to a [y/N] readline prompt. Empty or non-"y" answer
rejects (matches the existing confirm_cmd UX from PHASE0 §10).

Pretty-printing renders args as compact JSON, truncated at 80 chars
with "..." suffix so one-line prompts stay readable.

Smoke-test passes all eight cases per §12 verify-row #2:
  exact match / alias glob → auto-approve, no prompt
  miss + y / n / empty / nil-cfg → prompt shown, expected verdict
  empty args / long args → clean rendering, truncation works

Note: PHASE0 §4 module-layout had a "lands in Phase 2" hint on the
norris_step stub; the actual landing is Phase 3 per PHASE0 §11 row 3.
Comment in safety.lua updated to clarify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:07:57 +00:00
claude-noether 4310207738 Phase 0: scaffold tree + manifest
- README, .gitignore, CLAUDE.md (project conventions)
- docs/PHASE0.md — full Phase 0 manifest (locked substrate)
- 10 root .lua modules + 4 ffi/ bindings, all stubs raising NotImplemented
  with module-scoped responsibilities matching the manifest
- config.lua wired to current dirac/hossenfelder endpoints (qwen-coder-7b
  snappy/32k + cloud via OpenRouter through hossenfelder)

File names match docs/PHASE0.md §4 exactly. Module bodies fill in across
later phases; the tree shape is locked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:16:07 +00:00