Bug #5 Phase 1 metric + Phase 0 anchor receipts #6

Merged
marfrit merged 1 commits from claude-noether-4 into main 2026-05-07 14:37:29 +00:00
Collaborator

Phase 0 anchored at N=3 reps under 4 MB/s pv-cap on 2.4GHz newton.

Headline

  • Throughput regression confirmed at N=3: rep1 = 725 KB/s, rep2 = 663 KB/s, rep3 = 75 KB/s with link death at ~9 min in (passive mode, beacon-loss cascade).
  • Hot symbol: _raw_spin_unlock_irqrestore at ~20 % of CPU in BOTH healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → unlock. Lock-cycling cost is the floor of throughput.
  • Source pin: bes2600/wsm.c:98 (wsm_cmd_send()), bes2600/bh.c:435,559,847 (vif_list_lock cycling in BH dispatcher).

Phase 1 metric (locked)

Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on 2.4GHz newton, ohm sustains ≥ 2 MB/s observed RX throughput AND survives ≥ 30 min continuous load without link cascade.

Three measurable outcomes: throughput floor ≥ 2 MB/s, lock-cycle ceiling < 10 % CPU, no cascade in 30 min.

Phase 4 candidates (preliminary)

  • B5-1: shrink wsm_cmd_send lock scope (smallest, fastest)
  • B5-2: coalesce vif_list_lock in BH dispatcher (medium)
  • B5-3: SPSC ringbuffer for WSM commands (largest rework, biggest potential)

Asks

  1. Lock B5-1 first as the shortest path to data, or jump to B5-3 for the bigger ceiling lift?
  2. Add tracepoints around wsm_cmd_send for lock-holdtime distribution before committing to any patch?
  3. Anything in the Phase 0 anchor that screams "you missed a confounder" before we lock Phase 1?

🤖 Generated with Claude Code

Phase 0 anchored at N=3 reps under 4 MB/s pv-cap on 2.4GHz newton. ## Headline - Throughput regression confirmed at N=3: rep1 = 725 KB/s, rep2 = 663 KB/s, **rep3 = 75 KB/s with link death at ~9 min in** (passive mode, beacon-loss cascade). - Hot symbol: `_raw_spin_unlock_irqrestore` at **~20 % of CPU in BOTH healthy and failed reps**, callstack `process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → unlock`. Lock-cycling cost is the floor of throughput. - Source pin: `bes2600/wsm.c:98` (`wsm_cmd_send()`), `bes2600/bh.c:435,559,847` (`vif_list_lock` cycling in BH dispatcher). ## Phase 1 metric (locked) > Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on 2.4GHz newton, ohm sustains ≥ 2 MB/s observed RX throughput AND survives ≥ 30 min continuous load without link cascade. Three measurable outcomes: throughput floor ≥ 2 MB/s, lock-cycle ceiling < 10 % CPU, no cascade in 30 min. ## Phase 4 candidates (preliminary) - **B5-1**: shrink `wsm_cmd_send` lock scope (smallest, fastest) - **B5-2**: coalesce `vif_list_lock` in BH dispatcher (medium) - **B5-3**: SPSC ringbuffer for WSM commands (largest rework, biggest potential) ## Asks 1. Lock B5-1 first as the shortest path to data, or jump to B5-3 for the bigger ceiling lift? 2. Add tracepoints around `wsm_cmd_send` for lock-holdtime distribution before committing to any patch? 3. Anything in the Phase 0 anchor that screams "you missed a confounder" before we lock Phase 1? 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claude-noether added 1 commit 2026-05-07 14:33:14 +00:00
Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz):
- rep1+2: ~700 KB/s sustained (10% of link capacity)
- rep3: link death at ~9 min in (passive mode, beacon-loss cascade)

Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both
healthy and failed reps, callstack process_one_work → wsm_configuration
→ wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock.

Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock-
cycling, no link death under 30 min continuous load.

Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope;
B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for
WSM commands). Locking pending review.
Owner

Lock B5-1 first as the shortest path to data, or jump to B5-3 for the bigger ceiling lift? data first.
Add tracepoints around wsm_cmd_send for lock-holdtime distribution before committing to any patch? Yes. And, if I understand you correctly, the lock is issued many times over before released. So lock count should also be measured - how many locks before release? I Have a feeling the driver was created with a "when in doubt, at a lock" kind of approach. Also backlog a full architect review by Sonnet of the thing we call a driver. The way your tracing looks, the quality of the code is... not so good.
Anything in the Phase 0 anchor that screams "you missed a confounder" before we lock Phase 1? No, but I see you are admitting to a pattern which is well received by "the user".

Lock B5-1 first as the shortest path to data, or jump to B5-3 for the bigger ceiling lift? data first. Add tracepoints around wsm_cmd_send for lock-holdtime distribution before committing to any patch? Yes. And, if I understand you correctly, the lock is issued many times over before released. So lock count should also be measured - how many locks before release? I Have a feeling the driver was created with a "when in doubt, at a lock" kind of approach. Also backlog a full architect review by Sonnet of the thing we call a driver. The way your tracing looks, the quality of the code is... not so good. Anything in the Phase 0 anchor that screams "you missed a confounder" before we lock Phase 1? No, but I see you are admitting to a pattern which is well received by "the user".
marfrit merged commit 425eb92456 into main 2026-05-07 14:37:29 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/besser#6