Files
besser/notes/phase1-bug5-2026-05-07.md
T
claude-noether 1830c17891 notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts
Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz):
- rep1+2: ~700 KB/s sustained (10% of link capacity)
- rep3: link death at ~9 min in (passive mode, beacon-loss cascade)

Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both
healthy and failed reps, callstack process_one_work → wsm_configuration
→ wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock.

Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock-
cycling, no link death under 30 min continuous load.

Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope;
B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for
WSM commands). Locking pending review.
2026-05-07 16:32:45 +02:00

6.3 KiB
Raw Blame History

Bug #5 campaign — Phase 1 metric + Phase 2 situation

Date assembled: 2026-05-07 Module under test (baseline): bes2600.ko srcversion 1B3B3ED096AAD7217FEDE11 (cleanups + Patch A + Patch B)

Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm 2.4GHz 5b:32) reproduces the throughput regression and traces it to lock- cycling cost in the bes2600 BH path. See notes/observed-bugs.md Bug #5 for the original report. This document locks Phase 1 and prepares Phase 4 candidates.


Phase 1 — measurable target (locked)

Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).

Three measurable outcomes, single sentence:

  • (a) Throughput floor: ≥ 2 MB/s sustained RX at ohm
  • (b) Lock-cycle ceiling: % CPU in _raw_spin_unlock_irqrestore from bes2600_bh.isra.0 callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
  • (c) Cascade prevention: no link death under continuous 30 min @ 4 MB/s

Phase 0 anchor — receipts

Reproduction protocol (same units as Phase 7 will use)

  1. boltzmann: pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345
  2. ohm: sudo $RUN/rep-trace.sh 600 (10 min capture window)
  3. Rep dirs: bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}
  4. N=3 reps with 60 s cooldowns

Observed (2026-05-07 15:3616:08)

rep duration avg KB/s near-zero ticks end state
1 600 s 725 1/119 associated, MCS 6
2 600 s 663 5/119 associated, MCS 4
3 600 s 75 53/119 passive (link died at sample ~82, ~9 min in)

mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.

Hot-symbol receipts (perf record on bes_sdio | bes2600_bh kworkers)

symbol rep 1 (healthy) rep 3 (cascade)
_raw_spin_unlock_irqrestore (sum across kworker variants) ~19 % ~21 %
handle_softirqs 5.4 % 4.3 %
__tasklet_schedule 2.4 % 2.0 %
dw_mci_start_command (SDIO host) 1.5 % < 0.6 %
bes2600_sw_retry_requeue 0.79 % 0.70 %

Top callchain leading to _raw_spin_unlock_irqrestore:

process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0

Phase 2 — situation analysis

Relevant source pins

  • bes2600/wsm.c:98wsm_cmd_send(), the function in the hot callstack. Body:
    • holds &hw_priv->wsm_cmd.lock (spin) only briefly to fill the cmd struct (lines ~145152)
    • calls bes2600_bh_wakeup() then wait_event_interruptible_timeout for the response
    • outer lock: down(&hw_priv->wsm_cmd_sema) from callers (wsm_cmd_lock() at wsm.c:105)
  • bes2600/bh.c:435,559,847bes2600_bh_work() takes &hw_priv->vif_list_lock 3× per pass through the dispatcher
  • bes2600/bh.c:172,195,487,581,861,1361,1427 — multiple wait_event* calls; the BH thread cycles through wait/wake/dispatch

What the lock-cycling cost is buying

Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:

  1. caller acquires wsm_cmd_sema (outer)
  2. wsm_cmd_send() acquires wsm_cmd.lock, fills the struct, releases the lock
  3. bes2600_bh_wakeup() schedules the BH
  4. BH dispatches the command, takes vif_list_lock to walk vifs
  5. BH talks to chip via SDIO
  6. response arrives, BH wakes the waiter
  7. caller releases wsm_cmd_sema

Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.

What's been ruled out

  • AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see feedback_phase7_stress_ramp.md reasoning in the campaign-so-far prior).
  • Patch A / Patch B (target different triggers; would not address lock cost).
  • Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
  • mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).

Phase 4 — candidate plans (preliminary, not locked)

Three candidates surfaced from the perf data. Listed cheapest to most-invasive.

B5-1 — Reduce wsm_cmd_send lock scope

The spin_lock around the cmd-struct fill (wsm.c:145152) can probably be replaced with WRITE_ONCE of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after bh_wakeup schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.

Risk: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.

Predicted delta: small but measurable. Maybe 23 % CPU off the lock floor.

B5-2 — Coalesce vif_list_lock in BH dispatcher

bh.c takes vif_list_lock 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.

Risk: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.

Predicted delta: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).

B5-3 — Move WSM send out of process context, use ringbuffer

Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).

Risk: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.

Predicted delta: large — could halve the lock cost. But cost-to-implement is also large.


Open question for the reviewer

Which Phase 4 candidate to lock? My ranking by ROI:

  1. B5-1 (smallest, fastest, cleanest source pin) — try first
  2. B5-2 (medium, conditional on multi-vif applicability)
  3. B5-3 (largest rework, biggest potential)

Or: instrument deeper before committing to a fix (e.g., add tracepoints around wsm_cmd_send enter/exit to measure lock holdtime distribution, not just CPU%).