Files
besser/notes/phase1-bug5-2026-05-07.md
T
claude-noether 1830c17891 notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts
Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz):
- rep1+2: ~700 KB/s sustained (10% of link capacity)
- rep3: link death at ~9 min in (passive mode, beacon-loss cascade)

Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both
healthy and failed reps, callstack process_one_work → wsm_configuration
→ wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock.

Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock-
cycling, no link death under 30 min continuous load.

Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope;
B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for
WSM commands). Locking pending review.
2026-05-07 16:32:45 +02:00

136 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bug #5 campaign — Phase 1 metric + Phase 2 situation
Date assembled: 2026-05-07
Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
(cleanups + Patch A + Patch B)
Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
for the original report. This document locks Phase 1 and prepares Phase 4
candidates.
---
## Phase 1 — measurable target (locked)
> *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
Three measurable outcomes, single sentence:
- **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
- **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
- **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
---
## Phase 0 anchor — receipts
### Reproduction protocol (same units as Phase 7 will use)
1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
4. N=3 reps with 60 s cooldowns
### Observed (2026-05-07 15:3616:08)
| rep | duration | avg KB/s | near-zero ticks | end state |
|---|---|---|---|---|
| 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
| 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
| 3 | 600 s | 75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
| symbol | rep 1 (healthy) | rep 3 (cascade) |
|---|---|---|
| `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
| `handle_softirqs` | 5.4 % | 4.3 % |
| `__tasklet_schedule` | 2.4 % | 2.0 % |
| `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
| `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
Top callchain leading to `_raw_spin_unlock_irqrestore`:
```
process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
```
---
## Phase 2 — situation analysis
### Relevant source pins
- `bes2600/wsm.c:98``wsm_cmd_send()`, the function in the hot callstack. Body:
- holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145152)
- calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
- outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
- `bes2600/bh.c:435,559,847``bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
- `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
### What the lock-cycling cost is buying
Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
1. caller acquires `wsm_cmd_sema` (outer)
2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
3. `bes2600_bh_wakeup()` schedules the BH
4. BH dispatches the command, takes `vif_list_lock` to walk vifs
5. BH talks to chip via SDIO
6. response arrives, BH wakes the waiter
7. caller releases `wsm_cmd_sema`
Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
### What's been ruled out
- AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
- Patch A / Patch B (target different triggers; would not address lock cost).
- Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
- mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
---
## Phase 4 — candidate plans (preliminary, not locked)
Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
### B5-1 — Reduce `wsm_cmd_send` lock scope
The spin_lock around the cmd-struct fill (wsm.c:145152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
**Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
**Predicted delta**: small but measurable. Maybe 23 % CPU off the lock floor.
### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
**Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
**Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
### B5-3 — Move WSM send out of process context, use ringbuffer
Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
**Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
**Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
---
## Open question for the reviewer
Which Phase 4 candidate to lock? My ranking by ROI:
1. B5-1 (smallest, fastest, cleanest source pin) — try first
2. B5-2 (medium, conditional on multi-vif applicability)
3. B5-3 (largest rework, biggest potential)
Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).