notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts
Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
# Bug #5 campaign — Phase 1 metric + Phase 2 situation
|
||||
|
||||
Date assembled: 2026-05-07
|
||||
Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
|
||||
(cleanups + Patch A + Patch B)
|
||||
|
||||
Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
|
||||
2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
|
||||
cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
|
||||
for the original report. This document locks Phase 1 and prepares Phase 4
|
||||
candidates.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — measurable target (locked)
|
||||
|
||||
> *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
|
||||
|
||||
Three measurable outcomes, single sentence:
|
||||
|
||||
- **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
|
||||
- **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
|
||||
- **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 anchor — receipts
|
||||
|
||||
### Reproduction protocol (same units as Phase 7 will use)
|
||||
|
||||
1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
|
||||
2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
|
||||
3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
|
||||
4. N=3 reps with 60 s cooldowns
|
||||
|
||||
### Observed (2026-05-07 15:36–16:08)
|
||||
|
||||
| rep | duration | avg KB/s | near-zero ticks | end state |
|
||||
|---|---|---|---|---|
|
||||
| 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
|
||||
| 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
|
||||
| 3 | 600 s | 75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
|
||||
|
||||
mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
|
||||
|
||||
### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
|
||||
|
||||
| symbol | rep 1 (healthy) | rep 3 (cascade) |
|
||||
|---|---|---|
|
||||
| `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
|
||||
| `handle_softirqs` | 5.4 % | 4.3 % |
|
||||
| `__tasklet_schedule` | 2.4 % | 2.0 % |
|
||||
| `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
|
||||
| `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
|
||||
|
||||
Top callchain leading to `_raw_spin_unlock_irqrestore`:
|
||||
|
||||
```
|
||||
process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — situation analysis
|
||||
|
||||
### Relevant source pins
|
||||
|
||||
- `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
|
||||
- holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
|
||||
- calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
|
||||
- outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
|
||||
- `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
|
||||
- `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
|
||||
|
||||
### What the lock-cycling cost is buying
|
||||
|
||||
Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
|
||||
1. caller acquires `wsm_cmd_sema` (outer)
|
||||
2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
|
||||
3. `bes2600_bh_wakeup()` schedules the BH
|
||||
4. BH dispatches the command, takes `vif_list_lock` to walk vifs
|
||||
5. BH talks to chip via SDIO
|
||||
6. response arrives, BH wakes the waiter
|
||||
7. caller releases `wsm_cmd_sema`
|
||||
|
||||
Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
|
||||
|
||||
### What's been ruled out
|
||||
|
||||
- AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
|
||||
- Patch A / Patch B (target different triggers; would not address lock cost).
|
||||
- Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
|
||||
- mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — candidate plans (preliminary, not locked)
|
||||
|
||||
Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
|
||||
|
||||
### B5-1 — Reduce `wsm_cmd_send` lock scope
|
||||
|
||||
The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
|
||||
|
||||
**Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
|
||||
|
||||
**Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
|
||||
|
||||
### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
|
||||
|
||||
bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
|
||||
|
||||
**Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
|
||||
|
||||
**Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
|
||||
|
||||
### B5-3 — Move WSM send out of process context, use ringbuffer
|
||||
|
||||
Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
|
||||
|
||||
**Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
|
||||
|
||||
**Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
|
||||
|
||||
---
|
||||
|
||||
## Open question for the reviewer
|
||||
|
||||
Which Phase 4 candidate to lock? My ranking by ROI:
|
||||
|
||||
1. B5-1 (smallest, fastest, cleanest source pin) — try first
|
||||
2. B5-2 (medium, conditional on multi-vif applicability)
|
||||
3. B5-3 (largest rework, biggest potential)
|
||||
|
||||
Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
|
||||
Reference in New Issue
Block a user