notes: Bug #5 RX-degradation campaign — Phase 0 plan (#13)

2026-05-07 21:52:14 +00:00
parent fde41fcdd4 6bae531917
commit fc88ff41c3
1 changed files with 108 additions and 0 deletions
@@ -0,0 +1,108 @@
+# Bug #5 RX-degradation campaign — Phase 0
+
+**Date:** 2026-05-07
+**Module under test:** v3 + F (`bes2600.ko` srcversion `371C6606B73AF19299228CA`)
+**Hardware:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live.
+
+---
+
+## Research question (locked)
+
+> **Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic `wsm_generic_confirm failed for request 0x0007` + `ieee80211 phy0: [SCAN] Scan failed (-22)` every 300 s in the intervening window?**
+
+Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable.
+
+## Predecessor data (reference, not anchor)
+
+| source | observation |
+|---|---|
+| Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender |
+| v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" |
+| v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) |
+| dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) |
+
+The shape: **fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.**
+
+Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result.
+
+## Mechanism candidates (Phase 4 will discriminate)
+
+1. **Firmware-side resource exhaustion.** Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit.
+2. **NetworkManager scan-fail recovery loop.** Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans.
+3. **AP-side rate limiting.** Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP.
+4. **PSM state machine deadlock.** c7's `pm_unsupported` self-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable by `chip_pm_state` debugfs read at degradation onset.
+5. **SDIO bus clock degradation / mmc retune.** SDIO retune with `retune_protected` flag interacts with bes2600's data path. Verifiable by ftrace `mmc/mmc_request_*` event correlation with throughput drop.
+6. **Power-management busy-event accumulation.** `bes2600_pwr_set_busy_event` counters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable by `bes2600_pwr_busy_event_record` dump.
+
+## Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738)
+
+Capturing for 35 minutes from fresh boot. All capture lives in `/root/bes2600-samples/run-20260507-bug5-degradation-rig/` on ohm.
+
+### Always-on streams
+
+| stream | tool | output |
+|---|---|---|
+| ftrace events | per-event `enable=1` | `trace.log` (via `trace_pipe`) |
+| cfg80211 events | `iw event -t -f` | `iw-event.log` |
+| kernel printks | `dmesg -wT` | `dmesg.log` |
+| netdev counters | per-30s shell loop | `snap.log` |
+
+### ftrace event set
+
+- `workqueue/workqueue_execute_start` — work dispatches
+- `workqueue/workqueue_queue_work` — work submissions
+- `mac80211/api_beacon_loss` — driver beacon-loss events
+- `mac80211/api_connection_loss` — driver-side conn-loss
+- `mac80211/api_disconnect` — driver-side disconnect
+- `mac80211/drv_hw_scan` — mac80211 → driver scan dispatch
+- `mac80211/drv_set_key` — key state changes
+- `cfg80211/rdev_assoc` — assoc requests
+- `cfg80211/rdev_deauth` — deauth requests
+- `cfg80211/rdev_disassoc` — disassoc requests
+- `cfg80211/cfg80211_assoc_comeback` — AP-side assoc-busy throttling
+- `cfg80211/cfg80211_send_auth_timeout` — auth timeouts
+- `cfg80211/cfg80211_scan_done` — scan completions
+- `power/suspend_resume` — PM transitions
+- `mmc/mmc_request_start` / `mmc_request_done` — bus-level transactions
+
+### Scheduled stress probes
+
+Sender on boltzmann (`/tmp/bug5-probe-loop.sh`) fires `pv -L 4m | nc ohm 12345` for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries.
+
+Probe markers logged via `logger -t bes2600-bug5 PROBE_N_START/END` so they appear in dmesg.log timeline.
+
+## Anti-theatre receipts (must tick before claiming Phase 0 done)
+
+- [ ] In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous
+- [ ] ftrace events actually firing (verify by tail of trace.log mid-capture)
+- [ ] dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s)
+- [ ] Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s)
+- [ ] At least one probe in-window after scan-failure onset (expected: T+15 or T+20)
+- [ ] Snap.log shows monotonic counter behaviour (no rx_bytes going backwards)
+
+## Phase 1 hypothesis (provisional, refine after Phase 3 data)
+
+Metric candidate: **probe throughput as function of uptime, with state-transition markers (first `wsm_generic_confirm 0x0007 failed`, first `[SCAN] Scan failed (-22)`, first NetworkManager-deauth-and-reassociate)**.
+
+Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator.
+
+## Phase 4 candidates (post-Phase-3)
+
+Depending on which mechanism (1-6) Phase 3 surfaces:
+- (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix.
+- (2) NM scan-fail loop: configure `wpa_supplicant` to skip scans; or add scan-failure handling in driver to dampen retry cascade.
+- (3) AP-side: switch APs for testing; report to AVM if reproducible.
+- (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery.
+- (5) SDIO retune: ftrace correlation guides the lock-ordering fix.
+- (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale.
+
+## Out-of-scope
+
+- Patch C v3 closure (PR #5 merged, Phase 7 done).
+- Patch C2 (`ieee80211_rx_list` batch) — gated on Task #19 kerneldoc.
+- Patch D / E independent.
+- Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified.
+
+---
+
+*Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7.  Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min.  Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.*