# Bug #5 RX-degradation campaign — Phase 0 **Date:** 2026-05-07 **Module under test:** v3 + F (`bes2600.ko` srcversion `371C6606B73AF19299228CA`) **Hardware:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live. --- ## Research question (locked) > **Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic `wsm_generic_confirm failed for request 0x0007` + `ieee80211 phy0: [SCAN] Scan failed (-22)` every 300 s in the intervening window?** Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable. ## Predecessor data (reference, not anchor) | source | observation | |---|---| | Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender | | v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" | | v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) | | dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) | The shape: **fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.** Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result. ## Mechanism candidates (Phase 4 will discriminate) 1. **Firmware-side resource exhaustion.** Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit. 2. **NetworkManager scan-fail recovery loop.** Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans. 3. **AP-side rate limiting.** Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP. 4. **PSM state machine deadlock.** c7's `pm_unsupported` self-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable by `chip_pm_state` debugfs read at degradation onset. 5. **SDIO bus clock degradation / mmc retune.** SDIO retune with `retune_protected` flag interacts with bes2600's data path. Verifiable by ftrace `mmc/mmc_request_*` event correlation with throughput drop. 6. **Power-management busy-event accumulation.** `bes2600_pwr_set_busy_event` counters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable by `bes2600_pwr_busy_event_record` dump. ## Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738) Capturing for 35 minutes from fresh boot. All capture lives in `/root/bes2600-samples/run-20260507-bug5-degradation-rig/` on ohm. ### Always-on streams | stream | tool | output | |---|---|---| | ftrace events | per-event `enable=1` | `trace.log` (via `trace_pipe`) | | cfg80211 events | `iw event -t -f` | `iw-event.log` | | kernel printks | `dmesg -wT` | `dmesg.log` | | netdev counters | per-30s shell loop | `snap.log` | ### ftrace event set - `workqueue/workqueue_execute_start` — work dispatches - `workqueue/workqueue_queue_work` — work submissions - `mac80211/api_beacon_loss` — driver beacon-loss events - `mac80211/api_connection_loss` — driver-side conn-loss - `mac80211/api_disconnect` — driver-side disconnect - `mac80211/drv_hw_scan` — mac80211 → driver scan dispatch - `mac80211/drv_set_key` — key state changes - `cfg80211/rdev_assoc` — assoc requests - `cfg80211/rdev_deauth` — deauth requests - `cfg80211/rdev_disassoc` — disassoc requests - `cfg80211/cfg80211_assoc_comeback` — AP-side assoc-busy throttling - `cfg80211/cfg80211_send_auth_timeout` — auth timeouts - `cfg80211/cfg80211_scan_done` — scan completions - `power/suspend_resume` — PM transitions - `mmc/mmc_request_start` / `mmc_request_done` — bus-level transactions ### Scheduled stress probes Sender on boltzmann (`/tmp/bug5-probe-loop.sh`) fires `pv -L 4m | nc ohm 12345` for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries. Probe markers logged via `logger -t bes2600-bug5 PROBE_N_START/END` so they appear in dmesg.log timeline. ## Anti-theatre receipts (must tick before claiming Phase 0 done) - [ ] In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous - [ ] ftrace events actually firing (verify by tail of trace.log mid-capture) - [ ] dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s) - [ ] Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s) - [ ] At least one probe in-window after scan-failure onset (expected: T+15 or T+20) - [ ] Snap.log shows monotonic counter behaviour (no rx_bytes going backwards) ## Phase 1 hypothesis (provisional, refine after Phase 3 data) Metric candidate: **probe throughput as function of uptime, with state-transition markers (first `wsm_generic_confirm 0x0007 failed`, first `[SCAN] Scan failed (-22)`, first NetworkManager-deauth-and-reassociate)**. Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator. ## Phase 4 candidates (post-Phase-3) Depending on which mechanism (1-6) Phase 3 surfaces: - (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix. - (2) NM scan-fail loop: configure `wpa_supplicant` to skip scans; or add scan-failure handling in driver to dampen retry cascade. - (3) AP-side: switch APs for testing; report to AVM if reproducible. - (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery. - (5) SDIO retune: ftrace correlation guides the lock-ordering fix. - (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale. ## Out-of-scope - Patch C v3 closure (PR #5 merged, Phase 7 done). - Patch C2 (`ieee80211_rx_list` batch) — gated on Task #19 kerneldoc. - Patch D / E independent. - Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified. --- *Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7. Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min. Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.*