After Patch C v3 closed (PR #5 merged, Phase 7 N=3 verified at +73% throughput vs Patch B baseline), the post-13-min RX-degradation pattern remains. Reproduces on Patch B, F, and v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable. Research question (locked): Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic wsm_generic_confirm failed for request 0x0007 + ieee80211 phy0: [SCAN] Scan failed (-22) every 300 s in the intervening window? Phase 0 protocol: - long-capture rig armed on ohm at uptime 0 (fresh boot 23:13 CEST) - ftrace events: workqueue, mac80211, cfg80211, mmc, sdhci, power - iw event (cfg80211 reason codes), dmesg follow, per-30s netdev counter snap, 5 stress probes at T+5/10/15/20/25 min Phase 0 will: - re-anchor the predecessor data via the long capture (in-session N=1; re-run if anomalous) - characterize state transitions (first scan-fail, first throughput drop) via cfg80211/mac80211 ftrace + iw event correlation - feed Phase 1 metric formulation Mechanism candidates (Phase 4 will discriminate): 1. Firmware-side resource exhaustion (per-scan accumulator) 2. NetworkManager scan-fail recovery loop competing with data 3. AP-side rate limiting / fairness probation 4. PSM state machine deadlock (c7 latch stale) 5. SDIO bus retune interaction 6. Power-management busy-event accumulator leak Out of scope: Patch C2/D/E, higher-rate ramp, reproducing on different APs. Independent campaign from Patch C closure.
6.8 KiB
Bug #5 RX-degradation campaign — Phase 0
Date: 2026-05-07
Module under test: v3 + F (bes2600.ko srcversion 371C6606B73AF19299228CA)
Hardware: ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live.
Research question (locked)
Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic
wsm_generic_confirm failed for request 0x0007+ieee80211 phy0: [SCAN] Scan failed (-22)every 300 s in the intervening window?
Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable.
Predecessor data (reference, not anchor)
| source | observation |
|---|---|
| Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender |
| v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" |
| v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) |
| dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) |
The shape: fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.
Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result.
Mechanism candidates (Phase 4 will discriminate)
- Firmware-side resource exhaustion. Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit.
- NetworkManager scan-fail recovery loop. Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans.
- AP-side rate limiting. Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP.
- PSM state machine deadlock. c7's
pm_unsupportedself-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable bychip_pm_statedebugfs read at degradation onset. - SDIO bus clock degradation / mmc retune. SDIO retune with
retune_protectedflag interacts with bes2600's data path. Verifiable by ftracemmc/mmc_request_*event correlation with throughput drop. - Power-management busy-event accumulation.
bes2600_pwr_set_busy_eventcounters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable bybes2600_pwr_busy_event_recorddump.
Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738)
Capturing for 35 minutes from fresh boot. All capture lives in /root/bes2600-samples/run-20260507-bug5-degradation-rig/ on ohm.
Always-on streams
| stream | tool | output |
|---|---|---|
| ftrace events | per-event enable=1 |
trace.log (via trace_pipe) |
| cfg80211 events | iw event -t -f |
iw-event.log |
| kernel printks | dmesg -wT |
dmesg.log |
| netdev counters | per-30s shell loop | snap.log |
ftrace event set
workqueue/workqueue_execute_start— work dispatchesworkqueue/workqueue_queue_work— work submissionsmac80211/api_beacon_loss— driver beacon-loss eventsmac80211/api_connection_loss— driver-side conn-lossmac80211/api_disconnect— driver-side disconnectmac80211/drv_hw_scan— mac80211 → driver scan dispatchmac80211/drv_set_key— key state changescfg80211/rdev_assoc— assoc requestscfg80211/rdev_deauth— deauth requestscfg80211/rdev_disassoc— disassoc requestscfg80211/cfg80211_assoc_comeback— AP-side assoc-busy throttlingcfg80211/cfg80211_send_auth_timeout— auth timeoutscfg80211/cfg80211_scan_done— scan completionspower/suspend_resume— PM transitionsmmc/mmc_request_start/mmc_request_done— bus-level transactions
Scheduled stress probes
Sender on boltzmann (/tmp/bug5-probe-loop.sh) fires pv -L 4m | nc ohm 12345 for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries.
Probe markers logged via logger -t bes2600-bug5 PROBE_N_START/END so they appear in dmesg.log timeline.
Anti-theatre receipts (must tick before claiming Phase 0 done)
- In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous
- ftrace events actually firing (verify by tail of trace.log mid-capture)
- dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s)
- Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s)
- At least one probe in-window after scan-failure onset (expected: T+15 or T+20)
- Snap.log shows monotonic counter behaviour (no rx_bytes going backwards)
Phase 1 hypothesis (provisional, refine after Phase 3 data)
Metric candidate: probe throughput as function of uptime, with state-transition markers (first wsm_generic_confirm 0x0007 failed, first [SCAN] Scan failed (-22), first NetworkManager-deauth-and-reassociate).
Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator.
Phase 4 candidates (post-Phase-3)
Depending on which mechanism (1-6) Phase 3 surfaces:
- (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix.
- (2) NM scan-fail loop: configure
wpa_supplicantto skip scans; or add scan-failure handling in driver to dampen retry cascade. - (3) AP-side: switch APs for testing; report to AVM if reproducible.
- (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery.
- (5) SDIO retune: ftrace correlation guides the lock-ordering fix.
- (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale.
Out-of-scope
- Patch C v3 closure (PR #5 merged, Phase 7 done).
- Patch C2 (
ieee80211_rx_listbatch) — gated on Task #19 kerneldoc. - Patch D / E independent.
- Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified.
Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7. Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min. Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.