From 6bae5319178837745b090097cffe6434024035aa Mon Sep 17 00:00:00 2001 From: "Claude (noether)" Date: Thu, 7 May 2026 23:23:31 +0200 Subject: [PATCH] =?UTF-8?q?notes:=20Bug=20#5=20RX-degradation=20campaign?= =?UTF-8?q?=20=E2=80=94=20Phase=200=20plan=20+=20research=20question?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After Patch C v3 closed (PR #5 merged, Phase 7 N=3 verified at +73% throughput vs Patch B baseline), the post-13-min RX-degradation pattern remains. Reproduces on Patch B, F, and v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable. Research question (locked): Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic wsm_generic_confirm failed for request 0x0007 + ieee80211 phy0: [SCAN] Scan failed (-22) every 300 s in the intervening window? Phase 0 protocol: - long-capture rig armed on ohm at uptime 0 (fresh boot 23:13 CEST) - ftrace events: workqueue, mac80211, cfg80211, mmc, sdhci, power - iw event (cfg80211 reason codes), dmesg follow, per-30s netdev counter snap, 5 stress probes at T+5/10/15/20/25 min Phase 0 will: - re-anchor the predecessor data via the long capture (in-session N=1; re-run if anomalous) - characterize state transitions (first scan-fail, first throughput drop) via cfg80211/mac80211 ftrace + iw event correlation - feed Phase 1 metric formulation Mechanism candidates (Phase 4 will discriminate): 1. Firmware-side resource exhaustion (per-scan accumulator) 2. NetworkManager scan-fail recovery loop competing with data 3. AP-side rate limiting / fairness probation 4. PSM state machine deadlock (c7 latch stale) 5. SDIO bus retune interaction 6. Power-management busy-event accumulator leak Out of scope: Patch C2/D/E, higher-rate ramp, reproducing on different APs. Independent campaign from Patch C closure. --- .../bug5-rx-degradation-phase0-2026-05-07.md | 108 ++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 notes/bug5-rx-degradation-phase0-2026-05-07.md diff --git a/notes/bug5-rx-degradation-phase0-2026-05-07.md b/notes/bug5-rx-degradation-phase0-2026-05-07.md new file mode 100644 index 000000000..b7cb652a6 --- /dev/null +++ b/notes/bug5-rx-degradation-phase0-2026-05-07.md @@ -0,0 +1,108 @@ +# Bug #5 RX-degradation campaign — Phase 0 + +**Date:** 2026-05-07 +**Module under test:** v3 + F (`bes2600.ko` srcversion `371C6606B73AF19299228CA`) +**Hardware:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live. + +--- + +## Research question (locked) + +> **Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic `wsm_generic_confirm failed for request 0x0007` + `ieee80211 phy0: [SCAN] Scan failed (-22)` every 300 s in the intervening window?** + +Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable. + +## Predecessor data (reference, not anchor) + +| source | observation | +|---|---| +| Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender | +| v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" | +| v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) | +| dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) | + +The shape: **fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.** + +Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result. + +## Mechanism candidates (Phase 4 will discriminate) + +1. **Firmware-side resource exhaustion.** Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit. +2. **NetworkManager scan-fail recovery loop.** Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans. +3. **AP-side rate limiting.** Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP. +4. **PSM state machine deadlock.** c7's `pm_unsupported` self-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable by `chip_pm_state` debugfs read at degradation onset. +5. **SDIO bus clock degradation / mmc retune.** SDIO retune with `retune_protected` flag interacts with bes2600's data path. Verifiable by ftrace `mmc/mmc_request_*` event correlation with throughput drop. +6. **Power-management busy-event accumulation.** `bes2600_pwr_set_busy_event` counters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable by `bes2600_pwr_busy_event_record` dump. + +## Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738) + +Capturing for 35 minutes from fresh boot. All capture lives in `/root/bes2600-samples/run-20260507-bug5-degradation-rig/` on ohm. + +### Always-on streams + +| stream | tool | output | +|---|---|---| +| ftrace events | per-event `enable=1` | `trace.log` (via `trace_pipe`) | +| cfg80211 events | `iw event -t -f` | `iw-event.log` | +| kernel printks | `dmesg -wT` | `dmesg.log` | +| netdev counters | per-30s shell loop | `snap.log` | + +### ftrace event set + +- `workqueue/workqueue_execute_start` — work dispatches +- `workqueue/workqueue_queue_work` — work submissions +- `mac80211/api_beacon_loss` — driver beacon-loss events +- `mac80211/api_connection_loss` — driver-side conn-loss +- `mac80211/api_disconnect` — driver-side disconnect +- `mac80211/drv_hw_scan` — mac80211 → driver scan dispatch +- `mac80211/drv_set_key` — key state changes +- `cfg80211/rdev_assoc` — assoc requests +- `cfg80211/rdev_deauth` — deauth requests +- `cfg80211/rdev_disassoc` — disassoc requests +- `cfg80211/cfg80211_assoc_comeback` — AP-side assoc-busy throttling +- `cfg80211/cfg80211_send_auth_timeout` — auth timeouts +- `cfg80211/cfg80211_scan_done` — scan completions +- `power/suspend_resume` — PM transitions +- `mmc/mmc_request_start` / `mmc_request_done` — bus-level transactions + +### Scheduled stress probes + +Sender on boltzmann (`/tmp/bug5-probe-loop.sh`) fires `pv -L 4m | nc ohm 12345` for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries. + +Probe markers logged via `logger -t bes2600-bug5 PROBE_N_START/END` so they appear in dmesg.log timeline. + +## Anti-theatre receipts (must tick before claiming Phase 0 done) + +- [ ] In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous +- [ ] ftrace events actually firing (verify by tail of trace.log mid-capture) +- [ ] dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s) +- [ ] Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s) +- [ ] At least one probe in-window after scan-failure onset (expected: T+15 or T+20) +- [ ] Snap.log shows monotonic counter behaviour (no rx_bytes going backwards) + +## Phase 1 hypothesis (provisional, refine after Phase 3 data) + +Metric candidate: **probe throughput as function of uptime, with state-transition markers (first `wsm_generic_confirm 0x0007 failed`, first `[SCAN] Scan failed (-22)`, first NetworkManager-deauth-and-reassociate)**. + +Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator. + +## Phase 4 candidates (post-Phase-3) + +Depending on which mechanism (1-6) Phase 3 surfaces: +- (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix. +- (2) NM scan-fail loop: configure `wpa_supplicant` to skip scans; or add scan-failure handling in driver to dampen retry cascade. +- (3) AP-side: switch APs for testing; report to AVM if reproducible. +- (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery. +- (5) SDIO retune: ftrace correlation guides the lock-ordering fix. +- (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale. + +## Out-of-scope + +- Patch C v3 closure (PR #5 merged, Phase 7 done). +- Patch C2 (`ieee80211_rx_list` batch) — gated on Task #19 kerneldoc. +- Patch D / E independent. +- Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified. + +--- + +*Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7. Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min. Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.*