From 6bae5319178837745b090097cffe6434024035aa Mon Sep 17 00:00:00 2001
From: "Claude (noether)" <claude@reauktion.de>
Date: Thu, 7 May 2026 23:23:31 +0200
Subject: [PATCH] =?UTF-8?q?notes:=20Bug=20#5=20RX-degradation=20campaign?=
 =?UTF-8?q?=20=E2=80=94=20Phase=200=20plan=20+=20research=20question?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After Patch C v3 closed (PR #5 merged, Phase 7 N=3 verified at +73%
throughput vs Patch B baseline), the post-13-min RX-degradation
pattern remains.  Reproduces on Patch B, F, and v3 alike — independent
of the relay/race issues v3 addressed.  Side-effect that was masked
by the throughput floor while v2's race was the dominant variable.

Research question (locked):

  Why does the bes2600 RX path collapse from ~2 MB/s sustained @
  fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic
  wsm_generic_confirm failed for request 0x0007 + ieee80211 phy0:
  [SCAN] Scan failed (-22) every 300 s in the intervening window?

Phase 0 protocol:

  - long-capture rig armed on ohm at uptime 0 (fresh boot 23:13 CEST)
  - ftrace events: workqueue, mac80211, cfg80211, mmc, sdhci, power
  - iw event (cfg80211 reason codes), dmesg follow, per-30s netdev
    counter snap, 5 stress probes at T+5/10/15/20/25 min

Phase 0 will:

  - re-anchor the predecessor data via the long capture (in-session
    N=1; re-run if anomalous)
  - characterize state transitions (first scan-fail, first throughput
    drop) via cfg80211/mac80211 ftrace + iw event correlation
  - feed Phase 1 metric formulation

Mechanism candidates (Phase 4 will discriminate):

  1. Firmware-side resource exhaustion (per-scan accumulator)
  2. NetworkManager scan-fail recovery loop competing with data
  3. AP-side rate limiting / fairness probation
  4. PSM state machine deadlock (c7 latch stale)
  5. SDIO bus retune interaction
  6. Power-management busy-event accumulator leak

Out of scope: Patch C2/D/E, higher-rate ramp, reproducing on different
APs.  Independent campaign from Patch C closure.
---
 .../bug5-rx-degradation-phase0-2026-05-07.md  | 108 ++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 notes/bug5-rx-degradation-phase0-2026-05-07.md

diff --git a/notes/bug5-rx-degradation-phase0-2026-05-07.md b/notes/bug5-rx-degradation-phase0-2026-05-07.md
new file mode 100644
index 000000000..b7cb652a6
--- /dev/null
+++ b/notes/bug5-rx-degradation-phase0-2026-05-07.md
@@ -0,0 +1,108 @@
+# Bug #5 RX-degradation campaign — Phase 0
+
+**Date:** 2026-05-07
+**Module under test:** v3 + F (`bes2600.ko` srcversion `371C6606B73AF19299228CA`)
+**Hardware:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live.
+
+---
+
+## Research question (locked)
+
+> **Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic `wsm_generic_confirm failed for request 0x0007` + `ieee80211 phy0: [SCAN] Scan failed (-22)` every 300 s in the intervening window?**
+
+Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable.
+
+## Predecessor data (reference, not anchor)
+
+| source | observation |
+|---|---|
+| Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender |
+| v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" |
+| v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) |
+| dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) |
+
+The shape: **fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.**
+
+Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result.
+
+## Mechanism candidates (Phase 4 will discriminate)
+
+1. **Firmware-side resource exhaustion.** Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit.
+2. **NetworkManager scan-fail recovery loop.** Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans.
+3. **AP-side rate limiting.** Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP.
+4. **PSM state machine deadlock.** c7's `pm_unsupported` self-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable by `chip_pm_state` debugfs read at degradation onset.
+5. **SDIO bus clock degradation / mmc retune.** SDIO retune with `retune_protected` flag interacts with bes2600's data path. Verifiable by ftrace `mmc/mmc_request_*` event correlation with throughput drop.
+6. **Power-management busy-event accumulation.** `bes2600_pwr_set_busy_event` counters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable by `bes2600_pwr_busy_event_record` dump.
+
+## Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738)
+
+Capturing for 35 minutes from fresh boot. All capture lives in `/root/bes2600-samples/run-20260507-bug5-degradation-rig/` on ohm.
+
+### Always-on streams
+
+| stream | tool | output |
+|---|---|---|
+| ftrace events | per-event `enable=1` | `trace.log` (via `trace_pipe`) |
+| cfg80211 events | `iw event -t -f` | `iw-event.log` |
+| kernel printks | `dmesg -wT` | `dmesg.log` |
+| netdev counters | per-30s shell loop | `snap.log` |
+
+### ftrace event set
+
+- `workqueue/workqueue_execute_start` — work dispatches
+- `workqueue/workqueue_queue_work` — work submissions
+- `mac80211/api_beacon_loss` — driver beacon-loss events
+- `mac80211/api_connection_loss` — driver-side conn-loss
+- `mac80211/api_disconnect` — driver-side disconnect
+- `mac80211/drv_hw_scan` — mac80211 → driver scan dispatch
+- `mac80211/drv_set_key` — key state changes
+- `cfg80211/rdev_assoc` — assoc requests
+- `cfg80211/rdev_deauth` — deauth requests
+- `cfg80211/rdev_disassoc` — disassoc requests
+- `cfg80211/cfg80211_assoc_comeback` — AP-side assoc-busy throttling
+- `cfg80211/cfg80211_send_auth_timeout` — auth timeouts
+- `cfg80211/cfg80211_scan_done` — scan completions
+- `power/suspend_resume` — PM transitions
+- `mmc/mmc_request_start` / `mmc_request_done` — bus-level transactions
+
+### Scheduled stress probes
+
+Sender on boltzmann (`/tmp/bug5-probe-loop.sh`) fires `pv -L 4m | nc ohm 12345` for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries.
+
+Probe markers logged via `logger -t bes2600-bug5 PROBE_N_START/END` so they appear in dmesg.log timeline.
+
+## Anti-theatre receipts (must tick before claiming Phase 0 done)
+
+- [ ] In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous
+- [ ] ftrace events actually firing (verify by tail of trace.log mid-capture)
+- [ ] dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s)
+- [ ] Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s)
+- [ ] At least one probe in-window after scan-failure onset (expected: T+15 or T+20)
+- [ ] Snap.log shows monotonic counter behaviour (no rx_bytes going backwards)
+
+## Phase 1 hypothesis (provisional, refine after Phase 3 data)
+
+Metric candidate: **probe throughput as function of uptime, with state-transition markers (first `wsm_generic_confirm 0x0007 failed`, first `[SCAN] Scan failed (-22)`, first NetworkManager-deauth-and-reassociate)**.
+
+Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator.
+
+## Phase 4 candidates (post-Phase-3)
+
+Depending on which mechanism (1-6) Phase 3 surfaces:
+- (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix.
+- (2) NM scan-fail loop: configure `wpa_supplicant` to skip scans; or add scan-failure handling in driver to dampen retry cascade.
+- (3) AP-side: switch APs for testing; report to AVM if reproducible.
+- (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery.
+- (5) SDIO retune: ftrace correlation guides the lock-ordering fix.
+- (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale.
+
+## Out-of-scope
+
+- Patch C v3 closure (PR #5 merged, Phase 7 done).
+- Patch C2 (`ieee80211_rx_list` batch) — gated on Task #19 kerneldoc.
+- Patch D / E independent.
+- Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified.
+
+---
+
+*Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7.  Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min.  Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.*