Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6bae531917 |
@@ -0,0 +1,108 @@
|
|||||||
|
# Bug #5 RX-degradation campaign — Phase 0
|
||||||
|
|
||||||
|
**Date:** 2026-05-07
|
||||||
|
**Module under test:** v3 + F (`bes2600.ko` srcversion `371C6606B73AF19299228CA`)
|
||||||
|
**Hardware:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 fallback path live.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research question (locked)
|
||||||
|
|
||||||
|
> **Why does the bes2600 RX path collapse from ~2 MB/s sustained @ fresh-chip uptime to ~180 B/s @ ~28-min uptime, with periodic `wsm_generic_confirm failed for request 0x0007` + `ieee80211 phy0: [SCAN] Scan failed (-22)` every 300 s in the intervening window?**
|
||||||
|
|
||||||
|
Reproduces on Patch B, Patch F, and Patch C v3 alike — independent of the relay/race issues v3 addressed. Side-effect that was masked by the throughput floor while v2's race was the dominant variable.
|
||||||
|
|
||||||
|
## Predecessor data (reference, not anchor)
|
||||||
|
|
||||||
|
| source | observation |
|
||||||
|
|---|---|
|
||||||
|
| Patch C v3 N=3 (uptime 200/391/582 s) | mean 2.352 MB/s @ 4 MB/s sender |
|
||||||
|
| v3 single rep at uptime ~28 min (rep 2 of 2026-05-07 22:23) | 180 KB / 5 min = 600 B/s, sender saw "Connection reset by peer" |
|
||||||
|
| v3 single rep at uptime ~47 min (N=3 first attempt 22:42) | 55 KB / 5 min = 180 B/s, sender timed out (exit 124) |
|
||||||
|
| dmesg pattern observed at 47-min uptime | scan failures every 301-302 s starting at uptime 778 s (~13 min) |
|
||||||
|
|
||||||
|
The shape: **fresh chip → linear data flow at ~2 MB/s sustained → sometime around 13 min uptime, NetworkManager-triggered scans start failing → sometime around 28 min uptime, data throughput collapses to <1 KB/s while link still shows associated.**
|
||||||
|
|
||||||
|
Predecessor data is reference. Phase 0 will re-anchor at N=1 long-trace + 5 in-window stress probes; if the pattern doesn't reproduce, that's the campaign result.
|
||||||
|
|
||||||
|
## Mechanism candidates (Phase 4 will discriminate)
|
||||||
|
|
||||||
|
1. **Firmware-side resource exhaustion.** Per-scan or per-WSM-event accumulation in chip-side state. Scan-failed -22 (EINVAL) suggests firmware refusing the request — possibly out of scan handles, scan-buffer slots, or some other limit.
|
||||||
|
2. **NetworkManager scan-fail recovery loop.** Each failed scan triggers NM retry. If retry overhead dominates the bh thread, data path starves. Verifiable by suppressing NM scans.
|
||||||
|
3. **AP-side rate limiting.** Newton (AVM) AP could be applying QoS / fairness / probation after sustained 4 MB/s burst. Verifiable by Fritz!Box log access (Markus has it) or by switching to a different AP.
|
||||||
|
4. **PSM state machine deadlock.** c7's `pm_unsupported` self-detect was supposed to handle this, but the latch state could become stale if a real PM_IND arrives mid-operation. Verifiable by `chip_pm_state` debugfs read at degradation onset.
|
||||||
|
5. **SDIO bus clock degradation / mmc retune.** SDIO retune with `retune_protected` flag interacts with bes2600's data path. Verifiable by ftrace `mmc/mmc_request_*` event correlation with throughput drop.
|
||||||
|
6. **Power-management busy-event accumulation.** `bes2600_pwr_set_busy_event` counters might leak — busy events not cleared lock the chip awake (no PSM) but also exhaust event capacity. Verifiable by `bes2600_pwr_busy_event_record` dump.
|
||||||
|
|
||||||
|
## Phase 0 measurement protocol (rig armed 2026-05-07 23:18:58 CEST, T0=1778188738)
|
||||||
|
|
||||||
|
Capturing for 35 minutes from fresh boot. All capture lives in `/root/bes2600-samples/run-20260507-bug5-degradation-rig/` on ohm.
|
||||||
|
|
||||||
|
### Always-on streams
|
||||||
|
|
||||||
|
| stream | tool | output |
|
||||||
|
|---|---|---|
|
||||||
|
| ftrace events | per-event `enable=1` | `trace.log` (via `trace_pipe`) |
|
||||||
|
| cfg80211 events | `iw event -t -f` | `iw-event.log` |
|
||||||
|
| kernel printks | `dmesg -wT` | `dmesg.log` |
|
||||||
|
| netdev counters | per-30s shell loop | `snap.log` |
|
||||||
|
|
||||||
|
### ftrace event set
|
||||||
|
|
||||||
|
- `workqueue/workqueue_execute_start` — work dispatches
|
||||||
|
- `workqueue/workqueue_queue_work` — work submissions
|
||||||
|
- `mac80211/api_beacon_loss` — driver beacon-loss events
|
||||||
|
- `mac80211/api_connection_loss` — driver-side conn-loss
|
||||||
|
- `mac80211/api_disconnect` — driver-side disconnect
|
||||||
|
- `mac80211/drv_hw_scan` — mac80211 → driver scan dispatch
|
||||||
|
- `mac80211/drv_set_key` — key state changes
|
||||||
|
- `cfg80211/rdev_assoc` — assoc requests
|
||||||
|
- `cfg80211/rdev_deauth` — deauth requests
|
||||||
|
- `cfg80211/rdev_disassoc` — disassoc requests
|
||||||
|
- `cfg80211/cfg80211_assoc_comeback` — AP-side assoc-busy throttling
|
||||||
|
- `cfg80211/cfg80211_send_auth_timeout` — auth timeouts
|
||||||
|
- `cfg80211/cfg80211_scan_done` — scan completions
|
||||||
|
- `power/suspend_resume` — PM transitions
|
||||||
|
- `mmc/mmc_request_start` / `mmc_request_done` — bus-level transactions
|
||||||
|
|
||||||
|
### Scheduled stress probes
|
||||||
|
|
||||||
|
Sender on boltzmann (`/tmp/bug5-probe-loop.sh`) fires `pv -L 4m | nc ohm 12345` for 30 s at T+5/10/15/20/25 min. Each probe brackets uptime, RX-bytes pre, RX-bytes post, elapsed. Throughput-vs-uptime curve falls out of the snap.log + probe boundaries.
|
||||||
|
|
||||||
|
Probe markers logged via `logger -t bes2600-bug5 PROBE_N_START/END` so they appear in dmesg.log timeline.
|
||||||
|
|
||||||
|
## Anti-theatre receipts (must tick before claiming Phase 0 done)
|
||||||
|
|
||||||
|
- [ ] In-session baseline: long-capture across degradation window, N=1 for now; re-run if anomalous
|
||||||
|
- [ ] ftrace events actually firing (verify by tail of trace.log mid-capture)
|
||||||
|
- [ ] dmesg captures the scan-failure pattern timestamp (expected ~uptime 778 s)
|
||||||
|
- [ ] Probes actually transferred data at fresh chip (T+5 should be > 1 MB/s)
|
||||||
|
- [ ] At least one probe in-window after scan-failure onset (expected: T+15 or T+20)
|
||||||
|
- [ ] Snap.log shows monotonic counter behaviour (no rx_bytes going backwards)
|
||||||
|
|
||||||
|
## Phase 1 hypothesis (provisional, refine after Phase 3 data)
|
||||||
|
|
||||||
|
Metric candidate: **probe throughput as function of uptime, with state-transition markers (first `wsm_generic_confirm 0x0007 failed`, first `[SCAN] Scan failed (-22)`, first NetworkManager-deauth-and-reassociate)**.
|
||||||
|
|
||||||
|
Discriminator question: does throughput collapse abruptly at the first scan failure, or gradually over a window? Abrupt = single-event causation; gradual = accumulator.
|
||||||
|
|
||||||
|
## Phase 4 candidates (post-Phase-3)
|
||||||
|
|
||||||
|
Depending on which mechanism (1-6) Phase 3 surfaces:
|
||||||
|
- (1) firmware resource exhaustion: report to upstream; possibly disable NetworkManager scans pending firmware fix.
|
||||||
|
- (2) NM scan-fail loop: configure `wpa_supplicant` to skip scans; or add scan-failure handling in driver to dampen retry cascade.
|
||||||
|
- (3) AP-side: switch APs for testing; report to AVM if reproducible.
|
||||||
|
- (4) PSM deadlock: extend c7 latch with timeout-or-progress recovery.
|
||||||
|
- (5) SDIO retune: ftrace correlation guides the lock-ordering fix.
|
||||||
|
- (6) PWR busy-event leak: audit set/clear pairs; add a warning-when-stale.
|
||||||
|
|
||||||
|
## Out-of-scope
|
||||||
|
|
||||||
|
- Patch C v3 closure (PR #5 merged, Phase 7 done).
|
||||||
|
- Patch C2 (`ieee80211_rx_list` batch) — gated on Task #19 kerneldoc.
|
||||||
|
- Patch D / E independent.
|
||||||
|
- Reproduction at higher rates (8 MB/s ramp) — defer to Phase 4 once mechanism identified.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Phase 0 plan written 2026-05-07 23:21 CEST by Claude (noether), at the close of Patch C v3 Phase 7. Rig armed; long capture in flight; probes scheduled at T+5/10/15/20/25 min. Post-capture analysis will populate Phase 3 results before Phase 4 plan branches off.*
|
||||||
@@ -1,94 +0,0 @@
|
|||||||
# Patch C v3 Phase 7 — N=3 verification results
|
|
||||||
|
|
||||||
**Date:** 2026-05-07
|
|
||||||
**Module:** `bes2600.ko` srcversion `371C6606B73AF19299228CA` (cleanups+F+v3)
|
|
||||||
**Rig:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 path for telemetry
|
|
||||||
**Stress:** netcat sender from boltzmann, `pv -L 4m` rate cap (4 MB/s), 3-min window per rep
|
|
||||||
**Boot:** fresh — uptime 200 s / 391 s / 582 s at rep 1/2/3 starts (all within fresh-chip window before the ~13-min Bug #5 RX-degradation point)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Results table
|
|
||||||
|
|
||||||
| rep | elapsed (s) | RX bytes | RX MB | MB/s | sdio_rx_work | sdio_tx_work | bes2600_bh_work redispatches |
|
|
||||||
|---:|---:|---:|---:|---:|---:|---:|---:|
|
|
||||||
| 1 | 180.72 | 447,758,333 | 427.0 | **2.363** | 0 | 368 | 0 |
|
|
||||||
| 2 | 180.67 | 490,669,836 | 467.9 | **2.590** | 0 | 20 | 0 |
|
|
||||||
| 3 | 180.69 | 398,224,992 | 379.8 | **2.102** | 0 | 39 | 0 |
|
|
||||||
|
|
||||||
**N=3 stats:** mean 2.352 MB/s · median 2.363 MB/s · min 2.102 MB/s · max 2.590 MB/s
|
|
||||||
|
|
||||||
## Comparison to baselines
|
|
||||||
|
|
||||||
### vs Patch B baseline (`run-20260507-patchC-preflight`, N=1, 5 min @ 4 MB/s, fresh chip)
|
|
||||||
|
|
||||||
| | Patch B | v3 mean | Δ |
|
|
||||||
|---|---:|---:|---:|
|
|
||||||
| throughput | 1.362 MB/s | 2.352 MB/s | **+73%** |
|
|
||||||
|
|
||||||
### vs original Bug #5 baseline (`run-20260506-0659-fresh`, N=3, decay over time)
|
|
||||||
|
|
||||||
Bug #5 anchor was 725 / 663 / **75** KB/s — rep 3 saw link-death at ~9 min.
|
|
||||||
|
|
||||||
| | Bug #5 floor (rep 3) | v3 floor (rep 3) | Δ |
|
|
||||||
|---|---:|---:|---:|
|
|
||||||
| throughput | 0.075 MB/s | 2.102 MB/s | **28× improvement** |
|
|
||||||
|
|
||||||
### vs Phase 4 v3 plan §4.5 predictions
|
|
||||||
|
|
||||||
| metric | predicted | observed | verdict |
|
|
||||||
|---|---|---|---|
|
|
||||||
| sdio_rx_work dispatch rate | → 0/s (high confidence) | 0/s all 3 reps | ✅ |
|
|
||||||
| `bes2600_bh_work` redispatches | → 0 (high confidence) | 0 all 3 reps | ✅ |
|
|
||||||
| observed RX @ 4 MB/s | floor lifts toward ≥ 1 MB/s sustained (medium) | 2.10 MB/s floor | ✅ exceeds prediction |
|
|
||||||
| `_raw_spin_unlock_irqrestore` CPU% | 20% → 12-15% (medium) | not measured | deferred — perf-record run can confirm |
|
|
||||||
|
|
||||||
## Workqueue dispatch rate collapse
|
|
||||||
|
|
||||||
Patch B baseline (per `run-20260507-patchC-preflight`):
|
|
||||||
- sdio_rx_work: 86.4/s
|
|
||||||
- sdio_tx_work: 276.1/s
|
|
||||||
- bes2600_bh_work redispatches: 0
|
|
||||||
|
|
||||||
v3 N=3 mean:
|
|
||||||
- **sdio_rx_work: 0.0/s** (function deleted)
|
|
||||||
- **sdio_tx_work: 0.8/s** (post-tx queue_work → self->irq_handler call; the chip-side TX driver no longer needs to wake a separate workqueue)
|
|
||||||
- bes2600_bh_work redispatches: 0 (preserved invariant; bh thread still single long-lived work item)
|
|
||||||
|
|
||||||
The 99.7% reduction in `sdio_tx_work` dispatch rate is a side-effect of v3's IRQ→bh-direct rewiring: the post-TX `queue_work(self->sdio_wq, &self->rx_work)` call I replaced with `self->irq_handler()` was actually firing more often than I'd assumed (276/s on Patch B). Folding it into the bh wake-up cuts 275/s of workqueue dispatches that weren't doing anything useful.
|
|
||||||
|
|
||||||
## Risks observed
|
|
||||||
|
|
||||||
- **Bug #5 RX-degradation after ~13-min uptime is independent of v3.** Same scan-failure pattern observed (`wsm_generic_confirm failed for request 0x0007` + `[SCAN] Scan failed (-22)` every 300s) on v3 as on Patch B. v3 did NOT fix Bug #5; it fixed the v2-race that was ALSO present. RX-degradation is firmware-side, likely needs a separate campaign.
|
|
||||||
- **N=3 reps were 3 minutes each instead of 5** to fit within the fresh-chip window. Direct comparison with Patch B's 5-min baseline is approximate; chip-side throughput in 3-min vs 5-min should be similar given the bug fires on uptime, not on transferred-bytes.
|
|
||||||
- **No regression observed in 3×3 min = 9 min of stress.** The v2 race that wedged Patch C v1 within 13 s did NOT reproduce. v3's structural fix held.
|
|
||||||
|
|
||||||
## Phase 8 — lesson distilled
|
|
||||||
|
|
||||||
**The cw1200 mining was decisive.** Patch C v2 (atomic_t prep + direct-deliver on top of relay, PR #10 closed) would have worked correctly but kept the structural relay that was the source of the race. v3 removed the relay entirely — restoring single-writer-from-bh invariant by construction, no atomic_t needed, and delivering a 73% throughput improvement as side benefit.
|
|
||||||
|
|
||||||
Without the cw1200 history mine (`~/src/linux-rockchip`, 228 cw1200 commits over 16 years), v2's atomic_t prep would have shipped. The structural fix is upstream-grade because it matches the reference driver. v2's atomic_t wrapper would have been bes2600-specific bookkeeping with no upstream parallel — defensible as a fix, but worse to maintain.
|
|
||||||
|
|
||||||
**Memory entry:** *When you have an upstream-ancestral driver still in the kernel tree, mine its bug-fix history before patching the inherited fork. The architectural answer may already be there; you just have to look.*
|
|
||||||
|
|
||||||
## Receipts checklist (Phase 7 done)
|
|
||||||
|
|
||||||
- [x] N=3 reps captured at fresh-chip uptime (200/391/582 s)
|
|
||||||
- [x] Same instrumentation pre/post (workqueue ftrace + rx_packets/rx_bytes counters)
|
|
||||||
- [x] Predicted delta matched (sdio_rx_work → 0; bh redispatches → 0; throughput ≥ 1 MB/s sustained)
|
|
||||||
- [x] No WARN/BUG/oops during stress on any rep
|
|
||||||
- [x] Wired-rig telemetry collection (would have caught a wedge if v3 had one)
|
|
||||||
- [x] Receiver `nc` listener restarted fresh per rep (avoiding rep-2-style TCP race)
|
|
||||||
- [x] Stress-ramp memory honored: not steady-state low-rate; saw 4 MB/s saturate
|
|
||||||
|
|
||||||
## Out-of-scope follow-ups
|
|
||||||
|
|
||||||
- Patch C2 — `ieee80211_rx_list` batch delivery — gated on Task #19 kerneldoc verification.
|
|
||||||
- Patch D — ba_lock atomicization — independent.
|
|
||||||
- Patch E — ps_state_lock skip when pm_unsupported — independent.
|
|
||||||
- Bug #5 RX-degradation after 13-min uptime — separate campaign, scan-failure pattern is the entry point.
|
|
||||||
- Task #24 — observe whether `bh.c` `asm volatile("nop")` / commented-out `__bes2600_irq_enable(1)` / BUG_ON in hot path are still load-bearing post-v3. Already partially answered: `__bes2600_irq_enable` is a stub (PR #11 comment). The other artifacts can be re-read fresh.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*Phase 7 results captured 2026-05-07 by Claude (noether). v3 (PR #5) closes Patch C campaign with structural improvement + race fix + measurable throughput win.*
|
|
||||||
Reference in New Issue
Block a user