notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts

Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.
notes: phase 7 verdict — Patch A confirmed, Patch B dormant
2026-05-07 16:32:45 +02:00 · 2026-05-07 15:18:36 +02:00 · 2026-05-07 13:56:36 +02:00 · 2026-05-07 10:45:28 +00:00
3 changed files with 274 additions and 0 deletions
@@ -82,6 +82,49 @@ without board power-cycle").
 **Status**: task c3 (indirectly, via bes_chardev removal which currently
 gates the signal/nosignal mode switch path).

+## Bug #5 — RX path degrades under attempted-throughput pressure
+
+**Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
+SDIO RX scheduling) — pinpoint pending.
+
+**Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
+Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
+netcat load):**
+
+```
+sender cap 1 MB/s  →  ohm receives 1015 KB/s,  signal -57 dBm,  RX MCS 4
+sender cap 4 MB/s  →  ohm receives  563 KB/s,  signal -67 dBm,  RX MCS 3
+                     (Send-Q on boltzmann backed up to 1.16 MB)
+```
+
+Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
+throughput at the receiver and degraded the link metrics. Signal dropped
+~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
+the higher RX rate even with the link physically capable of more (link
+bitrate 65 Mb/s = ~8 MB/s theoretical).
+
+**Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
+under busy reads** — possibly a busy-wait loop or lock contention on the
+RX SDIO path that prevents draining at line rate. Plausible reason it
+didn't surface for the c-stack tasks: those operated at typical
+browse-rate traffic, well below the saturation threshold this bug needs
+to fire.
+
+**May explain**: original Phase-0 observation that **YouTube DASH chunks
+drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
+fetch is a brief burst at near-link-rate; if the driver throttles itself
+down during high-RX, the player buffer underruns for the duration of
+the fetch.
+
+**How to drill (when prioritized)**:
+- Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
+  controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
+- Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
+  CPU saturation.
+- `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
+
+**Status**: backlog. No patch yet.
+
 ## Bug #4 — scan_complete_cb constant loop

 **File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
@@ -0,0 +1,135 @@
+# Bug #5 campaign — Phase 1 metric + Phase 2 situation
+
+Date assembled: 2026-05-07
+Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
+                              (cleanups + Patch A + Patch B)
+
+Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
+2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
+cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
+for the original report. This document locks Phase 1 and prepares Phase 4
+candidates.
+
+---
+
+## Phase 1 — measurable target (locked)
+
+> *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
+
+Three measurable outcomes, single sentence:
+
+- **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
+- **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
+- **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
+
+---
+
+## Phase 0 anchor — receipts
+
+### Reproduction protocol (same units as Phase 7 will use)
+
+1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
+2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
+3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
+4. N=3 reps with 60 s cooldowns
+
+### Observed (2026-05-07 15:36–16:08)
+
+| rep | duration | avg KB/s | near-zero ticks | end state |
+|---|---|---|---|---|
+| 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
+| 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
+| 3 | 600 s |  75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
+
+mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
+
+### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
+
+| symbol | rep 1 (healthy) | rep 3 (cascade) |
+|---|---|---|
+| `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
+| `handle_softirqs` | 5.4 % | 4.3 % |
+| `__tasklet_schedule` | 2.4 % | 2.0 % |
+| `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
+| `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
+
+Top callchain leading to `_raw_spin_unlock_irqrestore`:
+
+```
+process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
+```
+
+---
+
+## Phase 2 — situation analysis
+
+### Relevant source pins
+
+- `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
+  - holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
+  - calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
+  - outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
+- `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
+- `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
+
+### What the lock-cycling cost is buying
+
+Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
+1. caller acquires `wsm_cmd_sema` (outer)
+2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
+3. `bes2600_bh_wakeup()` schedules the BH
+4. BH dispatches the command, takes `vif_list_lock` to walk vifs
+5. BH talks to chip via SDIO
+6. response arrives, BH wakes the waiter
+7. caller releases `wsm_cmd_sema`
+
+Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
+
+### What's been ruled out
+
+- AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
+- Patch A / Patch B (target different triggers; would not address lock cost).
+- Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
+- mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
+
+---
+
+## Phase 4 — candidate plans (preliminary, not locked)
+
+Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
+
+### B5-1 — Reduce `wsm_cmd_send` lock scope
+
+The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
+
+**Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
+
+**Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
+
+### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
+
+bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
+
+**Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
+
+**Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
+
+### B5-3 — Move WSM send out of process context, use ringbuffer
+
+Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
+
+**Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
+
+**Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
+
+---
+
+## Open question for the reviewer
+
+Which Phase 4 candidate to lock? My ranking by ROI:
+
+1. B5-1 (smallest, fastest, cleanest source pin) — try first
+2. B5-2 (medium, conditional on multi-vif applicability)
+3. B5-3 (largest rework, biggest potential)
+
+Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
@@ -0,0 +1,96 @@
+# BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
+
+Date assembled: 2026-05-07
+Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
+                   (cleanups + Patch A + Patch B)
+Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
+
+Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
+of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
+newton (5b:32, signal -57 to -67 dBm).
+
+---
+
+## Result table (vs the Phase 4 predicted delta)
+
+### Patch A — decrypt-storm fast-recover (Trigger B)
+
+| metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
+|---|---|---|---|
+| decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
+| AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
+| recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
+
+**Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
+
+#### Receipts (verbatim)
+
+```
+13:47:56  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
+13:47:57  wlan0: associated to cc:ce:1e:2b:74:17     (cross-BSSID, 1 s)
+13:49:26  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
+13:49:27  wlan0: associated to c0:25:06:e6:5b:32     (back home, 1 s)
+```
+
+`DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
+
+### Patch B — connection-loss-storm bus_reset (Trigger A)
+
+| metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
+|---|---|---|---|
+| api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
+| ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
+| Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
+
+**Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
+
+The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
+
+### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
+
+```
+12:53:10.475 → 12:53:11.756  AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
+                              (2 mgmt-TX no-ack from our chip in the middle)
+12:53:12.309  kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
+12:53:14–15  reauth via 61:b0 → 5b:32, recovery in ~4 s
+```
+
+Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
+
+Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
+
+---
+
+## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
+
+```
+sender 1 MB/s  →  ohm receives 1015 KB/s,  -57 dBm,  RX MCS 4
+sender 4 MB/s  →  ohm receives  563 KB/s,  -67 dBm,  RX MCS 3
+```
+
+Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
+
+**Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
+
+---
+
+## Lessons captured for memory (Phase 8 anchor)
+
+1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
+2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
+3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
+4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
+5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
+6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
+
+---
+
+## Loop status
+
+- Phase 7: closed.
+- Patch A: confirmed (N=2). Stays in.
+- Patch B: installed, dormant in this regime, no harm. Stays in.
+- Bug #5: backlog, no patch yet. Documented.
+- Trigger C: backlog candidate, no patch yet. Documented.
+
+Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.
Author	SHA1	Message	Date
claude-noether	1830c17891	notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.	2026-05-07 16:32:45 +02:00
claude-noether	69a1d0f8b1	notes: phase 7 verdict — Patch A confirmed, Patch B dormant Phase 7 verification of cleanups + Patch A + Patch B (srcversion 1B3B3ED0) on ohm 2026-05-07 12:48 → 15:13 CEST under netcat load ramped 1 MB/s → 4 MB/s on 2.4GHz newton. Patch A: predicted delta CONFIRMED at N=2 reproductions. - 13:47:56 storm → 1 s reassoc, no AP-deauth-6 escalation - 13:49:26 storm → 1 s reassoc, no AP-deauth-6 escalation Patch B: installed, untriggered. 2 api_connection_loss events spaced 91 s apart, never tripping the 3-in-60s threshold. No false positives, no spurious bus_resets. Recovery delta unobserved (no harm done). Trigger C: 17-frame AP-deauth-6 cluster at 12:53 with no patch hooks firing — bes2600 TX-side glitch suspect. Recovery via mac80211 reauth in ~4 s. New backlog item. Bug #5 documented separately (RX path degrades under throughput pressure; possible root of the original Phase-0 YouTube frame drops).	2026-05-07 15:18:36 +02:00
claude-noether	458ad36f8b	notes: backlog Bug #5 — RX path degrades under throughput pressure Observed 2026-05-07: bumping the netcat sender from 1 MB/s to 4 MB/s DECREASED ohm's observed RX rate (1015 KB/s → 563 KB/s) and degraded the link (signal -57 → -67 dBm, MCS 4 → 3). Chip can't sustain near- link-rate RX even though theoretical capacity is ~8 MB/s. Hypothesis: driver/firmware lock contention or busy-wait on the RX SDIO path. Plausibly explains the original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch — chunk fetch is a brief near-line-rate burst that this bug would be triggered by.	2026-05-07 13:56:36 +02:00
marfrit	ea509e810f	Merge pull request 'Phase 4 plan: Patch B (Trigger A / api_connection_loss)' (#5 ) from claude-noether-3 into main Reviewed-on: #5	2026-05-07 10:45:28 +00:00