diff --git a/notes/phase7-2026-05-07.md b/notes/phase7-2026-05-07.md new file mode 100644 index 000000000..c6c829045 --- /dev/null +++ b/notes/phase7-2026-05-07.md @@ -0,0 +1,96 @@ +# BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B) + +Date assembled: 2026-05-07 +Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11` + (cleanups + Patch A + Patch B) +Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm + +Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m) +of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz +newton (5b:32, signal -57 to -67 dBm). + +--- + +## Result table (vs the Phase 4 predicted delta) + +### Patch A — decrypt-storm fast-recover (Trigger B) + +| metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed | +|---|---|---|---| +| decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on | +| AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** | +| recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) | + +**Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s). + +#### Receipts (verbatim) + +``` +13:47:56 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc +13:47:57 wlan0: associated to cc:ce:1e:2b:74:17 (cross-BSSID, 1 s) +13:49:26 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc +13:49:27 wlan0: associated to c0:25:06:e6:5b:32 (back home, 1 s) +``` + +`DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`. + +### Patch B — connection-loss-storm bus_reset (Trigger A) + +| metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed | +|---|---|---|---| +| api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) | +| ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** | +| Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) | + +**Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run. + +The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate. + +### Trigger C — AP unprotected-deauth-6 cluster without preceding storm + +``` +12:53:10.475 → 12:53:11.756 AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s + (2 mgmt-TX no-ack from our chip in the middle) +12:53:12.309 kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID +12:53:14–15 reauth via 61:b0 → 5b:32, recovery in ~4 s +``` + +Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface. + +Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized. + +--- + +## Bug #5 — RX path degradation under attempted-throughput pressure (NEW) + +``` +sender 1 MB/s → ohm receives 1015 KB/s, -57 dBm, RX MCS 4 +sender 4 MB/s → ohm receives 563 KB/s, -67 dBm, RX MCS 3 +``` + +Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load. + +**Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`. + +--- + +## Lessons captured for memory (Phase 8 anchor) + +1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting. +2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop. +3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk. +4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated. +5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly. +6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate. + +--- + +## Loop status + +- Phase 7: closed. +- Patch A: confirmed (N=2). Stays in. +- Patch B: installed, dormant in this regime, no harm. Stays in. +- Bug #5: backlog, no patch yet. Documented. +- Trigger C: backlog candidate, no patch yet. Documented. + +Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.