notes: phase 7 verdict — Patch A confirmed, Patch B dormant

Phase 7 verification of cleanups + Patch A + Patch B (srcversion
1B3B3ED0) on ohm 2026-05-07 12:48 → 15:13 CEST under netcat load
ramped 1 MB/s → 4 MB/s on 2.4GHz newton.

Patch A: predicted delta CONFIRMED at N=2 reproductions.
  - 13:47:56 storm → 1 s reassoc, no AP-deauth-6 escalation
  - 13:49:26 storm → 1 s reassoc, no AP-deauth-6 escalation

Patch B: installed, untriggered. 2 api_connection_loss events spaced
91 s apart, never tripping the 3-in-60s threshold. No false positives,
no spurious bus_resets. Recovery delta unobserved (no harm done).

Trigger C: 17-frame AP-deauth-6 cluster at 12:53 with no patch hooks
firing — bes2600 TX-side glitch suspect. Recovery via mac80211 reauth
in ~4 s. New backlog item.

Bug #5 documented separately (RX path degrades under throughput
pressure; possible root of the original Phase-0 YouTube frame drops).
This commit is contained in:
2026-05-07 15:18:36 +02:00
parent 458ad36f8b
commit 69a1d0f8b1
+96
View File
@@ -0,0 +1,96 @@
# BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
Date assembled: 2026-05-07
Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
(cleanups + Patch A + Patch B)
Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
newton (5b:32, signal -57 to -67 dBm).
---
## Result table (vs the Phase 4 predicted delta)
### Patch A — decrypt-storm fast-recover (Trigger B)
| metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
|---|---|---|---|
| decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
| AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
| recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
**Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
#### Receipts (verbatim)
```
13:47:56 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
13:47:57 wlan0: associated to cc:ce:1e:2b:74:17 (cross-BSSID, 1 s)
13:49:26 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
13:49:27 wlan0: associated to c0:25:06:e6:5b:32 (back home, 1 s)
```
`DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
### Patch B — connection-loss-storm bus_reset (Trigger A)
| metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
|---|---|---|---|
| api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
| ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
| Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
**Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
```
12:53:10.475 → 12:53:11.756 AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
(2 mgmt-TX no-ack from our chip in the middle)
12:53:12.309 kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
12:53:1415 reauth via 61:b0 → 5b:32, recovery in ~4 s
```
Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
---
## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
```
sender 1 MB/s → ohm receives 1015 KB/s, -57 dBm, RX MCS 4
sender 4 MB/s → ohm receives 563 KB/s, -67 dBm, RX MCS 3
```
Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
**Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
---
## Lessons captured for memory (Phase 8 anchor)
1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
---
## Loop status
- Phase 7: closed.
- Patch A: confirmed (N=2). Stays in.
- Patch B: installed, dormant in this regime, no harm. Stays in.
- Bug #5: backlog, no patch yet. Documented.
- Trigger C: backlog candidate, no patch yet. Documented.
Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.