diff --git a/notes/phase4-2026-05-06.md b/notes/phase4-2026-05-06.md new file mode 100644 index 000000000..00e9152eb --- /dev/null +++ b/notes/phase4-2026-05-06.md @@ -0,0 +1,104 @@ +# BES2600 WiFi-stability campaign — Phase 4 plan artifact + +Date assembled: 2026-05-06 +Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm +Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1) + +This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate. + +Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 `api_connection_loss`) are folded into the Phase 1 revision below. + +--- + +## Phase 1 — revised metric (folds review feedback + Trigger-A discovery) + +> Per hour of operation, count three event classes: +> +> (a) `WSM_STATUS_DECRYPTFAILURE` bursts (≥4 events / 60 s) +> (b) mac80211 `api_connection_loss` events +> (c) AP-side unprotected-deauth-reason-6 frames +> +> For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s. + +Reviewer feedback applied: +- "Also count AP-deauth" → class (c) added +- "N=1 idle is fine" → no further idle reps required to lock +- AP-side capture not needed → confirmed; no Fritz!Box logging required + +New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps): + +- **Trigger B (decrypt storm path)**: `decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth`. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:00–13:28 load run (12 bursts). +- **Trigger A (beacon-loss path)**: `mac80211 api_connection_loss → kernel local reason-2 deauth`. Receipts: 17:23 and 18:03 today, both following ftrace `api_connection_loss`; yesterday 22:33 (presumed same path; not instrumented at the time). + +--- + +## Phase 4 — Plan + +### Patch A — Decrypt-storm fast-recover (Trigger B) + +#### What will be touched + +- `bes2600/txrx.c`, at the `bes_warn` for `WSM_STATUS_DECRYPTFAILURE / goto drop` site (currently line 1696). +- A new sliding-window counter on `bes2600_common` (or equivalent struct) tracking decrypt-fail timestamps. +- On threshold (proposed: ≥5 within 5 s), `schedule_work(&priv->reassoc_work)` that calls `ieee80211_connection_loss(vif)` so mac80211 enters its clean-reassoc path. +- A small struct field for the counter, plus init in probe and reset on assoc. + +#### What will NOT be touched + +- `bes2600/bes2600_sdio.c` bus-level paths (no SDIO change). +- Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset). +- Firmware. The fix is host-side recovery, not chip- or AP-side. +- mac80211 / cfg80211 core. Only `ieee80211_connection_loss` is called; no kernel API addition. + +#### Predicted delta on Phase 1 metric (same units as Phase 3 receipts) + +- Decrypt-burst rate **(a)**: UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier. +- AP-deauth-6 rate **(c)**: DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2. +- Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s. + +Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7. + +#### API contracts to confirm before writing code + +Per project CLAUDE.md "contract before code": + +- `ieee80211_connection_loss(vif)` — semantics + caller-context constraints. Header: `linux/mac80211.h`. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held. +- `bes2600_vif` / `bes2600_common` struct fields available for the counter — counter must be safe to update from the `wsm_handle_rx` path. +- cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition. +- Existing bes2600 work-item plumbing (e.g., `bes2600_chrdev_do_bus_reset` from c5.2) — same shape, same allocation rules. + +These will be cited in the commit message body and in the patch header comment per Phase 6 rules. + +#### Risk + +- If `ieee80211_connection_loss` is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm). +- If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement. + +--- + +### Patch B — Beacon-loss fast-recover (Trigger A) — PARKED + +Not part of this Phase 4. Locked behind one more diagnostic rep: + +- Add to the snap loop (rig is live): track wlan0 station-dump `beacon loss` counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before `api_connection_loss` fires. +- Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch. +- Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough. +- Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review. + +--- + +## Receipt checklist for Phase 4 + +- [x] What will and will not be touched: stated above +- [x] Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction) +- [x] Out-of-scope items explicitly listed +- [x] Risk items explicitly listed + +--- + +## Asks of the reviewer + +1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives. +2. Is `ieee80211_connection_loss(vif)` the right kernel API? Alternative: `cfg80211_disconnected` with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc? +3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone? +4. Patch B parked correctly, or fold it into this same Phase 4?