Drafts Patch A (decrypt-storm fast-recover, Trigger B) at txrx.c:1696 with sliding-window threshold + ieee80211_connection_loss reassoc. Patch B (beacon-loss / Trigger A) parked behind one more diagnostic rep with 10s snap-loop cadence on the beacon-loss counter. Folds reviewer feedback from PR #3 + the new Trigger-A finding (post-resume P1 = api_connection_loss-driven, two reps captured today at 17:23 and 18:03) into a revised Phase 1 metric counting three event classes. Pending Phase 5 second-model review of the plan before Phase 6 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.4 KiB
BES2600 WiFi-stability campaign — Phase 4 plan artifact
Date assembled: 2026-05-06 Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.
Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 api_connection_loss) are folded into the Phase 1 revision below.
Phase 1 — revised metric (folds review feedback + Trigger-A discovery)
Per hour of operation, count three event classes:
(a)
WSM_STATUS_DECRYPTFAILUREbursts (≥4 events / 60 s) (b) mac80211api_connection_lossevents (c) AP-side unprotected-deauth-reason-6 framesFor each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.
Reviewer feedback applied:
- "Also count AP-deauth" → class (c) added
- "N=1 idle is fine" → no further idle reps required to lock
- AP-side capture not needed → confirmed; no Fritz!Box logging required
New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):
- Trigger B (decrypt storm path):
decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:00–13:28 load run (12 bursts). - Trigger A (beacon-loss path):
mac80211 api_connection_loss → kernel local reason-2 deauth. Receipts: 17:23 and 18:03 today, both following ftraceapi_connection_loss; yesterday 22:33 (presumed same path; not instrumented at the time).
Phase 4 — Plan
Patch A — Decrypt-storm fast-recover (Trigger B)
What will be touched
bes2600/txrx.c, at thebes_warnforWSM_STATUS_DECRYPTFAILURE / goto dropsite (currently line 1696).- A new sliding-window counter on
bes2600_common(or equivalent struct) tracking decrypt-fail timestamps. - On threshold (proposed: ≥5 within 5 s),
schedule_work(&priv->reassoc_work)that callsieee80211_connection_loss(vif)so mac80211 enters its clean-reassoc path. - A small struct field for the counter, plus init in probe and reset on assoc.
What will NOT be touched
bes2600/bes2600_sdio.cbus-level paths (no SDIO change).- Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
- Firmware. The fix is host-side recovery, not chip- or AP-side.
- mac80211 / cfg80211 core. Only
ieee80211_connection_lossis called; no kernel API addition.
Predicted delta on Phase 1 metric (same units as Phase 3 receipts)
- Decrypt-burst rate (a): UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
- AP-deauth-6 rate (c): DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
- Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.
Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.
API contracts to confirm before writing code
Per project CLAUDE.md "contract before code":
ieee80211_connection_loss(vif)— semantics + caller-context constraints. Header:linux/mac80211.h. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.bes2600_vif/bes2600_commonstruct fields available for the counter — counter must be safe to update from thewsm_handle_rxpath.- cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
- Existing bes2600 work-item plumbing (e.g.,
bes2600_chrdev_do_bus_resetfrom c5.2) — same shape, same allocation rules.
These will be cited in the commit message body and in the patch header comment per Phase 6 rules.
Risk
- If
ieee80211_connection_lossis called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm). - If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.
Patch B — Beacon-loss fast-recover (Trigger A) — PARKED
Not part of this Phase 4. Locked behind one more diagnostic rep:
- Add to the snap loop (rig is live): track wlan0 station-dump
beacon losscounter at 10-second cadence (currently 60 s). Want to see the per-tick increase beforeapi_connection_lossfires. - Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
- Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
- Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.
Receipt checklist for Phase 4
- What will and will not be touched: stated above
- Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
- Out-of-scope items explicitly listed
- Risk items explicitly listed
Asks of the reviewer
- Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
- Is
ieee80211_connection_loss(vif)the right kernel API? Alternative:cfg80211_disconnectedwith a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc? - Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
- Patch B parked correctly, or fold it into this same Phase 4?