notes: phase 4 plan artifact for BES2600 wifi-stability campaign
Drafts Patch A (decrypt-storm fast-recover, Trigger B) at txrx.c:1696 with sliding-window threshold + ieee80211_connection_loss reassoc. Patch B (beacon-loss / Trigger A) parked behind one more diagnostic rep with 10s snap-loop cadence on the beacon-loss counter. Folds reviewer feedback from PR #3 + the new Trigger-A finding (post-resume P1 = api_connection_loss-driven, two reps captured today at 17:23 and 18:03) into a revised Phase 1 metric counting three event classes. Pending Phase 5 second-model review of the plan before Phase 6 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,104 @@
|
|||||||
|
# BES2600 WiFi-stability campaign — Phase 4 plan artifact
|
||||||
|
|
||||||
|
Date assembled: 2026-05-06
|
||||||
|
Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
|
||||||
|
Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
|
||||||
|
|
||||||
|
This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.
|
||||||
|
|
||||||
|
Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 `api_connection_loss`) are folded into the Phase 1 revision below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — revised metric (folds review feedback + Trigger-A discovery)
|
||||||
|
|
||||||
|
> Per hour of operation, count three event classes:
|
||||||
|
>
|
||||||
|
> (a) `WSM_STATUS_DECRYPTFAILURE` bursts (≥4 events / 60 s)
|
||||||
|
> (b) mac80211 `api_connection_loss` events
|
||||||
|
> (c) AP-side unprotected-deauth-reason-6 frames
|
||||||
|
>
|
||||||
|
> For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.
|
||||||
|
|
||||||
|
Reviewer feedback applied:
|
||||||
|
- "Also count AP-deauth" → class (c) added
|
||||||
|
- "N=1 idle is fine" → no further idle reps required to lock
|
||||||
|
- AP-side capture not needed → confirmed; no Fritz!Box logging required
|
||||||
|
|
||||||
|
New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):
|
||||||
|
|
||||||
|
- **Trigger B (decrypt storm path)**: `decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth`. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:00–13:28 load run (12 bursts).
|
||||||
|
- **Trigger A (beacon-loss path)**: `mac80211 api_connection_loss → kernel local reason-2 deauth`. Receipts: 17:23 and 18:03 today, both following ftrace `api_connection_loss`; yesterday 22:33 (presumed same path; not instrumented at the time).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4 — Plan
|
||||||
|
|
||||||
|
### Patch A — Decrypt-storm fast-recover (Trigger B)
|
||||||
|
|
||||||
|
#### What will be touched
|
||||||
|
|
||||||
|
- `bes2600/txrx.c`, at the `bes_warn` for `WSM_STATUS_DECRYPTFAILURE / goto drop` site (currently line 1696).
|
||||||
|
- A new sliding-window counter on `bes2600_common` (or equivalent struct) tracking decrypt-fail timestamps.
|
||||||
|
- On threshold (proposed: ≥5 within 5 s), `schedule_work(&priv->reassoc_work)` that calls `ieee80211_connection_loss(vif)` so mac80211 enters its clean-reassoc path.
|
||||||
|
- A small struct field for the counter, plus init in probe and reset on assoc.
|
||||||
|
|
||||||
|
#### What will NOT be touched
|
||||||
|
|
||||||
|
- `bes2600/bes2600_sdio.c` bus-level paths (no SDIO change).
|
||||||
|
- Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
|
||||||
|
- Firmware. The fix is host-side recovery, not chip- or AP-side.
|
||||||
|
- mac80211 / cfg80211 core. Only `ieee80211_connection_loss` is called; no kernel API addition.
|
||||||
|
|
||||||
|
#### Predicted delta on Phase 1 metric (same units as Phase 3 receipts)
|
||||||
|
|
||||||
|
- Decrypt-burst rate **(a)**: UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
|
||||||
|
- AP-deauth-6 rate **(c)**: DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
|
||||||
|
- Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.
|
||||||
|
|
||||||
|
Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.
|
||||||
|
|
||||||
|
#### API contracts to confirm before writing code
|
||||||
|
|
||||||
|
Per project CLAUDE.md "contract before code":
|
||||||
|
|
||||||
|
- `ieee80211_connection_loss(vif)` — semantics + caller-context constraints. Header: `linux/mac80211.h`. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.
|
||||||
|
- `bes2600_vif` / `bes2600_common` struct fields available for the counter — counter must be safe to update from the `wsm_handle_rx` path.
|
||||||
|
- cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
|
||||||
|
- Existing bes2600 work-item plumbing (e.g., `bes2600_chrdev_do_bus_reset` from c5.2) — same shape, same allocation rules.
|
||||||
|
|
||||||
|
These will be cited in the commit message body and in the patch header comment per Phase 6 rules.
|
||||||
|
|
||||||
|
#### Risk
|
||||||
|
|
||||||
|
- If `ieee80211_connection_loss` is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm).
|
||||||
|
- If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Patch B — Beacon-loss fast-recover (Trigger A) — PARKED
|
||||||
|
|
||||||
|
Not part of this Phase 4. Locked behind one more diagnostic rep:
|
||||||
|
|
||||||
|
- Add to the snap loop (rig is live): track wlan0 station-dump `beacon loss` counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before `api_connection_loss` fires.
|
||||||
|
- Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
|
||||||
|
- Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
|
||||||
|
- Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Receipt checklist for Phase 4
|
||||||
|
|
||||||
|
- [x] What will and will not be touched: stated above
|
||||||
|
- [x] Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
|
||||||
|
- [x] Out-of-scope items explicitly listed
|
||||||
|
- [x] Risk items explicitly listed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Asks of the reviewer
|
||||||
|
|
||||||
|
1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
|
||||||
|
2. Is `ieee80211_connection_loss(vif)` the right kernel API? Alternative: `cfg80211_disconnected` with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc?
|
||||||
|
3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
|
||||||
|
4. Patch B parked correctly, or fold it into this same Phase 4?
|
||||||
Reference in New Issue
Block a user