Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| e53aad5013 | |||
| 4acba3e707 | |||
| f6a25d811f | |||
| 07a7d4b3af |
@@ -0,0 +1,104 @@
|
||||
# BES2600 WiFi-stability campaign — Phase 4 plan artifact
|
||||
|
||||
Date assembled: 2026-05-06
|
||||
Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
|
||||
Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
|
||||
|
||||
This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.
|
||||
|
||||
Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 `api_connection_loss`) are folded into the Phase 1 revision below.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — revised metric (folds review feedback + Trigger-A discovery)
|
||||
|
||||
> Per hour of operation, count three event classes:
|
||||
>
|
||||
> (a) `WSM_STATUS_DECRYPTFAILURE` bursts (≥4 events / 60 s)
|
||||
> (b) mac80211 `api_connection_loss` events
|
||||
> (c) AP-side unprotected-deauth-reason-6 frames
|
||||
>
|
||||
> For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.
|
||||
|
||||
Reviewer feedback applied:
|
||||
- "Also count AP-deauth" → class (c) added
|
||||
- "N=1 idle is fine" → no further idle reps required to lock
|
||||
- AP-side capture not needed → confirmed; no Fritz!Box logging required
|
||||
|
||||
New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):
|
||||
|
||||
- **Trigger B (decrypt storm path)**: `decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth`. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:00–13:28 load run (12 bursts).
|
||||
- **Trigger A (beacon-loss path)**: `mac80211 api_connection_loss → kernel local reason-2 deauth`. Receipts: 17:23 and 18:03 today, both following ftrace `api_connection_loss`; yesterday 22:33 (presumed same path; not instrumented at the time).
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Plan
|
||||
|
||||
### Patch A — Decrypt-storm fast-recover (Trigger B)
|
||||
|
||||
#### What will be touched
|
||||
|
||||
- `bes2600/txrx.c`, at the `bes_warn` for `WSM_STATUS_DECRYPTFAILURE / goto drop` site (currently line 1696).
|
||||
- A new sliding-window counter on `bes2600_common` (or equivalent struct) tracking decrypt-fail timestamps.
|
||||
- On threshold (proposed: ≥5 within 5 s), `schedule_work(&priv->reassoc_work)` that calls `ieee80211_connection_loss(vif)` so mac80211 enters its clean-reassoc path.
|
||||
- A small struct field for the counter, plus init in probe and reset on assoc.
|
||||
|
||||
#### What will NOT be touched
|
||||
|
||||
- `bes2600/bes2600_sdio.c` bus-level paths (no SDIO change).
|
||||
- Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
|
||||
- Firmware. The fix is host-side recovery, not chip- or AP-side.
|
||||
- mac80211 / cfg80211 core. Only `ieee80211_connection_loss` is called; no kernel API addition.
|
||||
|
||||
#### Predicted delta on Phase 1 metric (same units as Phase 3 receipts)
|
||||
|
||||
- Decrypt-burst rate **(a)**: UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
|
||||
- AP-deauth-6 rate **(c)**: DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
|
||||
- Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.
|
||||
|
||||
Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.
|
||||
|
||||
#### API contracts to confirm before writing code
|
||||
|
||||
Per project CLAUDE.md "contract before code":
|
||||
|
||||
- `ieee80211_connection_loss(vif)` — semantics + caller-context constraints. Header: `linux/mac80211.h`. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.
|
||||
- `bes2600_vif` / `bes2600_common` struct fields available for the counter — counter must be safe to update from the `wsm_handle_rx` path.
|
||||
- cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
|
||||
- Existing bes2600 work-item plumbing (e.g., `bes2600_chrdev_do_bus_reset` from c5.2) — same shape, same allocation rules.
|
||||
|
||||
These will be cited in the commit message body and in the patch header comment per Phase 6 rules.
|
||||
|
||||
#### Risk
|
||||
|
||||
- If `ieee80211_connection_loss` is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm).
|
||||
- If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.
|
||||
|
||||
---
|
||||
|
||||
### Patch B — Beacon-loss fast-recover (Trigger A) — PARKED
|
||||
|
||||
Not part of this Phase 4. Locked behind one more diagnostic rep:
|
||||
|
||||
- Add to the snap loop (rig is live): track wlan0 station-dump `beacon loss` counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before `api_connection_loss` fires.
|
||||
- Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
|
||||
- Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
|
||||
- Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.
|
||||
|
||||
---
|
||||
|
||||
## Receipt checklist for Phase 4
|
||||
|
||||
- [x] What will and will not be touched: stated above
|
||||
- [x] Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
|
||||
- [x] Out-of-scope items explicitly listed
|
||||
- [x] Risk items explicitly listed
|
||||
|
||||
---
|
||||
|
||||
## Asks of the reviewer
|
||||
|
||||
1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
|
||||
2. Is `ieee80211_connection_loss(vif)` the right kernel API? Alternative: `cfg80211_disconnected` with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc?
|
||||
3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
|
||||
4. Patch B parked correctly, or fold it into this same Phase 4?
|
||||
@@ -0,0 +1,153 @@
|
||||
# BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)
|
||||
|
||||
Date assembled: 2026-05-07
|
||||
Run dir: /root/bes2600-samples/run-20260506-2113-patchA/ on ohm
|
||||
Module: bes2600.ko srcversion 21BD07B3782B144D478CE43 (c-stack + Patch A merged)
|
||||
|
||||
This is the Phase 4 plan for **Patch B (Trigger A: beacon-loss / mac80211 `api_connection_loss` chain)**, drafted after the Phase 7 verification of Patch A on 2026-05-07. Per project CLAUDE.md: paste verbatim, do not curate.
|
||||
|
||||
---
|
||||
|
||||
## What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)
|
||||
|
||||
Patch A is **landed (PR #1)** and **active on ohm** (srcversion `21BD07B3`). Phase 7 verification:
|
||||
|
||||
```
|
||||
duration: 10h30m sustained 1 MB/s load on 2.4GHz (5b:32)
|
||||
DecryptStormRecoveries: 0
|
||||
Decrypt-fails total: 183 (~1 every 3.5 min — never bursted ≥5/5s)
|
||||
api_connection_loss events: 9 ← Trigger A
|
||||
unprotected deauth (AP): 7 ← AP-deauth-6 cluster at 02:42:11
|
||||
mac80211 reason 4 deauth: yes ← inactivity (Trigger A flavor)
|
||||
mac80211 reason 2 deauth: no ← what Patch A handles
|
||||
```
|
||||
|
||||
**Patch A's predicted delta is unobserved** (no decrypt-storm fired during Phase 7). Patch A is dormant but caused no harm. This Phase 4 pivots to **Trigger A** — the dominant failure path in the overnight rep.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — revised metric (Trigger-A scope)
|
||||
|
||||
> Per hour of operation: count `mac80211 api_connection_loss` events and the conditional probability that each escalates into a > 5 s reauth blackhole (assoc-comeback timeouts followed by AP unprotected-deauth-6 cluster).
|
||||
|
||||
Observed rate from the Phase 7 rep: 9 events over 10h30m = **0.86/h** under sustained load. Not all of them escalated to a visible blackhole — some apparently recovered cleanly. But the 02:42 cluster (1/9 = 11 % escalation rate in this rep) shows the catastrophic shape.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)
|
||||
|
||||
```
|
||||
02:40:32 scan started, full-band
|
||||
02:40:34 scan aborted
|
||||
02:40:45 del station 5b:32
|
||||
02:40:45 kernel: deauth 8a:2e:77:1f:ec:05 → 5b:32 reason 4
|
||||
(Disassociated due to inactivity) ← TRIGGER A
|
||||
02:40:45 cfg80211: disconnected (local request) reason 4
|
||||
02:40:45 scan started → finished: 2462 2412, "newton"
|
||||
02:40:45 new station 61:b0
|
||||
02:40:45 AP→ohm: auth status 0: Successful
|
||||
02:40:45 AP→ohm: assoc comeback bssid 61:b0 timeout 1000 ← BSS load mgmt
|
||||
02:40:47 del station 61:b0
|
||||
02:40:47 assoc: timed out
|
||||
02:40:47 scan → new station cc:ce:1e:2b:74:17
|
||||
02:40:48 auth: timed out
|
||||
02:40:49 scan → new station 5b:32 (back to where we started)
|
||||
02:40:49 AP→ohm: auth status 0
|
||||
02:40:49 AP→ohm: assoc comeback bssid 5b:32 timeout 881
|
||||
02:40:51 assoc: timed out
|
||||
02:42:11 ── AP-deauth-6 cluster (×7 within 1 ms) from 61:b0 ──
|
||||
reason 6: Class 2 frame received from non-authenticated station
|
||||
02:42:11 reason 9 = STA_REQ_ASSOC_WITHOUT_AUTH
|
||||
```
|
||||
|
||||
**Net**: ~86 s in reauth-blackhole, recovery via cross-channel fallback. Same shape as Phase 3's 11:03 blackhole (~109 s), but trigger here is **inactivity-deauth → assoc-comeback rejection**, not decrypt-storm.
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis on the mechanism
|
||||
|
||||
Three plausible chains for why post-`api_connection_loss` reauth blackholes:
|
||||
|
||||
1. **AP's assoc-comeback timer disrespected.** The AP says "wait 1000 TU before retrying", but mac80211 / wpa_supplicant retries fast. AP keeps deferring; eventually a stale frame triggers the AP's "Class 2 from unauth" reaction.
|
||||
|
||||
2. **Driver state stale across deauth.** After mac80211 fires `ieee80211_connection_loss`, the bes2600 driver's per-vif state (link_id, key state, queues) may not be fully scrubbed. Subsequent reassoc starts with mixed state; AP rejects.
|
||||
|
||||
3. **Chip-level RX wedge.** The chip's RX state machine got stuck during the inactivity period; reauth sends out frames, but RX of AP's responses is lossy. mac80211 perceives timeout when actually frames arrived but were dropped. Hard to verify without monitor mode (which the chip doesn't support concurrent with managed).
|
||||
|
||||
Each hypothesis suggests a different fix surface.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Plan candidates
|
||||
|
||||
### Candidate B-1 — Chip-level reset on api_connection_loss flood
|
||||
|
||||
**What touches:**
|
||||
- New work-item on `bes2600_common`: `api_connection_loss_recover_work`.
|
||||
- mac80211 → driver `disconnect()` op → bump a sliding-window counter.
|
||||
- On threshold (e.g., 3 events within 60 s): schedule the work that calls `bes2600_chrdev_do_bus_reset()` (the existing c5.2 LMAC-wedge recovery path).
|
||||
- After bus reset, NM auto-reconnects from a fresh chip state.
|
||||
|
||||
**Why this candidate:** reuses the c5.2 infrastructure already deployed; small surface; if hypothesis 3 (chip wedge) is right, this fixes the root cause. If hypothesis 1 or 2 are right, this is overkill but harmless (a brief reset).
|
||||
|
||||
**API contracts to confirm:**
|
||||
- `bes2600_chrdev_do_bus_reset()` re-entrancy and worker-context safety
|
||||
- mac80211 ops or callbacks around `ieee80211_connection_loss`/`disconnect`
|
||||
- cw1200/cw1260 ancestor for any similar pattern
|
||||
|
||||
**Predicted delta (Phase 7 units):**
|
||||
- `api_connection_loss` rate: unchanged (we don't address the trigger)
|
||||
- conditional escalation to >5 s blackhole: target ≤ 30 % (need realistic)
|
||||
- worst-case recovery: 86 s → < 10 s
|
||||
|
||||
### Candidate B-2 — Respect assoc-comeback timer
|
||||
|
||||
**What touches:**
|
||||
- Possibly NOT in bes2600 — this looks like a mac80211 / wpa_supplicant concern.
|
||||
- If the driver does anything assoc-related itself, audit for racing the comeback timer.
|
||||
|
||||
**Status:** out of scope for a bes2600 patch unless the driver is observed sending frames during the comeback window.
|
||||
|
||||
### Candidate B-3 — Audit and scrub vif state on disconnect
|
||||
|
||||
**What touches:**
|
||||
- `bes2600_unjoin_work` — verify link_id, keys, queues all reset
|
||||
- Add explicit reset on `ieee80211_disconnect`/`disconnect` ops
|
||||
|
||||
**Status:** speculative without further instrumentation.
|
||||
|
||||
---
|
||||
|
||||
## Lock proposal
|
||||
|
||||
**Lock Candidate B-1 first.** It has:
|
||||
- the cleanest re-use (c5.2's bus_reset)
|
||||
- the smallest patch surface
|
||||
- a measurable predicted delta against Phase 7's `api_connection_loss` rate
|
||||
|
||||
If Phase 7-of-B-1 shows the rate unchanged but escalation still high → loop back to B-2/B-3 hypothesis space.
|
||||
|
||||
---
|
||||
|
||||
## What will NOT be touched
|
||||
|
||||
- mac80211 / cfg80211 core — host-side STA driver only
|
||||
- The c-stack patches (c5.x, c6.x, c7) — independent recovery paths
|
||||
- Patch A — stays in place, untouched
|
||||
- AP / firmware
|
||||
|
||||
---
|
||||
|
||||
## Receipt checklist for Phase 4
|
||||
|
||||
- [x] What will and will not be touched: stated above
|
||||
- [x] Predicted delta in Phase 3 units: stated for Candidate B-1
|
||||
- [x] Out-of-scope items explicitly listed
|
||||
- [x] Risk items: bus_reset has a known multi-function-SDIO consideration (handled by c5.2.1)
|
||||
|
||||
## Asks of the reviewer
|
||||
|
||||
1. Candidate B-1 (bus_reset on api_connection_loss flood) the right scope, or should we instrument deeper before committing?
|
||||
2. Threshold (3 events / 60 s): too aggressive (false-positive bus_resets on transient RF issues) or about right?
|
||||
3. Should bus_reset be conditional on ALSO seeing post-deauth assoc-comeback timeouts, to avoid resetting on benign connection_loss events?
|
||||
4. Hypothesis 1 (assoc-comeback disrespected) — is this a mac80211/wpa_supplicant bug rather than a bes2600 bug? If yes, we file it elsewhere.
|
||||
Reference in New Issue
Block a user