2 Commits

Author SHA1 Message Date
test0r f6a25d811f notes: phase 4 plan artifact for BES2600 wifi-stability campaign
Drafts Patch A (decrypt-storm fast-recover, Trigger B) at txrx.c:1696
with sliding-window threshold + ieee80211_connection_loss reassoc.
Patch B (beacon-loss / Trigger A) parked behind one more diagnostic
rep with 10s snap-loop cadence on the beacon-loss counter.

Folds reviewer feedback from PR #3 + the new Trigger-A finding
(post-resume P1 = api_connection_loss-driven, two reps captured today
at 17:23 and 18:03) into a revised Phase 1 metric counting three
event classes.

Pending Phase 5 second-model review of the plan before Phase 6
implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:10:12 +02:00
marfrit 07a7d4b3af Merge pull request 'Phase 5 review: BES2600 WiFi-stability campaign artifact' (#3) from claude-noether into main
Reviewed-on: #3
2026-05-06 13:37:16 +00:00
+104
View File
@@ -0,0 +1,104 @@
# BES2600 WiFi-stability campaign — Phase 4 plan artifact
Date assembled: 2026-05-06
Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.
Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 `api_connection_loss`) are folded into the Phase 1 revision below.
---
## Phase 1 — revised metric (folds review feedback + Trigger-A discovery)
> Per hour of operation, count three event classes:
>
> (a) `WSM_STATUS_DECRYPTFAILURE` bursts (≥4 events / 60 s)
> (b) mac80211 `api_connection_loss` events
> (c) AP-side unprotected-deauth-reason-6 frames
>
> For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.
Reviewer feedback applied:
- "Also count AP-deauth" → class (c) added
- "N=1 idle is fine" → no further idle reps required to lock
- AP-side capture not needed → confirmed; no Fritz!Box logging required
New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):
- **Trigger B (decrypt storm path)**: `decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth`. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:0013:28 load run (12 bursts).
- **Trigger A (beacon-loss path)**: `mac80211 api_connection_loss → kernel local reason-2 deauth`. Receipts: 17:23 and 18:03 today, both following ftrace `api_connection_loss`; yesterday 22:33 (presumed same path; not instrumented at the time).
---
## Phase 4 — Plan
### Patch A — Decrypt-storm fast-recover (Trigger B)
#### What will be touched
- `bes2600/txrx.c`, at the `bes_warn` for `WSM_STATUS_DECRYPTFAILURE / goto drop` site (currently line 1696).
- A new sliding-window counter on `bes2600_common` (or equivalent struct) tracking decrypt-fail timestamps.
- On threshold (proposed: ≥5 within 5 s), `schedule_work(&priv->reassoc_work)` that calls `ieee80211_connection_loss(vif)` so mac80211 enters its clean-reassoc path.
- A small struct field for the counter, plus init in probe and reset on assoc.
#### What will NOT be touched
- `bes2600/bes2600_sdio.c` bus-level paths (no SDIO change).
- Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
- Firmware. The fix is host-side recovery, not chip- or AP-side.
- mac80211 / cfg80211 core. Only `ieee80211_connection_loss` is called; no kernel API addition.
#### Predicted delta on Phase 1 metric (same units as Phase 3 receipts)
- Decrypt-burst rate **(a)**: UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
- AP-deauth-6 rate **(c)**: DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
- Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.
Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.
#### API contracts to confirm before writing code
Per project CLAUDE.md "contract before code":
- `ieee80211_connection_loss(vif)` — semantics + caller-context constraints. Header: `linux/mac80211.h`. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.
- `bes2600_vif` / `bes2600_common` struct fields available for the counter — counter must be safe to update from the `wsm_handle_rx` path.
- cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
- Existing bes2600 work-item plumbing (e.g., `bes2600_chrdev_do_bus_reset` from c5.2) — same shape, same allocation rules.
These will be cited in the commit message body and in the patch header comment per Phase 6 rules.
#### Risk
- If `ieee80211_connection_loss` is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm).
- If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.
---
### Patch B — Beacon-loss fast-recover (Trigger A) — PARKED
Not part of this Phase 4. Locked behind one more diagnostic rep:
- Add to the snap loop (rig is live): track wlan0 station-dump `beacon loss` counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before `api_connection_loss` fires.
- Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
- Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
- Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.
---
## Receipt checklist for Phase 4
- [x] What will and will not be touched: stated above
- [x] Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
- [x] Out-of-scope items explicitly listed
- [x] Risk items explicitly listed
---
## Asks of the reviewer
1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
2. Is `ieee80211_connection_loss(vif)` the right kernel API? Alternative: `cfg80211_disconnected` with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc?
3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
4. Patch B parked correctly, or fold it into this same Phase 4?