Files
besser/notes/phase4-2026-05-06.md
test0r f6a25d811f notes: phase 4 plan artifact for BES2600 wifi-stability campaign
Drafts Patch A (decrypt-storm fast-recover, Trigger B) at txrx.c:1696
with sliding-window threshold + ieee80211_connection_loss reassoc.
Patch B (beacon-loss / Trigger A) parked behind one more diagnostic
rep with 10s snap-loop cadence on the beacon-loss counter.

Folds reviewer feedback from PR #3 + the new Trigger-A finding
(post-resume P1 = api_connection_loss-driven, two reps captured today
at 17:23 and 18:03) into a revised Phase 1 metric counting three
event classes.

Pending Phase 5 second-model review of the plan before Phase 6
implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:10:12 +02:00

6.4 KiB
Raw Permalink Blame History

BES2600 WiFi-stability campaign — Phase 4 plan artifact

Date assembled: 2026-05-06 Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)

This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.

Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 api_connection_loss) are folded into the Phase 1 revision below.


Phase 1 — revised metric (folds review feedback + Trigger-A discovery)

Per hour of operation, count three event classes:

(a) WSM_STATUS_DECRYPTFAILURE bursts (≥4 events / 60 s) (b) mac80211 api_connection_loss events (c) AP-side unprotected-deauth-reason-6 frames

For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.

Reviewer feedback applied:

  • "Also count AP-deauth" → class (c) added
  • "N=1 idle is fine" → no further idle reps required to lock
  • AP-side capture not needed → confirmed; no Fritz!Box logging required

New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):

  • Trigger B (decrypt storm path): decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:0013:28 load run (12 bursts).
  • Trigger A (beacon-loss path): mac80211 api_connection_loss → kernel local reason-2 deauth. Receipts: 17:23 and 18:03 today, both following ftrace api_connection_loss; yesterday 22:33 (presumed same path; not instrumented at the time).

Phase 4 — Plan

Patch A — Decrypt-storm fast-recover (Trigger B)

What will be touched

  • bes2600/txrx.c, at the bes_warn for WSM_STATUS_DECRYPTFAILURE / goto drop site (currently line 1696).
  • A new sliding-window counter on bes2600_common (or equivalent struct) tracking decrypt-fail timestamps.
  • On threshold (proposed: ≥5 within 5 s), schedule_work(&priv->reassoc_work) that calls ieee80211_connection_loss(vif) so mac80211 enters its clean-reassoc path.
  • A small struct field for the counter, plus init in probe and reset on assoc.

What will NOT be touched

  • bes2600/bes2600_sdio.c bus-level paths (no SDIO change).
  • Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
  • Firmware. The fix is host-side recovery, not chip- or AP-side.
  • mac80211 / cfg80211 core. Only ieee80211_connection_loss is called; no kernel API addition.

Predicted delta on Phase 1 metric (same units as Phase 3 receipts)

  • Decrypt-burst rate (a): UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
  • AP-deauth-6 rate (c): DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
  • Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.

Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.

API contracts to confirm before writing code

Per project CLAUDE.md "contract before code":

  • ieee80211_connection_loss(vif) — semantics + caller-context constraints. Header: linux/mac80211.h. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.
  • bes2600_vif / bes2600_common struct fields available for the counter — counter must be safe to update from the wsm_handle_rx path.
  • cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
  • Existing bes2600 work-item plumbing (e.g., bes2600_chrdev_do_bus_reset from c5.2) — same shape, same allocation rules.

These will be cited in the commit message body and in the patch header comment per Phase 6 rules.

Risk

  • If ieee80211_connection_loss is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm).
  • If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.

Patch B — Beacon-loss fast-recover (Trigger A) — PARKED

Not part of this Phase 4. Locked behind one more diagnostic rep:

  • Add to the snap loop (rig is live): track wlan0 station-dump beacon loss counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before api_connection_loss fires.
  • Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
  • Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
  • Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.

Receipt checklist for Phase 4

  • What will and will not be touched: stated above
  • Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
  • Out-of-scope items explicitly listed
  • Risk items explicitly listed

Asks of the reviewer

  1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
  2. Is ieee80211_connection_loss(vif) the right kernel API? Alternative: cfg80211_disconnected with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc?
  3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
  4. Patch B parked correctly, or fold it into this same Phase 4?