bes2600: bus_reset on connection-loss storm to dodge assoc-comeback blackhole #2

Merged
claude-noether merged 1 commits from bes2600/connection-loss-fast-recover into cleanups 2026-05-07 10:48:00 +00:00
Collaborator

Patch B — Phase 6 implementation (Trigger A)

Follows the Phase 4 plan merged at marfrit/besser PR #5 (commit, notes/phase4-2026-05-07.md).

What it does

When 3 driver-side bes2600_connection_loss_work decisions fire within 60 s on the same vif, skip the regular ieee80211_connection_loss(vif) path and trigger a chip-level bes2600_chrdev_do_bus_reset() instead. SDIO removes and re-probes the chip; userspace reassociates from a fresh state, dodging the AP's assoc comeback rejection cycle.

Why

  • Phase 7 rep (10h30m sustained 1 MB/s on 2.4GHz, srcversion 21BD07B3 = c-stack + Patch A) saw 9 api_connection_loss events, with one catastrophic at 02:42:11 → ~86 s of assoc comeback timeouts and AP unprotected-deauth-6 cluster, recovered only via cross-channel fallback.
  • Same shape as Phase 3's 11:03 blackhole, but the trigger here is mac80211 inactivity-deauth (reason 4), not decrypt-storm — so Patch A doesn't help.
  • Receipts: notes/phase4-2026-05-07.md and the run dir at ohm:/root/bes2600-samples/run-20260506-2113-patchA/.

Reviewer feedback (from marfrit/besser PR #5) folded in

  • B-1 right scope: confirmed.
  • Threshold (3 / 60 s): confirmed.
  • Don't condition on assoc-comeback observation: confirmed ("device is in one position for hours — there should be no benign connection_loss events").
  • Don't file the assoc-comeback-disrespect issue elsewhere: documented in notes/phase4-2026-05-07.md only.

Predicted Phase 7 delta vs unpatched baseline

metric observed predicted under B-1
api_connection_loss rate 0.86/h unchanged
P(>5s blackhole | event) 11 % (1/9) ≤ 30 %
worst-case recovery 86 s < 10 s

Files touched

  • bes2600/bes2600.h — 3 counter fields on struct bes2600_vif, 1 work_struct on struct bes2600_common, 3 prototypes
  • bes2600/sta.c — 3 helpers + storm-account hook in bes2600_connection_loss_work + storm-init in bes2600_vif_setup + cancel_work_sync in the hw_priv shutdown path
  • bes2600/main.cINIT_WORK alongside other hw_priv work_structs
  • bes2600/debug.cConnectionLossStormRecoveries seq_printf

Verification status

  • checkpatch.pl --no-tree --strict: clean (0/0/0).
  • Built and applied on ohm via the c-final + c5.2.1 + Patch A sandbox; deployment + Phase 7 verification next.
  • This patch builds on top of cleanups (which has the c-stack + Patch A cherry-picked just now). Targeted at cleanups not mobian because bes2600_chrdev_do_bus_reset() is c5.2-only.

Asks

  1. The recover work_struct lives on bes2600_common rather than bes2600_vif — chosen because bes2600_chrdev_do_bus_reset() triggers SDIO remove which frees the per-vif state, and we don't want to schedule_work onto a freed work_struct. OK?
  2. cancel_work_sync placed adjacent to coex_work cancel in the hw_priv shutdown path. OK or should it move?
  3. The patch increments connection_loss_storm_recoveries BEFORE calling schedule_work, so the counter is visible in debugfs even though the chip is about to be reset (counter survives in vif memory until the bus_reset's remove() fires). OK?

🤖 Generated with Claude Code

## Patch B — Phase 6 implementation (Trigger A) Follows the Phase 4 plan merged at `marfrit/besser` PR #5 (commit, `notes/phase4-2026-05-07.md`). ### What it does When 3 driver-side `bes2600_connection_loss_work` decisions fire within 60 s on the same vif, skip the regular `ieee80211_connection_loss(vif)` path and trigger a chip-level `bes2600_chrdev_do_bus_reset()` instead. SDIO removes and re-probes the chip; userspace reassociates from a fresh state, dodging the AP's `assoc comeback` rejection cycle. ### Why - Phase 7 rep (10h30m sustained 1 MB/s on 2.4GHz, srcversion `21BD07B3` = c-stack + Patch A) saw 9 `api_connection_loss` events, with one catastrophic at 02:42:11 → ~86 s of `assoc comeback` timeouts and AP unprotected-deauth-6 cluster, recovered only via cross-channel fallback. - Same shape as Phase 3's 11:03 blackhole, but the trigger here is mac80211 inactivity-deauth (reason 4), not decrypt-storm — so Patch A doesn't help. - Receipts: `notes/phase4-2026-05-07.md` and the run dir at `ohm:/root/bes2600-samples/run-20260506-2113-patchA/`. ### Reviewer feedback (from `marfrit/besser` PR #5) folded in - B-1 right scope: confirmed. - Threshold (3 / 60 s): confirmed. - Don't condition on assoc-comeback observation: confirmed ("device is in one position for hours — there should be no benign connection_loss events"). - Don't file the assoc-comeback-disrespect issue elsewhere: documented in `notes/phase4-2026-05-07.md` only. ### Predicted Phase 7 delta vs unpatched baseline | metric | observed | predicted under B-1 | |---|---|---| | api_connection_loss rate | 0.86/h | unchanged | | P(>5s blackhole \| event) | 11 % (1/9) | ≤ 30 % | | worst-case recovery | 86 s | < 10 s | ### Files touched - `bes2600/bes2600.h` — 3 counter fields on `struct bes2600_vif`, 1 work_struct on `struct bes2600_common`, 3 prototypes - `bes2600/sta.c` — 3 helpers + storm-account hook in `bes2600_connection_loss_work` + storm-init in `bes2600_vif_setup` + `cancel_work_sync` in the hw_priv shutdown path - `bes2600/main.c` — `INIT_WORK` alongside other hw_priv work_structs - `bes2600/debug.c` — `ConnectionLossStormRecoveries` seq_printf ### Verification status - `checkpatch.pl --no-tree --strict`: clean (0/0/0). - Built and applied on ohm via the c-final + c5.2.1 + Patch A sandbox; deployment + Phase 7 verification next. - This patch builds **on top of `cleanups`** (which has the c-stack + Patch A cherry-picked just now). Targeted at `cleanups` not `mobian` because `bes2600_chrdev_do_bus_reset()` is c5.2-only. ### Asks 1. The recover work_struct lives on `bes2600_common` rather than `bes2600_vif` — chosen because `bes2600_chrdev_do_bus_reset()` triggers SDIO remove which frees the per-vif state, and we don't want to schedule_work onto a freed work_struct. OK? 2. `cancel_work_sync` placed adjacent to `coex_work` cancel in the hw_priv shutdown path. OK or should it move? 3. The patch increments `connection_loss_storm_recoveries` BEFORE calling `schedule_work`, so the counter is visible in debugfs even though the chip is about to be reset (counter survives in vif memory until the bus_reset's remove() fires). OK? 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claude-noether added 1 commit 2026-05-07 10:04:35 +00:00
When mac80211 declares connection loss against this AP (typically driven
by inactivity-deauth or beacon-loss), the userspace reauth that follows
sometimes enters a long blackhole: the AP responds to auth with success
but defers assoc with the 802.11v "assoc comeback" timer; ohm retries
faster than the comeback grants permission; the AP eventually fires an
unprotected deauth-reason-6 ("Class 2 frame received from non-
authenticated station"), and recovery only completes via cross-SSID or
cross-channel fallback. Receipts: ~86 s blackhole observed in the
phase-7 rep on 2026-05-07 02:42, with three subsequent BSSIDs returning
assoc comeback timeouts before reason-9 (STA_REQ_ASSOC_WITHOUT_AUTH)
fired. Documented in marfrit/besser:notes/phase4-2026-05-07.md.

When N=3 driver-side connection_loss decisions fire within a 60 s window
on the same vif, skip the ieee80211_connection_loss() path and trigger
the c5.2-introduced bes2600_chrdev_do_bus_reset() instead. The bus
reset removes and re-probes the chip; userspace re-associates with a
fresh chip state, dodging the AP's comeback-timer rejection cycle.

Predicted Phase 7 delta vs current baseline:
- api_connection_loss rate: unchanged (we don't address the trigger)
- conditional probability of >5 s blackhole given event: <= 30 %
- worst-case recovery: 86 s -> < 10 s

Contract pin: bes2600_chrdev_do_bus_reset(sbus_ops, sbus_priv) at
bes2600/bes_chardev.c:455, introduced by c5.2. The function is async-
returning: sbus_ops->bus_reset() schedules an SDIO rescan; the helper
waits up to 3 s for the remove() callback to clear sbus_priv, then
returns. Per-vif state is gone after this point, so the recover work
lives on bes2600_common (hw_priv) and uses the global bes2600_cdev for
the bus_reset call rather than dereferencing per-vif state.

Threshold (3 / 60 s) is well above the steady-state per-vif
connection_loss rate observed in the patch-A phase-7 rep (0.86/h under
sustained load), so a true storm is required to trip it.

Files touched:
- bes2600/bes2600.h: 3 counter fields on struct bes2600_vif, 1
  work_struct on struct bes2600_common, 3 prototypes
- bes2600/sta.c: 3 helpers + storm-account hook in
  bes2600_connection_loss_work + storm-init in bes2600_vif_setup +
  cancel_work_sync in the hw_priv shutdown path; #include bes_chardev.h
  was already pulled in by an earlier c-stack patch
- bes2600/main.c: INIT_WORK alongside other hw_priv work_structs
- bes2600/debug.c: ConnectionLossStormRecoveries seq_printf in the
  per-vif status seq_file output

The cw1200/cw1260 ancestor has no equivalent; this is a clean
addition. checkpatch.pl --no-tree --strict: clean (0/0/0).

Signed-off-by: Claude (noether) <claude@reauktion.de>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
marfrit force-pushed bes2600/connection-loss-fast-recover from cdfdac987a to ae556d49da 2026-05-07 10:05:52 +00:00 Compare
marfrit force-pushed bes2600/connection-loss-fast-recover from ae556d49da to e78beea2cf 2026-05-07 10:06:27 +00:00 Compare
marfrit force-pushed bes2600/connection-loss-fast-recover from e78beea2cf to f2cf586f89 2026-05-07 10:06:48 +00:00 Compare
Owner

The recover work_struct lives on bes2600_common rather than bes2600_vif — chosen because bes2600_chrdev_do_bus_reset() triggers SDIO remove which frees the per-vif state, and we don't want to schedule_work onto a freed work_struct. OK!
cancel_work_sync placed adjacent to coex_work cancel in the hw_priv shutdown path. OK!
The patch increments connection_loss_storm_recoveries BEFORE calling schedule_work, so the counter is visible in debugfs even though the chip is about to be reset (counter survives in vif memory until the bus_reset's remove() fires). OK!

The recover work_struct lives on bes2600_common rather than bes2600_vif — chosen because bes2600_chrdev_do_bus_reset() triggers SDIO remove which frees the per-vif state, and we don't want to schedule_work onto a freed work_struct. OK! cancel_work_sync placed adjacent to coex_work cancel in the hw_priv shutdown path. OK! The patch increments connection_loss_storm_recoveries BEFORE calling schedule_work, so the counter is visible in debugfs even though the chip is about to be reset (counter survives in vif memory until the bus_reset's remove() fires). OK!
claude-noether merged commit 90f50b375f into cleanups 2026-05-07 10:48:00 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/bes2600-dkms#2