Files

T

claude-noether e53aad5013 notes: phase 4 plan for Patch B (Trigger A / api_connection_loss)

Drafted after Phase 7 verification of Patch A (PR #1, srcversion
21BD07B3). 10h30m sustained load on 2.4GHz produced:
- 0 DecryptStormRecoveries (Patch A dormant; no decrypt-storm fired)
- 9 mac80211 api_connection_loss events
- 1 catastrophic blackhole at 02:42 (reason 4 inactivity → reauth
  with assoc-comeback timeouts → AP unprotected-deauth-6 cluster)

Phase 4 pivots to Trigger A (Patch B). Candidate B-1 lock proposal:
extend c5.2 bus_reset infrastructure to fire on N consecutive
api_connection_loss events; reuses existing recovery path.

Pending Phase 5 review before Phase 6 implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-07 10:22:34 +02:00

7.1 KiB

Raw Blame History

BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)

Date assembled: 2026-05-07 Run dir: /root/bes2600-samples/run-20260506-2113-patchA/ on ohm Module: bes2600.ko srcversion 21BD07B3782B144D478CE43 (c-stack + Patch A merged)

This is the Phase 4 plan for Patch B (Trigger A: beacon-loss / mac80211 api_connection_loss chain), drafted after the Phase 7 verification of Patch A on 2026-05-07. Per project CLAUDE.md: paste verbatim, do not curate.

What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)

Patch A is landed (PR #1) and active on ohm (srcversion 21BD07B3). Phase 7 verification:

duration:                     10h30m sustained 1 MB/s load on 2.4GHz (5b:32)
DecryptStormRecoveries:       0
Decrypt-fails total:          183  (~1 every 3.5 min — never bursted ≥5/5s)
api_connection_loss events:   9    ← Trigger A
unprotected deauth (AP):      7    ← AP-deauth-6 cluster at 02:42:11
mac80211 reason 4 deauth:     yes  ← inactivity (Trigger A flavor)
mac80211 reason 2 deauth:     no   ← what Patch A handles

Patch A's predicted delta is unobserved (no decrypt-storm fired during Phase 7). Patch A is dormant but caused no harm. This Phase 4 pivots to Trigger A — the dominant failure path in the overnight rep.

Phase 1 — revised metric (Trigger-A scope)

Per hour of operation: count mac80211 api_connection_loss events and the conditional probability that each escalates into a > 5 s reauth blackhole (assoc-comeback timeouts followed by AP unprotected-deauth-6 cluster).

Observed rate from the Phase 7 rep: 9 events over 10h30m = 0.86/h under sustained load. Not all of them escalated to a visible blackhole — some apparently recovered cleanly. But the 02:42 cluster (1/9 = 11 % escalation rate in this rep) shows the catastrophic shape.

Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)

02:40:32  scan started, full-band
02:40:34  scan aborted
02:40:45  del station 5b:32
02:40:45  kernel: deauth 8a:2e:77:1f:ec:05 → 5b:32 reason 4
          (Disassociated due to inactivity)        ← TRIGGER A
02:40:45  cfg80211: disconnected (local request) reason 4
02:40:45  scan started → finished: 2462 2412, "newton"
02:40:45  new station 61:b0
02:40:45  AP→ohm: auth status 0: Successful
02:40:45  AP→ohm: assoc comeback bssid 61:b0 timeout 1000  ← BSS load mgmt
02:40:47  del station 61:b0
02:40:47  assoc: timed out
02:40:47  scan → new station cc:ce:1e:2b:74:17
02:40:48  auth: timed out
02:40:49  scan → new station 5b:32 (back to where we started)
02:40:49  AP→ohm: auth status 0
02:40:49  AP→ohm: assoc comeback bssid 5b:32 timeout 881
02:40:51  assoc: timed out
02:42:11  ── AP-deauth-6 cluster (×7 within 1 ms) from 61:b0 ──
          reason 6: Class 2 frame received from non-authenticated station
02:42:11  reason 9 = STA_REQ_ASSOC_WITHOUT_AUTH

Net: ~86 s in reauth-blackhole, recovery via cross-channel fallback. Same shape as Phase 3's 11:03 blackhole (~109 s), but trigger here is inactivity-deauth → assoc-comeback rejection, not decrypt-storm.

Hypothesis on the mechanism

Three plausible chains for why post-api_connection_loss reauth blackholes:

AP's assoc-comeback timer disrespected. The AP says "wait 1000 TU before retrying", but mac80211 / wpa_supplicant retries fast. AP keeps deferring; eventually a stale frame triggers the AP's "Class 2 from unauth" reaction.
Driver state stale across deauth. After mac80211 fires ieee80211_connection_loss, the bes2600 driver's per-vif state (link_id, key state, queues) may not be fully scrubbed. Subsequent reassoc starts with mixed state; AP rejects.
Chip-level RX wedge. The chip's RX state machine got stuck during the inactivity period; reauth sends out frames, but RX of AP's responses is lossy. mac80211 perceives timeout when actually frames arrived but were dropped. Hard to verify without monitor mode (which the chip doesn't support concurrent with managed).

Each hypothesis suggests a different fix surface.

Phase 4 — Plan candidates

Candidate B-1 — Chip-level reset on api_connection_loss flood

What touches:

New work-item on bes2600_common: api_connection_loss_recover_work.
mac80211 → driver disconnect() op → bump a sliding-window counter.
On threshold (e.g., 3 events within 60 s): schedule the work that calls bes2600_chrdev_do_bus_reset() (the existing c5.2 LMAC-wedge recovery path).
After bus reset, NM auto-reconnects from a fresh chip state.

Why this candidate: reuses the c5.2 infrastructure already deployed; small surface; if hypothesis 3 (chip wedge) is right, this fixes the root cause. If hypothesis 1 or 2 are right, this is overkill but harmless (a brief reset).

API contracts to confirm:

bes2600_chrdev_do_bus_reset() re-entrancy and worker-context safety
mac80211 ops or callbacks around ieee80211_connection_loss/disconnect
cw1200/cw1260 ancestor for any similar pattern

Predicted delta (Phase 7 units):

api_connection_loss rate: unchanged (we don't address the trigger)
conditional escalation to >5 s blackhole: target ≤ 30 % (need realistic)
worst-case recovery: 86 s → < 10 s

Candidate B-2 — Respect assoc-comeback timer

What touches:

Possibly NOT in bes2600 — this looks like a mac80211 / wpa_supplicant concern.
If the driver does anything assoc-related itself, audit for racing the comeback timer.

Status: out of scope for a bes2600 patch unless the driver is observed sending frames during the comeback window.

Candidate B-3 — Audit and scrub vif state on disconnect

What touches:

bes2600_unjoin_work — verify link_id, keys, queues all reset
Add explicit reset on ieee80211_disconnect/disconnect ops

Status: speculative without further instrumentation.

Lock proposal

Lock Candidate B-1 first. It has:

the cleanest re-use (c5.2's bus_reset)
the smallest patch surface
a measurable predicted delta against Phase 7's api_connection_loss rate

If Phase 7-of-B-1 shows the rate unchanged but escalation still high → loop back to B-2/B-3 hypothesis space.

What will NOT be touched

mac80211 / cfg80211 core — host-side STA driver only
The c-stack patches (c5.x, c6.x, c7) — independent recovery paths
Patch A — stays in place, untouched
AP / firmware

Receipt checklist for Phase 4

What will and will not be touched: stated above
Predicted delta in Phase 3 units: stated for Candidate B-1
Out-of-scope items explicitly listed
Risk items: bus_reset has a known multi-function-SDIO consideration (handled by c5.2.1)

Asks of the reviewer

Candidate B-1 (bus_reset on api_connection_loss flood) the right scope, or should we instrument deeper before committing?
Threshold (3 events / 60 s): too aggressive (false-positive bus_resets on transient RF issues) or about right?
Should bus_reset be conditional on ALSO seeing post-deauth assoc-comeback timeouts, to avoid resetting on benign connection_loss events?
Hypothesis 1 (assoc-comeback disrespected) — is this a mac80211/wpa_supplicant bug rather than a bes2600 bug? If yes, we file it elsewhere.

7.1 KiB Raw Blame History Unescape Escape

BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)

What changed since the merged Patch-A plan (notes/phase4-2026-05-06.md)

Phase 1 — revised metric (Trigger-A scope)

Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)

Hypothesis on the mechanism

Phase 4 — Plan candidates

Candidate B-1 — Chip-level reset on api_connection_loss flood

Candidate B-2 — Respect assoc-comeback timer

Candidate B-3 — Audit and scrub vif state on disconnect

Lock proposal

What will NOT be touched

Receipt checklist for Phase 4

Asks of the reviewer

7.1 KiB

Raw Blame History

What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)