Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 1830c17891 | |||
| 69a1d0f8b1 | |||
| 458ad36f8b | |||
| ea509e810f | |||
| e53aad5013 | |||
| 4acba3e707 |
@@ -82,6 +82,49 @@ without board power-cycle").
|
|||||||
**Status**: task c3 (indirectly, via bes_chardev removal which currently
|
**Status**: task c3 (indirectly, via bes_chardev removal which currently
|
||||||
gates the signal/nosignal mode switch path).
|
gates the signal/nosignal mode switch path).
|
||||||
|
|
||||||
|
## Bug #5 — RX path degrades under attempted-throughput pressure
|
||||||
|
|
||||||
|
**Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
|
||||||
|
SDIO RX scheduling) — pinpoint pending.
|
||||||
|
|
||||||
|
**Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
|
||||||
|
Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
|
||||||
|
netcat load):**
|
||||||
|
|
||||||
|
```
|
||||||
|
sender cap 1 MB/s → ohm receives 1015 KB/s, signal -57 dBm, RX MCS 4
|
||||||
|
sender cap 4 MB/s → ohm receives 563 KB/s, signal -67 dBm, RX MCS 3
|
||||||
|
(Send-Q on boltzmann backed up to 1.16 MB)
|
||||||
|
```
|
||||||
|
|
||||||
|
Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
|
||||||
|
throughput at the receiver and degraded the link metrics. Signal dropped
|
||||||
|
~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
|
||||||
|
the higher RX rate even with the link physically capable of more (link
|
||||||
|
bitrate 65 Mb/s = ~8 MB/s theoretical).
|
||||||
|
|
||||||
|
**Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
|
||||||
|
under busy reads** — possibly a busy-wait loop or lock contention on the
|
||||||
|
RX SDIO path that prevents draining at line rate. Plausible reason it
|
||||||
|
didn't surface for the c-stack tasks: those operated at typical
|
||||||
|
browse-rate traffic, well below the saturation threshold this bug needs
|
||||||
|
to fire.
|
||||||
|
|
||||||
|
**May explain**: original Phase-0 observation that **YouTube DASH chunks
|
||||||
|
drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
|
||||||
|
fetch is a brief burst at near-link-rate; if the driver throttles itself
|
||||||
|
down during high-RX, the player buffer underruns for the duration of
|
||||||
|
the fetch.
|
||||||
|
|
||||||
|
**How to drill (when prioritized)**:
|
||||||
|
- Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
|
||||||
|
controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
|
||||||
|
- Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
|
||||||
|
CPU saturation.
|
||||||
|
- `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
|
||||||
|
|
||||||
|
**Status**: backlog. No patch yet.
|
||||||
|
|
||||||
## Bug #4 — scan_complete_cb constant loop
|
## Bug #4 — scan_complete_cb constant loop
|
||||||
|
|
||||||
**File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
|
**File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
|
||||||
|
|||||||
@@ -0,0 +1,135 @@
|
|||||||
|
# Bug #5 campaign — Phase 1 metric + Phase 2 situation
|
||||||
|
|
||||||
|
Date assembled: 2026-05-07
|
||||||
|
Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
|
||||||
|
(cleanups + Patch A + Patch B)
|
||||||
|
|
||||||
|
Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
|
||||||
|
2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
|
||||||
|
cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
|
||||||
|
for the original report. This document locks Phase 1 and prepares Phase 4
|
||||||
|
candidates.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — measurable target (locked)
|
||||||
|
|
||||||
|
> *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
|
||||||
|
|
||||||
|
Three measurable outcomes, single sentence:
|
||||||
|
|
||||||
|
- **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
|
||||||
|
- **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
|
||||||
|
- **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 0 anchor — receipts
|
||||||
|
|
||||||
|
### Reproduction protocol (same units as Phase 7 will use)
|
||||||
|
|
||||||
|
1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
|
||||||
|
2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
|
||||||
|
3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
|
||||||
|
4. N=3 reps with 60 s cooldowns
|
||||||
|
|
||||||
|
### Observed (2026-05-07 15:36–16:08)
|
||||||
|
|
||||||
|
| rep | duration | avg KB/s | near-zero ticks | end state |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
|
||||||
|
| 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
|
||||||
|
| 3 | 600 s | 75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
|
||||||
|
|
||||||
|
mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
|
||||||
|
|
||||||
|
### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
|
||||||
|
|
||||||
|
| symbol | rep 1 (healthy) | rep 3 (cascade) |
|
||||||
|
|---|---|---|
|
||||||
|
| `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
|
||||||
|
| `handle_softirqs` | 5.4 % | 4.3 % |
|
||||||
|
| `__tasklet_schedule` | 2.4 % | 2.0 % |
|
||||||
|
| `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
|
||||||
|
| `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
|
||||||
|
|
||||||
|
Top callchain leading to `_raw_spin_unlock_irqrestore`:
|
||||||
|
|
||||||
|
```
|
||||||
|
process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 — situation analysis
|
||||||
|
|
||||||
|
### Relevant source pins
|
||||||
|
|
||||||
|
- `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
|
||||||
|
- holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
|
||||||
|
- calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
|
||||||
|
- outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
|
||||||
|
- `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
|
||||||
|
- `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
|
||||||
|
|
||||||
|
### What the lock-cycling cost is buying
|
||||||
|
|
||||||
|
Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
|
||||||
|
1. caller acquires `wsm_cmd_sema` (outer)
|
||||||
|
2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
|
||||||
|
3. `bes2600_bh_wakeup()` schedules the BH
|
||||||
|
4. BH dispatches the command, takes `vif_list_lock` to walk vifs
|
||||||
|
5. BH talks to chip via SDIO
|
||||||
|
6. response arrives, BH wakes the waiter
|
||||||
|
7. caller releases `wsm_cmd_sema`
|
||||||
|
|
||||||
|
Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
|
||||||
|
|
||||||
|
### What's been ruled out
|
||||||
|
|
||||||
|
- AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
|
||||||
|
- Patch A / Patch B (target different triggers; would not address lock cost).
|
||||||
|
- Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
|
||||||
|
- mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4 — candidate plans (preliminary, not locked)
|
||||||
|
|
||||||
|
Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
|
||||||
|
|
||||||
|
### B5-1 — Reduce `wsm_cmd_send` lock scope
|
||||||
|
|
||||||
|
The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
|
||||||
|
|
||||||
|
**Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
|
||||||
|
|
||||||
|
**Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
|
||||||
|
|
||||||
|
### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
|
||||||
|
|
||||||
|
bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
|
||||||
|
|
||||||
|
**Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
|
||||||
|
|
||||||
|
**Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
|
||||||
|
|
||||||
|
### B5-3 — Move WSM send out of process context, use ringbuffer
|
||||||
|
|
||||||
|
Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
|
||||||
|
|
||||||
|
**Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
|
||||||
|
|
||||||
|
**Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open question for the reviewer
|
||||||
|
|
||||||
|
Which Phase 4 candidate to lock? My ranking by ROI:
|
||||||
|
|
||||||
|
1. B5-1 (smallest, fastest, cleanest source pin) — try first
|
||||||
|
2. B5-2 (medium, conditional on multi-vif applicability)
|
||||||
|
3. B5-3 (largest rework, biggest potential)
|
||||||
|
|
||||||
|
Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
|
||||||
@@ -0,0 +1,153 @@
|
|||||||
|
# BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)
|
||||||
|
|
||||||
|
Date assembled: 2026-05-07
|
||||||
|
Run dir: /root/bes2600-samples/run-20260506-2113-patchA/ on ohm
|
||||||
|
Module: bes2600.ko srcversion 21BD07B3782B144D478CE43 (c-stack + Patch A merged)
|
||||||
|
|
||||||
|
This is the Phase 4 plan for **Patch B (Trigger A: beacon-loss / mac80211 `api_connection_loss` chain)**, drafted after the Phase 7 verification of Patch A on 2026-05-07. Per project CLAUDE.md: paste verbatim, do not curate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)
|
||||||
|
|
||||||
|
Patch A is **landed (PR #1)** and **active on ohm** (srcversion `21BD07B3`). Phase 7 verification:
|
||||||
|
|
||||||
|
```
|
||||||
|
duration: 10h30m sustained 1 MB/s load on 2.4GHz (5b:32)
|
||||||
|
DecryptStormRecoveries: 0
|
||||||
|
Decrypt-fails total: 183 (~1 every 3.5 min — never bursted ≥5/5s)
|
||||||
|
api_connection_loss events: 9 ← Trigger A
|
||||||
|
unprotected deauth (AP): 7 ← AP-deauth-6 cluster at 02:42:11
|
||||||
|
mac80211 reason 4 deauth: yes ← inactivity (Trigger A flavor)
|
||||||
|
mac80211 reason 2 deauth: no ← what Patch A handles
|
||||||
|
```
|
||||||
|
|
||||||
|
**Patch A's predicted delta is unobserved** (no decrypt-storm fired during Phase 7). Patch A is dormant but caused no harm. This Phase 4 pivots to **Trigger A** — the dominant failure path in the overnight rep.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — revised metric (Trigger-A scope)
|
||||||
|
|
||||||
|
> Per hour of operation: count `mac80211 api_connection_loss` events and the conditional probability that each escalates into a > 5 s reauth blackhole (assoc-comeback timeouts followed by AP unprotected-deauth-6 cluster).
|
||||||
|
|
||||||
|
Observed rate from the Phase 7 rep: 9 events over 10h30m = **0.86/h** under sustained load. Not all of them escalated to a visible blackhole — some apparently recovered cleanly. But the 02:42 cluster (1/9 = 11 % escalation rate in this rep) shows the catastrophic shape.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)
|
||||||
|
|
||||||
|
```
|
||||||
|
02:40:32 scan started, full-band
|
||||||
|
02:40:34 scan aborted
|
||||||
|
02:40:45 del station 5b:32
|
||||||
|
02:40:45 kernel: deauth 8a:2e:77:1f:ec:05 → 5b:32 reason 4
|
||||||
|
(Disassociated due to inactivity) ← TRIGGER A
|
||||||
|
02:40:45 cfg80211: disconnected (local request) reason 4
|
||||||
|
02:40:45 scan started → finished: 2462 2412, "newton"
|
||||||
|
02:40:45 new station 61:b0
|
||||||
|
02:40:45 AP→ohm: auth status 0: Successful
|
||||||
|
02:40:45 AP→ohm: assoc comeback bssid 61:b0 timeout 1000 ← BSS load mgmt
|
||||||
|
02:40:47 del station 61:b0
|
||||||
|
02:40:47 assoc: timed out
|
||||||
|
02:40:47 scan → new station cc:ce:1e:2b:74:17
|
||||||
|
02:40:48 auth: timed out
|
||||||
|
02:40:49 scan → new station 5b:32 (back to where we started)
|
||||||
|
02:40:49 AP→ohm: auth status 0
|
||||||
|
02:40:49 AP→ohm: assoc comeback bssid 5b:32 timeout 881
|
||||||
|
02:40:51 assoc: timed out
|
||||||
|
02:42:11 ── AP-deauth-6 cluster (×7 within 1 ms) from 61:b0 ──
|
||||||
|
reason 6: Class 2 frame received from non-authenticated station
|
||||||
|
02:42:11 reason 9 = STA_REQ_ASSOC_WITHOUT_AUTH
|
||||||
|
```
|
||||||
|
|
||||||
|
**Net**: ~86 s in reauth-blackhole, recovery via cross-channel fallback. Same shape as Phase 3's 11:03 blackhole (~109 s), but trigger here is **inactivity-deauth → assoc-comeback rejection**, not decrypt-storm.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hypothesis on the mechanism
|
||||||
|
|
||||||
|
Three plausible chains for why post-`api_connection_loss` reauth blackholes:
|
||||||
|
|
||||||
|
1. **AP's assoc-comeback timer disrespected.** The AP says "wait 1000 TU before retrying", but mac80211 / wpa_supplicant retries fast. AP keeps deferring; eventually a stale frame triggers the AP's "Class 2 from unauth" reaction.
|
||||||
|
|
||||||
|
2. **Driver state stale across deauth.** After mac80211 fires `ieee80211_connection_loss`, the bes2600 driver's per-vif state (link_id, key state, queues) may not be fully scrubbed. Subsequent reassoc starts with mixed state; AP rejects.
|
||||||
|
|
||||||
|
3. **Chip-level RX wedge.** The chip's RX state machine got stuck during the inactivity period; reauth sends out frames, but RX of AP's responses is lossy. mac80211 perceives timeout when actually frames arrived but were dropped. Hard to verify without monitor mode (which the chip doesn't support concurrent with managed).
|
||||||
|
|
||||||
|
Each hypothesis suggests a different fix surface.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4 — Plan candidates
|
||||||
|
|
||||||
|
### Candidate B-1 — Chip-level reset on api_connection_loss flood
|
||||||
|
|
||||||
|
**What touches:**
|
||||||
|
- New work-item on `bes2600_common`: `api_connection_loss_recover_work`.
|
||||||
|
- mac80211 → driver `disconnect()` op → bump a sliding-window counter.
|
||||||
|
- On threshold (e.g., 3 events within 60 s): schedule the work that calls `bes2600_chrdev_do_bus_reset()` (the existing c5.2 LMAC-wedge recovery path).
|
||||||
|
- After bus reset, NM auto-reconnects from a fresh chip state.
|
||||||
|
|
||||||
|
**Why this candidate:** reuses the c5.2 infrastructure already deployed; small surface; if hypothesis 3 (chip wedge) is right, this fixes the root cause. If hypothesis 1 or 2 are right, this is overkill but harmless (a brief reset).
|
||||||
|
|
||||||
|
**API contracts to confirm:**
|
||||||
|
- `bes2600_chrdev_do_bus_reset()` re-entrancy and worker-context safety
|
||||||
|
- mac80211 ops or callbacks around `ieee80211_connection_loss`/`disconnect`
|
||||||
|
- cw1200/cw1260 ancestor for any similar pattern
|
||||||
|
|
||||||
|
**Predicted delta (Phase 7 units):**
|
||||||
|
- `api_connection_loss` rate: unchanged (we don't address the trigger)
|
||||||
|
- conditional escalation to >5 s blackhole: target ≤ 30 % (need realistic)
|
||||||
|
- worst-case recovery: 86 s → < 10 s
|
||||||
|
|
||||||
|
### Candidate B-2 — Respect assoc-comeback timer
|
||||||
|
|
||||||
|
**What touches:**
|
||||||
|
- Possibly NOT in bes2600 — this looks like a mac80211 / wpa_supplicant concern.
|
||||||
|
- If the driver does anything assoc-related itself, audit for racing the comeback timer.
|
||||||
|
|
||||||
|
**Status:** out of scope for a bes2600 patch unless the driver is observed sending frames during the comeback window.
|
||||||
|
|
||||||
|
### Candidate B-3 — Audit and scrub vif state on disconnect
|
||||||
|
|
||||||
|
**What touches:**
|
||||||
|
- `bes2600_unjoin_work` — verify link_id, keys, queues all reset
|
||||||
|
- Add explicit reset on `ieee80211_disconnect`/`disconnect` ops
|
||||||
|
|
||||||
|
**Status:** speculative without further instrumentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lock proposal
|
||||||
|
|
||||||
|
**Lock Candidate B-1 first.** It has:
|
||||||
|
- the cleanest re-use (c5.2's bus_reset)
|
||||||
|
- the smallest patch surface
|
||||||
|
- a measurable predicted delta against Phase 7's `api_connection_loss` rate
|
||||||
|
|
||||||
|
If Phase 7-of-B-1 shows the rate unchanged but escalation still high → loop back to B-2/B-3 hypothesis space.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What will NOT be touched
|
||||||
|
|
||||||
|
- mac80211 / cfg80211 core — host-side STA driver only
|
||||||
|
- The c-stack patches (c5.x, c6.x, c7) — independent recovery paths
|
||||||
|
- Patch A — stays in place, untouched
|
||||||
|
- AP / firmware
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Receipt checklist for Phase 4
|
||||||
|
|
||||||
|
- [x] What will and will not be touched: stated above
|
||||||
|
- [x] Predicted delta in Phase 3 units: stated for Candidate B-1
|
||||||
|
- [x] Out-of-scope items explicitly listed
|
||||||
|
- [x] Risk items: bus_reset has a known multi-function-SDIO consideration (handled by c5.2.1)
|
||||||
|
|
||||||
|
## Asks of the reviewer
|
||||||
|
|
||||||
|
1. Candidate B-1 (bus_reset on api_connection_loss flood) the right scope, or should we instrument deeper before committing?
|
||||||
|
2. Threshold (3 events / 60 s): too aggressive (false-positive bus_resets on transient RF issues) or about right?
|
||||||
|
3. Should bus_reset be conditional on ALSO seeing post-deauth assoc-comeback timeouts, to avoid resetting on benign connection_loss events?
|
||||||
|
4. Hypothesis 1 (assoc-comeback disrespected) — is this a mac80211/wpa_supplicant bug rather than a bes2600 bug? If yes, we file it elsewhere.
|
||||||
@@ -0,0 +1,96 @@
|
|||||||
|
# BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
|
||||||
|
|
||||||
|
Date assembled: 2026-05-07
|
||||||
|
Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
|
||||||
|
(cleanups + Patch A + Patch B)
|
||||||
|
Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
|
||||||
|
|
||||||
|
Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
|
||||||
|
of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
|
||||||
|
newton (5b:32, signal -57 to -67 dBm).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Result table (vs the Phase 4 predicted delta)
|
||||||
|
|
||||||
|
### Patch A — decrypt-storm fast-recover (Trigger B)
|
||||||
|
|
||||||
|
| metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
|
||||||
|
|---|---|---|---|
|
||||||
|
| decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
|
||||||
|
| AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
|
||||||
|
| recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
|
||||||
|
|
||||||
|
**Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
|
||||||
|
|
||||||
|
#### Receipts (verbatim)
|
||||||
|
|
||||||
|
```
|
||||||
|
13:47:56 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
|
||||||
|
13:47:57 wlan0: associated to cc:ce:1e:2b:74:17 (cross-BSSID, 1 s)
|
||||||
|
13:49:26 bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
|
||||||
|
13:49:27 wlan0: associated to c0:25:06:e6:5b:32 (back home, 1 s)
|
||||||
|
```
|
||||||
|
|
||||||
|
`DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
|
||||||
|
|
||||||
|
### Patch B — connection-loss-storm bus_reset (Trigger A)
|
||||||
|
|
||||||
|
| metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
|
||||||
|
|---|---|---|---|
|
||||||
|
| api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
|
||||||
|
| ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
|
||||||
|
| Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
|
||||||
|
|
||||||
|
**Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
|
||||||
|
|
||||||
|
The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
|
||||||
|
|
||||||
|
### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
|
||||||
|
|
||||||
|
```
|
||||||
|
12:53:10.475 → 12:53:11.756 AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
|
||||||
|
(2 mgmt-TX no-ack from our chip in the middle)
|
||||||
|
12:53:12.309 kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
|
||||||
|
12:53:14–15 reauth via 61:b0 → 5b:32, recovery in ~4 s
|
||||||
|
```
|
||||||
|
|
||||||
|
Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
|
||||||
|
|
||||||
|
Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
|
||||||
|
|
||||||
|
```
|
||||||
|
sender 1 MB/s → ohm receives 1015 KB/s, -57 dBm, RX MCS 4
|
||||||
|
sender 4 MB/s → ohm receives 563 KB/s, -67 dBm, RX MCS 3
|
||||||
|
```
|
||||||
|
|
||||||
|
Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
|
||||||
|
|
||||||
|
**Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lessons captured for memory (Phase 8 anchor)
|
||||||
|
|
||||||
|
1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
|
||||||
|
2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
|
||||||
|
3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
|
||||||
|
4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
|
||||||
|
5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
|
||||||
|
6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Loop status
|
||||||
|
|
||||||
|
- Phase 7: closed.
|
||||||
|
- Patch A: confirmed (N=2). Stays in.
|
||||||
|
- Patch B: installed, dormant in this regime, no harm. Stays in.
|
||||||
|
- Bug #5: backlog, no patch yet. Documented.
|
||||||
|
- Trigger C: backlog candidate, no patch yet. Documented.
|
||||||
|
|
||||||
|
Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.
|
||||||
Reference in New Issue
Block a user