notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts

Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.
notes: phase 7 verdict — Patch A confirmed, Patch B dormant
2026-05-07 16:32:45 +02:00 · 2026-05-07 15:18:36 +02:00 · 2026-05-07 13:56:36 +02:00 · 2026-05-07 10:45:28 +00:00 · 2026-05-07 10:22:34 +02:00 · 2026-05-06 17:30:48 +00:00
5 changed files with 531 additions and 0 deletions
@@ -82,6 +82,49 @@ without board power-cycle").
 **Status**: task c3 (indirectly, via bes_chardev removal which currently
 gates the signal/nosignal mode switch path).
 ## Bug #5 — RX path degrades under attempted-throughput pressure
 **Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
 SDIO RX scheduling) — pinpoint pending.
 **Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
 Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
 netcat load):**
 ```
 sender cap 1 MB/s  →  ohm receives 1015 KB/s,  signal -57 dBm,  RX MCS 4
 sender cap 4 MB/s  →  ohm receives  563 KB/s,  signal -67 dBm,  RX MCS 3
                     (Send-Q on boltzmann backed up to 1.16 MB)
 ```
 Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
 throughput at the receiver and degraded the link metrics. Signal dropped
 ~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
 the higher RX rate even with the link physically capable of more (link
 bitrate 65 Mb/s = ~8 MB/s theoretical).
 **Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
 under busy reads** — possibly a busy-wait loop or lock contention on the
 RX SDIO path that prevents draining at line rate. Plausible reason it
 didn't surface for the c-stack tasks: those operated at typical
 browse-rate traffic, well below the saturation threshold this bug needs
 to fire.
 **May explain**: original Phase-0 observation that **YouTube DASH chunks
 drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
 fetch is a brief burst at near-link-rate; if the driver throttles itself
 down during high-RX, the player buffer underruns for the duration of
 the fetch.
 **How to drill (when prioritized)**:
 - Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
  controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
 - Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
  CPU saturation.
 - `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
 **Status**: backlog. No patch yet.
 ## Bug #4 — scan_complete_cb constant loop
 **File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
@@ -0,0 +1,135 @@
 # Bug #5 campaign — Phase 1 metric + Phase 2 situation
 Date assembled: 2026-05-07
 Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
                              (cleanups + Patch A + Patch B)
 Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
 2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
 cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
 for the original report. This document locks Phase 1 and prepares Phase 4
 candidates.
 ---
 ## Phase 1 — measurable target (locked)
 > *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
 Three measurable outcomes, single sentence:
 - **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
 - **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
 - **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
 ---
 ## Phase 0 anchor — receipts
 ### Reproduction protocol (same units as Phase 7 will use)
 1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
 2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
 3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
 4. N=3 reps with 60 s cooldowns
 ### Observed (2026-05-07 15:36–16:08)
 | rep | duration | avg KB/s | near-zero ticks | end state |
 |---|---|---|---|---|
 | 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
 | 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
 | 3 | 600 s |  75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
 mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
 ### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
 | symbol | rep 1 (healthy) | rep 3 (cascade) |
 |---|---|---|
 | `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
 | `handle_softirqs` | 5.4 % | 4.3 % |
 | `__tasklet_schedule` | 2.4 % | 2.0 % |
 | `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
 | `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
 Top callchain leading to `_raw_spin_unlock_irqrestore`:
 ```
 process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
 ```
 ---
 ## Phase 2 — situation analysis
 ### Relevant source pins
 - `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
  - holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
  - calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
  - outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
 - `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
 - `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
 ### What the lock-cycling cost is buying
 Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
 1. caller acquires `wsm_cmd_sema` (outer)
 2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
 3. `bes2600_bh_wakeup()` schedules the BH
 4. BH dispatches the command, takes `vif_list_lock` to walk vifs
 5. BH talks to chip via SDIO
 6. response arrives, BH wakes the waiter
 7. caller releases `wsm_cmd_sema`
 Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
 ### What's been ruled out
 - AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
 - Patch A / Patch B (target different triggers; would not address lock cost).
 - Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
 - mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
 ---
 ## Phase 4 — candidate plans (preliminary, not locked)
 Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
 ### B5-1 — Reduce `wsm_cmd_send` lock scope
 The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
 **Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
 **Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
 ### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
 bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
 **Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
 **Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
 ### B5-3 — Move WSM send out of process context, use ringbuffer
 Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
 **Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
 **Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
 ---
 ## Open question for the reviewer
 Which Phase 4 candidate to lock? My ranking by ROI:
 1. B5-1 (smallest, fastest, cleanest source pin) — try first
 2. B5-2 (medium, conditional on multi-vif applicability)
 3. B5-3 (largest rework, biggest potential)
 Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
@@ -0,0 +1,104 @@
 # BES2600 WiFi-stability campaign — Phase 4 plan artifact
 Date assembled: 2026-05-06
 Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
 Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
 This is the Phase 4 plan hand-off. Per project CLAUDE.md: paste verbatim, do not curate.
 Drafted after the Phase 5 review of the artifact merged as PR #3 (notes/phase5-2026-05-06.md). Reviewer feedback and a new in-rig finding (Trigger A pinned to mac80211 `api_connection_loss`) are folded into the Phase 1 revision below.
 ---
 ## Phase 1 — revised metric (folds review feedback + Trigger-A discovery)
 > Per hour of operation, count three event classes:
 >
 > (a) `WSM_STATUS_DECRYPTFAILURE` bursts (≥4 events / 60 s)
 > (b) mac80211 `api_connection_loss` events
 > (c) AP-side unprotected-deauth-reason-6 frames
 >
 > For each, also report the conditional probability that the event escalates to a recovery blackhole > 5 s.
 Reviewer feedback applied:
 - "Also count AP-deauth" → class (c) added
 - "N=1 idle is fine" → no further idle reps required to lock
 - AP-side capture not needed → confirmed; no Fritz!Box logging required
 New finding since the merged Phase 5 artifact (today, ftrace-instrumented suspend/resume reps):
 - **Trigger B (decrypt storm path)**: `decrypt-fail × N → AP unprotected-deauth-6 → kernel local reason-2 deauth`. Receipts: 07:13 (77 events / 24 s), 11:03 (8 events / 9 s), 12:00–13:28 load run (12 bursts).
 - **Trigger A (beacon-loss path)**: `mac80211 api_connection_loss → kernel local reason-2 deauth`. Receipts: 17:23 and 18:03 today, both following ftrace `api_connection_loss`; yesterday 22:33 (presumed same path; not instrumented at the time).
 ---
 ## Phase 4 — Plan
 ### Patch A — Decrypt-storm fast-recover (Trigger B)
 #### What will be touched
 - `bes2600/txrx.c`, at the `bes_warn` for `WSM_STATUS_DECRYPTFAILURE / goto drop` site (currently line 1696).
 - A new sliding-window counter on `bes2600_common` (or equivalent struct) tracking decrypt-fail timestamps.
 - On threshold (proposed: ≥5 within 5 s), `schedule_work(&priv->reassoc_work)` that calls `ieee80211_connection_loss(vif)` so mac80211 enters its clean-reassoc path.
 - A small struct field for the counter, plus init in probe and reset on assoc.
 #### What will NOT be touched
 - `bes2600/bes2600_sdio.c` bus-level paths (no SDIO change).
 - Any of the c5.x or c7 stacks (PM, scan defer, LMAC monitor, multi-func reset).
 - Firmware. The fix is host-side recovery, not chip- or AP-side.
 - mac80211 / cfg80211 core. Only `ieee80211_connection_loss` is called; no kernel API addition.
 #### Predicted delta on Phase 1 metric (same units as Phase 3 receipts)
 - Decrypt-burst rate **(a)**: UNCHANGED. We don't address the root cause of why decrypts fail; we only catch the storm earlier.
 - AP-deauth-6 rate **(c)**: DECREASES toward zero, because we pre-empt the AP by initiating a clean reassoc before the AP fires its unprotected deauth. Predicted: c_after / c_before ≤ 0.2.
 - Conditional probability of >5 s blackhole given a burst: DECREASES from current 100 % (idle baseline, N=1) toward ≤ 10 %. Recovery time falls from 109 s (worst observed) to <5 s.
 Dimensions match Phase 3's idle/load comparison table; numbers are predictions, to be verified in Phase 7.
 #### API contracts to confirm before writing code
 Per project CLAUDE.md "contract before code":
 - `ieee80211_connection_loss(vif)` — semantics + caller-context constraints. Header: `linux/mac80211.h`. Must be called from process context (work item is fine), must NOT be called from interrupt or with rx-skb lock held.
 - `bes2600_vif` / `bes2600_common` struct fields available for the counter — counter must be safe to update from the `wsm_handle_rx` path.
 - cw1200 / cw1260 ancestor: any pre-existing storm-recovery logic? If yes, follow that pattern; if no, this is a clean addition.
 - Existing bes2600 work-item plumbing (e.g., `bes2600_chrdev_do_bus_reset` from c5.2) — same shape, same allocation rules.
 These will be cited in the commit message body and in the patch header comment per Phase 6 rules.
 #### Risk
 - If `ieee80211_connection_loss` is called too aggressively, normal occasional decrypt fails (e.g., one-off MIC failures on bad RX) could trigger spurious reassocs. Threshold (5 in 5 s) is chosen to be stricter than the steady-state decrypt-fail rate observed (60+/h under load ≈ 1/min, never 5/5 s outside a true storm).
 - If the chip's RX path is the actual cause of the storm, the reassoc will hit the same chip-level issue. The patch may move the symptom from "stuck for 109 s" to "rapidly cycling reassocs". That itself would be visible in Phase 7 measurement.
 ---
 ### Patch B — Beacon-loss fast-recover (Trigger A) — PARKED
 Not part of this Phase 4. Locked behind one more diagnostic rep:
 - Add to the snap loop (rig is live): track wlan0 station-dump `beacon loss` counter at 10-second cadence (currently 60 s). Want to see the per-tick increase before `api_connection_loss` fires.
 - Goal: distinguish "chip silently drops beacons" from "real beacon loss in the air" before committing to a host-side patch.
 - Requires no new instrumentation install — just a snap-loop cadence change. Estimate two reps with this finer cadence will make the picture clean enough.
 - Once that data lands, Patch B becomes its own Phase 4 plan + Phase 5 review.
 ---
 ## Receipt checklist for Phase 4
 - [x] What will and will not be touched: stated above
 - [x] Predicted delta in Phase 3 units: stated above (a/c rate predictions, conditional-probability prediction, recovery-time prediction)
 - [x] Out-of-scope items explicitly listed
 - [x] Risk items explicitly listed
 ---
 ## Asks of the reviewer
 1. Is the threshold (≥5 decrypt-fails in 5 s) the right shape? Should it be more conservative (≥10 in 10 s)? More aggressive (≥3 in 3 s)? The 12 observed bursts ranged from 4 to 9 events per 60 s window (the Phase 1 looser definition). The patch threshold will fire on the same bursts under any of those choices; pick the one most defensible against false positives.
 2. Is `ieee80211_connection_loss(vif)` the right kernel API? Alternative: `cfg80211_disconnected` with a reason code. Which is cleaner per mac80211 contract for a host-driven preemptive reassoc?
 3. Should Patch A include a debugfs counter exposing how many storms it has caught, so Phase 7 verification has a host-side counter rather than relying on journal grep alone?
 4. Patch B parked correctly, or fold it into this same Phase 4?
@@ -0,0 +1,153 @@
 # BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)
 Date assembled: 2026-05-07
 Run dir: /root/bes2600-samples/run-20260506-2113-patchA/ on ohm
 Module: bes2600.ko srcversion 21BD07B3782B144D478CE43 (c-stack + Patch A merged)
 This is the Phase 4 plan for **Patch B (Trigger A: beacon-loss / mac80211 `api_connection_loss` chain)**, drafted after the Phase 7 verification of Patch A on 2026-05-07. Per project CLAUDE.md: paste verbatim, do not curate.
 ---
 ## What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)
 Patch A is **landed (PR #1)** and **active on ohm** (srcversion `21BD07B3`). Phase 7 verification:
 ```
 duration:                     10h30m sustained 1 MB/s load on 2.4GHz (5b:32)
 DecryptStormRecoveries:       0
 Decrypt-fails total:          183  (~1 every 3.5 min — never bursted ≥5/5s)
 api_connection_loss events:   9    ← Trigger A
 unprotected deauth (AP):      7    ← AP-deauth-6 cluster at 02:42:11
 mac80211 reason 4 deauth:     yes  ← inactivity (Trigger A flavor)
 mac80211 reason 2 deauth:     no   ← what Patch A handles
 ```
 **Patch A's predicted delta is unobserved** (no decrypt-storm fired during Phase 7). Patch A is dormant but caused no harm. This Phase 4 pivots to **Trigger A** — the dominant failure path in the overnight rep.
 ---
 ## Phase 1 — revised metric (Trigger-A scope)
 > Per hour of operation: count `mac80211 api_connection_loss` events and the conditional probability that each escalates into a > 5 s reauth blackhole (assoc-comeback timeouts followed by AP unprotected-deauth-6 cluster).
 Observed rate from the Phase 7 rep: 9 events over 10h30m = **0.86/h** under sustained load. Not all of them escalated to a visible blackhole — some apparently recovered cleanly. But the 02:42 cluster (1/9 = 11 % escalation rate in this rep) shows the catastrophic shape.
 ---
 ## Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)
 ```
 02:40:32  scan started, full-band
 02:40:34  scan aborted
 02:40:45  del station 5b:32
 02:40:45  kernel: deauth 8a:2e:77:1f:ec:05 → 5b:32 reason 4
          (Disassociated due to inactivity)        ← TRIGGER A
 02:40:45  cfg80211: disconnected (local request) reason 4
 02:40:45  scan started → finished: 2462 2412, "newton"
 02:40:45  new station 61:b0
 02:40:45  AP→ohm: auth status 0: Successful
 02:40:45  AP→ohm: assoc comeback bssid 61:b0 timeout 1000  ← BSS load mgmt
 02:40:47  del station 61:b0
 02:40:47  assoc: timed out
 02:40:47  scan → new station cc:ce:1e:2b:74:17
 02:40:48  auth: timed out
 02:40:49  scan → new station 5b:32 (back to where we started)
 02:40:49  AP→ohm: auth status 0
 02:40:49  AP→ohm: assoc comeback bssid 5b:32 timeout 881
 02:40:51  assoc: timed out
 02:42:11  ── AP-deauth-6 cluster (×7 within 1 ms) from 61:b0 ──
          reason 6: Class 2 frame received from non-authenticated station
 02:42:11  reason 9 = STA_REQ_ASSOC_WITHOUT_AUTH
 ```
 **Net**: ~86 s in reauth-blackhole, recovery via cross-channel fallback. Same shape as Phase 3's 11:03 blackhole (~109 s), but trigger here is **inactivity-deauth → assoc-comeback rejection**, not decrypt-storm.
 ---
 ## Hypothesis on the mechanism
 Three plausible chains for why post-`api_connection_loss` reauth blackholes:
 1. **AP's assoc-comeback timer disrespected.** The AP says "wait 1000 TU before retrying", but mac80211 / wpa_supplicant retries fast. AP keeps deferring; eventually a stale frame triggers the AP's "Class 2 from unauth" reaction.
 2. **Driver state stale across deauth.** After mac80211 fires `ieee80211_connection_loss`, the bes2600 driver's per-vif state (link_id, key state, queues) may not be fully scrubbed. Subsequent reassoc starts with mixed state; AP rejects.
 3. **Chip-level RX wedge.** The chip's RX state machine got stuck during the inactivity period; reauth sends out frames, but RX of AP's responses is lossy. mac80211 perceives timeout when actually frames arrived but were dropped. Hard to verify without monitor mode (which the chip doesn't support concurrent with managed).
 Each hypothesis suggests a different fix surface.
 ---
 ## Phase 4 — Plan candidates
 ### Candidate B-1 — Chip-level reset on api_connection_loss flood
 **What touches:**
 - New work-item on `bes2600_common`: `api_connection_loss_recover_work`.
 - mac80211 → driver `disconnect()` op → bump a sliding-window counter.
 - On threshold (e.g., 3 events within 60 s): schedule the work that calls `bes2600_chrdev_do_bus_reset()` (the existing c5.2 LMAC-wedge recovery path).
 - After bus reset, NM auto-reconnects from a fresh chip state.
 **Why this candidate:** reuses the c5.2 infrastructure already deployed; small surface; if hypothesis 3 (chip wedge) is right, this fixes the root cause. If hypothesis 1 or 2 are right, this is overkill but harmless (a brief reset).
 **API contracts to confirm:**
 - `bes2600_chrdev_do_bus_reset()` re-entrancy and worker-context safety
 - mac80211 ops or callbacks around `ieee80211_connection_loss`/`disconnect`
 - cw1200/cw1260 ancestor for any similar pattern
 **Predicted delta (Phase 7 units):**
 - `api_connection_loss` rate: unchanged (we don't address the trigger)
 - conditional escalation to >5 s blackhole: target ≤ 30 % (need realistic)
 - worst-case recovery: 86 s → < 10 s
 ### Candidate B-2 — Respect assoc-comeback timer
 **What touches:**
 - Possibly NOT in bes2600 — this looks like a mac80211 / wpa_supplicant concern.
 - If the driver does anything assoc-related itself, audit for racing the comeback timer.
 **Status:** out of scope for a bes2600 patch unless the driver is observed sending frames during the comeback window.
 ### Candidate B-3 — Audit and scrub vif state on disconnect
 **What touches:**
 - `bes2600_unjoin_work` — verify link_id, keys, queues all reset
 - Add explicit reset on `ieee80211_disconnect`/`disconnect` ops
 **Status:** speculative without further instrumentation.
 ---
 ## Lock proposal
 **Lock Candidate B-1 first.** It has:
 - the cleanest re-use (c5.2's bus_reset)
 - the smallest patch surface
 - a measurable predicted delta against Phase 7's `api_connection_loss` rate
 If Phase 7-of-B-1 shows the rate unchanged but escalation still high → loop back to B-2/B-3 hypothesis space.
 ---
 ## What will NOT be touched
 - mac80211 / cfg80211 core — host-side STA driver only
 - The c-stack patches (c5.x, c6.x, c7) — independent recovery paths
 - Patch A — stays in place, untouched
 - AP / firmware
 ---
 ## Receipt checklist for Phase 4
 - [x] What will and will not be touched: stated above
 - [x] Predicted delta in Phase 3 units: stated for Candidate B-1
 - [x] Out-of-scope items explicitly listed
 - [x] Risk items: bus_reset has a known multi-function-SDIO consideration (handled by c5.2.1)
 ## Asks of the reviewer
 1. Candidate B-1 (bus_reset on api_connection_loss flood) the right scope, or should we instrument deeper before committing?
 2. Threshold (3 events / 60 s): too aggressive (false-positive bus_resets on transient RF issues) or about right?
 3. Should bus_reset be conditional on ALSO seeing post-deauth assoc-comeback timeouts, to avoid resetting on benign connection_loss events?
 4. Hypothesis 1 (assoc-comeback disrespected) — is this a mac80211/wpa_supplicant bug rather than a bes2600 bug? If yes, we file it elsewhere.
@@ -0,0 +1,96 @@
 # BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
 Date assembled: 2026-05-07
 Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
                   (cleanups + Patch A + Patch B)
 Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
 Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
 of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
 newton (5b:32, signal -57 to -67 dBm).
 ---
 ## Result table (vs the Phase 4 predicted delta)
 ### Patch A — decrypt-storm fast-recover (Trigger B)
 | metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
 |---|---|---|---|
 | decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
 | AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
 | recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
 **Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
 #### Receipts (verbatim)
 ```
 13:47:56  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
 13:47:57  wlan0: associated to cc:ce:1e:2b:74:17     (cross-BSSID, 1 s)
 13:49:26  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
 13:49:27  wlan0: associated to c0:25:06:e6:5b:32     (back home, 1 s)
 ```
 `DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
 ### Patch B — connection-loss-storm bus_reset (Trigger A)
 | metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
 |---|---|---|---|
 | api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
 | ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
 | Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
 **Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
 The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
 ### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
 ```
 12:53:10.475 → 12:53:11.756  AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
                              (2 mgmt-TX no-ack from our chip in the middle)
 12:53:12.309  kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
 12:53:14–15  reauth via 61:b0 → 5b:32, recovery in ~4 s
 ```
 Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
 Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
 ---
 ## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
 ```
 sender 1 MB/s  →  ohm receives 1015 KB/s,  -57 dBm,  RX MCS 4
 sender 4 MB/s  →  ohm receives  563 KB/s,  -67 dBm,  RX MCS 3
 ```
 Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
 **Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
 ---
 ## Lessons captured for memory (Phase 8 anchor)
 1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
 2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
 3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
 4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
 5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
 6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
 ---
 ## Loop status
 - Phase 7: closed.
 - Patch A: confirmed (N=2). Stays in.
 - Patch B: installed, dormant in this regime, no harm. Stays in.
 - Bug #5: backlog, no patch yet. Documented.
 - Trigger C: backlog candidate, no patch yet. Documented.
 Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.
Author	SHA1	Message	Date
claude-noether	1830c17891	notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.	2026-05-07 16:32:45 +02:00
claude-noether	69a1d0f8b1	notes: phase 7 verdict — Patch A confirmed, Patch B dormant Phase 7 verification of cleanups + Patch A + Patch B (srcversion 1B3B3ED0) on ohm 2026-05-07 12:48 → 15:13 CEST under netcat load ramped 1 MB/s → 4 MB/s on 2.4GHz newton. Patch A: predicted delta CONFIRMED at N=2 reproductions. - 13:47:56 storm → 1 s reassoc, no AP-deauth-6 escalation - 13:49:26 storm → 1 s reassoc, no AP-deauth-6 escalation Patch B: installed, untriggered. 2 api_connection_loss events spaced 91 s apart, never tripping the 3-in-60s threshold. No false positives, no spurious bus_resets. Recovery delta unobserved (no harm done). Trigger C: 17-frame AP-deauth-6 cluster at 12:53 with no patch hooks firing — bes2600 TX-side glitch suspect. Recovery via mac80211 reauth in ~4 s. New backlog item. Bug #5 documented separately (RX path degrades under throughput pressure; possible root of the original Phase-0 YouTube frame drops).	2026-05-07 15:18:36 +02:00
claude-noether	458ad36f8b	notes: backlog Bug #5 — RX path degrades under throughput pressure Observed 2026-05-07: bumping the netcat sender from 1 MB/s to 4 MB/s DECREASED ohm's observed RX rate (1015 KB/s → 563 KB/s) and degraded the link (signal -57 → -67 dBm, MCS 4 → 3). Chip can't sustain near- link-rate RX even though theoretical capacity is ~8 MB/s. Hypothesis: driver/firmware lock contention or busy-wait on the RX SDIO path. Plausibly explains the original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch — chunk fetch is a brief near-line-rate burst that this bug would be triggered by.	2026-05-07 13:56:36 +02:00
marfrit	ea509e810f	Merge pull request 'Phase 4 plan: Patch B (Trigger A / api_connection_loss)' (#5 ) from claude-noether-3 into main Reviewed-on: #5	2026-05-07 10:45:28 +00:00
claude-noether	e53aad5013	notes: phase 4 plan for Patch B (Trigger A / api_connection_loss) Drafted after Phase 7 verification of Patch A (PR #1, srcversion 21BD07B3). 10h30m sustained load on 2.4GHz produced: - 0 DecryptStormRecoveries (Patch A dormant; no decrypt-storm fired) - 9 mac80211 api_connection_loss events - 1 catastrophic blackhole at 02:42 (reason 4 inactivity → reauth with assoc-comeback timeouts → AP unprotected-deauth-6 cluster) Phase 4 pivots to Trigger A (Patch B). Candidate B-1 lock proposal: extend c5.2 bus_reset infrastructure to fire on N consecutive api_connection_loss events; reuses existing recovery path. Pending Phase 5 review before Phase 6 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:22:34 +02:00
marfrit	4acba3e707	Merge pull request #4 : Phase 4 plan: decrypt-storm fast-recover (Trigger B), with revised Phase 1	2026-05-06 17:30:48 +00:00
test0r	f6a25d811f	notes: phase 4 plan artifact for BES2600 wifi-stability campaign Drafts Patch A (decrypt-storm fast-recover, Trigger B) at txrx.c:1696 with sliding-window threshold + ieee80211_connection_loss reassoc. Patch B (beacon-loss / Trigger A) parked behind one more diagnostic rep with 10s snap-loop cadence on the beacon-loss counter. Folds reviewer feedback from PR #3 + the new Trigger-A finding (post-resume P1 = api_connection_loss-driven, two reps captured today at 17:23 and 18:03) into a revised Phase 1 metric counting three event classes. Pending Phase 5 second-model review of the plan before Phase 6 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 19:10:12 +02:00
marfrit	07a7d4b3af	Merge pull request 'Phase 5 review: BES2600 WiFi-stability campaign artifact' (#3 ) from claude-noether into main Reviewed-on: #3	2026-05-06 13:37:16 +00:00