notes: Sonnet architect review for Bug #5 — ranked restructuring map

Sonnet (general-purpose subagent, model=sonnet) reviewed ~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement context. Output: 8-item ranked restructuring map, file:line cited. Headline: - Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue dispatch reduction, medium effort) - Item 2: batch deliver via ieee80211_rx_list (small effort, removes per-frame softirq) - Items 1 + 2 together collapse "9 workqueue events per delivered frame" to ~1. Items 3-5 clean up next-layer overhead (TX-side queue_work, per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8 are follow-ons to be re-measured after 1-3 land. Phase 4 plan locking the lead candidate(s) follows in a separate PR.
notes: Bug #5 root cause refined — workqueue-per-SDIO-transaction is the floor
2026-05-07 17:38:16 +02:00 · 2026-05-07 17:31:31 +02:00 · 2026-05-07 16:38:58 +02:00 · 2026-05-07 14:37:29 +00:00 · 2026-05-07 16:32:45 +02:00 · 2026-05-07 15:18:36 +02:00
5 changed files with 649 additions and 0 deletions
@@ -0,0 +1,180 @@
+# BES2600 architecture review — Bug #5
+
+Date assembled: 2026-05-07
+Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet)
+Driver source: `~/src/besser/bes2600-dkms-mobian/bes2600/` on boltzmann
+
+This is the architect-review pass requested in `notes/observed-bugs.md` after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.
+
+---
+
+## Measurement context (input to the reviewer)
+
+```
+Reproduction:                pv -L 4M < /dev/zero | nc ohm 12345
+Module under test:           bes2600.ko srcversion 1B3B3ED0...  (cleanups + Patch A + Patch B)
+Hardware:                    PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
+Link rate:                   65 Mb/s ≈ 8 MB/s theoretical
+Observed throughput:         725 KB/s  (Phase 0 anchor at N=3)
+                             rep 3 cascaded into beacon-loss disconnect at ~9 min in
+
+Per-second event rates (3-min capture under 4 MB/s pv-cap):
+  workqueue_execute_start:  5,643/sec    ← architectural floor
+  bes2600_rx_cb:              611/sec
+  bes2600_bh_wakeup:          267/sec
+  wsm_cmd_send:                13/sec    (host-to-chip command rate, surprisingly low)
+  lock contention_begin:       50/sec    (modest)
+  mmc_request_start:        ~5,800/sec   (matches workqueue rate — every SDIO transaction is its own work item)
+
+perf record top symbol:      _raw_spin_unlock_irqrestore  (~20 % CPU samples)
+Dominant callstack:          process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
+```
+
+The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.
+
+---
+
+## Item 1 — Two-hop workqueue relay: SDIO IRQ → `sdio_rx_work` → BH loop → mac80211
+
+**File:line:** `bes2600_sdio.c:416` (IRQ handler dispatches `rx_work`); `bes2600_sdio.c:829` (`sdio_rx_work` body); `bh.c:1330–1538` (BH main loop, `BES2600_RX_IN_BH` path); `bes2600_sdio.c:1267` (`bes_sdio` workqueue, `max_active=2`).
+
+**Current shape:** Every SDIO interrupt fires `queue_work(sdio_wq, &rx_work)`. `sdio_rx_work` reads up to `BES_SDIO_RX_MULTIPLE_NUM=16` frames (`hwio.h:294`) into per-frame SKBs, enqueues each onto `sbus_priv.rx_queue` under `rx_queue_lock`, then returns. Meanwhile the BH kthread (one work item queued at boot in `bh.c:93`, running an infinite loop inside `bes2600_bh()`) calls `pipe_read()` → `spin_lock(rx_queue_lock)` → `skb_dequeue()` → `wsm_handle_rx()` → `ieee80211_rx_irqsafe()` one frame at a time. When `pipe_read()` returns NULL and pending TX exists, `bes2600_sdio_pipe_read()` at `bes2600_sdio.c:941` re-dispatches `rx_work` — so a sustained RX stream fires **one `queue_work` per BH wakeup, not per burst**. That explains why `bh_wakeup` events are only 267/sec while `workqueue_execute_start` is 5,643/sec: the SDIO layer is firing a new `rx_work` item for every frame the BH loop drains.
+
+**Proposed shape:** Collapse `sdio_rx_work` and `pipe_read()` into the BH loop directly. The BH already runs in a dedicated `WQ_HIGHPRI | WQ_CPU_INTENSIVE` workqueue (`bh.c:66`) and (with `BES2600_RX_IN_BH` defined per `Makefile:159`) `bes2600_bh_rx_helper()` already dequeues from `rx_queue`. Merge `sdio_rx_work` into a function called synchronously from `bes2600_bh_rx_helper()` before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) `queue_work` calls per burst while keeping the BH as the single SDIO-access context.
+
+**Predicted delta vs Phase 1 metric:** Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 2–4× throughput improvement and a proportional drop in `_raw_spin_unlock_irqrestore` CPU cost.
+
+**Effort:** Medium. SDIO host-lock protocol (`sdio_claim_host`/`sdio_release_host`) is already managed inside `sdio_rx_work`; moving the body is mechanical but requires care around the `sdio_wq` `max_active=2` concurrency assumption.
+
+**Risks:** `sdio_rx_work` runs with `sdio_claim_host` held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch `bes2600_sdio.c:1889` — flushes `rx_work` during teardown; that path must remain.
+
+---
+
+## Item 2 — `ieee80211_rx_irqsafe` instead of `ieee80211_rx` (pre-NAPI cw1200 ancestor pattern)
+
+**File:line:** `txrx.c:1947`, `txrx.c:1950`, `ap.c:99`, `sta.c:1487`, `wsm.c:2416`.
+
+**Current shape:** Every RX frame is delivered via `ieee80211_rx_irqsafe()`. This function enqueues the SKB onto a per-cpu `tasklet_rx` list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.
+
+**Proposed shape:** Switch to `ieee80211_rx_ni()` (process context, which `wsm_handle_rx` is already in) or, better, batch-deliver frames using `ieee80211_rx_list()` (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single `sdio_rx_work` burst into a `list_head`, then call `ieee80211_rx_list()` once per burst.
+
+**mac80211 contract:** `ieee80211_rx_list()` is safe from process context with the same `ieee80211_rx_status` rules as `ieee80211_rx_ni()`. Per `include/net/mac80211.h` — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as `ieee80211_rx_ni()`.
+
+**Predicted delta:** Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.
+
+**Effort:** Small (once item 1 is done — the batch list naturally exists at the burst boundary).
+
+**Risks:** Must hold `rcu_read_lock()` at call site; `skb->cb` (`IEEE80211_SKB_RXCB`) must be filled before the call, as today. The `early_data` path at `txrx.c:1942` uses `skb_queue_tail` into a per-link queue before calling `ieee80211_rx_irqsafe` — that path must be excluded from batch collection.
+
+---
+
+## Item 3 — Per-frame `queue_work(sdio_wq, &tx_work)` in the TX send path
+
+**File:line:** `bes2600_sdio.c:1236` (inside `bes2600_sdio_pipe_send()`).
+
+**Current shape:** Every call to `bes2600_sdio_pipe_send()` appends one descriptor to `tx_bufferlist` and immediately calls `queue_work(sdio_wq, &tx_work)`. `sdio_tx_work` then drains the list with scatterlist batching (up to `BES_SDIO_TX_MULTIPLE_NUM=16` frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after `atomic_add(1, &hw_priv->bh_tx)` in `bh.c` reschedules TX — successive `pipe_send` calls each hit `queue_work` before the previous fires, multiplying dispatches.
+
+**Proposed shape:** Stage all frames into `tx_bufferlist` in the BH TX loop, then flush `sdio_tx_work` synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.
+
+**Predicted delta:** Removes redundant TX-side `queue_work` calls. Lower priority than RX side given current TX rate (13 `wsm_cmd_send`/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count `numInpChBufs`), but it does remove one source of the 5,643/sec workqueue count.
+
+**Effort:** Small.
+
+**Risks:** `sdio_tx_work` calls `sdio_claim_host`/`sdio_release_host` internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that `sdio_rx_work` (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (`rx:` → `tx:` ordering in `bh.c:1444`).
+
+---
+
+## Item 4 — `ba_lock` per-frame acquisition in the RX path
+
+**File:line:** `txrx.c:998–1005` (`bes2600_rx_h_ba_stat()`); `txrx.c:1159–1164` and `txrx.c:1682–1698` (`tsm_lock`, `CONFIG_BES2600_TESTMODE`-gated).
+
+**Current shape:** `bes2600_rx_cb()` calls `bes2600_rx_h_ba_stat()` for every non-multicast data frame. That function acquires `ba_lock` under BH (`spin_lock_bh`) to increment `ba_acc_rx` and `ba_cnt_rx`, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on `ba_lock` alone.
+
+**Proposed shape:** Replace per-frame `ba_lock` with a per-cpu counter (or `atomic64_t`) for `ba_acc_rx` and `ba_cnt_rx`. The timer arm (`mod_timer`) is the actual reason for the lock — a `READ_ONCE`/`cmpxchg` on a flag to detect first-frame-in-interval is sufficient.
+
+**Predicted delta:** Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 1–2 land.
+
+**Effort:** Small.
+
+**Risks:** `ba_lock` also serialises TX-side block-ack accounting (`txrx.c:1632`). The per-cpu approach requires a fold step in the timer callback — cheap.
+
+---
+
+## Item 5 — Skip `ps_state_lock` acquisitions when PSM is known-disabled
+
+**File:line:** `bes2600.h:320` (decl); `txrx.c:1340–1365` (`vif_lock`); `txrx.c:1415–1426` (`ps_state_lock`); `txrx.c:1942–1948` (RX-side `ps_state_lock`).
+
+**Current shape:** `ps_state_lock` is taken on every TX frame if powersave is active. Per memory `reference_bes2600_firmware_no_psm.md`, **PSM is non-functional on this firmware** — c7 already self-detects this and latches `pm_unsupported = true`. The `ps_state_lock` guards in the RX callback and TX path are therefore taking dead overhead.
+
+**Proposed shape:** Add a `READ_ONCE()` check on `powersave_enabled` before taking `ps_state_lock`; if false, skip the lock and the PSM state update entirely. Since c7's `pm_unsupported` latches, this is safe.
+
+**Predicted delta:** Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.
+
+**Effort:** Small.
+
+**Risks:** `powersave_enabled` is written from process context (`bh.c:403`). `READ_ONCE` without lock is safe — at worst one spurious PSM notification, not a state corruption.
+
+---
+
+## Item 6 — Firmware block-read size cap (`EFFECTIVE_BUF_SIZE = 8190 bytes`)
+
+**File:line:** `bh.c:33–36` (defines); `bes2600_sdio.c:721–783` (`bes2600_sdio_extract_packets()`); `hwio.h:294` (`BES_SDIO_RX_MULTIPLE_NUM=16`).
+
+**Current shape:** `BES_SDIO_RX_MULTIPLE_NUM=16` and `BES_SDIO_OPTIMIZED_LEN` both defined (`Makefile:90,92`). The RX burst reads `PACKET_TOTAL_LEN(ctrl_reg)` bytes in a single CMD53; each sub-packet bounded by `EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190` bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ **2.3 frames per wakeup** — well under the 16-frame limit. **Not the bottleneck today.**
+
+**Proposed shape:** No change needed now. Re-evaluate after items 1–3 land if throughput rises past ~3 MB/s. Verify ctrl_reg `PACKET_TOTAL_LEN` field values during high load — requires firmware-trace observation we don't currently have.
+
+**Effort:** N/A.
+
+**Risks:** Increasing beyond 16 requires a larger DMA allocation (currently `1632 × 16 = 26 KB`). Cortex-M4F firmware side is opaque.
+
+---
+
+## Item 7 — Duplicate workqueues (`hw_priv->workqueue` vs `hw_priv->bh_workqueue` vs `sbus_priv->sdio_wq`)
+
+**File:line:** `bes2600.h:323` (`workqueue`); `bes2600.h:385` (`bh_workqueue`); `bes2600_sdio.c:63` (`sdio_wq`). `txrx.c` has 10 `queue_work(hw_priv->workqueue, ...)` calls for control-plane work.
+
+**Current shape:** Three distinct workqueues. The 5,643 `workqueue_execute_start`/sec are dominated by `sdio_wq` items, not `workqueue`. `workqueue` items are control-plane events at rates well below the data-plane.
+
+**Proposed shape:** After item 1 (merging `sdio_rx_work` into BH), `sdio_wq` only carries `tx_work`. After item 3 (synchronous TX flush from BH), `sdio_wq` is idle during normal data-plane and could be replaced with `system_highpri_wq`.
+
+**Effort:** Small (follow-on to items 1 and 3).
+
+**Risks:** None if items 1 and 3 land first.
+
+---
+
+## Item 8 — `BH_RX_CONT_LIMIT=3` cap on RX burst per BH wakeup
+
+**File:line:** `bh.c:1380–1405` (timeout detection); `bh.c:1330` (`BH_RX_CONT_LIMIT=3`); `bh.c:1331` (`BH_TX_CONT_LIMIT=20`).
+
+**Current shape:** BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. **After items 1–3 land**, per-burst frame rate will rise and `BH_RX_CONT_LIMIT=3` becomes the ceiling.
+
+**Proposed shape:** Raise to 16 (matching `BES_SDIO_RX_MULTIPLE_NUM`) after items 1–3 are deployed and re-measured.
+
+**Effort:** Trivial (constant change), but must wait for Phase 7 measurements post 1–3.
+
+**Risks:** Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to `BES_SDIO_RX_MULTIPLE_NUM` as safe ceiling.
+
+---
+
+## Ranking summary
+
+| Rank | Item | Predicted gain | Effort |
+|------|------|----------------|--------|
+| 1 | Collapse `sdio_rx_work` relay into BH loop | ~5x workqueue dispatch reduction | Medium |
+| 2 | Batch deliver via `ieee80211_rx_list()` | Removes per-frame softirq | Small |
+| 3 | Synchronous TX flush from BH | Removes TX-side dispatch noise | Small |
+| 4 | Replace `ba_lock` per-frame with atomic/per-cpu | Removes 611 lock/sec from RX hot path | Small |
+| 5 | Skip `ps_state_lock` when PSM-known-disabled | Removes dead overhead | Small |
+| 6 | Raise `BH_RX_CONT_LIMIT` after 1–3 land | Unlocks residual throughput | Trivial |
+| 7 | Consolidate workqueues post-items 1&3 | Cleanup | Small |
+| 8 | Firmware block-read size | Not bottleneck at current rates | N/A |
+
+**Items 1 + 2 together are the structural answer to the measurement**: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 3–5 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.
+
+---
+
+## Next campaign step
+
+A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.
@@ -82,6 +82,91 @@ without board power-cycle").
 **Status**: task c3 (indirectly, via bes_chardev removal which currently
 gates the signal/nosignal mode switch path).

+## Architect review — now BUG-#5-blocking (was backlog)
+
+The Phase 0 perf trace for Bug #5 first exposed a "when in doubt, add a
+lock" pattern (~20 % CPU in `_raw_spin_unlock_irqrestore`). The
+follow-up ftrace measurement (2026-05-07 17:00) refined the root cause
+to an architectural problem: **the bes2600 driver dispatches every
+SDIO transaction through the kernel workqueue**. Numbers from a 3-min
+4 MB/s ohm capture (post-reboot, srcversion `1B3B3ED0`):
+
+```
+wsm_cmd_send:               13/sec    (host-to-chip command rate, surprisingly low)
+bes2600_rx_cb:             611/sec
+bes2600_bh_wakeup:         267/sec
+lock contention_begin:      50/sec
+workqueue_execute_start: 5,643/sec    ← DOMINATES; matches the mmc
+                                       transaction rate from earlier perf
+```
+
+5.6 k workqueue dispatches per second is the throughput floor — not a
+specific lock, not WSM-command rate, not decrypt-state. A surgical fix
+to any single function won't move the floor; the architecture needs
+to be restructured to amortise SDIO transactions across fewer work-
+items (or move SDIO RX out of the workqueue entirely).
+
+This is where the **Claude Sonnet architect review** belongs: a
+top-to-bottom assessment of `~/src/besser/bes2600-dkms-mobian/bes2600/`
+focused on:
+
+- the workqueue dispatch shape (most actionable)
+- needless lock proliferation (the original signal)
+- BH / RX scheduling boundaries
+- error-handling coverage and dead-code from the cw1200 ancestor
+- API contract violations relative to mainline mac80211
+
+Output: ranked list of restructuring targets, with predicted-delta
+estimates against the Phase 1 metric (≥ 2 MB/s sustained @ 4 MB/s cap,
+< 10 % CPU in lock-cycling, no link cascade in 30 min).
+
+**Status**: now blocking on Bug #5 (was independent track). Surgical
+patches B5-1, B5-2, B5-3 from the original Phase 4 candidate list are
+all DEFERRED until the architect review's restructuring map is in.
+
+## Bug #5 — RX path degrades under attempted-throughput pressure
+
+**Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
+SDIO RX scheduling) — pinpoint pending.
+
+**Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
+Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
+netcat load):**
+
+```
+sender cap 1 MB/s  →  ohm receives 1015 KB/s,  signal -57 dBm,  RX MCS 4
+sender cap 4 MB/s  →  ohm receives  563 KB/s,  signal -67 dBm,  RX MCS 3
+                     (Send-Q on boltzmann backed up to 1.16 MB)
+```
+
+Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
+throughput at the receiver and degraded the link metrics. Signal dropped
+~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
+the higher RX rate even with the link physically capable of more (link
+bitrate 65 Mb/s = ~8 MB/s theoretical).
+
+**Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
+under busy reads** — possibly a busy-wait loop or lock contention on the
+RX SDIO path that prevents draining at line rate. Plausible reason it
+didn't surface for the c-stack tasks: those operated at typical
+browse-rate traffic, well below the saturation threshold this bug needs
+to fire.
+
+**May explain**: original Phase-0 observation that **YouTube DASH chunks
+drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
+fetch is a brief burst at near-link-rate; if the driver throttles itself
+down during high-RX, the player buffer underruns for the duration of
+the fetch.
+
+**How to drill (when prioritized)**:
+- Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
+  controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
+- Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
+  CPU saturation.
+- `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
+
+**Status**: backlog. No patch yet.
+
 ## Bug #4 — scan_complete_cb constant loop

 **File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
@@ -0,0 +1,135 @@
+# Bug #5 campaign — Phase 1 metric + Phase 2 situation
+
+Date assembled: 2026-05-07
+Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
+                              (cleanups + Patch A + Patch B)
+
+Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
+2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
+cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
+for the original report. This document locks Phase 1 and prepares Phase 4
+candidates.
+
+---
+
+## Phase 1 — measurable target (locked)
+
+> *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
+
+Three measurable outcomes, single sentence:
+
+- **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
+- **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
+- **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
+
+---
+
+## Phase 0 anchor — receipts
+
+### Reproduction protocol (same units as Phase 7 will use)
+
+1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
+2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
+3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
+4. N=3 reps with 60 s cooldowns
+
+### Observed (2026-05-07 15:36–16:08)
+
+| rep | duration | avg KB/s | near-zero ticks | end state |
+|---|---|---|---|---|
+| 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
+| 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
+| 3 | 600 s |  75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
+
+mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
+
+### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
+
+| symbol | rep 1 (healthy) | rep 3 (cascade) |
+|---|---|---|
+| `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
+| `handle_softirqs` | 5.4 % | 4.3 % |
+| `__tasklet_schedule` | 2.4 % | 2.0 % |
+| `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
+| `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
+
+Top callchain leading to `_raw_spin_unlock_irqrestore`:
+
+```
+process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
+```
+
+---
+
+## Phase 2 — situation analysis
+
+### Relevant source pins
+
+- `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
+  - holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
+  - calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
+  - outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
+- `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
+- `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
+
+### What the lock-cycling cost is buying
+
+Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
+1. caller acquires `wsm_cmd_sema` (outer)
+2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
+3. `bes2600_bh_wakeup()` schedules the BH
+4. BH dispatches the command, takes `vif_list_lock` to walk vifs
+5. BH talks to chip via SDIO
+6. response arrives, BH wakes the waiter
+7. caller releases `wsm_cmd_sema`
+
+Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
+
+### What's been ruled out
+
+- AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
+- Patch A / Patch B (target different triggers; would not address lock cost).
+- Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
+- mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
+
+---
+
+## Phase 4 — candidate plans (preliminary, not locked)
+
+Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
+
+### B5-1 — Reduce `wsm_cmd_send` lock scope
+
+The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
+
+**Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
+
+**Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
+
+### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
+
+bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
+
+**Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
+
+**Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
+
+### B5-3 — Move WSM send out of process context, use ringbuffer
+
+Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
+
+**Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
+
+**Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
+
+---
+
+## Open question for the reviewer
+
+Which Phase 4 candidate to lock? My ranking by ROI:
+
+1. B5-1 (smallest, fastest, cleanest source pin) — try first
+2. B5-2 (medium, conditional on multi-vif applicability)
+3. B5-3 (largest rework, biggest potential)
+
+Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
@@ -0,0 +1,153 @@
+# BES2600 WiFi-stability campaign — Phase 4 plan (Patch B / Trigger A)
+
+Date assembled: 2026-05-07
+Run dir: /root/bes2600-samples/run-20260506-2113-patchA/ on ohm
+Module: bes2600.ko srcversion 21BD07B3782B144D478CE43 (c-stack + Patch A merged)
+
+This is the Phase 4 plan for **Patch B (Trigger A: beacon-loss / mac80211 `api_connection_loss` chain)**, drafted after the Phase 7 verification of Patch A on 2026-05-07. Per project CLAUDE.md: paste verbatim, do not curate.
+
+---
+
+## What changed since the merged Patch-A plan (`notes/phase4-2026-05-06.md`)
+
+Patch A is **landed (PR #1)** and **active on ohm** (srcversion `21BD07B3`). Phase 7 verification:
+
+```
+duration:                     10h30m sustained 1 MB/s load on 2.4GHz (5b:32)
+DecryptStormRecoveries:       0
+Decrypt-fails total:          183  (~1 every 3.5 min — never bursted ≥5/5s)
+api_connection_loss events:   9    ← Trigger A
+unprotected deauth (AP):      7    ← AP-deauth-6 cluster at 02:42:11
+mac80211 reason 4 deauth:     yes  ← inactivity (Trigger A flavor)
+mac80211 reason 2 deauth:     no   ← what Patch A handles
+```
+
+**Patch A's predicted delta is unobserved** (no decrypt-storm fired during Phase 7). Patch A is dormant but caused no harm. This Phase 4 pivots to **Trigger A** — the dominant failure path in the overnight rep.
+
+---
+
+## Phase 1 — revised metric (Trigger-A scope)
+
+> Per hour of operation: count `mac80211 api_connection_loss` events and the conditional probability that each escalates into a > 5 s reauth blackhole (assoc-comeback timeouts followed by AP unprotected-deauth-6 cluster).
+
+Observed rate from the Phase 7 rep: 9 events over 10h30m = **0.86/h** under sustained load. Not all of them escalated to a visible blackhole — some apparently recovered cleanly. But the 02:42 cluster (1/9 = 11 % escalation rate in this rep) shows the catastrophic shape.
+
+---
+
+## Phase 0 / 3 receipt — the 02:42 chain (verbatim from iw-event)
+
+```
+02:40:32  scan started, full-band
+02:40:34  scan aborted
+02:40:45  del station 5b:32
+02:40:45  kernel: deauth 8a:2e:77:1f:ec:05 → 5b:32 reason 4
+          (Disassociated due to inactivity)        ← TRIGGER A
+02:40:45  cfg80211: disconnected (local request) reason 4
+02:40:45  scan started → finished: 2462 2412, "newton"
+02:40:45  new station 61:b0
+02:40:45  AP→ohm: auth status 0: Successful
+02:40:45  AP→ohm: assoc comeback bssid 61:b0 timeout 1000  ← BSS load mgmt
+02:40:47  del station 61:b0
+02:40:47  assoc: timed out
+02:40:47  scan → new station cc:ce:1e:2b:74:17
+02:40:48  auth: timed out
+02:40:49  scan → new station 5b:32 (back to where we started)
+02:40:49  AP→ohm: auth status 0
+02:40:49  AP→ohm: assoc comeback bssid 5b:32 timeout 881
+02:40:51  assoc: timed out
+02:42:11  ── AP-deauth-6 cluster (×7 within 1 ms) from 61:b0 ──
+          reason 6: Class 2 frame received from non-authenticated station
+02:42:11  reason 9 = STA_REQ_ASSOC_WITHOUT_AUTH
+```
+
+**Net**: ~86 s in reauth-blackhole, recovery via cross-channel fallback. Same shape as Phase 3's 11:03 blackhole (~109 s), but trigger here is **inactivity-deauth → assoc-comeback rejection**, not decrypt-storm.
+
+---
+
+## Hypothesis on the mechanism
+
+Three plausible chains for why post-`api_connection_loss` reauth blackholes:
+
+1. **AP's assoc-comeback timer disrespected.** The AP says "wait 1000 TU before retrying", but mac80211 / wpa_supplicant retries fast. AP keeps deferring; eventually a stale frame triggers the AP's "Class 2 from unauth" reaction.
+
+2. **Driver state stale across deauth.** After mac80211 fires `ieee80211_connection_loss`, the bes2600 driver's per-vif state (link_id, key state, queues) may not be fully scrubbed. Subsequent reassoc starts with mixed state; AP rejects.
+
+3. **Chip-level RX wedge.** The chip's RX state machine got stuck during the inactivity period; reauth sends out frames, but RX of AP's responses is lossy. mac80211 perceives timeout when actually frames arrived but were dropped. Hard to verify without monitor mode (which the chip doesn't support concurrent with managed).
+
+Each hypothesis suggests a different fix surface.
+
+---
+
+## Phase 4 — Plan candidates
+
+### Candidate B-1 — Chip-level reset on api_connection_loss flood
+
+**What touches:**
+- New work-item on `bes2600_common`: `api_connection_loss_recover_work`.
+- mac80211 → driver `disconnect()` op → bump a sliding-window counter.
+- On threshold (e.g., 3 events within 60 s): schedule the work that calls `bes2600_chrdev_do_bus_reset()` (the existing c5.2 LMAC-wedge recovery path).
+- After bus reset, NM auto-reconnects from a fresh chip state.
+
+**Why this candidate:** reuses the c5.2 infrastructure already deployed; small surface; if hypothesis 3 (chip wedge) is right, this fixes the root cause. If hypothesis 1 or 2 are right, this is overkill but harmless (a brief reset).
+
+**API contracts to confirm:**
+- `bes2600_chrdev_do_bus_reset()` re-entrancy and worker-context safety
+- mac80211 ops or callbacks around `ieee80211_connection_loss`/`disconnect`
+- cw1200/cw1260 ancestor for any similar pattern
+
+**Predicted delta (Phase 7 units):**
+- `api_connection_loss` rate: unchanged (we don't address the trigger)
+- conditional escalation to >5 s blackhole: target ≤ 30 % (need realistic)
+- worst-case recovery: 86 s → < 10 s
+
+### Candidate B-2 — Respect assoc-comeback timer
+
+**What touches:**
+- Possibly NOT in bes2600 — this looks like a mac80211 / wpa_supplicant concern.
+- If the driver does anything assoc-related itself, audit for racing the comeback timer.
+
+**Status:** out of scope for a bes2600 patch unless the driver is observed sending frames during the comeback window.
+
+### Candidate B-3 — Audit and scrub vif state on disconnect
+
+**What touches:**
+- `bes2600_unjoin_work` — verify link_id, keys, queues all reset
+- Add explicit reset on `ieee80211_disconnect`/`disconnect` ops
+
+**Status:** speculative without further instrumentation.
+
+---
+
+## Lock proposal
+
+**Lock Candidate B-1 first.** It has:
+- the cleanest re-use (c5.2's bus_reset)
+- the smallest patch surface
+- a measurable predicted delta against Phase 7's `api_connection_loss` rate
+
+If Phase 7-of-B-1 shows the rate unchanged but escalation still high → loop back to B-2/B-3 hypothesis space.
+
+---
+
+## What will NOT be touched
+
+- mac80211 / cfg80211 core — host-side STA driver only
+- The c-stack patches (c5.x, c6.x, c7) — independent recovery paths
+- Patch A — stays in place, untouched
+- AP / firmware
+
+---
+
+## Receipt checklist for Phase 4
+
+- [x] What will and will not be touched: stated above
+- [x] Predicted delta in Phase 3 units: stated for Candidate B-1
+- [x] Out-of-scope items explicitly listed
+- [x] Risk items: bus_reset has a known multi-function-SDIO consideration (handled by c5.2.1)
+
+## Asks of the reviewer
+
+1. Candidate B-1 (bus_reset on api_connection_loss flood) the right scope, or should we instrument deeper before committing?
+2. Threshold (3 events / 60 s): too aggressive (false-positive bus_resets on transient RF issues) or about right?
+3. Should bus_reset be conditional on ALSO seeing post-deauth assoc-comeback timeouts, to avoid resetting on benign connection_loss events?
+4. Hypothesis 1 (assoc-comeback disrespected) — is this a mac80211/wpa_supplicant bug rather than a bes2600 bug? If yes, we file it elsewhere.
@@ -0,0 +1,96 @@
+# BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
+
+Date assembled: 2026-05-07
+Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
+                   (cleanups + Patch A + Patch B)
+Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
+
+Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
+of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
+newton (5b:32, signal -57 to -67 dBm).
+
+---
+
+## Result table (vs the Phase 4 predicted delta)
+
+### Patch A — decrypt-storm fast-recover (Trigger B)
+
+| metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
+|---|---|---|---|
+| decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
+| AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
+| recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
+
+**Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
+
+#### Receipts (verbatim)
+
+```
+13:47:56  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
+13:47:57  wlan0: associated to cc:ce:1e:2b:74:17     (cross-BSSID, 1 s)
+13:49:26  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
+13:49:27  wlan0: associated to c0:25:06:e6:5b:32     (back home, 1 s)
+```
+
+`DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
+
+### Patch B — connection-loss-storm bus_reset (Trigger A)
+
+| metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
+|---|---|---|---|
+| api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
+| ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
+| Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
+
+**Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
+
+The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
+
+### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
+
+```
+12:53:10.475 → 12:53:11.756  AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
+                              (2 mgmt-TX no-ack from our chip in the middle)
+12:53:12.309  kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
+12:53:14–15  reauth via 61:b0 → 5b:32, recovery in ~4 s
+```
+
+Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
+
+Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
+
+---
+
+## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
+
+```
+sender 1 MB/s  →  ohm receives 1015 KB/s,  -57 dBm,  RX MCS 4
+sender 4 MB/s  →  ohm receives  563 KB/s,  -67 dBm,  RX MCS 3
+```
+
+Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
+
+**Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
+
+---
+
+## Lessons captured for memory (Phase 8 anchor)
+
+1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
+2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
+3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
+4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
+5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
+6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
+
+---
+
+## Loop status
+
+- Phase 7: closed.
+- Patch A: confirmed (N=2). Stays in.
+- Patch B: installed, dormant in this regime, no harm. Stays in.
+- Bug #5: backlog, no patch yet. Documented.
+- Trigger C: backlog candidate, no patch yet. Documented.
+
+Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.
Author	SHA1	Message	Date
claude-noether	679083d1aa	notes: Sonnet architect review for Bug #5 — ranked restructuring map Sonnet (general-purpose subagent, model=sonnet) reviewed ~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement context. Output: 8-item ranked restructuring map, file:line cited. Headline: - Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue dispatch reduction, medium effort) - Item 2: batch deliver via ieee80211_rx_list (small effort, removes per-frame softirq) - Items 1 + 2 together collapse "9 workqueue events per delivered frame" to ~1. Items 3-5 clean up next-layer overhead (TX-side queue_work, per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8 are follow-ons to be re-measured after 1-3 land. Phase 4 plan locking the lead candidate(s) follows in a separate PR.	2026-05-07 17:38:16 +02:00
claude-noether	594f73c6b4	notes: Bug #5 root cause refined — workqueue-per-SDIO-transaction is the floor Follow-up ftrace measurement (post-reboot, 3-min 4MB/s capture): - workqueue_execute_start: 5,643/sec ← dominates - wsm_cmd_send: only 13/sec (host-to-chip command path NOT the hotspot) - lock contention: 50/sec (modest) The throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. Surgical patches B5-1/B5-2/B5-3 from the prior Phase 4 plan all targeted the wrong layer; deferring those until an architectural restructuring map is produced. Promoting the Sonnet architect review from "backlog" to "blocking on Bug #5" — the next step is a restructuring assessment, not another patch.	2026-05-07 17:31:31 +02:00
claude-noether	928268f477	notes: backlog Sonnet architect review of bes2600 driver Per PR #6 review feedback. Independent track from Bug #5; scheduled once the Bug #5 measurement pass finishes.	2026-05-07 16:38:58 +02:00
marfrit	425eb92456	Merge pull request 'Bug #5 Phase 1 metric + Phase 0 anchor receipts' (#6 ) from claude-noether-4 into main Reviewed-on: #6	2026-05-07 14:37:29 +00:00
claude-noether	1830c17891	notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.	2026-05-07 16:32:45 +02:00
claude-noether	69a1d0f8b1	notes: phase 7 verdict — Patch A confirmed, Patch B dormant Phase 7 verification of cleanups + Patch A + Patch B (srcversion 1B3B3ED0) on ohm 2026-05-07 12:48 → 15:13 CEST under netcat load ramped 1 MB/s → 4 MB/s on 2.4GHz newton. Patch A: predicted delta CONFIRMED at N=2 reproductions. - 13:47:56 storm → 1 s reassoc, no AP-deauth-6 escalation - 13:49:26 storm → 1 s reassoc, no AP-deauth-6 escalation Patch B: installed, untriggered. 2 api_connection_loss events spaced 91 s apart, never tripping the 3-in-60s threshold. No false positives, no spurious bus_resets. Recovery delta unobserved (no harm done). Trigger C: 17-frame AP-deauth-6 cluster at 12:53 with no patch hooks firing — bes2600 TX-side glitch suspect. Recovery via mac80211 reauth in ~4 s. New backlog item. Bug #5 documented separately (RX path degrades under throughput pressure; possible root of the original Phase-0 YouTube frame drops).	2026-05-07 15:18:36 +02:00
claude-noether	458ad36f8b	notes: backlog Bug #5 — RX path degrades under throughput pressure Observed 2026-05-07: bumping the netcat sender from 1 MB/s to 4 MB/s DECREASED ohm's observed RX rate (1015 KB/s → 563 KB/s) and degraded the link (signal -57 → -67 dBm, MCS 4 → 3). Chip can't sustain near- link-rate RX even though theoretical capacity is ~8 MB/s. Hypothesis: driver/firmware lock contention or busy-wait on the RX SDIO path. Plausibly explains the original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch — chunk fetch is a brief near-line-rate burst that this bug would be triggered by.	2026-05-07 13:56:36 +02:00
marfrit	ea509e810f	Merge pull request 'Phase 4 plan: Patch B (Trigger A / api_connection_loss)' (#5 ) from claude-noether-3 into main Reviewed-on: #5	2026-05-07 10:45:28 +00:00
claude-noether	e53aad5013	notes: phase 4 plan for Patch B (Trigger A / api_connection_loss) Drafted after Phase 7 verification of Patch A (PR #1, srcversion 21BD07B3). 10h30m sustained load on 2.4GHz produced: - 0 DecryptStormRecoveries (Patch A dormant; no decrypt-storm fired) - 9 mac80211 api_connection_loss events - 1 catastrophic blackhole at 02:42 (reason 4 inactivity → reauth with assoc-comeback timeouts → AP unprotected-deauth-6 cluster) Phase 4 pivots to Trigger A (Patch B). Candidate B-1 lock proposal: extend c5.2 bus_reset infrastructure to fire on N consecutive api_connection_loss events; reuses existing recovery path. Pending Phase 5 review before Phase 6 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:22:34 +02:00
marfrit	4acba3e707	Merge pull request #4 : Phase 4 plan: decrypt-storm fast-recover (Trigger B), with revised Phase 1	2026-05-06 17:30:48 +00:00