notes: Sonnet architect review for Bug #5 — ranked restructuring map

Sonnet (general-purpose subagent, model=sonnet) reviewed ~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement context. Output: 8-item ranked restructuring map, file:line cited. Headline: - Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue dispatch reduction, medium effort) - Item 2: batch deliver via ieee80211_rx_list (small effort, removes per-frame softirq) - Items 1 + 2 together collapse "9 workqueue events per delivered frame" to ~1. Items 3-5 clean up next-layer overhead (TX-side queue_work, per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8 are follow-ons to be re-measured after 1-3 land. Phase 4 plan locking the lead candidate(s) follows in a separate PR.
notes: Bug #5 root cause refined — workqueue-per-SDIO-transaction is the floor
2026-05-07 17:38:16 +02:00 · 2026-05-07 17:31:31 +02:00 · 2026-05-07 16:38:58 +02:00 · 2026-05-07 14:37:29 +00:00 · 2026-05-07 16:32:45 +02:00 · 2026-05-07 15:18:36 +02:00
4 changed files with 496 additions and 0 deletions
@@ -0,0 +1,180 @@
 # BES2600 architecture review — Bug #5
 Date assembled: 2026-05-07
 Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet)
 Driver source: `~/src/besser/bes2600-dkms-mobian/bes2600/` on boltzmann
 This is the architect-review pass requested in `notes/observed-bugs.md` after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.
 ---
 ## Measurement context (input to the reviewer)
 ```
 Reproduction:                pv -L 4M < /dev/zero | nc ohm 12345
 Module under test:           bes2600.ko srcversion 1B3B3ED0...  (cleanups + Patch A + Patch B)
 Hardware:                    PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
 Link rate:                   65 Mb/s ≈ 8 MB/s theoretical
 Observed throughput:         725 KB/s  (Phase 0 anchor at N=3)
                             rep 3 cascaded into beacon-loss disconnect at ~9 min in
 Per-second event rates (3-min capture under 4 MB/s pv-cap):
  workqueue_execute_start:  5,643/sec    ← architectural floor
  bes2600_rx_cb:              611/sec
  bes2600_bh_wakeup:          267/sec
  wsm_cmd_send:                13/sec    (host-to-chip command rate, surprisingly low)
  lock contention_begin:       50/sec    (modest)
  mmc_request_start:        ~5,800/sec   (matches workqueue rate — every SDIO transaction is its own work item)
 perf record top symbol:      _raw_spin_unlock_irqrestore  (~20 % CPU samples)
 Dominant callstack:          process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
 ```
 The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.
 ---
 ## Item 1 — Two-hop workqueue relay: SDIO IRQ → `sdio_rx_work` → BH loop → mac80211
 **File:line:** `bes2600_sdio.c:416` (IRQ handler dispatches `rx_work`); `bes2600_sdio.c:829` (`sdio_rx_work` body); `bh.c:1330–1538` (BH main loop, `BES2600_RX_IN_BH` path); `bes2600_sdio.c:1267` (`bes_sdio` workqueue, `max_active=2`).
 **Current shape:** Every SDIO interrupt fires `queue_work(sdio_wq, &rx_work)`. `sdio_rx_work` reads up to `BES_SDIO_RX_MULTIPLE_NUM=16` frames (`hwio.h:294`) into per-frame SKBs, enqueues each onto `sbus_priv.rx_queue` under `rx_queue_lock`, then returns. Meanwhile the BH kthread (one work item queued at boot in `bh.c:93`, running an infinite loop inside `bes2600_bh()`) calls `pipe_read()` → `spin_lock(rx_queue_lock)` → `skb_dequeue()` → `wsm_handle_rx()` → `ieee80211_rx_irqsafe()` one frame at a time. When `pipe_read()` returns NULL and pending TX exists, `bes2600_sdio_pipe_read()` at `bes2600_sdio.c:941` re-dispatches `rx_work` — so a sustained RX stream fires **one `queue_work` per BH wakeup, not per burst**. That explains why `bh_wakeup` events are only 267/sec while `workqueue_execute_start` is 5,643/sec: the SDIO layer is firing a new `rx_work` item for every frame the BH loop drains.
 **Proposed shape:** Collapse `sdio_rx_work` and `pipe_read()` into the BH loop directly. The BH already runs in a dedicated `WQ_HIGHPRI | WQ_CPU_INTENSIVE` workqueue (`bh.c:66`) and (with `BES2600_RX_IN_BH` defined per `Makefile:159`) `bes2600_bh_rx_helper()` already dequeues from `rx_queue`. Merge `sdio_rx_work` into a function called synchronously from `bes2600_bh_rx_helper()` before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) `queue_work` calls per burst while keeping the BH as the single SDIO-access context.
 **Predicted delta vs Phase 1 metric:** Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 2–4× throughput improvement and a proportional drop in `_raw_spin_unlock_irqrestore` CPU cost.
 **Effort:** Medium. SDIO host-lock protocol (`sdio_claim_host`/`sdio_release_host`) is already managed inside `sdio_rx_work`; moving the body is mechanical but requires care around the `sdio_wq` `max_active=2` concurrency assumption.
 **Risks:** `sdio_rx_work` runs with `sdio_claim_host` held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch `bes2600_sdio.c:1889` — flushes `rx_work` during teardown; that path must remain.
 ---
 ## Item 2 — `ieee80211_rx_irqsafe` instead of `ieee80211_rx` (pre-NAPI cw1200 ancestor pattern)
 **File:line:** `txrx.c:1947`, `txrx.c:1950`, `ap.c:99`, `sta.c:1487`, `wsm.c:2416`.
 **Current shape:** Every RX frame is delivered via `ieee80211_rx_irqsafe()`. This function enqueues the SKB onto a per-cpu `tasklet_rx` list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.
 **Proposed shape:** Switch to `ieee80211_rx_ni()` (process context, which `wsm_handle_rx` is already in) or, better, batch-deliver frames using `ieee80211_rx_list()` (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single `sdio_rx_work` burst into a `list_head`, then call `ieee80211_rx_list()` once per burst.
 **mac80211 contract:** `ieee80211_rx_list()` is safe from process context with the same `ieee80211_rx_status` rules as `ieee80211_rx_ni()`. Per `include/net/mac80211.h` — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as `ieee80211_rx_ni()`.
 **Predicted delta:** Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.
 **Effort:** Small (once item 1 is done — the batch list naturally exists at the burst boundary).
 **Risks:** Must hold `rcu_read_lock()` at call site; `skb->cb` (`IEEE80211_SKB_RXCB`) must be filled before the call, as today. The `early_data` path at `txrx.c:1942` uses `skb_queue_tail` into a per-link queue before calling `ieee80211_rx_irqsafe` — that path must be excluded from batch collection.
 ---
 ## Item 3 — Per-frame `queue_work(sdio_wq, &tx_work)` in the TX send path
 **File:line:** `bes2600_sdio.c:1236` (inside `bes2600_sdio_pipe_send()`).
 **Current shape:** Every call to `bes2600_sdio_pipe_send()` appends one descriptor to `tx_bufferlist` and immediately calls `queue_work(sdio_wq, &tx_work)`. `sdio_tx_work` then drains the list with scatterlist batching (up to `BES_SDIO_TX_MULTIPLE_NUM=16` frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after `atomic_add(1, &hw_priv->bh_tx)` in `bh.c` reschedules TX — successive `pipe_send` calls each hit `queue_work` before the previous fires, multiplying dispatches.
 **Proposed shape:** Stage all frames into `tx_bufferlist` in the BH TX loop, then flush `sdio_tx_work` synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.
 **Predicted delta:** Removes redundant TX-side `queue_work` calls. Lower priority than RX side given current TX rate (13 `wsm_cmd_send`/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count `numInpChBufs`), but it does remove one source of the 5,643/sec workqueue count.
 **Effort:** Small.
 **Risks:** `sdio_tx_work` calls `sdio_claim_host`/`sdio_release_host` internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that `sdio_rx_work` (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (`rx:` → `tx:` ordering in `bh.c:1444`).
 ---
 ## Item 4 — `ba_lock` per-frame acquisition in the RX path
 **File:line:** `txrx.c:998–1005` (`bes2600_rx_h_ba_stat()`); `txrx.c:1159–1164` and `txrx.c:1682–1698` (`tsm_lock`, `CONFIG_BES2600_TESTMODE`-gated).
 **Current shape:** `bes2600_rx_cb()` calls `bes2600_rx_h_ba_stat()` for every non-multicast data frame. That function acquires `ba_lock` under BH (`spin_lock_bh`) to increment `ba_acc_rx` and `ba_cnt_rx`, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on `ba_lock` alone.
 **Proposed shape:** Replace per-frame `ba_lock` with a per-cpu counter (or `atomic64_t`) for `ba_acc_rx` and `ba_cnt_rx`. The timer arm (`mod_timer`) is the actual reason for the lock — a `READ_ONCE`/`cmpxchg` on a flag to detect first-frame-in-interval is sufficient.
 **Predicted delta:** Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 1–2 land.
 **Effort:** Small.
 **Risks:** `ba_lock` also serialises TX-side block-ack accounting (`txrx.c:1632`). The per-cpu approach requires a fold step in the timer callback — cheap.
 ---
 ## Item 5 — Skip `ps_state_lock` acquisitions when PSM is known-disabled
 **File:line:** `bes2600.h:320` (decl); `txrx.c:1340–1365` (`vif_lock`); `txrx.c:1415–1426` (`ps_state_lock`); `txrx.c:1942–1948` (RX-side `ps_state_lock`).
 **Current shape:** `ps_state_lock` is taken on every TX frame if powersave is active. Per memory `reference_bes2600_firmware_no_psm.md`, **PSM is non-functional on this firmware** — c7 already self-detects this and latches `pm_unsupported = true`. The `ps_state_lock` guards in the RX callback and TX path are therefore taking dead overhead.
 **Proposed shape:** Add a `READ_ONCE()` check on `powersave_enabled` before taking `ps_state_lock`; if false, skip the lock and the PSM state update entirely. Since c7's `pm_unsupported` latches, this is safe.
 **Predicted delta:** Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.
 **Effort:** Small.
 **Risks:** `powersave_enabled` is written from process context (`bh.c:403`). `READ_ONCE` without lock is safe — at worst one spurious PSM notification, not a state corruption.
 ---
 ## Item 6 — Firmware block-read size cap (`EFFECTIVE_BUF_SIZE = 8190 bytes`)
 **File:line:** `bh.c:33–36` (defines); `bes2600_sdio.c:721–783` (`bes2600_sdio_extract_packets()`); `hwio.h:294` (`BES_SDIO_RX_MULTIPLE_NUM=16`).
 **Current shape:** `BES_SDIO_RX_MULTIPLE_NUM=16` and `BES_SDIO_OPTIMIZED_LEN` both defined (`Makefile:90,92`). The RX burst reads `PACKET_TOTAL_LEN(ctrl_reg)` bytes in a single CMD53; each sub-packet bounded by `EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190` bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ **2.3 frames per wakeup** — well under the 16-frame limit. **Not the bottleneck today.**
 **Proposed shape:** No change needed now. Re-evaluate after items 1–3 land if throughput rises past ~3 MB/s. Verify ctrl_reg `PACKET_TOTAL_LEN` field values during high load — requires firmware-trace observation we don't currently have.
 **Effort:** N/A.
 **Risks:** Increasing beyond 16 requires a larger DMA allocation (currently `1632 × 16 = 26 KB`). Cortex-M4F firmware side is opaque.
 ---
 ## Item 7 — Duplicate workqueues (`hw_priv->workqueue` vs `hw_priv->bh_workqueue` vs `sbus_priv->sdio_wq`)
 **File:line:** `bes2600.h:323` (`workqueue`); `bes2600.h:385` (`bh_workqueue`); `bes2600_sdio.c:63` (`sdio_wq`). `txrx.c` has 10 `queue_work(hw_priv->workqueue, ...)` calls for control-plane work.
 **Current shape:** Three distinct workqueues. The 5,643 `workqueue_execute_start`/sec are dominated by `sdio_wq` items, not `workqueue`. `workqueue` items are control-plane events at rates well below the data-plane.
 **Proposed shape:** After item 1 (merging `sdio_rx_work` into BH), `sdio_wq` only carries `tx_work`. After item 3 (synchronous TX flush from BH), `sdio_wq` is idle during normal data-plane and could be replaced with `system_highpri_wq`.
 **Effort:** Small (follow-on to items 1 and 3).
 **Risks:** None if items 1 and 3 land first.
 ---
 ## Item 8 — `BH_RX_CONT_LIMIT=3` cap on RX burst per BH wakeup
 **File:line:** `bh.c:1380–1405` (timeout detection); `bh.c:1330` (`BH_RX_CONT_LIMIT=3`); `bh.c:1331` (`BH_TX_CONT_LIMIT=20`).
 **Current shape:** BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. **After items 1–3 land**, per-burst frame rate will rise and `BH_RX_CONT_LIMIT=3` becomes the ceiling.
 **Proposed shape:** Raise to 16 (matching `BES_SDIO_RX_MULTIPLE_NUM`) after items 1–3 are deployed and re-measured.
 **Effort:** Trivial (constant change), but must wait for Phase 7 measurements post 1–3.
 **Risks:** Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to `BES_SDIO_RX_MULTIPLE_NUM` as safe ceiling.
 ---
 ## Ranking summary
 | Rank | Item | Predicted gain | Effort |
 |------|------|----------------|--------|
 | 1 | Collapse `sdio_rx_work` relay into BH loop | ~5x workqueue dispatch reduction | Medium |
 | 2 | Batch deliver via `ieee80211_rx_list()` | Removes per-frame softirq | Small |
 | 3 | Synchronous TX flush from BH | Removes TX-side dispatch noise | Small |
 | 4 | Replace `ba_lock` per-frame with atomic/per-cpu | Removes 611 lock/sec from RX hot path | Small |
 | 5 | Skip `ps_state_lock` when PSM-known-disabled | Removes dead overhead | Small |
 | 6 | Raise `BH_RX_CONT_LIMIT` after 1–3 land | Unlocks residual throughput | Trivial |
 | 7 | Consolidate workqueues post-items 1&3 | Cleanup | Small |
 | 8 | Firmware block-read size | Not bottleneck at current rates | N/A |
 **Items 1 + 2 together are the structural answer to the measurement**: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 3–5 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.
 ---
 ## Next campaign step
 A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.
@@ -82,6 +82,91 @@ without board power-cycle").
 **Status**: task c3 (indirectly, via bes_chardev removal which currently
 gates the signal/nosignal mode switch path).
 ## Architect review — now BUG-#5-blocking (was backlog)
 The Phase 0 perf trace for Bug #5 first exposed a "when in doubt, add a
 lock" pattern (~20 % CPU in `_raw_spin_unlock_irqrestore`). The
 follow-up ftrace measurement (2026-05-07 17:00) refined the root cause
 to an architectural problem: **the bes2600 driver dispatches every
 SDIO transaction through the kernel workqueue**. Numbers from a 3-min
 4 MB/s ohm capture (post-reboot, srcversion `1B3B3ED0`):
 ```
 wsm_cmd_send:               13/sec    (host-to-chip command rate, surprisingly low)
 bes2600_rx_cb:             611/sec
 bes2600_bh_wakeup:         267/sec
 lock contention_begin:      50/sec
 workqueue_execute_start: 5,643/sec    ← DOMINATES; matches the mmc
                                       transaction rate from earlier perf
 ```
 5.6 k workqueue dispatches per second is the throughput floor — not a
 specific lock, not WSM-command rate, not decrypt-state. A surgical fix
 to any single function won't move the floor; the architecture needs
 to be restructured to amortise SDIO transactions across fewer work-
 items (or move SDIO RX out of the workqueue entirely).
 This is where the **Claude Sonnet architect review** belongs: a
 top-to-bottom assessment of `~/src/besser/bes2600-dkms-mobian/bes2600/`
 focused on:
 - the workqueue dispatch shape (most actionable)
 - needless lock proliferation (the original signal)
 - BH / RX scheduling boundaries
 - error-handling coverage and dead-code from the cw1200 ancestor
 - API contract violations relative to mainline mac80211
 Output: ranked list of restructuring targets, with predicted-delta
 estimates against the Phase 1 metric (≥ 2 MB/s sustained @ 4 MB/s cap,
 < 10 % CPU in lock-cycling, no link cascade in 30 min).
 **Status**: now blocking on Bug #5 (was independent track). Surgical
 patches B5-1, B5-2, B5-3 from the original Phase 4 candidate list are
 all DEFERRED until the architect review's restructuring map is in.
 ## Bug #5 — RX path degrades under attempted-throughput pressure
 **Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
 SDIO RX scheduling) — pinpoint pending.
 **Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
 Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
 netcat load):**
 ```
 sender cap 1 MB/s  →  ohm receives 1015 KB/s,  signal -57 dBm,  RX MCS 4
 sender cap 4 MB/s  →  ohm receives  563 KB/s,  signal -67 dBm,  RX MCS 3
                     (Send-Q on boltzmann backed up to 1.16 MB)
 ```
 Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
 throughput at the receiver and degraded the link metrics. Signal dropped
 ~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
 the higher RX rate even with the link physically capable of more (link
 bitrate 65 Mb/s = ~8 MB/s theoretical).
 **Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
 under busy reads** — possibly a busy-wait loop or lock contention on the
 RX SDIO path that prevents draining at line rate. Plausible reason it
 didn't surface for the c-stack tasks: those operated at typical
 browse-rate traffic, well below the saturation threshold this bug needs
 to fire.
 **May explain**: original Phase-0 observation that **YouTube DASH chunks
 drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
 fetch is a brief burst at near-link-rate; if the driver throttles itself
 down during high-RX, the player buffer underruns for the duration of
 the fetch.
 **How to drill (when prioritized)**:
 - Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
  controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
 - Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
  CPU saturation.
 - `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
 **Status**: backlog. No patch yet.
 ## Bug #4 — scan_complete_cb constant loop
 **File**: `scan.c:883-909` — `bes2600_scan_complete_cb()`.
@@ -0,0 +1,135 @@
 # Bug #5 campaign — Phase 1 metric + Phase 2 situation
 Date assembled: 2026-05-07
 Module under test (baseline): bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
                              (cleanups + Patch A + Patch B)
 Phase 0 anchor at N=3 reps (10 min each, 4 MB/s pv-cap on boltzmann → ohm
 2.4GHz 5b:32) reproduces the throughput regression and traces it to lock-
 cycling cost in the bes2600 BH path. See `notes/observed-bugs.md` Bug #5
 for the original report. This document locks Phase 1 and prepares Phase 4
 candidates.
 ---
 ## Phase 1 — measurable target (locked)
 > *Reduce the bes2600 BH path's spin-unlock-irqrestore cost so that under sustained 4 MB/s sender pressure on a healthy 2.4GHz link (signal -55 to -65 dBm), ohm sustains ≥ 2 MB/s observed RX throughput (vs 663–725 KB/s baseline at N=3) AND the link survives ≥ 30 min continuous load without cascading into beacon-loss disconnect (vs rep3's failure at ~9 min).*
 Three measurable outcomes, single sentence:
 - **(a) Throughput floor**: ≥ 2 MB/s sustained RX at ohm
 - **(b) Lock-cycle ceiling**: % CPU in `_raw_spin_unlock_irqrestore` from `bes2600_bh.isra.0` callstack drops to < 10 % (currently ~10 % rep1, ~16 % rep3)
 - **(c) Cascade prevention**: no link death under continuous 30 min @ 4 MB/s
 ---
 ## Phase 0 anchor — receipts
 ### Reproduction protocol (same units as Phase 7 will use)
 1. boltzmann: `pv -L 4M -q < /dev/zero | nc ohm.fritz.box 12345`
 2. ohm: `sudo $RUN/rep-trace.sh 600` (10 min capture window)
 3. Rep dirs: `bug5/rep-<ts>/{mmc.log, perf.data, rx_bytes.tsv, start.txt, end.txt}`
 4. N=3 reps with 60 s cooldowns
 ### Observed (2026-05-07 15:36–16:08)
 | rep | duration | avg KB/s | near-zero ticks | end state |
 |---|---|---|---|---|
 | 1 | 600 s | 725 | 1/119 | associated, MCS 6 |
 | 2 | 600 s | 663 | 5/119 | associated, MCS 4 |
 | 3 | 600 s |  75 | 53/119 | **passive (link died at sample ~82, ~9 min in)** |
 mmc transaction rate: rep1 = 5793/s sustained, rep3 = 6000/s for first ~10s then collapse to <100/s.
 ### Hot-symbol receipts (perf record on `bes_sdio | bes2600_bh` kworkers)
 | symbol | rep 1 (healthy) | rep 3 (cascade) |
 |---|---|---|
 | `_raw_spin_unlock_irqrestore` (sum across kworker variants) | **~19 %** | **~21 %** |
 | `handle_softirqs` | 5.4 % | 4.3 % |
 | `__tasklet_schedule` | 2.4 % | 2.0 % |
 | `dw_mci_start_command` (SDIO host) | 1.5 % | < 0.6 % |
 | `bes2600_sw_retry_requeue` | 0.79 % | 0.70 % |
 Top callchain leading to `_raw_spin_unlock_irqrestore`:
 ```
 process_one_work → worker_thread → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
 ```
 ---
 ## Phase 2 — situation analysis
 ### Relevant source pins
 - `bes2600/wsm.c:98` — `wsm_cmd_send()`, the function in the hot callstack. Body:
  - holds `&hw_priv->wsm_cmd.lock` (spin) only briefly to fill the cmd struct (lines ~145–152)
  - calls `bes2600_bh_wakeup()` then `wait_event_interruptible_timeout` for the response
  - outer lock: `down(&hw_priv->wsm_cmd_sema)` from callers (`wsm_cmd_lock()` at wsm.c:105)
 - `bes2600/bh.c:435,559,847` — `bes2600_bh_work()` takes `&hw_priv->vif_list_lock` 3× per pass through the dispatcher
 - `bes2600/bh.c:172,195,487,581,861,1361,1427` — multiple `wait_event*` calls; the BH thread cycles through wait/wake/dispatch
 ### What the lock-cycling cost is buying
 Each WSM command from the host (mac80211, NM, kernel scan etc.) goes through the same path:
 1. caller acquires `wsm_cmd_sema` (outer)
 2. `wsm_cmd_send()` acquires `wsm_cmd.lock`, fills the struct, releases the lock
 3. `bes2600_bh_wakeup()` schedules the BH
 4. BH dispatches the command, takes `vif_list_lock` to walk vifs
 5. BH talks to chip via SDIO
 6. response arrives, BH wakes the waiter
 7. caller releases `wsm_cmd_sema`
 Under sustained TCP RX, mac80211 issues lots of small WSM commands (TX-scheduling hints, rate updates, etc.) — every one cycles through this path. The spin-unlock-irqrestore cost is the floor on this cycle rate.
 ### What's been ruled out
 - AP-side bug (AVM Fritz!Box, reliable per Markus's testimony — see `feedback_phase7_stress_ramp.md` reasoning in the campaign-so-far prior).
 - Patch A / Patch B (target different triggers; would not address lock cost).
 - Decrypt-failure storm (Patch A handles; this regression occurs in rep1 with zero decrypt-fails).
 - mac80211 scan-fail / scan-comeback (cosmetic; doesn't account for the throughput floor).
 ---
 ## Phase 4 — candidate plans (preliminary, not locked)
 Three candidates surfaced from the perf data. Listed cheapest to most-invasive.
 ### B5-1 — Reduce `wsm_cmd_send` lock scope
 The spin_lock around the cmd-struct fill (wsm.c:145–152) can probably be replaced with `WRITE_ONCE` of a single sentinel field, since the BH thread reads these fields cooperatively (BH only reads after `bh_wakeup` schedules it, and only writes back via the response path). Eliminates the per-command spin-lock cycle for the host side.
 **Risk**: race with BH if the protocol isn't strictly happens-before. Need to read bh.c:486-500 (where BH reads wsm_cmd.ptr) carefully.
 **Predicted delta**: small but measurable. Maybe 2–3 % CPU off the lock floor.
 ### B5-2 — Coalesce `vif_list_lock` in BH dispatcher
 bh.c takes `vif_list_lock` 3× per dispatcher pass. If these 3 critical sections are within a single iteration, they should be merged into one acquire/release.
 **Risk**: vif teardown might depend on releasing the lock between iterations to allow concurrent vif removal. Needs careful audit.
 **Predicted delta**: significant under multi-vif (we're single-vif STA today, so probably small immediate gain).
 ### B5-3 — Move WSM send out of process context, use ringbuffer
 Replace the wsm_cmd_sema + wsm_cmd struct mechanism with an SPSC ringbuffer between caller and BH. Caller writes to ring, no lock needed (one producer); BH reads from ring, no lock needed (one consumer).
 **Risk**: significant rework. cw1200 ancestor doesn't have this; we'd be inventing it.
 **Predicted delta**: large — could halve the lock cost. But cost-to-implement is also large.
 ---
 ## Open question for the reviewer
 Which Phase 4 candidate to lock? My ranking by ROI:
 1. B5-1 (smallest, fastest, cleanest source pin) — try first
 2. B5-2 (medium, conditional on multi-vif applicability)
 3. B5-3 (largest rework, biggest potential)
 Or: instrument deeper before committing to a fix (e.g., add `tracepoints around wsm_cmd_send` enter/exit to measure lock holdtime distribution, not just CPU%).
@@ -0,0 +1,96 @@
 # BES2600 WiFi-stability campaign — Phase 7 verdict (Patches A + B)
 Date assembled: 2026-05-07
 Module under test: bes2600.ko srcversion `1B3B3ED096AAD7217FEDE11`
                   (cleanups + Patch A + Patch B)
 Run dir: `/root/bes2600-samples/run-20260507-1248-patchB/` on ohm
 Phase 7 verification window: 2026-05-07 12:48 → ~15:13 CEST (≈ 2 h 25 m)
 of which: ~50 min @ 1 MB/s pv-cap, ~1 h 30 m @ 4 MB/s pv-cap on 2.4 GHz
 newton (5b:32, signal -57 to -67 dBm).
 ---
 ## Result table (vs the Phase 4 predicted delta)
 ### Patch A — decrypt-storm fast-recover (Trigger B)
 | metric | Phase 3 baseline | Phase 4 prediction | Phase 7-of-B observed |
 |---|---|---|---|
 | decrypt-burst rate | 8.18/h | unchanged | 2 bursts in ~22 min once 4MB/s pressure was on |
 | AP-deauth-6 rate following burst | 100 % escalation | ≤ 0.2 × baseline | **0/2 = 0 % escalation** |
 | recovery time given burst | up to 109 s | < 5 s | **~1 s** (×2) |
 **Verdict: predicted delta CONFIRMED at N=2.** CLAUDE.md ideal is N=3; we're directionally locked at 2 reproductions, both behaving as predicted (threshold trip → `[bes2600] decrypt-storm fast-recover: forcing reassoc` log line → mac80211 disassoc → userspace reauth in ≈1 s).
 #### Receipts (verbatim)
 ```
 13:47:56  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
 13:47:57  wlan0: associated to cc:ce:1e:2b:74:17     (cross-BSSID, 1 s)
 13:49:26  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc
 13:49:27  wlan0: associated to c0:25:06:e6:5b:32     (back home, 1 s)
 ```
 `DecryptStormRecoveries: 2` exposed via debugfs at `/sys/kernel/debug/ieee80211/phy0/bes2600/vif_0/status`.
 ### Patch B — connection-loss-storm bus_reset (Trigger A)
 | metric | Phase 7-of-A observed | Phase 4 prediction | Phase 7-of-B observed |
 |---|---|---|---|
 | api_connection_loss rate | 0.86/h | unchanged | 2 events in ~2 h (≈ 1/h) |
 | ConnectionLossStormRecoveries | n/a | trips on 3-in-60s bursts | **0** |
 | Threshold trip events | n/a | (when burst occurs) | **0** (events spaced 91 s apart) |
 **Verdict: installed but UNTRIGGERED.** The 3-in-60s threshold was never reached (max-cluster observed: 2-in-91s). No false positives, no spurious bus_resets. Predicted delta unobserved — same shape as Patch A's first Phase 7 run.
 The threshold may be too conservative for typical event rates (we'd need a true api_connection_loss flood to trip it). Tuning is a future Phase-1 question if more reproductions accumulate.
 ### Trigger C — AP unprotected-deauth-6 cluster without preceding storm
 ```
 12:53:10.475 → 12:53:11.756  AP fires 17 unprotected-deauth-6 from 5b:32 over 1.3 s
                              (2 mgmt-TX no-ack from our chip in the middle)
 12:53:12.309  kernel: deauthenticating ... reason 2 = PREV_AUTH_NOT_VALID
 12:53:14–15  reauth via 61:b0 → 5b:32, recovery in ~4 s
 ```
 Neither Patch A (zero decrypt-fails preceded) nor Patch B (zero api_connection_loss) fired. Background: AVM Fritz!Boxes (newton) are reliable; the AP correctly classified ohm's frames as Class 2 from non-auth, meaning **bes2600 sent something the AP couldn't authenticate**. New backlog entry: `notes/observed-bugs.md` Bug #5 (RX path under throughput pressure) is the leading hypothesis surface.
 Recovery was fast (4 s) so this isn't a P0 — but a Patch C investigation is warranted when prioritized.
 ---
 ## Bug #5 — RX path degradation under attempted-throughput pressure (NEW)
 ```
 sender 1 MB/s  →  ohm receives 1015 KB/s,  -57 dBm,  RX MCS 4
 sender 4 MB/s  →  ohm receives  563 KB/s,  -67 dBm,  RX MCS 3
 ```
 Higher attempted-throughput on the sender side → LOWER observed throughput at ohm. Signal degraded ~10 dB, MCS dropped a notch. Link-physical max is ~8 MB/s; we're getting ~7 % of that under load.
 **Hypothesis (Markus): driver/firmware locks itself to death under busy reads.** Plausibly the same root-cause as the Phase 0 YouTube DASH chunk-fetch drops (~10 frames per chunk fetch on hardware-decoder playback). Documented as Bug #5 in `notes/observed-bugs.md`.
 ---
 ## Lessons captured for memory (Phase 8 anchor)
 1. **Stress-rate matters for verification.** Patch A's predicted delta only became observable when the netcat cap went 1 → 4 MB/s. The previous Phase 7 (10h30m @ 1 MB/s) saw zero decrypt-storms. Future Phase 7 protocols should plan a stress ramp from steady to near-saturation, not just the steady setting.
 2. **"Untriggered, no harm" is a valid Phase 7 verdict** for installed patches. Patch B fits this exactly. The patch is ready; the trigger pattern just doesn't fire often enough in this RF / load regime to verify the recovery delta. Don't let unobserved verifications block the loop.
 3. **Build infrastructure on `cleanups` not `mobian`.** The Phase 6 attempt to base Patch B on mobian forced a refactor mid-flight; the c-stack lives on cleanups, and re-using c5.2's `bes2600_chrdev_do_bus_reset` requires that. The cleanups branch is the campaign's working trunk.
 4. **AP-side bug is unlikely on AVM hardware.** AVM Fritz!Boxes don't fire spurious deauth-6 storms. When ohm sees AP-deauth-6 unprovoked, the suspect chain is bes2600 sending something the AP can't authenticate. The bias toward "bes2600 is the broken thing" is empirically validated.
 5. **AP-deauth-6 can fire without our local triggers.** Trigger C is a real failure mode neither Patch A nor B addresses. Adding a Phase-1-style metric for "AP-deauth-6 rate without preceding decrypt-storm or api_connection_loss" would surface Trigger C cleanly.
 6. **`pv -L` cap interacts with TCP retransmit recovery.** When the link can't sustain the cap, TCP backs off and pv blocks. Observed throughput is then a **floor on chip RX capacity at that signal level**, not the sender's intent. Useful for chip-load-characterization, but the cap should be set based on observed pull-rate, not on the link's nominal MCS rate.
 ---
 ## Loop status
 - Phase 7: closed.
 - Patch A: confirmed (N=2). Stays in.
 - Patch B: installed, dormant in this regime, no harm. Stays in.
 - Bug #5: backlog, no patch yet. Documented.
 - Trigger C: backlog candidate, no patch yet. Documented.
 Next campaign cycle would be re-anchoring Phase 0 around Bug #5 or Trigger C.
Author	SHA1	Message	Date
claude-noether	679083d1aa	notes: Sonnet architect review for Bug #5 — ranked restructuring map Sonnet (general-purpose subagent, model=sonnet) reviewed ~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement context. Output: 8-item ranked restructuring map, file:line cited. Headline: - Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue dispatch reduction, medium effort) - Item 2: batch deliver via ieee80211_rx_list (small effort, removes per-frame softirq) - Items 1 + 2 together collapse "9 workqueue events per delivered frame" to ~1. Items 3-5 clean up next-layer overhead (TX-side queue_work, per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8 are follow-ons to be re-measured after 1-3 land. Phase 4 plan locking the lead candidate(s) follows in a separate PR.	2026-05-07 17:38:16 +02:00
claude-noether	594f73c6b4	notes: Bug #5 root cause refined — workqueue-per-SDIO-transaction is the floor Follow-up ftrace measurement (post-reboot, 3-min 4MB/s capture): - workqueue_execute_start: 5,643/sec ← dominates - wsm_cmd_send: only 13/sec (host-to-chip command path NOT the hotspot) - lock contention: 50/sec (modest) The throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. Surgical patches B5-1/B5-2/B5-3 from the prior Phase 4 plan all targeted the wrong layer; deferring those until an architectural restructuring map is produced. Promoting the Sonnet architect review from "backlog" to "blocking on Bug #5" — the next step is a restructuring assessment, not another patch.	2026-05-07 17:31:31 +02:00
claude-noether	928268f477	notes: backlog Sonnet architect review of bes2600 driver Per PR #6 review feedback. Independent track from Bug #5; scheduled once the Bug #5 measurement pass finishes.	2026-05-07 16:38:58 +02:00
marfrit	425eb92456	Merge pull request 'Bug #5 Phase 1 metric + Phase 0 anchor receipts' (#6 ) from claude-noether-4 into main Reviewed-on: #6	2026-05-07 14:37:29 +00:00
claude-noether	1830c17891	notes: Bug #5 Phase 1 metric + Phase 0 anchor receipts Phase 0 anchored at N=3 reps (10min @ 4MB/s pv-cap on 2.4GHz): - rep1+2: ~700 KB/s sustained (10% of link capacity) - rep3: link death at ~9 min in (passive mode, beacon-loss cascade) Hot symbol identified: _raw_spin_unlock_irqrestore at ~20% CPU in both healthy and failed reps, callstack process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0 → spin-unlock. Phase 1 metric locked: ≥2 MB/s sustained throughput, <10% CPU in lock- cycling, no link death under 30 min continuous load. Three Phase 4 candidates drafted (B5-1: shrink wsm_cmd_send lock scope; B5-2: coalesce vif_list_lock in BH dispatcher; B5-3: SPSC ringbuffer for WSM commands). Locking pending review.	2026-05-07 16:32:45 +02:00
claude-noether	69a1d0f8b1	notes: phase 7 verdict — Patch A confirmed, Patch B dormant Phase 7 verification of cleanups + Patch A + Patch B (srcversion 1B3B3ED0) on ohm 2026-05-07 12:48 → 15:13 CEST under netcat load ramped 1 MB/s → 4 MB/s on 2.4GHz newton. Patch A: predicted delta CONFIRMED at N=2 reproductions. - 13:47:56 storm → 1 s reassoc, no AP-deauth-6 escalation - 13:49:26 storm → 1 s reassoc, no AP-deauth-6 escalation Patch B: installed, untriggered. 2 api_connection_loss events spaced 91 s apart, never tripping the 3-in-60s threshold. No false positives, no spurious bus_resets. Recovery delta unobserved (no harm done). Trigger C: 17-frame AP-deauth-6 cluster at 12:53 with no patch hooks firing — bes2600 TX-side glitch suspect. Recovery via mac80211 reauth in ~4 s. New backlog item. Bug #5 documented separately (RX path degrades under throughput pressure; possible root of the original Phase-0 YouTube frame drops).	2026-05-07 15:18:36 +02:00
claude-noether	458ad36f8b	notes: backlog Bug #5 — RX path degrades under throughput pressure Observed 2026-05-07: bumping the netcat sender from 1 MB/s to 4 MB/s DECREASED ohm's observed RX rate (1015 KB/s → 563 KB/s) and degraded the link (signal -57 → -67 dBm, MCS 4 → 3). Chip can't sustain near- link-rate RX even though theoretical capacity is ~8 MB/s. Hypothesis: driver/firmware lock contention or busy-wait on the RX SDIO path. Plausibly explains the original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch — chunk fetch is a brief near-line-rate burst that this bug would be triggered by.	2026-05-07 13:56:36 +02:00
marfrit	ea509e810f	Merge pull request 'Phase 4 plan: Patch B (Trigger A / api_connection_loss)' (#5 ) from claude-noether-3 into main Reviewed-on: #5	2026-05-07 10:45:28 +00:00