besser/notes/architect-review-bug5-2026-05-07.md

# BES2600 architecture review — Bug #5

Date assembled: 2026-05-07
Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet)
Driver source: `~/src/besser/bes2600-dkms-mobian/bes2600/` on boltzmann

This is the architect-review pass requested in `notes/observed-bugs.md` after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.

---

## Measurement context (input to the reviewer)

```
Reproduction:                pv -L 4M < /dev/zero | nc ohm 12345
Module under test:           bes2600.ko srcversion 1B3B3ED0...  (cleanups + Patch A + Patch B)
Hardware:                    PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
Link rate:                   65 Mb/s ≈ 8 MB/s theoretical
Observed throughput:         725 KB/s  (Phase 0 anchor at N=3)
                             rep 3 cascaded into beacon-loss disconnect at ~9 min in

Per-second event rates (3-min capture under 4 MB/s pv-cap):
  workqueue_execute_start:  5,643/sec    ← architectural floor
  bes2600_rx_cb:              611/sec
  bes2600_bh_wakeup:          267/sec
  wsm_cmd_send:                13/sec    (host-to-chip command rate, surprisingly low)
  lock contention_begin:       50/sec    (modest)
  mmc_request_start:        ~5,800/sec   (matches workqueue rate — every SDIO transaction is its own work item)

perf record top symbol:      _raw_spin_unlock_irqrestore  (~20 % CPU samples)
Dominant callstack:          process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
```

The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.

---

## Item 1 — Two-hop workqueue relay: SDIO IRQ → `sdio_rx_work` → BH loop → mac80211

**File:line:** `bes2600_sdio.c:416` (IRQ handler dispatches `rx_work`); `bes2600_sdio.c:829` (`sdio_rx_work` body); `bh.c:1330–1538` (BH main loop, `BES2600_RX_IN_BH` path); `bes2600_sdio.c:1267` (`bes_sdio` workqueue, `max_active=2`).

**Current shape:** Every SDIO interrupt fires `queue_work(sdio_wq, &rx_work)`. `sdio_rx_work` reads up to `BES_SDIO_RX_MULTIPLE_NUM=16` frames (`hwio.h:294`) into per-frame SKBs, enqueues each onto `sbus_priv.rx_queue` under `rx_queue_lock`, then returns. Meanwhile the BH kthread (one work item queued at boot in `bh.c:93`, running an infinite loop inside `bes2600_bh()`) calls `pipe_read()` → `spin_lock(rx_queue_lock)` → `skb_dequeue()` → `wsm_handle_rx()` → `ieee80211_rx_irqsafe()` one frame at a time. When `pipe_read()` returns NULL and pending TX exists, `bes2600_sdio_pipe_read()` at `bes2600_sdio.c:941` re-dispatches `rx_work` — so a sustained RX stream fires **one `queue_work` per BH wakeup, not per burst**. That explains why `bh_wakeup` events are only 267/sec while `workqueue_execute_start` is 5,643/sec: the SDIO layer is firing a new `rx_work` item for every frame the BH loop drains.

**Proposed shape:** Collapse `sdio_rx_work` and `pipe_read()` into the BH loop directly. The BH already runs in a dedicated `WQ_HIGHPRI | WQ_CPU_INTENSIVE` workqueue (`bh.c:66`) and (with `BES2600_RX_IN_BH` defined per `Makefile:159`) `bes2600_bh_rx_helper()` already dequeues from `rx_queue`. Merge `sdio_rx_work` into a function called synchronously from `bes2600_bh_rx_helper()` before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) `queue_work` calls per burst while keeping the BH as the single SDIO-access context.

**Predicted delta vs Phase 1 metric:** Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 2–4× throughput improvement and a proportional drop in `_raw_spin_unlock_irqrestore` CPU cost.

**Effort:** Medium. SDIO host-lock protocol (`sdio_claim_host`/`sdio_release_host`) is already managed inside `sdio_rx_work`; moving the body is mechanical but requires care around the `sdio_wq` `max_active=2` concurrency assumption.

**Risks:** `sdio_rx_work` runs with `sdio_claim_host` held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch `bes2600_sdio.c:1889` — flushes `rx_work` during teardown; that path must remain.

---

## Item 2 — `ieee80211_rx_irqsafe` instead of `ieee80211_rx` (pre-NAPI cw1200 ancestor pattern)

**File:line:** `txrx.c:1947`, `txrx.c:1950`, `ap.c:99`, `sta.c:1487`, `wsm.c:2416`.

**Current shape:** Every RX frame is delivered via `ieee80211_rx_irqsafe()`. This function enqueues the SKB onto a per-cpu `tasklet_rx` list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.

**Proposed shape:** Switch to `ieee80211_rx_ni()` (process context, which `wsm_handle_rx` is already in) or, better, batch-deliver frames using `ieee80211_rx_list()` (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single `sdio_rx_work` burst into a `list_head`, then call `ieee80211_rx_list()` once per burst.

**mac80211 contract:** `ieee80211_rx_list()` is safe from process context with the same `ieee80211_rx_status` rules as `ieee80211_rx_ni()`. Per `include/net/mac80211.h` — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as `ieee80211_rx_ni()`.

**Predicted delta:** Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.

**Effort:** Small (once item 1 is done — the batch list naturally exists at the burst boundary).

**Risks:** Must hold `rcu_read_lock()` at call site; `skb->cb` (`IEEE80211_SKB_RXCB`) must be filled before the call, as today. The `early_data` path at `txrx.c:1942` uses `skb_queue_tail` into a per-link queue before calling `ieee80211_rx_irqsafe` — that path must be excluded from batch collection.

---

## Item 3 — Per-frame `queue_work(sdio_wq, &tx_work)` in the TX send path

**File:line:** `bes2600_sdio.c:1236` (inside `bes2600_sdio_pipe_send()`).

**Current shape:** Every call to `bes2600_sdio_pipe_send()` appends one descriptor to `tx_bufferlist` and immediately calls `queue_work(sdio_wq, &tx_work)`. `sdio_tx_work` then drains the list with scatterlist batching (up to `BES_SDIO_TX_MULTIPLE_NUM=16` frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after `atomic_add(1, &hw_priv->bh_tx)` in `bh.c` reschedules TX — successive `pipe_send` calls each hit `queue_work` before the previous fires, multiplying dispatches.

**Proposed shape:** Stage all frames into `tx_bufferlist` in the BH TX loop, then flush `sdio_tx_work` synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.

**Predicted delta:** Removes redundant TX-side `queue_work` calls. Lower priority than RX side given current TX rate (13 `wsm_cmd_send`/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count `numInpChBufs`), but it does remove one source of the 5,643/sec workqueue count.

**Effort:** Small.

**Risks:** `sdio_tx_work` calls `sdio_claim_host`/`sdio_release_host` internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that `sdio_rx_work` (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (`rx:` → `tx:` ordering in `bh.c:1444`).

---

## Item 4 — `ba_lock` per-frame acquisition in the RX path

**File:line:** `txrx.c:998–1005` (`bes2600_rx_h_ba_stat()`); `txrx.c:1159–1164` and `txrx.c:1682–1698` (`tsm_lock`, `CONFIG_BES2600_TESTMODE`-gated).

**Current shape:** `bes2600_rx_cb()` calls `bes2600_rx_h_ba_stat()` for every non-multicast data frame. That function acquires `ba_lock` under BH (`spin_lock_bh`) to increment `ba_acc_rx` and `ba_cnt_rx`, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on `ba_lock` alone.

**Proposed shape:** Replace per-frame `ba_lock` with a per-cpu counter (or `atomic64_t`) for `ba_acc_rx` and `ba_cnt_rx`. The timer arm (`mod_timer`) is the actual reason for the lock — a `READ_ONCE`/`cmpxchg` on a flag to detect first-frame-in-interval is sufficient.

**Predicted delta:** Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 1–2 land.

**Effort:** Small.

**Risks:** `ba_lock` also serialises TX-side block-ack accounting (`txrx.c:1632`). The per-cpu approach requires a fold step in the timer callback — cheap.

---

## Item 5 — Skip `ps_state_lock` acquisitions when PSM is known-disabled

**File:line:** `bes2600.h:320` (decl); `txrx.c:1340–1365` (`vif_lock`); `txrx.c:1415–1426` (`ps_state_lock`); `txrx.c:1942–1948` (RX-side `ps_state_lock`).

**Current shape:** `ps_state_lock` is taken on every TX frame if powersave is active. Per memory `reference_bes2600_firmware_no_psm.md`, **PSM is non-functional on this firmware** — c7 already self-detects this and latches `pm_unsupported = true`. The `ps_state_lock` guards in the RX callback and TX path are therefore taking dead overhead.

**Proposed shape:** Add a `READ_ONCE()` check on `powersave_enabled` before taking `ps_state_lock`; if false, skip the lock and the PSM state update entirely. Since c7's `pm_unsupported` latches, this is safe.

**Predicted delta:** Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.

**Effort:** Small.

**Risks:** `powersave_enabled` is written from process context (`bh.c:403`). `READ_ONCE` without lock is safe — at worst one spurious PSM notification, not a state corruption.

---

## Item 6 — Firmware block-read size cap (`EFFECTIVE_BUF_SIZE = 8190 bytes`)

**File:line:** `bh.c:33–36` (defines); `bes2600_sdio.c:721–783` (`bes2600_sdio_extract_packets()`); `hwio.h:294` (`BES_SDIO_RX_MULTIPLE_NUM=16`).

**Current shape:** `BES_SDIO_RX_MULTIPLE_NUM=16` and `BES_SDIO_OPTIMIZED_LEN` both defined (`Makefile:90,92`). The RX burst reads `PACKET_TOTAL_LEN(ctrl_reg)` bytes in a single CMD53; each sub-packet bounded by `EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190` bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ **2.3 frames per wakeup** — well under the 16-frame limit. **Not the bottleneck today.**

**Proposed shape:** No change needed now. Re-evaluate after items 1–3 land if throughput rises past ~3 MB/s. Verify ctrl_reg `PACKET_TOTAL_LEN` field values during high load — requires firmware-trace observation we don't currently have.

**Effort:** N/A.

**Risks:** Increasing beyond 16 requires a larger DMA allocation (currently `1632 × 16 = 26 KB`). Cortex-M4F firmware side is opaque.

---

## Item 7 — Duplicate workqueues (`hw_priv->workqueue` vs `hw_priv->bh_workqueue` vs `sbus_priv->sdio_wq`)

**File:line:** `bes2600.h:323` (`workqueue`); `bes2600.h:385` (`bh_workqueue`); `bes2600_sdio.c:63` (`sdio_wq`). `txrx.c` has 10 `queue_work(hw_priv->workqueue, ...)` calls for control-plane work.

**Current shape:** Three distinct workqueues. The 5,643 `workqueue_execute_start`/sec are dominated by `sdio_wq` items, not `workqueue`. `workqueue` items are control-plane events at rates well below the data-plane.

**Proposed shape:** After item 1 (merging `sdio_rx_work` into BH), `sdio_wq` only carries `tx_work`. After item 3 (synchronous TX flush from BH), `sdio_wq` is idle during normal data-plane and could be replaced with `system_highpri_wq`.

**Effort:** Small (follow-on to items 1 and 3).

**Risks:** None if items 1 and 3 land first.

---

## Item 8 — `BH_RX_CONT_LIMIT=3` cap on RX burst per BH wakeup

**File:line:** `bh.c:1380–1405` (timeout detection); `bh.c:1330` (`BH_RX_CONT_LIMIT=3`); `bh.c:1331` (`BH_TX_CONT_LIMIT=20`).

**Current shape:** BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. **After items 1–3 land**, per-burst frame rate will rise and `BH_RX_CONT_LIMIT=3` becomes the ceiling.

**Proposed shape:** Raise to 16 (matching `BES_SDIO_RX_MULTIPLE_NUM`) after items 1–3 are deployed and re-measured.

**Effort:** Trivial (constant change), but must wait for Phase 7 measurements post 1–3.

**Risks:** Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to `BES_SDIO_RX_MULTIPLE_NUM` as safe ceiling.

---

## Ranking summary

| Rank | Item | Predicted gain | Effort |
|------|------|----------------|--------|
| 1 | Collapse `sdio_rx_work` relay into BH loop | ~5x workqueue dispatch reduction | Medium |
| 2 | Batch deliver via `ieee80211_rx_list()` | Removes per-frame softirq | Small |
| 3 | Synchronous TX flush from BH | Removes TX-side dispatch noise | Small |
| 4 | Replace `ba_lock` per-frame with atomic/per-cpu | Removes 611 lock/sec from RX hot path | Small |
| 5 | Skip `ps_state_lock` when PSM-known-disabled | Removes dead overhead | Small |
| 6 | Raise `BH_RX_CONT_LIMIT` after 1–3 land | Unlocks residual throughput | Trivial |
| 7 | Consolidate workqueues post-items 1&3 | Cleanup | Small |
| 8 | Firmware block-read size | Not bottleneck at current rates | N/A |

**Items 1 + 2 together are the structural answer to the measurement**: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 3–5 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.

---

## Next campaign step

A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.