Files
besser/notes/architect-review-bug5-2026-05-07.md
T
claude-noether 679083d1aa notes: Sonnet architect review for Bug #5 — ranked restructuring map
Sonnet (general-purpose subagent, model=sonnet) reviewed
~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement
context. Output: 8-item ranked restructuring map, file:line cited.

Headline:
- Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue
  dispatch reduction, medium effort)
- Item 2: batch deliver via ieee80211_rx_list (small effort, removes
  per-frame softirq)
- Items 1 + 2 together collapse "9 workqueue events per delivered
  frame" to ~1.

Items 3-5 clean up next-layer overhead (TX-side queue_work,
per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8
are follow-ons to be re-measured after 1-3 land.

Phase 4 plan locking the lead candidate(s) follows in a separate PR.
2026-05-07 17:38:16 +02:00

181 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BES2600 architecture review — Bug #5
Date assembled: 2026-05-07
Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet)
Driver source: `~/src/besser/bes2600-dkms-mobian/bes2600/` on boltzmann
This is the architect-review pass requested in `notes/observed-bugs.md` after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.
---
## Measurement context (input to the reviewer)
```
Reproduction: pv -L 4M < /dev/zero | nc ohm 12345
Module under test: bes2600.ko srcversion 1B3B3ED0... (cleanups + Patch A + Patch B)
Hardware: PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
Link rate: 65 Mb/s ≈ 8 MB/s theoretical
Observed throughput: 725 KB/s (Phase 0 anchor at N=3)
rep 3 cascaded into beacon-loss disconnect at ~9 min in
Per-second event rates (3-min capture under 4 MB/s pv-cap):
workqueue_execute_start: 5,643/sec ← architectural floor
bes2600_rx_cb: 611/sec
bes2600_bh_wakeup: 267/sec
wsm_cmd_send: 13/sec (host-to-chip command rate, surprisingly low)
lock contention_begin: 50/sec (modest)
mmc_request_start: ~5,800/sec (matches workqueue rate — every SDIO transaction is its own work item)
perf record top symbol: _raw_spin_unlock_irqrestore (~20 % CPU samples)
Dominant callstack: process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
```
The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.
---
## Item 1 — Two-hop workqueue relay: SDIO IRQ → `sdio_rx_work` → BH loop → mac80211
**File:line:** `bes2600_sdio.c:416` (IRQ handler dispatches `rx_work`); `bes2600_sdio.c:829` (`sdio_rx_work` body); `bh.c:13301538` (BH main loop, `BES2600_RX_IN_BH` path); `bes2600_sdio.c:1267` (`bes_sdio` workqueue, `max_active=2`).
**Current shape:** Every SDIO interrupt fires `queue_work(sdio_wq, &rx_work)`. `sdio_rx_work` reads up to `BES_SDIO_RX_MULTIPLE_NUM=16` frames (`hwio.h:294`) into per-frame SKBs, enqueues each onto `sbus_priv.rx_queue` under `rx_queue_lock`, then returns. Meanwhile the BH kthread (one work item queued at boot in `bh.c:93`, running an infinite loop inside `bes2600_bh()`) calls `pipe_read()``spin_lock(rx_queue_lock)``skb_dequeue()``wsm_handle_rx()``ieee80211_rx_irqsafe()` one frame at a time. When `pipe_read()` returns NULL and pending TX exists, `bes2600_sdio_pipe_read()` at `bes2600_sdio.c:941` re-dispatches `rx_work` — so a sustained RX stream fires **one `queue_work` per BH wakeup, not per burst**. That explains why `bh_wakeup` events are only 267/sec while `workqueue_execute_start` is 5,643/sec: the SDIO layer is firing a new `rx_work` item for every frame the BH loop drains.
**Proposed shape:** Collapse `sdio_rx_work` and `pipe_read()` into the BH loop directly. The BH already runs in a dedicated `WQ_HIGHPRI | WQ_CPU_INTENSIVE` workqueue (`bh.c:66`) and (with `BES2600_RX_IN_BH` defined per `Makefile:159`) `bes2600_bh_rx_helper()` already dequeues from `rx_queue`. Merge `sdio_rx_work` into a function called synchronously from `bes2600_bh_rx_helper()` before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) `queue_work` calls per burst while keeping the BH as the single SDIO-access context.
**Predicted delta vs Phase 1 metric:** Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 24× throughput improvement and a proportional drop in `_raw_spin_unlock_irqrestore` CPU cost.
**Effort:** Medium. SDIO host-lock protocol (`sdio_claim_host`/`sdio_release_host`) is already managed inside `sdio_rx_work`; moving the body is mechanical but requires care around the `sdio_wq` `max_active=2` concurrency assumption.
**Risks:** `sdio_rx_work` runs with `sdio_claim_host` held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch `bes2600_sdio.c:1889` — flushes `rx_work` during teardown; that path must remain.
---
## Item 2 — `ieee80211_rx_irqsafe` instead of `ieee80211_rx` (pre-NAPI cw1200 ancestor pattern)
**File:line:** `txrx.c:1947`, `txrx.c:1950`, `ap.c:99`, `sta.c:1487`, `wsm.c:2416`.
**Current shape:** Every RX frame is delivered via `ieee80211_rx_irqsafe()`. This function enqueues the SKB onto a per-cpu `tasklet_rx` list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.
**Proposed shape:** Switch to `ieee80211_rx_ni()` (process context, which `wsm_handle_rx` is already in) or, better, batch-deliver frames using `ieee80211_rx_list()` (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single `sdio_rx_work` burst into a `list_head`, then call `ieee80211_rx_list()` once per burst.
**mac80211 contract:** `ieee80211_rx_list()` is safe from process context with the same `ieee80211_rx_status` rules as `ieee80211_rx_ni()`. Per `include/net/mac80211.h` — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as `ieee80211_rx_ni()`.
**Predicted delta:** Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.
**Effort:** Small (once item 1 is done — the batch list naturally exists at the burst boundary).
**Risks:** Must hold `rcu_read_lock()` at call site; `skb->cb` (`IEEE80211_SKB_RXCB`) must be filled before the call, as today. The `early_data` path at `txrx.c:1942` uses `skb_queue_tail` into a per-link queue before calling `ieee80211_rx_irqsafe` — that path must be excluded from batch collection.
---
## Item 3 — Per-frame `queue_work(sdio_wq, &tx_work)` in the TX send path
**File:line:** `bes2600_sdio.c:1236` (inside `bes2600_sdio_pipe_send()`).
**Current shape:** Every call to `bes2600_sdio_pipe_send()` appends one descriptor to `tx_bufferlist` and immediately calls `queue_work(sdio_wq, &tx_work)`. `sdio_tx_work` then drains the list with scatterlist batching (up to `BES_SDIO_TX_MULTIPLE_NUM=16` frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after `atomic_add(1, &hw_priv->bh_tx)` in `bh.c` reschedules TX — successive `pipe_send` calls each hit `queue_work` before the previous fires, multiplying dispatches.
**Proposed shape:** Stage all frames into `tx_bufferlist` in the BH TX loop, then flush `sdio_tx_work` synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.
**Predicted delta:** Removes redundant TX-side `queue_work` calls. Lower priority than RX side given current TX rate (13 `wsm_cmd_send`/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count `numInpChBufs`), but it does remove one source of the 5,643/sec workqueue count.
**Effort:** Small.
**Risks:** `sdio_tx_work` calls `sdio_claim_host`/`sdio_release_host` internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that `sdio_rx_work` (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (`rx:``tx:` ordering in `bh.c:1444`).
---
## Item 4 — `ba_lock` per-frame acquisition in the RX path
**File:line:** `txrx.c:9981005` (`bes2600_rx_h_ba_stat()`); `txrx.c:11591164` and `txrx.c:16821698` (`tsm_lock`, `CONFIG_BES2600_TESTMODE`-gated).
**Current shape:** `bes2600_rx_cb()` calls `bes2600_rx_h_ba_stat()` for every non-multicast data frame. That function acquires `ba_lock` under BH (`spin_lock_bh`) to increment `ba_acc_rx` and `ba_cnt_rx`, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on `ba_lock` alone.
**Proposed shape:** Replace per-frame `ba_lock` with a per-cpu counter (or `atomic64_t`) for `ba_acc_rx` and `ba_cnt_rx`. The timer arm (`mod_timer`) is the actual reason for the lock — a `READ_ONCE`/`cmpxchg` on a flag to detect first-frame-in-interval is sufficient.
**Predicted delta:** Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 12 land.
**Effort:** Small.
**Risks:** `ba_lock` also serialises TX-side block-ack accounting (`txrx.c:1632`). The per-cpu approach requires a fold step in the timer callback — cheap.
---
## Item 5 — Skip `ps_state_lock` acquisitions when PSM is known-disabled
**File:line:** `bes2600.h:320` (decl); `txrx.c:13401365` (`vif_lock`); `txrx.c:14151426` (`ps_state_lock`); `txrx.c:19421948` (RX-side `ps_state_lock`).
**Current shape:** `ps_state_lock` is taken on every TX frame if powersave is active. Per memory `reference_bes2600_firmware_no_psm.md`, **PSM is non-functional on this firmware** — c7 already self-detects this and latches `pm_unsupported = true`. The `ps_state_lock` guards in the RX callback and TX path are therefore taking dead overhead.
**Proposed shape:** Add a `READ_ONCE()` check on `powersave_enabled` before taking `ps_state_lock`; if false, skip the lock and the PSM state update entirely. Since c7's `pm_unsupported` latches, this is safe.
**Predicted delta:** Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.
**Effort:** Small.
**Risks:** `powersave_enabled` is written from process context (`bh.c:403`). `READ_ONCE` without lock is safe — at worst one spurious PSM notification, not a state corruption.
---
## Item 6 — Firmware block-read size cap (`EFFECTIVE_BUF_SIZE = 8190 bytes`)
**File:line:** `bh.c:3336` (defines); `bes2600_sdio.c:721783` (`bes2600_sdio_extract_packets()`); `hwio.h:294` (`BES_SDIO_RX_MULTIPLE_NUM=16`).
**Current shape:** `BES_SDIO_RX_MULTIPLE_NUM=16` and `BES_SDIO_OPTIMIZED_LEN` both defined (`Makefile:90,92`). The RX burst reads `PACKET_TOTAL_LEN(ctrl_reg)` bytes in a single CMD53; each sub-packet bounded by `EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190` bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ **2.3 frames per wakeup** — well under the 16-frame limit. **Not the bottleneck today.**
**Proposed shape:** No change needed now. Re-evaluate after items 13 land if throughput rises past ~3 MB/s. Verify ctrl_reg `PACKET_TOTAL_LEN` field values during high load — requires firmware-trace observation we don't currently have.
**Effort:** N/A.
**Risks:** Increasing beyond 16 requires a larger DMA allocation (currently `1632 × 16 = 26 KB`). Cortex-M4F firmware side is opaque.
---
## Item 7 — Duplicate workqueues (`hw_priv->workqueue` vs `hw_priv->bh_workqueue` vs `sbus_priv->sdio_wq`)
**File:line:** `bes2600.h:323` (`workqueue`); `bes2600.h:385` (`bh_workqueue`); `bes2600_sdio.c:63` (`sdio_wq`). `txrx.c` has 10 `queue_work(hw_priv->workqueue, ...)` calls for control-plane work.
**Current shape:** Three distinct workqueues. The 5,643 `workqueue_execute_start`/sec are dominated by `sdio_wq` items, not `workqueue`. `workqueue` items are control-plane events at rates well below the data-plane.
**Proposed shape:** After item 1 (merging `sdio_rx_work` into BH), `sdio_wq` only carries `tx_work`. After item 3 (synchronous TX flush from BH), `sdio_wq` is idle during normal data-plane and could be replaced with `system_highpri_wq`.
**Effort:** Small (follow-on to items 1 and 3).
**Risks:** None if items 1 and 3 land first.
---
## Item 8 — `BH_RX_CONT_LIMIT=3` cap on RX burst per BH wakeup
**File:line:** `bh.c:13801405` (timeout detection); `bh.c:1330` (`BH_RX_CONT_LIMIT=3`); `bh.c:1331` (`BH_TX_CONT_LIMIT=20`).
**Current shape:** BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. **After items 13 land**, per-burst frame rate will rise and `BH_RX_CONT_LIMIT=3` becomes the ceiling.
**Proposed shape:** Raise to 16 (matching `BES_SDIO_RX_MULTIPLE_NUM`) after items 13 are deployed and re-measured.
**Effort:** Trivial (constant change), but must wait for Phase 7 measurements post 13.
**Risks:** Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to `BES_SDIO_RX_MULTIPLE_NUM` as safe ceiling.
---
## Ranking summary
| Rank | Item | Predicted gain | Effort |
|------|------|----------------|--------|
| 1 | Collapse `sdio_rx_work` relay into BH loop | ~5x workqueue dispatch reduction | Medium |
| 2 | Batch deliver via `ieee80211_rx_list()` | Removes per-frame softirq | Small |
| 3 | Synchronous TX flush from BH | Removes TX-side dispatch noise | Small |
| 4 | Replace `ba_lock` per-frame with atomic/per-cpu | Removes 611 lock/sec from RX hot path | Small |
| 5 | Skip `ps_state_lock` when PSM-known-disabled | Removes dead overhead | Small |
| 6 | Raise `BH_RX_CONT_LIMIT` after 13 land | Unlocks residual throughput | Trivial |
| 7 | Consolidate workqueues post-items 1&3 | Cleanup | Small |
| 8 | Firmware block-read size | Not bottleneck at current rates | N/A |
**Items 1 + 2 together are the structural answer to the measurement**: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 35 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.
---
## Next campaign step
A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.