Sonnet (general-purpose subagent, model=sonnet) reviewed ~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement context. Output: 8-item ranked restructuring map, file:line cited. Headline: - Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue dispatch reduction, medium effort) - Item 2: batch deliver via ieee80211_rx_list (small effort, removes per-frame softirq) - Items 1 + 2 together collapse "9 workqueue events per delivered frame" to ~1. Items 3-5 clean up next-layer overhead (TX-side queue_work, per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8 are follow-ons to be re-measured after 1-3 land. Phase 4 plan locking the lead candidate(s) follows in a separate PR.
14 KiB
BES2600 architecture review — Bug #5
Date assembled: 2026-05-07
Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet)
Driver source: ~/src/besser/bes2600-dkms-mobian/bes2600/ on boltzmann
This is the architect-review pass requested in notes/observed-bugs.md after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.
Measurement context (input to the reviewer)
Reproduction: pv -L 4M < /dev/zero | nc ohm 12345
Module under test: bes2600.ko srcversion 1B3B3ED0... (cleanups + Patch A + Patch B)
Hardware: PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
Link rate: 65 Mb/s ≈ 8 MB/s theoretical
Observed throughput: 725 KB/s (Phase 0 anchor at N=3)
rep 3 cascaded into beacon-loss disconnect at ~9 min in
Per-second event rates (3-min capture under 4 MB/s pv-cap):
workqueue_execute_start: 5,643/sec ← architectural floor
bes2600_rx_cb: 611/sec
bes2600_bh_wakeup: 267/sec
wsm_cmd_send: 13/sec (host-to-chip command rate, surprisingly low)
lock contention_begin: 50/sec (modest)
mmc_request_start: ~5,800/sec (matches workqueue rate — every SDIO transaction is its own work item)
perf record top symbol: _raw_spin_unlock_irqrestore (~20 % CPU samples)
Dominant callstack: process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0
The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.
Item 1 — Two-hop workqueue relay: SDIO IRQ → sdio_rx_work → BH loop → mac80211
File:line: bes2600_sdio.c:416 (IRQ handler dispatches rx_work); bes2600_sdio.c:829 (sdio_rx_work body); bh.c:1330–1538 (BH main loop, BES2600_RX_IN_BH path); bes2600_sdio.c:1267 (bes_sdio workqueue, max_active=2).
Current shape: Every SDIO interrupt fires queue_work(sdio_wq, &rx_work). sdio_rx_work reads up to BES_SDIO_RX_MULTIPLE_NUM=16 frames (hwio.h:294) into per-frame SKBs, enqueues each onto sbus_priv.rx_queue under rx_queue_lock, then returns. Meanwhile the BH kthread (one work item queued at boot in bh.c:93, running an infinite loop inside bes2600_bh()) calls pipe_read() → spin_lock(rx_queue_lock) → skb_dequeue() → wsm_handle_rx() → ieee80211_rx_irqsafe() one frame at a time. When pipe_read() returns NULL and pending TX exists, bes2600_sdio_pipe_read() at bes2600_sdio.c:941 re-dispatches rx_work — so a sustained RX stream fires one queue_work per BH wakeup, not per burst. That explains why bh_wakeup events are only 267/sec while workqueue_execute_start is 5,643/sec: the SDIO layer is firing a new rx_work item for every frame the BH loop drains.
Proposed shape: Collapse sdio_rx_work and pipe_read() into the BH loop directly. The BH already runs in a dedicated WQ_HIGHPRI | WQ_CPU_INTENSIVE workqueue (bh.c:66) and (with BES2600_RX_IN_BH defined per Makefile:159) bes2600_bh_rx_helper() already dequeues from rx_queue. Merge sdio_rx_work into a function called synchronously from bes2600_bh_rx_helper() before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) queue_work calls per burst while keeping the BH as the single SDIO-access context.
Predicted delta vs Phase 1 metric: Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 2–4× throughput improvement and a proportional drop in _raw_spin_unlock_irqrestore CPU cost.
Effort: Medium. SDIO host-lock protocol (sdio_claim_host/sdio_release_host) is already managed inside sdio_rx_work; moving the body is mechanical but requires care around the sdio_wq max_active=2 concurrency assumption.
Risks: sdio_rx_work runs with sdio_claim_host held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch bes2600_sdio.c:1889 — flushes rx_work during teardown; that path must remain.
Item 2 — ieee80211_rx_irqsafe instead of ieee80211_rx (pre-NAPI cw1200 ancestor pattern)
File:line: txrx.c:1947, txrx.c:1950, ap.c:99, sta.c:1487, wsm.c:2416.
Current shape: Every RX frame is delivered via ieee80211_rx_irqsafe(). This function enqueues the SKB onto a per-cpu tasklet_rx list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.
Proposed shape: Switch to ieee80211_rx_ni() (process context, which wsm_handle_rx is already in) or, better, batch-deliver frames using ieee80211_rx_list() (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single sdio_rx_work burst into a list_head, then call ieee80211_rx_list() once per burst.
mac80211 contract: ieee80211_rx_list() is safe from process context with the same ieee80211_rx_status rules as ieee80211_rx_ni(). Per include/net/mac80211.h — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as ieee80211_rx_ni().
Predicted delta: Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.
Effort: Small (once item 1 is done — the batch list naturally exists at the burst boundary).
Risks: Must hold rcu_read_lock() at call site; skb->cb (IEEE80211_SKB_RXCB) must be filled before the call, as today. The early_data path at txrx.c:1942 uses skb_queue_tail into a per-link queue before calling ieee80211_rx_irqsafe — that path must be excluded from batch collection.
Item 3 — Per-frame queue_work(sdio_wq, &tx_work) in the TX send path
File:line: bes2600_sdio.c:1236 (inside bes2600_sdio_pipe_send()).
Current shape: Every call to bes2600_sdio_pipe_send() appends one descriptor to tx_bufferlist and immediately calls queue_work(sdio_wq, &tx_work). sdio_tx_work then drains the list with scatterlist batching (up to BES_SDIO_TX_MULTIPLE_NUM=16 frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after atomic_add(1, &hw_priv->bh_tx) in bh.c reschedules TX — successive pipe_send calls each hit queue_work before the previous fires, multiplying dispatches.
Proposed shape: Stage all frames into tx_bufferlist in the BH TX loop, then flush sdio_tx_work synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.
Predicted delta: Removes redundant TX-side queue_work calls. Lower priority than RX side given current TX rate (13 wsm_cmd_send/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count numInpChBufs), but it does remove one source of the 5,643/sec workqueue count.
Effort: Small.
Risks: sdio_tx_work calls sdio_claim_host/sdio_release_host internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that sdio_rx_work (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (rx: → tx: ordering in bh.c:1444).
Item 4 — ba_lock per-frame acquisition in the RX path
File:line: txrx.c:998–1005 (bes2600_rx_h_ba_stat()); txrx.c:1159–1164 and txrx.c:1682–1698 (tsm_lock, CONFIG_BES2600_TESTMODE-gated).
Current shape: bes2600_rx_cb() calls bes2600_rx_h_ba_stat() for every non-multicast data frame. That function acquires ba_lock under BH (spin_lock_bh) to increment ba_acc_rx and ba_cnt_rx, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on ba_lock alone.
Proposed shape: Replace per-frame ba_lock with a per-cpu counter (or atomic64_t) for ba_acc_rx and ba_cnt_rx. The timer arm (mod_timer) is the actual reason for the lock — a READ_ONCE/cmpxchg on a flag to detect first-frame-in-interval is sufficient.
Predicted delta: Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 1–2 land.
Effort: Small.
Risks: ba_lock also serialises TX-side block-ack accounting (txrx.c:1632). The per-cpu approach requires a fold step in the timer callback — cheap.
Item 5 — Skip ps_state_lock acquisitions when PSM is known-disabled
File:line: bes2600.h:320 (decl); txrx.c:1340–1365 (vif_lock); txrx.c:1415–1426 (ps_state_lock); txrx.c:1942–1948 (RX-side ps_state_lock).
Current shape: ps_state_lock is taken on every TX frame if powersave is active. Per memory reference_bes2600_firmware_no_psm.md, PSM is non-functional on this firmware — c7 already self-detects this and latches pm_unsupported = true. The ps_state_lock guards in the RX callback and TX path are therefore taking dead overhead.
Proposed shape: Add a READ_ONCE() check on powersave_enabled before taking ps_state_lock; if false, skip the lock and the PSM state update entirely. Since c7's pm_unsupported latches, this is safe.
Predicted delta: Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.
Effort: Small.
Risks: powersave_enabled is written from process context (bh.c:403). READ_ONCE without lock is safe — at worst one spurious PSM notification, not a state corruption.
Item 6 — Firmware block-read size cap (EFFECTIVE_BUF_SIZE = 8190 bytes)
File:line: bh.c:33–36 (defines); bes2600_sdio.c:721–783 (bes2600_sdio_extract_packets()); hwio.h:294 (BES_SDIO_RX_MULTIPLE_NUM=16).
Current shape: BES_SDIO_RX_MULTIPLE_NUM=16 and BES_SDIO_OPTIMIZED_LEN both defined (Makefile:90,92). The RX burst reads PACKET_TOTAL_LEN(ctrl_reg) bytes in a single CMD53; each sub-packet bounded by EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190 bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ 2.3 frames per wakeup — well under the 16-frame limit. Not the bottleneck today.
Proposed shape: No change needed now. Re-evaluate after items 1–3 land if throughput rises past ~3 MB/s. Verify ctrl_reg PACKET_TOTAL_LEN field values during high load — requires firmware-trace observation we don't currently have.
Effort: N/A.
Risks: Increasing beyond 16 requires a larger DMA allocation (currently 1632 × 16 = 26 KB). Cortex-M4F firmware side is opaque.
Item 7 — Duplicate workqueues (hw_priv->workqueue vs hw_priv->bh_workqueue vs sbus_priv->sdio_wq)
File:line: bes2600.h:323 (workqueue); bes2600.h:385 (bh_workqueue); bes2600_sdio.c:63 (sdio_wq). txrx.c has 10 queue_work(hw_priv->workqueue, ...) calls for control-plane work.
Current shape: Three distinct workqueues. The 5,643 workqueue_execute_start/sec are dominated by sdio_wq items, not workqueue. workqueue items are control-plane events at rates well below the data-plane.
Proposed shape: After item 1 (merging sdio_rx_work into BH), sdio_wq only carries tx_work. After item 3 (synchronous TX flush from BH), sdio_wq is idle during normal data-plane and could be replaced with system_highpri_wq.
Effort: Small (follow-on to items 1 and 3).
Risks: None if items 1 and 3 land first.
Item 8 — BH_RX_CONT_LIMIT=3 cap on RX burst per BH wakeup
File:line: bh.c:1380–1405 (timeout detection); bh.c:1330 (BH_RX_CONT_LIMIT=3); bh.c:1331 (BH_TX_CONT_LIMIT=20).
Current shape: BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. After items 1–3 land, per-burst frame rate will rise and BH_RX_CONT_LIMIT=3 becomes the ceiling.
Proposed shape: Raise to 16 (matching BES_SDIO_RX_MULTIPLE_NUM) after items 1–3 are deployed and re-measured.
Effort: Trivial (constant change), but must wait for Phase 7 measurements post 1–3.
Risks: Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to BES_SDIO_RX_MULTIPLE_NUM as safe ceiling.
Ranking summary
| Rank | Item | Predicted gain | Effort |
|---|---|---|---|
| 1 | Collapse sdio_rx_work relay into BH loop |
~5x workqueue dispatch reduction | Medium |
| 2 | Batch deliver via ieee80211_rx_list() |
Removes per-frame softirq | Small |
| 3 | Synchronous TX flush from BH | Removes TX-side dispatch noise | Small |
| 4 | Replace ba_lock per-frame with atomic/per-cpu |
Removes 611 lock/sec from RX hot path | Small |
| 5 | Skip ps_state_lock when PSM-known-disabled |
Removes dead overhead | Small |
| 6 | Raise BH_RX_CONT_LIMIT after 1–3 land |
Unlocks residual throughput | Trivial |
| 7 | Consolidate workqueues post-items 1&3 | Cleanup | Small |
| 8 | Firmware block-read size | Not bottleneck at current rates | N/A |
Items 1 + 2 together are the structural answer to the measurement: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 3–5 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.
Next campaign step
A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.