Files
besser/notes/architect-review-bug5-2026-05-07.md
T
claude-noether 679083d1aa notes: Sonnet architect review for Bug #5 — ranked restructuring map
Sonnet (general-purpose subagent, model=sonnet) reviewed
~/src/besser/bes2600-dkms-mobian/bes2600/ given the Phase 0 measurement
context. Output: 8-item ranked restructuring map, file:line cited.

Headline:
- Item 1: collapse sdio_rx_work relay into BH loop (~5x workqueue
  dispatch reduction, medium effort)
- Item 2: batch deliver via ieee80211_rx_list (small effort, removes
  per-frame softirq)
- Items 1 + 2 together collapse "9 workqueue events per delivered
  frame" to ~1.

Items 3-5 clean up next-layer overhead (TX-side queue_work,
per-frame ba_lock, ps_state_lock under known-dead PSM). Items 6-8
are follow-ons to be re-measured after 1-3 land.

Phase 4 plan locking the lead candidate(s) follows in a separate PR.
2026-05-07 17:38:16 +02:00

14 KiB
Raw Blame History

BES2600 architecture review — Bug #5

Date assembled: 2026-05-07 Reviewer: Claude Sonnet (general-purpose subagent, model=sonnet) Driver source: ~/src/besser/bes2600-dkms-mobian/bes2600/ on boltzmann

This is the architect-review pass requested in notes/observed-bugs.md after the Phase 0 measurement showed the throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. The reviewer was given the measurement summary, source location, and a focused brief; output is a ranked restructuring map with file:line citations for every concrete claim.


Measurement context (input to the reviewer)

Reproduction:                pv -L 4M < /dev/zero | nc ohm 12345
Module under test:           bes2600.ko srcversion 1B3B3ED0...  (cleanups + Patch A + Patch B)
Hardware:                    PineTab2, RK3566 Cortex-A55 ARMv8.5, kernel 6.19.10-danctnix1
Link rate:                   65 Mb/s ≈ 8 MB/s theoretical
Observed throughput:         725 KB/s  (Phase 0 anchor at N=3)
                             rep 3 cascaded into beacon-loss disconnect at ~9 min in

Per-second event rates (3-min capture under 4 MB/s pv-cap):
  workqueue_execute_start:  5,643/sec    ← architectural floor
  bes2600_rx_cb:              611/sec
  bes2600_bh_wakeup:          267/sec
  wsm_cmd_send:                13/sec    (host-to-chip command rate, surprisingly low)
  lock contention_begin:       50/sec    (modest)
  mmc_request_start:        ~5,800/sec   (matches workqueue rate — every SDIO transaction is its own work item)

perf record top symbol:      _raw_spin_unlock_irqrestore  (~20 % CPU samples)
Dominant callstack:          process_one_work → wsm_configuration → wsm_cmd_send → bes2600_bh.isra.0

The implication: ~9 workqueue dispatches fire per frame delivered to mac80211. Items below address that ratio in descending order of predicted leverage.


Item 1 — Two-hop workqueue relay: SDIO IRQ → sdio_rx_work → BH loop → mac80211

File:line: bes2600_sdio.c:416 (IRQ handler dispatches rx_work); bes2600_sdio.c:829 (sdio_rx_work body); bh.c:13301538 (BH main loop, BES2600_RX_IN_BH path); bes2600_sdio.c:1267 (bes_sdio workqueue, max_active=2).

Current shape: Every SDIO interrupt fires queue_work(sdio_wq, &rx_work). sdio_rx_work reads up to BES_SDIO_RX_MULTIPLE_NUM=16 frames (hwio.h:294) into per-frame SKBs, enqueues each onto sbus_priv.rx_queue under rx_queue_lock, then returns. Meanwhile the BH kthread (one work item queued at boot in bh.c:93, running an infinite loop inside bes2600_bh()) calls pipe_read()spin_lock(rx_queue_lock)skb_dequeue()wsm_handle_rx()ieee80211_rx_irqsafe() one frame at a time. When pipe_read() returns NULL and pending TX exists, bes2600_sdio_pipe_read() at bes2600_sdio.c:941 re-dispatches rx_work — so a sustained RX stream fires one queue_work per BH wakeup, not per burst. That explains why bh_wakeup events are only 267/sec while workqueue_execute_start is 5,643/sec: the SDIO layer is firing a new rx_work item for every frame the BH loop drains.

Proposed shape: Collapse sdio_rx_work and pipe_read() into the BH loop directly. The BH already runs in a dedicated WQ_HIGHPRI | WQ_CPU_INTENSIVE workqueue (bh.c:66) and (with BES2600_RX_IN_BH defined per Makefile:159) bes2600_bh_rx_helper() already dequeues from rx_queue. Merge sdio_rx_work into a function called synchronously from bes2600_bh_rx_helper() before the dequeue, guarded by a trylock so re-entry is safe. This eliminates O(N) queue_work calls per burst while keeping the BH as the single SDIO-access context.

Predicted delta vs Phase 1 metric: Eliminates ~5 of the ~9 redundant workqueue dispatches per frame. 24× throughput improvement and a proportional drop in _raw_spin_unlock_irqrestore CPU cost.

Effort: Medium. SDIO host-lock protocol (sdio_claim_host/sdio_release_host) is already managed inside sdio_rx_work; moving the body is mechanical but requires care around the sdio_wq max_active=2 concurrency assumption.

Risks: sdio_rx_work runs with sdio_claim_host held for the entire burst. Inside the BH it serialises all SDIO access fine. Watch bes2600_sdio.c:1889 — flushes rx_work during teardown; that path must remain.


Item 2 — ieee80211_rx_irqsafe instead of ieee80211_rx (pre-NAPI cw1200 ancestor pattern)

File:line: txrx.c:1947, txrx.c:1950, ap.c:99, sta.c:1487, wsm.c:2416.

Current shape: Every RX frame is delivered via ieee80211_rx_irqsafe(). This function enqueues the SKB onto a per-cpu tasklet_rx list and schedules a software IRQ. Under sustained load: one softirq wakeup per frame — 611 softirq wakeups/sec on top of the workqueue overhead.

Proposed shape: Switch to ieee80211_rx_ni() (process context, which wsm_handle_rx is already in) or, better, batch-deliver frames using ieee80211_rx_list() (introduced in kernel 5.12, available in 6.19). Accumulate frames from a single sdio_rx_work burst into a list_head, then call ieee80211_rx_list() once per burst.

mac80211 contract: ieee80211_rx_list() is safe from process context with the same ieee80211_rx_status rules as ieee80211_rx_ni(). Per include/net/mac80211.h — kerneldoc states it takes the RX path atomically only when called from softirq context; from process context it uses the same path as ieee80211_rx_ni().

Predicted delta: Reduces per-frame softirq overhead. Hard to isolate independently of item 1, but combined the two deliver the < 10 % CPU-in-lock target.

Effort: Small (once item 1 is done — the batch list naturally exists at the burst boundary).

Risks: Must hold rcu_read_lock() at call site; skb->cb (IEEE80211_SKB_RXCB) must be filled before the call, as today. The early_data path at txrx.c:1942 uses skb_queue_tail into a per-link queue before calling ieee80211_rx_irqsafe — that path must be excluded from batch collection.


Item 3 — Per-frame queue_work(sdio_wq, &tx_work) in the TX send path

File:line: bes2600_sdio.c:1236 (inside bes2600_sdio_pipe_send()).

Current shape: Every call to bes2600_sdio_pipe_send() appends one descriptor to tx_bufferlist and immediately calls queue_work(sdio_wq, &tx_work). sdio_tx_work then drains the list with scatterlist batching (up to BES_SDIO_TX_MULTIPLE_NUM=16 frames per SDIO transfer). At low rates the workqueue's pending-but-not-started dedup means only one dispatch fires; at high TX rates — especially after atomic_add(1, &hw_priv->bh_tx) in bh.c reschedules TX — successive pipe_send calls each hit queue_work before the previous fires, multiplying dispatches.

Proposed shape: Stage all frames into tx_bufferlist in the BH TX loop, then flush sdio_tx_work synchronously (call the work function body directly) before returning to the wait-event. The TX mirror of item 1.

Predicted delta: Removes redundant TX-side queue_work calls. Lower priority than RX side given current TX rate (13 wsm_cmd_send/sec is host→chip control plane; data-plane TX is also limited by firmware buffer count numInpChBufs), but it does remove one source of the 5,643/sec workqueue count.

Effort: Small.

Risks: sdio_tx_work calls sdio_claim_host/sdio_release_host internally. Running directly from BH context requires confirming no deadlock with the SDIO bus claim that sdio_rx_work (now merged per item 1) holds. The TX flush must happen after the RX burst, matching the existing BH loop structure (rx:tx: ordering in bh.c:1444).


Item 4 — ba_lock per-frame acquisition in the RX path

File:line: txrx.c:9981005 (bes2600_rx_h_ba_stat()); txrx.c:11591164 and txrx.c:16821698 (tsm_lock, CONFIG_BES2600_TESTMODE-gated).

Current shape: bes2600_rx_cb() calls bes2600_rx_h_ba_stat() for every non-multicast data frame. That function acquires ba_lock under BH (spin_lock_bh) to increment ba_acc_rx and ba_cnt_rx, then sets a timer. At 611 frames/sec that's 611 lock acquisitions/sec on ba_lock alone.

Proposed shape: Replace per-frame ba_lock with a per-cpu counter (or atomic64_t) for ba_acc_rx and ba_cnt_rx. The timer arm (mod_timer) is the actual reason for the lock — a READ_ONCE/cmpxchg on a flag to detect first-frame-in-interval is sufficient.

Predicted delta: Removes 611 lock acquisitions/sec from the RX hot path. Not the dominant cost but the next bottleneck after items 12 land.

Effort: Small.

Risks: ba_lock also serialises TX-side block-ack accounting (txrx.c:1632). The per-cpu approach requires a fold step in the timer callback — cheap.


Item 5 — Skip ps_state_lock acquisitions when PSM is known-disabled

File:line: bes2600.h:320 (decl); txrx.c:13401365 (vif_lock); txrx.c:14151426 (ps_state_lock); txrx.c:19421948 (RX-side ps_state_lock).

Current shape: ps_state_lock is taken on every TX frame if powersave is active. Per memory reference_bes2600_firmware_no_psm.md, PSM is non-functional on this firmware — c7 already self-detects this and latches pm_unsupported = true. The ps_state_lock guards in the RX callback and TX path are therefore taking dead overhead.

Proposed shape: Add a READ_ONCE() check on powersave_enabled before taking ps_state_lock; if false, skip the lock and the PSM state update entirely. Since c7's pm_unsupported latches, this is safe.

Predicted delta: Small absolute gain at current TX rate, but prevents fast path from regressing as throughput improves.

Effort: Small.

Risks: powersave_enabled is written from process context (bh.c:403). READ_ONCE without lock is safe — at worst one spurious PSM notification, not a state corruption.


Item 6 — Firmware block-read size cap (EFFECTIVE_BUF_SIZE = 8190 bytes)

File:line: bh.c:3336 (defines); bes2600_sdio.c:721783 (bes2600_sdio_extract_packets()); hwio.h:294 (BES_SDIO_RX_MULTIPLE_NUM=16).

Current shape: BES_SDIO_RX_MULTIPLE_NUM=16 and BES_SDIO_OPTIMIZED_LEN both defined (Makefile:90,92). The RX burst reads PACKET_TOTAL_LEN(ctrl_reg) bytes in a single CMD53; each sub-packet bounded by EFFECTIVE_BUF_SIZE = (0x1000-4)*2 - 2 = 8190 bytes. At 611 frames/sec ÷ 267 BH wakeups/sec ≈ 2.3 frames per wakeup — well under the 16-frame limit. Not the bottleneck today.

Proposed shape: No change needed now. Re-evaluate after items 13 land if throughput rises past ~3 MB/s. Verify ctrl_reg PACKET_TOTAL_LEN field values during high load — requires firmware-trace observation we don't currently have.

Effort: N/A.

Risks: Increasing beyond 16 requires a larger DMA allocation (currently 1632 × 16 = 26 KB). Cortex-M4F firmware side is opaque.


Item 7 — Duplicate workqueues (hw_priv->workqueue vs hw_priv->bh_workqueue vs sbus_priv->sdio_wq)

File:line: bes2600.h:323 (workqueue); bes2600.h:385 (bh_workqueue); bes2600_sdio.c:63 (sdio_wq). txrx.c has 10 queue_work(hw_priv->workqueue, ...) calls for control-plane work.

Current shape: Three distinct workqueues. The 5,643 workqueue_execute_start/sec are dominated by sdio_wq items, not workqueue. workqueue items are control-plane events at rates well below the data-plane.

Proposed shape: After item 1 (merging sdio_rx_work into BH), sdio_wq only carries tx_work. After item 3 (synchronous TX flush from BH), sdio_wq is idle during normal data-plane and could be replaced with system_highpri_wq.

Effort: Small (follow-on to items 1 and 3).

Risks: None if items 1 and 3 land first.


Item 8 — BH_RX_CONT_LIMIT=3 cap on RX burst per BH wakeup

File:line: bh.c:13801405 (timeout detection); bh.c:1330 (BH_RX_CONT_LIMIT=3); bh.c:1331 (BH_TX_CONT_LIMIT=20).

Current shape: BH loop limits RX burst to 3 consecutive iterations before breaking back to wait-event. At 611 frames/sec ÷ 267 wakeups/sec ≈ 2.3 frames per wakeup → not the bottleneck today. After items 13 land, per-burst frame rate will rise and BH_RX_CONT_LIMIT=3 becomes the ceiling.

Proposed shape: Raise to 16 (matching BES_SDIO_RX_MULTIPLE_NUM) after items 13 are deployed and re-measured.

Effort: Trivial (constant change), but must wait for Phase 7 measurements post 13.

Risks: Too high a limit under firmware anomaly (corrupted ctrl_reg) can spin BH long enough to miss beacon ACK deadline. Bound to BES_SDIO_RX_MULTIPLE_NUM as safe ceiling.


Ranking summary

Rank Item Predicted gain Effort
1 Collapse sdio_rx_work relay into BH loop ~5x workqueue dispatch reduction Medium
2 Batch deliver via ieee80211_rx_list() Removes per-frame softirq Small
3 Synchronous TX flush from BH Removes TX-side dispatch noise Small
4 Replace ba_lock per-frame with atomic/per-cpu Removes 611 lock/sec from RX hot path Small
5 Skip ps_state_lock when PSM-known-disabled Removes dead overhead Small
6 Raise BH_RX_CONT_LIMIT after 13 land Unlocks residual throughput Trivial
7 Consolidate workqueues post-items 1&3 Cleanup Small
8 Firmware block-read size Not bottleneck at current rates N/A

Items 1 + 2 together are the structural answer to the measurement: ~9 workqueue events per delivered frame collapse to ~1, and the per-frame softirq cost disappears. Items 35 clean up the next layer. The beacon-loss cascade at 9 minutes is almost certainly starvation of the BH wait-event under the per-frame workqueue storm — item 1 removes the mechanism that makes the cascade possible.


Next campaign step

A Phase 4 plan locking item 1 (and possibly item 2) follows in a separate PR. The remaining items go on the campaign backlog as follow-on patches once the Phase 7 verification of item-1-or-1-plus-2 confirms the predicted delta.