From 594f73c6b4db7827790cf9eb2c47e146932e88ff Mon Sep 17 00:00:00 2001 From: "Claude (noether)" Date: Thu, 7 May 2026 17:31:31 +0200 Subject: [PATCH] =?UTF-8?q?notes:=20Bug=20#5=20root=20cause=20refined=20?= =?UTF-8?q?=E2=80=94=20workqueue-per-SDIO-transaction=20is=20the=20floor?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up ftrace measurement (post-reboot, 3-min 4MB/s capture): - workqueue_execute_start: 5,643/sec ← dominates - wsm_cmd_send: only 13/sec (host-to-chip command path NOT the hotspot) - lock contention: 50/sec (modest) The throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. Surgical patches B5-1/B5-2/B5-3 from the prior Phase 4 plan all targeted the wrong layer; deferring those until an architectural restructuring map is produced. Promoting the Sonnet architect review from "backlog" to "blocking on Bug #5" — the next step is a restructuring assessment, not another patch. --- notes/observed-bugs.md | 50 ++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/notes/observed-bugs.md b/notes/observed-bugs.md index 7fbe8e7df..8016e5da0 100644 --- a/notes/observed-bugs.md +++ b/notes/observed-bugs.md @@ -82,25 +82,47 @@ without board power-cycle"). **Status**: task c3 (indirectly, via bes_chardev removal which currently gates the signal/nosignal mode switch path). -## Backlog — full architect review of bes2600 driver code quality +## Architect review — now BUG-#5-blocking (was backlog) -The Phase 0 perf trace for Bug #5 exposes a "when in doubt, add a lock" -pattern in the BH path (~20 % CPU in `_raw_spin_unlock_irqrestore` even -during healthy throughput). Markus has flagged this for a separate -architect-review pass: have Claude Sonnet (or equivalent reviewer) do a -top-to-bottom code-quality review of the bes2600 sources we have on -boltzmann (`~/src/besser/bes2600-dkms-mobian/bes2600/`), looking for: +The Phase 0 perf trace for Bug #5 first exposed a "when in doubt, add a +lock" pattern (~20 % CPU in `_raw_spin_unlock_irqrestore`). The +follow-up ftrace measurement (2026-05-07 17:00) refined the root cause +to an architectural problem: **the bes2600 driver dispatches every +SDIO transaction through the kernel workqueue**. Numbers from a 3-min +4 MB/s ohm capture (post-reboot, srcversion `1B3B3ED0`): -- needless lock proliferation -- BH / workqueue dispatch shape -- error-handling coverage -- dead code / leftover-from-cw1200 cruft +``` +wsm_cmd_send: 13/sec (host-to-chip command rate, surprisingly low) +bes2600_rx_cb: 611/sec +bes2600_bh_wakeup: 267/sec +lock contention_begin: 50/sec +workqueue_execute_start: 5,643/sec ← DOMINATES; matches the mmc + transaction rate from earlier perf +``` + +5.6 k workqueue dispatches per second is the throughput floor — not a +specific lock, not WSM-command rate, not decrypt-state. A surgical fix +to any single function won't move the floor; the architecture needs +to be restructured to amortise SDIO transactions across fewer work- +items (or move SDIO RX out of the workqueue entirely). + +This is where the **Claude Sonnet architect review** belongs: a +top-to-bottom assessment of `~/src/besser/bes2600-dkms-mobian/bes2600/` +focused on: + +- the workqueue dispatch shape (most actionable) +- needless lock proliferation (the original signal) +- BH / RX scheduling boundaries +- error-handling coverage and dead-code from the cw1200 ancestor - API contract violations relative to mainline mac80211 -Output: ranked list of cleanup targets that would make later patch series -land more cleanly. Not blocking on Bug #5 — independent track. +Output: ranked list of restructuring targets, with predicted-delta +estimates against the Phase 1 metric (≥ 2 MB/s sustained @ 4 MB/s cap, +< 10 % CPU in lock-cycling, no link cascade in 30 min). -**Status**: backlog. Schedule when Bug #5's measurement pass finishes. +**Status**: now blocking on Bug #5 (was independent track). Surgical +patches B5-1, B5-2, B5-3 from the original Phase 4 candidate list are +all DEFERRED until the architect review's restructuring map is in. ## Bug #5 — RX path degrades under attempted-throughput pressure