Files
besser/notes/patch-c-phase4-plan-2026-05-07.md
T
claude-noether f232476240 notes: Patch C Phase 4 plan — collapse sdio_rx_work into BH (item 1 only)
Per merged PR #8 inline review: items 1 and 2 split, sequential. Patch C
is item-1-only (collapse the sdio_rx_work → rx_queue → bh_work
indirection). Patch C2 (ieee80211_rx_list batch delivery) is split out
and gated on Task #19 kerneldoc contract verification.

Approach choice: Option A (sdio_rx_work delivers directly into
wsm_handle_rx, removing rx_queue and its two synchronization points per
frame) over Option B (subsume into bh thread). Option A has a smaller
diff and clearer bisection story; the residual per-IRQ workqueue
dispatch is preserved as a measurable Phase 7 data point that motivates
or doesn't motivate a follow-on Option-B patch.

Predicted delta in Phase 3 units, with confidence levels stated
explicitly. §4.6 lists 6 risks, of which 2 require Phase 6 contract
citations (wsm_handle_rx callability from sdio_wq context;
wsm_release_tx_buffer's bh-wake invariant). §4.8 mandates a stress
ramp in Phase 7, not a steady cap, per feedback_phase7_stress_ramp.

Symptom-shaped findings (asm nop, commented-out IRQ re-enable, BUG_ON
in hot path) explicitly deferred to Task #24 per
feedback_dont_patch_downstream_artifacts.

Awaiting Phase 5 second-model review on DokuWiki.
2026-05-07 19:04:53 +02:00

13 KiB
Raw Blame History

Patch C — Phase 4 Plan: collapse sdio_rx_work into BH

Author: Claude (noether) Status: Phase 4 — pending Phase 5 second-model review before any Phase 6 code. Scope: item 1 only (per merged PR #8 inline review: "do it sequentially; we're not on the clock"). Item 2 (batch deliver via ieee80211_rx_list) splits to Patch C2, gated on Task #19 kerneldoc verification.


§0 Substrate — anchored

Bug #5 anchor (recorded 2026-05-07, see notes/phase1-bug5-2026-05-07.md):

  • Sender: netcat-over-WiFi, 4 MB/s cap, 2.4 GHz, AVM AP, single-STA
  • Receiver: ohm (PineTab2, RK3566 + BES2600WM-SDIO)
  • N=3 baseline reps: 725 / 663 / 75 KB/s (rep 3 saw link-death at ~9 min)
  • perf record -g during 4MB/s window: _raw_spin_unlock_irqrestore ≈ 20% CPU
  • ftrace lock-instrumentation, system-wide: workqueue_execute_start ≈ 5,643/sec
  • Driver-side count: wsm_cmd_send 13/sec — wsm command path is not the dispatch source; the contributor is the per-SDIO-transaction relay.

Root cause traced in PR #7 (Sonnet review) and concurred in PR #8 (Opus review): RX path adds two synchronization points per frame and one wait-queue wake-up per IRQ batch via sdio_rx_workrx_queuebh_work indirection.

§1 Goal (locked)

Reduce per-RX-frame overhead enough that observed receive ≥ 1.0 MB/s sustained @ 4 MB/s sender, with _raw_spin_unlock_irqrestore < 15 % CPU during the 4 MB/s window. No 30-min cascade to link-death.

(This is a partial step toward Phase 1's full target of ≥ 2 MB/s sustained @ 4 MB/s with < 10 % lock CPU. The full target is jointly addressed by Patch C + Patch C2; Patch C alone should cross half the gap.)

§2 Situation

  • bes2600.ko srcversion 1B3B3ED0… deployed on ohm (c-stack + Patch A + Patch B).
  • cleanups branch on marfrit/bes2600-dkms is the current source-of-truth.
  • Build sandbox /var/tmp/c6-sandbox/ on ohm, native make -j4.
  • BES2600_RX_IN_BH is defined in Makefile — bes2600_bh_rx_helper is the active RX consumer.
  • ohm reachable. Markus pushes the reboot button; never me.
  • Test rig under /root/bes2600-samples/rep-trace.sh per-rep capture script.

§3 Baseline measurements

Reused from Bug #5 Phase 0 (above). No re-anchor needed for Patch C — same regime.

Specific Phase-3-units that this plan's predictions reference:

metric tool current value (4MB/s window)
observed receive throughput netcat receiver byte-count 75725 KB/s, rep-variance high
_raw_spin_unlock_irqrestore CPU% perf record / report ~20%
workqueue_execute_start/sec ftrace workqueue:workqueue_execute_start ~5,643/sec system-wide
bes_sdio workqueue dispatches cat /sys/kernel/tracing/events/workqueue/.../filter filtered by bes_sdio not measured pre-patch — TODO before Phase 6
RX SKB rate at mac80211 boundary trace mac80211:drv_rx_irqsafe count not measured pre-patch — TODO before Phase 6

Phase 6 must not start until the two TODOs above are filled in — otherwise Phase 7 has no reference point for the predicted-delta comparison.

§4 Plan

§4.1 What will be touched

  • bes2600_sdio.c::sdio_rx_work — the relay loop. After this patch, it still drains the SDIO bus into SKBs but delivers SKBs directly into wsm_handle_rx instead of skb_queue_tail-ing them onto self->rx_queue.
  • bes2600_sdio.c::bes2600_sdio_extract_packets — the inner per-SKB extractor. Changes the in-loop action from skb_queue_tail(&self->rx_queue, skb) to a direct call (or callback) into the wsm dispatcher.
  • bes2600_sdio.c::bes2600_sdio_pipe_read — becomes unused, removed.
  • bh.c::bes2600_bh_rx_helper — its BES_SDIO_RX_MULTIPLE_ENABLE branch is no longer reachable for RX (RX path no longer feeds bh). Either gate the helper, or remove the helper outright if bh_rx atomic is no longer raised on RX.

§4.2 What will NOT be touched

  • ieee80211_rx_irqsafe() call sites — that's Patch C2 (item 2).
  • TX path — sdio_tx_work, bes2600_bh_tx_helper, etc. Untouched.
  • sdio_wq workqueue alloc — stays. After patch it hosts only tx_work + scan_work + (briefly during patch) rx_work. Renaming is cosmetic and out of scope.
  • The bh thread itself — still runs, still handles TX, still watches the timeouts.
  • bh.c #if 0 graveyard — separate hygiene patch, not bundled.
  • __bes2600_irq_enable(1) commented-out / asm volatile("nop") placeholder — deferred per feedback_dont_patch_downstream_artifacts. These are symptom-shaped; Patch C may dissolve them. Re-evaluate at Task #24 (post-Patch-E observation).
  • bh_rx / bh_tx atomic split — out of scope.

§4.3 Approach choice — Option A (sdio_rx_work direct delivery)

Two structural options surveyed in PR #8 §2.1; recap:

Option A: direct delivery from sdio_rx_work Option B: subsume sdio_rx_work into bh thread
diff size small medium
eliminates rx_queue->lock × 2 per frame yes yes
eliminates sdio_wq.rx_work workqueue dispatch per IRQ no yes
changes who calls wsm_handle_rx sdio_wq context (already process context) bh thread
TX/RX SDIO bus contention unchanged (sdio_rx_work and sdio_tx_work already share bes2600_sdio_lock) adds bh ↔ sdio_tx_work contention on the SDIO mutex
bisection isolation clean: only the rx_queue handoff is removed mixes "remove handoff" with "subsume thread"

Choosing Option A. Reasons:

  1. Smaller diff = clearer Phase-7 attribution. If RX KB/s rises, we know it was the rx_queue handoff, not the workqueue topology.
  2. Per Markus's PR #8 review: split was for bisection clarity. Option A is narrower than Option B.
  3. The remaining cost (per-IRQ sdio_wq.rx_work dispatch) is ≤ 1 dispatch per IRQ batch; multi-RX coalescing means several frames per dispatch. If Phase 7 of Patch C shows that dispatch IS the residual cost, that becomes a concrete data point and motivates a measured Option-B follow-up, not a speculative one.

§4.4 Implementation sketch (preview — actual code in Phase 6)

Today (bes2600_sdio.c:783831):

static int bes2600_sdio_extract_packets(...) {
    for each packet:
        skb = dev_alloc_skb(...);
        memcpy(skb->data, &data[pos], packet_len);
        spin_lock(&self->rx_queue_lock);
        skb_queue_tail(&self->rx_queue, skb);    // ← handoff
        spin_unlock(&self->rx_queue_lock);
}
static void sdio_rx_work(...) {
    bes2600_sdio_extract_packets(...);
    self->irq_handler(self->irq_priv);           // ← wakes bh_wq
}
// bh thread later: pipe_read = skb_dequeue(rx_queue) → wsm_handle_rx(skb)

After patch (sketch):

static int bes2600_sdio_extract_packets(struct sbus_priv *self, u32 ctrl_reg, u8 *data) {
    for each packet:
        skb = dev_alloc_skb(...);
        memcpy(skb->data, &data[pos], packet_len);
        ret = wsm_handle_rx(self->core, wsm_id_from(skb), wsm_hdr_of(skb), &skb);
        if (skb) dev_kfree_skb(skb);
        // no rx_queue, no spinlock, no wake-up
}
static void sdio_rx_work(...) {
    bes2600_sdio_extract_packets(...);
    // self->irq_handler(...) is no longer called for RX-only wakes
    // (it remains called for TX-confirm-completion paths, if any)
}

Caveats discovered during sketch:

  • wsm_handle_rx's signature wants (hw_priv, id, wsm_hdr*, **skb). extract_packets doesn't currently parse the wsm header — we either parse it inline (cheap; the cost is one __le16_to_cpu) or defer parsing into a new bes2600_sdio_deliver_rx(skb) helper that wraps it.
  • hw_priv is reachable as self->core.
  • Need to verify wsm_handle_rx is callable from sdio_wq context. Hypothesis: yes, because today's bh thread is also process-context-via-workqueue and that's where wsm_handle_rx already runs. Phase 6 contract-cite from wsm.h / call-graph confirms.
  • The irq_handler(self->irq_priv) wakeup at sdio_rx_work:902 — keep it, but confirm whether bh actually has remaining work after RX is gone. Possibilities: TX-confirm completions (wsm_release_tx_buffer) still need a bh wake. Verify in Phase 6.

§4.5 Predicted delta (Phase 3 units)

Conservative because Patch C is item 1 only, not items 1+2.

metric predicted change confidence
rx_queue->lock acquire/release rate → 0 (lock is removed entirely; struct field deleted) high
RX-path wait-queue wakes (bh_wq from sdio_rx_work for RX) → 0 (TX-confirm wakes remain) high
_raw_spin_unlock_irqrestore CPU% 20 % → 1215 % medium — the rx_queue lock is one of several contributors; I don't have per-lock breakdown pre-patch
workqueue_execute_start/sec marginal change (≤ 5 %) high — sdio_wq dispatch still happens per IRQ
observed receive @ 4 MB/s floor lifts from 75 KB/s → ≥ 1.0 MB/s; rep-variance shrinks medium — rep 3's link death has multiple causes (decrypt-storm path is Patch A's territory; AP-side aid 30 rejection is also possible)
Phase 7 N=3 outcome all reps ≥ 1 MB/s sustained for 30 min @ 4 MB/s medium

Honest acknowledgement: the medium-confidence predictions are the ones where Phase 7 either confirms the model or surfaces a new bug. If _raw_spin_unlock_irqrestore only drops to 18 %, the next-largest contributor was something else — pool->lock (workqueue infrastructure) or ba_lock — and Patch D/E/C2 become the answer.

§4.6 Risks

  1. wsm_handle_rx not callable from sdio_wq: low probability (process context, same shape as today's bh), but a cite-failure here means revert to Option B. Phase 6 must produce a wsm.h contract citation before code lands.
  2. TX-confirm wake-ups stop firing: if wsm_handle_rx was the only thing that ultimately bumped bh_tx, removing it from bh's input causes TX-confirm starvation. Mitigation: keep irq_handler(irq_priv) call in sdio_rx_work for now; let the bh's wait_event re-evaluate bh_tx on every wake. Verify in Phase 6 that wsm_release_tx_buffer still wakes bh.
  3. SKB allocation under memory pressure: dev_alloc_skb in extract_packets currently msleep(100) retries up to 10×. Calling wsm_handle_rx directly from extract_packets keeps us in sdio_wq context during sleep; that's the same as today, so no new risk.
  4. rcu / locking invariants in wsm_handle_rx: it traverses priv->vif_list, may grab priv->vif_lock. Currently called from bh thread. After patch: called from sdio_wq context. Both are process context, both can sleep. No new risk unless there's a held lock at sdio_wq level that wsm_handle_rx tries to re-acquire. Phase 6 lock-graph audit required.
  5. bes2600_chrdev_is_bus_error() early-return: currently checked in pipe_read. After patch, must move into extract_packets or sdio_rx_work so RX during a bus-error window still gets dropped, not passed to mac80211.
  6. Multi-vif RX serialization: the rx_queue is per-sbus_priv, not per-vif. After patch, multi-vif demux happens inside wsm_handle_rx (same as today). No new risk; same ceiling.

§4.7 Phase 5 review handover

Goal/Situation/Measurements/Plan paste verbatim into DokuWiki when Markus initiates handover. Do not curate the plan for the reviewer — including the "medium-confidence" predictions and the §4.6 risk list verbatim. Reviewer should see the same uncertainty I have.

§4.8 Phase 7 protocol (after Phase 6 lands)

Per feedback_phase7_stress_ramp.mdstress ramp, not steady cap:

  1. Pre-patch baseline (re-anchor): 5 min @ 1 MB/s, 10 min @ 2 MB/s, 30 min @ 4 MB/s. Capture ftrace workqueue/, lock/, mac80211/, mmc/. perf record during the 4 MB/s window.
  2. Apply Patch C, install, reboot (Markus pushes).
  3. Post-patch: identical ramp, identical instrumentation.
  4. Compute deltas in the same units as §3 baseline. Compare to §4.5 predictions. Any unexplained delta is a finding, not a footnote — log it and loop back to Phase 4 if the model is wrong.
  5. N=3 reps post-patch. The user's stress-ramp memory and the receipts checklist both require this.
  6. Capture sdio_work_debug output and dmesg if any storm fires (Patch A's counter should hold steady).
  7. If Phase 7 numbers match prediction → Phase 8 memory update + proceed to Patch C2.
  8. If they don't match → loop back to Phase 4. Don't paper-fix.

§5 Out-of-scope items recorded for follow-on patches

  • Patch C2: items 2 — ieee80211_rx_list batch delivery. Gated on Task #19 kerneldoc verification.
  • Patch D: ba_lock atomicization at txrx.c:998-1005, 1632. Independent.
  • Patch E: ps_state_lock skip when pm_unsupported = true at txrx.c:1942-1948. Independent, gated on c7 latch.
  • Task #24: post-Patch-E observation of bh.c asm volatile("nop"), commented-out __bes2600_irq_enable(1), BUG_ON in steady-state hot path. Symptom-shaped; observe before patching.
  • Task #25: measure sw_mci_check_r1_ready on RK3566 during testing.

Plan written 2026-05-07 by Claude (noether). Awaiting Phase 5 second-model review on DokuWiki, initiated by Markus.