Files
besser/notes/patch-c2-phase4-plan-2026-05-08.md
claude-noether 722434414a notes: Patch C2 Phase 4 plan — ieee80211_rx_irqsafe to ieee80211_rx_list
After Patch C v3 / D / E / F / G all merged, the remaining cleanup
target is the per-RX-frame tasklet defer that ieee80211_rx_irqsafe
introduces.  Patch C2 migrates all 6 call sites in bes2600 to
ieee80211_rx_list, the process-context API verified per the
kerneldoc audit (Task #19, mainline include/net/mac80211.h:5324-5345).

Key constraints from kerneldoc:
  - cannot mix _list and _irqsafe for the same hardware
    (=> all 6 sites convert atomically)
  - requires local_bh_disable + rcu_read_lock wrap
  - calls must be synchronized for a single hardware
    (=> bh-thread-as-sole-RX-context post-v3 satisfies trivially)

Plan §4.2 design decision: per-batch wrap (Option B), wrapping
bes2600_sdio_read_rx_batch outer loop, rather than per-call wrap.
Captures the actual batch benefit.

Open questions for the Phase 5 reviewer:

  1. rx_list draining semantics — does mainline expect explicit
     netif_receive_skb_list at end-of-batch, or does mac80211
     internal-deliver?  Need to verify by reading mt76 / iwl_pcie
     usage before Phase 6 lands.
  2. beacon path (wsm.c:2415) SKB ownership — hw_priv->beacon is
     long-lived; after _rx_list consumes it, the field would be
     dangling.  Audit before Phase 6.

Predicted throughput delta: +5-15% over v3 N=3 baseline (2.352 MB/s),
medium confidence.  Smaller-than-expected delta = "marginal but no
regression, ship for upstream-cleanliness".

Phase 7 N=3 ramp uses wired enu1 path + per-rep fresh nc listener
per the rig-failure-is-finding lesson.
2026-05-08 00:42:50 +02:00

9.5 KiB

Patch C2 — Phase 4 Plan: migrate ieee80211_rx_irqsafe → ieee80211_rx_list

Author: Claude (noether) Status: Phase 4 — pending Phase 5 PR review before any Phase 6 code. Predecessor: Patch C v3 (PR #5 merged, +73% throughput, no-relay architecture); Patch D + E + F + G also landed. Cleanups branch tip = 42fd0ce. Task #19 contract: ieee80211_rx_list callable from process context, requires local_bh_disable() + rcu_read_lock() wrap, cannot mix with ieee80211_rx_irqsafe() for the same hardware → all 6 sites convert in one shot.


§0 Substrate

After Patch C v3:

  • bh thread is the sole RX-delivery context (no relay, no sdio_rx_work)
  • Per-frame work runs in process context (sleepable)
  • Single-writer-from-bh invariant covers hw_bufs_used and friends

ieee80211_rx_irqsafe is currently called from process context. Per kerneldoc (include/net/mac80211.h:5399-5411):

Like ieee80211_rx() but can be called in IRQ context (internally defers to a tasklet.)

The tasklet hop is the cost we pay today for delivering each RX frame from process context. ieee80211_rx_list is the process-context replacement.

§1 Goal

Per-frame: skip the tasklet hop. Batch: process multiple SKBs from one SDIO read inside a single local_bh_disable()/rcu_read_lock() window.

Phase 1 metric: RX throughput @ 4 MB/s sender, with v3 N=3 baseline = 2.352 MB/s. Hypothesis: small to moderate uplift (<10%) from removing the tasklet deferral. Larger improvement would be surprising — if observed, that's a finding to investigate.

§2 Situation

  • 6 call sites in bes2600 currently use ieee80211_rx_irqsafe:
    • ap.c:96 (AP-mode link-id RX queue drain)
    • sta.c:1487 (link-id rx_queue drain in ?)
    • txrx.c:1960 (early-data + pm_unsupported branch — Patch E added)
    • txrx.c:1967 (early-data + LINK_SOFT-not-set branch)
    • txrx.c:1971 (normal RX path)
    • wsm.c:2415 (beacon SKB delivery from bes2600_beacon_handler?)
  • All 6 must convert together (kerneldoc: cannot mix per hardware)
  • bh thread is single-writer post-v3 → _rx_list's "calls must be synchronized" satisfied trivially
  • bh thread is process context → _rx_list callable

§3 Baseline (carry forward)

From notes/phase7-v3-2026-05-07.md (v3 N=3 ramp, Phase 7 closed):

metric v3 fresh-chip N=3
RX throughput @ 4 MB/s mean 2.352 MB/s, min 2.102, max 2.590
sdio_rx_work dispatches 0/s
bh_work redispatches 0

Phase 7 of C2 will compare against this baseline.

§4 Plan

§4.1 Conversion shape

Per call site:

ieee80211_rx_irqsafe(priv->hw, skb);

becomes:

ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list);

Where priv->rx_list is a struct list_head initialized once.

Wrap requirement: local_bh_disable() + rcu_read_lock() must be held across the call. Per the kerneldoc, that's also needed for batch correctness.

§4.2 Wrap placement (the design decision)

Option A — per-call wrap. Wrap each individual ieee80211_rx_list() call. Simple but loses the batch benefit (each call's wrap+unwrap costs as much as the avoided tasklet defer).

Option B — per-batch wrap. Wrap the OUTER frame-iteration loop (e.g., the for in bes2600_sdio_extract_packets). All 16 SKBs from one SDIO read get delivered inside one wrap. This is the upstream-idiomatic pattern (mt76, iwl_pcie do this).

Choosing Option B. Concrete shape:

  • bes2600_sdio_read_rx_batch (the per-SDIO-batch entry point added in Patch C v3) wraps the read+extract+deliver phase:
    rcu_read_lock();
    local_bh_disable();
    // existing read + extract_packets that calls bh_handle_rx_skb per frame
    local_bh_enable();
    rcu_read_unlock();
    
  • Inside bes2600_bh_handle_rx_skb, the single ieee80211_rx_irqsafe swap becomes ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list).
  • The OTHER 5 call sites (in ap.c, sta.c, txrx.c's branches, wsm.c) need the same treatment, but they're called from the bh thread (post-v3) so they're already in the right context. Each gets its own narrow wrap (Option A applied selectively because those paths process one frame at a time, not a batch).

§4.3 The rx_list field

Add struct list_head rx_list to either struct bes2600_common (driver-wide) or struct bes2600_vif (per-vif). Per-vif is cleaner because the existing priv->hw parameter implies vif scope.

INIT_LIST_HEAD(&priv->rx_list) at vif setup; no teardown needed (mac80211 owns the SKBs once handed off).

Open question for reviewer: does the rx_list need to be drained explicitly after the batch (e.g., via a list_for_each_entry_safe + netif_receive_skb_list_internal)? Looking at mainline mt76 / iwl_pcie usage will clarify. Phase 6 must answer this before code lands.

§4.4 What will NOT be touched

  • The 6 call sites change atomically (all-or-nothing per kerneldoc) — no per-site progressive migration
  • wsm.c:2415 beacon path: same conversion shape, but beacon delivery is once-per-beacon-interval (not hot path); could stay _irqsafe if upstream allows mixing per-SKB-type. Re-read kerneldoc carefully — it says "per hardware", not per-call-site, so we can't keep _irqsafe even on the slow paths.
  • bh thread structure (Patch C v3 stands)
  • atomic_t counters from Patch D
  • pm_unsupported lock-skip from Patch E
  • mac80211 batch-delivery semantics (mainline owns this; we just call the API)

§4.5 Predicted delta in Phase 3 units

metric predicted
rx_irqsafe tasklet schedule rate → 0 (function no longer called)
RX throughput @ 4 MB/s sustained 2.352 → +5-15% (medium confidence)
_raw_spin_unlock_irqrestore CPU% small drop (no tasklet schedule lock contribution)

Honest acknowledgment: I don't have data on how much the tasklet hop actually costs. The improvement might be smaller than predicted if tasklet defer was already cheap on this kernel. If <2%, Phase 7 says "marginal but no regression" and we ship anyway for upstream-cleanliness.

§4.6 Risks

  1. ieee80211_rx_list semantics surprise. mainline drivers I have access to (mt76, iwl_pcie) use this via NAPI infrastructure. bes2600 doesn't have NAPI; we're doing process-context-direct. The kerneldoc says callable that way but we should verify a few mainline drivers actually do it. Phase 6 contract-cite from at least one upstream caller before code lands.

  2. rx_list lifetime in cross-batch / cross-vif scenarios. Multiple vifs (P2P_MULTIVIF=y in Makefile) might race on the same hw's rx_list. The kerneldoc says "for a single hardware" — the list is per-call destination, which means each call appends to its argument list. Per-vif rx_list per-call is the natural shape. No per-hw aggregator needed.

  3. local_bh_disable cost in batch wrap. Not free. If the batch is small (1-2 SKBs), the wrap might dominate. Estimated breakeven: 2-3 SKBs per wrap. Phase 7 should look at SKB-per-batch distribution to confirm.

  4. rcu_read_lock across SDIO read. SDIO read can take multi-ms (multi-block transfers). RCU reader-cs across that is fine (no preemption blocked) but it's a longer reader-cs than typical. Verifiable but not a blocker — kerneldoc requires it.

  5. wsm.c:2415 (beacon) is a different SKB lifecyclehw_priv->beacon is owned by hw_priv, not allocated per-call. After _rx_list consumes it (by passing ownership to mac80211), hw_priv->beacon is dangling. Phase 6 must verify the beacon path either reallocates after delivery or wasn't actually transferring ownership. Risk #5 is the biggest open question.

§4.7 Phase 5 review handover

PR on git.reauktion.de/marfrit/besser with this artifact. Specifically request reviewer focus on:

  • §4.2 wrap-placement choice (Option B vs A)
  • §4.3 rx_list scoping (per-vif)
  • §4.6 risks #1 (mainline-caller verification) and #5 (beacon path SKB ownership)

Don't curate.

§4.8 Phase 6 implementation order

  1. Branch off cleanups: bes2600/rx-list-batch-delivery
  2. Add struct list_head rx_list to struct bes2600_vif, INIT_LIST_HEAD in vif setup
  3. Convert all 6 call sites: ieee80211_rx_irqsafe(...)ieee80211_rx_list(...)
  4. Wrap bes2600_sdio_read_rx_batch outer loop with rcu_read_lock + local_bh_disable / local_bh_enable + rcu_read_unlock
  5. For the non-bh-thread call sites (ap.c, sta.c, wsm.c beacon): per-call narrow wrap
  6. Verify beacon path in wsm.c:2415 (Risk #5)
  7. Build, install, smoke-test
  8. Phase 7 N=3 stress ramp — compare to v3 baseline

§4.9 Phase 7 protocol (per feedback_phase7_stress_ramp)

  • N=3 reps, 30s each at 4 MB/s, fresh-chip (uptime <15 min)
  • Use wired path (ssh mfritsche@192.168.88.80) for telemetry
  • Fresh nc listener per rep (per feedback_rig_failure_is_finding)
  • Compare: throughput delta + tasklet schedule rate (ftrace irq:tasklet_* events)
  • If predicted delta met → close C2 + memory entry
  • If NO delta → marginal patch but no regression; ship for upstream-cleanliness

§5 Out of scope

  • Patch D / E already shipped (PR #7, #8 merged)
  • Patch G already shipped (PR #6 merged)
  • bh.c #if 0 graveyard removal (Task #24 hygiene)
  • Allwinner sw_mci_check_r1_ready (Task #25)

§6 Summary

C2 is a 6-site mechanical migration with ONE design decision (per-batch wrap), TWO open questions for the reviewer (rx_list draining + beacon path SKB ownership), and SMALL expected throughput delta (<15%). Risk-low, upstream-prep-high. Worth shipping for the kernel.org submission story even if the throughput delta is marginal.


Plan written 2026-05-08 by Claude (noether). Phase 5 review on PR. Phase 6 contingent on review passing.