diff --git a/notes/patch-c2-phase4-plan-2026-05-08.md b/notes/patch-c2-phase4-plan-2026-05-08.md new file mode 100644 index 000000000..60dbadd35 --- /dev/null +++ b/notes/patch-c2-phase4-plan-2026-05-08.md @@ -0,0 +1,171 @@ +# Patch C2 — Phase 4 Plan: migrate ieee80211_rx_irqsafe → ieee80211_rx_list + +**Author:** Claude (noether) +**Status:** Phase 4 — pending Phase 5 PR review before any Phase 6 code. +**Predecessor:** Patch C v3 (PR #5 merged, +73% throughput, no-relay architecture); Patch D + E + F + G also landed. Cleanups branch tip = 42fd0ce. +**Task #19 contract**: `ieee80211_rx_list` callable from process context, **requires `local_bh_disable()` + `rcu_read_lock()` wrap**, **cannot mix with `ieee80211_rx_irqsafe()` for the same hardware** → all 6 sites convert in one shot. + +--- + +## §0 Substrate + +After Patch C v3: +- bh thread is the sole RX-delivery context (no relay, no sdio_rx_work) +- Per-frame work runs in process context (sleepable) +- Single-writer-from-bh invariant covers `hw_bufs_used` and friends + +`ieee80211_rx_irqsafe` is currently called from process context. Per kerneldoc (`include/net/mac80211.h:5399-5411`): + +> **Like ieee80211_rx() but can be called in IRQ context** (internally defers to a tasklet.) + +The tasklet hop is the cost we pay today for delivering each RX frame from process context. `ieee80211_rx_list` is the process-context replacement. + +## §1 Goal + +Per-frame: skip the tasklet hop. Batch: process multiple SKBs from one SDIO read inside a single `local_bh_disable()`/`rcu_read_lock()` window. + +Phase 1 metric: **RX throughput @ 4 MB/s sender**, with v3 N=3 baseline = 2.352 MB/s. Hypothesis: small to moderate uplift (<10%) from removing the tasklet deferral. Larger improvement would be surprising — if observed, that's a finding to investigate. + +## §2 Situation + +- 6 call sites in bes2600 currently use `ieee80211_rx_irqsafe`: + - `ap.c:96` (AP-mode link-id RX queue drain) + - `sta.c:1487` (link-id rx_queue drain in ?) + - `txrx.c:1960` (early-data + pm_unsupported branch — Patch E added) + - `txrx.c:1967` (early-data + LINK_SOFT-not-set branch) + - `txrx.c:1971` (normal RX path) + - `wsm.c:2415` (beacon SKB delivery from `bes2600_beacon_handler`?) +- All 6 must convert together (kerneldoc: cannot mix per hardware) +- bh thread is single-writer post-v3 → `_rx_list`'s "calls must be synchronized" satisfied trivially +- bh thread is process context → `_rx_list` callable + +## §3 Baseline (carry forward) + +From `notes/phase7-v3-2026-05-07.md` (v3 N=3 ramp, Phase 7 closed): + +| metric | v3 fresh-chip N=3 | +|---|---| +| RX throughput @ 4 MB/s | mean 2.352 MB/s, min 2.102, max 2.590 | +| sdio_rx_work dispatches | 0/s | +| bh_work redispatches | 0 | + +Phase 7 of C2 will compare against this baseline. + +## §4 Plan + +### §4.1 Conversion shape + +Per call site: +```c +ieee80211_rx_irqsafe(priv->hw, skb); +``` +becomes: +```c +ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list); +``` + +Where `priv->rx_list` is a `struct list_head` initialized once. + +**Wrap requirement:** `local_bh_disable()` + `rcu_read_lock()` must be held across the call. Per the kerneldoc, that's also needed for batch correctness. + +### §4.2 Wrap placement (the design decision) + +**Option A — per-call wrap.** Wrap each individual `ieee80211_rx_list()` call. Simple but loses the batch benefit (each call's wrap+unwrap costs as much as the avoided tasklet defer). + +**Option B — per-batch wrap.** Wrap the OUTER frame-iteration loop (e.g., the `for` in `bes2600_sdio_extract_packets`). All 16 SKBs from one SDIO read get delivered inside one wrap. This is the upstream-idiomatic pattern (mt76, iwl_pcie do this). + +Choosing **Option B**. Concrete shape: + +- `bes2600_sdio_read_rx_batch` (the per-SDIO-batch entry point added in Patch C v3) wraps the read+extract+deliver phase: + ```c + rcu_read_lock(); + local_bh_disable(); + // existing read + extract_packets that calls bh_handle_rx_skb per frame + local_bh_enable(); + rcu_read_unlock(); + ``` +- Inside `bes2600_bh_handle_rx_skb`, the single `ieee80211_rx_irqsafe` swap becomes `ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list)`. +- The OTHER 5 call sites (in `ap.c`, `sta.c`, `txrx.c`'s branches, `wsm.c`) need the same treatment, but they're called from the bh thread (post-v3) so they're already in the right context. Each gets its own narrow wrap (Option A applied selectively because those paths process one frame at a time, not a batch). + +### §4.3 The `rx_list` field + +Add `struct list_head rx_list` to either `struct bes2600_common` (driver-wide) or `struct bes2600_vif` (per-vif). Per-vif is cleaner because the existing `priv->hw` parameter implies vif scope. + +`INIT_LIST_HEAD(&priv->rx_list)` at vif setup; no teardown needed (mac80211 owns the SKBs once handed off). + +**Open question for reviewer:** does the `rx_list` need to be drained explicitly after the batch (e.g., via a `list_for_each_entry_safe` + `netif_receive_skb_list_internal`)? Looking at mainline mt76 / iwl_pcie usage will clarify. Phase 6 must answer this before code lands. + +### §4.4 What will NOT be touched + +- The 6 call sites change atomically (all-or-nothing per kerneldoc) — no per-site progressive migration +- `wsm.c:2415` beacon path: same conversion shape, but beacon delivery is once-per-beacon-interval (not hot path); could stay `_irqsafe` if upstream allows mixing per-SKB-type. Re-read kerneldoc carefully — it says "per hardware", not per-call-site, so we can't keep _irqsafe even on the slow paths. +- bh thread structure (Patch C v3 stands) +- atomic_t counters from Patch D +- `pm_unsupported` lock-skip from Patch E +- mac80211 batch-delivery semantics (mainline owns this; we just call the API) + +### §4.5 Predicted delta in Phase 3 units + +| metric | predicted | +|---|---| +| `rx_irqsafe` tasklet schedule rate | → 0 (function no longer called) | +| RX throughput @ 4 MB/s sustained | 2.352 → +5-15% (medium confidence) | +| `_raw_spin_unlock_irqrestore` CPU% | small drop (no tasklet schedule lock contribution) | + +**Honest acknowledgment:** I don't have data on how much the tasklet hop actually costs. The improvement might be smaller than predicted if tasklet defer was already cheap on this kernel. If <2%, Phase 7 says "marginal but no regression" and we ship anyway for upstream-cleanliness. + +### §4.6 Risks + +1. **`ieee80211_rx_list` semantics surprise.** mainline drivers I have access to (mt76, iwl_pcie) use this via NAPI infrastructure. bes2600 doesn't have NAPI; we're doing process-context-direct. The kerneldoc says callable that way but we should verify a few mainline drivers actually do it. **Phase 6 contract-cite from at least one upstream caller** before code lands. + +2. **`rx_list` lifetime in cross-batch / cross-vif scenarios.** Multiple vifs (P2P_MULTIVIF=y in Makefile) might race on the same hw's `rx_list`. The kerneldoc says "for a single hardware" — the list is per-call destination, which means each call appends to its argument list. Per-vif `rx_list` per-call is the natural shape. No per-hw aggregator needed. + +3. **`local_bh_disable` cost in batch wrap.** Not free. If the batch is small (1-2 SKBs), the wrap might dominate. Estimated breakeven: 2-3 SKBs per wrap. Phase 7 should look at SKB-per-batch distribution to confirm. + +4. **`rcu_read_lock` across SDIO read.** SDIO read can take multi-ms (multi-block transfers). RCU reader-cs across that is fine (no preemption blocked) but it's a longer reader-cs than typical. Verifiable but not a blocker — kerneldoc requires it. + +5. **wsm.c:2415 (beacon) is a different SKB lifecycle** — `hw_priv->beacon` is owned by hw_priv, not allocated per-call. After `_rx_list` consumes it (by passing ownership to mac80211), `hw_priv->beacon` is dangling. **Phase 6 must verify the beacon path either reallocates after delivery or wasn't actually transferring ownership.** Risk #5 is the biggest open question. + +### §4.7 Phase 5 review handover + +PR on `git.reauktion.de/marfrit/besser` with this artifact. Specifically request reviewer focus on: +- §4.2 wrap-placement choice (Option B vs A) +- §4.3 rx_list scoping (per-vif) +- §4.6 risks #1 (mainline-caller verification) and #5 (beacon path SKB ownership) + +Don't curate. + +### §4.8 Phase 6 implementation order + +1. Branch off cleanups: `bes2600/rx-list-batch-delivery` +2. Add `struct list_head rx_list` to `struct bes2600_vif`, `INIT_LIST_HEAD` in vif setup +3. Convert all 6 call sites: `ieee80211_rx_irqsafe(...)` → `ieee80211_rx_list(...)` +4. Wrap `bes2600_sdio_read_rx_batch` outer loop with `rcu_read_lock + local_bh_disable / local_bh_enable + rcu_read_unlock` +5. For the non-bh-thread call sites (ap.c, sta.c, wsm.c beacon): per-call narrow wrap +6. Verify beacon path in wsm.c:2415 (Risk #5) +7. Build, install, smoke-test +8. Phase 7 N=3 stress ramp — compare to v3 baseline + +### §4.9 Phase 7 protocol (per `feedback_phase7_stress_ramp`) + +- N=3 reps, 30s each at 4 MB/s, fresh-chip (uptime <15 min) +- Use wired path (`ssh mfritsche@192.168.88.80`) for telemetry +- Fresh nc listener per rep (per `feedback_rig_failure_is_finding`) +- Compare: throughput delta + tasklet schedule rate (ftrace `irq:tasklet_*` events) +- If predicted delta met → close C2 + memory entry +- If NO delta → marginal patch but no regression; ship for upstream-cleanliness + +## §5 Out of scope + +- Patch D / E already shipped (PR #7, #8 merged) +- Patch G already shipped (PR #6 merged) +- bh.c `#if 0` graveyard removal (Task #24 hygiene) +- Allwinner `sw_mci_check_r1_ready` (Task #25) + +## §6 Summary + +C2 is a 6-site mechanical migration with ONE design decision (per-batch wrap), TWO open questions for the reviewer (rx_list draining + beacon path SKB ownership), and SMALL expected throughput delta (<15%). Risk-low, upstream-prep-high. Worth shipping for the kernel.org submission story even if the throughput delta is marginal. + +--- + +*Plan written 2026-05-08 by Claude (noether). Phase 5 review on PR. Phase 6 contingent on review passing.*