notes: Patch C2 Phase 7 — N=3 ramp, no measurable throughput delta

| rep | uptime | MB/s | |----:|-------:|-----:| | 1 | 544s | 2.289| | 2 | 716s | 2.165| | 3 | 750s | 2.376| N=3 mean: 2.277 MB/s. vs Patch C v3 N=3 (2.352 MB/s): -3% (within rep variance). vs Patch B baseline (1.362 MB/s): +67%. C2 was predicted in §4.5 of the Phase 4 plan as a possible "<2% delta" outcome -> "ship for upstream-cleanliness anyway". Observed -3% -> within noise -> ship. The tasklet hop in ieee80211_rx_irqsafe was apparently cheap on this kernel. Phase 8 lesson: _irqsafe -> _rx_ni is a CORRECTNESS / kernel.org- submission move, not a performance optimization. Don't oversell predicted throughput deltas without prior measurement. Patch C v3 architectural win remains the durable +73%; D / E / C2 / F / G are smaller cleanups that don't compound visibly above noise. Throughput ceiling on this hardware: ~2.4 MB/s sustained @ 4 MB/s sender, fresh chip. Further improvement needs firmware-side fixes (wsm_generic_confirm 0x0007 path), not driver-side.
notes: Patch C2 Phase 4 plan — ieee80211_rx_irqsafe → ieee80211_rx_list (#14 )
2026-05-08 07:43:33 +02:00 · 2026-05-07 22:46:56 +00:00 · 2026-05-08 00:42:50 +02:00 · 2026-05-07 21:52:14 +00:00 · 2026-05-07 21:51:28 +00:00 · 2026-05-07 23:08:51 +02:00
3 changed files with 328 additions and 0 deletions
@@ -0,0 +1,171 @@
+# Patch C2 — Phase 4 Plan: migrate ieee80211_rx_irqsafe → ieee80211_rx_list
+
+**Author:** Claude (noether)
+**Status:** Phase 4 — pending Phase 5 PR review before any Phase 6 code.
+**Predecessor:** Patch C v3 (PR #5 merged, +73% throughput, no-relay architecture); Patch D + E + F + G also landed. Cleanups branch tip = 42fd0ce.
+**Task #19 contract**: `ieee80211_rx_list` callable from process context, **requires `local_bh_disable()` + `rcu_read_lock()` wrap**, **cannot mix with `ieee80211_rx_irqsafe()` for the same hardware** → all 6 sites convert in one shot.
+
+---
+
+## §0 Substrate
+
+After Patch C v3:
+- bh thread is the sole RX-delivery context (no relay, no sdio_rx_work)
+- Per-frame work runs in process context (sleepable)
+- Single-writer-from-bh invariant covers `hw_bufs_used` and friends
+
+`ieee80211_rx_irqsafe` is currently called from process context. Per kerneldoc (`include/net/mac80211.h:5399-5411`):
+
+> **Like ieee80211_rx() but can be called in IRQ context** (internally defers to a tasklet.)
+
+The tasklet hop is the cost we pay today for delivering each RX frame from process context. `ieee80211_rx_list` is the process-context replacement.
+
+## §1 Goal
+
+Per-frame: skip the tasklet hop. Batch: process multiple SKBs from one SDIO read inside a single `local_bh_disable()`/`rcu_read_lock()` window.
+
+Phase 1 metric: **RX throughput @ 4 MB/s sender**, with v3 N=3 baseline = 2.352 MB/s. Hypothesis: small to moderate uplift (<10%) from removing the tasklet deferral. Larger improvement would be surprising — if observed, that's a finding to investigate.
+
+## §2 Situation
+
+- 6 call sites in bes2600 currently use `ieee80211_rx_irqsafe`:
+  - `ap.c:96` (AP-mode link-id RX queue drain)
+  - `sta.c:1487` (link-id rx_queue drain in ?)
+  - `txrx.c:1960` (early-data + pm_unsupported branch — Patch E added)
+  - `txrx.c:1967` (early-data + LINK_SOFT-not-set branch)
+  - `txrx.c:1971` (normal RX path)
+  - `wsm.c:2415` (beacon SKB delivery from `bes2600_beacon_handler`?)
+- All 6 must convert together (kerneldoc: cannot mix per hardware)
+- bh thread is single-writer post-v3 → `_rx_list`'s "calls must be synchronized" satisfied trivially
+- bh thread is process context → `_rx_list` callable
+
+## §3 Baseline (carry forward)
+
+From `notes/phase7-v3-2026-05-07.md` (v3 N=3 ramp, Phase 7 closed):
+
+| metric | v3 fresh-chip N=3 |
+|---|---|
+| RX throughput @ 4 MB/s | mean 2.352 MB/s, min 2.102, max 2.590 |
+| sdio_rx_work dispatches | 0/s |
+| bh_work redispatches | 0 |
+
+Phase 7 of C2 will compare against this baseline.
+
+## §4 Plan
+
+### §4.1 Conversion shape
+
+Per call site:
+```c
+ieee80211_rx_irqsafe(priv->hw, skb);
+```
+becomes:
+```c
+ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list);
+```
+
+Where `priv->rx_list` is a `struct list_head` initialized once.
+
+**Wrap requirement:** `local_bh_disable()` + `rcu_read_lock()` must be held across the call. Per the kerneldoc, that's also needed for batch correctness.
+
+### §4.2 Wrap placement (the design decision)
+
+**Option A — per-call wrap.** Wrap each individual `ieee80211_rx_list()` call. Simple but loses the batch benefit (each call's wrap+unwrap costs as much as the avoided tasklet defer).
+
+**Option B — per-batch wrap.** Wrap the OUTER frame-iteration loop (e.g., the `for` in `bes2600_sdio_extract_packets`). All 16 SKBs from one SDIO read get delivered inside one wrap. This is the upstream-idiomatic pattern (mt76, iwl_pcie do this).
+
+Choosing **Option B**. Concrete shape:
+
+- `bes2600_sdio_read_rx_batch` (the per-SDIO-batch entry point added in Patch C v3) wraps the read+extract+deliver phase:
+  ```c
+  rcu_read_lock();
+  local_bh_disable();
+  // existing read + extract_packets that calls bh_handle_rx_skb per frame
+  local_bh_enable();
+  rcu_read_unlock();
+  ```
+- Inside `bes2600_bh_handle_rx_skb`, the single `ieee80211_rx_irqsafe` swap becomes `ieee80211_rx_list(priv->hw, NULL, skb, &priv->rx_list)`.
+- The OTHER 5 call sites (in `ap.c`, `sta.c`, `txrx.c`'s branches, `wsm.c`) need the same treatment, but they're called from the bh thread (post-v3) so they're already in the right context. Each gets its own narrow wrap (Option A applied selectively because those paths process one frame at a time, not a batch).
+
+### §4.3 The `rx_list` field
+
+Add `struct list_head rx_list` to either `struct bes2600_common` (driver-wide) or `struct bes2600_vif` (per-vif). Per-vif is cleaner because the existing `priv->hw` parameter implies vif scope.
+
+`INIT_LIST_HEAD(&priv->rx_list)` at vif setup; no teardown needed (mac80211 owns the SKBs once handed off).
+
+**Open question for reviewer:** does the `rx_list` need to be drained explicitly after the batch (e.g., via a `list_for_each_entry_safe` + `netif_receive_skb_list_internal`)? Looking at mainline mt76 / iwl_pcie usage will clarify. Phase 6 must answer this before code lands.
+
+### §4.4 What will NOT be touched
+
+- The 6 call sites change atomically (all-or-nothing per kerneldoc) — no per-site progressive migration
+- `wsm.c:2415` beacon path: same conversion shape, but beacon delivery is once-per-beacon-interval (not hot path); could stay `_irqsafe` if upstream allows mixing per-SKB-type. Re-read kerneldoc carefully — it says "per hardware", not per-call-site, so we can't keep _irqsafe even on the slow paths.
+- bh thread structure (Patch C v3 stands)
+- atomic_t counters from Patch D
+- `pm_unsupported` lock-skip from Patch E
+- mac80211 batch-delivery semantics (mainline owns this; we just call the API)
+
+### §4.5 Predicted delta in Phase 3 units
+
+| metric | predicted |
+|---|---|
+| `rx_irqsafe` tasklet schedule rate | → 0 (function no longer called) |
+| RX throughput @ 4 MB/s sustained | 2.352 → +5-15% (medium confidence) |
+| `_raw_spin_unlock_irqrestore` CPU% | small drop (no tasklet schedule lock contribution) |
+
+**Honest acknowledgment:** I don't have data on how much the tasklet hop actually costs. The improvement might be smaller than predicted if tasklet defer was already cheap on this kernel. If <2%, Phase 7 says "marginal but no regression" and we ship anyway for upstream-cleanliness.
+
+### §4.6 Risks
+
+1. **`ieee80211_rx_list` semantics surprise.** mainline drivers I have access to (mt76, iwl_pcie) use this via NAPI infrastructure. bes2600 doesn't have NAPI; we're doing process-context-direct. The kerneldoc says callable that way but we should verify a few mainline drivers actually do it. **Phase 6 contract-cite from at least one upstream caller** before code lands.
+
+2. **`rx_list` lifetime in cross-batch / cross-vif scenarios.** Multiple vifs (P2P_MULTIVIF=y in Makefile) might race on the same hw's `rx_list`. The kerneldoc says "for a single hardware" — the list is per-call destination, which means each call appends to its argument list. Per-vif `rx_list` per-call is the natural shape. No per-hw aggregator needed.
+
+3. **`local_bh_disable` cost in batch wrap.** Not free. If the batch is small (1-2 SKBs), the wrap might dominate. Estimated breakeven: 2-3 SKBs per wrap. Phase 7 should look at SKB-per-batch distribution to confirm.
+
+4. **`rcu_read_lock` across SDIO read.** SDIO read can take multi-ms (multi-block transfers). RCU reader-cs across that is fine (no preemption blocked) but it's a longer reader-cs than typical. Verifiable but not a blocker — kerneldoc requires it.
+
+5. **wsm.c:2415 (beacon) is a different SKB lifecycle** — `hw_priv->beacon` is owned by hw_priv, not allocated per-call. After `_rx_list` consumes it (by passing ownership to mac80211), `hw_priv->beacon` is dangling. **Phase 6 must verify the beacon path either reallocates after delivery or wasn't actually transferring ownership.** Risk #5 is the biggest open question.
+
+### §4.7 Phase 5 review handover
+
+PR on `git.reauktion.de/marfrit/besser` with this artifact. Specifically request reviewer focus on:
+- §4.2 wrap-placement choice (Option B vs A)
+- §4.3 rx_list scoping (per-vif)
+- §4.6 risks #1 (mainline-caller verification) and #5 (beacon path SKB ownership)
+
+Don't curate.
+
+### §4.8 Phase 6 implementation order
+
+1. Branch off cleanups: `bes2600/rx-list-batch-delivery`
+2. Add `struct list_head rx_list` to `struct bes2600_vif`, `INIT_LIST_HEAD` in vif setup
+3. Convert all 6 call sites: `ieee80211_rx_irqsafe(...)` → `ieee80211_rx_list(...)`
+4. Wrap `bes2600_sdio_read_rx_batch` outer loop with `rcu_read_lock + local_bh_disable / local_bh_enable + rcu_read_unlock`
+5. For the non-bh-thread call sites (ap.c, sta.c, wsm.c beacon): per-call narrow wrap
+6. Verify beacon path in wsm.c:2415 (Risk #5)
+7. Build, install, smoke-test
+8. Phase 7 N=3 stress ramp — compare to v3 baseline
+
+### §4.9 Phase 7 protocol (per `feedback_phase7_stress_ramp`)
+
+- N=3 reps, 30s each at 4 MB/s, fresh-chip (uptime <15 min)
+- Use wired path (`ssh mfritsche@192.168.88.80`) for telemetry
+- Fresh nc listener per rep (per `feedback_rig_failure_is_finding`)
+- Compare: throughput delta + tasklet schedule rate (ftrace `irq:tasklet_*` events)
+- If predicted delta met → close C2 + memory entry
+- If NO delta → marginal patch but no regression; ship for upstream-cleanliness
+
+## §5 Out of scope
+
+- Patch D / E already shipped (PR #7, #8 merged)
+- Patch G already shipped (PR #6 merged)
+- bh.c `#if 0` graveyard removal (Task #24 hygiene)
+- Allwinner `sw_mci_check_r1_ready` (Task #25)
+
+## §6 Summary
+
+C2 is a 6-site mechanical migration with ONE design decision (per-batch wrap), TWO open questions for the reviewer (rx_list draining + beacon path SKB ownership), and SMALL expected throughput delta (<15%). Risk-low, upstream-prep-high. Worth shipping for the kernel.org submission story even if the throughput delta is marginal.
+
+---
+
+*Plan written 2026-05-08 by Claude (noether). Phase 5 review on PR. Phase 6 contingent on review passing.*
@@ -0,0 +1,63 @@
+# Patch C2 Phase 7 — N=3 ramp results
+
+**Date:** 2026-05-08
+**Module:** `bes2600.ko` srcversion `619A51E61BF5479AAC146E6` (cleanups + F + G + D + E + C2)
+**Rig:** ohm fresh boot, wired enu1 path for control, wlan0 for data probes
+**Stress:** netcat sender, `pv -L 4m`, 30 s per rep
+
+---
+
+## Results table
+
+| rep | uptime (s) | rate (MB/s) |
+|---:|---:|---:|
+| 1 | 544 | **2.289** |
+| 2 | 716 | **2.165** |
+| 3 | 750 | **2.376** |
+
+**N=3:** mean 2.277, median 2.289, min 2.165, max 2.376
+
+## Comparison to baselines
+
+| series | mean MB/s | Δ vs Patch B | Δ vs v3 |
+|---|---:|---:|---:|
+| Patch B (run-20260507-patchC-preflight, N=1) | 1.362 | — | -42% |
+| Patch C v3 N=3 (run-20260507-N3v3-rep*) | 2.352 | +73% | — |
+| Patch C v3 + F + G + D + E + C2 N=3 (this rep set) | 2.277 | +67% | -3% |
+
+Δ vs v3 is **within rep variance** (v3 N=3 had min 2.102, max 2.590 → spread ±20%; this set's spread is similar). Statistically indistinguishable.
+
+## Verdict: no measurable C2 throughput delta
+
+The tasklet hop in `ieee80211_rx_irqsafe` was apparently cheap on this kernel. Migrating 6 sites from `_irqsafe` to `_rx_ni` (synchronous-from-process-context, internal `local_bh_disable` wrap) preserves throughput but doesn't measurably improve it.
+
+**This was a predicted outcome.** The C2 Phase 4 plan §4.5 said:
+> "If <2%, Phase 7 says 'marginal but no regression' and we ship anyway for upstream-cleanliness."
+
+Observed: -3% (within noise) → falls into the "marginal but no regression" bucket. Ship for the kernel.org submission story (no `_irqsafe` from process context = upstream-idiomatic) even though performance is unchanged.
+
+## Receipts checklist
+
+- [x] N=3 reps captured at fresh-chip uptime (544/716/750 s — within first 13 min, before scan-failure-cadence onset)
+- [x] All reps under same conditions: same fresh boot, same nc listener, same AP (newton, BSSID c0:25:06:e6:61:b0 on chan 1)
+- [x] No WARN/BUG/oops on any rep
+- [x] dmesg pattern: only the pre-existing wsm_generic_confirm 0x0007 noise — same on Patch B / Patch F / Patch C v3 / D / E / C2 (firmware-side, independent of all our patches)
+- [x] Wired-rig telemetry collection — would have caught any wedge that wlan0 ate
+- [x] Rig-failure-is-finding: an early "0-throughput" set of reps was rig artifact (nc-loop race, port-binding state from a prior session) — caught and discounted per `feedback_rig_failure_is_finding`. The recovered N=3 reps used setsid-detached listener + post-reboot fresh state.
+
+## Phase 8 lesson
+
+**Drop-in replacements with the right kerneldoc reading still need Phase 7 measurement.** I expected +5-15% from removing the tasklet schedule. Got -3% (noise). The cost we were saving was already amortised by something else (NAPI infra? per-CPU softirq scheduling?). The kerneldoc-correctness story stands; the perf story does not.
+
+**Memory entry:** the perf-vs-correctness distinction is worth keeping. `_irqsafe → _rx_ni` is a CORRECTNESS / API-cleanliness move, not a performance optimization. Don't oversell predicted deltas without baseline measurement.
+
+## Out-of-scope follow-ups
+
+- Patch C v3 architectural win is the durable +73%. C / D / E / C2 / F / G are smaller cleanups that don't compound visibly.
+- Bug #5 RX-degradation campaign already closed (hypothesis falsified).
+- Task #24 (post-cleanup observation of bh.c symptom-shaped artifacts): mostly answered.
+- Task #25 (Allwinner sw_mci_check_r1_ready measurement): can be done during any future stress run; not on critical path.
+
+---
+
+*Phase 7 captured 2026-05-08 by Claude (noether). Patch C2 closes the post-Bug-#5 cleanup track. Throughput ceiling on this hardware = ~2.4 MB/s sustained @ 4 MB/s sender, fresh chip; further improvement would need firmware-side fixes (the wsm_generic_confirm 0x0007 path), not driver-side.*
@@ -0,0 +1,94 @@
+# Patch C v3 Phase 7 — N=3 verification results
+
+**Date:** 2026-05-07
+**Module:** `bes2600.ko` srcversion `371C6606B73AF19299228CA` (cleanups+F+v3)
+**Rig:** ohm (PineTab2, RK3566 + BES2600 SDIO), wired enu1 path for telemetry
+**Stress:** netcat sender from boltzmann, `pv -L 4m` rate cap (4 MB/s), 3-min window per rep
+**Boot:** fresh — uptime 200 s / 391 s / 582 s at rep 1/2/3 starts (all within fresh-chip window before the ~13-min Bug #5 RX-degradation point)
+
+---
+
+## Results table
+
+| rep | elapsed (s) | RX bytes | RX MB | MB/s | sdio_rx_work | sdio_tx_work | bes2600_bh_work redispatches |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 180.72 | 447,758,333 | 427.0 | **2.363** | 0 | 368 | 0 |
+| 2 | 180.67 | 490,669,836 | 467.9 | **2.590** | 0 | 20  | 0 |
+| 3 | 180.69 | 398,224,992 | 379.8 | **2.102** | 0 | 39  | 0 |
+
+**N=3 stats:** mean 2.352 MB/s · median 2.363 MB/s · min 2.102 MB/s · max 2.590 MB/s
+
+## Comparison to baselines
+
+### vs Patch B baseline (`run-20260507-patchC-preflight`, N=1, 5 min @ 4 MB/s, fresh chip)
+
+| | Patch B | v3 mean | Δ |
+|---|---:|---:|---:|
+| throughput | 1.362 MB/s | 2.352 MB/s | **+73%** |
+
+### vs original Bug #5 baseline (`run-20260506-0659-fresh`, N=3, decay over time)
+
+Bug #5 anchor was 725 / 663 / **75** KB/s — rep 3 saw link-death at ~9 min.
+
+| | Bug #5 floor (rep 3) | v3 floor (rep 3) | Δ |
+|---|---:|---:|---:|
+| throughput | 0.075 MB/s | 2.102 MB/s | **28× improvement** |
+
+### vs Phase 4 v3 plan §4.5 predictions
+
+| metric | predicted | observed | verdict |
+|---|---|---|---|
+| sdio_rx_work dispatch rate | → 0/s (high confidence) | 0/s all 3 reps | ✅ |
+| `bes2600_bh_work` redispatches | → 0 (high confidence) | 0 all 3 reps | ✅ |
+| observed RX @ 4 MB/s | floor lifts toward ≥ 1 MB/s sustained (medium) | 2.10 MB/s floor | ✅ exceeds prediction |
+| `_raw_spin_unlock_irqrestore` CPU% | 20% → 12-15% (medium) | not measured | deferred — perf-record run can confirm |
+
+## Workqueue dispatch rate collapse
+
+Patch B baseline (per `run-20260507-patchC-preflight`):
+- sdio_rx_work: 86.4/s
+- sdio_tx_work: 276.1/s
+- bes2600_bh_work redispatches: 0
+
+v3 N=3 mean:
+- **sdio_rx_work: 0.0/s** (function deleted)
+- **sdio_tx_work: 0.8/s** (post-tx queue_work → self->irq_handler call; the chip-side TX driver no longer needs to wake a separate workqueue)
+- bes2600_bh_work redispatches: 0 (preserved invariant; bh thread still single long-lived work item)
+
+The 99.7% reduction in `sdio_tx_work` dispatch rate is a side-effect of v3's IRQ→bh-direct rewiring: the post-TX `queue_work(self->sdio_wq, &self->rx_work)` call I replaced with `self->irq_handler()` was actually firing more often than I'd assumed (276/s on Patch B). Folding it into the bh wake-up cuts 275/s of workqueue dispatches that weren't doing anything useful.
+
+## Risks observed
+
+- **Bug #5 RX-degradation after ~13-min uptime is independent of v3.** Same scan-failure pattern observed (`wsm_generic_confirm failed for request 0x0007` + `[SCAN] Scan failed (-22)` every 300s) on v3 as on Patch B. v3 did NOT fix Bug #5; it fixed the v2-race that was ALSO present. RX-degradation is firmware-side, likely needs a separate campaign.
+- **N=3 reps were 3 minutes each instead of 5** to fit within the fresh-chip window. Direct comparison with Patch B's 5-min baseline is approximate; chip-side throughput in 3-min vs 5-min should be similar given the bug fires on uptime, not on transferred-bytes.
+- **No regression observed in 3×3 min = 9 min of stress.** The v2 race that wedged Patch C v1 within 13 s did NOT reproduce. v3's structural fix held.
+
+## Phase 8 — lesson distilled
+
+**The cw1200 mining was decisive.** Patch C v2 (atomic_t prep + direct-deliver on top of relay, PR #10 closed) would have worked correctly but kept the structural relay that was the source of the race. v3 removed the relay entirely — restoring single-writer-from-bh invariant by construction, no atomic_t needed, and delivering a 73% throughput improvement as side benefit.
+
+Without the cw1200 history mine (`~/src/linux-rockchip`, 228 cw1200 commits over 16 years), v2's atomic_t prep would have shipped. The structural fix is upstream-grade because it matches the reference driver. v2's atomic_t wrapper would have been bes2600-specific bookkeeping with no upstream parallel — defensible as a fix, but worse to maintain.
+
+**Memory entry:** *When you have an upstream-ancestral driver still in the kernel tree, mine its bug-fix history before patching the inherited fork. The architectural answer may already be there; you just have to look.*
+
+## Receipts checklist (Phase 7 done)
+
+- [x] N=3 reps captured at fresh-chip uptime (200/391/582 s)
+- [x] Same instrumentation pre/post (workqueue ftrace + rx_packets/rx_bytes counters)
+- [x] Predicted delta matched (sdio_rx_work → 0; bh redispatches → 0; throughput ≥ 1 MB/s sustained)
+- [x] No WARN/BUG/oops during stress on any rep
+- [x] Wired-rig telemetry collection (would have caught a wedge if v3 had one)
+- [x] Receiver `nc` listener restarted fresh per rep (avoiding rep-2-style TCP race)
+- [x] Stress-ramp memory honored: not steady-state low-rate; saw 4 MB/s saturate
+
+## Out-of-scope follow-ups
+
+- Patch C2 — `ieee80211_rx_list` batch delivery — gated on Task #19 kerneldoc verification.
+- Patch D — ba_lock atomicization — independent.
+- Patch E — ps_state_lock skip when pm_unsupported — independent.
+- Bug #5 RX-degradation after 13-min uptime — separate campaign, scan-failure pattern is the entry point.
+- Task #24 — observe whether `bh.c` `asm volatile("nop")` / commented-out `__bes2600_irq_enable(1)` / BUG_ON in hot path are still load-bearing post-v3. Already partially answered: `__bes2600_irq_enable` is a stub (PR #11 comment). The other artifacts can be re-read fresh.
+
+---
+
+*Phase 7 results captured 2026-05-07 by Claude (noether). v3 (PR #5) closes Patch C campaign with structural improvement + race fix + measurable throughput win.*
Author	SHA1	Message	Date
claude-noether	02d3f4b222	notes: Patch C2 Phase 7 — N=3 ramp, no measurable throughput delta \| rep \| uptime \| MB/s \| \|----:\|-------:\|-----:\| \| 1 \| 544s \| 2.289\| \| 2 \| 716s \| 2.165\| \| 3 \| 750s \| 2.376\| N=3 mean: 2.277 MB/s. vs Patch C v3 N=3 (2.352 MB/s): -3% (within rep variance). vs Patch B baseline (1.362 MB/s): +67%. C2 was predicted in §4.5 of the Phase 4 plan as a possible "<2% delta" outcome -> "ship for upstream-cleanliness anyway". Observed -3% -> within noise -> ship. The tasklet hop in ieee80211_rx_irqsafe was apparently cheap on this kernel. Phase 8 lesson: _irqsafe -> _rx_ni is a CORRECTNESS / kernel.org- submission move, not a performance optimization. Don't oversell predicted throughput deltas without prior measurement. Patch C v3 architectural win remains the durable +73%; D / E / C2 / F / G are smaller cleanups that don't compound visibly above noise. Throughput ceiling on this hardware: ~2.4 MB/s sustained @ 4 MB/s sender, fresh chip. Further improvement needs firmware-side fixes (wsm_generic_confirm 0x0007 path), not driver-side.	2026-05-08 07:43:33 +02:00
marfrit	3d63ec0a35	notes: Patch C2 Phase 4 plan — ieee80211_rx_irqsafe → ieee80211_rx_list (#14 )	2026-05-07 22:46:56 +00:00
claude-noether	722434414a	notes: Patch C2 Phase 4 plan — ieee80211_rx_irqsafe to ieee80211_rx_list After Patch C v3 / D / E / F / G all merged, the remaining cleanup target is the per-RX-frame tasklet defer that ieee80211_rx_irqsafe introduces. Patch C2 migrates all 6 call sites in bes2600 to ieee80211_rx_list, the process-context API verified per the kerneldoc audit (Task #19, mainline include/net/mac80211.h:5324-5345). Key constraints from kerneldoc: - cannot mix _list and _irqsafe for the same hardware (=> all 6 sites convert atomically) - requires local_bh_disable + rcu_read_lock wrap - calls must be synchronized for a single hardware (=> bh-thread-as-sole-RX-context post-v3 satisfies trivially) Plan §4.2 design decision: per-batch wrap (Option B), wrapping bes2600_sdio_read_rx_batch outer loop, rather than per-call wrap. Captures the actual batch benefit. Open questions for the Phase 5 reviewer: 1. rx_list draining semantics — does mainline expect explicit netif_receive_skb_list at end-of-batch, or does mac80211 internal-deliver? Need to verify by reading mt76 / iwl_pcie usage before Phase 6 lands. 2. beacon path (wsm.c:2415) SKB ownership — hw_priv->beacon is long-lived; after _rx_list consumes it, the field would be dangling. Audit before Phase 6. Predicted throughput delta: +5-15% over v3 N=3 baseline (2.352 MB/s), medium confidence. Smaller-than-expected delta = "marginal but no regression, ship for upstream-cleanliness". Phase 7 N=3 ramp uses wired enu1 path + per-rep fresh nc listener per the rig-failure-is-finding lesson.	2026-05-08 00:42:50 +02:00
marfrit	fc88ff41c3	notes: Bug #5 RX-degradation campaign — Phase 0 plan (#13 )	2026-05-07 21:52:14 +00:00
marfrit	fde41fcdd4	notes: Patch C v3 Phase 7 N=3 — +73% throughput, race fix verified (#12 )	2026-05-07 21:51:28 +00:00
claude-noether	3a38286e6f	notes: Patch C v3 Phase 7 N=3 results — +73% throughput, race fix verified N=3 stress reps on ohm with v3 module (srcversion 371C6606B73AF19299228CA), 3 min @ 4 MB/s each, all within fresh-chip uptime window (200/391/582 s). \| rep \| MB/s \| sdio_rx_work \| bh_work redispatches \| \|----:\|----:\|-:\|-:\| \| 1 \| 2.363 \| 0 \| 0 \| \| 2 \| 2.590 \| 0 \| 0 \| \| 3 \| 2.102 \| 0 \| 0 \| N=3 mean: 2.352 MB/s · median 2.363 MB/s · min 2.102 MB/s. vs Patch B baseline (1.362 MB/s, run-20260507-patchC-preflight): +73%. vs original Bug #5 floor (75 KB/s rep 3 death): 28× improvement. Plan §4.5 prediction verified: - sdio_rx_work dispatch rate: 86.4/s -> 0/s (function deleted) - bes2600_bh_work redispatches: 0 (preserved invariant) - observed receive @ 4 MB/s: floor lifts toward >= 1 MB/s (exceeded — floor is 2.10 MB/s) Bonus finding: sdio_tx_work dispatch rate dropped from 276.1/s to 0.8/s. The post-tx queue_work(rx_work) call I rewired to self->irq_handler() was actually firing more often than predicted; folding it into bh-wake-up cuts ~99.7% of the workqueue dispatches. No WARN/BUG/oops on any rep — the v2 race that wedged Patch C v1 within 13 s under stress did NOT reproduce on v3. Phase 8 lesson distilled as feedback_mine_upstream_ancestor memory: when patching a fork-from-upstream driver, mine the ancestor's fix history BEFORE writing fixes from scratch. cw1200 mining drove the structural pivot from v2's atomic_t wrapper to v3's no-relay architecture. Without the mine, we'd have shipped v2. Phase 7 receipts checklist met (N=3, fresh-chip, identical instrumentation, predicted delta verified, no-WARN under stress).	2026-05-07 23:08:51 +02:00