f232476240
Per merged PR #8 inline review: items 1 and 2 split, sequential. Patch C is item-1-only (collapse the sdio_rx_work → rx_queue → bh_work indirection). Patch C2 (ieee80211_rx_list batch delivery) is split out and gated on Task #19 kerneldoc contract verification. Approach choice: Option A (sdio_rx_work delivers directly into wsm_handle_rx, removing rx_queue and its two synchronization points per frame) over Option B (subsume into bh thread). Option A has a smaller diff and clearer bisection story; the residual per-IRQ workqueue dispatch is preserved as a measurable Phase 7 data point that motivates or doesn't motivate a follow-on Option-B patch. Predicted delta in Phase 3 units, with confidence levels stated explicitly. §4.6 lists 6 risks, of which 2 require Phase 6 contract citations (wsm_handle_rx callability from sdio_wq context; wsm_release_tx_buffer's bh-wake invariant). §4.8 mandates a stress ramp in Phase 7, not a steady cap, per feedback_phase7_stress_ramp. Symptom-shaped findings (asm nop, commented-out IRQ re-enable, BUG_ON in hot path) explicitly deferred to Task #24 per feedback_dont_patch_downstream_artifacts. Awaiting Phase 5 second-model review on DokuWiki.
185 lines
13 KiB
Markdown
185 lines
13 KiB
Markdown
# Patch C — Phase 4 Plan: collapse sdio_rx_work into BH
|
||
|
||
**Author:** Claude (noether)
|
||
**Status:** Phase 4 — pending Phase 5 second-model review before any Phase 6 code.
|
||
**Scope:** **item 1 only** (per merged PR #8 inline review: "do it sequentially; we're not on the clock").
|
||
**Item 2** (batch deliver via `ieee80211_rx_list`) splits to **Patch C2**, gated on Task #19 kerneldoc verification.
|
||
|
||
---
|
||
|
||
## §0 Substrate — anchored
|
||
|
||
Bug #5 anchor (recorded 2026-05-07, see `notes/phase1-bug5-2026-05-07.md`):
|
||
|
||
- Sender: netcat-over-WiFi, 4 MB/s cap, 2.4 GHz, AVM AP, single-STA
|
||
- Receiver: ohm (PineTab2, RK3566 + BES2600WM-SDIO)
|
||
- N=3 baseline reps: 725 / 663 / 75 KB/s (rep 3 saw link-death at ~9 min)
|
||
- `perf record -g` during 4MB/s window: `_raw_spin_unlock_irqrestore` ≈ 20% CPU
|
||
- ftrace lock-instrumentation, system-wide: `workqueue_execute_start` ≈ 5,643/sec
|
||
- Driver-side count: `wsm_cmd_send` 13/sec — wsm command path is *not* the dispatch source; the contributor is the per-SDIO-transaction relay.
|
||
|
||
Root cause traced in PR #7 (Sonnet review) and concurred in PR #8 (Opus review): RX path adds two synchronization points per frame and one wait-queue wake-up per IRQ batch via `sdio_rx_work` → `rx_queue` → `bh_work` indirection.
|
||
|
||
## §1 Goal (locked)
|
||
|
||
Reduce per-RX-frame overhead enough that observed receive ≥ 1.0 MB/s sustained @ 4 MB/s sender, with `_raw_spin_unlock_irqrestore` < 15 % CPU during the 4 MB/s window. No 30-min cascade to link-death.
|
||
|
||
(This is a partial step toward Phase 1's full target of ≥ 2 MB/s sustained @ 4 MB/s with < 10 % lock CPU. The full target is jointly addressed by Patch C + Patch C2; Patch C alone should *cross half the gap*.)
|
||
|
||
## §2 Situation
|
||
|
||
- `bes2600.ko` srcversion `1B3B3ED0…` deployed on ohm (c-stack + Patch A + Patch B).
|
||
- `cleanups` branch on `marfrit/bes2600-dkms` is the current source-of-truth.
|
||
- Build sandbox `/var/tmp/c6-sandbox/` on ohm, native `make -j4`.
|
||
- `BES2600_RX_IN_BH` is **defined** in Makefile — `bes2600_bh_rx_helper` is the active RX consumer.
|
||
- ohm reachable. Markus pushes the reboot button; never me.
|
||
- Test rig under `/root/bes2600-samples/` — `rep-trace.sh` per-rep capture script.
|
||
|
||
## §3 Baseline measurements
|
||
|
||
Reused from Bug #5 Phase 0 (above). No re-anchor needed for Patch C — same regime.
|
||
|
||
**Specific Phase-3-units that this plan's predictions reference:**
|
||
|
||
| metric | tool | current value (4MB/s window) |
|
||
|---|---|---|
|
||
| observed receive throughput | netcat receiver byte-count | 75–725 KB/s, rep-variance high |
|
||
| `_raw_spin_unlock_irqrestore` CPU% | perf record / report | ~20% |
|
||
| `workqueue_execute_start`/sec | ftrace `workqueue:workqueue_execute_start` | ~5,643/sec system-wide |
|
||
| `bes_sdio` workqueue dispatches | `cat /sys/kernel/tracing/events/workqueue/.../filter` filtered by `bes_sdio` | not measured pre-patch — **TODO before Phase 6** |
|
||
| RX SKB rate at mac80211 boundary | trace `mac80211:drv_rx_irqsafe` count | not measured pre-patch — **TODO before Phase 6** |
|
||
|
||
Phase 6 must not start until the two TODOs above are filled in — otherwise Phase 7 has no reference point for the predicted-delta comparison.
|
||
|
||
## §4 Plan
|
||
|
||
### §4.1 What will be touched
|
||
|
||
- `bes2600_sdio.c::sdio_rx_work` — the relay loop. After this patch, it still drains the SDIO bus into SKBs but **delivers SKBs directly into `wsm_handle_rx`** instead of `skb_queue_tail`-ing them onto `self->rx_queue`.
|
||
- `bes2600_sdio.c::bes2600_sdio_extract_packets` — the inner per-SKB extractor. Changes the in-loop action from `skb_queue_tail(&self->rx_queue, skb)` to a direct call (or callback) into the wsm dispatcher.
|
||
- `bes2600_sdio.c::bes2600_sdio_pipe_read` — becomes unused, removed.
|
||
- `bh.c::bes2600_bh_rx_helper` — its `BES_SDIO_RX_MULTIPLE_ENABLE` branch is no longer reachable for RX (RX path no longer feeds bh). Either gate the helper, or remove the helper outright if `bh_rx` atomic is no longer raised on RX.
|
||
|
||
### §4.2 What will NOT be touched
|
||
|
||
- `ieee80211_rx_irqsafe()` call sites — that's Patch C2 (item 2).
|
||
- TX path — `sdio_tx_work`, `bes2600_bh_tx_helper`, etc. Untouched.
|
||
- `sdio_wq` workqueue alloc — stays. After patch it hosts only `tx_work` + `scan_work` + (briefly during patch) `rx_work`. Renaming is cosmetic and out of scope.
|
||
- The bh thread itself — still runs, still handles TX, still watches the timeouts.
|
||
- `bh.c` `#if 0` graveyard — separate hygiene patch, not bundled.
|
||
- `__bes2600_irq_enable(1)` commented-out / `asm volatile("nop")` placeholder — **deferred** per `feedback_dont_patch_downstream_artifacts`. These are symptom-shaped; Patch C may dissolve them. Re-evaluate at Task #24 (post-Patch-E observation).
|
||
- `bh_rx` / `bh_tx` atomic split — out of scope.
|
||
|
||
### §4.3 Approach choice — Option A (sdio_rx_work direct delivery)
|
||
|
||
Two structural options surveyed in PR #8 §2.1; recap:
|
||
|
||
| | Option A: direct delivery from sdio_rx_work | Option B: subsume sdio_rx_work into bh thread |
|
||
|---|---|---|
|
||
| diff size | small | medium |
|
||
| eliminates `rx_queue->lock` × 2 per frame | yes | yes |
|
||
| eliminates `sdio_wq.rx_work` workqueue dispatch per IRQ | no | yes |
|
||
| changes who calls `wsm_handle_rx` | sdio_wq context (already process context) | bh thread |
|
||
| TX/RX SDIO bus contention | unchanged (sdio_rx_work and sdio_tx_work already share `bes2600_sdio_lock`) | adds bh ↔ sdio_tx_work contention on the SDIO mutex |
|
||
| bisection isolation | clean: only the rx_queue handoff is removed | mixes "remove handoff" with "subsume thread" |
|
||
|
||
**Choosing Option A.** Reasons:
|
||
1. Smaller diff = clearer Phase-7 attribution. If RX KB/s rises, we know it was the rx_queue handoff, not the workqueue topology.
|
||
2. Per Markus's PR #8 review: split was for bisection clarity. Option A is narrower than Option B.
|
||
3. The remaining cost (per-IRQ `sdio_wq.rx_work` dispatch) is ≤ 1 dispatch per IRQ batch; multi-RX coalescing means several frames per dispatch. If Phase 7 of Patch C shows that dispatch IS the residual cost, that becomes a concrete data point and motivates a *measured* Option-B follow-up, not a speculative one.
|
||
|
||
### §4.4 Implementation sketch (preview — actual code in Phase 6)
|
||
|
||
**Today** (`bes2600_sdio.c:783–831`):
|
||
```c
|
||
static int bes2600_sdio_extract_packets(...) {
|
||
for each packet:
|
||
skb = dev_alloc_skb(...);
|
||
memcpy(skb->data, &data[pos], packet_len);
|
||
spin_lock(&self->rx_queue_lock);
|
||
skb_queue_tail(&self->rx_queue, skb); // ← handoff
|
||
spin_unlock(&self->rx_queue_lock);
|
||
}
|
||
static void sdio_rx_work(...) {
|
||
bes2600_sdio_extract_packets(...);
|
||
self->irq_handler(self->irq_priv); // ← wakes bh_wq
|
||
}
|
||
// bh thread later: pipe_read = skb_dequeue(rx_queue) → wsm_handle_rx(skb)
|
||
```
|
||
|
||
**After patch** (sketch):
|
||
```c
|
||
static int bes2600_sdio_extract_packets(struct sbus_priv *self, u32 ctrl_reg, u8 *data) {
|
||
for each packet:
|
||
skb = dev_alloc_skb(...);
|
||
memcpy(skb->data, &data[pos], packet_len);
|
||
ret = wsm_handle_rx(self->core, wsm_id_from(skb), wsm_hdr_of(skb), &skb);
|
||
if (skb) dev_kfree_skb(skb);
|
||
// no rx_queue, no spinlock, no wake-up
|
||
}
|
||
static void sdio_rx_work(...) {
|
||
bes2600_sdio_extract_packets(...);
|
||
// self->irq_handler(...) is no longer called for RX-only wakes
|
||
// (it remains called for TX-confirm-completion paths, if any)
|
||
}
|
||
```
|
||
|
||
Caveats discovered during sketch:
|
||
- `wsm_handle_rx`'s signature wants `(hw_priv, id, wsm_hdr*, **skb)`. `extract_packets` doesn't currently parse the wsm header — we either parse it inline (cheap; the cost is one `__le16_to_cpu`) or defer parsing into a new `bes2600_sdio_deliver_rx(skb)` helper that wraps it.
|
||
- `hw_priv` is reachable as `self->core`.
|
||
- Need to verify `wsm_handle_rx` is callable from sdio_wq context. **Hypothesis:** yes, because today's bh thread is also process-context-via-workqueue and that's where wsm_handle_rx already runs. Phase 6 contract-cite from `wsm.h` / call-graph confirms.
|
||
- The `irq_handler(self->irq_priv)` wakeup at sdio_rx_work:902 — keep it, but confirm whether bh actually has remaining work after RX is gone. Possibilities: TX-confirm completions (`wsm_release_tx_buffer`) still need a bh wake. Verify in Phase 6.
|
||
|
||
### §4.5 Predicted delta (Phase 3 units)
|
||
|
||
Conservative because Patch C is item 1 only, not items 1+2.
|
||
|
||
| metric | predicted change | confidence |
|
||
|---|---|---|
|
||
| `rx_queue->lock` acquire/release rate | → 0 (lock is removed entirely; struct field deleted) | high |
|
||
| RX-path wait-queue wakes (`bh_wq` from sdio_rx_work for RX) | → 0 (TX-confirm wakes remain) | high |
|
||
| `_raw_spin_unlock_irqrestore` CPU% | 20 % → 12–15 % | **medium** — the rx_queue lock is one of several contributors; I don't have per-lock breakdown pre-patch |
|
||
| `workqueue_execute_start`/sec | marginal change (≤ 5 %) | high — sdio_wq dispatch still happens per IRQ |
|
||
| observed receive @ 4 MB/s | floor lifts from 75 KB/s → ≥ 1.0 MB/s; rep-variance shrinks | **medium** — rep 3's link death has multiple causes (decrypt-storm path is Patch A's territory; AP-side `aid 30` rejection is also possible) |
|
||
| Phase 7 N=3 outcome | all reps ≥ 1 MB/s sustained for 30 min @ 4 MB/s | **medium** |
|
||
|
||
**Honest acknowledgement:** the medium-confidence predictions are the ones where Phase 7 either confirms the model or surfaces a new bug. If `_raw_spin_unlock_irqrestore` only drops to 18 %, the next-largest contributor was something else — `pool->lock` (workqueue infrastructure) or `ba_lock` — and Patch D/E/C2 become the answer.
|
||
|
||
### §4.6 Risks
|
||
|
||
1. **`wsm_handle_rx` not callable from sdio_wq**: low probability (process context, same shape as today's bh), but a cite-failure here means revert to Option B. **Phase 6 must produce a `wsm.h` contract citation** before code lands.
|
||
2. **TX-confirm wake-ups stop firing**: if `wsm_handle_rx` was the only thing that ultimately bumped `bh_tx`, removing it from bh's input causes TX-confirm starvation. Mitigation: keep `irq_handler(irq_priv)` call in sdio_rx_work for now; let the bh's wait_event re-evaluate `bh_tx` on every wake. **Verify in Phase 6 that `wsm_release_tx_buffer` still wakes bh.**
|
||
3. **SKB allocation under memory pressure**: `dev_alloc_skb` in extract_packets currently `msleep(100)` retries up to 10×. Calling `wsm_handle_rx` directly from extract_packets keeps us in sdio_wq context during sleep; that's the same as today, so no new risk.
|
||
4. **rcu / locking invariants in `wsm_handle_rx`**: it traverses `priv->vif_list`, may grab `priv->vif_lock`. Currently called from bh thread. After patch: called from sdio_wq context. Both are process context, both can sleep. No new risk *unless* there's a held lock at sdio_wq level that wsm_handle_rx tries to re-acquire. **Phase 6 lock-graph audit required.**
|
||
5. **`bes2600_chrdev_is_bus_error()` early-return**: currently checked in `pipe_read`. After patch, must move into `extract_packets` or `sdio_rx_work` so RX during a bus-error window still gets dropped, not passed to mac80211.
|
||
6. **Multi-vif RX serialization**: the `rx_queue` is per-sbus_priv, not per-vif. After patch, multi-vif demux happens inside `wsm_handle_rx` (same as today). No new risk; same ceiling.
|
||
|
||
### §4.7 Phase 5 review handover
|
||
|
||
Goal/Situation/Measurements/Plan paste verbatim into DokuWiki when Markus initiates handover. **Do not curate** the plan for the reviewer — including the "medium-confidence" predictions and the §4.6 risk list verbatim. Reviewer should see the same uncertainty I have.
|
||
|
||
### §4.8 Phase 7 protocol (after Phase 6 lands)
|
||
|
||
Per `feedback_phase7_stress_ramp.md` — **stress ramp, not steady cap**:
|
||
|
||
1. Pre-patch baseline (re-anchor): 5 min @ 1 MB/s, 10 min @ 2 MB/s, 30 min @ 4 MB/s. Capture ftrace `workqueue/`, `lock/`, `mac80211/`, `mmc/`. perf record during the 4 MB/s window.
|
||
2. Apply Patch C, install, reboot (Markus pushes).
|
||
3. Post-patch: identical ramp, identical instrumentation.
|
||
4. Compute deltas in **the same units** as §3 baseline. Compare to §4.5 predictions. Any unexplained delta is a finding, not a footnote — log it and loop back to Phase 4 if the model is wrong.
|
||
5. **N=3 reps** post-patch. The user's stress-ramp memory and the receipts checklist both require this.
|
||
6. Capture `sdio_work_debug` output and `dmesg` if any storm fires (Patch A's counter should hold steady).
|
||
7. If Phase 7 numbers match prediction → Phase 8 memory update + proceed to Patch C2.
|
||
8. If they don't match → loop back to Phase 4. Don't paper-fix.
|
||
|
||
## §5 Out-of-scope items recorded for follow-on patches
|
||
|
||
- **Patch C2**: items 2 — `ieee80211_rx_list` batch delivery. Gated on Task #19 kerneldoc verification.
|
||
- **Patch D**: ba_lock atomicization at `txrx.c:998-1005, 1632`. Independent.
|
||
- **Patch E**: ps_state_lock skip when `pm_unsupported = true` at `txrx.c:1942-1948`. Independent, gated on c7 latch.
|
||
- **Task #24**: post-Patch-E observation of bh.c `asm volatile("nop")`, commented-out `__bes2600_irq_enable(1)`, BUG_ON in steady-state hot path. Symptom-shaped; observe before patching.
|
||
- **Task #25**: measure `sw_mci_check_r1_ready` on RK3566 during testing.
|
||
|
||
---
|
||
|
||
*Plan written 2026-05-07 by Claude (noether). Awaiting Phase 5 second-model review on DokuWiki, initiated by Markus.*
|