notes: Patch C v2 Phase 4 plan — atomic_t prep + direct-deliver (re-after-failure) #10

Merged
marfrit merged 1 commits from claude-noether-8 into main 2026-05-07 18:56:12 +00:00
Collaborator

Replan after Patch C Phase 7 failed at boot -1 20:18:10 with a wsm_release_tx_buffer thread-safety race. PR #3 on bes2600-dkms closed with full verdict comment.

What's different from Patch C plan

  • §2 shared-state delta table — every field bes2600_bh_handle_rx_skb mutates directly or transitively, with current protection and required action. 3 fields need atomic_t (hw_bufs_used, hw_bufs_used_vif[], wsm_tx_pending[]); the rest are already concurrency-safe. This is the table that should have been in the original Patch C plan and wasn't.
  • §3 read-site survey (~58 atomic_read swaps).
  • Two-step landing: Patch C-prep (NFC atomic_t refactor) lands first, must show zero Phase 7 delta from baseline; Patch C v2 (direct-deliver) lands on top. Clean bisection: if v2 regresses, we know it's the structural move not the atomic ordering.
  • §6 6 risks listed including memory-ordering choices (atomic_fetch_sub_release vs _acquire paired reads) and the inc/dec_pending_count timer-decision race.
  • §9 Phase 7 v2 protocol uses ohm's new wired interface (enu1 at 192.168.88.80) so we can capture dmesg / ftrace / journalctl from a wedged ohm without forcing reboot — the data we lost in Patch C boot -1 because wlan0 was the only path.

Decision points for review

  • §4 atomic ordering: I chose atomic_fetch_sub_release for the decrement. Reviewer should sanity-check whether the corresponding reads need _acquire to pair correctly, especially in the bh main loop's idle predicate.
  • §6 risk #2: wsm_tx_pending[] decrement-then-timer-decide is racy under atomic_t alone. Plan flags it; doesn't fix it yet. Either a separate lock or moving inc/dec to single context is needed. Specifically request reviewer judgment on this.
  • The two-step landing requires Phase 7 of C-prep to be a true no-op delta. If we observe any throughput change on C-prep alone, the atomic ordering is wrong and we loop here, not to C v2.

Rig status

  • Rollback module on ohm: /lib/modules/.../extra/bes2600.ko = Patch B (1B3B3ED0…). Currently RUNNING is still Patch C broken (84239431…) until Markus reboots.
  • Patch C broken module preserved at /var/tmp/bes2600.patchC-broken.ko (forensics).
  • enu1 (USB-Ethernet) at 192.168.88.80 survives bes2600 wedges; lmcp ohm host still goes through wlan0, so I'll use ssh mfritsche@192.168.88.80 directly via Bash for any Phase 7 work.

What's NOT in this PR

  • No code yet. Phase 6 is contingent on Phase 5 passing.
  • No Patch C v2 code yet.
  • No re-attempt of the Phase 7 N=3 stress ramp until C-prep + C v2 both have been written and reviewed.
Replan after Patch C Phase 7 failed at boot -1 20:18:10 with a `wsm_release_tx_buffer` thread-safety race. PR #3 on bes2600-dkms closed with full verdict comment. ## What's different from Patch C plan - §2 **shared-state delta table** — every field `bes2600_bh_handle_rx_skb` mutates directly or transitively, with current protection and required action. 3 fields need atomic_t (`hw_bufs_used`, `hw_bufs_used_vif[]`, `wsm_tx_pending[]`); the rest are already concurrency-safe. This is the table that should have been in the original Patch C plan and wasn't. - §3 read-site survey (~58 atomic_read swaps). - Two-step landing: **Patch C-prep** (NFC atomic_t refactor) lands first, must show zero Phase 7 delta from baseline; **Patch C v2** (direct-deliver) lands on top. Clean bisection: if v2 regresses, we know it's the structural move not the atomic ordering. - §6 **6 risks listed** including memory-ordering choices (`atomic_fetch_sub_release` vs `_acquire` paired reads) and the `inc/dec_pending_count` timer-decision race. - §9 Phase 7 v2 protocol uses ohm's new wired interface (`enu1` at `192.168.88.80`) so we can capture dmesg / ftrace / journalctl from a wedged ohm without forcing reboot — the data we lost in Patch C boot -1 because wlan0 was the only path. ## Decision points for review - §4 atomic ordering: I chose `atomic_fetch_sub_release` for the decrement. Reviewer should sanity-check whether the corresponding reads need `_acquire` to pair correctly, especially in the bh main loop's idle predicate. - §6 risk #2: `wsm_tx_pending[]` decrement-then-timer-decide is racy under atomic_t alone. Plan flags it; doesn't fix it yet. Either a separate lock or moving inc/dec to single context is needed. **Specifically request reviewer judgment on this.** - The two-step landing requires Phase 7 of C-prep to be a true no-op delta. If we observe any throughput change on C-prep alone, the atomic ordering is wrong and we loop here, not to C v2. ## Rig status - Rollback module on ohm: `/lib/modules/.../extra/bes2600.ko` = Patch B (`1B3B3ED0…`). Currently RUNNING is still Patch C broken (`84239431…`) until Markus reboots. - Patch C broken module preserved at `/var/tmp/bes2600.patchC-broken.ko` (forensics). - `enu1` (USB-Ethernet) at `192.168.88.80` survives bes2600 wedges; lmcp `ohm` host still goes through wlan0, so I'll use `ssh mfritsche@192.168.88.80` directly via Bash for any Phase 7 work. ## What's NOT in this PR - No code yet. Phase 6 is contingent on Phase 5 passing. - No Patch C v2 code yet. - No re-attempt of the Phase 7 N=3 stress ramp until C-prep + C v2 both have been written and reviewed.
claude-noether added 1 commit 2026-05-07 18:51:06 +00:00
Phase 7 of Patch C (PR #9 → bes2600-dkms PR #3 → boot -1 of ohm
20:18:10) failed with a thread-safety race: wsm_release_tx_buffer's
unlocked R-M-W on hw_bufs_used races against wsm_alloc_tx_buffer in
the bh thread when Patch C moved the RX-confirm decrement into
sdio_rx_work.  WARN storm at +13s under stress, chip wedges, host
off-network.

Phase 6 contract analysis cited wsm_handle_rx's sleepability and
held-lock invariants but stopped at the function signature.  Did not
enumerate hw_bufs_used as shared state mutated by the callee.  Lesson
saved as feedback_phase6_contract_threadsafety memory.

Phase 4 v2 designs around that gap.  Two-step:

1. Patch C-prep: NFC refactor — convert hw_bufs_used,
   hw_bufs_used_vif[], wsm_tx_pending[] from int / int[] to atomic_t /
   atomic_t[].  Use atomic_fetch_sub_release in wsm_release_tx_buffer
   (returns prior value for the >= numInpChBufs - 1 predicate).
   Mechanical atomic_read swap at ~58 read sites.  Lands first;
   Phase 7 should show zero delta from baseline.

2. Patch C v2: re-apply the sdio_rx_work direct-deliver on top of
   C-prep.  Identical structural change to the closed PR #3, but now
   the racing counter is safe.  Contract block in
   bes2600_bh_handle_rx_skb expanded to include the shared-state
   delta table.

Plan §2 is the shared-state delta table — every field
bes2600_bh_handle_rx_skb mutates directly or transitively, with
current protection and required action.  3 fields need atomic_t,
the rest are already concurrency-safe or stay single-writer.

Plan §6 lists 6 risks including memory-ordering choices, the
inc/dec_pending_count timer-decision race, and the new wired-rig
fallback (enu1 192.168.88.80) that survives bes2600 wedges so Phase 7
can capture dmesg / ftrace from a wedged ohm without reboot.

PR superseded #3 closed with full verdict comment.  Phase B rolled
back on ohm at /lib/modules/.../extra/bes2600.ko.  Markus's reboot
button to land Patch B again before C-prep work begins.
marfrit merged commit 993117a108 into main 2026-05-07 18:56:12 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/besser#10