bes2600: take pending_record_lock with _bh() — fix SOFTIRQ-safe → -unsafe inversion (closes besser#18) #11

Merged
marfrit merged 3 commits from bes2600/queue-pending-record-lock-bh-fix into cleanups 2026-05-18 19:18:08 +00:00
Owner

Phase 5 review request — besser#18 lockdep fix

Splat being fixed (full text in besser#18)

WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
kworker/u16:1 is trying to acquire:
  &hw_priv->tx_loop.pending_record_lock at bes2600_queue_clear+0x80
and this task is already holding:
  &queue->lock at bes2600_queue_clear+0x60

which would create a new lock dependency:
  (&queue->lock){+.-.} -> (&hw_priv->tx_loop.pending_record_lock){+.+.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
  (&queue->lock){+.-.}
... which became SOFTIRQ-irq-safe at:
  bes2600_tx -> ieee80211_handle_wake_tx_queue -> tasklet_action
to a SOFTIRQ-irq-unsafe lock:
  (&hw_priv->tx_loop.pending_record_lock){+.+.}
... which became SOFTIRQ-irq-unsafe at:
  bes2600_queue_get_skb -> bes2600_join_work -> process_one_work

Locking topology (Phase 2)

  • queue->lock is taken with spin_lock_bh() at 22 sites in queue.c. Always SOFTIRQ-safe.

  • pending_record_lock is taken at 5 sites in the driver:

    site context was now
    queue.c:289 nested inside outer queue->lock_bh held at line 285 spin_lock (BH already disabled by outer) unchanged — outer _bh makes it safe
    queue.c:295 release for above spin_unlock unchanged
    queue.c:832 process context (bes2600_queue_get_skbbes2600_join_work workqueue), no outer queue->lock_bh spin_lock spin_lock_bh
    queue.c:839 early release for above spin_unlock spin_unlock_bh
    queue.c:844 normal release for above spin_unlock spin_unlock_bh
    tx_loop.c:112 TX-loop context, no outer queue->lock_bh spin_lock spin_lock_bh
    tx_loop.c:114 release spin_unlock spin_unlock_bh

    Result: lock is now consistently SOFTIRQ-safe at every acquisition site. No inversion class possible.

Why the conservative fix

  • The lock is small and held for short bounded windows in all paths (list-iterate-and-remove operations).
  • Disabling BH at the outer-most acquisitions has negligible cost vs the inversion-risk window.
  • The deeper alternative — restructure bes2600_join_work and bes2600_tx_loop_item_pending_check to not need this lock in process context — is much more invasive and not required by the splat.

Contract

  • Documentation/locking/locktypes.rst: spin_lock_bh() is the canonical way to make a non-IRQ spinlock safe against softirq preemption that might re-enter the same lock.
  • Same shape as queue->lock in this driver and as is_drv->lock in the cw1200 ancestor.

Phase 3 baseline (skipped in-session; explicit per CLAUDE.md)

In-session N=3 reproduction was skipped because it requires a PROVE_LOCKING-enabled kernel build, which depends on the pkgrel=4 migration pipeline being operational first (chicken-and-egg). Used the splat in besser#18 issue body as the reference baseline — fired at ~5140s uptime under regular wifi activity, no specific trigger needed.

Phase 7 verification plan

  • Build the migrated linux-pinetab2-danctnix-besser kernel with CONFIG_PROVE_LOCKING=y, CONFIG_LOCKDEP=y, CONFIG_DEBUG_LOCK_ALLOC=y in config (one-off debug flavor).
  • Install on ohm.
  • Uptime >= 5140s (~1.4h) with normal wifi activity; assert no SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected for pending_record_lockqueue->lock.

Predicted Phase 7 delta

Lockdep splat: 1+/5140s uptime → 0/equivalent-or-greater uptime.

Closes: besser#18

Companion deployment

  • This fix will ship as linux-pinetab2-danctnix-besser pkgrel=5 via marfrit/marfrit-packages once pkgrel=4 (the kernel-agent migration) lands. Patch will be added as patches/driver/bes2600/queue-pending-record-lock-bh-danctnix/ in kernel-agent.
## Phase 5 review request — besser#18 lockdep fix ## Splat being fixed (full text in besser#18) ``` WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected kworker/u16:1 is trying to acquire: &hw_priv->tx_loop.pending_record_lock at bes2600_queue_clear+0x80 and this task is already holding: &queue->lock at bes2600_queue_clear+0x60 which would create a new lock dependency: (&queue->lock){+.-.} -> (&hw_priv->tx_loop.pending_record_lock){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&queue->lock){+.-.} ... which became SOFTIRQ-irq-safe at: bes2600_tx -> ieee80211_handle_wake_tx_queue -> tasklet_action to a SOFTIRQ-irq-unsafe lock: (&hw_priv->tx_loop.pending_record_lock){+.+.} ... which became SOFTIRQ-irq-unsafe at: bes2600_queue_get_skb -> bes2600_join_work -> process_one_work ``` ## Locking topology (Phase 2) - `queue->lock` is taken with `spin_lock_bh()` at 22 sites in `queue.c`. Always SOFTIRQ-safe. - `pending_record_lock` is taken at 5 sites in the driver: | site | context | was | now | |---|---|---|---| | `queue.c:289` | nested inside outer `queue->lock_bh` held at line 285 | `spin_lock` (BH already disabled by outer) | unchanged — outer `_bh` makes it safe | | `queue.c:295` | release for above | `spin_unlock` | unchanged | | `queue.c:832` | process context (`bes2600_queue_get_skb` ← `bes2600_join_work` workqueue), no outer `queue->lock_bh` | `spin_lock` | **`spin_lock_bh`** | | `queue.c:839` | early release for above | `spin_unlock` | **`spin_unlock_bh`** | | `queue.c:844` | normal release for above | `spin_unlock` | **`spin_unlock_bh`** | | `tx_loop.c:112` | TX-loop context, no outer `queue->lock_bh` | `spin_lock` | **`spin_lock_bh`** | | `tx_loop.c:114` | release | `spin_unlock` | **`spin_unlock_bh`** | Result: lock is now consistently SOFTIRQ-safe at every acquisition site. No inversion class possible. ## Why the conservative fix - The lock is small and held for short bounded windows in all paths (list-iterate-and-remove operations). - Disabling BH at the outer-most acquisitions has negligible cost vs the inversion-risk window. - The deeper alternative — restructure `bes2600_join_work` and `bes2600_tx_loop_item_pending_check` to not need this lock in process context — is much more invasive and not required by the splat. ## Contract - `Documentation/locking/locktypes.rst`: `spin_lock_bh()` is the canonical way to make a non-IRQ spinlock safe against softirq preemption that might re-enter the same lock. - Same shape as `queue->lock` in this driver and as `is_drv->lock` in the cw1200 ancestor. ## Phase 3 baseline (skipped in-session; explicit per CLAUDE.md) In-session N=3 reproduction was skipped because it requires a PROVE_LOCKING-enabled kernel build, which depends on the pkgrel=4 migration pipeline being operational first (chicken-and-egg). Used the splat in besser#18 issue body as the reference baseline — fired at ~5140s uptime under regular wifi activity, no specific trigger needed. ## Phase 7 verification plan - Build the migrated `linux-pinetab2-danctnix-besser` kernel with `CONFIG_PROVE_LOCKING=y`, `CONFIG_LOCKDEP=y`, `CONFIG_DEBUG_LOCK_ALLOC=y` in `config` (one-off debug flavor). - Install on ohm. - Uptime >= 5140s (~1.4h) with normal wifi activity; assert no `SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected` for `pending_record_lock` ↔ `queue->lock`. ## Predicted Phase 7 delta Lockdep splat: 1+/5140s uptime → 0/equivalent-or-greater uptime. Closes: besser#18 ## Companion deployment - This fix will ship as `linux-pinetab2-danctnix-besser` pkgrel=5 via `marfrit/marfrit-packages` once pkgrel=4 (the kernel-agent migration) lands. Patch will be added as `patches/driver/bes2600/queue-pending-record-lock-bh-danctnix/` in kernel-agent.
marfrit added 3 commits 2026-05-18 14:59:19 +00:00
The BES2600 firmware refuses WSM start-scan for 5 GHz with status 2
("rejected by policy").  This shows up in dmesg as the recurring

    wsm_generic_confirm failed for request 0x0007.
    [SCAN] Scan failed (-22).

pattern (besser issue #1, ~14-16/h on ohm/PineTab2 baseline).

Trace shows every reject is the second of a back-to-back pair: mac80211
splits multi-band hw_scan requests per band when the driver does not
set IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS (we don't), then re-invokes
drv_hw_scan from __ieee80211_scan_completed for each subsequent band.
The 2.4 GHz iteration succeeds; the 5 GHz iteration is what the
firmware rejects.  See ieee80211_prep_hw_scan in net/mac80211/scan.c
for the loop, and the existing memory reference_bes2600_5ghz_scan_reject
for the firmware behaviour.

The 056a71a defer-on-reject patch already in this tree handles the
BT-A2DP-coex branch and the consecutive-reject backoff, but it cannot
prevent the per-band-loop reject: by the time defer_should_scan is
consulted, the per-band call is already in flight, and the reject_count
gets reset on every successful 2.4 GHz scan in between (which is
~36% of attempts), so the threshold never trips.

The fix: refuse the 5 GHz iteration upfront in bes2600_hw_scan.  The
2.4 GHz scan still runs normally.  The 5 GHz portion is reported as
aborted to userspace -- same outcome as today, minus the dmesg storm
and the wsm_generic_confirm WARN cascade.

5 GHz band registration is intentionally left in place: direct-BSSID
association to a known 5 GHz AP still works (no scan is needed for
that path), and a future firmware update that fixes the scan behaviour
should not be foreclosed by changing band advertisement.

Contract: per include/net/mac80211.h ieee80211_ops.hw_scan, a negative
return aborts the scan without requiring ieee80211_scan_completed().
-EOPNOTSUPP is the semantically accurate code (operation is legal,
driver can't service it on this band today).

Phase 3 evidence:
- baseline N=3: rate ~14.3-23.6/h converged at 14.3/h (matches OP)
- back-to-back scan gap: 6/6 rejected pairs <200us, 1/1 successful
  pair was 114ms (single-band-only, no 5 GHz leg)
- defer log fires: 0/9 in 30-min window (056a71a structurally bypassed)

Predicted Phase 7 delta: Pattern A 14/h -> 0/h.
The original Patch I refused EVERY 5 GHz scan request unconditionally
(req->n_channels > 0 && band == NL80211_BAND_5GHZ).  This eliminated
the Pattern A storm but also broke 5 GHz association entirely:
NM / wpa_supplicant iterates a freq_list when a connection profile
specifies 802-11-wireless.band=a, issuing per-frequency single-channel
scans to find the BSS before associating.  Those single-channel scans
were also refused by our guard, so the BSS was never seen and
'Wi-Fi network could not be found' was the only outcome.

Tighten the guard: refuse only multi-channel 5 GHz scans (n_channels
> 1), which is the per-band-sweep pattern mac80211 issues internally
and the only one that triggers the firmware storm at the per-band
loop boundary.  Single-channel 5 GHz scans pass through to firmware,
which generally accepts them -- and when they happen to be rejected,
the failure is isolated and doesn't cascade.

Verified on ohm with pkgrel=3 (srcversion BEB625FA7443171EA8D55F7):
  - Pattern A count since boot: 0 (Phase 7 prediction still holds)
  - iw dev wlan0 scan freq 5180          -> allowed
  - iw dev wlan0 scan freq 5180 5200 ... -> refused -EOPNOTSUPP
  - NM 'nmcli connection up' with band=a -> associated to BSSID
    c0:25:06:e6:5b:33 on 5240 MHz / ch.48 in ~1 second
  - TX bitrate 150 Mbit/s MCS 7 40MHz short-GI (vs 72.2 Mbit/s
    HT20 on 2.4 GHz) -- ~2x throughput recovered

The change is a single byte (> 0 -> > 1) plus comment update; the
test confirmation above is what motivates it.

Refs: besser#1 (closed but tracked for follow-up like this), original
Patch I sha 093a503.
PROVE_LOCKING reports:

  WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
  kworker/u16:1 is trying to acquire:
    &hw_priv->tx_loop.pending_record_lock at bes2600_queue_clear+0x80
  and this task is already holding:
    &queue->lock at bes2600_queue_clear+0x60

  which would create a new lock dependency:
    (&queue->lock){+.-.}   -> (&hw_priv->tx_loop.pending_record_lock){+.+.}

  but this new dependency connects a SOFTIRQ-irq-safe lock:
    (&queue->lock){+.-.}
  ... which became SOFTIRQ-irq-safe at:
    bes2600_tx -> ieee80211_handle_wake_tx_queue -> tasklet_action
  to a SOFTIRQ-irq-unsafe lock:
    (&hw_priv->tx_loop.pending_record_lock){+.+.}
  ... which became SOFTIRQ-irq-unsafe at:
    bes2600_queue_get_skb -> bes2600_join_work -> process_one_work

queue->lock is taken consistently with spin_lock_bh() at 22 sites;
the nested acquisition of pending_record_lock at queue.c:289 (inside
the outer queue->lock_bh held at line 285) had it implicitly BH-safe
via the outer scope. But pending_record_lock is ALSO taken from
non-BH-disabled contexts:

  bes2600_queue_get_skb  (queue.c:832)  — process context via
    bes2600_join_work (workqueue), no outer queue->lock held
  bes2600_tx_loop_item_pending_check (tx_loop.c:112)
                                     — TX-loop context, no outer
                                     queue->lock held

When CPU0 holds pending_record_lock from one of those non-BH paths
and a softirq fires that wants queue->lock, and CPU1 in softirq has
queue->lock and is about to acquire pending_record_lock — classic AB-BA
SOFTIRQ deadlock.

The fix is the conservative one: take pending_record_lock with _bh()
at every site that's not already inside a queue->lock_bh-held scope.
That makes the lock consistently SOFTIRQ-safe, eliminating the
inversion. queue.c:289/295 stays as plain spin_lock because BH is
already disabled by the outer queue->lock_bh acquired at queue.c:285.

Five sites converted:
  bes2600/queue.c:832 -- spin_lock      -> spin_lock_bh
  bes2600/queue.c:839 -- spin_unlock    -> spin_unlock_bh
  bes2600/queue.c:844 -- spin_unlock    -> spin_unlock_bh
  bes2600/tx_loop.c:112 -- spin_lock    -> spin_lock_bh
  bes2600/tx_loop.c:114 -- spin_unlock  -> spin_unlock_bh

Contract:
  - Documentation/locking/locktypes.rst spelling: spin_lock_bh() is
    the canonical way to make a non-IRQ spinlock safe against
    softirq preemption that might re-enter the same lock.
  - Same shape as queue->lock in this driver and as is_drv->lock
    in the cw1200 ancestor.

Closes: besser#18
Fixes: <bes2600 base import>
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
marfrit merged commit fc327b2ff6 into cleanups 2026-05-18 19:18:08 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/bes2600-dkms#11