lockdep: SOFTIRQ-safe -> SOFTIRQ-unsafe between bes2600_queue->lock and tx_loop.pending_record_lock #18

Closed
opened 2026-05-09 13:54:24 +00:00 by marfrit · 1 comment
Owner

Symptom

First-ever PROVE_LOCKING run on the besser stack on ohm (RK3566 PineTab2) surfaces a latent SOFTIRQ-safe → SOFTIRQ-unsafe lock-order warning in bes2600. Found incidentally during a vb2_dma_resv RFC v2 PROVE_LOCKING validation run; the splat is unrelated to that work (no vb2_*, dma_fence, dma_resv, or videobuf symbols on the call chain).

Splat (verbatim)

WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
7.0.0-danctnix1-3-pinetab2-danctnix-besser #1 Tainted: G         C

kworker/u16:1/6228 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
  ffff00011d0953b8 (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3},
     at: bes2600_queue_clear+0x80/0x400 [bes2600]

and this task is already holding:
  ffff00011d093218 (&queue->lock){+.-.}-{3:3},
     at: bes2600_queue_clear+0x60/0x400 [bes2600]

which would create a new lock dependency:
  (&queue->lock){+.-.}-{3:3} -> (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3}

but this new dependency connects a SOFTIRQ-irq-safe lock:
  (&queue->lock){+.-.}-{3:3}

... which became SOFTIRQ-irq-safe at:
  _raw_spin_lock_bh
  bes2600_queue_lock
  bes2600_tx
  ieee80211_handle_wake_tx_queue [mac80211]
  drv_wake_tx_queue [mac80211]
  _ieee80211_wake_txqs [mac80211]
  ieee80211_wake_txqs [mac80211]
  tasklet_action_common
  tasklet_action
  handle_softirqs

to a SOFTIRQ-irq-unsafe lock:
  (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3}

... which became SOFTIRQ-irq-unsafe at:
  _raw_spin_lock
  bes2600_queue_get_skb
  bes2600_join_work
  process_one_work
  worker_thread
  kthread

Possible interrupt-unsafe scenario (per lockdep)

       CPU0                          CPU1
       ----                          ----
  lock(&hw_priv->tx_loop.pending_record_lock);
                                local_irq_disable();
                                lock(&queue->lock);
                                lock(&hw_priv->tx_loop.pending_record_lock);
  <Interrupt>
    lock(&queue->lock);

If CPU0 holds pending_record_lock (taken without _bh() from bes2600_join_work's workqueue context) and a softirq fires that wants queue->lock, and CPU1 in softirq context already holds queue->lock waiting for pending_record_lock — deadlock.

Locks held when triggered (5)

#0: ((wq_completion)events_unbound)              process_one_work
#1: ((work_completion)(&rdev->wiphy_work))       process_one_work
#2: (&rdev->wiphy.mtx)                           cfg80211_wiphy_work [cfg80211]
#3: (&priv->vif_lock)                            __bes2600_flush [bes2600]
#4: (&queue->lock)                               bes2600_queue_clear [bes2600] ← held when trying to acquire #5

Trigger path: cfg80211 wiphy work → bes2600 flush → queue clear, while another CPU is in softirq tx path.

Reproducer

Enable on a besser pkgrel build:

CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCK_ALLOC=y

Then run anything that triggers a wifi flush concurrent with normal wifi tx — in this case the splat fired during ~5140s of system uptime (about ~1.4h after boot) without a specific trigger — likely happens routinely under regular wifi activity but only PROVE_LOCKING surfaces it.

Likely fix

Normalize lock ordering — either:

  • Take both locks with _bh() (i.e. spin_lock_bh on pending_record_lock everywhere), or
  • Always acquire in (pending_record_lock → queue->lock) order, never the reverse, or
  • Lift one of the spinlocks higher in the stack so they don't nest.

The bes2600_join_work path is the SOFTIRQ-unsafe culprit (uses plain _raw_spin_lock on pending_record_lock); making that _raw_spin_lock_bh would resolve the warning if the surrounding context tolerates it.

Why filing here

This is a pre-existing latent bug in the bes2600 driver / its tx-loop locking, surfaced by enabling PROVE_LOCKING for the first time on a besser stack. Filing as a non-fourier issue so it can be triaged on the bes2600 side independently.

Kernel: linux-pinetab2-danctnix-besser 7.0.danctnix1-3 (= 7.0.danctnix1 base + besser cumulative bes2600 patch + vb2_dma_resv RFC patches + lockdep annotations). Splat is in bes2600 only — base kernel + besser patches; vb2/dma_resv work doesn't touch this code path.

## Symptom First-ever PROVE_LOCKING run on the besser stack on ohm (RK3566 PineTab2) surfaces a latent SOFTIRQ-safe → SOFTIRQ-unsafe lock-order warning in bes2600. Found incidentally during a vb2_dma_resv RFC v2 PROVE_LOCKING validation run; **the splat is unrelated to that work** (no `vb2_*`, `dma_fence`, `dma_resv`, or `videobuf` symbols on the call chain). ## Splat (verbatim) ``` WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 7.0.0-danctnix1-3-pinetab2-danctnix-besser #1 Tainted: G C kworker/u16:1/6228 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire: ffff00011d0953b8 (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3}, at: bes2600_queue_clear+0x80/0x400 [bes2600] and this task is already holding: ffff00011d093218 (&queue->lock){+.-.}-{3:3}, at: bes2600_queue_clear+0x60/0x400 [bes2600] which would create a new lock dependency: (&queue->lock){+.-.}-{3:3} -> (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3} but this new dependency connects a SOFTIRQ-irq-safe lock: (&queue->lock){+.-.}-{3:3} ... which became SOFTIRQ-irq-safe at: _raw_spin_lock_bh bes2600_queue_lock bes2600_tx ieee80211_handle_wake_tx_queue [mac80211] drv_wake_tx_queue [mac80211] _ieee80211_wake_txqs [mac80211] ieee80211_wake_txqs [mac80211] tasklet_action_common tasklet_action handle_softirqs to a SOFTIRQ-irq-unsafe lock: (&hw_priv->tx_loop.pending_record_lock){+.+.}-{3:3} ... which became SOFTIRQ-irq-unsafe at: _raw_spin_lock bes2600_queue_get_skb bes2600_join_work process_one_work worker_thread kthread ``` ## Possible interrupt-unsafe scenario (per lockdep) ``` CPU0 CPU1 ---- ---- lock(&hw_priv->tx_loop.pending_record_lock); local_irq_disable(); lock(&queue->lock); lock(&hw_priv->tx_loop.pending_record_lock); <Interrupt> lock(&queue->lock); ``` If CPU0 holds `pending_record_lock` (taken without `_bh()` from `bes2600_join_work`'s workqueue context) and a softirq fires that wants `queue->lock`, and CPU1 in softirq context already holds `queue->lock` waiting for `pending_record_lock` — deadlock. ## Locks held when triggered (5) ``` #0: ((wq_completion)events_unbound) process_one_work #1: ((work_completion)(&rdev->wiphy_work)) process_one_work #2: (&rdev->wiphy.mtx) cfg80211_wiphy_work [cfg80211] #3: (&priv->vif_lock) __bes2600_flush [bes2600] #4: (&queue->lock) bes2600_queue_clear [bes2600] ← held when trying to acquire #5 ``` Trigger path: cfg80211 wiphy work → bes2600 flush → queue clear, while another CPU is in softirq tx path. ## Reproducer Enable on a besser pkgrel build: ``` CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_DEBUG_LOCK_ALLOC=y ``` Then run anything that triggers a wifi flush concurrent with normal wifi tx — in this case the splat fired during ~5140s of system uptime (about ~1.4h after boot) without a specific trigger — likely happens routinely under regular wifi activity but only PROVE_LOCKING surfaces it. ## Likely fix Normalize lock ordering — either: - Take both locks with `_bh()` (i.e. spin_lock_bh on `pending_record_lock` everywhere), or - Always acquire in (pending_record_lock → queue->lock) order, never the reverse, or - Lift one of the spinlocks higher in the stack so they don't nest. The `bes2600_join_work` path is the SOFTIRQ-unsafe culprit (uses plain `_raw_spin_lock` on `pending_record_lock`); making that `_raw_spin_lock_bh` would resolve the warning if the surrounding context tolerates it. ## Why filing here This is a pre-existing latent bug in the bes2600 driver / its tx-loop locking, surfaced by enabling PROVE_LOCKING for the first time on a besser stack. Filing as a non-fourier issue so it can be triaged on the bes2600 side independently. Kernel: `linux-pinetab2-danctnix-besser 7.0.danctnix1-3` (= 7.0.danctnix1 base + besser cumulative bes2600 patch + vb2_dma_resv RFC patches + lockdep annotations). Splat is in bes2600 only — base kernel + besser patches; vb2/dma_resv work doesn't touch this code path.
Author
Owner

Closed — Phase 7 verified

Built linux-pinetab2-danctnix-besser-lockdep (sibling install of production pkgrel=5) with CONFIG_PROVE_LOCKING=y + CONFIG_LOCKDEP=y + CONFIG_DEBUG_LOCK_ALLOC=y, same bes2600 source bytes as non-lockdep pkgrel=5 (srcversion 26B0003FE9F2B05DCE838C4).

Installed on ohm, ran 4h46m active uptime with normal wifi use + 5 suspend/resume cycles. The issue body's reported spontaneous-fire was at ~5140s uptime (~1h25m), so we're at 3.4× that window plus PM-path exercise.

Lockdep splat count (specifically the SOFTIRQ-safe.*SOFTIRQ-unsafe, lock order detected, possible recursive lock patterns this issue is about):

$ sudo dmesg | grep -ciE "SOFTIRQ-safe.*SOFTIRQ-unsafe|lock order detected|possible recursive lock"
0

Fix:

  • bes2600/queue.c:832/839/844 and bes2600/tx_loop.c:112/114: spin_lock(&pending_record_lock)spin_lock_bh().
  • queue.c:289/295 unchanged — nested inside outer queue->lock_bh, already BH-disabled.

Fix first integrated in linux-pinetab2-danctnix-besser pkgrel=4. Phase 7 ran on the lockdep sibling of pkgrel=5 (the current production on ohm). bes2600-dkms commit d95453c on branch bes2600/queue-pending-record-lock-bh-fix.

Phase 7 incidentally surfaced a separate, pre-existing lockdep finding in bes2600_join_work (suspicious RCU usage at net/wireless/util.c:1078 via ieee80211_bss_get_elem outside rcu_read_lock). Tracked for future filing as its own issue; not a regression introduced by this fix.

Closing.

## Closed — Phase 7 verified Built `linux-pinetab2-danctnix-besser-lockdep` (sibling install of production pkgrel=5) with `CONFIG_PROVE_LOCKING=y` + `CONFIG_LOCKDEP=y` + `CONFIG_DEBUG_LOCK_ALLOC=y`, same bes2600 source bytes as non-lockdep pkgrel=5 (`srcversion 26B0003FE9F2B05DCE838C4`). Installed on ohm, ran **4h46m** active uptime with normal wifi use + **5 suspend/resume cycles**. The issue body's reported spontaneous-fire was at ~5140s uptime (~1h25m), so we're at 3.4× that window plus PM-path exercise. Lockdep splat count (specifically the `SOFTIRQ-safe.*SOFTIRQ-unsafe`, `lock order detected`, `possible recursive lock` patterns this issue is about): ``` $ sudo dmesg | grep -ciE "SOFTIRQ-safe.*SOFTIRQ-unsafe|lock order detected|possible recursive lock" 0 ``` Fix: - `bes2600/queue.c:832/839/844` and `bes2600/tx_loop.c:112/114`: `spin_lock(&pending_record_lock)` → `spin_lock_bh()`. - `queue.c:289/295` unchanged — nested inside outer `queue->lock_bh`, already BH-disabled. Fix first integrated in `linux-pinetab2-danctnix-besser` pkgrel=4. Phase 7 ran on the lockdep sibling of pkgrel=5 (the current production on ohm). bes2600-dkms commit `d95453c` on branch `bes2600/queue-pending-record-lock-bh-fix`. Phase 7 incidentally surfaced a separate, pre-existing lockdep finding in `bes2600_join_work` (suspicious RCU usage at `net/wireless/util.c:1078` via `ieee80211_bss_get_elem` outside `rcu_read_lock`). Tracked for future filing as its own issue; not a regression introduced by this fix. Closing.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/besser#18