lockdep: SOFTIRQ-safe -> SOFTIRQ-unsafe between bes2600_queue->lock and tx_loop.pending_record_lock #18
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
First-ever PROVE_LOCKING run on the besser stack on ohm (RK3566 PineTab2) surfaces a latent SOFTIRQ-safe → SOFTIRQ-unsafe lock-order warning in bes2600. Found incidentally during a vb2_dma_resv RFC v2 PROVE_LOCKING validation run; the splat is unrelated to that work (no
vb2_*,dma_fence,dma_resv, orvideobufsymbols on the call chain).Splat (verbatim)
Possible interrupt-unsafe scenario (per lockdep)
If CPU0 holds
pending_record_lock(taken without_bh()frombes2600_join_work's workqueue context) and a softirq fires that wantsqueue->lock, and CPU1 in softirq context already holdsqueue->lockwaiting forpending_record_lock— deadlock.Locks held when triggered (5)
Trigger path: cfg80211 wiphy work → bes2600 flush → queue clear, while another CPU is in softirq tx path.
Reproducer
Enable on a besser pkgrel build:
Then run anything that triggers a wifi flush concurrent with normal wifi tx — in this case the splat fired during ~5140s of system uptime (about ~1.4h after boot) without a specific trigger — likely happens routinely under regular wifi activity but only PROVE_LOCKING surfaces it.
Likely fix
Normalize lock ordering — either:
_bh()(i.e. spin_lock_bh onpending_record_lockeverywhere), orThe
bes2600_join_workpath is the SOFTIRQ-unsafe culprit (uses plain_raw_spin_lockonpending_record_lock); making that_raw_spin_lock_bhwould resolve the warning if the surrounding context tolerates it.Why filing here
This is a pre-existing latent bug in the bes2600 driver / its tx-loop locking, surfaced by enabling PROVE_LOCKING for the first time on a besser stack. Filing as a non-fourier issue so it can be triaged on the bes2600 side independently.
Kernel:
linux-pinetab2-danctnix-besser 7.0.danctnix1-3(= 7.0.danctnix1 base + besser cumulative bes2600 patch + vb2_dma_resv RFC patches + lockdep annotations). Splat is in bes2600 only — base kernel + besser patches; vb2/dma_resv work doesn't touch this code path.Closed — Phase 7 verified
Built
linux-pinetab2-danctnix-besser-lockdep(sibling install of production pkgrel=5) withCONFIG_PROVE_LOCKING=y+CONFIG_LOCKDEP=y+CONFIG_DEBUG_LOCK_ALLOC=y, same bes2600 source bytes as non-lockdep pkgrel=5 (srcversion 26B0003FE9F2B05DCE838C4).Installed on ohm, ran 4h46m active uptime with normal wifi use + 5 suspend/resume cycles. The issue body's reported spontaneous-fire was at ~5140s uptime (~1h25m), so we're at 3.4× that window plus PM-path exercise.
Lockdep splat count (specifically the
SOFTIRQ-safe.*SOFTIRQ-unsafe,lock order detected,possible recursive lockpatterns this issue is about):Fix:
bes2600/queue.c:832/839/844andbes2600/tx_loop.c:112/114:spin_lock(&pending_record_lock)→spin_lock_bh().queue.c:289/295unchanged — nested inside outerqueue->lock_bh, already BH-disabled.Fix first integrated in
linux-pinetab2-danctnix-besserpkgrel=4. Phase 7 ran on the lockdep sibling of pkgrel=5 (the current production on ohm). bes2600-dkms commitd95453con branchbes2600/queue-pending-record-lock-bh-fix.Phase 7 incidentally surfaced a separate, pre-existing lockdep finding in
bes2600_join_work(suspicious RCU usage atnet/wireless/util.c:1078viaieee80211_bss_get_elemoutsidercu_read_lock). Tracked for future filing as its own issue; not a regression introduced by this fix.Closing.