Compare commits

...

7 Commits

Author SHA1 Message Date
marfrit 64fc309e26 Merge pull request 'bes2600: reset firmware state on wsm_join_confirm failure' (#12) from bes2600/wsm-join-confirm-reset into cleanups
Reviewed-on: #12
2026-05-21 08:45:22 +00:00
test0r cdb6bd07d3 bes2600: reset firmware state on wsm_join_confirm failure
When wsm_join_confirm() returns status != WSM_STATUS_SUCCESS (ret 1),
the driver cleared its bookkeeping but did not reset the firmware
interface, leaving it in an intermediate post-rejection state.  A rapid
second JOIN attempt (e.g. wpa_supplicant retrying after the
PREV_AUTH_NOT_VALID deauth that mac80211 emits to clean up) hits an
inconsistent firmware context, causing bes2600_sdio_read_rx_batch to
return SDIO error which cascades into wifi_force_close:

  wsm_join_confirm ret 1
  deauthenticating from <bssid> by local choice (Reason: 2=PREV_AUTH_NOT_VALID)
  [~10 min later]
  bes2600_sdio_read_rx_batch sdio read error
  WARNING: at bes2600_tx_loop_set_enable / bes2600_chrdev_wifi_force_close

Two additions to the failure path in bes2600_join_work():

1. wsm_reset (WSM_REQ_ID_RESET, 0x000A) with reset_statistics=false.
   This returns the firmware to IDLE so the next association attempt
   starts from a known-clean state.  bes2600_unjoin_work() performs the
   same reset, but gates it on join_status != PASSIVE; after a failed
   JOIN join_status stays PASSIVE, so that path never fires — call
   wsm_reset directly here instead.

   Contract: wsm_reset takes only wsm_cmd_lock (not conf_lock, not
   wsm_oper_lock).  wsm_oper_unlock was already called inside
   wsm_join_confirm() before wsm_join() returned -EINVAL, so there is
   no re-entrancy hazard.  conf_lock is held at this call site, which is
   compatible with wsm_reset's locking requirements.

2. queue_work(workqueue, &priv->unjoin_work) instead of direct
   wsm_unlock_tx().  Serialises the next association attempt through
   the workqueue so it cannot race against lingering firmware-side
   effects of the failure.  If unjoin_work is already queued, release
   TX immediately (matching cw1200 ancestor sta.c:1344 comment "Tx lock
   still held, unjoin will clear it.").

Ancestor reference: drivers/net/wireless/st/cw1200/sta.c, function
cw1200_join_work(), lines 1339-1344.  cw1200 queues unjoin_work on join
failure for the same reason.  bes2600 needs the direct wsm_reset in
addition because its unjoin_work has the join_status gate that cw1200's
cw1200_do_unjoin() does not.

Signed-off-by: Claude (noether) <claude@reauktion.de>
2026-05-21 10:43:42 +02:00
marfrit fc327b2ff6 Merge pull request 'bes2600: take pending_record_lock with _bh() — fix SOFTIRQ-safe → -unsafe inversion (closes besser#18)' (#11) from bes2600/queue-pending-record-lock-bh-fix into cleanups
Reviewed-on: #11
2026-05-18 19:18:08 +00:00
test0r d95453c98e bes2600: take pending_record_lock with _bh() to fix SOFTIRQ-safe → -unsafe inversion (besser#18)
PROVE_LOCKING reports:

  WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
  kworker/u16:1 is trying to acquire:
    &hw_priv->tx_loop.pending_record_lock at bes2600_queue_clear+0x80
  and this task is already holding:
    &queue->lock at bes2600_queue_clear+0x60

  which would create a new lock dependency:
    (&queue->lock){+.-.}   -> (&hw_priv->tx_loop.pending_record_lock){+.+.}

  but this new dependency connects a SOFTIRQ-irq-safe lock:
    (&queue->lock){+.-.}
  ... which became SOFTIRQ-irq-safe at:
    bes2600_tx -> ieee80211_handle_wake_tx_queue -> tasklet_action
  to a SOFTIRQ-irq-unsafe lock:
    (&hw_priv->tx_loop.pending_record_lock){+.+.}
  ... which became SOFTIRQ-irq-unsafe at:
    bes2600_queue_get_skb -> bes2600_join_work -> process_one_work

queue->lock is taken consistently with spin_lock_bh() at 22 sites;
the nested acquisition of pending_record_lock at queue.c:289 (inside
the outer queue->lock_bh held at line 285) had it implicitly BH-safe
via the outer scope. But pending_record_lock is ALSO taken from
non-BH-disabled contexts:

  bes2600_queue_get_skb  (queue.c:832)  — process context via
    bes2600_join_work (workqueue), no outer queue->lock held
  bes2600_tx_loop_item_pending_check (tx_loop.c:112)
                                     — TX-loop context, no outer
                                     queue->lock held

When CPU0 holds pending_record_lock from one of those non-BH paths
and a softirq fires that wants queue->lock, and CPU1 in softirq has
queue->lock and is about to acquire pending_record_lock — classic AB-BA
SOFTIRQ deadlock.

The fix is the conservative one: take pending_record_lock with _bh()
at every site that's not already inside a queue->lock_bh-held scope.
That makes the lock consistently SOFTIRQ-safe, eliminating the
inversion. queue.c:289/295 stays as plain spin_lock because BH is
already disabled by the outer queue->lock_bh acquired at queue.c:285.

Five sites converted:
  bes2600/queue.c:832 -- spin_lock      -> spin_lock_bh
  bes2600/queue.c:839 -- spin_unlock    -> spin_unlock_bh
  bes2600/queue.c:844 -- spin_unlock    -> spin_unlock_bh
  bes2600/tx_loop.c:112 -- spin_lock    -> spin_lock_bh
  bes2600/tx_loop.c:114 -- spin_unlock  -> spin_unlock_bh

Contract:
  - Documentation/locking/locktypes.rst spelling: spin_lock_bh() is
    the canonical way to make a non-IRQ spinlock safe against
    softirq preemption that might re-enter the same lock.
  - Same shape as queue->lock in this driver and as is_drv->lock
    in the cw1200 ancestor.

Closes: besser#18
Fixes: <bes2600 base import>
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-18 16:58:49 +02:00
test0r 8cd10f487c bes2600: scan-filter-5ghz: allow targeted single-channel scans (besser#1 follow-up)
The original Patch I refused EVERY 5 GHz scan request unconditionally
(req->n_channels > 0 && band == NL80211_BAND_5GHZ).  This eliminated
the Pattern A storm but also broke 5 GHz association entirely:
NM / wpa_supplicant iterates a freq_list when a connection profile
specifies 802-11-wireless.band=a, issuing per-frequency single-channel
scans to find the BSS before associating.  Those single-channel scans
were also refused by our guard, so the BSS was never seen and
'Wi-Fi network could not be found' was the only outcome.

Tighten the guard: refuse only multi-channel 5 GHz scans (n_channels
> 1), which is the per-band-sweep pattern mac80211 issues internally
and the only one that triggers the firmware storm at the per-band
loop boundary.  Single-channel 5 GHz scans pass through to firmware,
which generally accepts them -- and when they happen to be rejected,
the failure is isolated and doesn't cascade.

Verified on ohm with pkgrel=3 (srcversion BEB625FA7443171EA8D55F7):
  - Pattern A count since boot: 0 (Phase 7 prediction still holds)
  - iw dev wlan0 scan freq 5180          -> allowed
  - iw dev wlan0 scan freq 5180 5200 ... -> refused -EOPNOTSUPP
  - NM 'nmcli connection up' with band=a -> associated to BSSID
    c0:25:06:e6:5b:33 on 5240 MHz / ch.48 in ~1 second
  - TX bitrate 150 Mbit/s MCS 7 40MHz short-GI (vs 72.2 Mbit/s
    HT20 on 2.4 GHz) -- ~2x throughput recovered

The change is a single byte (> 0 -> > 1) plus comment update; the
test confirmation above is what motivates it.

Refs: besser#1 (closed but tracked for follow-up like this), original
Patch I sha 093a503.
2026-05-18 15:56:34 +02:00
test0r 093a5038b8 bes2600: filter 5 GHz scans at the driver boundary (besser#1)
The BES2600 firmware refuses WSM start-scan for 5 GHz with status 2
("rejected by policy").  This shows up in dmesg as the recurring

    wsm_generic_confirm failed for request 0x0007.
    [SCAN] Scan failed (-22).

pattern (besser issue #1, ~14-16/h on ohm/PineTab2 baseline).

Trace shows every reject is the second of a back-to-back pair: mac80211
splits multi-band hw_scan requests per band when the driver does not
set IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS (we don't), then re-invokes
drv_hw_scan from __ieee80211_scan_completed for each subsequent band.
The 2.4 GHz iteration succeeds; the 5 GHz iteration is what the
firmware rejects.  See ieee80211_prep_hw_scan in net/mac80211/scan.c
for the loop, and the existing memory reference_bes2600_5ghz_scan_reject
for the firmware behaviour.

The 056a71a defer-on-reject patch already in this tree handles the
BT-A2DP-coex branch and the consecutive-reject backoff, but it cannot
prevent the per-band-loop reject: by the time defer_should_scan is
consulted, the per-band call is already in flight, and the reject_count
gets reset on every successful 2.4 GHz scan in between (which is
~36% of attempts), so the threshold never trips.

The fix: refuse the 5 GHz iteration upfront in bes2600_hw_scan.  The
2.4 GHz scan still runs normally.  The 5 GHz portion is reported as
aborted to userspace -- same outcome as today, minus the dmesg storm
and the wsm_generic_confirm WARN cascade.

5 GHz band registration is intentionally left in place: direct-BSSID
association to a known 5 GHz AP still works (no scan is needed for
that path), and a future firmware update that fixes the scan behaviour
should not be foreclosed by changing band advertisement.

Contract: per include/net/mac80211.h ieee80211_ops.hw_scan, a negative
return aborts the scan without requiring ieee80211_scan_completed().
-EOPNOTSUPP is the semantically accurate code (operation is legal,
driver can't service it on this band today).

Phase 3 evidence:
- baseline N=3: rate ~14.3-23.6/h converged at 14.3/h (matches OP)
- back-to-back scan gap: 6/6 rejected pairs <200us, 1/1 successful
  pair was 114ms (single-band-only, no 5 GHz leg)
- defer log fires: 0/9 in 30-min window (056a71a structurally bypassed)

Predicted Phase 7 delta: Pattern A 14/h -> 0/h.
2026-05-18 11:27:40 +02:00
marfrit 87a3d65960 bes2600: Patch H — bh.c hygiene cleanup (drop fossil blocks, dead stubs) (#10) 2026-05-08 06:30:40 +00:00
4 changed files with 78 additions and 9 deletions
+3 -3
View File
@@ -829,19 +829,19 @@ int bes2600_queue_get_skb(struct bes2600_queue *queue, u32 packetID,
bes2600_queue_parse_id(packetID, &queue_generation, &queue_id,
&item_generation, &item_id, &if_id, &link_id);
spin_lock(&queue->stats->hw_priv->tx_loop.pending_record_lock);
spin_lock_bh(&queue->stats->hw_priv->tx_loop.pending_record_lock);
if (!list_empty(&queue->stats->hw_priv->tx_loop.pending_record_list)) {
list_for_each_entry_safe(record_item, temp_record_item, &queue->stats->hw_priv->tx_loop.pending_record_list, head) {
if (record_item->packetID == packetID) {
list_del(&record_item->head);
dev_kfree_skb(record_item->skb);
kfree(record_item);
spin_unlock(&queue->stats->hw_priv->tx_loop.pending_record_lock);
spin_unlock_bh(&queue->stats->hw_priv->tx_loop.pending_record_lock);
return -EINVAL;
}
}
}
spin_unlock(&queue->stats->hw_priv->tx_loop.pending_record_lock);
spin_unlock_bh(&queue->stats->hw_priv->tx_loop.pending_record_lock);
item = &queue->pool[item_id];
+30
View File
@@ -238,6 +238,36 @@ int bes2600_hw_scan(struct ieee80211_hw *hw,
/* Scan when P2P_GO corrupt firmware MiniAP mode */
if (priv->join_status == BES2600_JOIN_STATUS_AP)
return -EOPNOTSUPP;
/*
* Firmware refuses WSM start-scan for 5 GHz with status 2 ("rejected
* by policy"); see besser issue #1. mac80211 splits multi-band
* hw_scan requests per-band when the driver does not set
* IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS (we don't -- see
* ieee80211_hw_set() calls in bes2600_main.c), so each per-band call
* has req->channels[] from one band only (see ieee80211_prep_hw_scan
* in net/mac80211/scan.c). Refuse the 5 GHz iteration at the driver
* boundary so userspace gets a clean aborted-scan for that portion
* rather than waiting for the firmware reject to cascade up.
*
* Only the multi-channel case is refused (n_channels > 1): that's
* the per-band-sweep pattern mac80211 issues internally and the
* one that triggers the firmware storm at the per-band loop
* boundary. Single-channel 5 GHz scans (BSS verification, NM's
* per-freq iteration when 802-11-wireless.band=a is set) pass
* through to firmware, which generally accepts them since the
* storm is the back-to-back per-band issue, not a blanket 5 GHz
* reject. This preserves 5 GHz association via the
* "wpa_supplicant iterates freq_list per channel" path.
*
* Contract: per include/net/mac80211.h struct ieee80211_ops.hw_scan
* documentation, a negative return aborts the scan without requiring
* ieee80211_scan_completed().
*/
if (req->n_channels > 1 &&
req->channels[0]->band == NL80211_BAND_5GHZ)
return -EOPNOTSUPP;
#if 0
if (work_pending(&priv->offchannel_work) ||
(hw_priv->roc_if_id != -1)) {
+43 -4
View File
@@ -2209,9 +2209,10 @@ void bes2600_join_work(struct work_struct *work)
struct wsm_template_frame probe_tmp = {
.frame_type = WSM_FRAME_TYPE_PROBE_REQUEST,
};
/*struct wsm_reset reset = {
.reset_statistics = true,
};*/
struct wsm_reset join_fail_reset = {
.reset_statistics = false,
};
bool join_failed = false;
BUG_ON(queueId >= 4);
@@ -2390,6 +2391,33 @@ void bes2600_join_work(struct work_struct *work)
#endif /*CONFIG_BES2600_TESTMODE*/
cancel_delayed_work_sync(&priv->join_timeout);
bes2600_pwr_clear_busy_event(priv->hw_priv, BES_PWR_LOCK_ON_JOIN);
/*
* Firmware rejected WSM_JOIN (wsm_join_confirm ret 1).
* Issue wsm_reset so the firmware returns to a clean
* IDLE state before the next association attempt.
*
* Without this reset the firmware sits in an
* intermediate post-reject state. A rapid second
* JOIN (e.g. wpa_supplicant retrying after the
* PREV_AUTH_NOT_VALID deauth that follows) hits an
* inconsistent firmware context, causing
* bes2600_sdio_read_rx_batch to return SDIO error
* which cascades into wifi_force_close.
*
* cw1200 ancestor (drivers/net/wireless/st/cw1200/
* sta.c:1339) queues unjoin_work on join failure for
* the same reason; bes2600_unjoin_work gates its
* wsm_reset on join_status != PASSIVE, so after a
* failed JOIN (join_status stays PASSIVE) that path
* never fires — call wsm_reset directly here instead.
*
* Contract: wsm_reset takes only wsm_cmd_lock; safe
* to call while conf_lock is held. wsm_oper_unlock
* was already called in wsm_join_confirm() before
* wsm_join() returned the error.
*/
WARN_ON(wsm_reset(hw_priv, &join_fail_reset, priv->if_id));
join_failed = true;
} else {
/* Upload keys */
#ifdef CONFIG_BES2600_TESTMODE
@@ -2414,7 +2442,18 @@ void bes2600_join_work(struct work_struct *work)
up(&hw_priv->conf_lock);
if (bss)
cfg80211_put_bss(hw_priv->hw->wiphy, bss);
wsm_unlock_tx(hw_priv);
/*
* On join failure: queue unjoin_work so the next association
* attempt is serialised after any lingering cleanup, matching
* cw1200 sta.c:1344 "Tx lock still held, unjoin will clear it."
* If unjoin_work is already queued, release TX immediately.
*/
if (join_failed) {
if (queue_work(hw_priv->workqueue, &priv->unjoin_work) <= 0)
wsm_unlock_tx(hw_priv);
} else {
wsm_unlock_tx(hw_priv);
}
}
void bes2600_join_timeout(struct work_struct *work)
+2 -2
View File
@@ -109,9 +109,9 @@ void bes2600_tx_loop_set_enable(struct bes2600_common *hw_priv, bool need_warn)
bes2600_queue_iterate_pending_packet(&hw_priv->tx_queue[i],
bes2600_tx_loop_item_pending_item);
}
spin_lock(&hw_priv->tx_loop.pending_record_lock);
spin_lock_bh(&hw_priv->tx_loop.pending_record_lock);
bes2600_queue_iterate_record_pending_packet(hw_priv, bes2600_tx_loop_item_pending_item);
spin_unlock(&hw_priv->tx_loop.pending_record_lock);
spin_unlock_bh(&hw_priv->tx_loop.pending_record_lock);
if (atomic_read(&hw_priv->bh_rx) > 0)
wake_up(&hw_priv->bh_wq);