bes2600: reset firmware state on wsm_join_confirm failure #12

Merged
marfrit merged 1 commits from bes2600/wsm-join-confirm-reset into cleanups 2026-05-21 08:45:22 +00:00
Owner

Fixes besser#25.

Problem

When wsm_join_confirm() returns status 1 (firmware-rejected JOIN), the driver clears its bookkeeping but does not reset the firmware. A rapid second JOIN attempt — e.g. wpa_supplicant retrying after the PREV_AUTH_NOT_VALID deauth mac80211 emits to clean up — hits an inconsistent firmware context, causing bes2600_sdio_read_rx_batch SDIO error followed by the wifi_force_close / tx_loop_set_enable WARN_ON cascade.

Reproduced 2026-05-21 on pkgrel=5 (per-series reconstruction), boot -1: two consecutive JOIN failures to the wohnzimmer 5 GHz AP (c0:25:06:e6:5b:32) 10 minutes apart, the second triggering the full cascade.

cw1200 ancestor

drivers/net/wireless/st/cw1200/sta.c, cw1200_join_work() lines 1339-1344:

if (wsm_join(priv, &join)) {
    pr_err("[STA] cw1200_join_work: wsm_join failed!\n");
    cancel_delayed_work_sync(&priv->join_timeout);
    cw1200_update_listening(priv, priv->listening);
    /* Tx lock still held, unjoin will clear it. */
    if (queue_work(priv->workqueue, &priv->unjoin_work) <= 0)
        wsm_unlock_tx(priv);
}

cw1200 queues unjoin_work on join failure. cw1200_do_unjoin() calls wsm_reset() when join_status == STA, which in cw1200's call sequence is the usual state after a failed JOIN (prior association existed).

bes2600 divergence requiring direct wsm_reset

bes2600_unjoin_work() gates its wsm_reset call on priv->join_status != BES2600_JOIN_STATUS_PASSIVE. After a failed JOIN in bes2600, join_status stays PASSIVE (it is only set to STA in the success branch). Queuing unjoin_work alone therefore skips the reset.

This PR therefore does both halves of the fix:

  1. Direct wsm_reset(hw_priv, &join_fail_reset, priv->if_id) in the failure path, compensating for the PASSIVE gate in bes2600_unjoin_work(). Contract: wsm_reset takes only wsm_cmd_lock; conf_lock held here is compatible; wsm_oper_unlock was already called inside wsm_join_confirm() before wsm_join() returned -EINVAL.

  2. queue_work(workqueue, &priv->unjoin_work) instead of direct wsm_unlock_tx(). Serialises the next association attempt through the workqueue so it cannot race against any lingering firmware-side effects. If unjoin_work is already queued, release TX immediately, matching cw1200's Tx lock still held comment.

Testing

Soak test pending — branch bes2600/wsm-join-confirm-reset to be integrated into the next per-series build iteration. Acceptance:

  • 8h uptime near the wohnzimmer 5 GHz AP
  • wsm_join_confirm ret 1 may still occur (firmware-side reject is independent), but no bes2600_sdio_read_rx_batch sdio read error and no wifi_force_close cascade following it

Mobian flavor (this branch, on top of cleanups) and danctnix flavor (bes2600/join-confirm-failure-reset on top of bes2600/besser-danctnix-v3) are bit-identical except for the surrounding context per the dual-tree-flavors convention.

References

  • cw1200 ancestor: drivers/net/wireless/st/cw1200/sta.c:1339-1344
  • Issue: besser#25
  • Observed cascade: pkgrel=5 boot ID f70b826de75645dcb4b015a44cd4f785, kernel log 07:53-08:04

Signed-off-by: Claude (noether) claude@reauktion.de

Fixes besser#25. ## Problem When `wsm_join_confirm()` returns status 1 (firmware-rejected JOIN), the driver clears its bookkeeping but does not reset the firmware. A rapid second JOIN attempt — e.g. wpa_supplicant retrying after the `PREV_AUTH_NOT_VALID` deauth mac80211 emits to clean up — hits an inconsistent firmware context, causing `bes2600_sdio_read_rx_batch` SDIO error followed by the `wifi_force_close` / `tx_loop_set_enable` `WARN_ON` cascade. Reproduced 2026-05-21 on pkgrel=5 (per-series reconstruction), boot -1: two consecutive JOIN failures to the wohnzimmer 5 GHz AP (`c0:25:06:e6:5b:32`) 10 minutes apart, the second triggering the full cascade. ## cw1200 ancestor `drivers/net/wireless/st/cw1200/sta.c`, `cw1200_join_work()` lines 1339-1344: ```c if (wsm_join(priv, &join)) { pr_err("[STA] cw1200_join_work: wsm_join failed!\n"); cancel_delayed_work_sync(&priv->join_timeout); cw1200_update_listening(priv, priv->listening); /* Tx lock still held, unjoin will clear it. */ if (queue_work(priv->workqueue, &priv->unjoin_work) <= 0) wsm_unlock_tx(priv); } ``` cw1200 queues `unjoin_work` on join failure. `cw1200_do_unjoin()` calls `wsm_reset()` when `join_status == STA`, which in cw1200's call sequence is the usual state after a failed JOIN (prior association existed). ## bes2600 divergence requiring direct wsm_reset `bes2600_unjoin_work()` gates its `wsm_reset` call on `priv->join_status != BES2600_JOIN_STATUS_PASSIVE`. After a failed JOIN in bes2600, `join_status` stays PASSIVE (it is only set to STA in the success branch). Queuing `unjoin_work` alone therefore skips the reset. This PR therefore does both halves of the fix: 1. **Direct `wsm_reset(hw_priv, &join_fail_reset, priv->if_id)`** in the failure path, compensating for the PASSIVE gate in `bes2600_unjoin_work()`. Contract: `wsm_reset` takes only `wsm_cmd_lock`; `conf_lock` held here is compatible; `wsm_oper_unlock` was already called inside `wsm_join_confirm()` before `wsm_join()` returned `-EINVAL`. 2. **`queue_work(workqueue, &priv->unjoin_work)`** instead of direct `wsm_unlock_tx()`. Serialises the next association attempt through the workqueue so it cannot race against any lingering firmware-side effects. If `unjoin_work` is already queued, release TX immediately, matching cw1200's `Tx lock still held` comment. ## Testing Soak test pending — branch `bes2600/wsm-join-confirm-reset` to be integrated into the next per-series build iteration. Acceptance: - 8h uptime near the wohnzimmer 5 GHz AP - `wsm_join_confirm ret 1` may still occur (firmware-side reject is independent), but no `bes2600_sdio_read_rx_batch sdio read error` and no `wifi_force_close` cascade following it Mobian flavor (this branch, on top of `cleanups`) and danctnix flavor (`bes2600/join-confirm-failure-reset` on top of `bes2600/besser-danctnix-v3`) are bit-identical except for the surrounding context per the dual-tree-flavors convention. ## References - cw1200 ancestor: `drivers/net/wireless/st/cw1200/sta.c:1339-1344` - Issue: besser#25 - Observed cascade: pkgrel=5 boot ID `f70b826de75645dcb4b015a44cd4f785`, kernel log 07:53-08:04 Signed-off-by: Claude (noether) <claude@reauktion.de>
marfrit added 1 commit 2026-05-21 08:44:19 +00:00
When wsm_join_confirm() returns status != WSM_STATUS_SUCCESS (ret 1),
the driver cleared its bookkeeping but did not reset the firmware
interface, leaving it in an intermediate post-rejection state.  A rapid
second JOIN attempt (e.g. wpa_supplicant retrying after the
PREV_AUTH_NOT_VALID deauth that mac80211 emits to clean up) hits an
inconsistent firmware context, causing bes2600_sdio_read_rx_batch to
return SDIO error which cascades into wifi_force_close:

  wsm_join_confirm ret 1
  deauthenticating from <bssid> by local choice (Reason: 2=PREV_AUTH_NOT_VALID)
  [~10 min later]
  bes2600_sdio_read_rx_batch sdio read error
  WARNING: at bes2600_tx_loop_set_enable / bes2600_chrdev_wifi_force_close

Two additions to the failure path in bes2600_join_work():

1. wsm_reset (WSM_REQ_ID_RESET, 0x000A) with reset_statistics=false.
   This returns the firmware to IDLE so the next association attempt
   starts from a known-clean state.  bes2600_unjoin_work() performs the
   same reset, but gates it on join_status != PASSIVE; after a failed
   JOIN join_status stays PASSIVE, so that path never fires — call
   wsm_reset directly here instead.

   Contract: wsm_reset takes only wsm_cmd_lock (not conf_lock, not
   wsm_oper_lock).  wsm_oper_unlock was already called inside
   wsm_join_confirm() before wsm_join() returned -EINVAL, so there is
   no re-entrancy hazard.  conf_lock is held at this call site, which is
   compatible with wsm_reset's locking requirements.

2. queue_work(workqueue, &priv->unjoin_work) instead of direct
   wsm_unlock_tx().  Serialises the next association attempt through
   the workqueue so it cannot race against lingering firmware-side
   effects of the failure.  If unjoin_work is already queued, release
   TX immediately (matching cw1200 ancestor sta.c:1344 comment "Tx lock
   still held, unjoin will clear it.").

Ancestor reference: drivers/net/wireless/st/cw1200/sta.c, function
cw1200_join_work(), lines 1339-1344.  cw1200 queues unjoin_work on join
failure for the same reason.  bes2600 needs the direct wsm_reset in
addition because its unjoin_work has the join_status gate that cw1200's
cw1200_do_unjoin() does not.

Signed-off-by: Claude (noether) <claude@reauktion.de>
marfrit merged commit 64fc309e26 into cleanups 2026-05-21 08:45:22 +00:00
marfrit deleted branch bes2600/wsm-join-confirm-reset 2026-05-21 08:45:22 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/bes2600-dkms#12