Compare commits

...

3 Commits

Author SHA1 Message Date
test0r 822a5f1bab bes2600: short-circuit wake handshake when chip is confirmed ACTIVE
The previous patch ("bes2600: gate PM indication completion on pending
request and track chip state") added enum bes2600_chip_pm_state and the
chip_pm_state field tracking what the host has *seen the firmware
confirm*. This patch makes the wake side use it.

Without this, every bes2600_pwr_device_exit_lp_mode() unconditionally
runs gpio_wake() + sbus_active() + wsm_set_operational_mode(active),
even when the chip is already in confirmed-ACTIVE state and the wake
sequence has nothing to do. The visible failure mode on PineTab2:

  bes2600_pwr_enter_lp_mode, wait pm ind timeout
  repeat set gpio_wake_flag, sub_sys:0
  bes2600_sdio_active failed, subsys:0
  bes2600_pwr_device_exit_lp_mode, active mcu fail

cycling every ~9 s, ~22 cycles in 10 minutes. Three pieces:

  1. enter_lp_mode timed out (firmware indication lost). With c6.1,
     chip_pm_state is now UNKNOWN.
  2. lock_device fires exit_lp_mode.
  3. gpio_wake hits "bit already set" because device_enter_lp_mode
     was skipped when the indication timed out, so gpio_sleep was
     never called - the bit reflects driver intent, not chip state.
     gpio_wake silently no-ops (no GPIO edge), bit stays set.
  4. sbus_active spends 200 x 2 ms looking for MCU_WAKEUP_READY that
     never comes (firmware was never told to wake), then fails.
  5. Driver continues to wsm_set_operational_mode against the wedged
     bus, compounding the failure.

This patch's three moves:

  * bes2600_pwr_device_exit_lp_mode() reads chip_pm_state at entry.
    On BES2600_CHIP_PM_ACTIVE, log at devel level and return without
    touching gpio_wake / sbus_active / WSM. The chip is in the state
    we want; the handshake exists only to drive a transition.

  * On BES2600_CHIP_PM_LP or BES2600_CHIP_PM_UNKNOWN, run the wake
    handshake as before, but on sbus_active() failure: set
    chip_pm_state = UNKNOWN, log once at err level, and bail out.
    Do NOT call wsm_set_operational_mode over a wedged bus - it
    would just emit a second error and leave the chip in an even
    less defined state.

  * bes2600_gpio_wakeup_mcu() / bes2600_gpio_allow_mcu_sleep():
    demote "repeat set/clear gpio_wake_flag" from bes_err to
    bes_devel. Multi-subsystem wake-hold (e.g. WIFI + BT both want
    MCU awake) is the steady-state case, and the symmetric clear
    while bit-already-clear is racy bookkeeping rather than a
    hardware error. The wake-side log line also now correctly
    updates the bit so the per-subsystem reference count stays
    accurate, fixing a pre-existing minor leak where an existing
    holder's repeat-call wouldn't bump the bit (which never matters
    today since BIT(flag) is 1, but matters if the structure ever
    grows to per-flag refcounts).

Net effect on the cycle:

  * If chip is genuinely ACTIVE (chip_pm_state == ACTIVE), wake skips
    cleanly. Storm goes silent.
  * If chip is genuinely LP, behaviour is unchanged.
  * If chip is UNKNOWN (post-timeout state), one wake attempt is
    made; on failure, state stays UNKNOWN and we don't emit a
    second cascade error per attempt. Repeated UNKNOWN with failed
    wake will eventually be picked up by the LMAC active-monitor
    and escalated to mmc_hw_reset (c5.2).

No new locks, no new state. Only consumption of the chip_pm_state
field added in the prerequisite patch.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-04-28 16:11:08 +02:00
test0r c57c77e446 bes2600: gate PM indication completion on pending request and track chip state
When mac80211 toggles PSM on the BES2600, the host sends WSM set_pm
and waits up to 5 s on bes_power.pm_enter_cmpl for a firmware-side
PM-changed indication confirming the transition. Three sequenced
flaws make the wait-and-confirm racy and leave host/chip bookkeeping
desynced when anything misfires:

  1) bes2600_pwr_notify_ps_changed() unconditionally fires
     complete(pm_enter_cmpl) for any non-active psmode. It does not
     check whether a host-initiated set_pm is actually pending. A
     spontaneous indication (firmware-internal coex move,
     idle-driven aging) primes the completion, and the next host-
     driven enter_lp_mode sees a false success on its first
     wait_for_completion_timeout.

  2) The wait/reinit ordering in bes2600_pwr_enter_lp_mode is

         status = wait_for_completion_timeout(...);
         atomic_set(pm_set_in_process, 0);
         reinit_completion(...);

     If an indication arrives between wait_for_completion_timeout
     returning with status==1 and reinit_completion, the next
     enter_lp_mode iteration's wait can also see false success. The
     reinit must happen *before* we start the new request, not
     after handling the previous one.

  3) On wait_pm_ind timeout, the driver returns -ETIMEDOUT and walks
     away. It does not record that the firmware's actual PM state
     is no longer known to the host. Subsequent wake paths
     (gpio_wake / sbus_active) assume the chip is still active and
     hit deterministic SDIO failures when the firmware has
     transitioned anyway.

This patch is the safe-prerequisite half of a wider fix:

  * bes_pwr.h gains enum bes2600_chip_pm_state {ACTIVE, LP, UNKNOWN}
    and bes_power.chip_pm_state. Its job is to track what the host
    has *seen the firmware confirm*, not what the host has
    requested. Initialised to ACTIVE in bes2600_pwr_init().

  * bes2600_pwr_notify_ps_changed() unconditionally updates
    chip_pm_state on every indication, but only fires
    complete(pm_enter_cmpl) when atomic_cmpxchg(pm_set_in_process,
    1, 0) succeeds. A spontaneous indication can no longer prime a
    waiter that will only set up its request afterwards.

  * bes2600_pwr_enter_lp_mode() now reinit_completion()s before
    setting pm_set_in_process and sending wsm_set_pm. After a
    timeout, it cmpxchgs pm_set_in_process back to 0 (so a late
    indication cannot prime the next iteration) and on the win-
    cmpxchg branch records chip_pm_state=UNKNOWN.

A follow-up patch consumes chip_pm_state on the wake side
(bes2600_pwr_device_exit_lp_mode + bes2600_gpio_wakeup_mcu) to fix
the deterministic "active mcu fail" cycle this state-record
enables a fix for. Splitting the work this way keeps the lock-free
race fix small and reviewable on its own.

No new locks, no behaviour change on the success path. Only the
recovery path (timeout + spontaneous indication) gains correctness.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-04-28 16:11:08 +02:00
test0r db4ea70fb5 bes2600: widen scan-defer backoff to 30s and decay count on quiet
The scan-defer logic added in the previous patch ("bes2600: defer
scan and soften WARN on firmware reject") used a 10-second backoff
window and never cleared reject_count outside of a successful scan.
Field testing on a PineTab2 (linux-pinetab2 6.19.10-danctnix1) shows
two distinct mac80211 scan-retry cadences in practice:

  * Idle background scans every ~5 minutes when associated -- well
    outside any plausible backoff, the defer guard correctly falls
    through to a real WSM scan attempt.

  * Roam-evaluation bursts triggered when mac80211 wants to find a
    candidate AP for handover (signal degradation, beacon loss,
    locally-generated DEAUTH_LEAVING reason=3). Cadence is ~12 s, and
    one boot reproduced 14 such rejected scans in 3 minutes during a
    single burst, none of which engaged the defer guard because every
    retry landed just outside the 10 s window.

Two-line behaviour change to fix that:

  1. BES2600_SCAN_BACKOFF_JIFFIES grows from 10*HZ to 30*HZ, so a
     12 s-cadence burst stays inside the window across consecutive
     rejects and the third reject in the burst trips the threshold
     guard. The 5 min idle case is still naturally past the window
     and is unaffected.

  2. bes2600_scan_should_defer() resets reject_count to 0 when
     time_after(jiffies, backoff_until). Without this, reject_count
     accumulated indefinitely across the slow-cadence rejects, so an
     isolated reject after long quiet would have tripped the
     threshold the moment it arrived. After the change, count is
     latched only inside an active burst and decays cleanly when the
     burst ends.

Net effect on a roam burst:

  * t=0   reject #1 (count 1, backoff_until = t0 + 30s)
  * t=12  reject #2 (count 2, backoff_until = t1 + 30s)
  * t=24  reject #3 (count 3, threshold met, next scan deferred)
  * t=36  defer fires, no WSM round-trip, reject not sent
  * ...   defers continue until the firmware-policy state clears
  * scan succeeds -> reject_count = 0, normal cadence resumes

WSM 0x0007 confirm rejections in a burst drop from ~14 to ~3 (just
the scans needed to reach the threshold). wpa_supplicant's reason=3
locally-generated disconnects driven by exhausted roam candidates
during the same burst window also drop.

No new state, no new symbols, no change to mac80211-facing semantics:
the deferred scan still completes via the existing fail: path with
status=-EBUSY, the same response a real firmware-busy would produce.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-04-28 14:33:00 +02:00
4 changed files with 177 additions and 20 deletions
+13 -2
View File
@@ -1388,7 +1388,14 @@ static void bes2600_gpio_wakeup_mcu(struct sbus_priv *self, int flag)
/* error check */
if((self->gpio_wakup_flags & BIT(flag)) != 0) {
bes_err( "repeat set gpio_wake_flag, sub_sys:%d", flag);
/*
* Multiple subsystems holding wake is the steady-state case
* (e.g. WIFI + BT both want MCU awake). Demoted from bes_err
* to bes_devel since it isn't an error - the GPIO is already
* asserted high and the subsystem is now also tracked.
*/
bes_devel("repeat set gpio_wake_flag, sub_sys:%d\n", flag);
self->gpio_wakup_flags |= BIT(flag);
mutex_unlock(&self->io_mutex);
return;
}
@@ -1420,7 +1427,11 @@ static void bes2600_gpio_allow_mcu_sleep(struct sbus_priv *self, int flag)
/* error check */
if((self->gpio_wakup_flags & BIT(flag)) == 0) {
bes_err( "repeat clear gpio_wake_flag, sub_sys:%d", flag);
/*
* Mirror of the wake path: a clear when the bit is already
* clear is racy bookkeeping, not a hardware error.
*/
bes_devel("repeat clear gpio_wake_flag, sub_sys:%d\n", flag);
mutex_unlock(&self->io_mutex);
return;
}
+134 -16
View File
@@ -524,7 +524,17 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
bes_devel("%s, psMode:%s, fastPsmIdlePeriod:%d apPsmChangePeriod:%d minAutoPsPollPeriod:%d\n",
__func__, bes2600_get_ps_mode_str(priv->powersave_mode.pmMode), priv->powersave_mode.fastPsmIdlePeriod,
priv->powersave_mode.apPsmChangePeriod, priv->powersave_mode.minAutoPsPollPeriod);
/*
* Reinit BEFORE the WSM goes out, so a stale
* indication from a previous cycle cannot have
* primed pm_enter_cmpl. From here until the
* indication callback's cmpxchg(1->0) on
* pm_set_in_process, only the indication for
* THIS request can complete the wait.
*/
reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
atomic_set(&hw_priv->bes_power.pm_set_in_process, 1);
ret = bes2600_set_pm(priv, &priv->powersave_mode);
if (ret) {
atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
@@ -535,11 +545,33 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
/* wait power save mode changed indication */
status = wait_for_completion_timeout(&hw_priv->bes_power.pm_enter_cmpl, 5 * HZ);
atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
if (!status) {
bes_devel("%s, wait pm ind timeout\n", __func__);
timeouts++;
/*
* The indication callback only fires
* complete() when it observes
* pm_set_in_process == 1; cmpxchg it
* to 0 here so a late indication
* cannot prime the next wait.
*
* If we win the cmpxchg, this is a
* real timeout: the firmware's PS
* state is unknown to us. Mark it as
* such so the next wake path can
* probe before assuming the chip is
* still active.
*
* If we lose the cmpxchg, the
* indication arrived between the
* wait timing out and us getting
* here; treat as success.
*/
if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
1, 0) == 1) {
bes_devel("%s, wait pm ind timeout\n", __func__);
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_UNKNOWN);
timeouts++;
}
}
} else {
bes_devel("skip enter lp mode\n");
@@ -554,10 +586,34 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
* in an inconsistent state that cascades into SDIO TX errors on
* the BES2600.
*/
if (timeouts == 0)
if (timeouts == 0) {
bes2600_pwr_device_enter_lp_mode(hw_priv);
else
} else {
/*
* device_enter_lp_mode() was skipped (one or more VIFs
* timed out waiting for the firmware indication) so its
* gpio_sleep(MCU) - which drops the wake-flag bit and, if
* no other subsystem holds the wake, drives the GPIO low -
* never ran. Without it the bit stays asserted, and the
* next bes2600_pwr_device_exit_lp_mode() calls
* gpio_wake(MCU) into a "bit already set" no-op: the GPIO
* never re-edges, sbus_active() exhausts its 200x2ms
* MCU_WAKEUP_READY budget against an unwoken chip, and
* the first TX after idle stalls for several seconds.
*
* Drop the MCU wake-flag bit explicitly here so the next
* wake injects a real GPIO edge. gpio_allow_mcu_sleep
* preserves multi-subsystem semantics: it only drives the
* GPIO low when no other subsystem still holds wake; if
* BT or another holder is keeping the chip awake, the
* GPIO stays high and the bit clear here is purely
* bookkeeping (so the next gpio_wake doesn't no-op).
*/
if (hw_priv->sbus_ops->gpio_sleep)
hw_priv->sbus_ops->gpio_sleep(hw_priv->sbus_priv,
GPIO_WAKE_FLAG_MCU);
ret = -ETIMEDOUT;
}
return ret;
}
@@ -565,19 +621,61 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
static void bes2600_pwr_device_exit_lp_mode(struct bes2600_common *hw_priv)
{
int ret = 0;
enum bes2600_chip_pm_state state;
struct wsm_operational_mode mode = {
.power_mode = wsm_power_mode_active,
.disableMoreFlagUsage = true,
};
bes_devel("host lock lmac\n");
if(hw_priv->sbus_ops->gpio_wake)
hw_priv->sbus_ops->gpio_wake(hw_priv->sbus_priv, GPIO_WAKE_FLAG_MCU);
/*
* Consult chip_pm_state set by bes2600_pwr_notify_ps_changed().
* If we last saw the firmware confirm ACTIVE, skip ONLY the
* gpio_wake + sbus_active wake handshake - the GPIO is already
* asserted high and the SDIO MCU subsystem is already running,
* so another sbus_active() round-trip just hits its 200x2ms
* timeout because the firmware has nothing to do.
*
* wsm_set_operational_mode() below is NOT part of the wake
* handshake; it is the operational-mode setter the firmware
* tracks per call. Skipping it leaves the chip's SDIO state
* machine without a fresh operational-mode update, which on
* PineTab2 wedges the bus (-EBUSY on next sdio_rx_work read)
* within a few seconds of probe completion. So it must run
* unconditionally.
*/
state = atomic_read(&hw_priv->bes_power.chip_pm_state);
if (state == BES2600_CHIP_PM_ACTIVE) {
bes_devel("device_exit_lp_mode: chip already ACTIVE, skipping wake handshake\n");
} else {
bes_devel("host lock lmac\n");
if (hw_priv->sbus_ops->gpio_wake)
hw_priv->sbus_ops->gpio_wake(hw_priv->sbus_priv,
GPIO_WAKE_FLAG_MCU);
if(hw_priv->sbus_ops->sbus_active) {
ret = hw_priv->sbus_ops->sbus_active(hw_priv->sbus_priv, SUBSYSTEM_MCU);
if (ret)
bes_err("%s, active mcu fail\n", __func__);
if (hw_priv->sbus_ops->sbus_active) {
ret = hw_priv->sbus_ops->sbus_active(hw_priv->sbus_priv,
SUBSYSTEM_MCU);
if (ret) {
/*
* MCU_WAKEUP_READY did not arrive within
* the SDIO handshake window. Record state
* as UNKNOWN so the next exit_lp_mode call
* also runs the full wake sequence (no
* skip), but still send operational_mode
* below to match pre-c6 behaviour - the
* WSM may succeed even if the SDIO active
* confirm was lost, and if it fails too,
* we just emit a second devel-level error.
* Repeated UNKNOWN is the signal for the
* LMAC active-monitor to eventually
* escalate to bus_reset (c5.2's
* mmc_hw_reset path).
*/
bes_err("%s, active mcu fail\n", __func__);
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_UNKNOWN);
}
}
}
ret = wsm_set_operational_mode(hw_priv, &mode, 0);
@@ -833,6 +931,7 @@ void bes2600_pwr_init(struct bes2600_common *hw_priv)
hw_priv->bes_power.power_up_task = NULL;
mutex_init(&hw_priv->bes_power.pwr_mutex);
atomic_set(&hw_priv->bes_power.dev_state, 0);
atomic_set(&hw_priv->bes_power.chip_pm_state, BES2600_CHIP_PM_UNKNOWN);
init_completion(&hw_priv->bes_power.pm_enter_cmpl);
sema_init(&hw_priv->bes_power.sync_lock, 1);
device_set_wakeup_capable(hw_priv->pdev, true);
@@ -1213,9 +1312,28 @@ int bes2600_pwr_clear_busy_event(struct bes2600_common *hw_priv, u32 event)
void bes2600_pwr_notify_ps_changed(struct bes2600_common *hw_priv, u8 psmode)
{
if((psmode & 0x01) != WSM_PSM_ACTIVE) {
bes_devel("complete pm_enter_cmpl\n");
complete(&hw_priv->bes_power.pm_enter_cmpl);
/*
* The firmware sends a PM-changed indication for every transition,
* including ones we didn't ask for (firmware-internal coex moves,
* idle-driven aging). Update chip_pm_state unconditionally so the
* wake path can use it, but only fire pm_enter_cmpl when a host-
* initiated set_pm is actually in flight - otherwise a stale
* indication can prime a future wait against a freshly
* reinit_completion()'ed state.
*/
if ((psmode & 0x01) != WSM_PSM_ACTIVE) {
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_LP);
if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
1, 0) == 1) {
bes_devel("complete pm_enter_cmpl\n");
complete(&hw_priv->bes_power.pm_enter_cmpl);
} else {
bes_devel("PM ind (LP) without pending wait; state recorded\n");
}
} else {
atomic_set(&hw_priv->bes_power.chip_pm_state,
BES2600_CHIP_PM_ACTIVE);
}
}
+15
View File
@@ -64,6 +64,20 @@ enum power_down_state
POWER_DOWN_STATE_UNLOCKED,
};
/*
* Confirmed PM state of the firmware-side chip. Tracks what the host
* has *seen* the firmware acknowledge, not what the host has
* requested. UNKNOWN means a host-initiated transition timed out
* before the firmware indication arrived; the next wake path should
* treat it as "we don't know" and probe before issuing GPIO/SDIO
* wakeup ops.
*/
enum bes2600_chip_pm_state {
BES2600_CHIP_PM_ACTIVE = 0,
BES2600_CHIP_PM_LP,
BES2600_CHIP_PM_UNKNOWN,
};
typedef void (*bes_pwr_enter_lp_cb)(struct bes2600_common *hw_priv);
typedef void (*bes_pwr_exit_lp_cb)(struct bes2600_common *hw_priv);
@@ -106,6 +120,7 @@ struct bes2600_pwr_t
bool ap_lp_bad;
struct bes2600_pwr_event_t pwr_events[BES2600_DELAY_EVENT_NUM];
atomic_t pm_set_in_process;
atomic_t chip_pm_state;
};
#ifdef CONFIG_BES2600_WOWLAN
+15 -2
View File
@@ -22,9 +22,17 @@
* After this many consecutive WSM scan rejections from firmware, stop
* issuing new scans for BES2600_SCAN_BACKOFF_JIFFIES and let the state
* that's rejecting them (coex window, firmware-internal busy) clear.
*
* The backoff has to be at least as long as the natural mac80211 scan-
* retry cadence, otherwise the next attempt lands outside the window
* and bypasses the defer guard. Observed in the wild on PineTab2:
* roam-evaluation bursts at ~12 s cadence, idle background scans at
* ~5 min cadence. 30 s catches the burst and leaves the slow case
* alone (the firmware-policy state has had minutes to clear by then
* anyway).
*/
#define BES2600_SCAN_REJECT_THRESHOLD 3
#define BES2600_SCAN_BACKOFF_JIFFIES (10 * HZ)
#define BES2600_SCAN_BACKOFF_JIFFIES (30 * HZ)
static void bes2600_scan_restart_delayed(struct bes2600_vif *priv);
@@ -40,7 +48,9 @@ static void bes2600_scan_restart_delayed(struct bes2600_vif *priv);
* 2. We already saw >= BES2600_SCAN_REJECT_THRESHOLD consecutive
* rejections on recent scan attempts and the backoff window has
* not yet elapsed. Whatever was rejecting them is likely still
* rejecting them; give it time.
* rejecting them; give it time. If the backoff has elapsed without
* a fresh reject refreshing it, the burst is over and we reset the
* count so an isolated reject doesn't immediately re-trip.
*
* Returns true if the caller should abandon the scan iteration.
*/
@@ -51,6 +61,9 @@ static bool bes2600_scan_should_defer(struct bes2600_common *hw_priv)
return true;
#endif
if (time_after(jiffies, hw_priv->scan.backoff_until))
hw_priv->scan.reject_count = 0;
if (hw_priv->scan.reject_count >= BES2600_SCAN_REJECT_THRESHOLD &&
time_before(jiffies, hw_priv->scan.backoff_until))
return true;