bes2600: gate PM indication completion on pending request and track chip state

When mac80211 toggles PSM on the BES2600, the host sends WSM set_pm and waits up to 5 s on bes_power.pm_enter_cmpl for a firmware-side PM-changed indication confirming the transition. Three sequenced flaws make the wait-and-confirm racy and leave host/chip bookkeeping desynced when anything misfires: 1) bes2600_pwr_notify_ps_changed() unconditionally fires complete(pm_enter_cmpl) for any non-active psmode. It does not check whether a host-initiated set_pm is actually pending. A spontaneous indication (firmware-internal coex move, idle-driven aging) primes the completion, and the next host- driven enter_lp_mode sees a false success on its first wait_for_completion_timeout. 2) The wait/reinit ordering in bes2600_pwr_enter_lp_mode is status = wait_for_completion_timeout(...); atomic_set(pm_set_in_process, 0); reinit_completion(...); If an indication arrives between wait_for_completion_timeout returning with status==1 and reinit_completion, the next enter_lp_mode iteration's wait can also see false success. The reinit must happen *before* we start the new request, not after handling the previous one. 3) On wait_pm_ind timeout, the driver returns -ETIMEDOUT and walks away. It does not record that the firmware's actual PM state is no longer known to the host. Subsequent wake paths (gpio_wake / sbus_active) assume the chip is still active and hit deterministic SDIO failures when the firmware has transitioned anyway. This patch is the safe-prerequisite half of a wider fix: * bes_pwr.h gains enum bes2600_chip_pm_state {ACTIVE, LP, UNKNOWN} and bes_power.chip_pm_state. Its job is to track what the host has *seen the firmware confirm*, not what the host has requested. Initialised to ACTIVE in bes2600_pwr_init(). * bes2600_pwr_notify_ps_changed() unconditionally updates chip_pm_state on every indication, but only fires complete(pm_enter_cmpl) when atomic_cmpxchg(pm_set_in_process, 1, 0) succeeds. A spontaneous indication can no longer prime a waiter that will only set up its request afterwards. * bes2600_pwr_enter_lp_mode() now reinit_completion()s before setting pm_set_in_process and sending wsm_set_pm. After a timeout, it cmpxchgs pm_set_in_process back to 0 (so a late indication cannot prime the next iteration) and on the win- cmpxchg branch records chip_pm_state=UNKNOWN. A follow-up patch consumes chip_pm_state on the wake side (bes2600_pwr_device_exit_lp_mode + bes2600_gpio_wakeup_mcu) to fix the deterministic "active mcu fail" cycle this state-record enables a fix for. Splitting the work this way keeps the lock-free race fix small and reviewable on its own. No new locks, no behaviour change on the success path. Only the recovery path (timeout + spontaneous indication) gains correctness. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
bes2600: widen scan-defer backoff to 30s and decay count on quiet
2026-04-28 16:11:08 +02:00 · 2026-04-28 14:33:00 +02:00 · 2026-04-24 23:53:05 +02:00
5 changed files with 196 additions and 11 deletions
@@ -524,7 +524,17 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
 				bes_devel("%s, psMode:%s, fastPsmIdlePeriod:%d apPsmChangePeriod:%d minAutoPsPollPeriod:%d\n",
 						__func__, bes2600_get_ps_mode_str(priv->powersave_mode.pmMode), priv->powersave_mode.fastPsmIdlePeriod,
 						priv->powersave_mode.apPsmChangePeriod, priv->powersave_mode.minAutoPsPollPeriod);
+				/*
+				 * Reinit BEFORE the WSM goes out, so a stale
+				 * indication from a previous cycle cannot have
+				 * primed pm_enter_cmpl. From here until the
+				 * indication callback's cmpxchg(1->0) on
+				 * pm_set_in_process, only the indication for
+				 * THIS request can complete the wait.
+				 */
+				reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
 				atomic_set(&hw_priv->bes_power.pm_set_in_process, 1);
+
 				ret = bes2600_set_pm(priv, &priv->powersave_mode);
 				if (ret) {
 					atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
@@ -535,11 +545,33 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)

 				/* wait power save mode changed indication */
 				status = wait_for_completion_timeout(&hw_priv->bes_power.pm_enter_cmpl, 5 * HZ);
-				atomic_set(&hw_priv->bes_power.pm_set_in_process, 0);
-				reinit_completion(&hw_priv->bes_power.pm_enter_cmpl);
 				if (!status) {
-					bes_devel("%s, wait pm ind timeout\n", __func__);
-					timeouts++;
+					/*
+					 * The indication callback only fires
+					 * complete() when it observes
+					 * pm_set_in_process == 1; cmpxchg it
+					 * to 0 here so a late indication
+					 * cannot prime the next wait.
+					 *
+					 * If we win the cmpxchg, this is a
+					 * real timeout: the firmware's PS
+					 * state is unknown to us. Mark it as
+					 * such so the next wake path can
+					 * probe before assuming the chip is
+					 * still active.
+					 *
+					 * If we lose the cmpxchg, the
+					 * indication arrived between the
+					 * wait timing out and us getting
+					 * here; treat as success.
+					 */
+					if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
+							   1, 0) == 1) {
+						bes_devel("%s, wait pm ind timeout\n", __func__);
+						atomic_set(&hw_priv->bes_power.chip_pm_state,
+							   BES2600_CHIP_PM_UNKNOWN);
+						timeouts++;
+					}
 				}
 			} else {
 				bes_devel("skip enter lp mode\n");
@@ -554,10 +586,34 @@ static int bes2600_pwr_enter_lp_mode(struct bes2600_common *hw_priv)
 	 * in an inconsistent state that cascades into SDIO TX errors on
 	 * the BES2600.
 	 */
-	if (timeouts == 0)
+	if (timeouts == 0) {
 		bes2600_pwr_device_enter_lp_mode(hw_priv);
-	else
+	} else {
+		/*
+		 * device_enter_lp_mode() was skipped (one or more VIFs
+		 * timed out waiting for the firmware indication) so its
+		 * gpio_sleep(MCU) - which drops the wake-flag bit and, if
+		 * no other subsystem holds the wake, drives the GPIO low -
+		 * never ran. Without it the bit stays asserted, and the
+		 * next bes2600_pwr_device_exit_lp_mode() calls
+		 * gpio_wake(MCU) into a "bit already set" no-op: the GPIO
+		 * never re-edges, sbus_active() exhausts its 200x2ms
+		 * MCU_WAKEUP_READY budget against an unwoken chip, and
+		 * the first TX after idle stalls for several seconds.
+		 *
+		 * Drop the MCU wake-flag bit explicitly here so the next
+		 * wake injects a real GPIO edge. gpio_allow_mcu_sleep
+		 * preserves multi-subsystem semantics: it only drives the
+		 * GPIO low when no other subsystem still holds wake; if
+		 * BT or another holder is keeping the chip awake, the
+		 * GPIO stays high and the bit clear here is purely
+		 * bookkeeping (so the next gpio_wake doesn't no-op).
+		 */
+		if (hw_priv->sbus_ops->gpio_sleep)
+			hw_priv->sbus_ops->gpio_sleep(hw_priv->sbus_priv,
+						      GPIO_WAKE_FLAG_MCU);
 		ret = -ETIMEDOUT;
+	}

 	return ret;
 }
@@ -833,6 +889,7 @@ void bes2600_pwr_init(struct bes2600_common *hw_priv)
        hw_priv->bes_power.power_up_task = NULL;
        mutex_init(&hw_priv->bes_power.pwr_mutex);
        atomic_set(&hw_priv->bes_power.dev_state, 0);
+	atomic_set(&hw_priv->bes_power.chip_pm_state, BES2600_CHIP_PM_UNKNOWN);
        init_completion(&hw_priv->bes_power.pm_enter_cmpl);
        sema_init(&hw_priv->bes_power.sync_lock, 1);
        device_set_wakeup_capable(hw_priv->pdev, true);
@@ -1213,9 +1270,28 @@ int bes2600_pwr_clear_busy_event(struct bes2600_common *hw_priv, u32 event)

 void bes2600_pwr_notify_ps_changed(struct bes2600_common *hw_priv, u8 psmode)
 {
-	if((psmode & 0x01) != WSM_PSM_ACTIVE) {
-		bes_devel("complete pm_enter_cmpl\n");
-		complete(&hw_priv->bes_power.pm_enter_cmpl);
+	/*
+	 * The firmware sends a PM-changed indication for every transition,
+	 * including ones we didn't ask for (firmware-internal coex moves,
+	 * idle-driven aging). Update chip_pm_state unconditionally so the
+	 * wake path can use it, but only fire pm_enter_cmpl when a host-
+	 * initiated set_pm is actually in flight - otherwise a stale
+	 * indication can prime a future wait against a freshly
+	 * reinit_completion()'ed state.
+	 */
+	if ((psmode & 0x01) != WSM_PSM_ACTIVE) {
+		atomic_set(&hw_priv->bes_power.chip_pm_state,
+			   BES2600_CHIP_PM_LP);
+		if (atomic_cmpxchg(&hw_priv->bes_power.pm_set_in_process,
+				   1, 0) == 1) {
+			bes_devel("complete pm_enter_cmpl\n");
+			complete(&hw_priv->bes_power.pm_enter_cmpl);
+		} else {
+			bes_devel("PM ind (LP) without pending wait; state recorded\n");
+		}
+	} else {
+		atomic_set(&hw_priv->bes_power.chip_pm_state,
+			   BES2600_CHIP_PM_ACTIVE);
 	}
 }

@@ -64,6 +64,20 @@ enum power_down_state
        POWER_DOWN_STATE_UNLOCKED,
 };

+/*
+ * Confirmed PM state of the firmware-side chip. Tracks what the host
+ * has *seen* the firmware acknowledge, not what the host has
+ * requested. UNKNOWN means a host-initiated transition timed out
+ * before the firmware indication arrived; the next wake path should
+ * treat it as "we don't know" and probe before issuing GPIO/SDIO
+ * wakeup ops.
+ */
+enum bes2600_chip_pm_state {
+	BES2600_CHIP_PM_ACTIVE = 0,
+	BES2600_CHIP_PM_LP,
+	BES2600_CHIP_PM_UNKNOWN,
+};
+
 typedef void (*bes_pwr_enter_lp_cb)(struct bes2600_common *hw_priv);
 typedef void (*bes_pwr_exit_lp_cb)(struct bes2600_common *hw_priv);

@@ -106,6 +120,7 @@ struct bes2600_pwr_t
        bool ap_lp_bad;
        struct bes2600_pwr_event_t pwr_events[BES2600_DELAY_EVENT_NUM];
        atomic_t pm_set_in_process;
+	atomic_t chip_pm_state;
 };

 #ifdef CONFIG_BES2600_WOWLAN
@@ -14,11 +14,63 @@
 #include "scan.h"
 #include "sta.h"
 #include "pm.h"
+#include "epta_coex.h"
 #include "epta_request.h"
 #include "bes_pwr.h"

+/*
+ * After this many consecutive WSM scan rejections from firmware, stop
+ * issuing new scans for BES2600_SCAN_BACKOFF_JIFFIES and let the state
+ * that's rejecting them (coex window, firmware-internal busy) clear.
+ *
+ * The backoff has to be at least as long as the natural mac80211 scan-
+ * retry cadence, otherwise the next attempt lands outside the window
+ * and bypasses the defer guard. Observed in the wild on PineTab2:
+ * roam-evaluation bursts at ~12 s cadence, idle background scans at
+ * ~5 min cadence. 30 s catches the burst and leaves the slow case
+ * alone (the firmware-policy state has had minutes to clear by then
+ * anyway).
+ */
+#define BES2600_SCAN_REJECT_THRESHOLD	3
+#define BES2600_SCAN_BACKOFF_JIFFIES	(30 * HZ)
+
 static void bes2600_scan_restart_delayed(struct bes2600_vif *priv);

+/*
+ * Decide whether to skip sending the next WSM scan command without
+ * bothering the firmware. Two triggers:
+ *
+ *  1. BT A2DP is streaming in non-FDD coex mode. The firmware is
+ *     known to reject scan requests during that window; short-
+ *     circuiting here saves a WSM round-trip and avoids the
+ *     wsm_generic_confirm / scan_work warning cascade that follows.
+ *
+ *  2. We already saw >= BES2600_SCAN_REJECT_THRESHOLD consecutive
+ *     rejections on recent scan attempts and the backoff window has
+ *     not yet elapsed. Whatever was rejecting them is likely still
+ *     rejecting them; give it time. If the backoff has elapsed without
+ *     a fresh reject refreshing it, the burst is over and we reset the
+ *     count so an isolated reject doesn't immediately re-trip.
+ *
+ * Returns true if the caller should abandon the scan iteration.
+ */
+static bool bes2600_scan_should_defer(struct bes2600_common *hw_priv)
+{
+#ifdef WIFI_BT_COEXIST_EPTA_ENABLE
+	if (!coex_is_fdd_mode() && coex_is_bt_a2dp())
+		return true;
+#endif
+
+	if (time_after(jiffies, hw_priv->scan.backoff_until))
+		hw_priv->scan.reject_count = 0;
+
+	if (hw_priv->scan.reject_count >= BES2600_SCAN_REJECT_THRESHOLD &&
+	    time_before(jiffies, hw_priv->scan.backoff_until))
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_BES2600_TESTMODE
 static int bes2600_advance_scan_start(struct bes2600_common *hw_priv)
 {
@@ -702,10 +754,29 @@ void bes2600_scan_work(struct work_struct *work)
 				wsm_unlock_tx(hw_priv);
 		} else
 #endif
+		{
+			if (bes2600_scan_should_defer(hw_priv)) {
+				hw_priv->scan.status = -EBUSY;
+				hw_priv->scan.reject_count++;
+				hw_priv->scan.backoff_until =
+					jiffies + BES2600_SCAN_BACKOFF_JIFFIES;
+				wiphy_dbg(priv->hw->wiphy,
+					  "[SCAN] deferred (coex/backoff, reject_count=%u)\n",
+					  hw_priv->scan.reject_count);
+				kfree(scan.ch);
+				goto fail;
+			}
 			hw_priv->scan.status = bes2600_scan_start(priv, &scan);
+		}
 		kfree(scan.ch);
-		if (WARN_ON(hw_priv->scan.status))
+		if (hw_priv->scan.status) {
+			hw_priv->scan.reject_count++;
+			hw_priv->scan.backoff_until =
+				jiffies + BES2600_SCAN_BACKOFF_JIFFIES;
+			/* Lower callers already logged the reason at wiphy_warn. */
 			goto fail;
+		}
+		hw_priv->scan.reject_count = 0;
 		hw_priv->scan.curr = it;
 	}
 	up(&hw_priv->conf_lock);
@@ -42,6 +42,17 @@ struct bes2600_scan {
 	struct delayed_work probe_work;
 	int direct_probe;
 	u8 if_id;
+	/*
+	 * Track consecutive firmware-side WSM scan rejections so we can
+	 * back off briefly instead of re-issuing the same scan on every
+	 * mac80211 background-scan tick. Firmware returns WSM status != 0
+	 * for a handful of transient conditions (BT A2DP active in non-
+	 * FDD coex, firmware-internal busy windows) and keeps rejecting
+	 * until the state clears; retrying at full cadence just floods
+	 * dmesg.
+	 */
+	unsigned int reject_count;
+	unsigned long backoff_until;
 };

 int bes2600_hw_scan(struct ieee80211_hw *hw,
@@ -134,8 +134,20 @@ static int wsm_generic_confirm(struct bes2600_common *hw_priv,
 				 struct wsm_buf *buf)
 {
 	u32 status = WSM_GET32(buf);
-	if (WARN(status != WSM_STATUS_SUCCESS, "wsm_generic_confirm ret %u", status))
+
+	/*
+	 * A non-SUCCESS status here is a firmware-side policy decision for
+	 * the command whose confirm this is -- commonly WSM status 2 for
+	 * scan (0x0407) rejected because of a coex window or transient
+	 * firmware-busy state. It is not a driver/kernel bug, so avoid the
+	 * WARN()/stack-trace treatment; the caller already emits a
+	 * wiphy_warn identifying the request id and will propagate the
+	 * error to mac80211.
+	 */
+	if (status != WSM_STATUS_SUCCESS) {
+		bes_devel("%s ret %u\n", __func__, status);
 		return -EINVAL;
+	}
 	return 0;

 underflow: