bes2600: self-detect when firmware does not honor PSM and skip the cycle

The c6 series fixed several host-side bookkeeping bugs around PSM transitions, but didn't address the underlying contract: this chip's firmware (BES2600 with the Bestechnic Dec 2023 build that ships on PineTab2 and most danctnix images) silently drops every WSM_set_pm request without emitting the corresponding PM_INDICATION. The driver's own power_down_work delayed work calls bes2600_pwr_enter_lp_mode every ~10s; without firmware acknowledgment each call burns 5s on wait_for_completion_timeout(pm_enter_cmpl, 5*HZ) and produces a recurring three-line cascade in dmesg: bes2600_pwr_enter_lp_mode, wait pm ind timeout bes2600_sdio_active failed, subsys:0 bes2600_pwr_device_exit_lp_mode, active mcu fail Confirmed by tripwire instrumentation on PineTab2 (linux-pinetab2 6.19.10-danctnix1, ohm) running the c5+c6 stack: zero wsm_set_pm_indication() invocations across an entire boot, while bes2600_pwr_enter_lp_mode timed out repeatedly, and bes2600_sdio_active() consistently saw BES_SLAVE_STATUS_REG_ID return 0x2f (every "ready" bit set except MCU_WAKEUP_READY (bit 4) - the firmware reports "I'm awake, there's nothing to wake from"). This patch makes the driver self-heal: * struct bes2600_pwr_t gains pm_unsupported (bool) and pm_consecutive_timeouts (unsigned int). Both initialised to 0/false. * bes2600_pwr_enter_lp_mode early-returns -EOPNOTSUPP when pm_unsupported is set. Skips the per-VIF set_pm round-trip and the wait_for_completion entirely. * On the cmpxchg-success branch of the timeout path, we increment pm_consecutive_timeouts. When it crosses BES2600_PM_UNSUPPORTED_THRESHOLD (3, ~15s of trying), we latch pm_unsupported = true and force chip_pm_state = ACTIVE so that bes2600_pwr_device_exit_lp_mode's c6.2 skip branch covers the wake side (no gpio_wake / sbus_active / WSM_set_operational_mode reissue past the first one). * bes2600_pwr_notify_ps_changed resets pm_consecutive_timeouts to 0 on any incoming PM indication, and clears pm_unsupported if it was previously latched. So a firmware update that fixes PM_IND delivery automatically re-enables PSM transitions without a driver rebuild. mac80211's PSM requests via bes2600_set_pm() still flow to the firmware unchanged; they just don't have host-side timeouts so they remain silent regardless of firmware acknowledgment. Power consumption goes up if the firmware actually CAN do PSM (we'd be keeping the chip awake unnecessarily), but on a chip where the counter trips this trade-off is forced anyway: the chip stayed awake under the broken cascade as well, just with constant SDIO churn. Net effect on dmesg: after ~15s of boot, the three-line cascade stops firing entirely. The firmware-side wedge is observed once per boot (captured by the pm_unsupported latch) instead of per-cycle. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
bes2600: short-circuit wake handshake when chip is confirmed ACTIVE
2026-04-28 17:56:08 +02:00 · 2026-04-28 16:11:08 +02:00 · 2026-04-28 16:11:08 +02:00 · 2026-04-28 14:33:00 +02:00
9 changed files with 4 additions and 313 deletions
@@ -511,9 +511,6 @@ struct bes2600_common {
 	struct list_head coex_event_list;
 	spinlock_t coex_event_lock;

-	/* Connection-loss-storm fast-recover (Trigger A). See sta.c. */
-	struct work_struct connection_loss_storm_recover_work;
-
 	/* member for low power */
 	struct bes2600_pwr_t bes_power;

@@ -599,11 +596,6 @@ struct bes2600_vif {
        unsigned long           rx_timestamp;
        u32                     cipherType;

-	/* Decrypt-storm fast-recover (Trigger B). See txrx.c. */
-	unsigned long			decrypt_storm_window_start;
-	unsigned int			decrypt_storm_count;
-	unsigned int			decrypt_storm_recoveries;
-	struct work_struct		decrypt_storm_recover_work;

 	/* AP powersave */
 	u32			link_id_map;
@@ -630,10 +622,6 @@ struct bes2600_vif {
 	/* CQM Implementation */
 	struct delayed_work	bss_loss_work;
 	struct delayed_work	connection_loss_work;
-	/* Connection-loss-storm fast-recover (Trigger A). See sta.c. */
-	unsigned long		connection_loss_storm_window_start;
-	unsigned int		connection_loss_storm_count;
-	unsigned int		connection_loss_storm_recoveries;
 	struct work_struct	tx_failure_work;
 	int			delayed_link_loss;
 	spinlock_t		bss_loss_lock;
@@ -868,13 +856,4 @@ int bes2600_btusb_setup_pipes(struct sbus_priv *sbus_priv);
 void bes2600_btusb_uninit(struct usb_interface *interface);
 #endif

-/* Decrypt-storm fast-recover helpers — see txrx.c. */
-void bes2600_decrypt_storm_init(struct bes2600_vif *priv);
-void bes2600_decrypt_storm_account(struct bes2600_vif *priv);
-
-/* Connection-loss-storm fast-recover helpers — see sta.c. */
-void bes2600_connection_loss_storm_init(struct bes2600_vif *priv);
-bool bes2600_connection_loss_storm_account(struct bes2600_vif *priv);
-void bes2600_connection_loss_storm_recover(struct work_struct *work);
-
 #endif /* BES2600_H */
@@ -16,7 +16,6 @@
 #include <linux/mmc/host.h>
 #include <linux/mmc/sdio_func.h>
 #include <linux/mmc/card.h>
-#include <linux/mmc/core.h>
 #include <linux/mmc/sdio.h>
 #include <linux/spinlock.h>
 #include <net/mac80211.h>
@@ -1789,55 +1788,6 @@ static void bes2600_sdio_halt_device(struct sbus_priv *self)
 	sdio_work_debug(self);
 }

-/*
- * Trigger an SDIO bus reset via mmc_hw_reset().
- *
- * With multiple SDIO functions probed (PineTab2 binds func 1 for WLAN and
- * func 2 for the BT-companion path) mmc_sdio_hw_reset() takes the
- * remove-and-rescan path: it marks the card removed and schedules
- * mmc_rescan, which tears down the bound function drivers and re-detects
- * the card on the next sweep, in turn reinvoking bes2600_sdio_probe().
- *
- * With a single function probed it instead invokes mmc_power_cycle()
- * directly, which on PineTab2 toggles the wifi-reset GPIO via sdio_pwrseq.
- *
- * In both cases the chip ends up in a freshly reset state, which is the
- * goal of the recovery path.
- *
- * mmc_hw_reset() must be called without holding the SDIO host claim --
- * the multi-func remove-and-rescan path acquires the host claim via the
- * mmc workqueue.
- */
-static int bes2600_sdio_bus_reset(struct sbus_priv *self)
-{
-	struct mmc_host *host;
-	int ret;
-
-	if (!self || !self->func || !self->func->card)
-		return -EINVAL;
-
-	host = self->func->card->host;
-	ret = mmc_hw_reset(self->func->card);
-
-	/*
-	 * On multi-function SDIO cards (BES2600 has WLAN func 1 + BT
-	 * companion func 2), mmc_sdio_hw_reset() removes the card and
-	 * returns 1 to signal "remove happened, caller must trigger
-	 * rescan". The kernel does NOT auto-rescan in this case;
-	 * single-function cards take the rescan path inline and return 0.
-	 * Treat any non-negative return as success and force a rescan if
-	 * mmc_hw_reset signalled the multi-function path - otherwise the
-	 * card stays removed indefinitely after a wedge recovery,
-	 * leaving wifi (and the BT companion) silent until reboot.
-	 */
-	if (ret > 0) {
-		bes_info("multi-func mmc_hw_reset removed card; scheduling rescan\n");
-		mmc_detect_change(host, 0);
-		ret = 0;
-	}
-	return ret;
-}
-
 static bool bes2600_sdio_wakeup_source(struct sbus_priv *self)
 {
 	struct bes2600_platform_data_sdio *pdata = bes2600_get_platform_data();
@@ -1876,7 +1826,6 @@ static struct sbus_ops bes2600_sdio_sbus_ops = {
 	.gpio_sleep		= bes2600_gpio_allow_mcu_sleep,
 	.halt_device		= bes2600_sdio_halt_device,
 	.wakeup_source		= bes2600_sdio_wakeup_source,
-	.bus_reset		= bes2600_sdio_bus_reset,
 };

 static void bes2600_sdio_en_lp_cb(struct bes2600_common *hw_priv)
@@ -442,60 +442,6 @@ int bes2600_chrdev_do_system_close(const struct sbus_ops *sbus_ops, struct sbus_
 	return ret;
 }

-/*
- * Hard-reset the bus and wait for the bus core to remove the chip.
- *
- * Used by the firmware-wedge recovery path on platforms where the normal
- * power_switch(0) sequence has no effective chip-reset signal. The bus
- * implementation triggers an asynchronous re-detect; this helper waits for
- * the resulting remove() callback to clear bes2600_cdev.sbus_priv so that a
- * subsequent bes2600_switch_wifi(true) sees a clean state and can wait on
- * the fresh probe.
- */
-int bes2600_chrdev_do_bus_reset(const struct sbus_ops *sbus_ops, struct sbus_priv *priv)
-{
-	int ret;
-	long status;
-
-	if (!sbus_ops || !priv)
-		return -EINVAL;
-
-	if (!sbus_ops->bus_reset)
-		return -EOPNOTSUPP;
-
-	bes_info("trigger bus reset to recover wedged firmware.\n");
-
-	ret = sbus_ops->bus_reset(priv);
-	if (ret) {
-		bes_err("bus_reset failed: %d\n", ret);
-		return ret;
-	}
-
-	/*
-	 * The bus reset is asynchronous: the bus core schedules a rescan
-	 * which removes the bound function drivers and then re-detects the
-	 * chip. Wait for the remove callback to clear sbus_priv. Do not
-	 * dereference 'priv' after this point -- it may already be freed.
-	 */
-	status = wait_event_timeout(bes2600_cdev.probe_done_wq,
-				    !bes2600_cdev.sbus_priv, HZ * 3);
-	WARN_ON(status <= 0);
-
-	return 0;
-}
-
-/*
- * Trigger bes2600_chrdev_do_bus_reset() against the file-global
- * bes2600_cdev. Used by host-side recovery paths outside this
- * compilation unit (e.g. sta.c connection-loss-storm fast-recover) so
- * those callers do not need to reach the static bes2600_cdev directly.
- */
-int bes2600_chrdev_trigger_bus_reset(void)
-{
-	return bes2600_chrdev_do_bus_reset(bes2600_cdev.sbus_ops,
-					   bes2600_cdev.sbus_priv);
-}
-
 bool bes2600_chrdev_is_wifi_opened(void)
 {
 	bool wifi_opened = false;
@@ -594,21 +540,8 @@ static void bes2600_chrdev_wifi_force_close_work(struct work_struct *work)
 		/* unregister wifi */
 		bes2600_switch_wifi(0);

-		/*
-		 * Hard exception with a bus_reset implementation: tear the
-		 * bus down via mmc_hw_reset() (or equivalent) so the next
-		 * bringup probes a freshly reset chip. On PineTab2 this is
-		 * the only effective recovery path -- the existing
-		 * power_switch(0)/(1) sequence has no chip-reset signal of
-		 * its own (sdio_pwrseq owns wifi_reset).
-		 *
-		 * Soft close, or hard close on a board without bus_reset:
-		 * fall back to the legacy power_switch(0) sequence.
-		 */
-		if (bes2600_cdev.halt_dev && bes2600_cdev.sbus_ops->bus_reset) {
-			bes2600_chrdev_do_bus_reset(bes2600_cdev.sbus_ops,
-						    bes2600_cdev.sbus_priv);
-		} else if (bes2600_chrdev_check_system_close()) {
+		/* power down device if wifi is only opened */
+		if (bes2600_chrdev_check_system_close()) {
 			bes2600_chrdev_do_system_close(bes2600_cdev.sbus_ops,
 						bes2600_cdev.sbus_priv);
 		}
@@ -60,8 +60,6 @@ struct sbus_priv *bes2600_chrdev_get_sbus_priv_data(void);
 /* used to control device power down */
 int bes2600_chrdev_check_system_close(void);
 int bes2600_chrdev_do_system_close(const struct sbus_ops *sbus_ops, struct sbus_priv *priv);
-int bes2600_chrdev_do_bus_reset(const struct sbus_ops *sbus_ops, struct sbus_priv *priv);
-int bes2600_chrdev_trigger_bus_reset(void);
 void bes2600_chrdev_wakeup_bt(void);
 void bes2600_chrdev_wifi_force_close(struct bes2600_common *hw_priv, bool halt_dev);
 void bes2600_chrdev_usb_remove(struct bes2600_common *hw_priv);
@@ -542,10 +542,6 @@ static int bes2600_status_show_priv(struct seq_file *seq, void *v)
 		priv->listening ? " (listening)" : "");
 	seq_printf(seq, "Assoc:      %s\n",
 		bes2600_debug_join_status[priv->join_status]);
-	seq_printf(seq, "DecryptStormRecoveries: %u\n",
-		   priv->decrypt_storm_recoveries);
-	seq_printf(seq, "ConnectionLossStormRecoveries: %u\n",
-		   priv->connection_loss_storm_recoveries);
 	if (priv->rx_filter.promiscuous)
 		seq_puts(seq,   "Filter:     promisc\n");
 	else if (priv->rx_filter.fcs)
@@ -484,8 +484,6 @@ static struct ieee80211_hw *bes2600_init_common(size_t hw_priv_data_len)
 	spin_lock_init(&hw_priv->rtsvalue_lock);
 	INIT_WORK(&hw_priv->dynamic_opt_txrx_work, bes2600_dynamic_opt_txrx_work);
 	INIT_WORK(&hw_priv->tx_policy_upload_work, tx_policy_upload_work);
-	INIT_WORK(&hw_priv->connection_loss_storm_recover_work,
-		  bes2600_connection_loss_storm_recover);
 	spin_lock_init(&hw_priv->event_queue_lock);
 	INIT_LIST_HEAD(&hw_priv->event_queue);
 	INIT_WORK(&hw_priv->event_handler, bes2600_event_handler);
@@ -75,14 +75,6 @@ struct sbus_ops {
 	void (*halt_device)(struct sbus_priv *self);
 	bool (*wakeup_source)(struct sbus_priv *self);
 	int (*reboot)(struct sbus_priv *self);
-	/*
-	 * Force the host bus to re-detect and re-probe the chip. Called
-	 * from the firmware-wedge recovery path when power_switch() has no
-	 * effective chip-reset signal of its own (e.g. PineTab2, where the
-	 * wifi-reset GPIO is owned by sdio_pwrseq, not the bes2600 node).
-	 * Returns 0 on success or a negative errno.
-	 */
-	int (*bus_reset)(struct sbus_priv *self);
 };

 void bes2600_irq_handler(struct bes2600_common *priv);
@@ -266,7 +266,6 @@ void bes2600_stop(struct ieee80211_hw *dev, bool suspend)
 	cancel_work_sync(&hw_priv->coex_work);
 	coex_stop(hw_priv);
 #endif
-	cancel_work_sync(&hw_priv->connection_loss_storm_recover_work);

 	bes2600_wifi_stop(hw_priv);

@@ -449,7 +448,6 @@ void bes2600_remove_interface(struct ieee80211_hw *dev,
 	cancel_delayed_work_sync(&priv->join_timeout);
 	cancel_delayed_work_sync(&priv->set_cts_work);
 	cancel_delayed_work_sync(&priv->pending_offchanneltx_work);
-	cancel_work_sync(&priv->decrypt_storm_recover_work);

 	del_timer_sync(&priv->mcast_timeout);
 	/* TODO:COMBO: May be reset of these variables "delayed_link_loss and
@@ -1660,70 +1658,6 @@ report:
 	spin_unlock(&priv->bss_loss_lock);
 }

-/*
- * Connection-loss-storm fast-recover (Trigger A).
- *
- * bes2600_connection_loss_work below is the driver's own decision-point
- * to give up on a BSS (after bss-loss detection accumulates beyond
- * tolerance) and tell mac80211 via ieee80211_connection_loss(). On the
- * deployed pinetab2 stack a single ieee80211_connection_loss() event
- * sometimes triggers a userspace reauth blackhole (assoc-comeback
- * timeouts followed by AP unprotected-deauth-reason-6) that ends only
- * via cross-channel/cross-SSID fallback and can take 80+ s. Receipts at
- * https://git.reauktion.de/marfrit/besser, notes/phase4-2026-05-07.md.
- *
- * When N connection-loss decisions land within WINDOW on the same vif,
- * skip the ieee80211_connection_loss() path and trigger a chip-level
- * bus_reset (the c5.2-introduced bes2600_chrdev_do_bus_reset). The chip
- * is removed and re-probed; userspace re-associates from a fresh state,
- * dodging the assoc-comeback loop.
- *
- * Threshold (3 / 60 s) is chosen well above the steady-state per-vif
- * connection-loss rate observed in the patch-A Phase-7 rep
- * (0.86/h under sustained load), so a true storm is required.
- *
- * The recover work_struct lives on bes2600_common (hw_priv) so that
- * scheduling it does not race with vif teardown after bus_reset frees
- * the per-vif state.
- */
-#define BES2600_CONNECTION_LOSS_STORM_THRESHOLD	3
-#define BES2600_CONNECTION_LOSS_STORM_WINDOW_MS	60000
-
-void bes2600_connection_loss_storm_recover(struct work_struct *work)
-{
-	bes_warn("[bes2600] connection-loss-storm fast-recover: bus_reset\n");
-	bes2600_chrdev_trigger_bus_reset();
-	/*
-	 * After bes2600_chrdev_do_bus_reset() returns, the SDIO core has
-	 * scheduled a remove + rescan; per-vif state may already be gone.
-	 * Do not dereference any per-vif pointer here.
-	 */
-}
-
-void bes2600_connection_loss_storm_init(struct bes2600_vif *priv)
-{
-	priv->connection_loss_storm_window_start = 0;
-	priv->connection_loss_storm_count = 0;
-	priv->connection_loss_storm_recoveries = 0;
-}
-
-bool bes2600_connection_loss_storm_account(struct bes2600_vif *priv)
-{
-	unsigned long now = jiffies;
-	unsigned long window =
-		msecs_to_jiffies(BES2600_CONNECTION_LOSS_STORM_WINDOW_MS);
-
-	if (priv->connection_loss_storm_window_start == 0 ||
-	    time_after(now, priv->connection_loss_storm_window_start + window)) {
-		priv->connection_loss_storm_window_start = now;
-		priv->connection_loss_storm_count = 1;
-		return false;
-	}
-
-	return ++priv->connection_loss_storm_count >=
-	       BES2600_CONNECTION_LOSS_STORM_THRESHOLD;
-}
-
 void bes2600_connection_loss_work(struct work_struct *work)
 {
 	struct bes2600_vif *priv =
@@ -1733,21 +1667,9 @@ void bes2600_connection_loss_work(struct work_struct *work)

 	bes_devel("[CQM] Reporting connection loss.\n");
 	bes2600_pwr_clear_busy_event(priv->hw_priv, BES_PWR_LOCK_ON_BSS_LOST);
-
-	if (bes2600_connection_loss_storm_account(priv)) {
-		bes_warn("[bes2600] connection-loss storm: %u in %u s, scheduling bus reset\n",
-			 priv->connection_loss_storm_count,
-			 BES2600_CONNECTION_LOSS_STORM_WINDOW_MS / 1000);
-		priv->connection_loss_storm_count = 0;
-		priv->connection_loss_storm_recoveries++;
-		schedule_work(&hw_priv->connection_loss_storm_recover_work);
-		/* bus_reset will tear the chip down; skip the mac80211 path. */
-		return;
-	}
-
-	if (bes2600_suspend_status_get(hw_priv))
+	if(bes2600_suspend_status_get(hw_priv)) {
 		bes2600_pending_unjoin_set(hw_priv, priv->if_id);
-	else
+	} else
 		ieee80211_connection_loss(priv->vif);
 #ifdef WIFI_BT_COEXIST_EPTA_ENABLE
 	// set disconnected in BSS_CHANGED_ASSOC
@@ -2697,8 +2619,6 @@ int bes2600_vif_setup(struct bes2600_vif *priv)

 	/* Setup per vif workitems and locks */
 	spin_lock_init(&priv->vif_lock);
-	bes2600_decrypt_storm_init(priv);
-	bes2600_connection_loss_storm_init(priv);
 	INIT_WORK(&priv->join_work, bes2600_join_work);
 	INIT_DELAYED_WORK(&priv->join_timeout, bes2600_join_timeout);
 	INIT_WORK(&priv->unjoin_work, bes2600_unjoin_work);
@@ -25,78 +25,6 @@

 #define BES2600_INVALID_RATE_ID (0xFF)

-/*
- * Decrypt-storm fast-recover (Trigger B).
- *
- * When the BES2600 firmware reports WSM_STATUS_DECRYPTFAILURE for a
- * burst of received frames (typically because the host's PTK or GTK
- * has fallen out of sync with the AP), the AP eventually concludes that
- * the STA is not authenticated and emits an unprotected deauth-reason-6
- * ("Class 2 frame received from non-authenticated station"). On the
- * deployed pinetab2 + bes2600 stack this AP-initiated deauth has been
- * observed to leave the link blackholed for up to 109 s before
- * userspace finds a different SSID/channel to recover on. (Receipts at
- * https://git.reauktion.de/marfrit/besser, notes/phase5-2026-05-06.md.)
- *
- * Recovery here pre-empts the AP: when we see THRESHOLD decrypt
- * failures within WINDOW, we ask mac80211 for a clean reassoc via
- * ieee80211_connection_loss(), which causes immediate disassociation
- * and lets userspace auto-reconnect with fresh keys.
- *
- * mac80211 contract: ieee80211_connection_loss() may be called
- * regardless of IEEE80211_HW_CONNECTION_MONITOR; it causes immediate
- * disassociation without driver-side recovery attempts. See
- * include/net/mac80211.h for the canonical doc-comment.
- *
- * The threshold is set well above the steady-state per-vif
- * decrypt-fail rate observed in measurement (~1/min even under
- * sustained 1 MB/s load), so a true storm is required to trip it.
- */
-#define BES2600_DECRYPT_STORM_THRESHOLD	5
-#define BES2600_DECRYPT_STORM_WINDOW_MS	5000
-
-static void bes2600_decrypt_storm_recover_work(struct work_struct *work)
-{
-	struct bes2600_vif *priv = container_of(work, struct bes2600_vif,
-						decrypt_storm_recover_work);
-
-	if (!priv->vif)
-		return;
-
-	bes_warn("[bes2600] decrypt-storm fast-recover: forcing reassoc\n");
-	ieee80211_connection_loss(priv->vif);
-	priv->decrypt_storm_recoveries++;
-}
-
-void bes2600_decrypt_storm_init(struct bes2600_vif *priv)
-{
-	INIT_WORK(&priv->decrypt_storm_recover_work,
-		  bes2600_decrypt_storm_recover_work);
-	priv->decrypt_storm_window_start = 0;
-	priv->decrypt_storm_count = 0;
-	priv->decrypt_storm_recoveries = 0;
-}
-
-void bes2600_decrypt_storm_account(struct bes2600_vif *priv)
-{
-	unsigned long now = jiffies;
-	unsigned long window = msecs_to_jiffies(BES2600_DECRYPT_STORM_WINDOW_MS);
-
-	if (priv->decrypt_storm_window_start == 0 ||
-	    time_after(now, priv->decrypt_storm_window_start + window)) {
-		priv->decrypt_storm_window_start = now;
-		priv->decrypt_storm_count = 1;
-		return;
-	}
-
-	if (++priv->decrypt_storm_count >= BES2600_DECRYPT_STORM_THRESHOLD) {
-		priv->decrypt_storm_count = 0;
-		/* Skew the window so we don't re-fire on the same storm. */
-		priv->decrypt_storm_window_start = now + window;
-		schedule_work(&priv->decrypt_storm_recover_work);
-	}
-}
-
 #ifdef CONFIG_BES2600_TESTMODE
 #include "bes_nl80211_testmode_msg.h"
 #endif /* CONFIG_BES2600_TESTMODE */
@@ -1744,8 +1672,6 @@ void bes2600_rx_cb(struct bes2600_vif *priv,
 			goto drop;
 		} else {
 			bes_warn("[RX] Receive failure: %d.\n", arg->status);
-			if (arg->status == WSM_STATUS_DECRYPTFAILURE)
-				bes2600_decrypt_storm_account(priv);
 			goto drop;
 		}
 	}
Author	SHA1	Message	Date
Markus Fritsche	f12e870025	bes2600: self-detect when firmware does not honor PSM and skip the cycle The c6 series fixed several host-side bookkeeping bugs around PSM transitions, but didn't address the underlying contract: this chip's firmware (BES2600 with the Bestechnic Dec 2023 build that ships on PineTab2 and most danctnix images) silently drops every WSM_set_pm request without emitting the corresponding PM_INDICATION. The driver's own power_down_work delayed work calls bes2600_pwr_enter_lp_mode every ~10s; without firmware acknowledgment each call burns 5s on wait_for_completion_timeout(pm_enter_cmpl, 5HZ) and produces a recurring three-line cascade in dmesg: bes2600_pwr_enter_lp_mode, wait pm ind timeout bes2600_sdio_active failed, subsys:0 bes2600_pwr_device_exit_lp_mode, active mcu fail Confirmed by tripwire instrumentation on PineTab2 (linux-pinetab2 6.19.10-danctnix1, ohm) running the c5+c6 stack: zero wsm_set_pm_indication() invocations across an entire boot, while bes2600_pwr_enter_lp_mode timed out repeatedly, and bes2600_sdio_active() consistently saw BES_SLAVE_STATUS_REG_ID return 0x2f (every "ready" bit set except MCU_WAKEUP_READY (bit 4) - the firmware reports "I'm awake, there's nothing to wake from"). This patch makes the driver self-heal: struct bes2600_pwr_t gains pm_unsupported (bool) and pm_consecutive_timeouts (unsigned int). Both initialised to 0/false. * bes2600_pwr_enter_lp_mode early-returns -EOPNOTSUPP when pm_unsupported is set. Skips the per-VIF set_pm round-trip and the wait_for_completion entirely. * On the cmpxchg-success branch of the timeout path, we increment pm_consecutive_timeouts. When it crosses BES2600_PM_UNSUPPORTED_THRESHOLD (3, ~15s of trying), we latch pm_unsupported = true and force chip_pm_state = ACTIVE so that bes2600_pwr_device_exit_lp_mode's c6.2 skip branch covers the wake side (no gpio_wake / sbus_active / WSM_set_operational_mode reissue past the first one). * bes2600_pwr_notify_ps_changed resets pm_consecutive_timeouts to 0 on any incoming PM indication, and clears pm_unsupported if it was previously latched. So a firmware update that fixes PM_IND delivery automatically re-enables PSM transitions without a driver rebuild. mac80211's PSM requests via bes2600_set_pm() still flow to the firmware unchanged; they just don't have host-side timeouts so they remain silent regardless of firmware acknowledgment. Power consumption goes up if the firmware actually CAN do PSM (we'd be keeping the chip awake unnecessarily), but on a chip where the counter trips this trade-off is forced anyway: the chip stayed awake under the broken cascade as well, just with constant SDIO churn. Net effect on dmesg: after ~15s of boot, the three-line cascade stops firing entirely. The firmware-side wedge is observed once per boot (captured by the pm_unsupported latch) instead of per-cycle. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>	2026-04-28 17:56:08 +02:00
Markus Fritsche	822a5f1bab	bes2600: short-circuit wake handshake when chip is confirmed ACTIVE The previous patch ("bes2600: gate PM indication completion on pending request and track chip state") added enum bes2600_chip_pm_state and the chip_pm_state field tracking what the host has seen the firmware confirm. This patch makes the wake side use it. Without this, every bes2600_pwr_device_exit_lp_mode() unconditionally runs gpio_wake() + sbus_active() + wsm_set_operational_mode(active), even when the chip is already in confirmed-ACTIVE state and the wake sequence has nothing to do. The visible failure mode on PineTab2: bes2600_pwr_enter_lp_mode, wait pm ind timeout repeat set gpio_wake_flag, sub_sys:0 bes2600_sdio_active failed, subsys:0 bes2600_pwr_device_exit_lp_mode, active mcu fail cycling every ~9 s, ~22 cycles in 10 minutes. Three pieces: 1. enter_lp_mode timed out (firmware indication lost). With c6.1, chip_pm_state is now UNKNOWN. 2. lock_device fires exit_lp_mode. 3. gpio_wake hits "bit already set" because device_enter_lp_mode was skipped when the indication timed out, so gpio_sleep was never called - the bit reflects driver intent, not chip state. gpio_wake silently no-ops (no GPIO edge), bit stays set. 4. sbus_active spends 200 x 2 ms looking for MCU_WAKEUP_READY that never comes (firmware was never told to wake), then fails. 5. Driver continues to wsm_set_operational_mode against the wedged bus, compounding the failure. This patch's three moves: * bes2600_pwr_device_exit_lp_mode() reads chip_pm_state at entry. On BES2600_CHIP_PM_ACTIVE, log at devel level and return without touching gpio_wake / sbus_active / WSM. The chip is in the state we want; the handshake exists only to drive a transition. * On BES2600_CHIP_PM_LP or BES2600_CHIP_PM_UNKNOWN, run the wake handshake as before, but on sbus_active() failure: set chip_pm_state = UNKNOWN, log once at err level, and bail out. Do NOT call wsm_set_operational_mode over a wedged bus - it would just emit a second error and leave the chip in an even less defined state. * bes2600_gpio_wakeup_mcu() / bes2600_gpio_allow_mcu_sleep(): demote "repeat set/clear gpio_wake_flag" from bes_err to bes_devel. Multi-subsystem wake-hold (e.g. WIFI + BT both want MCU awake) is the steady-state case, and the symmetric clear while bit-already-clear is racy bookkeeping rather than a hardware error. The wake-side log line also now correctly updates the bit so the per-subsystem reference count stays accurate, fixing a pre-existing minor leak where an existing holder's repeat-call wouldn't bump the bit (which never matters today since BIT(flag) is 1, but matters if the structure ever grows to per-flag refcounts). Net effect on the cycle: * If chip is genuinely ACTIVE (chip_pm_state == ACTIVE), wake skips cleanly. Storm goes silent. * If chip is genuinely LP, behaviour is unchanged. * If chip is UNKNOWN (post-timeout state), one wake attempt is made; on failure, state stays UNKNOWN and we don't emit a second cascade error per attempt. Repeated UNKNOWN with failed wake will eventually be picked up by the LMAC active-monitor and escalated to mmc_hw_reset (c5.2). No new locks, no new state. Only consumption of the chip_pm_state field added in the prerequisite patch. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>	2026-04-28 16:11:08 +02:00
Markus Fritsche	c57c77e446	bes2600: gate PM indication completion on pending request and track chip state When mac80211 toggles PSM on the BES2600, the host sends WSM set_pm and waits up to 5 s on bes_power.pm_enter_cmpl for a firmware-side PM-changed indication confirming the transition. Three sequenced flaws make the wait-and-confirm racy and leave host/chip bookkeeping desynced when anything misfires: 1) bes2600_pwr_notify_ps_changed() unconditionally fires complete(pm_enter_cmpl) for any non-active psmode. It does not check whether a host-initiated set_pm is actually pending. A spontaneous indication (firmware-internal coex move, idle-driven aging) primes the completion, and the next host- driven enter_lp_mode sees a false success on its first wait_for_completion_timeout. 2) The wait/reinit ordering in bes2600_pwr_enter_lp_mode is status = wait_for_completion_timeout(...); atomic_set(pm_set_in_process, 0); reinit_completion(...); If an indication arrives between wait_for_completion_timeout returning with status==1 and reinit_completion, the next enter_lp_mode iteration's wait can also see false success. The reinit must happen before we start the new request, not after handling the previous one. 3) On wait_pm_ind timeout, the driver returns -ETIMEDOUT and walks away. It does not record that the firmware's actual PM state is no longer known to the host. Subsequent wake paths (gpio_wake / sbus_active) assume the chip is still active and hit deterministic SDIO failures when the firmware has transitioned anyway. This patch is the safe-prerequisite half of a wider fix: * bes_pwr.h gains enum bes2600_chip_pm_state {ACTIVE, LP, UNKNOWN} and bes_power.chip_pm_state. Its job is to track what the host has seen the firmware confirm, not what the host has requested. Initialised to ACTIVE in bes2600_pwr_init(). * bes2600_pwr_notify_ps_changed() unconditionally updates chip_pm_state on every indication, but only fires complete(pm_enter_cmpl) when atomic_cmpxchg(pm_set_in_process, 1, 0) succeeds. A spontaneous indication can no longer prime a waiter that will only set up its request afterwards. * bes2600_pwr_enter_lp_mode() now reinit_completion()s before setting pm_set_in_process and sending wsm_set_pm. After a timeout, it cmpxchgs pm_set_in_process back to 0 (so a late indication cannot prime the next iteration) and on the win- cmpxchg branch records chip_pm_state=UNKNOWN. A follow-up patch consumes chip_pm_state on the wake side (bes2600_pwr_device_exit_lp_mode + bes2600_gpio_wakeup_mcu) to fix the deterministic "active mcu fail" cycle this state-record enables a fix for. Splitting the work this way keeps the lock-free race fix small and reviewable on its own. No new locks, no behaviour change on the success path. Only the recovery path (timeout + spontaneous indication) gains correctness. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>	2026-04-28 16:11:08 +02:00
Markus Fritsche	db4ea70fb5	bes2600: widen scan-defer backoff to 30s and decay count on quiet The scan-defer logic added in the previous patch ("bes2600: defer scan and soften WARN on firmware reject") used a 10-second backoff window and never cleared reject_count outside of a successful scan. Field testing on a PineTab2 (linux-pinetab2 6.19.10-danctnix1) shows two distinct mac80211 scan-retry cadences in practice: * Idle background scans every ~5 minutes when associated -- well outside any plausible backoff, the defer guard correctly falls through to a real WSM scan attempt. * Roam-evaluation bursts triggered when mac80211 wants to find a candidate AP for handover (signal degradation, beacon loss, locally-generated DEAUTH_LEAVING reason=3). Cadence is ~12 s, and one boot reproduced 14 such rejected scans in 3 minutes during a single burst, none of which engaged the defer guard because every retry landed just outside the 10 s window. Two-line behaviour change to fix that: 1. BES2600_SCAN_BACKOFF_JIFFIES grows from 10HZ to 30HZ, so a 12 s-cadence burst stays inside the window across consecutive rejects and the third reject in the burst trips the threshold guard. The 5 min idle case is still naturally past the window and is unaffected. 2. bes2600_scan_should_defer() resets reject_count to 0 when time_after(jiffies, backoff_until). Without this, reject_count accumulated indefinitely across the slow-cadence rejects, so an isolated reject after long quiet would have tripped the threshold the moment it arrived. After the change, count is latched only inside an active burst and decays cleanly when the burst ends. Net effect on a roam burst: * t=0 reject #1 (count 1, backoff_until = t0 + 30s) * t=12 reject #2 (count 2, backoff_until = t1 + 30s) * t=24 reject #3 (count 3, threshold met, next scan deferred) * t=36 defer fires, no WSM round-trip, reject not sent * ... defers continue until the firmware-policy state clears * scan succeeds -> reject_count = 0, normal cadence resumes WSM 0x0007 confirm rejections in a burst drop from ~14 to ~3 (just the scans needed to reach the threshold). wpa_supplicant's reason=3 locally-generated disconnects driven by exhausted roam candidates during the same burst window also drop. No new state, no new symbols, no change to mac80211-facing semantics: the deferred scan still completes via the existing fail: path with status=-EBUSY, the same response a real firmware-busy would produce. Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>	2026-04-28 14:33:00 +02:00