Commit Graph

21 Commits

Author SHA1 Message Date
test0r dd01be0162 bes2600: Patch E — skip ps_state_lock when PSM-known-disabled
Per the Opus structural critique (PR #8 §2.4) and Sonnet review item 5.
The per-RX-frame early-data path takes ps_state_lock to double-check
whether a link entry transitioned to BES2600_LINK_SOFT (AP-side
power-save state machine, soft-link transition).

When c7 has latched pm_unsupported = true (firmware does not honor
PSM, see feedback_bes2600_firmware_no_psm memory), the AP power-save
state machine is dead and link entries never transition to LINK_SOFT.
The per-frame spin_lock_bh + double-check is wasted work.

This patch gates the lock acquisition on !pm_unsupported.  When the
latch is on (the steady state on the production-shipped bes2600
firmware), early_data RX frames bypass the spin_lock_bh and go
directly to ieee80211_rx_irqsafe.

If a future firmware drop fixes PSM, c7 self-clears pm_unsupported on
the first real PM_INDICATION and the locked path resumes.

Scope is narrower than Sonnet originally framed: only the per-RX-frame
hot path (txrx.c:1945-1951 in cleanups+G+D) is touched.  Other
ps_state_lock sites in txrx.c (lines 657, 1256, 1420, 1528) are TX
submission / multicast-start / link-id paths, not per-frame RX, and
not on the Bug #5 hot path.  Leave those alone.

Build verified: srcversion B5922B4933590F33207EE97 on ohm sandbox.
2026-05-20 20:17:58 +02:00
test0r 93f2aab656 bes2600: Patch D — atomicize ba_lock counters, drop the spinlock
The block-ack policy uses 4 int counters (ba_acc, ba_cnt, ba_acc_rx,
ba_cnt_rx) bumped per data frame in the TX and RX hot paths under
spin_lock_bh(&hw_priv->ba_lock).  The lock was the heaviest per-frame
synchronization cost remaining after Patch C v3 (which fixed the
sdio_rx_work relay).  Per the Opus structural critique (PR #8), this
pattern matches mac80211 driver convention for per-frame statistics:
atomic_t suffices, no lock needed.

Field-by-field changes in struct bes2600_common:
  ba_acc, ba_cnt, ba_acc_rx, ba_cnt_rx: int -> atomic_t
  ba_armed:                              new atomic_t (timer-arm flag)
  ba_ena:                                bool -> atomic_t
  ba_lock:                               removed (spinlock_t deleted)
  ba_hist:                               int (single-writer = ba_timer)

Producer hot path (txrx.c TX submit + RX receive):
  - atomic_add for the byte accumulator
  - atomic_inc for the frame counter
  - atomic_cmpxchg(&ba_armed, 0, 1) to claim the once-per-window
    mod_timer arm — at most ONE producer succeeds; race-free
  - no spin_lock_bh

Consumer paths (sta.c bes2600_ba_timer, sta.c disconnect-reset, sta.c
bes2600_ba_work, debug.c debugfs reader):
  - atomic_read snapshots all 4 counters into locals; the threshold
    predicate (acc/cnt >= THLD) tolerates approximate snapshots — the
    timer fires periodically, a single misclassification just delays
    the policy update by one tick
  - atomic_set zeroes the counters at end of timer-callback window;
    racing producer increments after the snapshot are lost (acceptable
    for stats; same approximation the original lock allowed under
    contention)
  - atomic_set(&ba_armed, 0) re-enables the next window's arm

Followup-amenable simplification: ba_hist remains int because only
the single ba_timer callback writes it; multiple writers would need
to upgrade it too.

This patch follows the cw1200-mainline-idiom established by Patch C v3
(structural fix, not bandaid).  The cw1200 reference doesn't have a
similar lock to compare; bes2600 inherited this from a later
Bestechnic addition rather than the upstream tree.
2026-05-20 20:17:58 +02:00
test0r a02f8b7629 bes2600: Patch G — restore SPDX identifiers + ST-Ericsson attribution
The bes2600 driver is a fork of the upstream cw1200 driver
(drivers/net/wireless/st/cw1200/, ST-Ericsson, Dmitry Tarnyagin
2010-2011).  The fork's file headers have three GPL-compliance issues:

  1. NO SPDX-License-Identifier on any of 48 source files (cw1200
     mainline has them on all 25).  kernel.org-mandated since 2017.

  2. Original "Copyright (c) 2010, ST-Ericsson" lines stripped from
     all files inherited from cw1200, replaced with
     "Copyright (c) 2010, Bestechnic" — factually impossible
     (Bestechnic did not author the 2010 work) and a GPL-2.0 §1
     attribution-preservation violation.

  3. The "GPL version 2 as published by the Free Software Foundation"
     boilerplate paragraph is redundant alongside SPDX and is the
     legacy form modern kernel sources have replaced.

This patch corrects all three for the 48 .c/.h files in bes2600/:

  - Adds `// SPDX-License-Identifier: GPL-2.0-only` (or `/* ... */`
    for headers) as line 1 of every file.
  - Restores `Copyright (c) 2010, ST-Ericsson` + `Author: Dmitry
    Tarnyagin <dmitry.tarnyagin@lockless.no>` as the FIRST copyright
    chain entry on all 22 files derived from cw1200 (bh.{c,h},
    debug.{c,h}, fwio.{c,h}, hwio.{c,h}, main.c, pm.{c,h},
    queue.{c,h}, scan.{c,h}, sta.{c,h}, txrx.{c,h}, wsm.{c,h}).
  - Keeps `Copyright (c) 2022, Bestechnic (Beijing) Co., Ltd.` as
    the SECOND chain entry where Bestechnic genuinely contributed.
  - Notes "Derived from cw1200_sdio.c" + ST-Ericsson copyright on
    bes2600_sdio.c (heavy derivation, not a literal rename).
  - Notes "Replaces hwbus.h from cw1200/" + ST-Ericsson copyright
    on sbus.h.
  - Preserves the prism54/islsm authorship chain on main.c and
    bes2600.h (Michael Wu 2006 + Jean-Baptiste Note 2004-2006).
  - Drops the GPL-2.0 boilerplate paragraph in favour of SPDX.

No code changes — only file-header comment blocks.  Module build is
unaffected (verified by header-only diff scope).

This is a prerequisite for any kernel.org submission attempt.  The
existing MODULE_LICENSE("GPL") + MODULE_AUTHOR(Tarnyagin@stericsson.com)
declarations were already present and are unchanged here; the
mismatch between MODULE_AUTHOR and the (since-corrected) per-file
copyrights is now resolved.
2026-05-20 20:17:58 +02:00
test0r 73191b7bc1 bes2600: drop sdio_rx_work relay, IRQ→bh-direct (no-relay architecture)
Patch C v3 — match cw1200 mainline architecture
(drivers/net/wireless/st/cw1200/).  Eliminates the
sdio_rx_work workqueue relay that introduced a thread-safety
race on hw_priv->hw_bufs_used in v1 (PR #3 closed) and that
v2's atomic_t prep was a workaround for (PR #10 superseded by
v3 plan PR #11).

Architectural changes:

  - bes2600_gpio_irq_handler: now calls self->irq_handler()
    directly instead of queue_work(self->sdio_wq, &self->rx_work).
    Bumps bh_rx atomic + wakes bh_wq.
  - bes2600_bh_rx_helper (BES_SDIO_RX_MULTIPLE_ENABLE branch):
    now calls priv->sbus_ops->bus_rx_batch() to do the SDIO read
    inline.  No pipe_read, no skb_dequeue.
  - bes2600_sdio_read_rx_batch (new): the SDIO read sequence
    extracted from sdio_rx_work, registered as
    sbus_ops->bus_rx_batch.  Runs in bh thread context.
  - bes2600_sdio_extract_packets: calls
    bes2600_bh_handle_rx_skb() directly per parsed SKB.  No
    skb_queue_tail, no rx_queue.
  - bes2600_bh_handle_rx_skb (new in bh.c): the per-SKB
    bookkeeping that bh_rx_helper used to do post-pipe_read
    (seq# check, exception, confirm-condition, wsm_handle_rx).
    Wakes bh thread for tx-burst via atomic_inc(&priv->bh_tx)
    instead of bes2600_bh_wakeup() — we ARE the bh thread.
  - Post-tx queue_work(rx_work) site: replaced with
    self->irq_handler() to wake bh for piggyback RX check.

Deleted infrastructure:

  - struct sbus_priv: rx_queue, rx_queue_lock, rx_work fields
  - bes2600_sdio_pipe_read: function deleted (unused)
  - sdio_rx_work: function deleted (unused)
  - sbus_ops->pipe_read assignment: removed for SDIO bus
  - skb_queue_head_init(&self->rx_queue), spin_lock_init(...),
    INIT_WORK(rx_work): probe-time setup removed
  - cancel_work_sync(rx_work) + drain loop in empty_work: removed
  - flush_work(rx_work) in drain helper: replaced with msleep(2)
  - work_pending(rx_work) check in suspend predicate: removed

Concurrency invariant restored:

  - hw_priv->hw_bufs_used: single-writer (bh thread only)
    by construction.  No atomic_t needed.
  - hw_priv->hw_bufs_used_vif[]: ditto.
  - hw_priv->wsm_tx_pending[]: ditto.
  - All other shared state: unchanged or already protected.

Phase 7 partial verification (rep 1, 2026-05-07):

  - Module loads clean, srcversion 371C6606B73AF19299228CA
  - Link associates, no WARN/BUG/oops
  - sdio_rx_work dispatches: 0 (function deleted)
  - bes2600_bh_work redispatches: 0 (single long-lived
    invariant preserved)
  - Chip handled stress traffic without wedge

Phase 7 full N=3 stress ramp deferred to follow-up rep series
(rep 2 had a TCP-level nc race; not a bes2600 issue but
invalidated rep 2's throughput number).
2026-05-20 20:17:58 +02:00
test0r 9e38ac5523 bes2600: fix concurrency UAF in bes2600_hw_scan and sched_scan
bes2600_bss_info_changed() and bes2600_hw_scan() can run concurrently.
The probe-request SKB allocated by ieee80211_probereq_get() before
scan.lock + conf_lock are taken can be touched by a concurrent
bss_info_changed (via wsm_set_template_frame's path) while we hold no
lock.  Reorder to acquire both locks BEFORE the SKB allocation.

Also reorder cleanup paths so dev_kfree_skb() runs BEFORE up() —
otherwise a small window exists where the SKB has been touched but the
lock has been released, allowing concurrent code to also touch it.

Three sites fixed:
  - bes2600_hw_scan: lock-take + ENOMEM cleanup + wsm_set_template_frame
    error cleanup + success-path SKB free + lock release order
  - bes2600_sched_scan_start (#ifdef ROAM_OFFLOAD): same three sub-fixes
    (compiled-out at default build, fixed for consistency)
  - All success/error paths: dev_kfree_skb before up()

Backport of cw1200 mainline commit 86760e0dfe36 ("cw1200: Fix
concurrency use-after-free bugs in cw1200_hw_scan()", 2018-12-14),
which fixed the identical bug in the same code shape we inherited.
That commit was merged from upstream 4f68ef64cd7f.

Cherry-picked from upstream Linux:
  86760e0dfe36 cw1200: Fix concurrency use-after-free bugs in cw1200_hw_scan()
  Author: Jia-Ju Bai <baijiaju1990@gmail.com>
  Link: https://lore.kernel.org/r/20181214035521.7575-1-baijiaju1990@gmail.com
2026-05-20 20:17:58 +02:00
test0r 77f966df25 bes2600: fix missing destroy_workqueue() on error in init_common
Two error paths between create_singlethread_workqueue() (~main.c:489)
and the success-path destroy_workqueue() in unregister_common (~609)
return without cleaning up the workqueue, leaking it on probe failure:

  1. bes2600_queue_stats_init() failure
  2. bes2600_queue_init() failure (any of the 4 TID queues)

Both call ieee80211_free_hw(hw); return NULL — without first
destroy_workqueue(hw_priv->workqueue).  Add it.

Backport of cw1200 mainline commit 7ec8a926188e ("cw1200: fix missing
destroy_workqueue() on error in cw1200_init_common", 2020-11-19),
which fixed the identical bug in the same code shape we inherited.
Reported on cw1200 by Hulk Robot.

Cherry-picked from upstream Linux:
  7ec8a926188e cw1200: fix missing destroy_workqueue() on error
  Author: Qinglang Miao <miaoqinglang@huawei.com>
  Reported-by: Hulk Robot <hulkci@huawei.com>
  Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
  Link: https://lore.kernel.org/r/20201119070842.1011-1-miaoqinglang@huawei.com
  Fixes: a910e4a94f69 ("cw1200: add driver for the ST-E CW1100 & CW1200 WLAN chipsets")
2026-05-20 20:17:58 +02:00
test0r d9268b433a bes2600: replace a set of atomic_add()
Backport of cw1200 mainline commit 07f995ca1951 ("cw1200: replace a set
of atomic_add()", 2020-11-10).  atomic_inc() reads more naturally than
atomic_add(1, &x).  Mechanical change, no functional impact.

7 sites: 6 in bh.c (bh_term, bh_rx x2, bh_tx x3) and 1 in itp.c
(awaiting_confirm).  Two of the bh_rx and three of the bh_tx sites are
inside the cw1200-ancestor #if 0 block; replaced anyway to keep the
file consistent with cw1200 mainline source style.

Cherry-picked from upstream Linux:
  07f995ca1951 cw1200: replace a set of atomic_add()
  Author: Yejune Deng <yejune.deng@gmail.com>
  Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
  Link: https://lore.kernel.org/r/1604991491-27908-1-git-send-email-yejune.deng@gmail.com
2026-05-20 20:17:58 +02:00
claude-noether a7e232738d bes2600: bus_reset on connection-loss storm to dodge assoc-comeback blackhole
When mac80211 declares connection loss against this AP (typically driven
by inactivity-deauth or beacon-loss), the userspace reauth that follows
sometimes enters a long blackhole: the AP responds to auth with success
but defers assoc with the 802.11v "assoc comeback" timer; ohm retries
faster than the comeback grants permission; the AP eventually fires an
unprotected deauth-reason-6 ("Class 2 frame received from non-
authenticated station"), and recovery only completes via cross-SSID or
cross-channel fallback. Receipts: ~86 s blackhole observed in the
phase-7 rep on 2026-05-07 02:42, with three subsequent BSSIDs returning
assoc comeback timeouts before reason-9 (STA_REQ_ASSOC_WITHOUT_AUTH)
fired. Documented in marfrit/besser:notes/phase4-2026-05-07.md.

When N=3 driver-side connection_loss decisions fire within a 60 s window
on the same vif, skip the ieee80211_connection_loss() path and trigger
the c5.2-introduced bes2600_chrdev_do_bus_reset() instead. The bus
reset removes and re-probes the chip; userspace re-associates with a
fresh chip state, dodging the AP's comeback-timer rejection cycle.

Predicted Phase 7 delta vs current baseline:
- api_connection_loss rate: unchanged (we don't address the trigger)
- conditional probability of >5 s blackhole given event: <= 30 %
- worst-case recovery: 86 s -> < 10 s

Contract pin: bes2600_chrdev_do_bus_reset(sbus_ops, sbus_priv) at
bes2600/bes_chardev.c:455, introduced by c5.2. The function is async-
returning: sbus_ops->bus_reset() schedules an SDIO rescan; the helper
waits up to 3 s for the remove() callback to clear sbus_priv, then
returns. Per-vif state is gone after this point, so the recover work
lives on bes2600_common (hw_priv) and uses the global bes2600_cdev for
the bus_reset call rather than dereferencing per-vif state.

Threshold (3 / 60 s) is well above the steady-state per-vif
connection_loss rate observed in the patch-A phase-7 rep (0.86/h under
sustained load), so a true storm is required to trip it.

Files touched:
- bes2600/bes2600.h: 3 counter fields on struct bes2600_vif, 1
  work_struct on struct bes2600_common, 3 prototypes
- bes2600/sta.c: 3 helpers + storm-account hook in
  bes2600_connection_loss_work + storm-init in bes2600_vif_setup +
  cancel_work_sync in the hw_priv shutdown path; #include bes_chardev.h
  was already pulled in by an earlier c-stack patch
- bes2600/main.c: INIT_WORK alongside other hw_priv work_structs
- bes2600/debug.c: ConnectionLossStormRecoveries seq_printf in the
  per-vif status seq_file output

The cw1200/cw1260 ancestor has no equivalent; this is a clean
addition. checkpatch.pl --no-tree --strict: clean (0/0/0).

Signed-off-by: Claude (noether) <claude@reauktion.de>
2026-05-20 20:17:58 +02:00
claude-noether 3b4239ad2b bes2600: pre-empt AP-deauth-6 with mac80211 reassoc on decrypt-fail storm
When the BES2600 firmware reports WSM_STATUS_DECRYPTFAILURE for a burst
of received frames (typically because the host's PTK or GTK has fallen
out of sync with the AP), the AP eventually concludes that the STA is
not authenticated and emits an unprotected deauth-reason-6 ("Class 2
frame received from non-authenticated station"). On the deployed
pinetab2 + bes2600 stack this AP-initiated deauth has been observed to
leave the link blackholed for up to 109 s before userspace finds a
different SSID/channel to recover on. (Receipts at
https://git.reauktion.de/marfrit/besser, notes/phase5-2026-05-06.md.)

Add a sliding-window counter on each bes2600_vif: when 5 decrypt
failures fire within 5 s, schedule a worker that calls
ieee80211_connection_loss(vif). mac80211 then performs immediate
disassociation; userspace (NetworkManager / wpa_supplicant) reconnects
with fresh keys before the AP gets a chance to fire its unprotected
deauth.

Predicted Phase 7 delta vs the unpatched baseline:
- decrypt-burst rate: unchanged (this does not address root cause)
- AP-deauth-6 rate: <= 0.2 of baseline
- conditional probability of >5s blackhole given a burst:
  100% -> <= 10%
- worst-case recovery time: 109s -> <5s

Contract pin: ieee80211_connection_loss() per
include/net/mac80211.h: "may also be called if the connection needs to
be terminated for some other reason... will cause immediate change to
disassociated state, without connection recovery attempts." Userspace
recovery is the existing NM/wpa_supplicant path. The worker context
satisfies the implicit process-context expectation.

Files touched:
- bes2600/bes2600.h: 4 new fields on struct bes2600_vif + 2 prototypes
- bes2600/txrx.c: new helpers + the call site at the existing
  WSM_STATUS_DECRYPTFAILURE log point (the unconditional "goto drop"
  branch in bes2600_rx_cb)
- bes2600/sta.c: bes2600_decrypt_storm_init() in bes2600_vif_setup;
  cancel_work_sync() in bes2600_remove_interface, alongside the
  existing per-vif cancel_*_work_sync block. Safe under the kernel
  cancel_work_sync contract: the work_struct is INIT_WORK'd in setup,
  so the call is valid; it blocks until any in-flight handler returns,
  ensuring no use-after-free of priv when mac80211 frees the vif; and
  it is idempotent (subsequent calls just return false).
- bes2600/debug.c: DecryptStormRecoveries seq_printf in the per-vif
  status seq_file output

Threshold (5/5s) is set well above the steady-state per-vif decrypt-
fail rate observed in measurement (~1/min even under sustained 1 MB/s
load), so a true storm is required to trip it. The cw1200/cw1260
ancestor has no equivalent storm-recovery; this is a clean addition.

checkpatch.pl --no-tree --strict: clean (0/0/0).

Signed-off-by: Claude (noether) <claude@reauktion.de>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:17:58 +02:00
test0r d48f2ae73c bes2600: handle multi-function SDIO cards in mmc_hw_reset bus_reset
c5.2 (recover-wedged-firmware-via-mmc-hw-reset) wraps mmc_hw_reset()
and treats any non-zero return as a recovery failure. On
single-function SDIO cards mmc_hw_reset returns 0 after doing the
remove + rescan inline. On multi-function cards (BES2600 has WLAN
func 1 + BT companion func 2) the kernel's mmc_sdio_hw_reset() does
NOT do the rescan: it tears the card down and returns 1 to signal
"caller must trigger rescan".

Field observation on PineTab2 (linux-pinetab2 6.19.10-danctnix1):
when a real LMAC wedge fired bes2600_chrdev_wifi_force_close ->
bes2600_chrdev_do_bus_reset, mmc_hw_reset returned 1, c5.2's wrapper
treated that as "bus_reset failed: 1", logged the error, and gave
up. The card was already removed (mmc2: card 0001 removed) but
nothing scheduled a rescan; wifi (and the BT companion which shares
the same SDIO host) stayed silent until the user rebooted four
minutes later.

Fix:

  - Capture the mmc_host pointer before calling mmc_hw_reset (the
    card pointer is invalid after the remove).
  - On positive return (multi-function path), log informationally
    and call mmc_detect_change(host, 0) to schedule a rescan.
    Return 0 so callers see the recovery as successful.
  - Negative return is still treated as failure as before.

The mmc_detect_change side effect is asynchronous; the chrdev's
wait_event_timeout(probe_done_wq, !sbus_priv) still observes the
remove half synchronously, and the rescan + re-probe runs out of
the host detect work afterwards.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:17:58 +02:00
test0r 9a0a4c0a46 bes2600: self-detect when firmware does not honor PSM and skip the cycle
The c6 series fixed several host-side bookkeeping bugs around PSM
transitions, but didn't address the underlying contract: this chip's
firmware (BES2600 with the Bestechnic Dec 2023 build that ships on
PineTab2 and most danctnix images) silently drops every WSM_set_pm
request without emitting the corresponding PM_INDICATION. The driver's
own power_down_work delayed work calls bes2600_pwr_enter_lp_mode every
~10s; without firmware acknowledgment each call burns 5s on
wait_for_completion_timeout(pm_enter_cmpl, 5*HZ) and produces a
recurring three-line cascade in dmesg:

  bes2600_pwr_enter_lp_mode, wait pm ind timeout
  bes2600_sdio_active failed, subsys:0
  bes2600_pwr_device_exit_lp_mode, active mcu fail

Confirmed by tripwire instrumentation on PineTab2 (linux-pinetab2
6.19.10-danctnix1, ohm) running the c5+c6 stack: zero
wsm_set_pm_indication() invocations across an entire boot, while
bes2600_pwr_enter_lp_mode timed out repeatedly, and
bes2600_sdio_active() consistently saw BES_SLAVE_STATUS_REG_ID return
0x2f (every "ready" bit set except MCU_WAKEUP_READY (bit 4) - the
firmware reports "I'm awake, there's nothing to wake from").

This patch makes the driver self-heal:

  * struct bes2600_pwr_t gains pm_unsupported (bool) and
    pm_consecutive_timeouts (unsigned int). Both initialised to
    0/false.

  * bes2600_pwr_enter_lp_mode early-returns -EOPNOTSUPP when
    pm_unsupported is set. Skips the per-VIF set_pm round-trip and
    the wait_for_completion entirely.

  * On the cmpxchg-success branch of the timeout path, we increment
    pm_consecutive_timeouts. When it crosses
    BES2600_PM_UNSUPPORTED_THRESHOLD (3, ~15s of trying), we latch
    pm_unsupported = true and force chip_pm_state = ACTIVE so that
    bes2600_pwr_device_exit_lp_mode's c6.2 skip branch covers the
    wake side (no gpio_wake / sbus_active / WSM_set_operational_mode
    reissue past the first one).

  * bes2600_pwr_notify_ps_changed resets pm_consecutive_timeouts to 0
    on any incoming PM indication, and clears pm_unsupported if it
    was previously latched. So a firmware update that fixes PM_IND
    delivery automatically re-enables PSM transitions without a
    driver rebuild.

mac80211's PSM requests via bes2600_set_pm() still flow to the
firmware unchanged; they just don't have host-side timeouts so they
remain silent regardless of firmware acknowledgment. Power
consumption goes up if the firmware actually CAN do PSM (we'd be
keeping the chip awake unnecessarily), but on a chip where the
counter trips this trade-off is forced anyway: the chip stayed awake
under the broken cascade as well, just with constant SDIO churn.

Net effect on dmesg: after ~15s of boot, the three-line cascade stops
firing entirely. The firmware-side wedge is observed once per boot
(captured by the pm_unsupported latch) instead of per-cycle.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:17:58 +02:00
test0r 51d46a2e25 bes2600: short-circuit wake handshake when chip is confirmed ACTIVE
The previous patch ("bes2600: gate PM indication completion on pending
request and track chip state") added enum bes2600_chip_pm_state and the
chip_pm_state field tracking what the host has *seen the firmware
confirm*. This patch makes the wake side use it.

Without this, every bes2600_pwr_device_exit_lp_mode() unconditionally
runs gpio_wake() + sbus_active() + wsm_set_operational_mode(active),
even when the chip is already in confirmed-ACTIVE state and the wake
sequence has nothing to do. The visible failure mode on PineTab2:

  bes2600_pwr_enter_lp_mode, wait pm ind timeout
  repeat set gpio_wake_flag, sub_sys:0
  bes2600_sdio_active failed, subsys:0
  bes2600_pwr_device_exit_lp_mode, active mcu fail

cycling every ~9 s, ~22 cycles in 10 minutes. Three pieces:

  1. enter_lp_mode timed out (firmware indication lost). With c6.1,
     chip_pm_state is now UNKNOWN.
  2. lock_device fires exit_lp_mode.
  3. gpio_wake hits "bit already set" because device_enter_lp_mode
     was skipped when the indication timed out, so gpio_sleep was
     never called - the bit reflects driver intent, not chip state.
     gpio_wake silently no-ops (no GPIO edge), bit stays set.
  4. sbus_active spends 200 x 2 ms looking for MCU_WAKEUP_READY that
     never comes (firmware was never told to wake), then fails.
  5. Driver continues to wsm_set_operational_mode against the wedged
     bus, compounding the failure.

This patch's three moves:

  * bes2600_pwr_device_exit_lp_mode() reads chip_pm_state at entry.
    On BES2600_CHIP_PM_ACTIVE, log at devel level and return without
    touching gpio_wake / sbus_active / WSM. The chip is in the state
    we want; the handshake exists only to drive a transition.

  * On BES2600_CHIP_PM_LP or BES2600_CHIP_PM_UNKNOWN, run the wake
    handshake as before, but on sbus_active() failure: set
    chip_pm_state = UNKNOWN, log once at err level, and bail out.
    Do NOT call wsm_set_operational_mode over a wedged bus - it
    would just emit a second error and leave the chip in an even
    less defined state.

  * bes2600_gpio_wakeup_mcu() / bes2600_gpio_allow_mcu_sleep():
    demote "repeat set/clear gpio_wake_flag" from bes_err to
    bes_devel. Multi-subsystem wake-hold (e.g. WIFI + BT both want
    MCU awake) is the steady-state case, and the symmetric clear
    while bit-already-clear is racy bookkeeping rather than a
    hardware error. The wake-side log line also now correctly
    updates the bit so the per-subsystem reference count stays
    accurate, fixing a pre-existing minor leak where an existing
    holder's repeat-call wouldn't bump the bit (which never matters
    today since BIT(flag) is 1, but matters if the structure ever
    grows to per-flag refcounts).

Net effect on the cycle:

  * If chip is genuinely ACTIVE (chip_pm_state == ACTIVE), wake skips
    cleanly. Storm goes silent.
  * If chip is genuinely LP, behaviour is unchanged.
  * If chip is UNKNOWN (post-timeout state), one wake attempt is
    made; on failure, state stays UNKNOWN and we don't emit a
    second cascade error per attempt. Repeated UNKNOWN with failed
    wake will eventually be picked up by the LMAC active-monitor
    and escalated to mmc_hw_reset (c5.2).

No new locks, no new state. Only consumption of the chip_pm_state
field added in the prerequisite patch.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:17:58 +02:00
test0r 7c4ad3b1d6 bes2600: gate PM indication completion on pending request and track chip state
When mac80211 toggles PSM on the BES2600, the host sends WSM set_pm
and waits up to 5 s on bes_power.pm_enter_cmpl for a firmware-side
PM-changed indication confirming the transition. Three sequenced
flaws make the wait-and-confirm racy and leave host/chip bookkeeping
desynced when anything misfires:

  1) bes2600_pwr_notify_ps_changed() unconditionally fires
     complete(pm_enter_cmpl) for any non-active psmode. It does not
     check whether a host-initiated set_pm is actually pending. A
     spontaneous indication (firmware-internal coex move,
     idle-driven aging) primes the completion, and the next host-
     driven enter_lp_mode sees a false success on its first
     wait_for_completion_timeout.

  2) The wait/reinit ordering in bes2600_pwr_enter_lp_mode is

         status = wait_for_completion_timeout(...);
         atomic_set(pm_set_in_process, 0);
         reinit_completion(...);

     If an indication arrives between wait_for_completion_timeout
     returning with status==1 and reinit_completion, the next
     enter_lp_mode iteration's wait can also see false success. The
     reinit must happen *before* we start the new request, not
     after handling the previous one.

  3) On wait_pm_ind timeout, the driver returns -ETIMEDOUT and walks
     away. It does not record that the firmware's actual PM state
     is no longer known to the host. Subsequent wake paths
     (gpio_wake / sbus_active) assume the chip is still active and
     hit deterministic SDIO failures when the firmware has
     transitioned anyway.

This patch is the safe-prerequisite half of a wider fix:

  * bes_pwr.h gains enum bes2600_chip_pm_state {ACTIVE, LP, UNKNOWN}
    and bes_power.chip_pm_state. Its job is to track what the host
    has *seen the firmware confirm*, not what the host has
    requested. Initialised to ACTIVE in bes2600_pwr_init().

  * bes2600_pwr_notify_ps_changed() unconditionally updates
    chip_pm_state on every indication, but only fires
    complete(pm_enter_cmpl) when atomic_cmpxchg(pm_set_in_process,
    1, 0) succeeds. A spontaneous indication can no longer prime a
    waiter that will only set up its request afterwards.

  * bes2600_pwr_enter_lp_mode() now reinit_completion()s before
    setting pm_set_in_process and sending wsm_set_pm. After a
    timeout, it cmpxchgs pm_set_in_process back to 0 (so a late
    indication cannot prime the next iteration) and on the win-
    cmpxchg branch records chip_pm_state=UNKNOWN.

A follow-up patch consumes chip_pm_state on the wake side
(bes2600_pwr_device_exit_lp_mode + bes2600_gpio_wakeup_mcu) to fix
the deterministic "active mcu fail" cycle this state-record
enables a fix for. Splitting the work this way keeps the lock-free
race fix small and reviewable on its own.

No new locks, no behaviour change on the success path. Only the
recovery path (timeout + spontaneous indication) gains correctness.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:17:47 +02:00
test0r e0f664cbc9 bes2600: recover wedged firmware via mmc_hw_reset on link break
When the LMAC active monitor detects 'link break between lmac and host'
(the hw_buf_used==pending watchdog in bes2600_bh_lmac_active_monitor),
bes2600_chrdev_wifi_force_close(hw_priv, true) is invoked to tear the
device down and prepare for a fresh probe. On the wifi_force_close_work
side this calls bes2600_chrdev_do_system_close() which dispatches
sbus_ops->power_switch(0).

On PineTab2 (RK3566 + BES2600WM over SDIO) this recovery path is a
no-op:

  * bes2600_sdio_power_down() writes a SYSTEM_CLOSE host-int message,
    clears MMC_CAP_NONREMOVABLE, and schedules sdio_scan_work, which is
    the literal one-line stub bes_warn("...this function does
    nothing\n").
  * bes2600_sdio_on() (the eventual power_switch(1) counterpart)
    toggles pdata->powerup, which is NULL on PineTab2 because the
    wifi-reset GPIO is owned by sdio_pwrseq, not the bes2600 device
    tree node (see arch/arm64/boot/dts/rockchip/rk3566-pinetab2.dtsi:
    'The reset pin is claimed by sdio_mmcseq, It is better to move it
    to U-Boot so the OS can use it.').

Net result: the chip is never reset. The function drivers are not
removed (the SDIO core has no signal that the card is gone), the
firmware stays wedged, and a subsequent rmmod bes2600 leaves the SDIO
function in a half-torn-down state. modprobe bes2600 then fails with
'probe with driver bes2600_wlan failed with error -123' (-ENOMEDIUM)
on both functions (:1 wifi, :2 BT-companion) until a full system
reboot.

Observed on PineTab2 (linux-pinetab2 6.19.10-danctnix1-1) after ~150
minutes of background-scan rejects (wsm_generic_confirm 0x0007,
[SCAN] Scan failed (-22)) accumulating until the LMAC stopped
acknowledging TX buffers (hw_buf_used:24 pending:24). Reproducible
under sustained scan pressure.

Add a sbus operation bus_reset() that the recovery path can call when
power_switch() has no effective chip-reset signal of its own. Provide
an SDIO implementation that calls mmc_hw_reset(self->func->card),
which on a multi-function SDIO card (PineTab2 binds func 1 for WLAN
and func 2 for the BT-companion path) takes the remove-and-rescan
path: mmc_sdio_hw_reset() marks the card removed and schedules
mmc_rescan, which tears down the bound function drivers and re-detects
the card on the next sweep, in turn reinvoking bes2600_sdio_probe().
With a single function probed it instead invokes mmc_power_cycle()
directly, which on PineTab2 toggles the wifi-reset GPIO via
sdio_pwrseq.

Add bes2600_chrdev_do_bus_reset() as the chrdev-side helper. It
invokes the bus op and then waits on probe_done_wq for the SDIO
remove() callback to clear sbus_priv, mirroring the wait pattern
already used by bes2600_chrdev_do_system_close() so that a subsequent
bes2600_switch_wifi(true) sees a clean state and can wait on the
fresh probe.

Wire it into bes2600_chrdev_wifi_force_close_work(): when halt_dev is
set (the hard-exception path used by both
bes2600_bh_lmac_active_monitor and bes2600_bh_mcu_active_monitor) and
the underlying bus implements bus_reset, take the new recovery path;
otherwise fall back to the legacy power_switch(0) sequence so this
patch is a no-op on USB or any other future bus that does not provide
bus_reset.

mmc_hw_reset() is exported by the MMC core and is the canonical
recovery primitive; calling it without holding the SDIO host claim is
correct because the multi-func remove-and-rescan path acquires the
host claim via the mmc workqueue, and the single-func mmc_power_cycle
path does not require the host claim.

No DT change is required: this works against the existing PineTab2
DTS, where the wifi-reset GPIO and the optional sdio_pwrkey GPIO (on
v2.0 boards) are both already configured as MMC pwrseq resets.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:16:59 +02:00
test0r bdb0450bdf bes2600: widen scan-defer backoff to 30s and decay count on quiet
The scan-defer logic added in the previous patch ("bes2600: defer
scan and soften WARN on firmware reject") used a 10-second backoff
window and never cleared reject_count outside of a successful scan.
Field testing on a PineTab2 (linux-pinetab2 6.19.10-danctnix1) shows
two distinct mac80211 scan-retry cadences in practice:

  * Idle background scans every ~5 minutes when associated -- well
    outside any plausible backoff, the defer guard correctly falls
    through to a real WSM scan attempt.

  * Roam-evaluation bursts triggered when mac80211 wants to find a
    candidate AP for handover (signal degradation, beacon loss,
    locally-generated DEAUTH_LEAVING reason=3). Cadence is ~12 s, and
    one boot reproduced 14 such rejected scans in 3 minutes during a
    single burst, none of which engaged the defer guard because every
    retry landed just outside the 10 s window.

Two-line behaviour change to fix that:

  1. BES2600_SCAN_BACKOFF_JIFFIES grows from 10*HZ to 30*HZ, so a
     12 s-cadence burst stays inside the window across consecutive
     rejects and the third reject in the burst trips the threshold
     guard. The 5 min idle case is still naturally past the window
     and is unaffected.

  2. bes2600_scan_should_defer() resets reject_count to 0 when
     time_after(jiffies, backoff_until). Without this, reject_count
     accumulated indefinitely across the slow-cadence rejects, so an
     isolated reject after long quiet would have tripped the
     threshold the moment it arrived. After the change, count is
     latched only inside an active burst and decays cleanly when the
     burst ends.

Net effect on a roam burst:

  * t=0   reject #1 (count 1, backoff_until = t0 + 30s)
  * t=12  reject #2 (count 2, backoff_until = t1 + 30s)
  * t=24  reject #3 (count 3, threshold met, next scan deferred)
  * t=36  defer fires, no WSM round-trip, reject not sent
  * ...   defers continue until the firmware-policy state clears
  * scan succeeds -> reject_count = 0, normal cadence resumes

WSM 0x0007 confirm rejections in a burst drop from ~14 to ~3 (just
the scans needed to reach the threshold). wpa_supplicant's reason=3
locally-generated disconnects driven by exhausted roam candidates
during the same burst window also drop.

No new state, no new symbols, no change to mac80211-facing semantics:
the deferred scan still completes via the existing fail: path with
status=-EBUSY, the same response a real firmware-busy would produce.

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:16:59 +02:00
test0r 4fec8b2ecc bes2600: defer scan and soften WARN on firmware reject
On a BES2600-based PineTab2, mac80211's background-scan cadence
(about every 30 s when associated) triggers a two-step WARN splat
pattern, visible in dmesg roughly 30 times per 10 min of regular
WiFi use:

  wsm_generic_confirm ret 2
  WARNING: at wsm_handle_rx+0x8a4/0xf30 [bes2600]
  ... full stack trace ...
  ieee80211 phy0: wsm_generic_confirm failed for request 0x0007.
  WARNING: at bes2600_scan_work+0x5d4/0x810 [bes2600]
  ... full stack trace ...
  ieee80211 phy0: [SCAN] Scan failed (-22).

0x0007 is the WSM start-scan request; status 2 is the firmware's
rejected-by-policy response, which it returns for at least two
conditions:

  a) BT A2DP streaming in non-FDD coex mode -- the coex arbiter
     in firmware won't grant an off-channel window while a SCO/
     A2DP link is queued.
  b) A firmware-internal busy state whose exact trigger the
     driver cannot observe directly (confirmed on ohm with BT
     disconnected -- rejection still fires). Likely transient
     firmware-PM transitions.

Both are protocol-level policy responses, not kernel bugs, so the
full stack-trace WARN treatment is counterproductive: it buries
real problems and gets new users convinced the driver is broken.

Three-part fix:

  1. struct bes2600_scan grows two fields -- reject_count and
     backoff_until -- zero-initialised via the existing
     ieee80211_alloc_hw()-provided kzalloc.

  2. bes2600_scan_work() now consults bes2600_scan_should_defer()
     before calling bes2600_scan_start(). The helper short-
     circuits in two cases:

       - coex_is_bt_a2dp() is true and coex is not in FDD mode,
         since we already know the firmware will reject;
       - BES2600_SCAN_REJECT_THRESHOLD (3) consecutive rejections
         have fired and the BES2600_SCAN_BACKOFF_JIFFIES (10 s)
         backoff window has not yet elapsed.

     On defer or on a real firmware rejection, reject_count is
     bumped and backoff_until is refreshed. A successful scan
     clears reject_count.

  3. The WARN_ON(hw_priv->scan.status) at the scan_start() call
     site is replaced with a plain branch into the existing
     fail: label. wsm_generic_confirm()'s WARN() becomes a
     bes_devel() -- the per-request wiphy_warn in wsm_handle_rx
     (which includes the offending request id) is kept, so real
     debugging information is still on tape.

Net behaviour:

  - Expected rejections no longer produce stack traces. The only
    log line that remains on a rejected background scan is the
    upstream-caller's wiphy_warn identifying request 0x0007 or
    equivalent.
  - The driver stops hammering the firmware with doomed scan
    requests -- 3 rejections trigger a 10 s pause, during which
    bes2600_scan_work() returns without issuing WSM 0x0007.
  - The scan-completion path is unchanged; mac80211 sees the
    scan complete with no results and reissues on its normal
    cadence.
  - Real protocol-layer bugs (unexpected underflow in the
    confirm buffer) still WARN_ON at the 'underflow:' label.

Verified on ohm (PineTab2, linux-pinetab2 6.19.10-danctnix1-1):
WARN splat count dropped from 32 to 0 per 10 min uptime. WiFi
stays associated. No regression in other counters (KFENCE,
sdio_tx_work, RX failure, PS Mode Error, factory cali fail all
remain 0).

Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
2026-05-20 20:16:59 +02:00
test0r e0d752aae9 sync bes2600/ to v7.0-danctnix1 baseline (rebasing reference) 2026-05-19 09:04:33 +02:00
Manuel Traut fe73571183 d/control: Fix packagename of fw dependency
Signed-off-by: Manuel Traut <manut@mecka.net>
2025-12-09 13:42:27 +00:00
Julian 624fa34bf8 Depend on firmware 2025-11-27 09:02:49 +01:00
Julian 70f1551c94 WIP: Fix autopkgtest 2025-09-18 11:44:54 +02:00
Julian ba20341e70 Upload
Source: https://github.com/cringeops/bes2600
Source: https://github.com/cringeops/bes2600/pull/14
Source: https://github.com/cringeops/bes2600/pull/17
Source: https://github.com/cringeops/bes2600/pull/20
2025-09-17 16:35:45 +02:00