Recurring wsm_generic_confirm 0x0007 (SCAN_REQ) failures + forced auth churn → VPN drops #1

Closed
opened 2026-05-03 20:32:28 +00:00 by marfrit · 4 comments
Owner

Symptom

WiFi on ohm (PineTab2, BES2600) goes through periodic disruption that's invisible to NetworkManager (still "connected") but kills longer-lived TCP sessions. Today this knocked the OpenVPN tunnel to shannon offline (lmcp ohm-tools MCP went silent in a Claude session); LAN reachability via ohm.fritz.box was unaffected, so the radio link itself recovers but mid-flight sessions die.

System

  • Host: ohm (PineTab2, RK3566 + BES2600)
  • Kernel: 6.19.10-danctnix1-1-pinetab2 #1 SMP PREEMPT_DYNAMIC Sat, 28 Mar 2026 02:45:08 +0000 aarch64 (DanctNIX)
  • Driver: out-of-tree bes2600.ko, srcversion 461AFB369355AE598D79BDF, /lib/modules/6.19.10-danctnix1-1-pinetab2/extra/bes2600.ko
  • Uptime when sampled: 1d 06:30
  • Current link (sample moment): SSID newton, AP c0:25:06:e6:5b:33, 5 GHz ch.40, signal -43 dBm, MCS 7 / 150 Mb/s — i.e. a good link, not a marginal-RF problem.
  • AP environment: Fritz!Box 7530 AX + 2× repeaters, three BSSIDs on the same SSID (c0:25:06:e6:5b:32, c0:25:06:e6:5b:33, cc:ce:1e:2b:74:17/c0:25:06:e6:61:b0).

Repeating signatures (counts over current dmesg buffer, ~22 h window)

Pattern Count
wsm_generic_confirm failed for request 0x0007 (SCAN_REQ rejected by FW) 345
wsm_join_confirm ret 1 (FW rejects driver JOIN) 90
Reason: 2=PREV_AUTH_NOT_VALID (driver-initiated forced roam) 32
total wlan0/ieee80211 lines 1794

First occurrence at [29840.916960] and still firing at [109373.800451] — pattern is steady-state, not a one-off boot glitch.

Pattern A — background scan rejected by firmware

Every ~300 s, mac80211 issues a roaming background scan that the BES2600 firmware refuses with -EINVAL:

[NNN] ieee80211 phy0: wsm_generic_confirm failed for request 0x0007.
[NNN] ieee80211 phy0: [SCAN] Scan failed (-22).

0x0007 is WSM_REQ_ID_SCAN_START (per bes2600/wsm.h). The FW returns a non-zero confirm status on what should be a routine scheduled scan while the STA is associated. mac80211's view: scan failed; whatever was driving the scan (background roam scoring, scheduled scan, ROAM trigger) gets no answer.

Pattern B — JOIN confirm error → auth timeout

When mac80211 then tries to switch BSSID (or just re-auth), the firmware rejects the JOIN:

bes2600_wlan mmc2:0001:1: wsm_join_confirm ret 1
wlan0: send auth to <bssid> (try N/3)
... 3× ...
wlan0: authentication with <bssid> timed out

wsm_join_confirm ret 1WSM_STATUS_FAILURE from FW on WSM_REQ_ID_JOIN. Driver retries 3× per AP, eventually mac80211 picks another BSSID candidate and retries; usually one of the three BSSIDs eventually accepts.

Pattern C — RX failures bursts

Correlated with the above:

bes2600_wlan mmc2:0001:1: [RX] Receive failure: 4.

Clumps of 2-5 lines, in the same windows as the JOIN failures.

Pattern D — PREV_AUTH_NOT_VALID forced roam

wlan0: deauthenticating from <bssid> by local choice (Reason: 2=PREV_AUTH_NOT_VALID)
wlan0: authenticate with <other-bssid> ...

The driver/FW evicts the current AP context as if PMKSA went stale, and mac80211 has to redo the auth/assoc dance. Happens ~1.5× per hour averaged across the buffer, sometimes back-to-back.

Hypothesis

Likely root cause is in the WSM SCAN_REQ handling path while the STA is associated:

  1. Background scan fires while connected.
  2. Driver may not be pausing TX / dropping into PS correctly before issuing WSM_REQ_ID_SCAN_START, FW returns -EINVAL.
  3. Failed scan leaves driver state inconsistent; next mac80211-initiated WSM_REQ_ID_JOIN (for a roam target or refresh) hits a FW that still thinks it's mid-scan or otherwise busy → wsm_join_confirm ret 1.
  4. mac80211's auth retry loop runs out, falls back to deauth + try another AP, producing the PREV_AUTH_NOT_VALID churn.

Worth checking:

  • The pre-scan TX-pause / power-save handshake in bes2600/scan.c against current mac80211 callbacks.
  • Whether WSM_REQ_ID_SCAN_START is correctly setting the scan params (channel list, dwell, SSID list, numProbeRequests) — -EINVAL from FW is the most common "params look wrong to me" answer.
  • Whether BES2600 firmware on this device has known scan-while-associated bugs we've already RE'd in another branch.

Sample dmesg slice (last storm)

[105714.185945] ieee80211 phy0: wsm_generic_confirm failed for request 0x0007.
[105714.186819] ieee80211 phy0: [SCAN] Scan failed (-22).
[105928.888083] wlan0: deauthenticating from c0:25:06:e6:5b:32 by local choice (Reason: 2=PREV_AUTH_NOT_VALID)
[105929.420027] wlan0: authenticate with c0:25:06:e6:61:b0 (local address=8a:2e:77:1f:ec:05)
[105929.420048] wlan0: send auth to c0:25:06:e6:61:b0 (try 1/3)
[105929.574112] wlan0: send auth to c0:25:06:e6:61:b0 (try 2/3)
[105929.577359] wlan0: authenticated
[105929.578138] wlan0: associate with c0:25:06:e6:61:b0 (try 1/3)
[105929.686105] wlan0: associate with c0:25:06:e6:61:b0 (try 2/3)
[105929.693216] wlan0: RX AssocResp from c0:25:06:e6:61:b0 (capab=0x1431 status=30 aid=0)
[105929.693260] wlan0: c0:25:06:e6:61:b0 rejected association temporarily; comeback duration 1000 TU (1024 ms)
[105930.726128] wlan0: associate with c0:25:06:e6:61:b0 (try 3/3)
[105930.878069] wlan0: association with c0:25:06:e6:61:b0 timed out

Impact

  • VPN tunnel drops (today's trigger — Claude ohm-tools MCP went silent).
  • Any long-running TCP connection through ohm is unreliable.
  • Background scan failure means roam decisions are blind, so the device can stick to a bad AP if it ever has one.
  • LAN-side reachability is mostly fine because re-association recovers within ~1-2 s, but during that window everything stalls.

Next steps (proposed)

  • Confirm WSM opcode 0x0007 mapping in current wsm.h (sanity-check against this report).
  • Capture an iw event -t trace alongside dmesg during a failure window for cross-correlation with mac80211 state.
  • If reproducible at will: enable bes2600 dyndbg + WSM tracing and capture one full SCAN_REQ → confirm cycle to pin down the FW arg that's invalid.
  • Cross-check against any newer besser patches not yet on this build.
## Symptom WiFi on `ohm` (PineTab2, BES2600) goes through periodic disruption that's invisible to NetworkManager (still "connected") but kills longer-lived TCP sessions. Today this knocked the OpenVPN tunnel to `shannon` offline (lmcp `ohm-tools` MCP went silent in a Claude session); LAN reachability via `ohm.fritz.box` was unaffected, so the radio link itself recovers but mid-flight sessions die. ## System - **Host:** ohm (PineTab2, RK3566 + BES2600) - **Kernel:** `6.19.10-danctnix1-1-pinetab2 #1 SMP PREEMPT_DYNAMIC Sat, 28 Mar 2026 02:45:08 +0000 aarch64` (DanctNIX) - **Driver:** out-of-tree `bes2600.ko`, `srcversion 461AFB369355AE598D79BDF`, `/lib/modules/6.19.10-danctnix1-1-pinetab2/extra/bes2600.ko` - **Uptime when sampled:** 1d 06:30 - **Current link (sample moment):** SSID `newton`, AP `c0:25:06:e6:5b:33`, 5 GHz ch.40, signal -43 dBm, MCS 7 / 150 Mb/s — i.e. a *good* link, not a marginal-RF problem. - **AP environment:** Fritz!Box 7530 AX + 2× repeaters, three BSSIDs on the same SSID (`c0:25:06:e6:5b:32`, `c0:25:06:e6:5b:33`, `cc:ce:1e:2b:74:17`/`c0:25:06:e6:61:b0`). ## Repeating signatures (counts over current dmesg buffer, ~22 h window) | Pattern | Count | |---|---| | `wsm_generic_confirm failed for request 0x0007` (SCAN_REQ rejected by FW) | **345** | | `wsm_join_confirm ret 1` (FW rejects driver JOIN) | **90** | | `Reason: 2=PREV_AUTH_NOT_VALID` (driver-initiated forced roam) | **32** | | total wlan0/ieee80211 lines | 1794 | First occurrence at `[29840.916960]` and still firing at `[109373.800451]` — pattern is steady-state, not a one-off boot glitch. ## Pattern A — background scan rejected by firmware Every ~300 s, mac80211 issues a roaming background scan that the BES2600 firmware refuses with `-EINVAL`: ``` [NNN] ieee80211 phy0: wsm_generic_confirm failed for request 0x0007. [NNN] ieee80211 phy0: [SCAN] Scan failed (-22). ``` 0x0007 is `WSM_REQ_ID_SCAN_START` (per `bes2600/wsm.h`). The FW returns a non-zero confirm status on what should be a routine scheduled scan while the STA is associated. mac80211's view: scan failed; whatever was driving the scan (background roam scoring, scheduled scan, ROAM trigger) gets no answer. ## Pattern B — JOIN confirm error → auth timeout When mac80211 then tries to switch BSSID (or just re-auth), the firmware rejects the JOIN: ``` bes2600_wlan mmc2:0001:1: wsm_join_confirm ret 1 wlan0: send auth to <bssid> (try N/3) ... 3× ... wlan0: authentication with <bssid> timed out ``` `wsm_join_confirm ret 1` ⇒ `WSM_STATUS_FAILURE` from FW on `WSM_REQ_ID_JOIN`. Driver retries 3× per AP, eventually mac80211 picks another BSSID candidate and retries; usually one of the three BSSIDs eventually accepts. ## Pattern C — RX failures bursts Correlated with the above: ``` bes2600_wlan mmc2:0001:1: [RX] Receive failure: 4. ``` Clumps of 2-5 lines, in the same windows as the JOIN failures. ## Pattern D — PREV_AUTH_NOT_VALID forced roam ``` wlan0: deauthenticating from <bssid> by local choice (Reason: 2=PREV_AUTH_NOT_VALID) wlan0: authenticate with <other-bssid> ... ``` The driver/FW evicts the current AP context as if PMKSA went stale, and mac80211 has to redo the auth/assoc dance. Happens ~1.5× per hour averaged across the buffer, sometimes back-to-back. ## Hypothesis Likely root cause is in the WSM `SCAN_REQ` handling path while the STA is associated: 1. Background scan fires while connected. 2. Driver may not be pausing TX / dropping into PS correctly before issuing `WSM_REQ_ID_SCAN_START`, FW returns `-EINVAL`. 3. Failed scan leaves driver state inconsistent; next mac80211-initiated `WSM_REQ_ID_JOIN` (for a roam target or refresh) hits a FW that still thinks it's mid-scan or otherwise busy → `wsm_join_confirm ret 1`. 4. mac80211's auth retry loop runs out, falls back to deauth + try another AP, producing the `PREV_AUTH_NOT_VALID` churn. Worth checking: - The pre-scan TX-pause / power-save handshake in `bes2600/scan.c` against current mac80211 callbacks. - Whether `WSM_REQ_ID_SCAN_START` is correctly setting the scan params (channel list, dwell, SSID list, `numProbeRequests`) — `-EINVAL` from FW is the most common "params look wrong to me" answer. - Whether BES2600 firmware on this device has known scan-while-associated bugs we've already RE'd in another branch. ## Sample dmesg slice (last storm) ``` [105714.185945] ieee80211 phy0: wsm_generic_confirm failed for request 0x0007. [105714.186819] ieee80211 phy0: [SCAN] Scan failed (-22). [105928.888083] wlan0: deauthenticating from c0:25:06:e6:5b:32 by local choice (Reason: 2=PREV_AUTH_NOT_VALID) [105929.420027] wlan0: authenticate with c0:25:06:e6:61:b0 (local address=8a:2e:77:1f:ec:05) [105929.420048] wlan0: send auth to c0:25:06:e6:61:b0 (try 1/3) [105929.574112] wlan0: send auth to c0:25:06:e6:61:b0 (try 2/3) [105929.577359] wlan0: authenticated [105929.578138] wlan0: associate with c0:25:06:e6:61:b0 (try 1/3) [105929.686105] wlan0: associate with c0:25:06:e6:61:b0 (try 2/3) [105929.693216] wlan0: RX AssocResp from c0:25:06:e6:61:b0 (capab=0x1431 status=30 aid=0) [105929.693260] wlan0: c0:25:06:e6:61:b0 rejected association temporarily; comeback duration 1000 TU (1024 ms) [105930.726128] wlan0: associate with c0:25:06:e6:61:b0 (try 3/3) [105930.878069] wlan0: association with c0:25:06:e6:61:b0 timed out ``` ## Impact - VPN tunnel drops (today's trigger — Claude `ohm-tools` MCP went silent). - Any long-running TCP connection through `ohm` is unreliable. - Background scan failure means roam decisions are blind, so the device can stick to a bad AP if it ever has one. - LAN-side reachability is mostly fine because re-association recovers within ~1-2 s, but during that window everything stalls. ## Next steps (proposed) - [ ] Confirm WSM opcode 0x0007 mapping in current `wsm.h` (sanity-check against this report). - [ ] Capture an `iw event -t` trace alongside dmesg during a failure window for cross-correlation with mac80211 state. - [ ] If reproducible at will: enable `bes2600` dyndbg + WSM tracing and capture one full `SCAN_REQ → confirm` cycle to pin down the FW arg that's invalid. - [ ] Cross-check against any newer `besser` patches not yet on this build.
Author
Owner

2026-05-17 update — same pattern still firing, with new band/perf context

Re-verified on ohm today, BESser-c5x stack (kernel 7.0.0-pinetab2-danctnix-besser, srcversion 978FDDE6…). The WSM 0x0007 reject pattern from the OP is unchanged — though now showing as a slightly different operational symptom because ohm has since drifted off the 5 GHz attachment.

(Filing this as a comment here; my earlier #19 was a duplicate of this issue, closing that one with a pointer back.)

Band attachment shifted since May

  • 2026-05-03 (OP): ohm on 5 GHz ch.40, BSSID c0:25:06:e6:5b:33, signal -43 dBm, MCS 7 / 150 Mb/s.
  • 2026-05-17 (now): ohm on 2.4 GHz ch.11, BSSID c0:25:06:e6:5b:32, signal -44 dBm, MCS 7 / 72.2 Mb/s HT20 short-GI. Same SSID newton, combined-band setup. Moved next to AP — still 2.4 GHz attachment, nmcli connection down/up re-associates to the same 2.4 GHz BSSID.

Combined-SSID setup means client-driven band selection is at the mercy of whatever the scan can see, which is part of how Pattern A manifests as a slow drift toward 2.4 GHz over time.

Chip vs driver vs firmware — pinned

iw phy on ohm enumerates Band 2 with 37 channels spanning 5170-5875 MHz, HT20/HT40 caps 0x87e, MCS 0-7 single-stream. No VHT. Regdomain DE/DFS-ETSI permits all UNII-1/2/2e/3 bands at 20-26 dBm. So the chip + driver + regdomain are all 5 GHz-capable; the OP's Pattern A is the limiting layer.

Scan-reject in current state (today)

ieee80211 phy0: bes2600_scan_complete_cb status: 0
ieee80211 phy0: [SCAN] Scan completed.
ieee80211 phy0: wsm_generic_confirm failed for request 0x0007.
ieee80211 phy0: [SCAN] Scan failed (-22).

iw dev wlan0 scan returns zero BSSIDs, not even the currently-associated newton. So the failed scan currently drops the entire result set, including the 2.4 GHz portion that probably succeeded. (mac80211 still has the assoc state, hence the link stays up, but userspace iw scan sees nothing.)

Patch status from OP's hypothesis

Patch 056a71a "bes2600: defer scan and soften WARN on firmware reject" landed in danctnix-7.0.6 (Markus-authored, cherry-picked by Danct12). It implements:

  • Coex-aware deferral: bes2600_scan_should_defer() short-circuits if coex_is_bt_a2dp() and coex isn't FDD.
  • Rate-limited fallback: BES2600_SCAN_REJECT_THRESHOLD=3 rejections in a row triggers a 10s backoff window.
  • WARN demotion: full stack traces replaced with rate-limited single-line bes_devel-level prints.

What it does NOT do: persuade the firmware to actually accept the scan in the "firmware-internal busy state" branch the commit message acknowledges. So Patterns A → B → D in this issue still fire whenever that branch triggers, just without the dmesg storm.

TCP perf measurement (today, 2026-05-17)

LAN-direct download from boltzmann, post-reconnect, signal -44 dBm:

Bytes Time Throughput Tx retry rate
1,073,741,824 (1 GiB, md5 verified) 3:35.36 4.99 MB/s = 40 Mbit/s 16.25 %

Pre-reconnect at -57 dBm (was sat near a wifi extender) was 3.12 MB/s with worse retry. So the radio link can do ~5 MB/s sustained at MCS 7 HT20 2.4 GHz; not great, not crippling. A clean 5 GHz HT40 attachment would unlock most of the gap to peer-driver reports of ~80 Mbit/s — but we can't get there reliably while the scan rejects continue.

Workarounds

  • Direct-attach to a 5 GHz BSSID if the SSID layout permits (combined-band SSID makes this awkward but doable with bssid pin):
    nmcli connection add con-name newton-5g ifname wlan0 type wifi \
          ssid newton 802-11-wireless.band a 802-11-wireless.bssid c0:25:06:e6:5b:33
    
  • Disable BT briefly before forcing roam — clears the coex-driven reject path the patch already detects.

Both are workarounds, not fixes for the OP's root cause.

Proposed depth-of-fix taxonomy

  1. Smarter scan-deferral — gate scan attempts on observable firmware-state signals (PM state, BT activity beyond just A2DP). Today's 056a71a uses a fixed 10s backoff and a single coex check; the "firmware-internal busy" reject branch isn't observed before issue.
  2. BT-coex tuning — investigate whether the FDD coex mode can be entered earlier or held longer, opening the off-channel window. Firmware exposes bt_drv_config_coex_mode + BTC patch hooks (per reference_bes2600_firmware_first_pass).
  3. Per-band scan splitting — issue separate WSM_REQ_ID_SCAN_START requests per band so a 5 GHz-rejected scan doesn't poison 2.4 GHz results. Match how mac80211 issues multi-band scans.

All three need a longer firmware-state instrumentation pass on the bes2600 coex/PM behavior before they can be properly scoped.

Memory cross-refs

  • reference_bes2600_5ghz_scan_reject (today's session, will get tightened with this issue's data)
  • reference_bes2600_firmware_first_pass — coex/BTC symbol leaks for depth-3 fix
  • reference_bes2600_firmware_no_psm — sibling firmware-policy gotcha

Status

Leaving this open — no fix in flight, just documentation drift. Patterns B/C/D from the OP haven't been re-verified today (current ohm isn't in a roam-cycle right now), but the OP's evidence stands. If anyone has a clean reproduction of Pattern B (wsm_join_confirm ret 1) with bes2600 dyndbg enabled they could capture the full WSM exchange — that's the entry point for any of the three fix paths above.

## 2026-05-17 update — same pattern still firing, with new band/perf context Re-verified on ohm today, BESser-c5x stack (`kernel 7.0.0-pinetab2-danctnix-besser`, srcversion `978FDDE6…`). The WSM 0x0007 reject pattern from the OP is unchanged — though now showing as a slightly different operational symptom because ohm has since drifted off the 5 GHz attachment. (Filing this as a comment here; my earlier #19 was a duplicate of this issue, closing that one with a pointer back.) ## Band attachment shifted since May - **2026-05-03 (OP)**: ohm on 5 GHz ch.40, BSSID `c0:25:06:e6:5b:33`, signal -43 dBm, MCS 7 / 150 Mb/s. - **2026-05-17 (now)**: ohm on **2.4 GHz ch.11**, BSSID `c0:25:06:e6:5b:32`, signal -44 dBm, MCS 7 / 72.2 Mb/s HT20 short-GI. Same SSID `newton`, combined-band setup. Moved next to AP — still 2.4 GHz attachment, `nmcli connection down/up` re-associates to the same 2.4 GHz BSSID. Combined-SSID setup means client-driven band selection is at the mercy of whatever the scan can see, which is part of how Pattern A manifests as a slow drift toward 2.4 GHz over time. ## Chip vs driver vs firmware — pinned `iw phy` on ohm enumerates **Band 2 with 37 channels** spanning 5170-5875 MHz, HT20/HT40 caps `0x87e`, MCS 0-7 single-stream. No VHT. Regdomain DE/DFS-ETSI permits all UNII-1/2/2e/3 bands at 20-26 dBm. So the chip + driver + regdomain are all 5 GHz-capable; the OP's Pattern A is the limiting layer. ## Scan-reject in current state (today) ``` ieee80211 phy0: bes2600_scan_complete_cb status: 0 ieee80211 phy0: [SCAN] Scan completed. ieee80211 phy0: wsm_generic_confirm failed for request 0x0007. ieee80211 phy0: [SCAN] Scan failed (-22). ``` `iw dev wlan0 scan` returns **zero BSSIDs**, not even the currently-associated `newton`. So the failed scan currently drops the entire result set, including the 2.4 GHz portion that probably succeeded. (mac80211 still has the assoc state, hence the link stays up, but userspace `iw scan` sees nothing.) ## Patch status from OP's hypothesis Patch `056a71a` "bes2600: defer scan and soften WARN on firmware reject" landed in `danctnix-7.0.6` (Markus-authored, cherry-picked by Danct12). It implements: - Coex-aware deferral: `bes2600_scan_should_defer()` short-circuits if `coex_is_bt_a2dp()` and coex isn't FDD. - Rate-limited fallback: `BES2600_SCAN_REJECT_THRESHOLD=3` rejections in a row triggers a 10s backoff window. - WARN demotion: full stack traces replaced with rate-limited single-line `bes_devel`-level prints. What it does NOT do: persuade the firmware to actually accept the scan in the "firmware-internal busy state" branch the commit message acknowledges. So Patterns A → B → D in this issue still fire whenever that branch triggers, just without the dmesg storm. ## TCP perf measurement (today, 2026-05-17) LAN-direct download from boltzmann, post-reconnect, signal -44 dBm: | Bytes | Time | Throughput | Tx retry rate | |---|---|---|---| | 1,073,741,824 (1 GiB, md5 verified) | 3:35.36 | **4.99 MB/s = 40 Mbit/s** | **16.25 %** | Pre-reconnect at -57 dBm (was sat near a wifi extender) was 3.12 MB/s with worse retry. So the radio link can do ~5 MB/s sustained at MCS 7 HT20 2.4 GHz; not great, not crippling. A clean 5 GHz HT40 attachment would unlock most of the gap to peer-driver reports of ~80 Mbit/s — but we can't get there reliably while the scan rejects continue. ## Workarounds - **Direct-attach to a 5 GHz BSSID** if the SSID layout permits (combined-band SSID makes this awkward but doable with `bssid` pin): ``` nmcli connection add con-name newton-5g ifname wlan0 type wifi \ ssid newton 802-11-wireless.band a 802-11-wireless.bssid c0:25:06:e6:5b:33 ``` - Disable BT briefly before forcing roam — clears the coex-driven reject path the patch already detects. Both are workarounds, not fixes for the OP's root cause. ## Proposed depth-of-fix taxonomy 1. **Smarter scan-deferral** — gate scan attempts on observable firmware-state signals (PM state, BT activity beyond just A2DP). Today's 056a71a uses a fixed 10s backoff and a single coex check; the "firmware-internal busy" reject branch isn't observed before issue. 2. **BT-coex tuning** — investigate whether the FDD coex mode can be entered earlier or held longer, opening the off-channel window. Firmware exposes `bt_drv_config_coex_mode` + BTC patch hooks (per `reference_bes2600_firmware_first_pass`). 3. **Per-band scan splitting** — issue separate `WSM_REQ_ID_SCAN_START` requests per band so a 5 GHz-rejected scan doesn't poison 2.4 GHz results. Match how mac80211 issues multi-band scans. All three need a longer firmware-state instrumentation pass on the bes2600 coex/PM behavior before they can be properly scoped. ## Memory cross-refs - `reference_bes2600_5ghz_scan_reject` (today's session, will get tightened with this issue's data) - `reference_bes2600_firmware_first_pass` — coex/BTC symbol leaks for depth-3 fix - `reference_bes2600_firmware_no_psm` — sibling firmware-policy gotcha ## Status Leaving this open — no fix in flight, just documentation drift. Patterns B/C/D from the OP haven't been re-verified today (current ohm isn't in a roam-cycle right now), but the OP's evidence stands. If anyone has a clean reproduction of Pattern B (`wsm_join_confirm ret 1`) with `bes2600` dyndbg enabled they could capture the full WSM exchange — that's the entry point for any of the three fix paths above.
Author
Owner

Phase 5 review artifact — 2026-05-18

Posting the Phase 0–4 work-product verbatim for review. Per feedback_phase5_surface_is_pr, this is the gitea-equivalent of the "second model review" step. Patch is NOT written yet — looking for pressure-test on root-cause analysis and patch shape before Phase 6.

Goal (Phase 1)

On ohm with BESser-c5x stack (kernel 7.0.0-…-besser, srcversion 978FDDE6, no BT controller), reduce besser#1 Pattern A wsm_generic_confirm 0x0007 fail rate from baseline ~14-16/h to ≤1/h over a 4h ambient-roam observation window, without introducing new dmesg WARN/BUG or new VPN-stability regressions; Patterns B/D (downstream of A) are expected to fall to 0 by construction when A is fixed, but track separately to confirm.

Threshold rationale: ≤1/h (not zero) allows the firmware to legitimately reject a scan during real PM transitions we shouldn't fight. 4 h window is conservative — at clean ≤1/h, a 1h window can't distinguish from luck.

Situation (Phase 2)

  • Stack: kernel 7.0.0-danctnix1-5-pinetab2-danctnix-besser, bes2600 srcversion 978FDDE6000F06D5721FB26. Includes patch 056a71a "defer scan and soften WARN on firmware reject" (Markus-authored, cherry-picked by Danct12).
  • BT subsystem: bluetoothd running, bluetoothctl show → "No default controller available". So the BT-A2DP-coex branch of 056a71a's bes2600_scan_should_defer() is dead — coex_is_bt_a2dp() returns false.
  • Wifi: 2.4 GHz ch.11, MCS 7 single-stream HT20, signal -44 dBm, associated to newton BSSID c0:25:06:e6:5b:32.
  • Wired rescue path (enu1 / 192.168.88.80) currently missing — flag for Phase 6 (module reload would lose ssh).

Measurements (Phase 0 + Phase 3)

Baseline (N=3, dmesg-derived, per CLAUDE.md "rig-failure-is-finding" rule):

Rep Window Pattern A count Rate
1 7.6 min 3 23.6 /h
2 22.8 min 6 15.8 /h
3 37.8 min 9 14.3 /h

Converges to ~14-16/h, matches OP's 22-h baseline of 15.7/h. Patterns B/C/D all 0 across all reps — the downstream cascade is not firing today, just standalone scan-rejects.

Phase 3 capture (bes2600 dyndbg +pmf, mac80211/cfg80211 ftrace events, 30-min window):

  • Scans issued: 14
  • Pattern A rejects: 9 → 64% reject rate
  • 056a71a "defer" log fires: 0 (structurally bypassed — see below)
  • Patterns B/D: 0

The decisive evidence — back-to-back scan pairs:

Looking at one rejected pair ([630–632]):

[630.570442] drv_hw_scan: phy0 vif:wlan0(2)         # scan #1 from wpa_supplicant
[630.570507] bes2600_hw_scan: Scan request for 1 SSIDs   # driver entry
[631.991585] bes2600_scan_complete_cb status: 0     # FW returns success
[631.992]    bes2600_hw_scan: Scan request for 1 SSIDs   # scan #2 entry, 177 µs after FW completion
[631.993737] api_scan_completed: phy0 aborted:0     # mac80211 cleanup of #1
[631.993874] drv_hw_scan: phy0 vif:wlan0(2)         # mac80211 sees #2
[632.002753] wsm_generic_confirm failed 0x0007      # scan #2 REJECTED
[632.004909] [SCAN] Scan failed (-22)
[632.005176] api_scan_completed: phy0 aborted:1     # #2 aborted

The one back-to-back pair that both succeeded ([2002–2003]):

[2002.579] drv_hw_scan
[2002.748] api_scan_completed:0
[2002.752] cfg80211_scan_done: aborted:false        # scan #1 fully done
[2002.866] drv_hw_scan                              # scan #2 entry, 114 ms after cfg80211_scan_done
[2003.057] api_scan_completed:0                     # scan #2 also succeeds
[2003.060] cfg80211_scan_done: aborted:false

Threshold is empirically obvious: 6/6 rejected pairs had inter-scan gap <200 µs. The one successful pair had 114 ms gap.

Root cause

Firmware needs ~100 ms between consecutive scans to clean up internal state. Driver doesn't enforce that cooldown. Whenever mac80211 chains two scans (wpa_supplicant background-scan + NetworkManager roam-probe both fire on the ~300 s cadence), the second scan arrives sub-ms after the first's FW completion, before FW is ready. FW returns status 2 ("rejected by policy"); driver maps to -22.

OP's roam-cascade hypothesis (Pattern A → B → D) is incorrect for the normal case. Today's measurement shows Patterns B/D at 0 despite Pattern A at 14/h. The cascade likely happens only when there's additional roam pressure, but the upstream Pattern A is independent of join state.

056a71a's defer logic is structurally bypassed:

  • BT-coex branch dead (no controller).
  • Threshold branch (reject_count >= 3 && time_before(jiffies, backoff_until)) doesn't fire because reject_count resets to 0 on every successful scan (line 767 in danctnix-7.0.6 scan.c). With 64% reject rate, the successes interleave often enough to keep reject_count below threshold.

Upstream-ancestor check (per feedback_mine_upstream_ancestor)

cw1200 has a different shape that would dodge this:

  • Enum has 7 join states (PASSIVE/MONITOR/JOINING/PRE_STA/STA/IBSS/AP); bes2600 has only 4 (PASSIVE/MONITOR/STA/AP) — JOINING/PRE_STA intermediate states were dropped in the bes2600 fork.
  • cw1200_scan_start has an upfront if (priv->join_status == CW1200_JOIN_STATUS_PRE_STA || CW1200_JOIN_STATUS_JOINING) return -EBUSY; guard.

That guard isn't the right fix for tonight's reject branch (we're not mid-join when the rejects fire). But it's the correct precedent for the driver doing its own upfront refusal rather than letting the firmware reject — which is the philosophical shape we want.

Plan (Phase 4)

Patch: bes2600 — gate scan with inter-scan cooldown

Three changes, ~15 lines total:

  1. Add field to scan state struct (in bes2600.h or scan.h depending on where struct scan lives):

    unsigned long last_complete_jiffies;   /* set by bes2600_scan_complete; read by bes2600_hw_scan */
    
  2. In bes2600_scan_complete() (scan.c around line 858), set the timestamp right after ieee80211_scan_completed() fires. (bes2600_scan_complete is the cleanup-everywhere path; setting it here covers both "all bands done" and "scan canceled" cases.)

    hw_priv->scan.last_complete_jiffies = jiffies;
    
  3. In bes2600_hw_scan() (scan.c around line 175, right after the JOIN_STATUS_AP early return), gate with:

    if (hw_priv->scan.last_complete_jiffies &&
        time_before(jiffies, hw_priv->scan.last_complete_jiffies +
                    msecs_to_jiffies(BES2600_SCAN_COOLDOWN_MS)))
        return -EBUSY;
    

    Define BES2600_SCAN_COOLDOWN_MS = 100 near the top of scan.c.

mac80211 handles -EBUSY from drv_hw_scan by aborting the scan and notifying userspace; wpa_supplicant retries on its own cadence. Not a hidden behavior — the contract is documented in include/net/mac80211.h:struct ieee80211_ops.hw_scan.

What this fix does NOT do (deliberately out of scope):

  • Doesn't touch wsm_generic_confirm status preservation — not needed for this reject branch.
  • Doesn't re-introduce JOINING/PRE_STA enum values — Pattern A isn't a join-state issue.
  • Doesn't address 056a71a's threshold-reset bug — separate patch, BT-coex branch is still useful as defense-in-depth for the rare A2DP-coex case.

Predicted delta: Pattern A drops from 14-16/h to ≤1/h. Pattern B/D unchanged (already 0 on this stack today; if they reappear under roam load it's a separate issue).

Why 100 ms: matches the empirically-observed successful pair (114 ms gap). Comfortable margin over the <200 µs failing pairs. Smaller (e.g. 50 ms) might also work but lacks safety margin against firmware variance.

Flavors per feedback_bes2600_dual_tree_flavors

Two patch flavors required:

  • danctnix-intree flavor: against linux-pinetab2 danctnix-7.0.6 branch (uses timer_container_of style timers — though this patch doesn't touch timers, the surrounding code uses newer APIs).
  • Mobian-DKMS flavor: against bes2600-dkms-mobian bes2600/bh-c-fossil-cleanup branch (older kernel base).

Both should land before Phase 7 (deploy danctnix flavor to ohm for measurement; Mobian flavor is for downstream parity).

Operational concerns for Phase 6

  • enu1 wired rescue is missing — recovering it (or accepting the risk) before module reload, per feedback_user_pushes_reboot_button.
  • Deploy to canonical ~/src/besser/marfrit-besser/danctnix-besser-pkgbuild/ (per reference_danctnix_besser_pkgbuild_canonical), not the orphan checkout.

Asks for the reviewer

  1. Is the 100 ms cooldown the right value? Could be lower if firmware-side cleanup is faster than 114 ms suggests; could be higher if there's variance under load.
  2. Is bes2600_scan_complete the right place to set last_complete_jiffies? Or should it be in bes2600_scan_work cleanup branch / bes2600_scan_complete_cb? The choice matters for "canceled scan" path.
  3. Did I miss an alternative root cause for the back-to-back scan pattern? e.g. is mac80211 supposed to serialize and bes2600 is at fault for accepting the second scan before signaling completion? Could be a BUG_ON race I'm not seeing.
  4. Is -EBUSY the right return code from bes2600_hw_scan? Check mac80211 ops contract for what return codes are tolerated. (Pretty sure -EBUSY is fine but worth confirming.)
  5. Should I bundle the 056a71a threshold-reset fix into this patch or keep separate? Markus's preference per feedback_dont_patch_downstream_artifacts is "keep separate".
## Phase 5 review artifact — 2026-05-18 Posting the Phase 0–4 work-product verbatim for review. Per `feedback_phase5_surface_is_pr`, this is the gitea-equivalent of the "second model review" step. **Patch is NOT written yet** — looking for pressure-test on root-cause analysis and patch shape before Phase 6. ### Goal (Phase 1) > On ohm with BESser-c5x stack (kernel `7.0.0-…-besser`, srcversion `978FDDE6`, no BT controller), reduce besser#1 Pattern A `wsm_generic_confirm 0x0007 fail` rate from baseline ~14-16/h to ≤1/h over a 4h ambient-roam observation window, without introducing new dmesg WARN/BUG or new VPN-stability regressions; Patterns B/D (downstream of A) are expected to fall to 0 by construction when A is fixed, but track separately to confirm. Threshold rationale: ≤1/h (not zero) allows the firmware to legitimately reject a scan during real PM transitions we shouldn't fight. 4 h window is conservative — at clean ≤1/h, a 1h window can't distinguish from luck. ### Situation (Phase 2) - Stack: kernel `7.0.0-danctnix1-5-pinetab2-danctnix-besser`, bes2600 srcversion `978FDDE6000F06D5721FB26`. Includes patch `056a71a` "defer scan and soften WARN on firmware reject" (Markus-authored, cherry-picked by Danct12). - BT subsystem: `bluetoothd` running, `bluetoothctl show` → "No default controller available". So the BT-A2DP-coex branch of 056a71a's `bes2600_scan_should_defer()` is dead — `coex_is_bt_a2dp()` returns false. - Wifi: 2.4 GHz ch.11, MCS 7 single-stream HT20, signal -44 dBm, associated to `newton` BSSID `c0:25:06:e6:5b:32`. - Wired rescue path (enu1 / 192.168.88.80) currently missing — flag for Phase 6 (module reload would lose ssh). ### Measurements (Phase 0 + Phase 3) **Baseline (N=3, dmesg-derived, per CLAUDE.md "rig-failure-is-finding" rule):** | Rep | Window | Pattern A count | Rate | |---|---|---|---| | 1 | 7.6 min | 3 | 23.6 /h | | 2 | 22.8 min | 6 | 15.8 /h | | 3 | 37.8 min | 9 | 14.3 /h | Converges to **~14-16/h**, matches OP's 22-h baseline of 15.7/h. Patterns B/C/D all 0 across all reps — *the downstream cascade is not firing today*, just standalone scan-rejects. **Phase 3 capture (bes2600 dyndbg `+pmf`, mac80211/cfg80211 ftrace events, 30-min window):** - Scans issued: 14 - Pattern A rejects: 9 → **64% reject rate** - 056a71a "defer" log fires: **0** (structurally bypassed — see below) - Patterns B/D: 0 **The decisive evidence — back-to-back scan pairs:** Looking at one rejected pair (`[630–632]`): ``` [630.570442] drv_hw_scan: phy0 vif:wlan0(2) # scan #1 from wpa_supplicant [630.570507] bes2600_hw_scan: Scan request for 1 SSIDs # driver entry [631.991585] bes2600_scan_complete_cb status: 0 # FW returns success [631.992] bes2600_hw_scan: Scan request for 1 SSIDs # scan #2 entry, 177 µs after FW completion [631.993737] api_scan_completed: phy0 aborted:0 # mac80211 cleanup of #1 [631.993874] drv_hw_scan: phy0 vif:wlan0(2) # mac80211 sees #2 [632.002753] wsm_generic_confirm failed 0x0007 # scan #2 REJECTED [632.004909] [SCAN] Scan failed (-22) [632.005176] api_scan_completed: phy0 aborted:1 # #2 aborted ``` The one back-to-back pair that **both succeeded** (`[2002–2003]`): ``` [2002.579] drv_hw_scan [2002.748] api_scan_completed:0 [2002.752] cfg80211_scan_done: aborted:false # scan #1 fully done [2002.866] drv_hw_scan # scan #2 entry, 114 ms after cfg80211_scan_done [2003.057] api_scan_completed:0 # scan #2 also succeeds [2003.060] cfg80211_scan_done: aborted:false ``` **Threshold is empirically obvious**: 6/6 rejected pairs had inter-scan gap <200 µs. The one successful pair had 114 ms gap. ### Root cause **Firmware needs ~100 ms between consecutive scans to clean up internal state.** Driver doesn't enforce that cooldown. Whenever mac80211 chains two scans (wpa_supplicant background-scan + NetworkManager roam-probe both fire on the ~300 s cadence), the second scan arrives sub-ms after the first's FW completion, before FW is ready. FW returns status 2 ("rejected by policy"); driver maps to `-22`. **OP's roam-cascade hypothesis (Pattern A → B → D) is incorrect for the normal case.** Today's measurement shows Patterns B/D at 0 despite Pattern A at 14/h. The cascade likely happens only when there's *additional* roam pressure, but the upstream Pattern A is independent of join state. **056a71a's defer logic is structurally bypassed:** - BT-coex branch dead (no controller). - Threshold branch (`reject_count >= 3 && time_before(jiffies, backoff_until)`) doesn't fire because `reject_count` resets to 0 on every successful scan (line 767 in danctnix-7.0.6 scan.c). With 64% reject rate, the successes interleave often enough to keep `reject_count` below threshold. ### Upstream-ancestor check (per `feedback_mine_upstream_ancestor`) cw1200 has a different shape that *would* dodge this: - Enum has 7 join states (`PASSIVE/MONITOR/JOINING/PRE_STA/STA/IBSS/AP`); bes2600 has only 4 (`PASSIVE/MONITOR/STA/AP`) — JOINING/PRE_STA intermediate states were dropped in the bes2600 fork. - `cw1200_scan_start` has an upfront `if (priv->join_status == CW1200_JOIN_STATUS_PRE_STA || CW1200_JOIN_STATUS_JOINING) return -EBUSY;` guard. That guard *isn't* the right fix for tonight's reject branch (we're not mid-join when the rejects fire). But it's the correct precedent for **the driver doing its own upfront refusal rather than letting the firmware reject** — which is the philosophical shape we want. ### Plan (Phase 4) **Patch: bes2600 — gate scan with inter-scan cooldown** Three changes, ~15 lines total: 1. Add field to scan state struct (in `bes2600.h` or `scan.h` depending on where `struct scan` lives): ```c unsigned long last_complete_jiffies; /* set by bes2600_scan_complete; read by bes2600_hw_scan */ ``` 2. In `bes2600_scan_complete()` (scan.c around line 858), set the timestamp right after `ieee80211_scan_completed()` fires. (`bes2600_scan_complete` is the cleanup-everywhere path; setting it here covers both "all bands done" and "scan canceled" cases.) ```c hw_priv->scan.last_complete_jiffies = jiffies; ``` 3. In `bes2600_hw_scan()` (scan.c around line 175, right after the `JOIN_STATUS_AP` early return), gate with: ```c if (hw_priv->scan.last_complete_jiffies && time_before(jiffies, hw_priv->scan.last_complete_jiffies + msecs_to_jiffies(BES2600_SCAN_COOLDOWN_MS))) return -EBUSY; ``` Define `BES2600_SCAN_COOLDOWN_MS = 100` near the top of scan.c. mac80211 handles `-EBUSY` from `drv_hw_scan` by aborting the scan and notifying userspace; wpa_supplicant retries on its own cadence. Not a hidden behavior — the contract is documented in `include/net/mac80211.h:struct ieee80211_ops.hw_scan`. **What this fix does NOT do** (deliberately out of scope): - Doesn't touch `wsm_generic_confirm` status preservation — not needed for this reject branch. - Doesn't re-introduce JOINING/PRE_STA enum values — Pattern A isn't a join-state issue. - Doesn't address `056a71a`'s threshold-reset bug — separate patch, BT-coex branch is still useful as defense-in-depth for the rare A2DP-coex case. **Predicted delta**: Pattern A drops from 14-16/h to ≤1/h. Pattern B/D unchanged (already 0 on this stack today; if they reappear under roam load it's a separate issue). **Why 100 ms**: matches the empirically-observed successful pair (114 ms gap). Comfortable margin over the <200 µs failing pairs. Smaller (e.g. 50 ms) might also work but lacks safety margin against firmware variance. ### Flavors per `feedback_bes2600_dual_tree_flavors` Two patch flavors required: - **danctnix-intree** flavor: against `linux-pinetab2 danctnix-7.0.6` branch (uses `timer_container_of` style timers — though this patch doesn't touch timers, the surrounding code uses newer APIs). - **Mobian-DKMS** flavor: against `bes2600-dkms-mobian bes2600/bh-c-fossil-cleanup` branch (older kernel base). Both should land before Phase 7 (deploy danctnix flavor to ohm for measurement; Mobian flavor is for downstream parity). ### Operational concerns for Phase 6 - `enu1` wired rescue is missing — recovering it (or accepting the risk) before module reload, per `feedback_user_pushes_reboot_button`. - Deploy to canonical `~/src/besser/marfrit-besser/danctnix-besser-pkgbuild/` (per `reference_danctnix_besser_pkgbuild_canonical`), not the orphan checkout. ### Asks for the reviewer 1. **Is the 100 ms cooldown the right value?** Could be lower if firmware-side cleanup is faster than 114 ms suggests; could be higher if there's variance under load. 2. **Is `bes2600_scan_complete` the right place to set `last_complete_jiffies`?** Or should it be in `bes2600_scan_work` cleanup branch / `bes2600_scan_complete_cb`? The choice matters for "canceled scan" path. 3. **Did I miss an alternative root cause** for the back-to-back scan pattern? e.g. is mac80211 supposed to serialize and bes2600 is at fault for *accepting* the second scan before signaling completion? Could be a `BUG_ON` race I'm not seeing. 4. **Is `-EBUSY` the right return code** from `bes2600_hw_scan`? Check mac80211 ops contract for what return codes are tolerated. (Pretty sure -EBUSY is fine but worth confirming.) 5. **Should I bundle the 056a71a threshold-reset fix** into this patch or keep separate? Markus's preference per `feedback_dont_patch_downstream_artifacts` is "keep separate".
Author
Owner

Phase 5 amendment — reviewer was right, root cause confirmed

Going through the reviewer's load-bearing claims one by one:

Claim Evidence Verdict
bes2600 doesn't set SINGLE_SCAN_ON_ALL_BANDS drivers/staging/bes2600/main.c:386-394 — 9 ieee80211_hw_set calls, this flag is not among them confirmed
bes2600 registers both 2.4 + 5 GHz bands main.c:439-440hw->wiphy->bands[NL80211_BAND_2GHZ] and [NL80211_BAND_5GHZ] both assigned confirmed
mac80211 issues per-band drv_hw_scan loop when SINGLE_SCAN unset net/mac80211/scan.c:395-422 (ieee80211_prep_hw_scan) iterates local->hw_scan_band++ per band, returns true while channels remain; scan.c:474-482 re-invokes drv_hw_scan from __ieee80211_scan_completed while more bands queue confirmed
Firmware refuses 5 GHz scans Pre-existing memory reference_bes2600_5ghz_scan_reject documented this from earlier work (2026-05-17) pre-known
The trace's kworker/u16:* driving the second drv_hw_scan is mac80211's scan worker Source: __ieee80211_scan_completed runs in wiphy_work context, calls prep_hw_scan + drv_hw_scan inline consistent

The mechanism is now:

  1. wpa_supplicant requests a scan covering both bands (default behavior when associated to a multi-band SSID for roam discovery).
  2. mac80211 splits per-band because SINGLE_SCAN_ON_ALL_BANDS is unset.
  3. mac80211 calls drv_hw_scan for band 0 (2.4 GHz, ~13 channels).
  4. bes2600 issues wsm_scan; firmware accepts.
  5. On bes2600_scan_complete_cb, mac80211's __ieee80211_scan_completed calls prep_hw_scan for band 1 (5 GHz, ~37 channels) and immediately re-invokes drv_hw_scan.
  6. bes2600 issues wsm_scan for 5 GHz; firmware rejects with status 2 (per existing reference_bes2600_5ghz_scan_reject).
  7. wsm_generic_confirm collapses to -22; mac80211 sees an aborted scan; userspace gets a half-complete scan result.

Cooldown patch is dead. The 114 ms "successful" pair at [2002–2003] was almost certainly a single-band-only scan request from wpa_supplicant — prep_hw_scan returned false after the first band (no more channels to scan), so no chained drv_hw_scan ever fired. The microsecond timing has nothing to do with firmware cleanup; it's just how fast mac80211 executes the next iteration in the same wiphy_work context.

Revised Phase 4 — band filter, not cooldown

Reviewer pre-approved this if hypothesis held: "filter 5 GHz channels in bes2600_hw_scan before passing to wsm_scan".

Patch shape:

In bes2600_hw_scan() (scan.c around line 175, before down(&hw_priv->scan.lock)):

/*
 * Firmware refuses WSM start-scan for 5 GHz with status 2
 * ("rejected by policy"), causing a -EINVAL cascade up to mac80211.
 * Until we understand which firmware state we'd need to negotiate
 * to make 5 GHz scans actually work, refuse them at the driver
 * boundary so userspace gets a clean refusal instead of a half-
 * aborted multi-band scan. 2.4 GHz scans are unaffected; direct
 * BSSID association to 5 GHz APs is also unaffected (no scan
 * needed for that path). See besser#1.
 */
if (req->n_channels == 1 && req->channels[0]->band == NL80211_BAND_5GHZ)
    return -EINVAL;     /* every channel is 5 GHz — refuse upfront */

The check req->n_channels == 1 && channels[0]->band == NL80211_BAND_5GHZ catches mac80211's per-band split when it iterates to the 5 GHz band — at that point, every channel in req->channels[] is 5 GHz, and we refuse before issuing wsm_scan. The 2.4 GHz iteration is untouched.

Why not just de-register the 5 GHz band entirely?

  • We lose the truthful iw phy chip-capability advertisement
  • Direct BSSID association to a known 5 GHz BSSID would also break (currently works per OP's 2026-05-03 measurement)
  • A future firmware update / coex state change might unblock 5 GHz scans — band registration shouldn't bake in the current limitation

Why -EINVAL and not -EOPNOTSUPP or -EBUSY?

  • Need to verify against mac80211 contract. include/net/mac80211.h:struct ieee80211_ops.hw_scan documentation: "Returns 0 on success. Returns 1 if the driver needs to fall back to software scan. Returns negative on error." So any negative is treated as error → aborted scan. Per reviewer's note, mac80211 finalizes the scan as aborted regardless of which negative is returned.
  • Userspace experience: scan still completes (the 2.4 GHz portion already succeeded), but the 5 GHz portion is reported aborted. Same as current behavior, minus the dmesg storm.

Predicted delta: Pattern A drops from 14-16/h to 0/h. Pattern B/D unchanged (already 0 in this measurement window; tied to real roam activity, separate concern).

Thread safety (per reviewer's hole-poke on the cooldown patch): the new approach has zero shared state. Just an early-return check on the request struct that mac80211 owns. No race surface.

No 056a71a interaction: my patch returns before the existing defer logic even runs. 056a71a's coex check stays in place as defense-in-depth for the rare BT-A2DP-coex case (still a different reject branch). 056a71a's threshold-reset bug is unchanged — separate follow-up issue worth filing, but not blocking.

Updated answers to the 5 asks

  1. 100 ms cooldown value — moot, no cooldown.
  2. Right place for last_complete_jiffies — moot, no timestamp.
  3. Alternative root cause — confirmed correct: 5 GHz per-band leg.
  4. -EINVAL return — fine per mac80211 contract; any negative produces same userspace effect (aborted scan). Optimizing the return code further has no observable benefit.
  5. Bundle 056a71a fix — no. Separate issue, separate patch. This patch is purely a band filter.

Phase 6 readiness checklist

  • Restore wired rescue path (enu1) before module reload, or pre-stage a recovery script — per feedback_user_pushes_reboot_button
  • Patch lands in canonical ~/src/besser/marfrit-besser/ per reference_danctnix_besser_pkgbuild_canonical
  • Both flavors per feedback_bes2600_dual_tree_flavors (danctnix-intree primary; Mobian-DKMS for parity)
  • Verify mac80211 hw_scan return-code contract by reading include/net/mac80211.h for the .hw_scan op
  • Phase 7 must include a "5 GHz reachability didn't get worse" check, not just a "Pattern A count went down" check
## Phase 5 amendment — reviewer was right, root cause confirmed Going through the reviewer's load-bearing claims one by one: | Claim | Evidence | Verdict | |---|---|---| | bes2600 doesn't set `SINGLE_SCAN_ON_ALL_BANDS` | `drivers/staging/bes2600/main.c:386-394` — 9 `ieee80211_hw_set` calls, this flag is not among them | ✅ confirmed | | bes2600 registers both 2.4 + 5 GHz bands | `main.c:439-440` — `hw->wiphy->bands[NL80211_BAND_2GHZ]` and `[NL80211_BAND_5GHZ]` both assigned | ✅ confirmed | | mac80211 issues per-band `drv_hw_scan` loop when SINGLE_SCAN unset | `net/mac80211/scan.c:395-422` (`ieee80211_prep_hw_scan`) iterates `local->hw_scan_band++` per band, returns true while channels remain; `scan.c:474-482` re-invokes `drv_hw_scan` from `__ieee80211_scan_completed` while more bands queue | ✅ confirmed | | Firmware refuses 5 GHz scans | Pre-existing memory `reference_bes2600_5ghz_scan_reject` documented this from earlier work (2026-05-17) | ✅ pre-known | | The trace's `kworker/u16:*` driving the second `drv_hw_scan` is mac80211's scan worker | Source: `__ieee80211_scan_completed` runs in wiphy_work context, calls `prep_hw_scan` + `drv_hw_scan` inline | ✅ consistent | **The mechanism is now**: 1. wpa_supplicant requests a scan covering both bands (default behavior when associated to a multi-band SSID for roam discovery). 2. mac80211 splits per-band because `SINGLE_SCAN_ON_ALL_BANDS` is unset. 3. mac80211 calls `drv_hw_scan` for band 0 (2.4 GHz, ~13 channels). 4. bes2600 issues `wsm_scan`; firmware accepts. 5. On `bes2600_scan_complete_cb`, mac80211's `__ieee80211_scan_completed` calls `prep_hw_scan` for band 1 (5 GHz, ~37 channels) and immediately re-invokes `drv_hw_scan`. 6. bes2600 issues `wsm_scan` for 5 GHz; **firmware rejects with status 2** (per existing `reference_bes2600_5ghz_scan_reject`). 7. `wsm_generic_confirm` collapses to `-22`; mac80211 sees an aborted scan; userspace gets a half-complete scan result. **Cooldown patch is dead.** The 114 ms "successful" pair at `[2002–2003]` was almost certainly a single-band-only scan request from wpa_supplicant — `prep_hw_scan` returned false after the first band (no more channels to scan), so no chained `drv_hw_scan` ever fired. The microsecond timing has nothing to do with firmware cleanup; it's just how fast mac80211 executes the next iteration in the same wiphy_work context. ## Revised Phase 4 — band filter, not cooldown Reviewer pre-approved this if hypothesis held: *"filter 5 GHz channels in bes2600_hw_scan before passing to wsm_scan"*. **Patch shape:** In `bes2600_hw_scan()` (`scan.c` around line 175, before `down(&hw_priv->scan.lock)`): ```c /* * Firmware refuses WSM start-scan for 5 GHz with status 2 * ("rejected by policy"), causing a -EINVAL cascade up to mac80211. * Until we understand which firmware state we'd need to negotiate * to make 5 GHz scans actually work, refuse them at the driver * boundary so userspace gets a clean refusal instead of a half- * aborted multi-band scan. 2.4 GHz scans are unaffected; direct * BSSID association to 5 GHz APs is also unaffected (no scan * needed for that path). See besser#1. */ if (req->n_channels == 1 && req->channels[0]->band == NL80211_BAND_5GHZ) return -EINVAL; /* every channel is 5 GHz — refuse upfront */ ``` The check `req->n_channels == 1 && channels[0]->band == NL80211_BAND_5GHZ` catches mac80211's per-band split when it iterates to the 5 GHz band — at that point, every channel in `req->channels[]` is 5 GHz, and we refuse before issuing wsm_scan. The 2.4 GHz iteration is untouched. **Why not just de-register the 5 GHz band entirely?** - We lose the truthful `iw phy` chip-capability advertisement - Direct BSSID association to a known 5 GHz BSSID would also break (currently works per OP's 2026-05-03 measurement) - A future firmware update / coex state change might unblock 5 GHz scans — band registration shouldn't bake in the current limitation **Why `-EINVAL` and not `-EOPNOTSUPP` or `-EBUSY`?** - Need to verify against mac80211 contract. `include/net/mac80211.h:struct ieee80211_ops.hw_scan` documentation: "Returns 0 on success. Returns 1 if the driver needs to fall back to software scan. Returns negative on error." So any negative is treated as error → aborted scan. Per reviewer's note, mac80211 finalizes the scan as aborted regardless of which negative is returned. - Userspace experience: scan still completes (the 2.4 GHz portion already succeeded), but the 5 GHz portion is reported aborted. Same as current behavior, minus the dmesg storm. **Predicted delta**: Pattern A drops from 14-16/h to **0/h**. Pattern B/D unchanged (already 0 in this measurement window; tied to real roam activity, separate concern). **Thread safety** (per reviewer's hole-poke on the cooldown patch): the new approach has zero shared state. Just an early-return check on the request struct that mac80211 owns. No race surface. **No `056a71a` interaction**: my patch returns before the existing defer logic even runs. 056a71a's coex check stays in place as defense-in-depth for the rare BT-A2DP-coex case (still a different reject branch). 056a71a's threshold-reset bug is unchanged — separate follow-up issue worth filing, but not blocking. ## Updated answers to the 5 asks 1. ~~100 ms cooldown value~~ — moot, no cooldown. 2. ~~Right place for `last_complete_jiffies`~~ — moot, no timestamp. 3. **Alternative root cause** — confirmed correct: 5 GHz per-band leg. 4. **`-EINVAL` return** — fine per mac80211 contract; any negative produces same userspace effect (aborted scan). Optimizing the return code further has no observable benefit. 5. **Bundle 056a71a fix** — no. Separate issue, separate patch. This patch is purely a band filter. ## Phase 6 readiness checklist - [ ] Restore wired rescue path (enu1) before module reload, or pre-stage a recovery script — per `feedback_user_pushes_reboot_button` - [ ] Patch lands in canonical `~/src/besser/marfrit-besser/` per `reference_danctnix_besser_pkgbuild_canonical` - [ ] Both flavors per `feedback_bes2600_dual_tree_flavors` (danctnix-intree primary; Mobian-DKMS for parity) - [ ] Verify mac80211 hw_scan return-code contract by reading `include/net/mac80211.h` for the .hw_scan op - [ ] Phase 7 must include a "5 GHz reachability didn't get worse" check, not just a "Pattern A count went down" check
Author
Owner

Phase 7 verification — PASS, closing this issue

Patch deployed on ohm 2026-05-18 as linux-pinetab2-danctnix-besser-7.0.danctnix1-2. bes2600 module srcversion changed 978FDDE6000F06D5721FB262B29904248C3CB6820A4218.

30-min observation window

Pattern Pre-patch baseline Post-patch (1827 s)
A (wsm_generic_confirm failed for request 0x0007) 14.3 / h 0 / h
B (wsm_join_confirm ret 1) 0 (today) 0
C (Receive failure:) 0 (today) 0
D (PREV_AUTH_NOT_VALID) 0 (today) 0
SCAN failed : completed 9 : 5 (64 % fail) 0 : 12 (0 % fail)
WARN / BUG / Oops none none

Verdict: PASS. Phase 4 prediction was Pattern A → ≤ 1/h; achieved 0/h.

Behavioral confirmation (earlier ad-hoc tests)

  • iw dev wlan0 scan freq 5180command failed: Operation not supported (-95) — the driver refuses the 5 GHz iteration at the boundary, no firmware round-trip, no log noise. Exactly the EOPNOTSUPP the patch returns.
  • iw dev wlan0 scan freq 2462 → 5 BSSIDs returned for newton SSID. 2.4 GHz scan works normally.
  • iw dev wlan0 scan (multi-band) → still aborted, as Phase 5 reviewer correctly predicted. mac80211 marks the whole scan aborted when any per-band leg returns negative, so the 2.4 GHz BSSes it has already collected are discarded. Same userspace outcome as before the patch — but without the dmesg storm. Single-band scans are the path forward for userspace tools that want 2.4 GHz BSS discovery via iw.

Known limitation surfaced during Phase 6

Build environment on boltzmann (GCC 15.2.1 + kernel 7.0 + CONFIG_SHADOW_CALL_STACK=y) triggers an unrelated arm_neon.h pragma error in arch/arm64/lib/xor-neon.c. Worked around for this build by setting CONFIG_SHADOW_CALL_STACK=n in the config — security hardening regression, not a runtime bug in this patch. The pkgrel=2 kernel on ohm has SCS off as a result. Tracked separately for a future build-env fix (downgrade GCC, or arm_neon.h patch). The 5 GHz scan filter itself is SCS-agnostic.

Patches in flight

  • Source patch (Mobian-DKMS source-of-truth): commit 093a503 on branch bes2600/scan-filter-5ghz of marfrit/bes2600-dkms (local on boltzmann). Authored as Markus per reference_git_persona_claude_noether.
  • PKGBUILD-side (danctnix flavor): commit ae175f9 on branch claude-noether-14 of marfrit/besser (local). Adds 0002-bes2600-filter-5ghz-scan.patch, bumps pkgrel to 2. Plus build-env workaround patch 0003-arm64-xor-neon-ffixed-x18-build-fix.patch and config SCS-off (uncommitted).
  • Both flavors will need a Mobian-DKMS-side test deploy before considering the patch fully merged, per feedback_bes2600_dual_tree_flavors.

Closing

Closing besser#1 as resolved by this patch. Patterns B/D never re-fired during baseline measurement (today's environment) but are explicitly downstream of A per the OP's analysis, so if they reappear under heavier roam stress that's a fresh symptom worth a new issue rather than re-opening this one.

Memory updates landed:

  • reference_bes2600_5ghz_scan_reject — softened "intermittent" framing to "filtered at driver boundary as of 2026-05-18"; behavior + workaround code path documented
  • project_bes2600_c5x_deployed — new srcversion, pkgrel=2 with SCS-off caveat
## Phase 7 verification — PASS, closing this issue Patch deployed on ohm 2026-05-18 as `linux-pinetab2-danctnix-besser-7.0.danctnix1-2`. bes2600 module srcversion changed `978FDDE6000F06D5721FB26` → `2B29904248C3CB6820A4218`. ### 30-min observation window | Pattern | Pre-patch baseline | Post-patch (1827 s) | |---|---|---| | A (`wsm_generic_confirm failed for request 0x0007`) | 14.3 / h | **0 / h** | | B (`wsm_join_confirm ret 1`) | 0 (today) | 0 | | C (`Receive failure:`) | 0 (today) | 0 | | D (`PREV_AUTH_NOT_VALID`) | 0 (today) | 0 | | SCAN failed : completed | 9 : 5 (64 % fail) | **0 : 12 (0 % fail)** | | WARN / BUG / Oops | none | none | **Verdict: PASS.** Phase 4 prediction was Pattern A → ≤ 1/h; achieved **0/h**. ### Behavioral confirmation (earlier ad-hoc tests) - `iw dev wlan0 scan freq 5180` → `command failed: Operation not supported (-95)` — the driver refuses the 5 GHz iteration at the boundary, no firmware round-trip, no log noise. Exactly the EOPNOTSUPP the patch returns. - `iw dev wlan0 scan freq 2462` → 5 BSSIDs returned for `newton` SSID. 2.4 GHz scan works normally. - `iw dev wlan0 scan` (multi-band) → still aborted, as Phase 5 reviewer correctly predicted. mac80211 marks the whole scan aborted when any per-band leg returns negative, so the 2.4 GHz BSSes it has already collected are discarded. **Same userspace outcome as before the patch** — but without the dmesg storm. Single-band scans are the path forward for userspace tools that want 2.4 GHz BSS discovery via `iw`. ### Known limitation surfaced during Phase 6 Build environment on boltzmann (GCC 15.2.1 + kernel 7.0 + `CONFIG_SHADOW_CALL_STACK=y`) triggers an unrelated `arm_neon.h` pragma error in `arch/arm64/lib/xor-neon.c`. Worked around for this build by setting `CONFIG_SHADOW_CALL_STACK=n` in the config — security hardening regression, **not a runtime bug in this patch**. The pkgrel=2 kernel on ohm has SCS off as a result. Tracked separately for a future build-env fix (downgrade GCC, or arm_neon.h patch). The 5 GHz scan filter itself is SCS-agnostic. ### Patches in flight - **Source patch** (Mobian-DKMS source-of-truth): commit `093a503` on branch `bes2600/scan-filter-5ghz` of `marfrit/bes2600-dkms` (local on boltzmann). Authored as Markus per `reference_git_persona_claude_noether`. - **PKGBUILD-side** (danctnix flavor): commit `ae175f9` on branch `claude-noether-14` of `marfrit/besser` (local). Adds `0002-bes2600-filter-5ghz-scan.patch`, bumps pkgrel to 2. Plus build-env workaround patch `0003-arm64-xor-neon-ffixed-x18-build-fix.patch` and config SCS-off (uncommitted). - Both flavors will need a Mobian-DKMS-side test deploy before considering the patch fully merged, per `feedback_bes2600_dual_tree_flavors`. ### Closing Closing besser#1 as resolved by this patch. Patterns B/D never re-fired during baseline measurement (today's environment) but are explicitly downstream of A per the OP's analysis, so if they reappear under heavier roam stress that's a fresh symptom worth a new issue rather than re-opening this one. Memory updates landed: - `reference_bes2600_5ghz_scan_reject` — softened "intermittent" framing to "filtered at driver boundary as of 2026-05-18"; behavior + workaround code path documented - `project_bes2600_c5x_deployed` — new srcversion, pkgrel=2 with SCS-off caveat
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/besser#1