Recurring wsm_generic_confirm 0x0007 (SCAN_REQ) failures + forced auth churn → VPN drops #1
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
WiFi on
ohm(PineTab2, BES2600) goes through periodic disruption that's invisible to NetworkManager (still "connected") but kills longer-lived TCP sessions. Today this knocked the OpenVPN tunnel toshannonoffline (lmcpohm-toolsMCP went silent in a Claude session); LAN reachability viaohm.fritz.boxwas unaffected, so the radio link itself recovers but mid-flight sessions die.System
6.19.10-danctnix1-1-pinetab2 #1 SMP PREEMPT_DYNAMIC Sat, 28 Mar 2026 02:45:08 +0000 aarch64(DanctNIX)bes2600.ko,srcversion 461AFB369355AE598D79BDF,/lib/modules/6.19.10-danctnix1-1-pinetab2/extra/bes2600.konewton, APc0:25:06:e6:5b:33, 5 GHz ch.40, signal -43 dBm, MCS 7 / 150 Mb/s — i.e. a good link, not a marginal-RF problem.c0:25:06:e6:5b:32,c0:25:06:e6:5b:33,cc:ce:1e:2b:74:17/c0:25:06:e6:61:b0).Repeating signatures (counts over current dmesg buffer, ~22 h window)
wsm_generic_confirm failed for request 0x0007(SCAN_REQ rejected by FW)wsm_join_confirm ret 1(FW rejects driver JOIN)Reason: 2=PREV_AUTH_NOT_VALID(driver-initiated forced roam)First occurrence at
[29840.916960]and still firing at[109373.800451]— pattern is steady-state, not a one-off boot glitch.Pattern A — background scan rejected by firmware
Every ~300 s, mac80211 issues a roaming background scan that the BES2600 firmware refuses with
-EINVAL:0x0007 is
WSM_REQ_ID_SCAN_START(perbes2600/wsm.h). The FW returns a non-zero confirm status on what should be a routine scheduled scan while the STA is associated. mac80211's view: scan failed; whatever was driving the scan (background roam scoring, scheduled scan, ROAM trigger) gets no answer.Pattern B — JOIN confirm error → auth timeout
When mac80211 then tries to switch BSSID (or just re-auth), the firmware rejects the JOIN:
wsm_join_confirm ret 1⇒WSM_STATUS_FAILUREfrom FW onWSM_REQ_ID_JOIN. Driver retries 3× per AP, eventually mac80211 picks another BSSID candidate and retries; usually one of the three BSSIDs eventually accepts.Pattern C — RX failures bursts
Correlated with the above:
Clumps of 2-5 lines, in the same windows as the JOIN failures.
Pattern D — PREV_AUTH_NOT_VALID forced roam
The driver/FW evicts the current AP context as if PMKSA went stale, and mac80211 has to redo the auth/assoc dance. Happens ~1.5× per hour averaged across the buffer, sometimes back-to-back.
Hypothesis
Likely root cause is in the WSM
SCAN_REQhandling path while the STA is associated:WSM_REQ_ID_SCAN_START, FW returns-EINVAL.WSM_REQ_ID_JOIN(for a roam target or refresh) hits a FW that still thinks it's mid-scan or otherwise busy →wsm_join_confirm ret 1.PREV_AUTH_NOT_VALIDchurn.Worth checking:
bes2600/scan.cagainst current mac80211 callbacks.WSM_REQ_ID_SCAN_STARTis correctly setting the scan params (channel list, dwell, SSID list,numProbeRequests) —-EINVALfrom FW is the most common "params look wrong to me" answer.Sample dmesg slice (last storm)
Impact
ohm-toolsMCP went silent).ohmis unreliable.Next steps (proposed)
wsm.h(sanity-check against this report).iw event -ttrace alongside dmesg during a failure window for cross-correlation with mac80211 state.bes2600dyndbg + WSM tracing and capture one fullSCAN_REQ → confirmcycle to pin down the FW arg that's invalid.besserpatches not yet on this build.2026-05-17 update — same pattern still firing, with new band/perf context
Re-verified on ohm today, BESser-c5x stack (
kernel 7.0.0-pinetab2-danctnix-besser, srcversion978FDDE6…). The WSM 0x0007 reject pattern from the OP is unchanged — though now showing as a slightly different operational symptom because ohm has since drifted off the 5 GHz attachment.(Filing this as a comment here; my earlier #19 was a duplicate of this issue, closing that one with a pointer back.)
Band attachment shifted since May
c0:25:06:e6:5b:33, signal -43 dBm, MCS 7 / 150 Mb/s.c0:25:06:e6:5b:32, signal -44 dBm, MCS 7 / 72.2 Mb/s HT20 short-GI. Same SSIDnewton, combined-band setup. Moved next to AP — still 2.4 GHz attachment,nmcli connection down/upre-associates to the same 2.4 GHz BSSID.Combined-SSID setup means client-driven band selection is at the mercy of whatever the scan can see, which is part of how Pattern A manifests as a slow drift toward 2.4 GHz over time.
Chip vs driver vs firmware — pinned
iw phyon ohm enumerates Band 2 with 37 channels spanning 5170-5875 MHz, HT20/HT40 caps0x87e, MCS 0-7 single-stream. No VHT. Regdomain DE/DFS-ETSI permits all UNII-1/2/2e/3 bands at 20-26 dBm. So the chip + driver + regdomain are all 5 GHz-capable; the OP's Pattern A is the limiting layer.Scan-reject in current state (today)
iw dev wlan0 scanreturns zero BSSIDs, not even the currently-associatednewton. So the failed scan currently drops the entire result set, including the 2.4 GHz portion that probably succeeded. (mac80211 still has the assoc state, hence the link stays up, but userspaceiw scansees nothing.)Patch status from OP's hypothesis
Patch
056a71a"bes2600: defer scan and soften WARN on firmware reject" landed indanctnix-7.0.6(Markus-authored, cherry-picked by Danct12). It implements:bes2600_scan_should_defer()short-circuits ifcoex_is_bt_a2dp()and coex isn't FDD.BES2600_SCAN_REJECT_THRESHOLD=3rejections in a row triggers a 10s backoff window.bes_devel-level prints.What it does NOT do: persuade the firmware to actually accept the scan in the "firmware-internal busy state" branch the commit message acknowledges. So Patterns A → B → D in this issue still fire whenever that branch triggers, just without the dmesg storm.
TCP perf measurement (today, 2026-05-17)
LAN-direct download from boltzmann, post-reconnect, signal -44 dBm:
Pre-reconnect at -57 dBm (was sat near a wifi extender) was 3.12 MB/s with worse retry. So the radio link can do ~5 MB/s sustained at MCS 7 HT20 2.4 GHz; not great, not crippling. A clean 5 GHz HT40 attachment would unlock most of the gap to peer-driver reports of ~80 Mbit/s — but we can't get there reliably while the scan rejects continue.
Workarounds
bssidpin):Both are workarounds, not fixes for the OP's root cause.
Proposed depth-of-fix taxonomy
bt_drv_config_coex_mode+ BTC patch hooks (perreference_bes2600_firmware_first_pass).WSM_REQ_ID_SCAN_STARTrequests per band so a 5 GHz-rejected scan doesn't poison 2.4 GHz results. Match how mac80211 issues multi-band scans.All three need a longer firmware-state instrumentation pass on the bes2600 coex/PM behavior before they can be properly scoped.
Memory cross-refs
reference_bes2600_5ghz_scan_reject(today's session, will get tightened with this issue's data)reference_bes2600_firmware_first_pass— coex/BTC symbol leaks for depth-3 fixreference_bes2600_firmware_no_psm— sibling firmware-policy gotchaStatus
Leaving this open — no fix in flight, just documentation drift. Patterns B/C/D from the OP haven't been re-verified today (current ohm isn't in a roam-cycle right now), but the OP's evidence stands. If anyone has a clean reproduction of Pattern B (
wsm_join_confirm ret 1) withbes2600dyndbg enabled they could capture the full WSM exchange — that's the entry point for any of the three fix paths above.Phase 5 review artifact — 2026-05-18
Posting the Phase 0–4 work-product verbatim for review. Per
feedback_phase5_surface_is_pr, this is the gitea-equivalent of the "second model review" step. Patch is NOT written yet — looking for pressure-test on root-cause analysis and patch shape before Phase 6.Goal (Phase 1)
Threshold rationale: ≤1/h (not zero) allows the firmware to legitimately reject a scan during real PM transitions we shouldn't fight. 4 h window is conservative — at clean ≤1/h, a 1h window can't distinguish from luck.
Situation (Phase 2)
7.0.0-danctnix1-5-pinetab2-danctnix-besser, bes2600 srcversion978FDDE6000F06D5721FB26. Includes patch056a71a"defer scan and soften WARN on firmware reject" (Markus-authored, cherry-picked by Danct12).bluetoothdrunning,bluetoothctl show→ "No default controller available". So the BT-A2DP-coex branch of 056a71a'sbes2600_scan_should_defer()is dead —coex_is_bt_a2dp()returns false.newtonBSSIDc0:25:06:e6:5b:32.Measurements (Phase 0 + Phase 3)
Baseline (N=3, dmesg-derived, per CLAUDE.md "rig-failure-is-finding" rule):
Converges to ~14-16/h, matches OP's 22-h baseline of 15.7/h. Patterns B/C/D all 0 across all reps — the downstream cascade is not firing today, just standalone scan-rejects.
Phase 3 capture (bes2600 dyndbg
+pmf, mac80211/cfg80211 ftrace events, 30-min window):The decisive evidence — back-to-back scan pairs:
Looking at one rejected pair (
[630–632]):The one back-to-back pair that both succeeded (
[2002–2003]):Threshold is empirically obvious: 6/6 rejected pairs had inter-scan gap <200 µs. The one successful pair had 114 ms gap.
Root cause
Firmware needs ~100 ms between consecutive scans to clean up internal state. Driver doesn't enforce that cooldown. Whenever mac80211 chains two scans (wpa_supplicant background-scan + NetworkManager roam-probe both fire on the ~300 s cadence), the second scan arrives sub-ms after the first's FW completion, before FW is ready. FW returns status 2 ("rejected by policy"); driver maps to
-22.OP's roam-cascade hypothesis (Pattern A → B → D) is incorrect for the normal case. Today's measurement shows Patterns B/D at 0 despite Pattern A at 14/h. The cascade likely happens only when there's additional roam pressure, but the upstream Pattern A is independent of join state.
056a71a's defer logic is structurally bypassed:
reject_count >= 3 && time_before(jiffies, backoff_until)) doesn't fire becausereject_countresets to 0 on every successful scan (line 767 in danctnix-7.0.6 scan.c). With 64% reject rate, the successes interleave often enough to keepreject_countbelow threshold.Upstream-ancestor check (per
feedback_mine_upstream_ancestor)cw1200 has a different shape that would dodge this:
PASSIVE/MONITOR/JOINING/PRE_STA/STA/IBSS/AP); bes2600 has only 4 (PASSIVE/MONITOR/STA/AP) — JOINING/PRE_STA intermediate states were dropped in the bes2600 fork.cw1200_scan_starthas an upfrontif (priv->join_status == CW1200_JOIN_STATUS_PRE_STA || CW1200_JOIN_STATUS_JOINING) return -EBUSY;guard.That guard isn't the right fix for tonight's reject branch (we're not mid-join when the rejects fire). But it's the correct precedent for the driver doing its own upfront refusal rather than letting the firmware reject — which is the philosophical shape we want.
Plan (Phase 4)
Patch: bes2600 — gate scan with inter-scan cooldown
Three changes, ~15 lines total:
Add field to scan state struct (in
bes2600.horscan.hdepending on wherestruct scanlives):In
bes2600_scan_complete()(scan.c around line 858), set the timestamp right afterieee80211_scan_completed()fires. (bes2600_scan_completeis the cleanup-everywhere path; setting it here covers both "all bands done" and "scan canceled" cases.)In
bes2600_hw_scan()(scan.c around line 175, right after theJOIN_STATUS_APearly return), gate with:Define
BES2600_SCAN_COOLDOWN_MS = 100near the top of scan.c.mac80211 handles
-EBUSYfromdrv_hw_scanby aborting the scan and notifying userspace; wpa_supplicant retries on its own cadence. Not a hidden behavior — the contract is documented ininclude/net/mac80211.h:struct ieee80211_ops.hw_scan.What this fix does NOT do (deliberately out of scope):
wsm_generic_confirmstatus preservation — not needed for this reject branch.056a71a's threshold-reset bug — separate patch, BT-coex branch is still useful as defense-in-depth for the rare A2DP-coex case.Predicted delta: Pattern A drops from 14-16/h to ≤1/h. Pattern B/D unchanged (already 0 on this stack today; if they reappear under roam load it's a separate issue).
Why 100 ms: matches the empirically-observed successful pair (114 ms gap). Comfortable margin over the <200 µs failing pairs. Smaller (e.g. 50 ms) might also work but lacks safety margin against firmware variance.
Flavors per
feedback_bes2600_dual_tree_flavorsTwo patch flavors required:
linux-pinetab2 danctnix-7.0.6branch (usestimer_container_ofstyle timers — though this patch doesn't touch timers, the surrounding code uses newer APIs).bes2600-dkms-mobian bes2600/bh-c-fossil-cleanupbranch (older kernel base).Both should land before Phase 7 (deploy danctnix flavor to ohm for measurement; Mobian flavor is for downstream parity).
Operational concerns for Phase 6
enu1wired rescue is missing — recovering it (or accepting the risk) before module reload, perfeedback_user_pushes_reboot_button.~/src/besser/marfrit-besser/danctnix-besser-pkgbuild/(perreference_danctnix_besser_pkgbuild_canonical), not the orphan checkout.Asks for the reviewer
bes2600_scan_completethe right place to setlast_complete_jiffies? Or should it be inbes2600_scan_workcleanup branch /bes2600_scan_complete_cb? The choice matters for "canceled scan" path.BUG_ONrace I'm not seeing.-EBUSYthe right return code frombes2600_hw_scan? Check mac80211 ops contract for what return codes are tolerated. (Pretty sure -EBUSY is fine but worth confirming.)feedback_dont_patch_downstream_artifactsis "keep separate".Phase 5 amendment — reviewer was right, root cause confirmed
Going through the reviewer's load-bearing claims one by one:
SINGLE_SCAN_ON_ALL_BANDSdrivers/staging/bes2600/main.c:386-394— 9ieee80211_hw_setcalls, this flag is not among themmain.c:439-440—hw->wiphy->bands[NL80211_BAND_2GHZ]and[NL80211_BAND_5GHZ]both assigneddrv_hw_scanloop when SINGLE_SCAN unsetnet/mac80211/scan.c:395-422(ieee80211_prep_hw_scan) iterateslocal->hw_scan_band++per band, returns true while channels remain;scan.c:474-482re-invokesdrv_hw_scanfrom__ieee80211_scan_completedwhile more bands queuereference_bes2600_5ghz_scan_rejectdocumented this from earlier work (2026-05-17)kworker/u16:*driving the seconddrv_hw_scanis mac80211's scan worker__ieee80211_scan_completedruns in wiphy_work context, callsprep_hw_scan+drv_hw_scaninlineThe mechanism is now:
SINGLE_SCAN_ON_ALL_BANDSis unset.drv_hw_scanfor band 0 (2.4 GHz, ~13 channels).wsm_scan; firmware accepts.bes2600_scan_complete_cb, mac80211's__ieee80211_scan_completedcallsprep_hw_scanfor band 1 (5 GHz, ~37 channels) and immediately re-invokesdrv_hw_scan.wsm_scanfor 5 GHz; firmware rejects with status 2 (per existingreference_bes2600_5ghz_scan_reject).wsm_generic_confirmcollapses to-22; mac80211 sees an aborted scan; userspace gets a half-complete scan result.Cooldown patch is dead. The 114 ms "successful" pair at
[2002–2003]was almost certainly a single-band-only scan request from wpa_supplicant —prep_hw_scanreturned false after the first band (no more channels to scan), so no chaineddrv_hw_scanever fired. The microsecond timing has nothing to do with firmware cleanup; it's just how fast mac80211 executes the next iteration in the same wiphy_work context.Revised Phase 4 — band filter, not cooldown
Reviewer pre-approved this if hypothesis held: "filter 5 GHz channels in bes2600_hw_scan before passing to wsm_scan".
Patch shape:
In
bes2600_hw_scan()(scan.caround line 175, beforedown(&hw_priv->scan.lock)):The check
req->n_channels == 1 && channels[0]->band == NL80211_BAND_5GHZcatches mac80211's per-band split when it iterates to the 5 GHz band — at that point, every channel inreq->channels[]is 5 GHz, and we refuse before issuing wsm_scan. The 2.4 GHz iteration is untouched.Why not just de-register the 5 GHz band entirely?
iw phychip-capability advertisementWhy
-EINVALand not-EOPNOTSUPPor-EBUSY?include/net/mac80211.h:struct ieee80211_ops.hw_scandocumentation: "Returns 0 on success. Returns 1 if the driver needs to fall back to software scan. Returns negative on error." So any negative is treated as error → aborted scan. Per reviewer's note, mac80211 finalizes the scan as aborted regardless of which negative is returned.Predicted delta: Pattern A drops from 14-16/h to 0/h. Pattern B/D unchanged (already 0 in this measurement window; tied to real roam activity, separate concern).
Thread safety (per reviewer's hole-poke on the cooldown patch): the new approach has zero shared state. Just an early-return check on the request struct that mac80211 owns. No race surface.
No
056a71ainteraction: my patch returns before the existing defer logic even runs. 056a71a's coex check stays in place as defense-in-depth for the rare BT-A2DP-coex case (still a different reject branch). 056a71a's threshold-reset bug is unchanged — separate follow-up issue worth filing, but not blocking.Updated answers to the 5 asks
100 ms cooldown value— moot, no cooldown.Right place for— moot, no timestamp.last_complete_jiffies-EINVALreturn — fine per mac80211 contract; any negative produces same userspace effect (aborted scan). Optimizing the return code further has no observable benefit.Phase 6 readiness checklist
feedback_user_pushes_reboot_button~/src/besser/marfrit-besser/perreference_danctnix_besser_pkgbuild_canonicalfeedback_bes2600_dual_tree_flavors(danctnix-intree primary; Mobian-DKMS for parity)include/net/mac80211.hfor the .hw_scan opPhase 7 verification — PASS, closing this issue
Patch deployed on ohm 2026-05-18 as
linux-pinetab2-danctnix-besser-7.0.danctnix1-2. bes2600 module srcversion changed978FDDE6000F06D5721FB26→2B29904248C3CB6820A4218.30-min observation window
wsm_generic_confirm failed for request 0x0007)wsm_join_confirm ret 1)Receive failure:)PREV_AUTH_NOT_VALID)Verdict: PASS. Phase 4 prediction was Pattern A → ≤ 1/h; achieved 0/h.
Behavioral confirmation (earlier ad-hoc tests)
iw dev wlan0 scan freq 5180→command failed: Operation not supported (-95)— the driver refuses the 5 GHz iteration at the boundary, no firmware round-trip, no log noise. Exactly the EOPNOTSUPP the patch returns.iw dev wlan0 scan freq 2462→ 5 BSSIDs returned fornewtonSSID. 2.4 GHz scan works normally.iw dev wlan0 scan(multi-band) → still aborted, as Phase 5 reviewer correctly predicted. mac80211 marks the whole scan aborted when any per-band leg returns negative, so the 2.4 GHz BSSes it has already collected are discarded. Same userspace outcome as before the patch — but without the dmesg storm. Single-band scans are the path forward for userspace tools that want 2.4 GHz BSS discovery viaiw.Known limitation surfaced during Phase 6
Build environment on boltzmann (GCC 15.2.1 + kernel 7.0 +
CONFIG_SHADOW_CALL_STACK=y) triggers an unrelatedarm_neon.hpragma error inarch/arm64/lib/xor-neon.c. Worked around for this build by settingCONFIG_SHADOW_CALL_STACK=nin the config — security hardening regression, not a runtime bug in this patch. The pkgrel=2 kernel on ohm has SCS off as a result. Tracked separately for a future build-env fix (downgrade GCC, or arm_neon.h patch). The 5 GHz scan filter itself is SCS-agnostic.Patches in flight
093a503on branchbes2600/scan-filter-5ghzofmarfrit/bes2600-dkms(local on boltzmann). Authored as Markus perreference_git_persona_claude_noether.ae175f9on branchclaude-noether-14ofmarfrit/besser(local). Adds0002-bes2600-filter-5ghz-scan.patch, bumps pkgrel to 2. Plus build-env workaround patch0003-arm64-xor-neon-ffixed-x18-build-fix.patchand config SCS-off (uncommitted).feedback_bes2600_dual_tree_flavors.Closing
Closing besser#1 as resolved by this patch. Patterns B/D never re-fired during baseline measurement (today's environment) but are explicitly downstream of A per the OP's analysis, so if they reappear under heavier roam stress that's a fresh symptom worth a new issue rather than re-opening this one.
Memory updates landed:
reference_bes2600_5ghz_scan_reject— softened "intermittent" framing to "filtered at driver boundary as of 2026-05-18"; behavior + workaround code path documentedproject_bes2600_c5x_deployed— new srcversion, pkgrel=2 with SCS-off caveat