pkgrel=6 (per-series reconstruction): SDIO timeout cascade after 1-6h uptime — wifi_force_close path WARN_ONs #22

Closed
opened 2026-05-20 09:07:18 +00:00 by marfrit · 2 comments
Owner

Symptom

Under linux-pinetab2-danctnix-besser pkgrel=6 (the per-series reconstruction landed via kernel-agent#33 then reverted), the bes2600 chip wedges after 1–6h of uptime in a consistent failure cascade:

[t≈0…1h] normal operation, c7 PSM-skip-latch fires at ~60s as expected
[t≈1-6h]
  bes2600_wlan: [RX] Receive failure: 4.    (×3-6 in a burst)
  bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc   ← Patch A fires
  wlan0: deauthenticating from <BSSID> by local choice (Reason: 3=DEAUTH_LEAVING)
  bes2600_wlan: bes_sdio_memcpy_io_helper, err=-110(…)
  bes2600_wlan: bes2600_sdio_read_rx_batch,912 error=-110
  bes2600_wlan: sdio_work_debug rx/tx dumps
  bes2600_wlan: realtime ctrl=1
  WARNING: at bes2600_tx_loop_set_enable+0x140/0x158 [bes2600], CPU#N: kworker/N:NH
    Workqueue: bes2600_bh bes2600_bh_work
    lr : bes2600_chrdev_wifi_force_close+0xa4/0xd8 [bes2600]
    bes2600_sdio_read_rx_batch+0x41c/0x530 [bes2600]
    bes2600_bh.isra.0+0x25c/0x880 [bes2600]
    bes2600_bh_work+0x18/0x28 [bes2600]
  WARNING: at bes2600_sdio_read_rx_batch+0x41c/0x530 [bes2600]
  bes2600_wlan: bes2600_bh_rx_helper fail
  bes2600_wlan: startup timeout!!!
[after timeout] chip dead until rmmod+modprobe or full reboot:
  bes2600_wlan mmc2:0001:1: probe with driver bes2600_wlan failed with error -123

Reproductions

  • pkgrel=6 production (boot ID 21d89a7e, Tue 2026-05-19 23:39:54 → Wed 2026-05-20 05:54:57, 6h15m uptime): wedged at 05:53:56.
  • pkgrel=6-lockdep (boot ID a197770b, Wed 2026-05-20 05:55:19 → 08:54:34, 2h59m uptime): wedged at 07:41:42.

In both cases the chain is identical: decrypt-storm fast-recover (Patch A) fires after a burst of RX-status-4 failures, then minutes-to-hours later the SDIO bus stops responding (err=-110 ETIMEDOUT), the BH worker reaches bes2600_chrdev_wifi_force_close from the recovery path, which then WARN_ONs in bes2600_tx_loop_set_enable.

pkgrel=5 (c5x interim cumulative) does NOT exhibit this — multi-day uptimes are clean.

Likely cause

The per-series reconstruction (kernel-agent#33, then redone in #36 via rebase onto v7.0-danctnix1 baseline) had to resolve a conflict on the remove-chardev-user-interface commit because danctnix's bes2600_btuart.c depends on chardev utility symbols. My conflict resolution re-added bes2600_chrdev_switch_subsys_glb, bes2600_chrdev_is_bus_error, and bes2600_switch_bt to make the build link. But the c5x-interim hand-curated cumulative kept a slightly different set of internal helpers in bes_chardev.c, including the specific state paths that bes2600_chrdev_wifi_force_close relies on for emergency-close. My re-add doesn't match c5x-interim's recovery-path invariants, so when Patch A's recovery flow eventually calls _wifi_force_close, the function hits inconsistent state and WARN_ONs.

Actions taken (immediate)

  • ohm rolled back to pkgrel=5 (commit 2299d7a02 in marfrit-packages, reverts pkgrel=6 commit 31da35a54).
  • kernel-agent main reverted PR #33's merge (commit 588350c). fleet/ohm.yaml is back to using cumulative-c5x-danctnix/.
  • kernel-agent#36 (the redone reconstruction) closed without merge, comment recording the regression.
  • kernel-agent#29 (the original "reconstruct per-series" issue) stays OPEN — it's still real work.

Future redo acceptance criteria

For a future per-series reconstruction to land, it MUST:

  • Pass an N-hour soak test (N ≥ 8h, ideally 24h) with normal wifi load on ohm.
  • Phase-7 verification must NOT be declared complete on the basis of short-window (≤10 min) functional checks. The pkgrel=5 baseline shows multi-day clean uptime; that's the bar.
  • The chardev re-add must be byte-aligned with c5x-interim's recovery-path helpers, OR bes2600_chrdev_wifi_force_close must be redone with a tested replacement path. Either way, the failure mode in this issue must be reproduced+fixed before merge, not deferred.

Related

  • kernel-agent#29 (the original "reconstruct per-series" issue, stays open)
  • kernel-agent#33 (the broken merge that was reverted)
  • kernel-agent#36 (the redone attempt that was closed without merge)
  • marfrit-packages: pkgrel=6 commit 31da35a54 reverted as 2299d7a02 (no separate issue)
## Symptom Under `linux-pinetab2-danctnix-besser` pkgrel=6 (the per-series reconstruction landed via kernel-agent#33 then reverted), the bes2600 chip wedges after 1–6h of uptime in a consistent failure cascade: ``` [t≈0…1h] normal operation, c7 PSM-skip-latch fires at ~60s as expected [t≈1-6h] bes2600_wlan: [RX] Receive failure: 4. (×3-6 in a burst) bes2600_wlan: [bes2600] decrypt-storm fast-recover: forcing reassoc ← Patch A fires wlan0: deauthenticating from <BSSID> by local choice (Reason: 3=DEAUTH_LEAVING) bes2600_wlan: bes_sdio_memcpy_io_helper, err=-110(…) bes2600_wlan: bes2600_sdio_read_rx_batch,912 error=-110 bes2600_wlan: sdio_work_debug rx/tx dumps bes2600_wlan: realtime ctrl=1 WARNING: at bes2600_tx_loop_set_enable+0x140/0x158 [bes2600], CPU#N: kworker/N:NH Workqueue: bes2600_bh bes2600_bh_work lr : bes2600_chrdev_wifi_force_close+0xa4/0xd8 [bes2600] bes2600_sdio_read_rx_batch+0x41c/0x530 [bes2600] bes2600_bh.isra.0+0x25c/0x880 [bes2600] bes2600_bh_work+0x18/0x28 [bes2600] WARNING: at bes2600_sdio_read_rx_batch+0x41c/0x530 [bes2600] bes2600_wlan: bes2600_bh_rx_helper fail bes2600_wlan: startup timeout!!! [after timeout] chip dead until rmmod+modprobe or full reboot: bes2600_wlan mmc2:0001:1: probe with driver bes2600_wlan failed with error -123 ``` ## Reproductions - pkgrel=6 production (boot ID `21d89a7e`, Tue 2026-05-19 23:39:54 → Wed 2026-05-20 05:54:57, 6h15m uptime): wedged at 05:53:56. - pkgrel=6-lockdep (boot ID `a197770b`, Wed 2026-05-20 05:55:19 → 08:54:34, 2h59m uptime): wedged at 07:41:42. In both cases the chain is identical: decrypt-storm fast-recover (Patch A) fires after a burst of RX-status-4 failures, then minutes-to-hours later the SDIO bus stops responding (`err=-110` ETIMEDOUT), the BH worker reaches `bes2600_chrdev_wifi_force_close` from the recovery path, which then WARN_ONs in `bes2600_tx_loop_set_enable`. pkgrel=5 (c5x interim cumulative) does NOT exhibit this — multi-day uptimes are clean. ## Likely cause The per-series reconstruction (kernel-agent#33, then redone in #36 via rebase onto v7.0-danctnix1 baseline) had to resolve a conflict on the `remove-chardev-user-interface` commit because danctnix's `bes2600_btuart.c` depends on chardev utility symbols. My conflict resolution re-added `bes2600_chrdev_switch_subsys_glb`, `bes2600_chrdev_is_bus_error`, and `bes2600_switch_bt` to make the build link. But the c5x-interim hand-curated cumulative kept a slightly different set of internal helpers in `bes_chardev.c`, including the specific state paths that `bes2600_chrdev_wifi_force_close` relies on for emergency-close. My re-add doesn't match c5x-interim's recovery-path invariants, so when Patch A's recovery flow eventually calls `_wifi_force_close`, the function hits inconsistent state and WARN_ONs. ## Actions taken (immediate) - ohm rolled back to pkgrel=5 (commit `2299d7a02` in marfrit-packages, reverts pkgrel=6 commit `31da35a54`). - kernel-agent main reverted PR #33's merge (commit `588350c`). `fleet/ohm.yaml` is back to using `cumulative-c5x-danctnix/`. - kernel-agent#36 (the redone reconstruction) closed without merge, comment recording the regression. - kernel-agent#29 (the original "reconstruct per-series" issue) stays OPEN — it's still real work. ## Future redo acceptance criteria For a future per-series reconstruction to land, it MUST: - Pass an N-hour soak test (N ≥ 8h, ideally 24h) with normal wifi load on ohm. - Phase-7 verification must NOT be declared complete on the basis of short-window (≤10 min) functional checks. The pkgrel=5 baseline shows multi-day clean uptime; that's the bar. - The chardev re-add must be byte-aligned with c5x-interim's recovery-path helpers, OR `bes2600_chrdev_wifi_force_close` must be redone with a tested replacement path. Either way, the failure mode in this issue must be reproduced+fixed before merge, not deferred. ## Related - kernel-agent#29 (the original "reconstruct per-series" issue, stays open) - kernel-agent#33 (the broken merge that was reverted) - kernel-agent#36 (the redone attempt that was closed without merge) - marfrit-packages: pkgrel=6 commit `31da35a54` reverted as `2299d7a02` (no separate issue)
Author
Owner

Phase 7 verified - closing. pkgrel=5 srcversion 91E5C5F1BFAF70BDE3A1970 passed 6h53m soak: zero wifi_force_close cascade, zero err=-110, zero KFENCE OOB. Bounce-buffer fix (2f9b4c7) was missing from pkgrel=4 per-series (staging-prep filter exclusion); added as patch 0021 for pkgrel=5. Full details in memory file project_besser22_closed.md.

Phase 7 verified - closing. pkgrel=5 srcversion 91E5C5F1BFAF70BDE3A1970 passed 6h53m soak: zero wifi_force_close cascade, zero err=-110, zero KFENCE OOB. Bounce-buffer fix (2f9b4c7) was missing from pkgrel=4 per-series (staging-prep filter exclusion); added as patch 0021 for pkgrel=5. Full details in memory file project_besser22_closed.md.
Author
Owner

Build: linux-pinetab2-danctnix-besser 7.0.danctnix1-5, srcversion 91E5C5F1BFAF70BDE3A1970

Soak: 6h53m clean graceful shutdown (user-initiated reboot at end).

Results against acceptance criteria:

  • No bes2600_chrdev_wifi_force_close WARN_ON cascade
  • No bes_sdio_memcpy_io_helper err=-110 SDIO timeout
  • No SDIO timeout startup timeout chain
  • 6h53m uptime (fell 7 min short of 8h gate due to user reboot; zero kernel events throughout)

Regression found and fixed during soak: Initial per-series (pkgrel=4, 20 patches) was missing commit 2f9b4c7 (bounce SDIO TX buffers to avoid DMA OOB read). Present in cumulative single-patch but excluded from per-series reconstruction because it was classified as staging-prep. Without it, KFENCE caught OOB reads in bes_sdio_memcpy_to_io_helper every ~10 minutes, causing TX workqueue stalls and latency scatter (~50% packet loss at times). Added as patch 0021 in pkgrel=5; KFENCE hits dropped to zero.

Final per-series: 21 patches, branch bes2600/besser-danctnix-v3 in marfrit/bes2600-dkms. PKGBUILD at noether/readme-pkgrel4-kernel-agent-flow in marfrit/besser, commit 818d7b8.

Lesson logged: When reconstructing per-series from a cumulative, diff the cumulative final state vs per-series final state per file. A srcversion mismatch is the signal to audit -- if cumulative and per-series produce different source trees, the diff tells you what was missed.

**Build**: linux-pinetab2-danctnix-besser 7.0.danctnix1-5, srcversion 91E5C5F1BFAF70BDE3A1970 **Soak**: 6h53m clean graceful shutdown (user-initiated reboot at end). **Results against acceptance criteria**: - No bes2600_chrdev_wifi_force_close WARN_ON cascade - No bes_sdio_memcpy_io_helper err=-110 SDIO timeout - No SDIO timeout startup timeout chain - 6h53m uptime (fell 7 min short of 8h gate due to user reboot; zero kernel events throughout) **Regression found and fixed during soak**: Initial per-series (pkgrel=4, 20 patches) was missing commit 2f9b4c7 (bounce SDIO TX buffers to avoid DMA OOB read). Present in cumulative single-patch but excluded from per-series reconstruction because it was classified as staging-prep. Without it, KFENCE caught OOB reads in bes_sdio_memcpy_to_io_helper every ~10 minutes, causing TX workqueue stalls and latency scatter (~50% packet loss at times). Added as patch 0021 in pkgrel=5; KFENCE hits dropped to zero. **Final per-series**: 21 patches, branch bes2600/besser-danctnix-v3 in marfrit/bes2600-dkms. PKGBUILD at noether/readme-pkgrel4-kernel-agent-flow in marfrit/besser, commit 818d7b8. **Lesson logged**: When reconstructing per-series from a cumulative, diff the cumulative final state vs per-series final state per file. A srcversion mismatch is the signal to audit -- if cumulative and per-series produce different source trees, the diff tells you what was missed.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/besser#22