Follow-up ftrace measurement (post-reboot, 3-min 4MB/s capture): - workqueue_execute_start: 5,643/sec ← dominates - wsm_cmd_send: only 13/sec (host-to-chip command path NOT the hotspot) - lock contention: 50/sec (modest) The throughput floor is set by per-SDIO-transaction workqueue dispatch overhead. Surgical patches B5-1/B5-2/B5-3 from the prior Phase 4 plan all targeted the wrong layer; deferring those until an architectural restructuring map is produced. Promoting the Sonnet architect review from "backlog" to "blocking on Bug #5" — the next step is a restructuring assessment, not another patch.
7.2 KiB
Observed BES2600 driver bugs on PineTab2 (ohm)
Compiled from on-device dmesg + Pine64 wiki + community reports. Cross-references the patch series.
Bug #1 — factory.txt path mismatch + filp_open antipattern (FIXED in c1)
File: bes2600_factory.c:148-170 (read), :188-200 (create)
Symptom (pre-fix):
(NULL device *): read and check /lib/firmware/bes2600_factory.txt error
Root cause: hardcoded FACTORY_PATH=/lib/firmware/bes2600_factory.txt Makefile macro;
real file ships at /lib/firmware/bes2600/bes2600_factory.txt. Worse, the read uses
filp_open + kernel_read directly, bypassing the firmware-class infrastructure.
Fix: c1 patch — request_firmware() for the read path, repointed Makefile macro
to firmware-class name bes2600/bes2600_factory.txt.
Bug #1.5 — factory.txt parse failure (NEW, c5 to investigate)
File: bes2600_factory.c factory_parse()
Symptom (post-c1):
bes2600_factory.txt parse fail
read and check bes2600/bes2600_factory.txt error
factory cali data get failed.
How discovered: c1 fix exposed a deeper bug — factory_parse() chokes on the data
that request_firmware() now successfully returns. The original bug masked this
because the read always failed first.
Hypotheses: null-termination assumption mismatch (request_firmware doesn't
null-terminate), FACTORY_MEMBER_NUM=30/31 count discrepancy, kmalloc not
zero-initialized, parser strict on trailing %%\n delimiter.
Status: investigation pending (task c5). Driver falls back to defaults; WiFi functional but TX power is uncalibrated (all channels at 0x1400).
Bug #2 — PM low-power handshake timeout (recurring)
File: bes_pwr.c:470-558 — bes2600_pwr_enter_lp_mode(). Error at line 538.
Symptom:
bes2600_wlan mmc2:0001:1: bes2600_pwr_enter_lp_mode, wait pm ind timeout
Fires every 5–10s in steady state when associated. Floods dmesg, likely correlates with bug #3 (SDIO TX stack splat) and bad battery life.
Root cause: wait_for_completion_timeout(&pm_enter_cmpl, 5*HZ) waits
for firmware to acknowledge a PM mode change; firmware never sends ACK.
Driver proceeds to bes2600_pwr_device_enter_lp_mode() regardless.
Mobian == danctnix: identical bes_pwr.c (1447 lines, 0-hunk diff). No upstream fix exists; we'd invent it (gate device-LP entry on completion + add retry).
Status: task c2.
Bug #3 — SDIO TX scatter-gather panic / WARN
File: bes2600_sdio.c:952-1200 — bes_sdio_memcpy_to_io_helper,
sdio_tx_work.
Symptom:
[RX] Receive failure: 4.
bes_sdio_memcpy_to_io_helper+0x18c/0x288 [bes2600]
sdio_tx_work+0x2b4/0x4a0 [bes2600]
Workqueue: bes_sdio sdio_tx_work [bes2600]
Recurring under TX load. Can wedge the chip irrecoverably (per Pine64 wiki: "Power/reset circuitry not properly implemented; hard reset impossible without board power-cycle").
Status: task c3 (indirectly, via bes_chardev removal which currently gates the signal/nosignal mode switch path).
Architect review — now BUG-#5-blocking (was backlog)
The Phase 0 perf trace for Bug #5 first exposed a "when in doubt, add a
lock" pattern (~20 % CPU in _raw_spin_unlock_irqrestore). The
follow-up ftrace measurement (2026-05-07 17:00) refined the root cause
to an architectural problem: the bes2600 driver dispatches every
SDIO transaction through the kernel workqueue. Numbers from a 3-min
4 MB/s ohm capture (post-reboot, srcversion 1B3B3ED0):
wsm_cmd_send: 13/sec (host-to-chip command rate, surprisingly low)
bes2600_rx_cb: 611/sec
bes2600_bh_wakeup: 267/sec
lock contention_begin: 50/sec
workqueue_execute_start: 5,643/sec ← DOMINATES; matches the mmc
transaction rate from earlier perf
5.6 k workqueue dispatches per second is the throughput floor — not a specific lock, not WSM-command rate, not decrypt-state. A surgical fix to any single function won't move the floor; the architecture needs to be restructured to amortise SDIO transactions across fewer work- items (or move SDIO RX out of the workqueue entirely).
This is where the Claude Sonnet architect review belongs: a
top-to-bottom assessment of ~/src/besser/bes2600-dkms-mobian/bes2600/
focused on:
- the workqueue dispatch shape (most actionable)
- needless lock proliferation (the original signal)
- BH / RX scheduling boundaries
- error-handling coverage and dead-code from the cw1200 ancestor
- API contract violations relative to mainline mac80211
Output: ranked list of restructuring targets, with predicted-delta estimates against the Phase 1 metric (≥ 2 MB/s sustained @ 4 MB/s cap, < 10 % CPU in lock-cycling, no link cascade in 30 min).
Status: now blocking on Bug #5 (was independent track). Surgical patches B5-1, B5-2, B5-3 from the original Phase 4 candidate list are all DEFERRED until the architect review's restructuring map is in.
Bug #5 — RX path degrades under attempted-throughput pressure
Suspect file: bes2600 RX path (txrx.c bes2600_rx_cb, bh.c bes2600_bh_work,
SDIO RX scheduling) — pinpoint pending.
Symptom (observed 2026-05-07 13:43, srcversion 1B3B3ED0 = c-stack +
Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
netcat load):
sender cap 1 MB/s → ohm receives 1015 KB/s, signal -57 dBm, RX MCS 4
sender cap 4 MB/s → ohm receives 563 KB/s, signal -67 dBm, RX MCS 3
(Send-Q on boltzmann backed up to 1.16 MB)
Pushing the sender-side cap from 1 MB/s to 4 MB/s decreased observed throughput at the receiver and degraded the link metrics. Signal dropped ~10 dB and the chip downshifted MCS, suggesting the chip can't sustain the higher RX rate even with the link physically capable of more (link bitrate 65 Mb/s = ~8 MB/s theoretical).
Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death under busy reads — possibly a busy-wait loop or lock contention on the RX SDIO path that prevents draining at line rate. Plausible reason it didn't surface for the c-stack tasks: those operated at typical browse-rate traffic, well below the saturation threshold this bug needs to fire.
May explain: original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch on hardware-decoder playback. A chunk fetch is a brief burst at near-link-rate; if the driver throttles itself down during high-RX, the player buffer underruns for the duration of the fetch.
How to drill (when prioritized):
- Capture trace_pipe with
mmc:*andsdio*events enabled during a controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s). - Watch
/proc/sys/kernel/sched_*and thebes2600_bh_workkworker for CPU saturation. perf top -p $(pgrep -f bes_sdio)during 4 MB/s load.
Status: backlog. No patch yet.
Bug #4 — scan_complete_cb constant loop
File: scan.c:883-909 — bes2600_scan_complete_cb().
Symptom:
ieee80211 phy0: bes2600_scan_complete_cb status: 0
Fires every 2–10s (status=0 = success, but the FREQUENCY suggests background scanning runs continuously when associated + idle).
Most likely a NetworkManager scheduling artifact, not a driver bug. Low priority; suppress the wiphy_dbg print or skip scan-on-assoc'd if it matters.