Files
besser/notes/observed-bugs.md
T
claude-noether 928268f477 notes: backlog Sonnet architect review of bes2600 driver
Per PR #6 review feedback. Independent track from Bug #5; scheduled
once the Bug #5 measurement pass finishes.
2026-05-07 16:38:58 +02:00

6.2 KiB
Raw Blame History

Observed BES2600 driver bugs on PineTab2 (ohm)

Compiled from on-device dmesg + Pine64 wiki + community reports. Cross-references the patch series.

Bug #1 — factory.txt path mismatch + filp_open antipattern (FIXED in c1)

File: bes2600_factory.c:148-170 (read), :188-200 (create)

Symptom (pre-fix):

(NULL device *): read and check /lib/firmware/bes2600_factory.txt error

Root cause: hardcoded FACTORY_PATH=/lib/firmware/bes2600_factory.txt Makefile macro; real file ships at /lib/firmware/bes2600/bes2600_factory.txt. Worse, the read uses filp_open + kernel_read directly, bypassing the firmware-class infrastructure.

Fix: c1 patch — request_firmware() for the read path, repointed Makefile macro to firmware-class name bes2600/bes2600_factory.txt.

Bug #1.5 — factory.txt parse failure (NEW, c5 to investigate)

File: bes2600_factory.c factory_parse()

Symptom (post-c1):

bes2600_factory.txt parse fail
read and check bes2600/bes2600_factory.txt error
factory cali data get failed.

How discovered: c1 fix exposed a deeper bug — factory_parse() chokes on the data that request_firmware() now successfully returns. The original bug masked this because the read always failed first.

Hypotheses: null-termination assumption mismatch (request_firmware doesn't null-terminate), FACTORY_MEMBER_NUM=30/31 count discrepancy, kmalloc not zero-initialized, parser strict on trailing %%\n delimiter.

Status: investigation pending (task c5). Driver falls back to defaults; WiFi functional but TX power is uncalibrated (all channels at 0x1400).

Bug #2 — PM low-power handshake timeout (recurring)

File: bes_pwr.c:470-558bes2600_pwr_enter_lp_mode(). Error at line 538.

Symptom:

bes2600_wlan mmc2:0001:1: bes2600_pwr_enter_lp_mode, wait pm ind timeout

Fires every 510s in steady state when associated. Floods dmesg, likely correlates with bug #3 (SDIO TX stack splat) and bad battery life.

Root cause: wait_for_completion_timeout(&pm_enter_cmpl, 5*HZ) waits for firmware to acknowledge a PM mode change; firmware never sends ACK. Driver proceeds to bes2600_pwr_device_enter_lp_mode() regardless.

Mobian == danctnix: identical bes_pwr.c (1447 lines, 0-hunk diff). No upstream fix exists; we'd invent it (gate device-LP entry on completion + add retry).

Status: task c2.

Bug #3 — SDIO TX scatter-gather panic / WARN

File: bes2600_sdio.c:952-1200bes_sdio_memcpy_to_io_helper, sdio_tx_work.

Symptom:

[RX] Receive failure: 4.
 bes_sdio_memcpy_to_io_helper+0x18c/0x288 [bes2600]
 sdio_tx_work+0x2b4/0x4a0 [bes2600]
Workqueue: bes_sdio sdio_tx_work [bes2600]

Recurring under TX load. Can wedge the chip irrecoverably (per Pine64 wiki: "Power/reset circuitry not properly implemented; hard reset impossible without board power-cycle").

Status: task c3 (indirectly, via bes_chardev removal which currently gates the signal/nosignal mode switch path).

Backlog — full architect review of bes2600 driver code quality

The Phase 0 perf trace for Bug #5 exposes a "when in doubt, add a lock" pattern in the BH path (~20 % CPU in _raw_spin_unlock_irqrestore even during healthy throughput). Markus has flagged this for a separate architect-review pass: have Claude Sonnet (or equivalent reviewer) do a top-to-bottom code-quality review of the bes2600 sources we have on boltzmann (~/src/besser/bes2600-dkms-mobian/bes2600/), looking for:

  • needless lock proliferation
  • BH / workqueue dispatch shape
  • error-handling coverage
  • dead code / leftover-from-cw1200 cruft
  • API contract violations relative to mainline mac80211

Output: ranked list of cleanup targets that would make later patch series land more cleanly. Not blocking on Bug #5 — independent track.

Status: backlog. Schedule when Bug #5's measurement pass finishes.

Bug #5 — RX path degrades under attempted-throughput pressure

Suspect file: bes2600 RX path (txrx.c bes2600_rx_cb, bh.c bes2600_bh_work, SDIO RX scheduling) — pinpoint pending.

Symptom (observed 2026-05-07 13:43, srcversion 1B3B3ED0 = c-stack + Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the netcat load):

sender cap 1 MB/s  →  ohm receives 1015 KB/s,  signal -57 dBm,  RX MCS 4
sender cap 4 MB/s  →  ohm receives  563 KB/s,  signal -67 dBm,  RX MCS 3
                     (Send-Q on boltzmann backed up to 1.16 MB)

Pushing the sender-side cap from 1 MB/s to 4 MB/s decreased observed throughput at the receiver and degraded the link metrics. Signal dropped ~10 dB and the chip downshifted MCS, suggesting the chip can't sustain the higher RX rate even with the link physically capable of more (link bitrate 65 Mb/s = ~8 MB/s theoretical).

Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death under busy reads — possibly a busy-wait loop or lock contention on the RX SDIO path that prevents draining at line rate. Plausible reason it didn't surface for the c-stack tasks: those operated at typical browse-rate traffic, well below the saturation threshold this bug needs to fire.

May explain: original Phase-0 observation that YouTube DASH chunks drop ~10 frames per chunk fetch on hardware-decoder playback. A chunk fetch is a brief burst at near-link-rate; if the driver throttles itself down during high-RX, the player buffer underruns for the duration of the fetch.

How to drill (when prioritized):

  • Capture trace_pipe with mmc:* and sdio* events enabled during a controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
  • Watch /proc/sys/kernel/sched_* and the bes2600_bh_work kworker for CPU saturation.
  • perf top -p $(pgrep -f bes_sdio) during 4 MB/s load.

Status: backlog. No patch yet.

Bug #4 — scan_complete_cb constant loop

File: scan.c:883-909bes2600_scan_complete_cb().

Symptom:

ieee80211 phy0: bes2600_scan_complete_cb status: 0

Fires every 210s (status=0 = success, but the FREQUENCY suggests background scanning runs continuously when associated + idle).

Most likely a NetworkManager scheduling artifact, not a driver bug. Low priority; suppress the wiphy_dbg print or skip scan-on-assoc'd if it matters.