Files
besser/notes/observed-bugs.md
claude-noether 594f73c6b4 notes: Bug #5 root cause refined — workqueue-per-SDIO-transaction is the floor
Follow-up ftrace measurement (post-reboot, 3-min 4MB/s capture):
- workqueue_execute_start: 5,643/sec  ← dominates
- wsm_cmd_send: only 13/sec (host-to-chip command path NOT the hotspot)
- lock contention: 50/sec (modest)

The throughput floor is set by per-SDIO-transaction workqueue dispatch
overhead. Surgical patches B5-1/B5-2/B5-3 from the prior Phase 4 plan
all targeted the wrong layer; deferring those until an architectural
restructuring map is produced.

Promoting the Sonnet architect review from "backlog" to
"blocking on Bug #5" — the next step is a restructuring assessment,
not another patch.
2026-05-07 17:31:31 +02:00

185 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observed BES2600 driver bugs on PineTab2 (ohm)
Compiled from on-device dmesg + Pine64 wiki + community reports. Cross-references the patch series.
## Bug #1 — factory.txt path mismatch + filp_open antipattern (FIXED in c1)
**File**: `bes2600_factory.c:148-170` (read), `:188-200` (create)
**Symptom (pre-fix)**:
```
(NULL device *): read and check /lib/firmware/bes2600_factory.txt error
```
**Root cause**: hardcoded `FACTORY_PATH=/lib/firmware/bes2600_factory.txt` Makefile macro;
real file ships at `/lib/firmware/bes2600/bes2600_factory.txt`. Worse, the read uses
`filp_open` + `kernel_read` directly, bypassing the firmware-class infrastructure.
**Fix**: c1 patch — `request_firmware()` for the read path, repointed Makefile macro
to firmware-class name `bes2600/bes2600_factory.txt`.
## Bug #1.5 — factory.txt parse failure (NEW, c5 to investigate)
**File**: `bes2600_factory.c factory_parse()`
**Symptom (post-c1)**:
```
bes2600_factory.txt parse fail
read and check bes2600/bes2600_factory.txt error
factory cali data get failed.
```
**How discovered**: c1 fix exposed a deeper bug — `factory_parse()` chokes on the data
that `request_firmware()` now successfully returns. The original bug masked this
because the read always failed first.
**Hypotheses**: null-termination assumption mismatch (`request_firmware` doesn't
null-terminate), `FACTORY_MEMBER_NUM=30/31` count discrepancy, kmalloc not
zero-initialized, parser strict on trailing `%%\n` delimiter.
**Status**: investigation pending (task c5). Driver falls back to defaults; WiFi
functional but TX power is uncalibrated (all channels at 0x1400).
## Bug #2 — PM low-power handshake timeout (recurring)
**File**: `bes_pwr.c:470-558``bes2600_pwr_enter_lp_mode()`. Error at line 538.
**Symptom**:
```
bes2600_wlan mmc2:0001:1: bes2600_pwr_enter_lp_mode, wait pm ind timeout
```
Fires every 510s in steady state when associated. Floods dmesg, likely
correlates with bug #3 (SDIO TX stack splat) and bad battery life.
**Root cause**: `wait_for_completion_timeout(&pm_enter_cmpl, 5*HZ)` waits
for firmware to acknowledge a PM mode change; firmware never sends ACK.
Driver proceeds to `bes2600_pwr_device_enter_lp_mode()` regardless.
**Mobian == danctnix**: identical bes_pwr.c (1447 lines, 0-hunk diff). No
upstream fix exists; we'd invent it (gate device-LP entry on completion +
add retry).
**Status**: task c2.
## Bug #3 — SDIO TX scatter-gather panic / WARN
**File**: `bes2600_sdio.c:952-1200``bes_sdio_memcpy_to_io_helper`,
`sdio_tx_work`.
**Symptom**:
```
[RX] Receive failure: 4.
bes_sdio_memcpy_to_io_helper+0x18c/0x288 [bes2600]
sdio_tx_work+0x2b4/0x4a0 [bes2600]
Workqueue: bes_sdio sdio_tx_work [bes2600]
```
Recurring under TX load. Can wedge the chip irrecoverably (per Pine64 wiki:
"Power/reset circuitry not properly implemented; hard reset impossible
without board power-cycle").
**Status**: task c3 (indirectly, via bes_chardev removal which currently
gates the signal/nosignal mode switch path).
## Architect review — now BUG-#5-blocking (was backlog)
The Phase 0 perf trace for Bug #5 first exposed a "when in doubt, add a
lock" pattern (~20 % CPU in `_raw_spin_unlock_irqrestore`). The
follow-up ftrace measurement (2026-05-07 17:00) refined the root cause
to an architectural problem: **the bes2600 driver dispatches every
SDIO transaction through the kernel workqueue**. Numbers from a 3-min
4 MB/s ohm capture (post-reboot, srcversion `1B3B3ED0`):
```
wsm_cmd_send: 13/sec (host-to-chip command rate, surprisingly low)
bes2600_rx_cb: 611/sec
bes2600_bh_wakeup: 267/sec
lock contention_begin: 50/sec
workqueue_execute_start: 5,643/sec ← DOMINATES; matches the mmc
transaction rate from earlier perf
```
5.6 k workqueue dispatches per second is the throughput floor — not a
specific lock, not WSM-command rate, not decrypt-state. A surgical fix
to any single function won't move the floor; the architecture needs
to be restructured to amortise SDIO transactions across fewer work-
items (or move SDIO RX out of the workqueue entirely).
This is where the **Claude Sonnet architect review** belongs: a
top-to-bottom assessment of `~/src/besser/bes2600-dkms-mobian/bes2600/`
focused on:
- the workqueue dispatch shape (most actionable)
- needless lock proliferation (the original signal)
- BH / RX scheduling boundaries
- error-handling coverage and dead-code from the cw1200 ancestor
- API contract violations relative to mainline mac80211
Output: ranked list of restructuring targets, with predicted-delta
estimates against the Phase 1 metric (≥ 2 MB/s sustained @ 4 MB/s cap,
< 10 % CPU in lock-cycling, no link cascade in 30 min).
**Status**: now blocking on Bug #5 (was independent track). Surgical
patches B5-1, B5-2, B5-3 from the original Phase 4 candidate list are
all DEFERRED until the architect review's restructuring map is in.
## Bug #5 — RX path degrades under attempted-throughput pressure
**Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
SDIO RX scheduling) — pinpoint pending.
**Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
netcat load):**
```
sender cap 1 MB/s → ohm receives 1015 KB/s, signal -57 dBm, RX MCS 4
sender cap 4 MB/s → ohm receives 563 KB/s, signal -67 dBm, RX MCS 3
(Send-Q on boltzmann backed up to 1.16 MB)
```
Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
throughput at the receiver and degraded the link metrics. Signal dropped
~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
the higher RX rate even with the link physically capable of more (link
bitrate 65 Mb/s = ~8 MB/s theoretical).
**Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
under busy reads** — possibly a busy-wait loop or lock contention on the
RX SDIO path that prevents draining at line rate. Plausible reason it
didn't surface for the c-stack tasks: those operated at typical
browse-rate traffic, well below the saturation threshold this bug needs
to fire.
**May explain**: original Phase-0 observation that **YouTube DASH chunks
drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
fetch is a brief burst at near-link-rate; if the driver throttles itself
down during high-RX, the player buffer underruns for the duration of
the fetch.
**How to drill (when prioritized)**:
- Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
- Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
CPU saturation.
- `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
**Status**: backlog. No patch yet.
## Bug #4 — scan_complete_cb constant loop
**File**: `scan.c:883-909``bes2600_scan_complete_cb()`.
**Symptom**:
```
ieee80211 phy0: bes2600_scan_complete_cb status: 0
```
Fires every 210s (status=0 = success, but the FREQUENCY suggests background
scanning runs continuously when associated + idle).
Most likely a NetworkManager scheduling artifact, not a driver bug. Low
priority; suppress the wiphy_dbg print or skip scan-on-assoc'd if it
matters.