Files
besser/notes/observed-bugs.md
claude-noether 458ad36f8b notes: backlog Bug #5 — RX path degrades under throughput pressure
Observed 2026-05-07: bumping the netcat sender from 1 MB/s to 4 MB/s
DECREASED ohm's observed RX rate (1015 KB/s → 563 KB/s) and degraded
the link (signal -57 → -67 dBm, MCS 4 → 3). Chip can't sustain near-
link-rate RX even though theoretical capacity is ~8 MB/s.

Hypothesis: driver/firmware lock contention or busy-wait on the RX
SDIO path. Plausibly explains the original Phase-0 observation that
YouTube DASH chunks drop ~10 frames per chunk fetch — chunk fetch is
a brief near-line-rate burst that this bug would be triggered by.
2026-05-07 13:56:36 +02:00

143 lines
5.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observed BES2600 driver bugs on PineTab2 (ohm)
Compiled from on-device dmesg + Pine64 wiki + community reports. Cross-references the patch series.
## Bug #1 — factory.txt path mismatch + filp_open antipattern (FIXED in c1)
**File**: `bes2600_factory.c:148-170` (read), `:188-200` (create)
**Symptom (pre-fix)**:
```
(NULL device *): read and check /lib/firmware/bes2600_factory.txt error
```
**Root cause**: hardcoded `FACTORY_PATH=/lib/firmware/bes2600_factory.txt` Makefile macro;
real file ships at `/lib/firmware/bes2600/bes2600_factory.txt`. Worse, the read uses
`filp_open` + `kernel_read` directly, bypassing the firmware-class infrastructure.
**Fix**: c1 patch — `request_firmware()` for the read path, repointed Makefile macro
to firmware-class name `bes2600/bes2600_factory.txt`.
## Bug #1.5 — factory.txt parse failure (NEW, c5 to investigate)
**File**: `bes2600_factory.c factory_parse()`
**Symptom (post-c1)**:
```
bes2600_factory.txt parse fail
read and check bes2600/bes2600_factory.txt error
factory cali data get failed.
```
**How discovered**: c1 fix exposed a deeper bug — `factory_parse()` chokes on the data
that `request_firmware()` now successfully returns. The original bug masked this
because the read always failed first.
**Hypotheses**: null-termination assumption mismatch (`request_firmware` doesn't
null-terminate), `FACTORY_MEMBER_NUM=30/31` count discrepancy, kmalloc not
zero-initialized, parser strict on trailing `%%\n` delimiter.
**Status**: investigation pending (task c5). Driver falls back to defaults; WiFi
functional but TX power is uncalibrated (all channels at 0x1400).
## Bug #2 — PM low-power handshake timeout (recurring)
**File**: `bes_pwr.c:470-558``bes2600_pwr_enter_lp_mode()`. Error at line 538.
**Symptom**:
```
bes2600_wlan mmc2:0001:1: bes2600_pwr_enter_lp_mode, wait pm ind timeout
```
Fires every 510s in steady state when associated. Floods dmesg, likely
correlates with bug #3 (SDIO TX stack splat) and bad battery life.
**Root cause**: `wait_for_completion_timeout(&pm_enter_cmpl, 5*HZ)` waits
for firmware to acknowledge a PM mode change; firmware never sends ACK.
Driver proceeds to `bes2600_pwr_device_enter_lp_mode()` regardless.
**Mobian == danctnix**: identical bes_pwr.c (1447 lines, 0-hunk diff). No
upstream fix exists; we'd invent it (gate device-LP entry on completion +
add retry).
**Status**: task c2.
## Bug #3 — SDIO TX scatter-gather panic / WARN
**File**: `bes2600_sdio.c:952-1200``bes_sdio_memcpy_to_io_helper`,
`sdio_tx_work`.
**Symptom**:
```
[RX] Receive failure: 4.
bes_sdio_memcpy_to_io_helper+0x18c/0x288 [bes2600]
sdio_tx_work+0x2b4/0x4a0 [bes2600]
Workqueue: bes_sdio sdio_tx_work [bes2600]
```
Recurring under TX load. Can wedge the chip irrecoverably (per Pine64 wiki:
"Power/reset circuitry not properly implemented; hard reset impossible
without board power-cycle").
**Status**: task c3 (indirectly, via bes_chardev removal which currently
gates the signal/nosignal mode switch path).
## Bug #5 — RX path degrades under attempted-throughput pressure
**Suspect file**: bes2600 RX path (`txrx.c bes2600_rx_cb`, `bh.c bes2600_bh_work`,
SDIO RX scheduling) — pinpoint pending.
**Symptom (observed 2026-05-07 13:43, srcversion `1B3B3ED0` = c-stack +
Patch A + Patch B, ohm @ -57 dBm 2.4GHz ch11 5b:32, idle save for the
netcat load):**
```
sender cap 1 MB/s → ohm receives 1015 KB/s, signal -57 dBm, RX MCS 4
sender cap 4 MB/s → ohm receives 563 KB/s, signal -67 dBm, RX MCS 3
(Send-Q on boltzmann backed up to 1.16 MB)
```
Pushing the sender-side cap from 1 MB/s to 4 MB/s **decreased** observed
throughput at the receiver and degraded the link metrics. Signal dropped
~10 dB and the chip downshifted MCS, suggesting the chip can't sustain
the higher RX rate even with the link physically capable of more (link
bitrate 65 Mb/s = ~8 MB/s theoretical).
**Hypothesis (Markus, 2026-05-07): driver/firmware locks itself to death
under busy reads** — possibly a busy-wait loop or lock contention on the
RX SDIO path that prevents draining at line rate. Plausible reason it
didn't surface for the c-stack tasks: those operated at typical
browse-rate traffic, well below the saturation threshold this bug needs
to fire.
**May explain**: original Phase-0 observation that **YouTube DASH chunks
drop ~10 frames per chunk fetch** on hardware-decoder playback. A chunk
fetch is a brief burst at near-link-rate; if the driver throttles itself
down during high-RX, the player buffer underruns for the duration of
the fetch.
**How to drill (when prioritized)**:
- Capture trace_pipe with `mmc:*` and `sdio*` events enabled during a
controlled rate-ramp (e.g., pv -L 500K, 1M, 2M, 4M each for 60 s).
- Watch `/proc/sys/kernel/sched_*` and the `bes2600_bh_work` kworker for
CPU saturation.
- `perf top -p $(pgrep -f bes_sdio)` during 4 MB/s load.
**Status**: backlog. No patch yet.
## Bug #4 — scan_complete_cb constant loop
**File**: `scan.c:883-909``bes2600_scan_complete_cb()`.
**Symptom**:
```
ieee80211 phy0: bes2600_scan_complete_cb status: 0
```
Fires every 210s (status=0 = success, but the FREQUENCY suggests background
scanning runs continuously when associated + idle).
Most likely a NetworkManager scheduling artifact, not a driver bug. Low
priority; suppress the wiphy_dbg print or skip scan-on-assoc'd if it
matters.