notes: phase 5 review artifact for BES2600 wifi-stability campaign

Captures Phase 0-3 receipts as of 2026-05-06: three Pattern-P1 events
reproduced (07:13, 11:03, yesterday 22:33), decrypt-failure metric locked
as Phase 1 with source pins (txrx.c:1696, wsm.h:620, wsm.c:1484), rig built
(snap loop + tcpdump filtered ring + iw event + dynamic_debug + netcat 1MB/s),
idle-vs-load comparison shows 35x burst-rate elevation under load with
conditional-escalation flip (100% idle / 0% load).

Pending Phase 5 second-model review before Phase 4 plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 15:23:24 +02:00
parent a6cd2a80fd
commit 1a21212744
+217
View File
@@ -0,0 +1,217 @@
# BES2600 WiFi-stability campaign — Phase 5 review artifact
Date assembled: 2026-05-06 (rig started 06:59 CEST)
Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
This is the Phase 5 hand-off artifact. Per project CLAUDE.md: paste verbatim,
do not curate. Anomalies, contradictions and small-N caveats are stated as-is.
---
## Phase 0 — Substrate / Motivation / Inventory
### Triggering observation (user-reported, 2026-05-06)
1. After hours of operation, near-100% WiFi quality eventually drops association.
2. YouTube with hardware decoder shows ~10 dropped frames per DASH chunk fetch.
### Hardware-in-the-loop
- ohm = PineTab2 (RK3566 SoC, BES2600 over SDIO)
- Reachable as ohm.fritz.box (192.168.88.168 on home AP, 10.141.179.63 on fallback)
- Kernel: 6.19.10-danctnix1-1-pinetab2
- bes2600 module: srcversion 461AFB369355AE598D79BDF (c5.1+c5.1.1+c5.2+c6.1+c6.2+c7+c5.2.1)
- bes2600_btuart: srcversion 8FF920B9C068EA2E7DB9BA8 (unchanged)
### What is measurable
- journald (kernel + NetworkManager + wpa_supplicant)
- bes2600 dynamic_debug: 13 callsites enabled with +pmf flag
- iw event -t -f (cfg80211 events with reason codes)
- tcpdump on wlan0 (managed mode — sees data + EAPOL + ARP/ICMP, NOT raw 802.11 mgmt)
- per-60s snapshot loop: iw link, station dump, /proc/net/wireless, /sys/class/net/wlan0/statistics
### Predecessor anchor
None — campaign re-anchored from in-session reps. User pre-rig claims partially replicated:
- A5 (c7 latch trip count rises over session) — does NOT replicate. 0 hits in 12h boot for "PSM not honored" / "pm_unsupported" / "switching to skip".
- A2 (~4h to drop) — does NOT cleanly replicate as a periodicity; intervals vary minutes to hours.
### Receipt checklist (Phase 0)
- [x] Predecessor data treated as reference, not anchor — A5 explicitly falsified above
- [ ] In-session baseline rep N=3: NOT yet — N=3 events observed across multiple boots, only N=1 idle bin and N=12 load bursts in current boot
---
## Phase 1 — Goal formulation (locked 2026-05-06 11:56 CEST)
### Measurable target
> Quantify the rate of WSM_STATUS_DECRYPTFAILURE bursts (≥4 events within 60s) per hour of operation, AND the conditional probability of an AP-side unprotected-deauth-reason-6 within 30s of such a burst.
Locked artifact: /root/bes2600-samples/run-20260506-0659-fresh/PHASE1.md
Journal marker: 2026-05-06T11:56:14+02:00 — bes2600-test PHASE1_LOCKED
### Why this metric (not the original "assoc up/down per hour")
Source pin: bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure.
The status field comes from the firmware's WSM RX-indication, parsed in
wsm.c:1484 wsm_receive_indication via WSM_GET32(buf).
Status code 4 = WSM_STATUS_DECRYPTFAILURE per wsm.h:620.
Two observed Pattern-P1 events on 2026-05-06 chained:
1. decrypt-failure storm
2. AP unprotected-deauth-6 ("Class 2 frame received from non-authenticated station")
3. kernel local PREV_AUTH_NOT_VALID
4. reauth-stall on same channel; recovery via different SSID/channel
A third P1 event (yesterday 2026-05-05 22:33) had ZERO decrypt-failures preceding — different trigger (post-resume). The metric will discriminate.
---
## Phase 2 — Situation Analysis
### Rig built (live as of artifact assembly)
- snap loop PID 5712, ~6h25m elapsed to snapshots/snap.log (60s cadence)
- tcpdump filtered ring PID 10174, ~4h20m elapsed to tcpdump/cap.pcap files
- iw event PID 9852, ~4h20m elapsed to iw-event.log
- dynamic_debug bes2600 13 callsites enabled
- nc listener loop PID 17037 wrapper, 17039 active listener, port 12345
### tcpdump filter applied
`arp || icmp || icmp6 || ether proto 0x888e || port 67 || port 68 || port 53 || port 5353 || port 546 || port 547 || (tcp[tcpflags] and (tcp-syn or tcp-fin or tcp-rst) != 0)`
(So we capture ARP, ICMP, EAPOL, DHCP4/6, DNS, mDNS, and TCP control flags. Bulk data dropped.)
### Anti-theatre receipts verified
- dynamic_debug honored: 13 bes2600 callsites flipped to +p
- journald persistent: /var/log/journal exists, ~376 MB
- snap loop ticks accumulating
- iw event capturing
- tcpdump rotating
- loopback self-test of nc listener: 2 MB through, OK
### Known limits of the rig
- Monitor mode NOT available concurrent with managed (per iw phy phy0 info valid interface combinations). Raw 802.11 mgmt frames are invisible to tcpdump.
- iw event partially compensates: deauth/auth frame headers are visible there with reason codes.
- ftrace not enabled — would add bottom-half scheduling latency data; deferred.
### Receipt checklist (Phase 2)
- [x] Re-read CLAUDE.md
- [x] Re-read relevant memory entries
- [x] Verified ohm reachable
- [x] Verified dynamic_debug honored for bes2600
- [x] N/A: hardware UART not used
---
## Phase 3 — Baseline measurements (the wall)
### Three Pattern-P1 events captured
#### Event 1 — 2026-05-06 07:13 (varied use, no suspend, on 2.4 GHz)
```
07:13:16 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s)
... 77 events total over 24 seconds ...
07:13:41 iw-event: AP→ohm unprotected deauth reason 6
("Class 2 frame received from non-authenticated station")
07:13:41 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
07:13:42 kernel: wlan0: associated → 5b:33 (newton 5 GHz)
```
Recovery: 1 second, cross-band.
#### Event 2 — 2026-05-06 11:03 (idle on newton 2.4 GHz)
```
11:03:10 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s)
11:03:11 bes2600_wlan: [RX] Receive failure: 4
11:03:19 bes2600_wlan: [RX] Receive failure: 4 (×2 in 1s)
11:03:21 iw-event: AP→ohm unprotected deauth reason 6
11:03:22 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
11:03:22 → 11:05:11 9 auth attempts on 3 newton BSSIDs ALL TIMED OUT
(iw-event shows AP returned auth status 0 Successful twice
but assoc step never completed)
11:05:11 kernel: wlan0: associated → 4e:64:5c:d8:11:62 (dingdongkingkong, ch 13)
full handshake auth+assoc+connect in 110ms
```
Recovery: 109 seconds, cross-SSID + cross-channel only.
EAPOL frames in entire 11:03:00 → 11:05:30 window: 0 (4WHS never attempted).
Inbound packets to .168 in that window: 0.
#### Event 3 — 2026-05-05 22:33 (post-resume from lid-close)
```
22:31:22 NM: state activated → deactivating (reason 'sleeping')
22:31:22 kernel: wlan0: deauthenticating from 5b:32 reason 3 = DEAUTH_LEAVING
22:31:27 kernel: PM: suspend entry (deep)
22:31:31 kernel: PM: suspend exit (4s suspend)
22:31:35 kernel: wlan0: associated → 5b:32
[97 seconds of normal operation]
22:33:12 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
22:33:13 kernel: wlan0: associated → 5b:33 (newton 5 GHz)
```
ZERO [RX] Receive failure events in the 22:31:35 → 22:33:12 window. Different trigger path. Recovery: 1s, cross-band.
### Comparison table (idle vs load)
| Period | Duration | Decrypt-fails | Rate/hr | Bursts (≥4 in 60s) | Burst rate/hr | Escalations to AP-deauth-6 |
|---|---|---|---|---|---|---|
| 06:3911:03 idle/varied | 4h24m | 8 | 1.8 | 1 | 0.23 | 1/1 = 100% |
| 12:0013:28 sustained 1MB/s | 1h28m | 104 | 70.9 | 12 | 8.18 | 0/12 = 0% |
Burst-rate elevation under load: ~35x.
### List of 12 load-period bursts (start time, count within 60s)
```
12:06:01 9 events
12:17:57 6 events
12:19:37 4 events
12:20:55 5 events
12:22:17 5 events
12:25:58 4 events
12:30:02 7 events
12:32:09 9 events
13:14:41 8 events
13:17:22 4 events
13:22:53 5 events
13:28:16 8 events
```
### Open contradictions / things the loop has NOT yet resolved
1. Event-3 (post-resume) had no decrypt-failures yet still ended in PREV_AUTH_NOT_VALID. There is a second P1 trigger path we have not pinned a mechanism for.
2. Conditional probability flips between idle (100% of 1 burst escalates) and load (0% of 12 bursts escalate). Hypothesis: AP "Class 2 from unauth STA" heuristic is silenced by sustained host TX. NOT yet verified by AP-side capture (no AP logs available).
3. Cause of decrypt failures themselves remains hypothetical: PTK or GTK drift, or replay-counter mismatch. EAPOL group-rekey frames were NOT captured before either P1 event; needed wider tcpdump window.
4. Newton AP is 802.11v BSS-load capable (saw "comeback duration 1000 TU" at 12:03:23). Reason 9 (STA_REQ_ASSOC_WITHOUT_AUTH) at 12:03:24 shows AP-side state churn even within 1s of auth-success. May or may not be relevant to the P1 blackhole.
### Source citations (Phase 6 contract pins, recorded for review reference)
- bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure
- bes2600/wsm.h:620 — WSM_STATUS_DECRYPTFAILURE = 4
- bes2600/wsm.c:1484 — wsm_receive_indication (parses status from firmware)
### Receipt checklist (Phase 3)
- [x] Trace files / dmesg / regdump pasted verbatim above
- [x] Raw before derived (event listings precede rates)
- [x] Rig failure findings honestly recorded (monitor mode unavailable, ftrace deferred, no AP logs)
---
## What we are explicitly NOT yet at
- Phase 4 plan: not started. Pending review.
- Phase 6 implementation: not started. Pending plan + review.
## Asks of the reviewer
1. Is the Phase 1 metric the right discriminator? In particular, does the conditional-probability column (escalation given burst) capture what we want, or should the metric also count "AP-deauth-6 with no preceding burst" (the Event-3 path)?
2. The escalation-rate flip (100% idle, 0% load) is at N=1 idle vs N=12 load. Is N=1 idle adequate to report this finding, or do we need ≥3 idle bursts before locking?
3. Anything in Phase 3 above flagged as "not yet measured" that would be cheap to add before Phase 4? Specifically:
- Should ftrace mac80211/cfg80211 events be enabled before next rep? Cost: ~10x journal volume. Benefit: bottom-half timing for the 100ms-scale stalls in the 109s blackhole.
- Should tcpdump filter widen to include EAPOL frames captured in a moving 5-min window before each P1 event so we see the group-rekey directly?
4. Is the lack of AP-side capture (no Fritz!Box logs) a blocking gap, or can the campaign proceed without it for now?