Files
besser/notes/phase5-2026-05-06.md
T
test0r 1a21212744 notes: phase 5 review artifact for BES2600 wifi-stability campaign
Captures Phase 0-3 receipts as of 2026-05-06: three Pattern-P1 events
reproduced (07:13, 11:03, yesterday 22:33), decrypt-failure metric locked
as Phase 1 with source pins (txrx.c:1696, wsm.h:620, wsm.c:1484), rig built
(snap loop + tcpdump filtered ring + iw event + dynamic_debug + netcat 1MB/s),
idle-vs-load comparison shows 35x burst-rate elevation under load with
conditional-escalation flip (100% idle / 0% load).

Pending Phase 5 second-model review before Phase 4 plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 15:23:24 +02:00

218 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BES2600 WiFi-stability campaign — Phase 5 review artifact
Date assembled: 2026-05-06 (rig started 06:59 CEST)
Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm
Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)
This is the Phase 5 hand-off artifact. Per project CLAUDE.md: paste verbatim,
do not curate. Anomalies, contradictions and small-N caveats are stated as-is.
---
## Phase 0 — Substrate / Motivation / Inventory
### Triggering observation (user-reported, 2026-05-06)
1. After hours of operation, near-100% WiFi quality eventually drops association.
2. YouTube with hardware decoder shows ~10 dropped frames per DASH chunk fetch.
### Hardware-in-the-loop
- ohm = PineTab2 (RK3566 SoC, BES2600 over SDIO)
- Reachable as ohm.fritz.box (192.168.88.168 on home AP, 10.141.179.63 on fallback)
- Kernel: 6.19.10-danctnix1-1-pinetab2
- bes2600 module: srcversion 461AFB369355AE598D79BDF (c5.1+c5.1.1+c5.2+c6.1+c6.2+c7+c5.2.1)
- bes2600_btuart: srcversion 8FF920B9C068EA2E7DB9BA8 (unchanged)
### What is measurable
- journald (kernel + NetworkManager + wpa_supplicant)
- bes2600 dynamic_debug: 13 callsites enabled with +pmf flag
- iw event -t -f (cfg80211 events with reason codes)
- tcpdump on wlan0 (managed mode — sees data + EAPOL + ARP/ICMP, NOT raw 802.11 mgmt)
- per-60s snapshot loop: iw link, station dump, /proc/net/wireless, /sys/class/net/wlan0/statistics
### Predecessor anchor
None — campaign re-anchored from in-session reps. User pre-rig claims partially replicated:
- A5 (c7 latch trip count rises over session) — does NOT replicate. 0 hits in 12h boot for "PSM not honored" / "pm_unsupported" / "switching to skip".
- A2 (~4h to drop) — does NOT cleanly replicate as a periodicity; intervals vary minutes to hours.
### Receipt checklist (Phase 0)
- [x] Predecessor data treated as reference, not anchor — A5 explicitly falsified above
- [ ] In-session baseline rep N=3: NOT yet — N=3 events observed across multiple boots, only N=1 idle bin and N=12 load bursts in current boot
---
## Phase 1 — Goal formulation (locked 2026-05-06 11:56 CEST)
### Measurable target
> Quantify the rate of WSM_STATUS_DECRYPTFAILURE bursts (≥4 events within 60s) per hour of operation, AND the conditional probability of an AP-side unprotected-deauth-reason-6 within 30s of such a burst.
Locked artifact: /root/bes2600-samples/run-20260506-0659-fresh/PHASE1.md
Journal marker: 2026-05-06T11:56:14+02:00 — bes2600-test PHASE1_LOCKED
### Why this metric (not the original "assoc up/down per hour")
Source pin: bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure.
The status field comes from the firmware's WSM RX-indication, parsed in
wsm.c:1484 wsm_receive_indication via WSM_GET32(buf).
Status code 4 = WSM_STATUS_DECRYPTFAILURE per wsm.h:620.
Two observed Pattern-P1 events on 2026-05-06 chained:
1. decrypt-failure storm
2. AP unprotected-deauth-6 ("Class 2 frame received from non-authenticated station")
3. kernel local PREV_AUTH_NOT_VALID
4. reauth-stall on same channel; recovery via different SSID/channel
A third P1 event (yesterday 2026-05-05 22:33) had ZERO decrypt-failures preceding — different trigger (post-resume). The metric will discriminate.
---
## Phase 2 — Situation Analysis
### Rig built (live as of artifact assembly)
- snap loop PID 5712, ~6h25m elapsed to snapshots/snap.log (60s cadence)
- tcpdump filtered ring PID 10174, ~4h20m elapsed to tcpdump/cap.pcap files
- iw event PID 9852, ~4h20m elapsed to iw-event.log
- dynamic_debug bes2600 13 callsites enabled
- nc listener loop PID 17037 wrapper, 17039 active listener, port 12345
### tcpdump filter applied
`arp || icmp || icmp6 || ether proto 0x888e || port 67 || port 68 || port 53 || port 5353 || port 546 || port 547 || (tcp[tcpflags] and (tcp-syn or tcp-fin or tcp-rst) != 0)`
(So we capture ARP, ICMP, EAPOL, DHCP4/6, DNS, mDNS, and TCP control flags. Bulk data dropped.)
### Anti-theatre receipts verified
- dynamic_debug honored: 13 bes2600 callsites flipped to +p
- journald persistent: /var/log/journal exists, ~376 MB
- snap loop ticks accumulating
- iw event capturing
- tcpdump rotating
- loopback self-test of nc listener: 2 MB through, OK
### Known limits of the rig
- Monitor mode NOT available concurrent with managed (per iw phy phy0 info valid interface combinations). Raw 802.11 mgmt frames are invisible to tcpdump.
- iw event partially compensates: deauth/auth frame headers are visible there with reason codes.
- ftrace not enabled — would add bottom-half scheduling latency data; deferred.
### Receipt checklist (Phase 2)
- [x] Re-read CLAUDE.md
- [x] Re-read relevant memory entries
- [x] Verified ohm reachable
- [x] Verified dynamic_debug honored for bes2600
- [x] N/A: hardware UART not used
---
## Phase 3 — Baseline measurements (the wall)
### Three Pattern-P1 events captured
#### Event 1 — 2026-05-06 07:13 (varied use, no suspend, on 2.4 GHz)
```
07:13:16 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s)
... 77 events total over 24 seconds ...
07:13:41 iw-event: AP→ohm unprotected deauth reason 6
("Class 2 frame received from non-authenticated station")
07:13:41 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
07:13:42 kernel: wlan0: associated → 5b:33 (newton 5 GHz)
```
Recovery: 1 second, cross-band.
#### Event 2 — 2026-05-06 11:03 (idle on newton 2.4 GHz)
```
11:03:10 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s)
11:03:11 bes2600_wlan: [RX] Receive failure: 4
11:03:19 bes2600_wlan: [RX] Receive failure: 4 (×2 in 1s)
11:03:21 iw-event: AP→ohm unprotected deauth reason 6
11:03:22 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
11:03:22 → 11:05:11 9 auth attempts on 3 newton BSSIDs ALL TIMED OUT
(iw-event shows AP returned auth status 0 Successful twice
but assoc step never completed)
11:05:11 kernel: wlan0: associated → 4e:64:5c:d8:11:62 (dingdongkingkong, ch 13)
full handshake auth+assoc+connect in 110ms
```
Recovery: 109 seconds, cross-SSID + cross-channel only.
EAPOL frames in entire 11:03:00 → 11:05:30 window: 0 (4WHS never attempted).
Inbound packets to .168 in that window: 0.
#### Event 3 — 2026-05-05 22:33 (post-resume from lid-close)
```
22:31:22 NM: state activated → deactivating (reason 'sleeping')
22:31:22 kernel: wlan0: deauthenticating from 5b:32 reason 3 = DEAUTH_LEAVING
22:31:27 kernel: PM: suspend entry (deep)
22:31:31 kernel: PM: suspend exit (4s suspend)
22:31:35 kernel: wlan0: associated → 5b:32
[97 seconds of normal operation]
22:33:12 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
22:33:13 kernel: wlan0: associated → 5b:33 (newton 5 GHz)
```
ZERO [RX] Receive failure events in the 22:31:35 → 22:33:12 window. Different trigger path. Recovery: 1s, cross-band.
### Comparison table (idle vs load)
| Period | Duration | Decrypt-fails | Rate/hr | Bursts (≥4 in 60s) | Burst rate/hr | Escalations to AP-deauth-6 |
|---|---|---|---|---|---|---|
| 06:3911:03 idle/varied | 4h24m | 8 | 1.8 | 1 | 0.23 | 1/1 = 100% |
| 12:0013:28 sustained 1MB/s | 1h28m | 104 | 70.9 | 12 | 8.18 | 0/12 = 0% |
Burst-rate elevation under load: ~35x.
### List of 12 load-period bursts (start time, count within 60s)
```
12:06:01 9 events
12:17:57 6 events
12:19:37 4 events
12:20:55 5 events
12:22:17 5 events
12:25:58 4 events
12:30:02 7 events
12:32:09 9 events
13:14:41 8 events
13:17:22 4 events
13:22:53 5 events
13:28:16 8 events
```
### Open contradictions / things the loop has NOT yet resolved
1. Event-3 (post-resume) had no decrypt-failures yet still ended in PREV_AUTH_NOT_VALID. There is a second P1 trigger path we have not pinned a mechanism for.
2. Conditional probability flips between idle (100% of 1 burst escalates) and load (0% of 12 bursts escalate). Hypothesis: AP "Class 2 from unauth STA" heuristic is silenced by sustained host TX. NOT yet verified by AP-side capture (no AP logs available).
3. Cause of decrypt failures themselves remains hypothetical: PTK or GTK drift, or replay-counter mismatch. EAPOL group-rekey frames were NOT captured before either P1 event; needed wider tcpdump window.
4. Newton AP is 802.11v BSS-load capable (saw "comeback duration 1000 TU" at 12:03:23). Reason 9 (STA_REQ_ASSOC_WITHOUT_AUTH) at 12:03:24 shows AP-side state churn even within 1s of auth-success. May or may not be relevant to the P1 blackhole.
### Source citations (Phase 6 contract pins, recorded for review reference)
- bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure
- bes2600/wsm.h:620 — WSM_STATUS_DECRYPTFAILURE = 4
- bes2600/wsm.c:1484 — wsm_receive_indication (parses status from firmware)
### Receipt checklist (Phase 3)
- [x] Trace files / dmesg / regdump pasted verbatim above
- [x] Raw before derived (event listings precede rates)
- [x] Rig failure findings honestly recorded (monitor mode unavailable, ftrace deferred, no AP logs)
---
## What we are explicitly NOT yet at
- Phase 4 plan: not started. Pending review.
- Phase 6 implementation: not started. Pending plan + review.
## Asks of the reviewer
1. Is the Phase 1 metric the right discriminator? In particular, does the conditional-probability column (escalation given burst) capture what we want, or should the metric also count "AP-deauth-6 with no preceding burst" (the Event-3 path)?
2. The escalation-rate flip (100% idle, 0% load) is at N=1 idle vs N=12 load. Is N=1 idle adequate to report this finding, or do we need ≥3 idle bursts before locking?
3. Anything in Phase 3 above flagged as "not yet measured" that would be cheap to add before Phase 4? Specifically:
- Should ftrace mac80211/cfg80211 events be enabled before next rep? Cost: ~10x journal volume. Benefit: bottom-half timing for the 100ms-scale stalls in the 109s blackhole.
- Should tcpdump filter widen to include EAPOL frames captured in a moving 5-min window before each P1 event so we see the group-rekey directly?
4. Is the lack of AP-side capture (no Fritz!Box logs) a blocking gap, or can the campaign proceed without it for now?