From 1a21212744c89d4568e761bafd5157f29e980c38 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Wed, 6 May 2026 15:23:24 +0200 Subject: [PATCH] notes: phase 5 review artifact for BES2600 wifi-stability campaign Captures Phase 0-3 receipts as of 2026-05-06: three Pattern-P1 events reproduced (07:13, 11:03, yesterday 22:33), decrypt-failure metric locked as Phase 1 with source pins (txrx.c:1696, wsm.h:620, wsm.c:1484), rig built (snap loop + tcpdump filtered ring + iw event + dynamic_debug + netcat 1MB/s), idle-vs-load comparison shows 35x burst-rate elevation under load with conditional-escalation flip (100% idle / 0% load). Pending Phase 5 second-model review before Phase 4 plan. Co-Authored-By: Claude Opus 4.7 (1M context) --- notes/phase5-2026-05-06.md | 217 +++++++++++++++++++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 notes/phase5-2026-05-06.md diff --git a/notes/phase5-2026-05-06.md b/notes/phase5-2026-05-06.md new file mode 100644 index 000000000..0419d57d3 --- /dev/null +++ b/notes/phase5-2026-05-06.md @@ -0,0 +1,217 @@ +# BES2600 WiFi-stability campaign — Phase 5 review artifact + +Date assembled: 2026-05-06 (rig started 06:59 CEST) +Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm +Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1) + +This is the Phase 5 hand-off artifact. Per project CLAUDE.md: paste verbatim, +do not curate. Anomalies, contradictions and small-N caveats are stated as-is. + +--- + +## Phase 0 — Substrate / Motivation / Inventory + +### Triggering observation (user-reported, 2026-05-06) +1. After hours of operation, near-100% WiFi quality eventually drops association. +2. YouTube with hardware decoder shows ~10 dropped frames per DASH chunk fetch. + +### Hardware-in-the-loop +- ohm = PineTab2 (RK3566 SoC, BES2600 over SDIO) +- Reachable as ohm.fritz.box (192.168.88.168 on home AP, 10.141.179.63 on fallback) +- Kernel: 6.19.10-danctnix1-1-pinetab2 +- bes2600 module: srcversion 461AFB369355AE598D79BDF (c5.1+c5.1.1+c5.2+c6.1+c6.2+c7+c5.2.1) +- bes2600_btuart: srcversion 8FF920B9C068EA2E7DB9BA8 (unchanged) + +### What is measurable +- journald (kernel + NetworkManager + wpa_supplicant) +- bes2600 dynamic_debug: 13 callsites enabled with +pmf flag +- iw event -t -f (cfg80211 events with reason codes) +- tcpdump on wlan0 (managed mode — sees data + EAPOL + ARP/ICMP, NOT raw 802.11 mgmt) +- per-60s snapshot loop: iw link, station dump, /proc/net/wireless, /sys/class/net/wlan0/statistics + +### Predecessor anchor +None — campaign re-anchored from in-session reps. User pre-rig claims partially replicated: +- A5 (c7 latch trip count rises over session) — does NOT replicate. 0 hits in 12h boot for "PSM not honored" / "pm_unsupported" / "switching to skip". +- A2 (~4h to drop) — does NOT cleanly replicate as a periodicity; intervals vary minutes to hours. + +### Receipt checklist (Phase 0) +- [x] Predecessor data treated as reference, not anchor — A5 explicitly falsified above +- [ ] In-session baseline rep N=3: NOT yet — N=3 events observed across multiple boots, only N=1 idle bin and N=12 load bursts in current boot + +--- + +## Phase 1 — Goal formulation (locked 2026-05-06 11:56 CEST) + +### Measurable target + +> Quantify the rate of WSM_STATUS_DECRYPTFAILURE bursts (≥4 events within 60s) per hour of operation, AND the conditional probability of an AP-side unprotected-deauth-reason-6 within 30s of such a burst. + +Locked artifact: /root/bes2600-samples/run-20260506-0659-fresh/PHASE1.md + +Journal marker: 2026-05-06T11:56:14+02:00 — bes2600-test PHASE1_LOCKED + +### Why this metric (not the original "assoc up/down per hour") + +Source pin: bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure. +The status field comes from the firmware's WSM RX-indication, parsed in +wsm.c:1484 wsm_receive_indication via WSM_GET32(buf). +Status code 4 = WSM_STATUS_DECRYPTFAILURE per wsm.h:620. + +Two observed Pattern-P1 events on 2026-05-06 chained: +1. decrypt-failure storm +2. AP unprotected-deauth-6 ("Class 2 frame received from non-authenticated station") +3. kernel local PREV_AUTH_NOT_VALID +4. reauth-stall on same channel; recovery via different SSID/channel + +A third P1 event (yesterday 2026-05-05 22:33) had ZERO decrypt-failures preceding — different trigger (post-resume). The metric will discriminate. + +--- + +## Phase 2 — Situation Analysis + +### Rig built (live as of artifact assembly) +- snap loop PID 5712, ~6h25m elapsed to snapshots/snap.log (60s cadence) +- tcpdump filtered ring PID 10174, ~4h20m elapsed to tcpdump/cap.pcap files +- iw event PID 9852, ~4h20m elapsed to iw-event.log +- dynamic_debug bes2600 13 callsites enabled +- nc listener loop PID 17037 wrapper, 17039 active listener, port 12345 + +### tcpdump filter applied +`arp || icmp || icmp6 || ether proto 0x888e || port 67 || port 68 || port 53 || port 5353 || port 546 || port 547 || (tcp[tcpflags] and (tcp-syn or tcp-fin or tcp-rst) != 0)` + +(So we capture ARP, ICMP, EAPOL, DHCP4/6, DNS, mDNS, and TCP control flags. Bulk data dropped.) + +### Anti-theatre receipts verified +- dynamic_debug honored: 13 bes2600 callsites flipped to +p +- journald persistent: /var/log/journal exists, ~376 MB +- snap loop ticks accumulating +- iw event capturing +- tcpdump rotating +- loopback self-test of nc listener: 2 MB through, OK + +### Known limits of the rig +- Monitor mode NOT available concurrent with managed (per iw phy phy0 info valid interface combinations). Raw 802.11 mgmt frames are invisible to tcpdump. +- iw event partially compensates: deauth/auth frame headers are visible there with reason codes. +- ftrace not enabled — would add bottom-half scheduling latency data; deferred. + +### Receipt checklist (Phase 2) +- [x] Re-read CLAUDE.md +- [x] Re-read relevant memory entries +- [x] Verified ohm reachable +- [x] Verified dynamic_debug honored for bes2600 +- [x] N/A: hardware UART not used + +--- + +## Phase 3 — Baseline measurements (the wall) + +### Three Pattern-P1 events captured + +#### Event 1 — 2026-05-06 07:13 (varied use, no suspend, on 2.4 GHz) + +``` +07:13:16 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s) +... 77 events total over 24 seconds ... +07:13:41 iw-event: AP→ohm unprotected deauth reason 6 + ("Class 2 frame received from non-authenticated station") +07:13:41 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID +07:13:42 kernel: wlan0: associated → 5b:33 (newton 5 GHz) +``` + +Recovery: 1 second, cross-band. + +#### Event 2 — 2026-05-06 11:03 (idle on newton 2.4 GHz) + +``` +11:03:10 bes2600_wlan: [RX] Receive failure: 4 (×6 in 1s) +11:03:11 bes2600_wlan: [RX] Receive failure: 4 +11:03:19 bes2600_wlan: [RX] Receive failure: 4 (×2 in 1s) +11:03:21 iw-event: AP→ohm unprotected deauth reason 6 +11:03:22 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID + +11:03:22 → 11:05:11 9 auth attempts on 3 newton BSSIDs ALL TIMED OUT + (iw-event shows AP returned auth status 0 Successful twice + but assoc step never completed) + +11:05:11 kernel: wlan0: associated → 4e:64:5c:d8:11:62 (dingdongkingkong, ch 13) + full handshake auth+assoc+connect in 110ms +``` + +Recovery: 109 seconds, cross-SSID + cross-channel only. +EAPOL frames in entire 11:03:00 → 11:05:30 window: 0 (4WHS never attempted). +Inbound packets to .168 in that window: 0. + +#### Event 3 — 2026-05-05 22:33 (post-resume from lid-close) + +``` +22:31:22 NM: state activated → deactivating (reason 'sleeping') +22:31:22 kernel: wlan0: deauthenticating from 5b:32 reason 3 = DEAUTH_LEAVING +22:31:27 kernel: PM: suspend entry (deep) +22:31:31 kernel: PM: suspend exit (4s suspend) +22:31:35 kernel: wlan0: associated → 5b:32 + +[97 seconds of normal operation] + +22:33:12 kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID +22:33:13 kernel: wlan0: associated → 5b:33 (newton 5 GHz) +``` + +ZERO [RX] Receive failure events in the 22:31:35 → 22:33:12 window. Different trigger path. Recovery: 1s, cross-band. + +### Comparison table (idle vs load) + +| Period | Duration | Decrypt-fails | Rate/hr | Bursts (≥4 in 60s) | Burst rate/hr | Escalations to AP-deauth-6 | +|---|---|---|---|---|---|---| +| 06:39–11:03 idle/varied | 4h24m | 8 | 1.8 | 1 | 0.23 | 1/1 = 100% | +| 12:00–13:28 sustained 1MB/s | 1h28m | 104 | 70.9 | 12 | 8.18 | 0/12 = 0% | + +Burst-rate elevation under load: ~35x. + +### List of 12 load-period bursts (start time, count within 60s) +``` +12:06:01 9 events +12:17:57 6 events +12:19:37 4 events +12:20:55 5 events +12:22:17 5 events +12:25:58 4 events +12:30:02 7 events +12:32:09 9 events +13:14:41 8 events +13:17:22 4 events +13:22:53 5 events +13:28:16 8 events +``` + +### Open contradictions / things the loop has NOT yet resolved + +1. Event-3 (post-resume) had no decrypt-failures yet still ended in PREV_AUTH_NOT_VALID. There is a second P1 trigger path we have not pinned a mechanism for. +2. Conditional probability flips between idle (100% of 1 burst escalates) and load (0% of 12 bursts escalate). Hypothesis: AP "Class 2 from unauth STA" heuristic is silenced by sustained host TX. NOT yet verified by AP-side capture (no AP logs available). +3. Cause of decrypt failures themselves remains hypothetical: PTK or GTK drift, or replay-counter mismatch. EAPOL group-rekey frames were NOT captured before either P1 event; needed wider tcpdump window. +4. Newton AP is 802.11v BSS-load capable (saw "comeback duration 1000 TU" at 12:03:23). Reason 9 (STA_REQ_ASSOC_WITHOUT_AUTH) at 12:03:24 shows AP-side state churn even within 1s of auth-success. May or may not be relevant to the P1 blackhole. + +### Source citations (Phase 6 contract pins, recorded for review reference) +- bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure +- bes2600/wsm.h:620 — WSM_STATUS_DECRYPTFAILURE = 4 +- bes2600/wsm.c:1484 — wsm_receive_indication (parses status from firmware) + +### Receipt checklist (Phase 3) +- [x] Trace files / dmesg / regdump pasted verbatim above +- [x] Raw before derived (event listings precede rates) +- [x] Rig failure findings honestly recorded (monitor mode unavailable, ftrace deferred, no AP logs) + +--- + +## What we are explicitly NOT yet at + +- Phase 4 plan: not started. Pending review. +- Phase 6 implementation: not started. Pending plan + review. + +## Asks of the reviewer + +1. Is the Phase 1 metric the right discriminator? In particular, does the conditional-probability column (escalation given burst) capture what we want, or should the metric also count "AP-deauth-6 with no preceding burst" (the Event-3 path)? +2. The escalation-rate flip (100% idle, 0% load) is at N=1 idle vs N=12 load. Is N=1 idle adequate to report this finding, or do we need ≥3 idle bursts before locking? +3. Anything in Phase 3 above flagged as "not yet measured" that would be cheap to add before Phase 4? Specifically: + - Should ftrace mac80211/cfg80211 events be enabled before next rep? Cost: ~10x journal volume. Benefit: bottom-half timing for the 100ms-scale stalls in the 109s blackhole. + - Should tcpdump filter widen to include EAPOL frames captured in a moving 5-min window before each P1 event so we see the group-rekey directly? +4. Is the lack of AP-side capture (no Fritz!Box logs) a blocking gap, or can the campaign proceed without it for now?