Files
besser/notes/phase5-2026-05-06.md
test0r 1a21212744 notes: phase 5 review artifact for BES2600 wifi-stability campaign
Captures Phase 0-3 receipts as of 2026-05-06: three Pattern-P1 events
reproduced (07:13, 11:03, yesterday 22:33), decrypt-failure metric locked
as Phase 1 with source pins (txrx.c:1696, wsm.h:620, wsm.c:1484), rig built
(snap loop + tcpdump filtered ring + iw event + dynamic_debug + netcat 1MB/s),
idle-vs-load comparison shows 35x burst-rate elevation under load with
conditional-escalation flip (100% idle / 0% load).

Pending Phase 5 second-model review before Phase 4 plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 15:23:24 +02:00

10 KiB
Raw Permalink Blame History

BES2600 WiFi-stability campaign — Phase 5 review artifact

Date assembled: 2026-05-06 (rig started 06:59 CEST) Run dir: /root/bes2600-samples/run-20260506-0659-fresh/ on ohm Module under test: bes2600.ko srcversion 461AFB369355AE598D79BDF (c-final + c5.2.1)

This is the Phase 5 hand-off artifact. Per project CLAUDE.md: paste verbatim, do not curate. Anomalies, contradictions and small-N caveats are stated as-is.


Phase 0 — Substrate / Motivation / Inventory

Triggering observation (user-reported, 2026-05-06)

  1. After hours of operation, near-100% WiFi quality eventually drops association.
  2. YouTube with hardware decoder shows ~10 dropped frames per DASH chunk fetch.

Hardware-in-the-loop

  • ohm = PineTab2 (RK3566 SoC, BES2600 over SDIO)
  • Reachable as ohm.fritz.box (192.168.88.168 on home AP, 10.141.179.63 on fallback)
  • Kernel: 6.19.10-danctnix1-1-pinetab2
  • bes2600 module: srcversion 461AFB369355AE598D79BDF (c5.1+c5.1.1+c5.2+c6.1+c6.2+c7+c5.2.1)
  • bes2600_btuart: srcversion 8FF920B9C068EA2E7DB9BA8 (unchanged)

What is measurable

  • journald (kernel + NetworkManager + wpa_supplicant)
  • bes2600 dynamic_debug: 13 callsites enabled with +pmf flag
  • iw event -t -f (cfg80211 events with reason codes)
  • tcpdump on wlan0 (managed mode — sees data + EAPOL + ARP/ICMP, NOT raw 802.11 mgmt)
  • per-60s snapshot loop: iw link, station dump, /proc/net/wireless, /sys/class/net/wlan0/statistics

Predecessor anchor

None — campaign re-anchored from in-session reps. User pre-rig claims partially replicated:

  • A5 (c7 latch trip count rises over session) — does NOT replicate. 0 hits in 12h boot for "PSM not honored" / "pm_unsupported" / "switching to skip".
  • A2 (~4h to drop) — does NOT cleanly replicate as a periodicity; intervals vary minutes to hours.

Receipt checklist (Phase 0)

  • Predecessor data treated as reference, not anchor — A5 explicitly falsified above
  • In-session baseline rep N=3: NOT yet — N=3 events observed across multiple boots, only N=1 idle bin and N=12 load bursts in current boot

Phase 1 — Goal formulation (locked 2026-05-06 11:56 CEST)

Measurable target

Quantify the rate of WSM_STATUS_DECRYPTFAILURE bursts (≥4 events within 60s) per hour of operation, AND the conditional probability of an AP-side unprotected-deauth-reason-6 within 30s of such a burst.

Locked artifact: /root/bes2600-samples/run-20260506-0659-fresh/PHASE1.md

Journal marker: 2026-05-06T11:56:14+02:00 — bes2600-test PHASE1_LOCKED

Why this metric (not the original "assoc up/down per hour")

Source pin: bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure. The status field comes from the firmware's WSM RX-indication, parsed in wsm.c:1484 wsm_receive_indication via WSM_GET32(buf). Status code 4 = WSM_STATUS_DECRYPTFAILURE per wsm.h:620.

Two observed Pattern-P1 events on 2026-05-06 chained:

  1. decrypt-failure storm
  2. AP unprotected-deauth-6 ("Class 2 frame received from non-authenticated station")
  3. kernel local PREV_AUTH_NOT_VALID
  4. reauth-stall on same channel; recovery via different SSID/channel

A third P1 event (yesterday 2026-05-05 22:33) had ZERO decrypt-failures preceding — different trigger (post-resume). The metric will discriminate.


Phase 2 — Situation Analysis

Rig built (live as of artifact assembly)

  • snap loop PID 5712, ~6h25m elapsed to snapshots/snap.log (60s cadence)
  • tcpdump filtered ring PID 10174, ~4h20m elapsed to tcpdump/cap.pcap files
  • iw event PID 9852, ~4h20m elapsed to iw-event.log
  • dynamic_debug bes2600 13 callsites enabled
  • nc listener loop PID 17037 wrapper, 17039 active listener, port 12345

tcpdump filter applied

arp || icmp || icmp6 || ether proto 0x888e || port 67 || port 68 || port 53 || port 5353 || port 546 || port 547 || (tcp[tcpflags] and (tcp-syn or tcp-fin or tcp-rst) != 0)

(So we capture ARP, ICMP, EAPOL, DHCP4/6, DNS, mDNS, and TCP control flags. Bulk data dropped.)

Anti-theatre receipts verified

  • dynamic_debug honored: 13 bes2600 callsites flipped to +p
  • journald persistent: /var/log/journal exists, ~376 MB
  • snap loop ticks accumulating
  • iw event capturing
  • tcpdump rotating
  • loopback self-test of nc listener: 2 MB through, OK

Known limits of the rig

  • Monitor mode NOT available concurrent with managed (per iw phy phy0 info valid interface combinations). Raw 802.11 mgmt frames are invisible to tcpdump.
  • iw event partially compensates: deauth/auth frame headers are visible there with reason codes.
  • ftrace not enabled — would add bottom-half scheduling latency data; deferred.

Receipt checklist (Phase 2)

  • Re-read CLAUDE.md
  • Re-read relevant memory entries
  • Verified ohm reachable
  • Verified dynamic_debug honored for bes2600
  • N/A: hardware UART not used

Phase 3 — Baseline measurements (the wall)

Three Pattern-P1 events captured

Event 1 — 2026-05-06 07:13 (varied use, no suspend, on 2.4 GHz)

07:13:16  bes2600_wlan: [RX] Receive failure: 4    (×6 in 1s)
... 77 events total over 24 seconds ...
07:13:41  iw-event: AP→ohm unprotected deauth reason 6
          ("Class 2 frame received from non-authenticated station")
07:13:41  kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
07:13:42  kernel: wlan0: associated → 5b:33 (newton 5 GHz)

Recovery: 1 second, cross-band.

Event 2 — 2026-05-06 11:03 (idle on newton 2.4 GHz)

11:03:10  bes2600_wlan: [RX] Receive failure: 4    (×6 in 1s)
11:03:11  bes2600_wlan: [RX] Receive failure: 4
11:03:19  bes2600_wlan: [RX] Receive failure: 4    (×2 in 1s)
11:03:21  iw-event: AP→ohm unprotected deauth reason 6
11:03:22  kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID

11:03:22 → 11:05:11    9 auth attempts on 3 newton BSSIDs ALL TIMED OUT
                       (iw-event shows AP returned auth status 0 Successful twice
                        but assoc step never completed)

11:05:11  kernel: wlan0: associated → 4e:64:5c:d8:11:62 (dingdongkingkong, ch 13)
                  full handshake auth+assoc+connect in 110ms

Recovery: 109 seconds, cross-SSID + cross-channel only. EAPOL frames in entire 11:03:00 → 11:05:30 window: 0 (4WHS never attempted). Inbound packets to .168 in that window: 0.

Event 3 — 2026-05-05 22:33 (post-resume from lid-close)

22:31:22  NM: state activated → deactivating (reason 'sleeping')
22:31:22  kernel: wlan0: deauthenticating from 5b:32 reason 3 = DEAUTH_LEAVING
22:31:27  kernel: PM: suspend entry (deep)
22:31:31  kernel: PM: suspend exit                                       (4s suspend)
22:31:35  kernel: wlan0: associated → 5b:32

[97 seconds of normal operation]

22:33:12  kernel: wlan0: deauthenticating from 5b:32 reason 2 = PREV_AUTH_NOT_VALID
22:33:13  kernel: wlan0: associated → 5b:33 (newton 5 GHz)

ZERO [RX] Receive failure events in the 22:31:35 → 22:33:12 window. Different trigger path. Recovery: 1s, cross-band.

Comparison table (idle vs load)

Period Duration Decrypt-fails Rate/hr Bursts (≥4 in 60s) Burst rate/hr Escalations to AP-deauth-6
06:3911:03 idle/varied 4h24m 8 1.8 1 0.23 1/1 = 100%
12:0013:28 sustained 1MB/s 1h28m 104 70.9 12 8.18 0/12 = 0%

Burst-rate elevation under load: ~35x.

List of 12 load-period bursts (start time, count within 60s)

12:06:01  9 events
12:17:57  6 events
12:19:37  4 events
12:20:55  5 events
12:22:17  5 events
12:25:58  4 events
12:30:02  7 events
12:32:09  9 events
13:14:41  8 events
13:17:22  4 events
13:22:53  5 events
13:28:16  8 events

Open contradictions / things the loop has NOT yet resolved

  1. Event-3 (post-resume) had no decrypt-failures yet still ended in PREV_AUTH_NOT_VALID. There is a second P1 trigger path we have not pinned a mechanism for.
  2. Conditional probability flips between idle (100% of 1 burst escalates) and load (0% of 12 bursts escalate). Hypothesis: AP "Class 2 from unauth STA" heuristic is silenced by sustained host TX. NOT yet verified by AP-side capture (no AP logs available).
  3. Cause of decrypt failures themselves remains hypothetical: PTK or GTK drift, or replay-counter mismatch. EAPOL group-rekey frames were NOT captured before either P1 event; needed wider tcpdump window.
  4. Newton AP is 802.11v BSS-load capable (saw "comeback duration 1000 TU" at 12:03:23). Reason 9 (STA_REQ_ASSOC_WITHOUT_AUTH) at 12:03:24 shows AP-side state churn even within 1s of auth-success. May or may not be relevant to the P1 blackhole.

Source citations (Phase 6 contract pins, recorded for review reference)

  • bes2600/txrx.c:1696 — bes_warn for [RX] Receive failure
  • bes2600/wsm.h:620 — WSM_STATUS_DECRYPTFAILURE = 4
  • bes2600/wsm.c:1484 — wsm_receive_indication (parses status from firmware)

Receipt checklist (Phase 3)

  • Trace files / dmesg / regdump pasted verbatim above
  • Raw before derived (event listings precede rates)
  • Rig failure findings honestly recorded (monitor mode unavailable, ftrace deferred, no AP logs)

What we are explicitly NOT yet at

  • Phase 4 plan: not started. Pending review.
  • Phase 6 implementation: not started. Pending plan + review.

Asks of the reviewer

  1. Is the Phase 1 metric the right discriminator? In particular, does the conditional-probability column (escalation given burst) capture what we want, or should the metric also count "AP-deauth-6 with no preceding burst" (the Event-3 path)?
  2. The escalation-rate flip (100% idle, 0% load) is at N=1 idle vs N=12 load. Is N=1 idle adequate to report this finding, or do we need ≥3 idle bursts before locking?
  3. Anything in Phase 3 above flagged as "not yet measured" that would be cheap to add before Phase 4? Specifically:
    • Should ftrace mac80211/cfg80211 events be enabled before next rep? Cost: ~10x journal volume. Benefit: bottom-half timing for the 100ms-scale stalls in the 109s blackhole.
    • Should tcpdump filter widen to include EAPOL frames captured in a moving 5-min window before each P1 event so we see the group-rekey directly?
  4. Is the lack of AP-side capture (no Fritz!Box logs) a blocking gap, or can the campaign proceed without it for now?