From 96e2d439c9e84824382be98d8d73a3fdc371d2f0 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 19:28:19 +0000 Subject: [PATCH] iter6 v3 plan: apply Sonnet round-3 amendments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 6 targeted fixes, no rewrite: 1. Add DEBUG_ATOMIC_SLEEP + DEBUG_LIST explicitly to config delta 2. Drop PAGE_POISONING_NO_SANITY (keep the sanity check) 3. Fix mpv hwdec flag (vaapi-copy is SW! use v4l2m2m or rkmpp). Extend monitoring window 2min → 20min (KASAN slows runtime 3-8x). Add glmark2 + ffmpeg-null concurrent option. 4. Add R11: /boot/firmware headroom pre-flight (KASAN_OUTLINE +15-25% Image size). 5. Replace serial fallback with netconsole (no clip work needed). 6. Add H4 isolation exit path: headless-only DRM ffmpeg loop if compositor test silent — isolates panthor from vb2. All round-3 ACCEPT/AMENDs applied. v3 ready for execution. Co-Authored-By: Claude Opus 4.7 --- phase4_plan_iter6_v3.md | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/phase4_plan_iter6_v3.md b/phase4_plan_iter6_v3.md index 70028b8..356be26 100644 --- a/phase4_plan_iter6_v3.md +++ b/phase4_plan_iter6_v3.md @@ -26,9 +26,11 @@ All v2 pre-flight + lockdep config + ramoops setup is preserved (already execute --enable KASAN_OUTLINE \ --enable KCSAN --enable KCSAN_REPORT_ONCE_IN_MS=2000 \ --enable DEBUG_PREEMPT \ + --enable DEBUG_ATOMIC_SLEEP \ --enable DEBUG_OBJECTS --enable DEBUG_OBJECTS_RCU_HEAD \ + --enable DEBUG_LIST \ --enable SLUB_DEBUG --enable SLUB_DEBUG_ON \ - --enable PAGE_POISONING --enable PAGE_POISONING_NO_SANITY + --enable PAGE_POISONING CONFIG_LOCALVERSION="-lockdep-kasan" ``` @@ -38,6 +40,9 @@ Notes: - KCSAN_REPORT_ONCE_IN_MS=2000 — rate-limit reports so console doesn't drown - SLUB_DEBUG_ON — enable all SLUB debug at boot (poisoning, redzone, owner-tracking) - DEBUG_OBJECTS — catches double-init / use-after-free of registered kernel objects (dma_fences are registered) +- DEBUG_ATOMIC_SLEEP — re-enable explicitly per round-3 review (don't assume v2 setting carries) +- DEBUG_LIST — catches list_head corruption at the point of corruption (dma_fence + vb2 use list_heads heavily) +- PAGE_POISONING WITHOUT `_NO_SANITY` — keep the sanity check (round-3 review: dropping _NO_SANITY gives more info) ### Build (step 1.3-DELTA) @@ -58,10 +63,15 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h **Critical addition for step 2.1**: after the headless smoke test passes, intentionally trigger GPU compositor activity to reproduce the v2 wedge condition: 1. Log into plasma wayland at SDDM prompt (manual, physical) -2. Start a video that drives both GPU compositor + V4L2 decode concurrently. E.g.: `mpv --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4` in the plasma session -3. Watch for ~2 minutes (v2 wedge timing was ~10 min, but KASAN/KCSAN should catch earlier) -4. From ssh: `tail -f` `/sys/fs/pstore/` AND `journalctl -k -f` for KASAN / KCSAN / BUG: KASAN / data-race / use-after-free reports -5. If a report fires, capture it then reboot back to lockdep-non-kasan kernel for analysis +2. Start a video that ACTUALLY uses the V4L2 hardware decode path (round-3 review correction: vaapi-copy is SW decode, doesn't drive vb2). Use: + - `mpv --hwdec=v4l2m2m ~/measurements/encoded/bbb_60s_720p.hevc.mp4`, OR + - `mpv --hwdec=rkmpp ~/measurements/encoded/bbb_60s_720p.hevc.mp4`, OR + - Concurrent: `glmark2 &` (GPU compositor load) + `ffmpeg -hwaccel vaapi -i bbb.hevc.mp4 -f null -` (V4L2 decode load) in a tight loop +3. Watch for ~20 minutes (v2 wedge timing was ~10 min; KASAN+KCSAN slow runtime 3-8x → effective wedge time may be 30-80 min wallclock. 20 min is the lower bound) +4. From ssh: `journalctl -k -f | grep -E "KASAN|KCSAN|BUG:|race|use-after|list_add|list_del|object .* not init"` piped live for early-warning report capture +5. If a report fires, capture it AND `sudo cat /sys/fs/pstore/*` (in case panic-trigger followed) then reboot back to lockdep-non-kasan kernel for analysis + +**Alternative H4 isolation path** (if step-2.1 KASAN/KCSAN silent under plasma session): re-run headless-only (no compositor) with `ffmpeg -hwaccel drm -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -f null -` looped for 20 min. If WEDGE under headless-DRM = vb2 fence bug. If CLEAN under headless = H4 panthor-compositor-specific (vb2 not at fault). ## Risk register additions vs v2 @@ -70,7 +80,8 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h | R7 | KASAN+KCSAN combined boot too slow → systemd timeout → boot failure | Multiple fallback labels in extlinux: `arch_devices_lockdep_kasan` (new), `arch_devices_lockdep` (no kasan), `arch_devices` (default vanilla). User picks at U-Boot. systemd unit timeouts may need extension via `DefaultTimeoutStartSec=` if observed | | R8 | KASAN region overlaps with our ramoops carve-out (0x10000000-0x100fffff) | KASAN shadow region uses high VA, not low PA. Carve-out is from real physical RAM, KASAN shadow is in vmalloc space. No conflict expected — verify with `/proc/iomem` post-boot | | R9 | KASAN tags trigger false positives on pre-existing kernel code (e.g. panel-edp's WARN) → noise drowns real signal | KASAN reports are distinctive ("BUG: KASAN:" prefix). Grep filtering should isolate. If false-positive volume is overwhelming, switch to KASAN_SW_TAGS or selectively disable instrumentation per-file | -| R10 | GPU-compositor wedge reproduction requires physical login (no remote-wayland start) | User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~10 min of compositor activity, declare iter6 v3 inconclusive | +| R10 | GPU-compositor wedge reproduction requires physical login (no remote-wayland start) | User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~20 min of compositor activity (per round-3 amendment), declare iter6 v3 inconclusive | +| R11 | /boot/firmware partition headroom for a third Image (KASAN_OUTLINE adds 15-25 % size) | Pre-flight `df -h /boot/firmware`. Need ≥ 30 MB free. If tight: overwrite the lockdep Image instead of keeping both — lockdep label becomes the lockdep-kasan label after rebuild | ## Estimated wall-time @@ -88,6 +99,6 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h **Success**: KASAN or KCSAN report fires during step 2.1's GPU-compositor test, pinpointing the bug class + responsible code. → file kernel-agent issue, fix in 0004 v2 OR upstream, retry. -**Partial**: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). Switch to serial console for live trace (final fallback). +**Partial**: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). **Final-fallback netconsole** (round-3 amendment replaces serial): before triggering the wedge, `sudo modprobe netconsole netconsole=6666@/eth0,6667@/eth0`; on boltzmann run `nc -u -l 6667 | tee netconsole-iter6v3-$(date +%s).log`. Netconsole survives kernel panic that consumes the display, requires no hardware-clip work, uses the existing 7.0-lockdep kernel as-is. Serial console remains as documented last-resort only. **Failure**: KASAN+KCSAN kernel doesn't boot at all → fall back to lockdep kernel, declare iter6 line closed, move iter7 to non-fence cache-coherency angle.