iter6 v3 plan: apply Sonnet round-3 amendments
6 targeted fixes, no rewrite: 1. Add DEBUG_ATOMIC_SLEEP + DEBUG_LIST explicitly to config delta 2. Drop PAGE_POISONING_NO_SANITY (keep the sanity check) 3. Fix mpv hwdec flag (vaapi-copy is SW! use v4l2m2m or rkmpp). Extend monitoring window 2min → 20min (KASAN slows runtime 3-8x). Add glmark2 + ffmpeg-null concurrent option. 4. Add R11: /boot/firmware headroom pre-flight (KASAN_OUTLINE +15-25% Image size). 5. Replace serial fallback with netconsole (no clip work needed). 6. Add H4 isolation exit path: headless-only DRM ffmpeg loop if compositor test silent — isolates panthor from vb2. All round-3 ACCEPT/AMENDs applied. v3 ready for execution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
+18
-7
@@ -26,9 +26,11 @@ All v2 pre-flight + lockdep config + ramoops setup is preserved (already execute
|
||||
--enable KASAN_OUTLINE \
|
||||
--enable KCSAN --enable KCSAN_REPORT_ONCE_IN_MS=2000 \
|
||||
--enable DEBUG_PREEMPT \
|
||||
--enable DEBUG_ATOMIC_SLEEP \
|
||||
--enable DEBUG_OBJECTS --enable DEBUG_OBJECTS_RCU_HEAD \
|
||||
--enable DEBUG_LIST \
|
||||
--enable SLUB_DEBUG --enable SLUB_DEBUG_ON \
|
||||
--enable PAGE_POISONING --enable PAGE_POISONING_NO_SANITY
|
||||
--enable PAGE_POISONING
|
||||
CONFIG_LOCALVERSION="-lockdep-kasan"
|
||||
```
|
||||
|
||||
@@ -38,6 +40,9 @@ Notes:
|
||||
- KCSAN_REPORT_ONCE_IN_MS=2000 — rate-limit reports so console doesn't drown
|
||||
- SLUB_DEBUG_ON — enable all SLUB debug at boot (poisoning, redzone, owner-tracking)
|
||||
- DEBUG_OBJECTS — catches double-init / use-after-free of registered kernel objects (dma_fences are registered)
|
||||
- DEBUG_ATOMIC_SLEEP — re-enable explicitly per round-3 review (don't assume v2 setting carries)
|
||||
- DEBUG_LIST — catches list_head corruption at the point of corruption (dma_fence + vb2 use list_heads heavily)
|
||||
- PAGE_POISONING WITHOUT `_NO_SANITY` — keep the sanity check (round-3 review: dropping _NO_SANITY gives more info)
|
||||
|
||||
### Build (step 1.3-DELTA)
|
||||
|
||||
@@ -58,10 +63,15 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h
|
||||
|
||||
**Critical addition for step 2.1**: after the headless smoke test passes, intentionally trigger GPU compositor activity to reproduce the v2 wedge condition:
|
||||
1. Log into plasma wayland at SDDM prompt (manual, physical)
|
||||
2. Start a video that drives both GPU compositor + V4L2 decode concurrently. E.g.: `mpv --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4` in the plasma session
|
||||
3. Watch for ~2 minutes (v2 wedge timing was ~10 min, but KASAN/KCSAN should catch earlier)
|
||||
4. From ssh: `tail -f` `/sys/fs/pstore/` AND `journalctl -k -f` for KASAN / KCSAN / BUG: KASAN / data-race / use-after-free reports
|
||||
5. If a report fires, capture it then reboot back to lockdep-non-kasan kernel for analysis
|
||||
2. Start a video that ACTUALLY uses the V4L2 hardware decode path (round-3 review correction: vaapi-copy is SW decode, doesn't drive vb2). Use:
|
||||
- `mpv --hwdec=v4l2m2m ~/measurements/encoded/bbb_60s_720p.hevc.mp4`, OR
|
||||
- `mpv --hwdec=rkmpp ~/measurements/encoded/bbb_60s_720p.hevc.mp4`, OR
|
||||
- Concurrent: `glmark2 &` (GPU compositor load) + `ffmpeg -hwaccel vaapi -i bbb.hevc.mp4 -f null -` (V4L2 decode load) in a tight loop
|
||||
3. Watch for ~20 minutes (v2 wedge timing was ~10 min; KASAN+KCSAN slow runtime 3-8x → effective wedge time may be 30-80 min wallclock. 20 min is the lower bound)
|
||||
4. From ssh: `journalctl -k -f | grep -E "KASAN|KCSAN|BUG:|race|use-after|list_add|list_del|object .* not init"` piped live for early-warning report capture
|
||||
5. If a report fires, capture it AND `sudo cat /sys/fs/pstore/*` (in case panic-trigger followed) then reboot back to lockdep-non-kasan kernel for analysis
|
||||
|
||||
**Alternative H4 isolation path** (if step-2.1 KASAN/KCSAN silent under plasma session): re-run headless-only (no compositor) with `ffmpeg -hwaccel drm -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -f null -` looped for 20 min. If WEDGE under headless-DRM = vb2 fence bug. If CLEAN under headless = H4 panthor-compositor-specific (vb2 not at fault).
|
||||
|
||||
## Risk register additions vs v2
|
||||
|
||||
@@ -70,7 +80,8 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h
|
||||
| R7 | KASAN+KCSAN combined boot too slow → systemd timeout → boot failure | Multiple fallback labels in extlinux: `arch_devices_lockdep_kasan` (new), `arch_devices_lockdep` (no kasan), `arch_devices` (default vanilla). User picks at U-Boot. systemd unit timeouts may need extension via `DefaultTimeoutStartSec=` if observed |
|
||||
| R8 | KASAN region overlaps with our ramoops carve-out (0x10000000-0x100fffff) | KASAN shadow region uses high VA, not low PA. Carve-out is from real physical RAM, KASAN shadow is in vmalloc space. No conflict expected — verify with `/proc/iomem` post-boot |
|
||||
| R9 | KASAN tags trigger false positives on pre-existing kernel code (e.g. panel-edp's WARN) → noise drowns real signal | KASAN reports are distinctive ("BUG: KASAN:" prefix). Grep filtering should isolate. If false-positive volume is overwhelming, switch to KASAN_SW_TAGS or selectively disable instrumentation per-file |
|
||||
| R10 | GPU-compositor wedge reproduction requires physical login (no remote-wayland start) | User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~10 min of compositor activity, declare iter6 v3 inconclusive |
|
||||
| R10 | GPU-compositor wedge reproduction requires physical login (no remote-wayland start) | User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~20 min of compositor activity (per round-3 amendment), declare iter6 v3 inconclusive |
|
||||
| R11 | /boot/firmware partition headroom for a third Image (KASAN_OUTLINE adds 15-25 % size) | Pre-flight `df -h /boot/firmware`. Need ≥ 30 MB free. If tight: overwrite the lockdep Image instead of keeping both — lockdep label becomes the lockdep-kasan label after rebuild |
|
||||
|
||||
## Estimated wall-time
|
||||
|
||||
@@ -88,6 +99,6 @@ Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + h
|
||||
|
||||
**Success**: KASAN or KCSAN report fires during step 2.1's GPU-compositor test, pinpointing the bug class + responsible code. → file kernel-agent issue, fix in 0004 v2 OR upstream, retry.
|
||||
|
||||
**Partial**: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). Switch to serial console for live trace (final fallback).
|
||||
**Partial**: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). **Final-fallback netconsole** (round-3 amendment replaces serial): before triggering the wedge, `sudo modprobe netconsole netconsole=6666@<ampere-ip>/eth0,6667@<boltzmann-ip>/eth0`; on boltzmann run `nc -u -l 6667 | tee netconsole-iter6v3-$(date +%s).log`. Netconsole survives kernel panic that consumes the display, requires no hardware-clip work, uses the existing 7.0-lockdep kernel as-is. Serial console remains as documented last-resort only.
|
||||
|
||||
**Failure**: KASAN+KCSAN kernel doesn't boot at all → fall back to lockdep kernel, declare iter6 line closed, move iter7 to non-fence cache-coherency angle.
|
||||
|
||||
Reference in New Issue
Block a user