Files
ampere-kernel-decoders/phase4_plan_iter6_v3.md
T
Markus Fritsche 96e2d439c9 iter6 v3 plan: apply Sonnet round-3 amendments
6 targeted fixes, no rewrite:
1. Add DEBUG_ATOMIC_SLEEP + DEBUG_LIST explicitly to config delta
2. Drop PAGE_POISONING_NO_SANITY (keep the sanity check)
3. Fix mpv hwdec flag (vaapi-copy is SW! use v4l2m2m or rkmpp).
   Extend monitoring window 2min → 20min (KASAN slows runtime 3-8x).
   Add glmark2 + ffmpeg-null concurrent option.
4. Add R11: /boot/firmware headroom pre-flight (KASAN_OUTLINE +15-25% Image size).
5. Replace serial fallback with netconsole (no clip work needed).
6. Add H4 isolation exit path: headless-only DRM ffmpeg loop if
   compositor test silent — isolates panthor from vb2.

All round-3 ACCEPT/AMENDs applied. v3 ready for execution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:28:19 +00:00

7.7 KiB

Phase 4 — iter6 v3 plan (KASAN+KCSAN delta on v2 substrate)

Plan ID: iter6-v3-kasan-kcsan Author: Claude Opus 4.7 Status: Pending Phase 5 architect delta-review Cross-ref: phase4_plan_iter6_postmortem_v2.md (commit 1656f84, executed), iter6_v2_attempt2_close.md (commit 53e2477)

Why v3

v2 attempt-2 found: 0004 alone passes headless HEVC clean, but wedges silently ~10 min later under GPU compositor + concurrent vb2 traffic. PROVE_LOCKING+9 lock-debug flags did NOT fire before the wedge → not a classical lock-order inversion lockdep models.

Refined hypothesis (4 candidates) needs different sanitizers:

  • H1' cross-context wait not lockdep-visible → KCSAN
  • H2 use-after-free / double-signal of dma_fence → KASAN
  • H3 race in fence alloc/init → KCSAN
  • H4 panthor consumer-side bug → headless-only repro path

Delta from v2 plan

All v2 pre-flight + lockdep config + ramoops setup is preserved (already executed and validated). v3 adds:

Kernel config additions (step 1.2-DELTA)

./scripts/config --enable KASAN --enable KASAN_GENERIC \
                 --enable KASAN_OUTLINE \
                 --enable KCSAN --enable KCSAN_REPORT_ONCE_IN_MS=2000 \
                 --enable DEBUG_PREEMPT \
                 --enable DEBUG_ATOMIC_SLEEP \
                 --enable DEBUG_OBJECTS --enable DEBUG_OBJECTS_RCU_HEAD \
                 --enable DEBUG_LIST \
                 --enable SLUB_DEBUG --enable SLUB_DEBUG_ON \
                 --enable PAGE_POISONING
CONFIG_LOCALVERSION="-lockdep-kasan"

Notes:

  • KASAN_GENERIC (not SW_TAGS) — generic mode works on all aarch64
  • KASAN_OUTLINE (not INLINE) — slower but smaller binary, supports more features
  • KCSAN_REPORT_ONCE_IN_MS=2000 — rate-limit reports so console doesn't drown
  • SLUB_DEBUG_ON — enable all SLUB debug at boot (poisoning, redzone, owner-tracking)
  • DEBUG_OBJECTS — catches double-init / use-after-free of registered kernel objects (dma_fences are registered)
  • DEBUG_ATOMIC_SLEEP — re-enable explicitly per round-3 review (don't assume v2 setting carries)
  • DEBUG_LIST — catches list_head corruption at the point of corruption (dma_fence + vb2 use list_heads heavily)
  • PAGE_POISONING WITHOUT _NO_SANITY — keep the sanity check (round-3 review: dropping _NO_SANITY gives more info)

Build (step 1.3-DELTA)

time make -j8 Image modules dtbs — estimated 75-90 min (KASAN+KCSAN compile slow)

Install (step 1.4-1.7-DELTA)

  • Modules install to /lib/modules/7.0.0-rc3-lockdep-kasan+/ (separate via LOCALVERSION)
  • Image installs to /boot/firmware/Image-7.0.0-rc3-lockdep-kasan+
  • Initramfs for new release
  • Copy lockdep DTB (with ramoops node) to /boot/firmware/rk3588-coolpi-cm5-genbook.dtb-7.0.0-rc3-lockdep-kasan+
  • ADD third extlinux label arch_devices_lockdep_kasan. Keep arch_devices (vanilla) as default. Keep arch_devices_lockdep (without KASAN) as a fallback if KASAN kernel doesn't boot
  • User picks arch_devices_lockdep_kasan at U-Boot menu (requires HDMI for first boot — OR temporarily flip default for the test)

Bisect-apply (step 2.x-DELTA)

Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + harvest journal/pstore for KASAN/KCSAN reports.

Critical addition for step 2.1: after the headless smoke test passes, intentionally trigger GPU compositor activity to reproduce the v2 wedge condition:

  1. Log into plasma wayland at SDDM prompt (manual, physical)
  2. Start a video that ACTUALLY uses the V4L2 hardware decode path (round-3 review correction: vaapi-copy is SW decode, doesn't drive vb2). Use:
    • mpv --hwdec=v4l2m2m ~/measurements/encoded/bbb_60s_720p.hevc.mp4, OR
    • mpv --hwdec=rkmpp ~/measurements/encoded/bbb_60s_720p.hevc.mp4, OR
    • Concurrent: glmark2 & (GPU compositor load) + ffmpeg -hwaccel vaapi -i bbb.hevc.mp4 -f null - (V4L2 decode load) in a tight loop
  3. Watch for ~20 minutes (v2 wedge timing was ~10 min; KASAN+KCSAN slow runtime 3-8x → effective wedge time may be 30-80 min wallclock. 20 min is the lower bound)
  4. From ssh: journalctl -k -f | grep -E "KASAN|KCSAN|BUG:|race|use-after|list_add|list_del|object .* not init" piped live for early-warning report capture
  5. If a report fires, capture it AND sudo cat /sys/fs/pstore/* (in case panic-trigger followed) then reboot back to lockdep-non-kasan kernel for analysis

Alternative H4 isolation path (if step-2.1 KASAN/KCSAN silent under plasma session): re-run headless-only (no compositor) with ffmpeg -hwaccel drm -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -f null - looped for 20 min. If WEDGE under headless-DRM = vb2 fence bug. If CLEAN under headless = H4 panthor-compositor-specific (vb2 not at fault).

Risk register additions vs v2

# Risk Mitigation
R7 KASAN+KCSAN combined boot too slow → systemd timeout → boot failure Multiple fallback labels in extlinux: arch_devices_lockdep_kasan (new), arch_devices_lockdep (no kasan), arch_devices (default vanilla). User picks at U-Boot. systemd unit timeouts may need extension via DefaultTimeoutStartSec= if observed
R8 KASAN region overlaps with our ramoops carve-out (0x10000000-0x100fffff) KASAN shadow region uses high VA, not low PA. Carve-out is from real physical RAM, KASAN shadow is in vmalloc space. No conflict expected — verify with /proc/iomem post-boot
R9 KASAN tags trigger false positives on pre-existing kernel code (e.g. panel-edp's WARN) → noise drowns real signal KASAN reports are distinctive ("BUG: KASAN:" prefix). Grep filtering should isolate. If false-positive volume is overwhelming, switch to KASAN_SW_TAGS or selectively disable instrumentation per-file
R10 GPU-compositor wedge reproduction requires physical login (no remote-wayland start) User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~20 min of compositor activity (per round-3 amendment), declare iter6 v3 inconclusive
R11 /boot/firmware partition headroom for a third Image (KASAN_OUTLINE adds 15-25 % size) Pre-flight df -h /boot/firmware. Need ≥ 30 MB free. If tight: overwrite the lockdep Image instead of keeping both — lockdep label becomes the lockdep-kasan label after rebuild

Estimated wall-time

Phase Time
Pre-flight + config delta 5 min (config additions only, no new pre-flight)
KASAN+KCSAN kernel build 75-90 min
Install + reboot + lockdep-kasan baseline 10 min
Apply 0004 + module rebuild + install + reboot + headless smoke 15 min
GPU-compositor concurrent test + monitor for KASAN/KCSAN reports 15 min
Subtotal for 0004 verdict ~120 min wall-time
0005/0006/0007 follow-ups if 0004 verdict is "clean under KASAN" +30 min each

Exit conditions

Success: KASAN or KCSAN report fires during step 2.1's GPU-compositor test, pinpointing the bug class + responsible code. → file kernel-agent issue, fix in 0004 v2 OR upstream, retry.

Partial: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). Final-fallback netconsole (round-3 amendment replaces serial): before triggering the wedge, sudo modprobe netconsole netconsole=6666@<ampere-ip>/eth0,6667@<boltzmann-ip>/eth0; on boltzmann run nc -u -l 6667 | tee netconsole-iter6v3-$(date +%s).log. Netconsole survives kernel panic that consumes the display, requires no hardware-clip work, uses the existing 7.0-lockdep kernel as-is. Serial console remains as documented last-resort only.

Failure: KASAN+KCSAN kernel doesn't boot at all → fall back to lockdep kernel, declare iter6 line closed, move iter7 to non-fence cache-coherency angle.