Files
ampere-kernel-decoders/phase4_plan_iter6_postmortem_v2.md
Markus Fritsche 1656f84a40 iter6 plan v2: log A5 execution deviation — PROVE_RCU forced =y by kconfig
CONFIG_PROVE_RCU is selected by CONFIG_PROVE_LOCKING via Kconfig
dependency. Cannot disable without disabling PROVE_LOCKING itself
(which would defeat the purpose). Documenting the execution deviation
from A5: PROVE_RCU=y remains. 10-min HW watchdog has plenty of margin.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:54:38 +00:00

11 KiB
Raw Permalink Blame History

Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)

Plan ID: iter6-postmortem-attempt2 Author: Claude Opus 4.7 Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT) Status: v2 pending Phase 5 round 2 (delta-review) before any execution Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)

Goal

Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.

Amendments vs v1 (architect findings)

# Finding Resolution
A1 0007 v1 placement (in rkvdec_device_run) inconsistent with hantro's pattern + metadata-copy ordering wrong 0007 v2 written: 7 per-codec *_run() insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. supports_release_fences = true stays in queue_init. See 0007-v2-rkvdec-opt-in.patch (8 hunks, 25 +lines)
A2 Missing risk: panthor ww_acquire_ctx contention on same resv as vb2 dma_resv_lock(NULL) under dma_fence_begin_signalling — primary RK3588 wedge hypothesis Added to risk register as Primary Hypothesis H1. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg
A3 Step 1.5 modules_install overwrites working modules on uname -r collision Added CONFIG_LOCALVERSION="-lockdep" to step 1.2 config block → distinct release suffix 7.0.0-rc3-devices-lockdep+ → separate /lib/modules/.../. Original modules untouched
A4 Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path Step 0.4 reworked as HARD GATE. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts
A5 CONFIG_PROVE_RCU slows boot + may push past watchdog before lockdep prints Attempted removal but PROVE_RCU is selected by PROVE_LOCKING via Kconfig — can't be disabled without disabling PROVE_LOCKING itself. Execution deviation logged: PROVE_RCU=y remains. Mitigation: 10-min HW watchdog gives plenty of margin even with PROVE_RCU enabled. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH, PSTORE_CONSOLE, PSTORE_PMSG, PSTORE_RAM=m

Pre-flight (REWRITTEN, hard gates marked)

Step Action Verify Gate
0.1 SDDM auto-login disabled Done — /etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem
0.2 Backup /lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko as attempt2-pre-base-<ts>.bkp AND scp tarball to boltzmann:/home/mfritsche/iter6-postmortem-backups/ ls shows .bkp + scp returned 0 HARD GATE — abort if backup write fails
0.3 Backup /boot/firmware/Image-7.0.0-rc3-devices+ and initramfs-7.0.0-rc3-devices+ as *.pre-attempt2.bkp ls HARD GATE
0.4a Probe current kernel: CONFIG_PSTORE_RAM=m confirmed on ampere; CONFIG_PSTORE_CONSOLE currently n (will be enabled in step 1.2). /sys/fs/pstore/ exists + empty survey only informational
0.4b Memory layout survey: cat /proc/iomem. Plan ramoops carve-out at 0x10000000 (1 MB inside the normal-RAM gap between 0x4500000 and 0xd7a00000) iomem confirms region is "System RAM" not pre-reserved informational
0.4c NOTE: User chose ramoops over serial, accepted documented limitation that hard spinlock-with-IRQ-off deadlocks may not flush to pstore before watchdog reset. Serial would capture those; ramoops may not acknowledged informational
0.4-G HARD GATE moved to step 1.8: ramoops verification only possible after lockdep kernel boots with PSTORE_CONSOLE=y. Pre-flight does NOT abort here. Step 1.8 runs echo c > /proc/sysrq-trigger to force a crash, reboots, verifies /sys/fs/pstore/ is non-empty; if empty, ABORT before bisect-apply 2.x and reconsider serial step 1.8 enforces mandatory at 1.8
0.5 Verify ampere ~/src/linux-rockchip working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), git status shows expected files modified git status output matches informational

Build the lockdep debug base kernel (~45 min one-time)

Step Action
1.1 cp .config .config.pre-iter6postmortem
1.2 ./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH (NO PROVE_RCU per A5). Ramoops/pstore: ./scripts/config --enable PSTORE_CONSOLE --enable PSTORE_PMSG --module PSTORE_RAM (PSTORE_RAM stays =m, console=y captures kernel printk to the ramoops region). Set CONFIG_LOCALVERSION="-lockdep" (A3). make olddefconfig
1.3 time make -j8 Image modules dtbs (~45 min)
1.4 make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/... — actually use kernel's default make modules_install which respects LOCALVERSION → installs to /lib/modules/7.0.0-rc3-devices-lockdep+/ separate from working tree. Verify destination path before sudo
1.5 sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+. Generate initramfs for new release: sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+
1.6 Backup extlinux.conf, then sudo edit: ADD new label arch_devices_lockdep pointing at the new Image + initrd. Since we're ramoops-only (no serial), per round 2 Q3 the DEFAULT arch_devices_lockdep override is MANDATORY for this test boot. Restore DEFAULT arch_devices before any subsequent reboot where lockdep boot is not desired. Ramoops cmdline (append): extend the lockdep label's append line with memmap=0x100000$0x10000000 ramoops.mem_address=0x10000000 ramoops.mem_size=0x100000 ramoops.record_size=0x10000 ramoops.console_size=0x40000 ramoops.dump_oops=1. Carves 1 MB at 256 MB physical → 256 KB console buffer + 11 × 64 KB oops records
1.7 Reboot. Since DEFAULT was set to arch_devices_lockdep in step 1.6, U-Boot loads lockdep kernel automatically. Verify uname -r = 7.0.0-rc3-devices-lockdep+. Verify journal has Lockdep is enabled. sudo modprobe pstore_ram (or verify auto-loaded). Verify ls /sys/fs/pstore/ after triggering a benign console write to confirm the ramoops region is accessible
1.8a Ramoops verify (HARD GATE per 0.4-G): sync; sudo bash -c "echo c > /proc/sysrq-trigger". Ampere panics → reboots (still on lockdep default). After ssh comes back, check ls /sys/fs/pstore/ — must contain at least one dmesg-ramoops-* and one console-ramoops-* file. If empty, ABORT — ramoops doesn't capture; reconsider serial cable. Backup pstore contents to /home/mfritsche/iter6-postmortem-ramoops-verify/ then sudo rm /sys/fs/pstore/* to clear for actual tests
1.8b Smoke test: ffmpeg HEVC decode (iter5 baseline), check journalctl -k -p warning -b 0 for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs)

1.9 — A2 GPU smoke test: open SDDM login → log in to plasma wayland → open mpv with --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.

Bisect-apply iter6 patches on lockdep base

Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.

Step Patch Build effort Smoke test
2.1 0004 (vb2 core helper + Kconfig) full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) Step 1.8 + Step 1.9
2.2 0005 (hantro opt-in) hantro_vpu module (~3 min) Step 1.8 + Step 1.9
2.3 0006 (rockchip-rga opt-in) rockchip-rga module (~3 min) Step 1.8 + Step 1.9
2.4 0007-v2 (rkvdec per-codec opt-in) rockchip-vdec module (~3 min) Step 1.8 + Step 1.9

Post-bisect

  • All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
  • Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.

Risk register (v2 — added H1)

# Risk Mitigation
H1 PRIMARY: panthor ww_acquire_ctx on same resv as vb2's dma_resv_lock(NULL) published inside dma_fence_begin_signalling() — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path
R2 Lockdep-debug kernel doesn't boot at all Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss
R3 Lockdep-debug kernel boots but uname -r collides with vanilla → modules_install overwrites CONFIG_LOCALVERSION="-lockdep" (A3) makes the suffix distinct; verify before sudo cp
R4 Watchdog fires before lockdep splat reaches journal Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path
R5 0007-v2 has a different bug than v1 (e.g. run.base.bufs.dst access pattern wrong for some codec) Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback
R6 User wants to use ampere during 45-min kernel build → reboot conflict Per feedback_no_bulldoze_reboots.md — ASK before kicking off

Estimated wall-time

Phase Time
Pre-flight (0.1-0.5) + serial confirmation 15-30 min (depends on serial cable hookup)
Lockdep base kernel build (1.1-1.9) 60 min
2.1 (0004 + Kconfig + full kernel rebuild) 50 min
2.2 (0005 hantro module) 10 min
2.3 (0006 rga module) 10 min
2.4 (0007-v2 rkvdec module) 10 min
Total ~3-3.5h wall-time clean-path

Exit conditions

Same as v1. Success / Partial / Failure trees unchanged.