Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)
Plan ID: iter6-postmortem-attempt2
Author: Claude Opus 4.7
Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT)
Status: v2 pending Phase 5 round 2 (delta-review) before any execution
Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)
Goal
Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.
Amendments vs v1 (architect findings)
| # |
Finding |
Resolution |
| A1 |
0007 v1 placement (in rkvdec_device_run) inconsistent with hantro's pattern + metadata-copy ordering wrong |
0007 v2 written: 7 per-codec *_run() insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. supports_release_fences = true stays in queue_init. See 0007-v2-rkvdec-opt-in.patch (8 hunks, 25 +lines) |
| A2 |
Missing risk: panthor ww_acquire_ctx contention on same resv as vb2 dma_resv_lock(NULL) under dma_fence_begin_signalling — primary RK3588 wedge hypothesis |
Added to risk register as Primary Hypothesis H1. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg |
| A3 |
Step 1.5 modules_install overwrites working modules on uname -r collision |
Added CONFIG_LOCALVERSION="-lockdep" to step 1.2 config block → distinct release suffix 7.0.0-rc3-devices-lockdep+ → separate /lib/modules/.../. Original modules untouched |
| A4 |
Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path |
Step 0.4 reworked as HARD GATE. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts |
| A5 |
CONFIG_PROVE_RCU slows boot + may push past watchdog before lockdep prints |
Removed from step 1.2 config set. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH. PROVE_RCU added only if first lockdep run passes clean |
Pre-flight (REWRITTEN, hard gates marked)
| Step |
Action |
Verify |
Gate |
| 0.1 |
SDDM auto-login disabled |
Done — /etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem |
✓ |
| 0.2 |
Backup /lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko as attempt2-pre-base-<ts>.bkp AND scp tarball to boltzmann:/home/mfritsche/iter6-postmortem-backups/ |
ls shows .bkp + scp returned 0 |
HARD GATE — abort if backup write fails |
| 0.3 |
Backup /boot/firmware/Image-7.0.0-rc3-devices+ and initramfs-7.0.0-rc3-devices+ as *.pre-attempt2.bkp |
ls |
HARD GATE |
| 0.4a |
Check pstore: ls -la /sys/fs/pstore/ as root, dmesg | grep -i pstore, look for ramoops reserved region in /proc/iomem or DT |
pstore writable + ramoops region present |
one-of (with 0.4b) |
| 0.4b |
Check serial console: confirm user has TTL-USB cable connected to ampere's ttyS2 UART; run screen /dev/ttyUSB0 1500000 (or equivalent) on a host that has it; ampere's extlinux already has console=ttyS2,1500000 |
user types confirmation after seeing serial output |
one-of (with 0.4a) |
| 0.4-G |
HARD GATE: if BOTH 0.4a and 0.4b fail (no pstore AND no serial), ABORT and ask user to obtain serial cable OR investigate ramoops DT addition before retry |
gate enforced |
mandatory |
| 0.5 |
Verify ampere ~/src/linux-rockchip working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), git status shows expected files modified |
git status output matches |
informational |
Build the lockdep debug base kernel (~45 min one-time)
| Step |
Action |
| 1.1 |
cp .config .config.pre-iter6postmortem |
| 1.2 |
./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH (NO PROVE_RCU per A5). Set CONFIG_LOCALVERSION="-lockdep" (A3). make olddefconfig |
| 1.3 |
time make -j8 Image modules dtbs (~45 min) |
| 1.4 |
make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/... — actually use kernel's default make modules_install which respects LOCALVERSION → installs to /lib/modules/7.0.0-rc3-devices-lockdep+/ separate from working tree. Verify destination path before sudo |
| 1.5 |
sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+. Generate initramfs for new release: sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+ |
| 1.6 |
Backup extlinux.conf, then sudo edit: ADD new label arch_devices_lockdep pointing at the new Image + initrd, leave arch_devices as the default. So system boots vanilla by default; user picks lockdep at U-Boot menu. Remote-operator note (round 2 review): If ampere is accessed SSH-only and the serial console from 0.4b is the only OOB path (no physical keyboard / HDMI), temporarily set DEFAULT arch_devices_lockdep in extlinux.conf for this test boot. Restore DEFAULT arch_devices before any subsequent reboot where lockdep boot is not desired. If 0.4a (ramoops only, no serial), U-Boot menu selection is not possible — the DEFAULT override is MANDATORY |
| 1.7 |
Reboot. At U-Boot menu, manually select arch_devices_lockdep. Verify uname -r = 7.0.0-rc3-devices-lockdep+. Verify journal has Lockdep is enabled |
| 1.8 |
Smoke test: ffmpeg HEVC decode (iter5 baseline), check journalctl -k -p warning -b 0 for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs) |
1.9 — A2 GPU smoke test: open SDDM login → log in to plasma wayland → open mpv with --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.
Bisect-apply iter6 patches on lockdep base
Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.
| Step |
Patch |
Build effort |
Smoke test |
| 2.1 |
0004 (vb2 core helper + Kconfig) |
full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) |
Step 1.8 + Step 1.9 |
| 2.2 |
0005 (hantro opt-in) |
hantro_vpu module (~3 min) |
Step 1.8 + Step 1.9 |
| 2.3 |
0006 (rockchip-rga opt-in) |
rockchip-rga module (~3 min) |
Step 1.8 + Step 1.9 |
| 2.4 |
0007-v2 (rkvdec per-codec opt-in) |
rockchip-vdec module (~3 min) |
Step 1.8 + Step 1.9 |
Post-bisect
- All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
- Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.
Risk register (v2 — added H1)
| # |
Risk |
Mitigation |
| H1 |
PRIMARY: panthor ww_acquire_ctx on same resv as vb2's dma_resv_lock(NULL) published inside dma_fence_begin_signalling() — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. |
LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path |
| R2 |
Lockdep-debug kernel doesn't boot at all |
Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss |
| R3 |
Lockdep-debug kernel boots but uname -r collides with vanilla → modules_install overwrites |
CONFIG_LOCALVERSION="-lockdep" (A3) makes the suffix distinct; verify before sudo cp |
| R4 |
Watchdog fires before lockdep splat reaches journal |
Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path |
| R5 |
0007-v2 has a different bug than v1 (e.g. run.base.bufs.dst access pattern wrong for some codec) |
Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback |
| R6 |
User wants to use ampere during 45-min kernel build → reboot conflict |
Per feedback_no_bulldoze_reboots.md — ASK before kicking off |
Estimated wall-time
| Phase |
Time |
| Pre-flight (0.1-0.5) + serial confirmation |
15-30 min (depends on serial cable hookup) |
| Lockdep base kernel build (1.1-1.9) |
60 min |
| 2.1 (0004 + Kconfig + full kernel rebuild) |
50 min |
| 2.2 (0005 hantro module) |
10 min |
| 2.3 (0006 rga module) |
10 min |
| 2.4 (0007-v2 rkvdec module) |
10 min |
| Total |
~3-3.5h wall-time clean-path |
Exit conditions
Same as v1. Success / Partial / Failure trees unchanged.