Files
ampere-kernel-decoders/phase4_plan_iter6_postmortem_v2.md
T
Markus Fritsche b02baffca7 iter6 post-mortem Phase 4 v2: per-codec 0007 + lockdep base kernel
Amendments per Sonnet architect review round 1:
- A1: 0007 v2 rewritten — 7 per-codec run() insertion points,
  matches hantro pattern (after preamble metadata copy, before HW kick).
  Old v1 (in rkvdec_device_run) REJECTED — wrong structural placement.
- A2: panthor ww_mutex/dma_resv contention added as primary hypothesis H1.
  Smoke test 1.9/2.x extended to exercise GPU compositor path.
- A3: CONFIG_LOCALVERSION=-lockdep so lockdep kernel uname differs from
  vanilla — prevents modules_install overwriting working tree.
- A4: pstore/serial gate is now HARD (one-of required); pre-flight aborts
  if neither serial nor ramoops is functional.
- A5: PROVE_RCU removed from initial config — boot latency risk pushes
  past watchdog before lockdep prints. Add back only if first run clean.

0007-v2 patch attached: 8 hunks across rkvdec-{h264,hevc,vdpu381-h264,
vdpu381-hevc,vdpu383-h264,vdpu383-hevc,vp9}.c + rkvdec.c queue_init flag.
25 lines insertions.

Pending Phase 5 round 2 delta-review of v2 source + amended plan before
any execution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:33:08 +00:00

9.1 KiB

Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)

Plan ID: iter6-postmortem-attempt2 Author: Claude Opus 4.7 Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT) Status: v2 pending Phase 5 round 2 (delta-review) before any execution Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)

Goal

Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.

Amendments vs v1 (architect findings)

# Finding Resolution
A1 0007 v1 placement (in rkvdec_device_run) inconsistent with hantro's pattern + metadata-copy ordering wrong 0007 v2 written: 7 per-codec *_run() insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. supports_release_fences = true stays in queue_init. See 0007-v2-rkvdec-opt-in.patch (8 hunks, 25 +lines)
A2 Missing risk: panthor ww_acquire_ctx contention on same resv as vb2 dma_resv_lock(NULL) under dma_fence_begin_signalling — primary RK3588 wedge hypothesis Added to risk register as Primary Hypothesis H1. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg
A3 Step 1.5 modules_install overwrites working modules on uname -r collision Added CONFIG_LOCALVERSION="-lockdep" to step 1.2 config block → distinct release suffix 7.0.0-rc3-devices-lockdep+ → separate /lib/modules/.../. Original modules untouched
A4 Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path Step 0.4 reworked as HARD GATE. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts
A5 CONFIG_PROVE_RCU slows boot + may push past watchdog before lockdep prints Removed from step 1.2 config set. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH. PROVE_RCU added only if first lockdep run passes clean

Pre-flight (REWRITTEN, hard gates marked)

Step Action Verify Gate
0.1 SDDM auto-login disabled Done — /etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem
0.2 Backup /lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko as attempt2-pre-base-<ts>.bkp AND scp tarball to boltzmann:/home/mfritsche/iter6-postmortem-backups/ ls shows .bkp + scp returned 0 HARD GATE — abort if backup write fails
0.3 Backup /boot/firmware/Image-7.0.0-rc3-devices+ and initramfs-7.0.0-rc3-devices+ as *.pre-attempt2.bkp ls HARD GATE
0.4a Check pstore: ls -la /sys/fs/pstore/ as root, dmesg | grep -i pstore, look for ramoops reserved region in /proc/iomem or DT pstore writable + ramoops region present one-of (with 0.4b)
0.4b Check serial console: confirm user has TTL-USB cable connected to ampere's ttyS2 UART; run screen /dev/ttyUSB0 1500000 (or equivalent) on a host that has it; ampere's extlinux already has console=ttyS2,1500000 user types confirmation after seeing serial output one-of (with 0.4a)
0.4-G HARD GATE: if BOTH 0.4a and 0.4b fail (no pstore AND no serial), ABORT and ask user to obtain serial cable OR investigate ramoops DT addition before retry gate enforced mandatory
0.5 Verify ampere ~/src/linux-rockchip working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), git status shows expected files modified git status output matches informational

Build the lockdep debug base kernel (~45 min one-time)

Step Action
1.1 cp .config .config.pre-iter6postmortem
1.2 ./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH (NO PROVE_RCU per A5). Set CONFIG_LOCALVERSION="-lockdep" (A3). make olddefconfig
1.3 time make -j8 Image modules dtbs (~45 min)
1.4 make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/... — actually use kernel's default make modules_install which respects LOCALVERSION → installs to /lib/modules/7.0.0-rc3-devices-lockdep+/ separate from working tree. Verify destination path before sudo
1.5 sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+. Generate initramfs for new release: sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+
1.6 Backup extlinux.conf, then sudo edit: ADD new label arch_devices_lockdep pointing at the new Image + initrd, leave arch_devices as the default. So system boots vanilla by default; user picks lockdep at U-Boot menu
1.7 Reboot. At U-Boot menu, manually select arch_devices_lockdep. Verify uname -r = 7.0.0-rc3-devices-lockdep+. Verify journal has Lockdep is enabled
1.8 Smoke test: ffmpeg HEVC decode (iter5 baseline), check journalctl -k -p warning -b 0 for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs)

1.9 — A2 GPU smoke test: open SDDM login → log in to plasma wayland → open mpv with --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.

Bisect-apply iter6 patches on lockdep base

Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.

Step Patch Build effort Smoke test
2.1 0004 (vb2 core helper + Kconfig) full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) Step 1.8 + Step 1.9
2.2 0005 (hantro opt-in) hantro_vpu module (~3 min) Step 1.8 + Step 1.9
2.3 0006 (rockchip-rga opt-in) rockchip-rga module (~3 min) Step 1.8 + Step 1.9
2.4 0007-v2 (rkvdec per-codec opt-in) rockchip-vdec module (~3 min) Step 1.8 + Step 1.9

Post-bisect

  • All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
  • Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.

Risk register (v2 — added H1)

# Risk Mitigation
H1 PRIMARY: panthor ww_acquire_ctx on same resv as vb2's dma_resv_lock(NULL) published inside dma_fence_begin_signalling() — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path
R2 Lockdep-debug kernel doesn't boot at all Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss
R3 Lockdep-debug kernel boots but uname -r collides with vanilla → modules_install overwrites CONFIG_LOCALVERSION="-lockdep" (A3) makes the suffix distinct; verify before sudo cp
R4 Watchdog fires before lockdep splat reaches journal Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path
R5 0007-v2 has a different bug than v1 (e.g. run.base.bufs.dst access pattern wrong for some codec) Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback
R6 User wants to use ampere during 45-min kernel build → reboot conflict Per feedback_no_bulldoze_reboots.md — ASK before kicking off

Estimated wall-time

Phase Time
Pre-flight (0.1-0.5) + serial confirmation 15-30 min (depends on serial cable hookup)
Lockdep base kernel build (1.1-1.9) 60 min
2.1 (0004 + Kconfig + full kernel rebuild) 50 min
2.2 (0005 hantro module) 10 min
2.3 (0006 rga module) 10 min
2.4 (0007-v2 rkvdec module) 10 min
Total ~3-3.5h wall-time clean-path

Exit conditions

Same as v1. Success / Partial / Failure trees unchanged.