Files
ampere-kernel-decoders/phase4_plan_iter6_postmortem.md
T
Markus Fritsche 0ef64406b6 iter6 post-mortem Phase 4: bisect-apply plan with lockdep base kernel
Pre-flight (5 steps): backup, pstore, serial console verify
Debug base kernel (8 steps): PROVE_LOCKING + LOCKDEP + DEBUG_*
  full kernel rebuild, separate extlinux label, keep vanilla default
Bisect-apply (4 steps): 0004 → 0005 → 0006 → 0007, reboot+test between each
Risk register: 5 risks with mitigations
Total wall-time: ~150 min if clean

Pending Phase 5 architect review (Sonnet) before any execution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:13:48 +00:00

7.9 KiB

Phase 4 — iter6 post-mortem retry plan

Plan ID: iter6-postmortem-attempt2 Author: Claude Opus 4.7 (this session) Status: Pending Phase 5 architect review (Sonnet) before any execution Cross-ref: phase0_findings_iter6_postmortem.md commit 11d2dde

Goal

Apply the vb2_dma_resv RFC v2 series (0004/0005/0006) + a reviewed 0007 rkvdec opt-in to the ampere kernel WITHOUT silently watchdog-resetting the box during compositor handover. Either succeed (iter6 fixed, can then test the cache-coherency hypothesis against iter5's solid-black output bug), OR fail with a self-diagnostic kernel log entry that pinpoints the deadlock/regression, NOT another reboot loop.

Strategy: bisect-apply with kernel-level safeguards

Don't repeat iter6 attempt-1's mistake of applying 4 inter-dependent patches in one go and rebooting blind. Apply one patch at a time on top of a debug-enabled kernel; reboot + smoke-test between each; revert immediately on any regression and capture the lockdep/oops trace.

Pre-flight (must complete before any patch applies)

Step Action Verify
0.1 SDDM auto-login disabled done (commit 11d2dde) — /etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem
0.2 Backup current /lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko as attempt2-pre-base-<ts>.bkp AND tarball-archive to boltzmann:/home/mfritsche/iter6-postmortem-backups/ ls shows .bkp + scp succeeded
0.3 Backup current kernel Image (/boot/firmware/Image-7.0.0-rc3-devices+) + initramfs as *.pre-attempt2.bkp ls shows .bkp
0.4 Check pstore availability: ls -la /sys/fs/pstore/ as root. If empty/inaccessible, check DT for ramoops reserved region and consider adding via cmdline pstore.backend=ramoops ramoops.mem_address=... ramoops.mem_size=2M ramoops.console_size=256K ramoops.pmsg_size=256K pstore directory writable + filesystem mounted
0.5 Verify serial console: console=ttyS2,1500000 is already in extlinux append. If user has a TTL-USB cable hooked up, capture it. Otherwise note "no serial cable connected" and rely on pstore + journal document

Build the debug base kernel (one-time, ~45 min)

Step Action Verify
1.1 On ampere, in ~/src/linux-rockchip: backup current .config as .config.pre-iter6postmortem ls .config.*
1.2 Enable: ./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH --enable PROVE_RCU then make olddefconfig grep .config for each = y
1.3 time make -j8 Image modules dtbs exit 0, ~45 min
1.4 Backup current /boot/firmware/Image-7.0.0-rc3-devices+ again (defensive, second copy) → install new Image as Image-7.0.0-rc3-devices-lockdep+ → add new extlinux label arch_devices_lockdep pointing at the new kernel (so we can choose between vanilla and lockdep kernel at boot prompt — don't replace the working arch_devices default) extlinux.conf shows new label, vanilla unchanged
1.5 sudo make modules_install INSTALL_MOD_PATH=/tmp/lockdep-modules (stage modules to /tmp first — don't write to /lib/modules yet) /tmp/lockdep-modules populated
1.6 Manually cp /tmp/lockdep-modules/lib/modules/7.0.0-rc3-devices+/* /lib/modules/<new-uname>/ after backing up old careful ordering
1.7 Reboot ampere into arch_devices_lockdep (manual selection at boot prompt) uname -r shows new release suffix
1.8 Confirm system still HEVC-decodes with iter3+4 fixes (smoke test from iter5 baseline: ffmpeg exit 0, no oops, DEC_RDY=1) on the debug kernel lockdep kernel baseline established

If step 1.8 fails (debug kernel itself regresses): STOP. The debug config is the culprit. Investigate which CONFIG flag broke things. Roll back to vanilla kernel before any iter6 work.

Bisect-apply iter6 patches

Order matches the patch series numbering. After each step, reboot + smoke-test + harvest journal/pstore for any lockdep splat. If any step regresses, revert ONLY that step, document the splat, escalate to user.

Step Patch Pre-check Action Smoke test Pass criteria Fail action
2.1 0004 vb2 helper (kernel core) working kernel + module backups present Apply 0004, enable CONFIG_VIDEOBUF2_RELEASE_FENCES=y, rebuild kernel (need to rebuild kernel + videobuf2_common) reboot → ssh in → ffmpeg HEVC test → check journalctl -k -p warning exit 0, no new warnings, no released after/is locked revert 0004, capture journal, STOP
2.2 0005 hantro opt-in step 2.1 passed Apply 0005, rebuild hantro_vpu module only (~3 min) reboot → wait for sddm login (manual) → ssh → ffmpeg test → journal check same as 2.1 + journal also clean revert 0005, capture, STOP
2.3 0006 rockchip-rga opt-in step 2.2 passed Apply 0006, rebuild rockchip_rga module only reboot → same same revert 0006, capture, STOP
2.4 0007 rkvdec opt-in (my unreviewed patch — needs Phase 5 review before this step!) step 2.3 passed + Phase 5 approval Apply 0007 source diff (queue_init flag + device_run hook), rebuild rockchip_vdec only reboot → same same revert 0007, capture, STOP

Post-bisect

  • If all 4 steps passed: iter6 attempt-2 succeeds. Commit findings. Move to iter7 (test cache-coherency hypothesis for the iter5 solid-black-output bug — the original reason we wanted these patches).
  • If a specific step fails: file kernel-agent#16 or upstream-aligned issue with the specific patch + lockdep splat. Document in iter6_attempt2_close.md.

Risk register

Risk Mitigation
Lockdep-debug kernel itself doesn't boot Step 1.4 keeps vanilla label as default; manual boot menu choice required for lockdep kernel; recovery stick handles boot-loop
Step 1.7 build artifact in /lib/modules/ overwrites working modules Step 1.6 backs up pre-state; surgical install only after backup
Watchdog still fires before any lockdep splat reaches journal pstore fallback (step 0.4); serial console if cable hooked up; ramoops capture across resets
0007 has a different bug than what we just fixed Phase 5 reviews the 0007 source; if reviewer rejects, write a v2 informed by upstream patterns or skip 0007 (test 0004+0005+0006 only as a control — they shouldn't affect rkvdec)
User wants to use ampere for other work during 45-min kernel build Per feedback_no_bulldoze_reboots.md, ASK before kicking off

Exit conditions

Success: ffmpeg exit 0, no regressions, journal/pstore clean across all 4 step boots. Iter6 attempt-2 closes successfully; can proceed to iter7 cache-coherency hypothesis test.

Partial success: 0004+0005+0006 land cleanly; 0007 regresses; revert 0007, iter7 can test cache hypothesis without rkvdec opt-in (helper still in core, just rkvdec doesn't publish a fence).

Failure: 0004 itself regresses on RK3588 (per the silent watchdog reset). File kernel-agent issue; vb2 fence series needs RK3588-specific debugging; iter6 line of investigation is closed; iter7 picks a different cache-coherency test path (DMA_BUF_IOCTL_SYNC backend or DT dma-coherent property).

Estimated wall-time

Phase Time
Pre-flight (0.1-0.5) 10 min
Debug kernel build (1.1-1.8) 60 min (build 45 + install 5 + reboot 2 + verify 8)
Bisect-apply 0004 (2.1) 50 min (full kernel rebuild for CONFIG_VIDEOBUF2_RELEASE_FENCES + 0004 changes)
Bisect-apply 0005 (2.2) 10 min
Bisect-apply 0006 (2.3) 10 min
Bisect-apply 0007 (2.4) 10 min
Total ~150 min wall-time if everything goes clean

If a regression hits early: 30 min to capture lockdep + revert + document.