Pre-flight (5 steps): backup, pstore, serial console verify Debug base kernel (8 steps): PROVE_LOCKING + LOCKDEP + DEBUG_* full kernel rebuild, separate extlinux label, keep vanilla default Bisect-apply (4 steps): 0004 → 0005 → 0006 → 0007, reboot+test between each Risk register: 5 risks with mitigations Total wall-time: ~150 min if clean Pending Phase 5 architect review (Sonnet) before any execution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.9 KiB
Phase 4 — iter6 post-mortem retry plan
Plan ID: iter6-postmortem-attempt2
Author: Claude Opus 4.7 (this session)
Status: Pending Phase 5 architect review (Sonnet) before any execution
Cross-ref: phase0_findings_iter6_postmortem.md commit 11d2dde
Goal
Apply the vb2_dma_resv RFC v2 series (0004/0005/0006) + a reviewed 0007 rkvdec opt-in to the ampere kernel WITHOUT silently watchdog-resetting the box during compositor handover. Either succeed (iter6 fixed, can then test the cache-coherency hypothesis against iter5's solid-black output bug), OR fail with a self-diagnostic kernel log entry that pinpoints the deadlock/regression, NOT another reboot loop.
Strategy: bisect-apply with kernel-level safeguards
Don't repeat iter6 attempt-1's mistake of applying 4 inter-dependent patches in one go and rebooting blind. Apply one patch at a time on top of a debug-enabled kernel; reboot + smoke-test between each; revert immediately on any regression and capture the lockdep/oops trace.
Pre-flight (must complete before any patch applies)
| Step | Action | Verify |
|---|---|---|
| 0.1 | SDDM auto-login disabled | done (commit 11d2dde) — /etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem |
| 0.2 | Backup current /lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko as attempt2-pre-base-<ts>.bkp AND tarball-archive to boltzmann:/home/mfritsche/iter6-postmortem-backups/ |
ls shows .bkp + scp succeeded |
| 0.3 | Backup current kernel Image (/boot/firmware/Image-7.0.0-rc3-devices+) + initramfs as *.pre-attempt2.bkp |
ls shows .bkp |
| 0.4 | Check pstore availability: ls -la /sys/fs/pstore/ as root. If empty/inaccessible, check DT for ramoops reserved region and consider adding via cmdline pstore.backend=ramoops ramoops.mem_address=... ramoops.mem_size=2M ramoops.console_size=256K ramoops.pmsg_size=256K |
pstore directory writable + filesystem mounted |
| 0.5 | Verify serial console: console=ttyS2,1500000 is already in extlinux append. If user has a TTL-USB cable hooked up, capture it. Otherwise note "no serial cable connected" and rely on pstore + journal |
document |
Build the debug base kernel (one-time, ~45 min)
| Step | Action | Verify |
|---|---|---|
| 1.1 | On ampere, in ~/src/linux-rockchip: backup current .config as .config.pre-iter6postmortem |
ls .config.* |
| 1.2 | Enable: ./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH --enable PROVE_RCU then make olddefconfig |
grep .config for each = y |
| 1.3 | time make -j8 Image modules dtbs |
exit 0, ~45 min |
| 1.4 | Backup current /boot/firmware/Image-7.0.0-rc3-devices+ again (defensive, second copy) → install new Image as Image-7.0.0-rc3-devices-lockdep+ → add new extlinux label arch_devices_lockdep pointing at the new kernel (so we can choose between vanilla and lockdep kernel at boot prompt — don't replace the working arch_devices default) |
extlinux.conf shows new label, vanilla unchanged |
| 1.5 | sudo make modules_install INSTALL_MOD_PATH=/tmp/lockdep-modules (stage modules to /tmp first — don't write to /lib/modules yet) |
/tmp/lockdep-modules populated |
| 1.6 | Manually cp /tmp/lockdep-modules/lib/modules/7.0.0-rc3-devices+/* /lib/modules/<new-uname>/ after backing up old |
careful ordering |
| 1.7 | Reboot ampere into arch_devices_lockdep (manual selection at boot prompt) |
uname -r shows new release suffix |
| 1.8 | Confirm system still HEVC-decodes with iter3+4 fixes (smoke test from iter5 baseline: ffmpeg exit 0, no oops, DEC_RDY=1) on the debug kernel | lockdep kernel baseline established |
If step 1.8 fails (debug kernel itself regresses): STOP. The debug config is the culprit. Investigate which CONFIG flag broke things. Roll back to vanilla kernel before any iter6 work.
Bisect-apply iter6 patches
Order matches the patch series numbering. After each step, reboot + smoke-test + harvest journal/pstore for any lockdep splat. If any step regresses, revert ONLY that step, document the splat, escalate to user.
| Step | Patch | Pre-check | Action | Smoke test | Pass criteria | Fail action |
|---|---|---|---|---|---|---|
| 2.1 | 0004 vb2 helper (kernel core) | working kernel + module backups present | Apply 0004, enable CONFIG_VIDEOBUF2_RELEASE_FENCES=y, rebuild kernel (need to rebuild kernel + videobuf2_common) |
reboot → ssh in → ffmpeg HEVC test → check journalctl -k -p warning |
exit 0, no new warnings, no released after/is locked |
revert 0004, capture journal, STOP |
| 2.2 | 0005 hantro opt-in | step 2.1 passed | Apply 0005, rebuild hantro_vpu module only (~3 min) | reboot → wait for sddm login (manual) → ssh → ffmpeg test → journal check | same as 2.1 + journal also clean | revert 0005, capture, STOP |
| 2.3 | 0006 rockchip-rga opt-in | step 2.2 passed | Apply 0006, rebuild rockchip_rga module only | reboot → same | same | revert 0006, capture, STOP |
| 2.4 | 0007 rkvdec opt-in (my unreviewed patch — needs Phase 5 review before this step!) | step 2.3 passed + Phase 5 approval | Apply 0007 source diff (queue_init flag + device_run hook), rebuild rockchip_vdec only | reboot → same | same | revert 0007, capture, STOP |
Post-bisect
- If all 4 steps passed: iter6 attempt-2 succeeds. Commit findings. Move to iter7 (test cache-coherency hypothesis for the iter5 solid-black-output bug — the original reason we wanted these patches).
- If a specific step fails: file kernel-agent#16 or upstream-aligned issue with the specific patch + lockdep splat. Document in iter6_attempt2_close.md.
Risk register
| Risk | Mitigation |
|---|---|
| Lockdep-debug kernel itself doesn't boot | Step 1.4 keeps vanilla label as default; manual boot menu choice required for lockdep kernel; recovery stick handles boot-loop |
| Step 1.7 build artifact in /lib/modules/ overwrites working modules | Step 1.6 backs up pre-state; surgical install only after backup |
| Watchdog still fires before any lockdep splat reaches journal | pstore fallback (step 0.4); serial console if cable hooked up; ramoops capture across resets |
| 0007 has a different bug than what we just fixed | Phase 5 reviews the 0007 source; if reviewer rejects, write a v2 informed by upstream patterns or skip 0007 (test 0004+0005+0006 only as a control — they shouldn't affect rkvdec) |
| User wants to use ampere for other work during 45-min kernel build | Per feedback_no_bulldoze_reboots.md, ASK before kicking off |
Exit conditions
Success: ffmpeg exit 0, no regressions, journal/pstore clean across all 4 step boots. Iter6 attempt-2 closes successfully; can proceed to iter7 cache-coherency hypothesis test.
Partial success: 0004+0005+0006 land cleanly; 0007 regresses; revert 0007, iter7 can test cache hypothesis without rkvdec opt-in (helper still in core, just rkvdec doesn't publish a fence).
Failure: 0004 itself regresses on RK3588 (per the silent watchdog reset). File kernel-agent issue; vb2 fence series needs RK3588-specific debugging; iter6 line of investigation is closed; iter7 picks a different cache-coherency test path (DMA_BUF_IOCTL_SYNC backend or DT dma-coherent property).
Estimated wall-time
| Phase | Time |
|---|---|
| Pre-flight (0.1-0.5) | 10 min |
| Debug kernel build (1.1-1.8) | 60 min (build 45 + install 5 + reboot 2 + verify 8) |
| Bisect-apply 0004 (2.1) | 50 min (full kernel rebuild for CONFIG_VIDEOBUF2_RELEASE_FENCES + 0004 changes) |
| Bisect-apply 0005 (2.2) | 10 min |
| Bisect-apply 0006 (2.3) | 10 min |
| Bisect-apply 0007 (2.4) | 10 min |
| Total | ~150 min wall-time if everything goes clean |
If a regression hits early: 30 min to capture lockdep + revert + document.