From 0ef64406b60b508091923f85fd5f45a905c442f0 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 16:13:48 +0000 Subject: [PATCH] iter6 post-mortem Phase 4: bisect-apply plan with lockdep base kernel MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pre-flight (5 steps): backup, pstore, serial console verify Debug base kernel (8 steps): PROVE_LOCKING + LOCKDEP + DEBUG_* full kernel rebuild, separate extlinux label, keep vanilla default Bisect-apply (4 steps): 0004 → 0005 → 0006 → 0007, reboot+test between each Risk register: 5 risks with mitigations Total wall-time: ~150 min if clean Pending Phase 5 architect review (Sonnet) before any execution. Co-Authored-By: Claude Opus 4.7 --- phase4_plan_iter6_postmortem.md | 87 +++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 phase4_plan_iter6_postmortem.md diff --git a/phase4_plan_iter6_postmortem.md b/phase4_plan_iter6_postmortem.md new file mode 100644 index 0000000..2e08d14 --- /dev/null +++ b/phase4_plan_iter6_postmortem.md @@ -0,0 +1,87 @@ +# Phase 4 — iter6 post-mortem retry plan + +Plan ID: iter6-postmortem-attempt2 +Author: Claude Opus 4.7 (this session) +Status: Pending Phase 5 architect review (Sonnet) before any execution +Cross-ref: phase0_findings_iter6_postmortem.md commit 11d2dde + +## Goal + +Apply the vb2_dma_resv RFC v2 series (0004/0005/0006) + a reviewed 0007 rkvdec opt-in to the ampere kernel WITHOUT silently watchdog-resetting the box during compositor handover. Either succeed (iter6 fixed, can then test the cache-coherency hypothesis against iter5's solid-black output bug), OR fail with a self-diagnostic kernel log entry that pinpoints the deadlock/regression, NOT another reboot loop. + +## Strategy: bisect-apply with kernel-level safeguards + +Don't repeat iter6 attempt-1's mistake of applying 4 inter-dependent patches in one go and rebooting blind. Apply one patch at a time on top of a debug-enabled kernel; reboot + smoke-test between each; revert immediately on any regression and capture the lockdep/oops trace. + +## Pre-flight (must complete before any patch applies) + +| Step | Action | Verify | +|------|--------|--------| +| 0.1 | SDDM auto-login disabled | done (commit 11d2dde) — `/etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem` | +| 0.2 | Backup current `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko` as `attempt2-pre-base-.bkp` AND tarball-archive to `boltzmann:/home/mfritsche/iter6-postmortem-backups/` | `ls` shows .bkp + scp succeeded | +| 0.3 | Backup current kernel Image (`/boot/firmware/Image-7.0.0-rc3-devices+`) + initramfs as `*.pre-attempt2.bkp` | `ls` shows .bkp | +| 0.4 | Check pstore availability: `ls -la /sys/fs/pstore/` as root. If empty/inaccessible, check DT for `ramoops` reserved region and consider adding via cmdline `pstore.backend=ramoops ramoops.mem_address=... ramoops.mem_size=2M ramoops.console_size=256K ramoops.pmsg_size=256K` | pstore directory writable + filesystem mounted | +| 0.5 | Verify serial console: `console=ttyS2,1500000` is already in extlinux append. If user has a TTL-USB cable hooked up, capture it. Otherwise note "no serial cable connected" and rely on pstore + journal | document | + +## Build the debug base kernel (one-time, ~45 min) + +| Step | Action | Verify | +|------|--------|--------| +| 1.1 | On ampere, in `~/src/linux-rockchip`: backup current `.config` as `.config.pre-iter6postmortem` | `ls .config.*` | +| 1.2 | Enable: `./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH --enable PROVE_RCU` then `make olddefconfig` | grep .config for each = y | +| 1.3 | `time make -j8 Image modules dtbs` | exit 0, ~45 min | +| 1.4 | Backup current `/boot/firmware/Image-7.0.0-rc3-devices+` again (defensive, second copy) → install new Image as `Image-7.0.0-rc3-devices-lockdep+` → add new extlinux label `arch_devices_lockdep` pointing at the new kernel (so we can choose between vanilla and lockdep kernel at boot prompt — don't replace the working `arch_devices` default) | extlinux.conf shows new label, vanilla unchanged | +| 1.5 | `sudo make modules_install INSTALL_MOD_PATH=/tmp/lockdep-modules` (stage modules to /tmp first — don't write to /lib/modules yet) | /tmp/lockdep-modules populated | +| 1.6 | Manually `cp /tmp/lockdep-modules/lib/modules/7.0.0-rc3-devices+/* /lib/modules//` after backing up old | careful ordering | +| 1.7 | Reboot ampere into `arch_devices_lockdep` (manual selection at boot prompt) | `uname -r` shows new release suffix | +| 1.8 | Confirm system still HEVC-decodes with iter3+4 fixes (smoke test from iter5 baseline: ffmpeg exit 0, no oops, DEC_RDY=1) on the debug kernel | lockdep kernel baseline established | + +**If step 1.8 fails (debug kernel itself regresses)**: STOP. The debug config is the culprit. Investigate which CONFIG flag broke things. Roll back to vanilla kernel before any iter6 work. + +## Bisect-apply iter6 patches + +Order matches the patch series numbering. After each step, reboot + smoke-test + harvest journal/pstore for any lockdep splat. If any step regresses, revert ONLY that step, document the splat, escalate to user. + +| Step | Patch | Pre-check | Action | Smoke test | Pass criteria | Fail action | +|------|-------|-----------|--------|------------|---------------|-------------| +| 2.1 | 0004 vb2 helper (kernel core) | working kernel + module backups present | Apply 0004, enable `CONFIG_VIDEOBUF2_RELEASE_FENCES=y`, rebuild kernel (need to rebuild kernel + videobuf2_common) | reboot → ssh in → ffmpeg HEVC test → check `journalctl -k -p warning` | exit 0, no new warnings, no `released after`/`is locked` | revert 0004, capture journal, STOP | +| 2.2 | 0005 hantro opt-in | step 2.1 passed | Apply 0005, rebuild hantro_vpu module only (~3 min) | reboot → wait for sddm login (manual) → ssh → ffmpeg test → journal check | same as 2.1 + journal also clean | revert 0005, capture, STOP | +| 2.3 | 0006 rockchip-rga opt-in | step 2.2 passed | Apply 0006, rebuild rockchip_rga module only | reboot → same | same | revert 0006, capture, STOP | +| 2.4 | 0007 rkvdec opt-in (my unreviewed patch — needs Phase 5 review before this step!) | step 2.3 passed + Phase 5 approval | Apply 0007 source diff (queue_init flag + device_run hook), rebuild rockchip_vdec only | reboot → same | same | revert 0007, capture, STOP | + +## Post-bisect + +- If all 4 steps passed: iter6 attempt-2 succeeds. Commit findings. Move to iter7 (test cache-coherency hypothesis for the iter5 solid-black-output bug — the original reason we wanted these patches). +- If a specific step fails: file kernel-agent#16 or upstream-aligned issue with the specific patch + lockdep splat. Document in iter6_attempt2_close.md. + +## Risk register + +| Risk | Mitigation | +|------|-----------| +| Lockdep-debug kernel itself doesn't boot | Step 1.4 keeps vanilla label as default; manual boot menu choice required for lockdep kernel; recovery stick handles boot-loop | +| Step 1.7 build artifact in /lib/modules/ overwrites working modules | Step 1.6 backs up pre-state; surgical install only after backup | +| Watchdog still fires before any lockdep splat reaches journal | pstore fallback (step 0.4); serial console if cable hooked up; ramoops capture across resets | +| 0007 has a different bug than what we just fixed | Phase 5 reviews the 0007 source; if reviewer rejects, write a v2 informed by upstream patterns or skip 0007 (test 0004+0005+0006 only as a control — they shouldn't affect rkvdec) | +| User wants to use ampere for other work during 45-min kernel build | Per `feedback_no_bulldoze_reboots.md`, ASK before kicking off | + +## Exit conditions + +**Success**: ffmpeg exit 0, no regressions, journal/pstore clean across all 4 step boots. Iter6 attempt-2 closes successfully; can proceed to iter7 cache-coherency hypothesis test. + +**Partial success**: 0004+0005+0006 land cleanly; 0007 regresses; revert 0007, iter7 can test cache hypothesis without rkvdec opt-in (helper still in core, just rkvdec doesn't publish a fence). + +**Failure**: 0004 itself regresses on RK3588 (per the silent watchdog reset). File kernel-agent issue; vb2 fence series needs RK3588-specific debugging; iter6 line of investigation is closed; iter7 picks a different cache-coherency test path (DMA_BUF_IOCTL_SYNC backend or DT dma-coherent property). + +## Estimated wall-time + +| Phase | Time | +|-------|------| +| Pre-flight (0.1-0.5) | 10 min | +| Debug kernel build (1.1-1.8) | 60 min (build 45 + install 5 + reboot 2 + verify 8) | +| Bisect-apply 0004 (2.1) | 50 min (full kernel rebuild for CONFIG_VIDEOBUF2_RELEASE_FENCES + 0004 changes) | +| Bisect-apply 0005 (2.2) | 10 min | +| Bisect-apply 0006 (2.3) | 10 min | +| Bisect-apply 0007 (2.4) | 10 min | +| **Total** | **~150 min wall-time** if everything goes clean | + +If a regression hits early: 30 min to capture lockdep + revert + document.