0ef64406b6
Pre-flight (5 steps): backup, pstore, serial console verify Debug base kernel (8 steps): PROVE_LOCKING + LOCKDEP + DEBUG_* full kernel rebuild, separate extlinux label, keep vanilla default Bisect-apply (4 steps): 0004 → 0005 → 0006 → 0007, reboot+test between each Risk register: 5 risks with mitigations Total wall-time: ~150 min if clean Pending Phase 5 architect review (Sonnet) before any execution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
88 lines
7.9 KiB
Markdown
88 lines
7.9 KiB
Markdown
# Phase 4 — iter6 post-mortem retry plan
|
|
|
|
Plan ID: iter6-postmortem-attempt2
|
|
Author: Claude Opus 4.7 (this session)
|
|
Status: Pending Phase 5 architect review (Sonnet) before any execution
|
|
Cross-ref: phase0_findings_iter6_postmortem.md commit 11d2dde
|
|
|
|
## Goal
|
|
|
|
Apply the vb2_dma_resv RFC v2 series (0004/0005/0006) + a reviewed 0007 rkvdec opt-in to the ampere kernel WITHOUT silently watchdog-resetting the box during compositor handover. Either succeed (iter6 fixed, can then test the cache-coherency hypothesis against iter5's solid-black output bug), OR fail with a self-diagnostic kernel log entry that pinpoints the deadlock/regression, NOT another reboot loop.
|
|
|
|
## Strategy: bisect-apply with kernel-level safeguards
|
|
|
|
Don't repeat iter6 attempt-1's mistake of applying 4 inter-dependent patches in one go and rebooting blind. Apply one patch at a time on top of a debug-enabled kernel; reboot + smoke-test between each; revert immediately on any regression and capture the lockdep/oops trace.
|
|
|
|
## Pre-flight (must complete before any patch applies)
|
|
|
|
| Step | Action | Verify |
|
|
|------|--------|--------|
|
|
| 0.1 | SDDM auto-login disabled | done (commit 11d2dde) — `/etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem` |
|
|
| 0.2 | Backup current `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko` as `attempt2-pre-base-<ts>.bkp` AND tarball-archive to `boltzmann:/home/mfritsche/iter6-postmortem-backups/` | `ls` shows .bkp + scp succeeded |
|
|
| 0.3 | Backup current kernel Image (`/boot/firmware/Image-7.0.0-rc3-devices+`) + initramfs as `*.pre-attempt2.bkp` | `ls` shows .bkp |
|
|
| 0.4 | Check pstore availability: `ls -la /sys/fs/pstore/` as root. If empty/inaccessible, check DT for `ramoops` reserved region and consider adding via cmdline `pstore.backend=ramoops ramoops.mem_address=... ramoops.mem_size=2M ramoops.console_size=256K ramoops.pmsg_size=256K` | pstore directory writable + filesystem mounted |
|
|
| 0.5 | Verify serial console: `console=ttyS2,1500000` is already in extlinux append. If user has a TTL-USB cable hooked up, capture it. Otherwise note "no serial cable connected" and rely on pstore + journal | document |
|
|
|
|
## Build the debug base kernel (one-time, ~45 min)
|
|
|
|
| Step | Action | Verify |
|
|
|------|--------|--------|
|
|
| 1.1 | On ampere, in `~/src/linux-rockchip`: backup current `.config` as `.config.pre-iter6postmortem` | `ls .config.*` |
|
|
| 1.2 | Enable: `./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH --enable PROVE_RCU` then `make olddefconfig` | grep .config for each = y |
|
|
| 1.3 | `time make -j8 Image modules dtbs` | exit 0, ~45 min |
|
|
| 1.4 | Backup current `/boot/firmware/Image-7.0.0-rc3-devices+` again (defensive, second copy) → install new Image as `Image-7.0.0-rc3-devices-lockdep+` → add new extlinux label `arch_devices_lockdep` pointing at the new kernel (so we can choose between vanilla and lockdep kernel at boot prompt — don't replace the working `arch_devices` default) | extlinux.conf shows new label, vanilla unchanged |
|
|
| 1.5 | `sudo make modules_install INSTALL_MOD_PATH=/tmp/lockdep-modules` (stage modules to /tmp first — don't write to /lib/modules yet) | /tmp/lockdep-modules populated |
|
|
| 1.6 | Manually `cp /tmp/lockdep-modules/lib/modules/7.0.0-rc3-devices+/* /lib/modules/<new-uname>/` after backing up old | careful ordering |
|
|
| 1.7 | Reboot ampere into `arch_devices_lockdep` (manual selection at boot prompt) | `uname -r` shows new release suffix |
|
|
| 1.8 | Confirm system still HEVC-decodes with iter3+4 fixes (smoke test from iter5 baseline: ffmpeg exit 0, no oops, DEC_RDY=1) on the debug kernel | lockdep kernel baseline established |
|
|
|
|
**If step 1.8 fails (debug kernel itself regresses)**: STOP. The debug config is the culprit. Investigate which CONFIG flag broke things. Roll back to vanilla kernel before any iter6 work.
|
|
|
|
## Bisect-apply iter6 patches
|
|
|
|
Order matches the patch series numbering. After each step, reboot + smoke-test + harvest journal/pstore for any lockdep splat. If any step regresses, revert ONLY that step, document the splat, escalate to user.
|
|
|
|
| Step | Patch | Pre-check | Action | Smoke test | Pass criteria | Fail action |
|
|
|------|-------|-----------|--------|------------|---------------|-------------|
|
|
| 2.1 | 0004 vb2 helper (kernel core) | working kernel + module backups present | Apply 0004, enable `CONFIG_VIDEOBUF2_RELEASE_FENCES=y`, rebuild kernel (need to rebuild kernel + videobuf2_common) | reboot → ssh in → ffmpeg HEVC test → check `journalctl -k -p warning` | exit 0, no new warnings, no `released after`/`is locked` | revert 0004, capture journal, STOP |
|
|
| 2.2 | 0005 hantro opt-in | step 2.1 passed | Apply 0005, rebuild hantro_vpu module only (~3 min) | reboot → wait for sddm login (manual) → ssh → ffmpeg test → journal check | same as 2.1 + journal also clean | revert 0005, capture, STOP |
|
|
| 2.3 | 0006 rockchip-rga opt-in | step 2.2 passed | Apply 0006, rebuild rockchip_rga module only | reboot → same | same | revert 0006, capture, STOP |
|
|
| 2.4 | 0007 rkvdec opt-in (my unreviewed patch — needs Phase 5 review before this step!) | step 2.3 passed + Phase 5 approval | Apply 0007 source diff (queue_init flag + device_run hook), rebuild rockchip_vdec only | reboot → same | same | revert 0007, capture, STOP |
|
|
|
|
## Post-bisect
|
|
|
|
- If all 4 steps passed: iter6 attempt-2 succeeds. Commit findings. Move to iter7 (test cache-coherency hypothesis for the iter5 solid-black-output bug — the original reason we wanted these patches).
|
|
- If a specific step fails: file kernel-agent#16 or upstream-aligned issue with the specific patch + lockdep splat. Document in iter6_attempt2_close.md.
|
|
|
|
## Risk register
|
|
|
|
| Risk | Mitigation |
|
|
|------|-----------|
|
|
| Lockdep-debug kernel itself doesn't boot | Step 1.4 keeps vanilla label as default; manual boot menu choice required for lockdep kernel; recovery stick handles boot-loop |
|
|
| Step 1.7 build artifact in /lib/modules/ overwrites working modules | Step 1.6 backs up pre-state; surgical install only after backup |
|
|
| Watchdog still fires before any lockdep splat reaches journal | pstore fallback (step 0.4); serial console if cable hooked up; ramoops capture across resets |
|
|
| 0007 has a different bug than what we just fixed | Phase 5 reviews the 0007 source; if reviewer rejects, write a v2 informed by upstream patterns or skip 0007 (test 0004+0005+0006 only as a control — they shouldn't affect rkvdec) |
|
|
| User wants to use ampere for other work during 45-min kernel build | Per `feedback_no_bulldoze_reboots.md`, ASK before kicking off |
|
|
|
|
## Exit conditions
|
|
|
|
**Success**: ffmpeg exit 0, no regressions, journal/pstore clean across all 4 step boots. Iter6 attempt-2 closes successfully; can proceed to iter7 cache-coherency hypothesis test.
|
|
|
|
**Partial success**: 0004+0005+0006 land cleanly; 0007 regresses; revert 0007, iter7 can test cache hypothesis without rkvdec opt-in (helper still in core, just rkvdec doesn't publish a fence).
|
|
|
|
**Failure**: 0004 itself regresses on RK3588 (per the silent watchdog reset). File kernel-agent issue; vb2 fence series needs RK3588-specific debugging; iter6 line of investigation is closed; iter7 picks a different cache-coherency test path (DMA_BUF_IOCTL_SYNC backend or DT dma-coherent property).
|
|
|
|
## Estimated wall-time
|
|
|
|
| Phase | Time |
|
|
|-------|------|
|
|
| Pre-flight (0.1-0.5) | 10 min |
|
|
| Debug kernel build (1.1-1.8) | 60 min (build 45 + install 5 + reboot 2 + verify 8) |
|
|
| Bisect-apply 0004 (2.1) | 50 min (full kernel rebuild for CONFIG_VIDEOBUF2_RELEASE_FENCES + 0004 changes) |
|
|
| Bisect-apply 0005 (2.2) | 10 min |
|
|
| Bisect-apply 0006 (2.3) | 10 min |
|
|
| Bisect-apply 0007 (2.4) | 10 min |
|
|
| **Total** | **~150 min wall-time** if everything goes clean |
|
|
|
|
If a regression hits early: 30 min to capture lockdep + revert + document.
|