Files
ampere-kernel-decoders/phase4_plan_iter6_postmortem_v2.md
Markus Fritsche 1656f84a40 iter6 plan v2: log A5 execution deviation — PROVE_RCU forced =y by kconfig
CONFIG_PROVE_RCU is selected by CONFIG_PROVE_LOCKING via Kconfig
dependency. Cannot disable without disabling PROVE_LOCKING itself
(which would defeat the purpose). Documenting the execution deviation
from A5: PROVE_RCU=y remains. 10-min HW watchdog has plenty of margin.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:54:38 +00:00

94 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)
Plan ID: iter6-postmortem-attempt2
Author: Claude Opus 4.7
Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT)
Status: v2 pending Phase 5 round 2 (delta-review) before any execution
Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)
## Goal
Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.
## Amendments vs v1 (architect findings)
| # | Finding | Resolution |
|---|---------|-----------|
| A1 | 0007 v1 placement (in `rkvdec_device_run`) inconsistent with hantro's pattern + metadata-copy ordering wrong | **0007 v2 written**: 7 per-codec `*_run()` insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. `supports_release_fences = true` stays in queue_init. See `0007-v2-rkvdec-opt-in.patch` (8 hunks, 25 +lines) |
| A2 | Missing risk: panthor `ww_acquire_ctx` contention on same resv as vb2 `dma_resv_lock(NULL)` under `dma_fence_begin_signalling` — primary RK3588 wedge hypothesis | Added to risk register as **Primary Hypothesis H1**. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg |
| A3 | Step 1.5 `modules_install` overwrites working modules on `uname -r` collision | Added `CONFIG_LOCALVERSION="-lockdep"` to step 1.2 config block → distinct release suffix `7.0.0-rc3-devices-lockdep+` → separate `/lib/modules/.../`. Original modules untouched |
| A4 | Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path | Step 0.4 reworked as **HARD GATE**. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts |
| A5 | `CONFIG_PROVE_RCU` slows boot + may push past watchdog before lockdep prints | Attempted removal but **PROVE_RCU is selected by PROVE_LOCKING via Kconfig — can't be disabled without disabling PROVE_LOCKING itself**. Execution deviation logged: PROVE_RCU=y remains. Mitigation: 10-min HW watchdog gives plenty of margin even with PROVE_RCU enabled. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH, PSTORE_CONSOLE, PSTORE_PMSG, PSTORE_RAM=m |
## Pre-flight (REWRITTEN, hard gates marked)
| Step | Action | Verify | Gate |
|------|--------|--------|------|
| 0.1 | SDDM auto-login disabled | Done — `/etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem` | ✓ |
| 0.2 | Backup `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko` as `attempt2-pre-base-<ts>.bkp` AND scp tarball to `boltzmann:/home/mfritsche/iter6-postmortem-backups/` | `ls` shows .bkp + scp returned 0 | **HARD GATE** — abort if backup write fails |
| 0.3 | Backup `/boot/firmware/Image-7.0.0-rc3-devices+` and `initramfs-7.0.0-rc3-devices+` as `*.pre-attempt2.bkp` | `ls` | **HARD GATE** |
| 0.4a | Probe current kernel: `CONFIG_PSTORE_RAM=m` confirmed on ampere; `CONFIG_PSTORE_CONSOLE` currently `n` (will be enabled in step 1.2). `/sys/fs/pstore/` exists + empty | survey only | informational |
| 0.4b | Memory layout survey: `cat /proc/iomem`. Plan ramoops carve-out at `0x10000000` (1 MB inside the normal-RAM gap between `0x4500000` and `0xd7a00000`) | iomem confirms region is "System RAM" not pre-reserved | informational |
| 0.4c | **NOTE**: User chose ramoops over serial, accepted documented limitation that hard spinlock-with-IRQ-off deadlocks may not flush to pstore before watchdog reset. Serial would capture those; ramoops may not | acknowledged | informational |
| 0.4-G | **HARD GATE moved to step 1.8**: ramoops verification only possible after lockdep kernel boots with PSTORE_CONSOLE=y. Pre-flight does NOT abort here. Step 1.8 runs `echo c > /proc/sysrq-trigger` to force a crash, reboots, verifies `/sys/fs/pstore/` is non-empty; if empty, ABORT before bisect-apply 2.x and reconsider serial | step 1.8 enforces | mandatory at 1.8 |
| 0.5 | Verify ampere `~/src/linux-rockchip` working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), `git status` shows expected files modified | git status output matches | informational |
## Build the lockdep debug base kernel (~45 min one-time)
| Step | Action |
|------|--------|
| 1.1 | `cp .config .config.pre-iter6postmortem` |
| 1.2 | `./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH` (NO PROVE_RCU per A5). **Ramoops/pstore**: `./scripts/config --enable PSTORE_CONSOLE --enable PSTORE_PMSG --module PSTORE_RAM` (PSTORE_RAM stays =m, console=y captures kernel printk to the ramoops region). Set `CONFIG_LOCALVERSION="-lockdep"` (A3). `make olddefconfig` |
| 1.3 | `time make -j8 Image modules dtbs` (~45 min) |
| 1.4 | `make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/...` — actually use kernel's default `make modules_install` which respects LOCALVERSION → installs to `/lib/modules/7.0.0-rc3-devices-lockdep+/` separate from working tree. Verify destination path before sudo |
| 1.5 | `sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+`. Generate initramfs for new release: `sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+` |
| 1.6 | Backup extlinux.conf, then `sudo` edit: ADD new label `arch_devices_lockdep` pointing at the new Image + initrd. Since we're ramoops-only (no serial), per round 2 Q3 the `DEFAULT arch_devices_lockdep` override is MANDATORY for this test boot. Restore `DEFAULT arch_devices` before any subsequent reboot where lockdep boot is not desired. **Ramoops cmdline (append)**: extend the lockdep label's `append` line with `memmap=0x100000$0x10000000 ramoops.mem_address=0x10000000 ramoops.mem_size=0x100000 ramoops.record_size=0x10000 ramoops.console_size=0x40000 ramoops.dump_oops=1`. Carves 1 MB at 256 MB physical → 256 KB console buffer + 11 × 64 KB oops records |
| 1.7 | Reboot. Since DEFAULT was set to `arch_devices_lockdep` in step 1.6, U-Boot loads lockdep kernel automatically. Verify `uname -r` = `7.0.0-rc3-devices-lockdep+`. Verify journal has `Lockdep is enabled`. `sudo modprobe pstore_ram` (or verify auto-loaded). Verify `ls /sys/fs/pstore/` after triggering a benign console write to confirm the ramoops region is accessible |
| 1.8a | Ramoops verify (HARD GATE per 0.4-G): `sync; sudo bash -c "echo c > /proc/sysrq-trigger"`. Ampere panics → reboots (still on lockdep default). After ssh comes back, check `ls /sys/fs/pstore/` — must contain at least one `dmesg-ramoops-*` and one `console-ramoops-*` file. If empty, ABORT — ramoops doesn't capture; reconsider serial cable. Backup pstore contents to `/home/mfritsche/iter6-postmortem-ramoops-verify/` then `sudo rm /sys/fs/pstore/*` to clear for actual tests |
| 1.8b | Smoke test: ffmpeg HEVC decode (iter5 baseline), check `journalctl -k -p warning -b 0` for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs) |
**1.9 — A2 GPU smoke test**: open SDDM login → log in to plasma wayland → open mpv with `--hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4`. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.
## Bisect-apply iter6 patches on lockdep base
Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.
| Step | Patch | Build effort | Smoke test |
|------|-------|--------------|------------|
| 2.1 | 0004 (vb2 core helper + Kconfig) | full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) | Step 1.8 + Step 1.9 |
| 2.2 | 0005 (hantro opt-in) | hantro_vpu module (~3 min) | Step 1.8 + Step 1.9 |
| 2.3 | 0006 (rockchip-rga opt-in) | rockchip-rga module (~3 min) | Step 1.8 + Step 1.9 |
| 2.4 | 0007-v2 (rkvdec per-codec opt-in) | rockchip-vdec module (~3 min) | Step 1.8 + Step 1.9 |
## Post-bisect
- All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
- Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.
## Risk register (v2 — added H1)
| # | Risk | Mitigation |
|---|------|-----------|
| **H1** | **PRIMARY**: panthor `ww_acquire_ctx` on same resv as vb2's `dma_resv_lock(NULL)` published inside `dma_fence_begin_signalling()` — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. | LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path |
| R2 | Lockdep-debug kernel doesn't boot at all | Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss |
| R3 | Lockdep-debug kernel boots but `uname -r` collides with vanilla → modules_install overwrites | `CONFIG_LOCALVERSION="-lockdep"` (A3) makes the suffix distinct; verify before sudo cp |
| R4 | Watchdog fires before lockdep splat reaches journal | Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path |
| R5 | 0007-v2 has a different bug than v1 (e.g. `run.base.bufs.dst` access pattern wrong for some codec) | Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback |
| R6 | User wants to use ampere during 45-min kernel build → reboot conflict | Per `feedback_no_bulldoze_reboots.md` — ASK before kicking off |
## Estimated wall-time
| Phase | Time |
|-------|------|
| Pre-flight (0.1-0.5) + serial confirmation | 15-30 min (depends on serial cable hookup) |
| Lockdep base kernel build (1.1-1.9) | 60 min |
| 2.1 (0004 + Kconfig + full kernel rebuild) | 50 min |
| 2.2 (0005 hantro module) | 10 min |
| 2.3 (0006 rga module) | 10 min |
| 2.4 (0007-v2 rkvdec module) | 10 min |
| Total | ~3-3.5h wall-time clean-path |
## Exit conditions
Same as v1. Success / Partial / Failure trees unchanged.