Files
ampere-kernel-decoders/phase4_plan_iter6_postmortem_v2.md
T
Markus Fritsche a8d563a917 iter6 plan v2: ramoops amendments (user chose ramoops over serial)
User picked ramoops path for the 0.4 hard gate. Current ampere kernel
has CONFIG_PSTORE_RAM=m but lacks PSTORE_CONSOLE, so ramoops can only
be made operational AFTER lockdep kernel rebuild.

4 amendments:
- 0.4: restructured. 0.4a/b survey current state (informational only),
  0.4c notes accepted limitation (hard spinlock+IRQ-off won't flush),
  0.4-G hard gate moves to step 1.8a (after lockdep kernel boots)
- 1.2: add --enable PSTORE_CONSOLE --enable PSTORE_PMSG
- 1.6: extend lockdep extlinux append with ramoops carve-out cmdline
  (memmap=0x100000$0x10000000 ramoops.mem_address=0x10000000
  ramoops.mem_size=0x100000 console_size=0x40000 dump_oops=1).
  DEFAULT override is mandatory per Q3 (ramoops-only operator).
- 1.7/1.8: split into 1.7 (boot+module load), 1.8a (sysrq-trigger
  ramoops verify HARD GATE), 1.8b (regular smoke test)

Documented limitation accepted by user: hard spinlock-with-IRQ-off
deadlocks (the worst-case iter6 v1 wedge shape) may not flush to
pstore before watchdog reset. Serial would catch those; ramoops may
miss. Bisect-apply 0004→0005→0006→0007v2 should surface lockdep
splats BEFORE the deadlock becomes a hard hang anyway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:53:14 +00:00

94 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)
Plan ID: iter6-postmortem-attempt2
Author: Claude Opus 4.7
Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT)
Status: v2 pending Phase 5 round 2 (delta-review) before any execution
Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)
## Goal
Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.
## Amendments vs v1 (architect findings)
| # | Finding | Resolution |
|---|---------|-----------|
| A1 | 0007 v1 placement (in `rkvdec_device_run`) inconsistent with hantro's pattern + metadata-copy ordering wrong | **0007 v2 written**: 7 per-codec `*_run()` insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. `supports_release_fences = true` stays in queue_init. See `0007-v2-rkvdec-opt-in.patch` (8 hunks, 25 +lines) |
| A2 | Missing risk: panthor `ww_acquire_ctx` contention on same resv as vb2 `dma_resv_lock(NULL)` under `dma_fence_begin_signalling` — primary RK3588 wedge hypothesis | Added to risk register as **Primary Hypothesis H1**. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg |
| A3 | Step 1.5 `modules_install` overwrites working modules on `uname -r` collision | Added `CONFIG_LOCALVERSION="-lockdep"` to step 1.2 config block → distinct release suffix `7.0.0-rc3-devices-lockdep+` → separate `/lib/modules/.../`. Original modules untouched |
| A4 | Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path | Step 0.4 reworked as **HARD GATE**. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts |
| A5 | `CONFIG_PROVE_RCU` slows boot + may push past watchdog before lockdep prints | Removed from step 1.2 config set. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH. PROVE_RCU added only if first lockdep run passes clean |
## Pre-flight (REWRITTEN, hard gates marked)
| Step | Action | Verify | Gate |
|------|--------|--------|------|
| 0.1 | SDDM auto-login disabled | Done — `/etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem` | ✓ |
| 0.2 | Backup `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko` as `attempt2-pre-base-<ts>.bkp` AND scp tarball to `boltzmann:/home/mfritsche/iter6-postmortem-backups/` | `ls` shows .bkp + scp returned 0 | **HARD GATE** — abort if backup write fails |
| 0.3 | Backup `/boot/firmware/Image-7.0.0-rc3-devices+` and `initramfs-7.0.0-rc3-devices+` as `*.pre-attempt2.bkp` | `ls` | **HARD GATE** |
| 0.4a | Probe current kernel: `CONFIG_PSTORE_RAM=m` confirmed on ampere; `CONFIG_PSTORE_CONSOLE` currently `n` (will be enabled in step 1.2). `/sys/fs/pstore/` exists + empty | survey only | informational |
| 0.4b | Memory layout survey: `cat /proc/iomem`. Plan ramoops carve-out at `0x10000000` (1 MB inside the normal-RAM gap between `0x4500000` and `0xd7a00000`) | iomem confirms region is "System RAM" not pre-reserved | informational |
| 0.4c | **NOTE**: User chose ramoops over serial, accepted documented limitation that hard spinlock-with-IRQ-off deadlocks may not flush to pstore before watchdog reset. Serial would capture those; ramoops may not | acknowledged | informational |
| 0.4-G | **HARD GATE moved to step 1.8**: ramoops verification only possible after lockdep kernel boots with PSTORE_CONSOLE=y. Pre-flight does NOT abort here. Step 1.8 runs `echo c > /proc/sysrq-trigger` to force a crash, reboots, verifies `/sys/fs/pstore/` is non-empty; if empty, ABORT before bisect-apply 2.x and reconsider serial | step 1.8 enforces | mandatory at 1.8 |
| 0.5 | Verify ampere `~/src/linux-rockchip` working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), `git status` shows expected files modified | git status output matches | informational |
## Build the lockdep debug base kernel (~45 min one-time)
| Step | Action |
|------|--------|
| 1.1 | `cp .config .config.pre-iter6postmortem` |
| 1.2 | `./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH` (NO PROVE_RCU per A5). **Ramoops/pstore**: `./scripts/config --enable PSTORE_CONSOLE --enable PSTORE_PMSG --module PSTORE_RAM` (PSTORE_RAM stays =m, console=y captures kernel printk to the ramoops region). Set `CONFIG_LOCALVERSION="-lockdep"` (A3). `make olddefconfig` |
| 1.3 | `time make -j8 Image modules dtbs` (~45 min) |
| 1.4 | `make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/...` — actually use kernel's default `make modules_install` which respects LOCALVERSION → installs to `/lib/modules/7.0.0-rc3-devices-lockdep+/` separate from working tree. Verify destination path before sudo |
| 1.5 | `sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+`. Generate initramfs for new release: `sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+` |
| 1.6 | Backup extlinux.conf, then `sudo` edit: ADD new label `arch_devices_lockdep` pointing at the new Image + initrd. Since we're ramoops-only (no serial), per round 2 Q3 the `DEFAULT arch_devices_lockdep` override is MANDATORY for this test boot. Restore `DEFAULT arch_devices` before any subsequent reboot where lockdep boot is not desired. **Ramoops cmdline (append)**: extend the lockdep label's `append` line with `memmap=0x100000$0x10000000 ramoops.mem_address=0x10000000 ramoops.mem_size=0x100000 ramoops.record_size=0x10000 ramoops.console_size=0x40000 ramoops.dump_oops=1`. Carves 1 MB at 256 MB physical → 256 KB console buffer + 11 × 64 KB oops records |
| 1.7 | Reboot. Since DEFAULT was set to `arch_devices_lockdep` in step 1.6, U-Boot loads lockdep kernel automatically. Verify `uname -r` = `7.0.0-rc3-devices-lockdep+`. Verify journal has `Lockdep is enabled`. `sudo modprobe pstore_ram` (or verify auto-loaded). Verify `ls /sys/fs/pstore/` after triggering a benign console write to confirm the ramoops region is accessible |
| 1.8a | Ramoops verify (HARD GATE per 0.4-G): `sync; sudo bash -c "echo c > /proc/sysrq-trigger"`. Ampere panics → reboots (still on lockdep default). After ssh comes back, check `ls /sys/fs/pstore/` — must contain at least one `dmesg-ramoops-*` and one `console-ramoops-*` file. If empty, ABORT — ramoops doesn't capture; reconsider serial cable. Backup pstore contents to `/home/mfritsche/iter6-postmortem-ramoops-verify/` then `sudo rm /sys/fs/pstore/*` to clear for actual tests |
| 1.8b | Smoke test: ffmpeg HEVC decode (iter5 baseline), check `journalctl -k -p warning -b 0` for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs) |
**1.9 — A2 GPU smoke test**: open SDDM login → log in to plasma wayland → open mpv with `--hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4`. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.
## Bisect-apply iter6 patches on lockdep base
Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.
| Step | Patch | Build effort | Smoke test |
|------|-------|--------------|------------|
| 2.1 | 0004 (vb2 core helper + Kconfig) | full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) | Step 1.8 + Step 1.9 |
| 2.2 | 0005 (hantro opt-in) | hantro_vpu module (~3 min) | Step 1.8 + Step 1.9 |
| 2.3 | 0006 (rockchip-rga opt-in) | rockchip-rga module (~3 min) | Step 1.8 + Step 1.9 |
| 2.4 | 0007-v2 (rkvdec per-codec opt-in) | rockchip-vdec module (~3 min) | Step 1.8 + Step 1.9 |
## Post-bisect
- All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
- Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.
## Risk register (v2 — added H1)
| # | Risk | Mitigation |
|---|------|-----------|
| **H1** | **PRIMARY**: panthor `ww_acquire_ctx` on same resv as vb2's `dma_resv_lock(NULL)` published inside `dma_fence_begin_signalling()` — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. | LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path |
| R2 | Lockdep-debug kernel doesn't boot at all | Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss |
| R3 | Lockdep-debug kernel boots but `uname -r` collides with vanilla → modules_install overwrites | `CONFIG_LOCALVERSION="-lockdep"` (A3) makes the suffix distinct; verify before sudo cp |
| R4 | Watchdog fires before lockdep splat reaches journal | Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path |
| R5 | 0007-v2 has a different bug than v1 (e.g. `run.base.bufs.dst` access pattern wrong for some codec) | Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback |
| R6 | User wants to use ampere during 45-min kernel build → reboot conflict | Per `feedback_no_bulldoze_reboots.md` — ASK before kicking off |
## Estimated wall-time
| Phase | Time |
|-------|------|
| Pre-flight (0.1-0.5) + serial confirmation | 15-30 min (depends on serial cable hookup) |
| Lockdep base kernel build (1.1-1.9) | 60 min |
| 2.1 (0004 + Kconfig + full kernel rebuild) | 50 min |
| 2.2 (0005 hantro module) | 10 min |
| 2.3 (0006 rga module) | 10 min |
| 2.4 (0007-v2 rkvdec module) | 10 min |
| Total | ~3-3.5h wall-time clean-path |
## Exit conditions
Same as v1. Success / Partial / Failure trees unchanged.