1656f84a40
CONFIG_PROVE_RCU is selected by CONFIG_PROVE_LOCKING via Kconfig dependency. Cannot disable without disabling PROVE_LOCKING itself (which would defeat the purpose). Documenting the execution deviation from A5: PROVE_RCU=y remains. 10-min HW watchdog has plenty of margin. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
94 lines
11 KiB
Markdown
94 lines
11 KiB
Markdown
# Phase 4 — iter6 post-mortem retry plan (v2, amended per Phase 5 architect review)
|
||
|
||
Plan ID: iter6-postmortem-attempt2
|
||
Author: Claude Opus 4.7
|
||
Reviewer: Sonnet (architect, Phase 5 round 1, returned 5 amendments + 0007 REJECT)
|
||
Status: v2 pending Phase 5 round 2 (delta-review) before any execution
|
||
Cross-ref: phase0_findings_iter6_postmortem.md (commit 11d2dde), phase4_plan_iter6_postmortem.md (commit 0ef6440 — superseded)
|
||
|
||
## Goal
|
||
|
||
Same as v1. Apply 0004 / 0005 / 0006 / 0007-v2 to ampere kernel WITHOUT silent watchdog reset. Either succeed, or fail with a self-diagnostic lockdep splat that survives to logs.
|
||
|
||
## Amendments vs v1 (architect findings)
|
||
|
||
| # | Finding | Resolution |
|
||
|---|---------|-----------|
|
||
| A1 | 0007 v1 placement (in `rkvdec_device_run`) inconsistent with hantro's pattern + metadata-copy ordering wrong | **0007 v2 written**: 7 per-codec `*_run()` insertion points (h264, hevc, vdpu381-h264, vdpu381-hevc, vdpu383-h264, vdpu383-hevc, vp9), each AFTER preamble + BEFORE HW kick. `supports_release_fences = true` stays in queue_init. See `0007-v2-rkvdec-opt-in.patch` (8 hunks, 25 +lines) |
|
||
| A2 | Missing risk: panthor `ww_acquire_ctx` contention on same resv as vb2 `dma_resv_lock(NULL)` under `dma_fence_begin_signalling` — primary RK3588 wedge hypothesis | Added to risk register as **Primary Hypothesis H1**. Step 2.1's smoke test extended to include GPU compositor exercise (open kwin, mpv playback with hwdec=vaapi, drm-vblank-trigger pattern), not just headless ffmpeg |
|
||
| A3 | Step 1.5 `modules_install` overwrites working modules on `uname -r` collision | Added `CONFIG_LOCALVERSION="-lockdep"` to step 1.2 config block → distinct release suffix `7.0.0-rc3-devices-lockdep+` → separate `/lib/modules/.../`. Original modules untouched |
|
||
| A4 | Step 0.4 "consider adding ramoops" too soft — silent watchdog reset can't be diagnosed without pre-reset capture path | Step 0.4 reworked as **HARD GATE**. Pre-flight aborts if neither serial console (TTL-USB cable confirmed connected and capturing) NOR ramoops region (verified writable + producing test entry) is functional. User decision input required before pre-flight starts |
|
||
| A5 | `CONFIG_PROVE_RCU` slows boot + may push past watchdog before lockdep prints | Attempted removal but **PROVE_RCU is selected by PROVE_LOCKING via Kconfig — can't be disabled without disabling PROVE_LOCKING itself**. Execution deviation logged: PROVE_RCU=y remains. Mitigation: 10-min HW watchdog gives plenty of margin even with PROVE_RCU enabled. Kept: PROVE_LOCKING, DEBUG_ATOMIC_SLEEP, LOCKDEP, DEBUG_RT_MUTEXES, DEBUG_SPINLOCK, DEBUG_MUTEXES, DEBUG_LOCK_ALLOC, PROVE_RAW_LOCK_NESTING, DEBUG_WW_MUTEX_SLOWPATH, PSTORE_CONSOLE, PSTORE_PMSG, PSTORE_RAM=m |
|
||
|
||
## Pre-flight (REWRITTEN, hard gates marked)
|
||
|
||
| Step | Action | Verify | Gate |
|
||
|------|--------|--------|------|
|
||
| 0.1 | SDDM auto-login disabled | Done — `/etc/sddm.conf.d/autologin.conf.disabled-iter6postmortem` | ✓ |
|
||
| 0.2 | Backup `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/{common/videobuf2,platform/verisilicon,platform/rockchip/{rga,rkvdec}}/*.ko` as `attempt2-pre-base-<ts>.bkp` AND scp tarball to `boltzmann:/home/mfritsche/iter6-postmortem-backups/` | `ls` shows .bkp + scp returned 0 | **HARD GATE** — abort if backup write fails |
|
||
| 0.3 | Backup `/boot/firmware/Image-7.0.0-rc3-devices+` and `initramfs-7.0.0-rc3-devices+` as `*.pre-attempt2.bkp` | `ls` | **HARD GATE** |
|
||
| 0.4a | Probe current kernel: `CONFIG_PSTORE_RAM=m` confirmed on ampere; `CONFIG_PSTORE_CONSOLE` currently `n` (will be enabled in step 1.2). `/sys/fs/pstore/` exists + empty | survey only | informational |
|
||
| 0.4b | Memory layout survey: `cat /proc/iomem`. Plan ramoops carve-out at `0x10000000` (1 MB inside the normal-RAM gap between `0x4500000` and `0xd7a00000`) | iomem confirms region is "System RAM" not pre-reserved | informational |
|
||
| 0.4c | **NOTE**: User chose ramoops over serial, accepted documented limitation that hard spinlock-with-IRQ-off deadlocks may not flush to pstore before watchdog reset. Serial would capture those; ramoops may not | acknowledged | informational |
|
||
| 0.4-G | **HARD GATE moved to step 1.8**: ramoops verification only possible after lockdep kernel boots with PSTORE_CONSOLE=y. Pre-flight does NOT abort here. Step 1.8 runs `echo c > /proc/sysrq-trigger` to force a crash, reboots, verifies `/sys/fs/pstore/` is non-empty; if empty, ABORT before bisect-apply 2.x and reconsider serial | step 1.8 enforces | mandatory at 1.8 |
|
||
| 0.5 | Verify ampere `~/src/linux-rockchip` working tree state: iter3+iter4+diag patches present, iter6 v1 reverted (per recovery), `git status` shows expected files modified | git status output matches | informational |
|
||
|
||
## Build the lockdep debug base kernel (~45 min one-time)
|
||
|
||
| Step | Action |
|
||
|------|--------|
|
||
| 1.1 | `cp .config .config.pre-iter6postmortem` |
|
||
| 1.2 | `./scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP --enable LOCKDEP --enable DEBUG_RT_MUTEXES --enable DEBUG_SPINLOCK --enable DEBUG_MUTEXES --enable DEBUG_LOCK_ALLOC --enable PROVE_RAW_LOCK_NESTING --enable DEBUG_WW_MUTEX_SLOWPATH` (NO PROVE_RCU per A5). **Ramoops/pstore**: `./scripts/config --enable PSTORE_CONSOLE --enable PSTORE_PMSG --module PSTORE_RAM` (PSTORE_RAM stays =m, console=y captures kernel printk to the ramoops region). Set `CONFIG_LOCALVERSION="-lockdep"` (A3). `make olddefconfig` |
|
||
| 1.3 | `time make -j8 Image modules dtbs` (~45 min) |
|
||
| 1.4 | `make modules_install INSTALL_MOD_PATH=/lib/modules/7.0.0-rc3-devices-lockdep+/...` — actually use kernel's default `make modules_install` which respects LOCALVERSION → installs to `/lib/modules/7.0.0-rc3-devices-lockdep+/` separate from working tree. Verify destination path before sudo |
|
||
| 1.5 | `sudo cp arch/arm64/boot/Image /boot/firmware/Image-7.0.0-rc3-devices-lockdep+`. Generate initramfs for new release: `sudo mkinitcpio -k 7.0.0-rc3-devices-lockdep+ -g /boot/firmware/initramfs-7.0.0-rc3-devices-lockdep+` |
|
||
| 1.6 | Backup extlinux.conf, then `sudo` edit: ADD new label `arch_devices_lockdep` pointing at the new Image + initrd. Since we're ramoops-only (no serial), per round 2 Q3 the `DEFAULT arch_devices_lockdep` override is MANDATORY for this test boot. Restore `DEFAULT arch_devices` before any subsequent reboot where lockdep boot is not desired. **Ramoops cmdline (append)**: extend the lockdep label's `append` line with `memmap=0x100000$0x10000000 ramoops.mem_address=0x10000000 ramoops.mem_size=0x100000 ramoops.record_size=0x10000 ramoops.console_size=0x40000 ramoops.dump_oops=1`. Carves 1 MB at 256 MB physical → 256 KB console buffer + 11 × 64 KB oops records |
|
||
| 1.7 | Reboot. Since DEFAULT was set to `arch_devices_lockdep` in step 1.6, U-Boot loads lockdep kernel automatically. Verify `uname -r` = `7.0.0-rc3-devices-lockdep+`. Verify journal has `Lockdep is enabled`. `sudo modprobe pstore_ram` (or verify auto-loaded). Verify `ls /sys/fs/pstore/` after triggering a benign console write to confirm the ramoops region is accessible |
|
||
| 1.8a | Ramoops verify (HARD GATE per 0.4-G): `sync; sudo bash -c "echo c > /proc/sysrq-trigger"`. Ampere panics → reboots (still on lockdep default). After ssh comes back, check `ls /sys/fs/pstore/` — must contain at least one `dmesg-ramoops-*` and one `console-ramoops-*` file. If empty, ABORT — ramoops doesn't capture; reconsider serial cable. Backup pstore contents to `/home/mfritsche/iter6-postmortem-ramoops-verify/` then `sudo rm /sys/fs/pstore/*` to clear for actual tests |
|
||
| 1.8b | Smoke test: ffmpeg HEVC decode (iter5 baseline), check `journalctl -k -p warning -b 0` for any new lockdep splats produced by iter3+4 alone. Expectation: clean (or only pre-existing edp/vblank WARNs) |
|
||
|
||
**1.9 — A2 GPU smoke test**: open SDDM login → log in to plasma wayland → open mpv with `--hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4`. Watch ~30 seconds. Check journal again. Reason: A2 hypothesis is panthor + kwin + V4L2 dmabuf contention, which only surfaces under active GPU composition. If iter3+4 alone (no fence helper yet) emits a lockdep splat with GPU active, the bug is even more upstream than iter6 patches — STOP and investigate.
|
||
|
||
## Bisect-apply iter6 patches on lockdep base
|
||
|
||
Order: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot into lockdep kernel → smoke test → if regression, REVERT only that step, capture splat, document, STOP.
|
||
|
||
| Step | Patch | Build effort | Smoke test |
|
||
|------|-------|--------------|------------|
|
||
| 2.1 | 0004 (vb2 core helper + Kconfig) | full kernel rebuild (~45 min — Kconfig + videobuf2-core.c) | Step 1.8 + Step 1.9 |
|
||
| 2.2 | 0005 (hantro opt-in) | hantro_vpu module (~3 min) | Step 1.8 + Step 1.9 |
|
||
| 2.3 | 0006 (rockchip-rga opt-in) | rockchip-rga module (~3 min) | Step 1.8 + Step 1.9 |
|
||
| 2.4 | 0007-v2 (rkvdec per-codec opt-in) | rockchip-vdec module (~3 min) | Step 1.8 + Step 1.9 |
|
||
|
||
## Post-bisect
|
||
|
||
- All 4 pass on the lockdep kernel: try the same set on the vanilla kernel (rebuild without lockdep). If vanilla also passes, iter6 succeeds; move to iter7 cache-coherency hypothesis test.
|
||
- Specific step regresses: capture lockdep splat via serial or pstore, attach to a new kernel-agent issue (#16), revert that step from lockdep kernel, document. If the failing step is 2.1 (0004 core helper), iter6 line is closed — switch iter7 to a non-fence cache-coherency approach.
|
||
|
||
## Risk register (v2 — added H1)
|
||
|
||
| # | Risk | Mitigation |
|
||
|---|------|-----------|
|
||
| **H1** | **PRIMARY**: panthor `ww_acquire_ctx` on same resv as vb2's `dma_resv_lock(NULL)` published inside `dma_fence_begin_signalling()` — cross-class lockdep violation invisible without PROVE_LOCKING on RK3588 + panthor. RK3566 + panfrost validation didn't surface this. | LOCKDEP base kernel (step 1.x) is specifically designed to catch this. Step 1.9 + 2.x smoke tests exercise GPU compositor path |
|
||
| R2 | Lockdep-debug kernel doesn't boot at all | Step 1.6 keeps vanilla as default; manual menu selection for lockdep; recovery stick handles total-loss |
|
||
| R3 | Lockdep-debug kernel boots but `uname -r` collides with vanilla → modules_install overwrites | `CONFIG_LOCALVERSION="-lockdep"` (A3) makes the suffix distinct; verify before sudo cp |
|
||
| R4 | Watchdog fires before lockdep splat reaches journal | Step 0.4 HARD GATE: serial OR ramoops for pre-reset capture path |
|
||
| R5 | 0007-v2 has a different bug than v1 (e.g. `run.base.bufs.dst` access pattern wrong for some codec) | Phase 5 round 2 reviews v2 source. If reviewer rejects, write v3 informed by feedback |
|
||
| R6 | User wants to use ampere during 45-min kernel build → reboot conflict | Per `feedback_no_bulldoze_reboots.md` — ASK before kicking off |
|
||
|
||
## Estimated wall-time
|
||
|
||
| Phase | Time |
|
||
|-------|------|
|
||
| Pre-flight (0.1-0.5) + serial confirmation | 15-30 min (depends on serial cable hookup) |
|
||
| Lockdep base kernel build (1.1-1.9) | 60 min |
|
||
| 2.1 (0004 + Kconfig + full kernel rebuild) | 50 min |
|
||
| 2.2 (0005 hantro module) | 10 min |
|
||
| 2.3 (0006 rga module) | 10 min |
|
||
| 2.4 (0007-v2 rkvdec module) | 10 min |
|
||
| Total | ~3-3.5h wall-time clean-path |
|
||
|
||
## Exit conditions
|
||
|
||
Same as v1. Success / Partial / Failure trees unchanged.
|