Locks in forensics from boot -9..-1 journal: - Silent watchdog reset, no oops/panic in logs - vblank/edp WARNs are pre-existing (also fire on recovered boot 0) - vb2_buffer_attach_release_fence / dma_fence / dma_resv NEVER appear in iter6-boot kernel logs — deadlock at a level kernel can't reach printk - Hardware Synopsys DesignWare watchdog is the reset mechanism Six non-negotiable safeguards for any retry: 1. backup .ko AND off-device archive before sudo install 2. CONFIG_PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP etc 3. bisect-apply one patch at a time, reboot+test between 4. SDDM auto-login OFF (done — file renamed .disabled-iter6postmortem) 5. pstore.backend=ramoops to capture kernel oops across reset 6. Phase 5 architect review of plan + 0007 source before apply Four gating questions for Phase 1, starting with bisect: - which of 4 patches is the actual vector - lockdep splat hidden by CONFIG_PROVE_LOCKING=n - why no oops in journal - producer-side fence-alloc hang vs consumer-side wait hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.1 KiB
Phase 0 — iter6 post-mortem substrate
Opened 2026-05-16 evening, immediately following iter6 surgical revert + ampere recovery via USB stick.
Research question
What caused the iter6 module set (vb2_dma_resv fence series 0004/0005/0006 + my unreviewed 0007 rkvdec opt-in) to silently wedge ampere during SDDM/wayland session handover, with no oops/panic logged, ending in hardware-watchdog reset within 30-90 seconds of each boot attempt — and what's the minimum kernel-debug instrumentation required to make the next attempt's failure mode self-diagnostic?
Locked-in evidence
| Observation | Source | Status |
|---|---|---|
| iter6 modules installed at 2026-05-16 15:00 (videobuf2-common, hantro-vpu, rockchip-rga, rockchip-vdec, all dated 15:00 in the backed-up dir) | ampere:~/iter6-broken-modules-bak-20260516-1720/ | preserved for analysis |
| Reboot loop boots -8..-1 in journalctl, each 1-42 sec lifetime before watchdog reset | journalctl --list-boots on ampere |
confirmed |
| Last log entries before each truncation: SDDM started, plasma wayland session jumped to VT 2, kwin_wayland chose drm backend, then journal cuts off mid-line | journalctl -b -9 --no-pager |
confirmed |
Zero Internal error / Oops / panic / dma_fence / vb2_buffer_attach_release_fence / dma_resv lines in iter6-boot kernel logs |
journalctl -b -N -k filter | confirmed |
panel-edp.c:814 generic_edp_panel_probe WARN: fires on iter6 boots AND on current recovered boot 0 — pre-existing, not iter6 |
filter against boot 0 | ruled out |
drm_atomic_helper.c:1921 drm_atomic_helper_wait_for_vblanks+0x1d0 WARN: fires on iter6 boots from kworker / plymouthd / Xorg / kwin_wayland — Modules linked in: list does NOT include videobuf2_common, only panthor + drm_* — pre-existing hardware-display path WARN, not iter6 |
module list inspection | ruled out as cause |
| No pstore dumps (perm denied to /sys/fs/pstore, /var/lib/systemd/pstore empty), no coredumps from kernel | survey | no kernel-side artifacts |
Hardware watchdog (Synopsys DesignWare Watchdog, 10 min hw timeout) is armed by systemd-shutdown but also by systemd-pid1 during normal runtime |
boot -10 shutdown log | confirms watchdog is the reset mechanism — software hang exceeded systemd ping interval |
Backend .so md5 404041ea2dcc03c769e0ab8c43ddadd6 unchanged across iter4/5/6 — userspace not implicated |
strings check | userspace not the cause |
| Recovery (revert iter6, rebuild, install pre-iter6 with iter3+4 kept) yielded clean ampere boot, ffmpeg exit 0, HEVC decode runs (still uniform black per iter5) | boot 0 verification | revert confirmed working |
What we DON'T know (gating questions for Phase 1)
-
Which of the 4 iter6 patches is the actual deadlock vector? Patches are interdependent (0004 helper required for 0005/0006/0007), but 0007 (my rkvdec opt-in) is the only NEW driver opt-in beyond what fresnel-fourier validated. Bisect: apply 0004 alone (helper, no opt-ins) → boot. Then add 0005 → boot. Then 0006 → boot. Then 0007 → boot. Whichever first one wedges is the vector.
-
Is it a lockdep splat we're missing because CONFIG_PROVE_LOCKING=n? The 0004 commit message explicitly claims
dma_fence_begin_signalling()was validated under PROVE_LOCKING on RK3566. On RK3588 without PROVE_LOCKING, a lockdep violation could deadlock silently. Phase 6 must rebuild withCONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=ybefore any iter6 patches go back. -
Why no oops/panic in journal? Possibilities:
- True silent CPU hang (not even WARN_ON reachable) — would imply a hard spinlock deadlock between vb2_buffer_done's signal path and panthor's wait
- Oops happened but didn't fsync to journal before reset (likely — kernel logs go through klogd → systemd-journald → fsync)
- Panic with
panic=Nset such that kernel auto-reboots before logging - Watchdog fired before kernel panicked
-
Does the hang happen at producer-side fence allocation or consumer-side wait? Phase 6 instrumentation: pr_warn at vb2_buffer_attach_release_fence entry/exit, and at vb2_buffer_done signal site. If entry fires but exit doesn't → producer hang. If both fire but compositor still wedges → consumer wait hang.
Substrate for post-mortem retry
- Kernel source on ampere: ~/src/linux-rockchip branch ampere-minimal-devices, iter3+4+diag patches in working tree (NOT committed), iter6 source changes already reverted via git checkout + sed surgery (per iter6 close commit)
- Build host: ampere itself (boltzmann tree on incompatible branch per kernel-agent#14 prior discovery)
- Iter6 .ko backup: ampere:~/iter6-broken-modules-bak-20260516-1720/ — 4 modules, ~1MB total — for binary diff analysis (e.g. compare reverted-but-rebuilt rockchip-vdec.ko vs iter6 version to see exact differences in compiled code)
- Backup rule now in project memory:
feedback_backup_before_module_replace.mdcites iter6 incident — next session loads it cold - Auto-login disabled: sddm drops to login prompt — if iter6 retry wedges at compositor handover, system stops at login screen instead of cascading to watchdog reset
- Recovery stick: still at higgs (or wherever user moved it after recovery), default label coolpi_rk3588_gbook, fstab includes
LABEL=writable / ext4 defaults,noatime 0 1, backup config files preserved on stick — ready for next recovery if needed - Serial console access: ampere extlinux command line includes
console=ttyS2,1500000— would capture pre-watchdog output IF a serial cable is hooked up. Currently no record of serial cable being connected.
Hard safeguards for iter6 retry (Phase 4 baseline)
These are NON-NEGOTIABLE before any iter6 patch goes back:
cpbackup of every .ko AND tarball off-device archive to a different host (boltzmann:/home/mfritsche/iter6-postmortem-backups/-pre-attemptN/) BEFOREsudo install. Perfeedback_backup_before_module_replace.md.CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=yin the kernel BEFORE applying any iter6 patches. Full kernel rebuild (not just modules) required for these flags.- Bisect-apply: one patch at a time + reboot + smoke-test + revert if regression. Don't apply 0004+0005+0006+0007 all at once again. Test boot after each.
- Auto-login OFF (done). If sddm wedges at handover, system halts at login prompt — manual rescue without compositor crash → no watchdog → no reset.
pstore.backend=ramoopskernel cmdline + reserved ramoops region in DT (if available on RK3588) → kernel oops survives reset. Pre-condition check Phase 6 step 0.- Phase 5 architect review of the plan + the 0007 rkvdec opt-in code before applying. Spawn separate Claude session OR get user sign-off explicitly.
Phase 0 close
Substrate locked. Five locked-in observations + four gating questions + six hard safeguards. Phase 1 starts with Q1 (which patch is the vector) via cheap iter6-rebuilt-binary diff against current state, then escalates to actual bisect-boot if binary diff is inconclusive.