Files
ampere-kernel-decoders/phase0_findings_iter6_postmortem.md
Markus Fritsche 11d2dde8ab iter6 post-mortem Phase 0: substrate + safeguards
Locks in forensics from boot -9..-1 journal:
- Silent watchdog reset, no oops/panic in logs
- vblank/edp WARNs are pre-existing (also fire on recovered boot 0)
- vb2_buffer_attach_release_fence / dma_fence / dma_resv NEVER
  appear in iter6-boot kernel logs — deadlock at a level kernel
  can't reach printk
- Hardware Synopsys DesignWare watchdog is the reset mechanism

Six non-negotiable safeguards for any retry:
1. backup .ko AND off-device archive before sudo install
2. CONFIG_PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP etc
3. bisect-apply one patch at a time, reboot+test between
4. SDDM auto-login OFF (done — file renamed .disabled-iter6postmortem)
5. pstore.backend=ramoops to capture kernel oops across reset
6. Phase 5 architect review of plan + 0007 source before apply

Four gating questions for Phase 1, starting with bisect:
- which of 4 patches is the actual vector
- lockdep splat hidden by CONFIG_PROVE_LOCKING=n
- why no oops in journal
- producer-side fence-alloc hang vs consumer-side wait hang

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:02:09 +00:00

7.1 KiB

Phase 0 — iter6 post-mortem substrate

Opened 2026-05-16 evening, immediately following iter6 surgical revert + ampere recovery via USB stick.

Research question

What caused the iter6 module set (vb2_dma_resv fence series 0004/0005/0006 + my unreviewed 0007 rkvdec opt-in) to silently wedge ampere during SDDM/wayland session handover, with no oops/panic logged, ending in hardware-watchdog reset within 30-90 seconds of each boot attempt — and what's the minimum kernel-debug instrumentation required to make the next attempt's failure mode self-diagnostic?

Locked-in evidence

Observation Source Status
iter6 modules installed at 2026-05-16 15:00 (videobuf2-common, hantro-vpu, rockchip-rga, rockchip-vdec, all dated 15:00 in the backed-up dir) ampere:~/iter6-broken-modules-bak-20260516-1720/ preserved for analysis
Reboot loop boots -8..-1 in journalctl, each 1-42 sec lifetime before watchdog reset journalctl --list-boots on ampere confirmed
Last log entries before each truncation: SDDM started, plasma wayland session jumped to VT 2, kwin_wayland chose drm backend, then journal cuts off mid-line journalctl -b -9 --no-pager confirmed
Zero Internal error / Oops / panic / dma_fence / vb2_buffer_attach_release_fence / dma_resv lines in iter6-boot kernel logs journalctl -b -N -k filter confirmed
panel-edp.c:814 generic_edp_panel_probe WARN: fires on iter6 boots AND on current recovered boot 0 — pre-existing, not iter6 filter against boot 0 ruled out
drm_atomic_helper.c:1921 drm_atomic_helper_wait_for_vblanks+0x1d0 WARN: fires on iter6 boots from kworker / plymouthd / Xorg / kwin_wayland — Modules linked in: list does NOT include videobuf2_common, only panthor + drm_* — pre-existing hardware-display path WARN, not iter6 module list inspection ruled out as cause
No pstore dumps (perm denied to /sys/fs/pstore, /var/lib/systemd/pstore empty), no coredumps from kernel survey no kernel-side artifacts
Hardware watchdog (Synopsys DesignWare Watchdog, 10 min hw timeout) is armed by systemd-shutdown but also by systemd-pid1 during normal runtime boot -10 shutdown log confirms watchdog is the reset mechanism — software hang exceeded systemd ping interval
Backend .so md5 404041ea2dcc03c769e0ab8c43ddadd6 unchanged across iter4/5/6 — userspace not implicated strings check userspace not the cause
Recovery (revert iter6, rebuild, install pre-iter6 with iter3+4 kept) yielded clean ampere boot, ffmpeg exit 0, HEVC decode runs (still uniform black per iter5) boot 0 verification revert confirmed working

What we DON'T know (gating questions for Phase 1)

  1. Which of the 4 iter6 patches is the actual deadlock vector? Patches are interdependent (0004 helper required for 0005/0006/0007), but 0007 (my rkvdec opt-in) is the only NEW driver opt-in beyond what fresnel-fourier validated. Bisect: apply 0004 alone (helper, no opt-ins) → boot. Then add 0005 → boot. Then 0006 → boot. Then 0007 → boot. Whichever first one wedges is the vector.

  2. Is it a lockdep splat we're missing because CONFIG_PROVE_LOCKING=n? The 0004 commit message explicitly claims dma_fence_begin_signalling() was validated under PROVE_LOCKING on RK3566. On RK3588 without PROVE_LOCKING, a lockdep violation could deadlock silently. Phase 6 must rebuild with CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=y before any iter6 patches go back.

  3. Why no oops/panic in journal? Possibilities:

    • True silent CPU hang (not even WARN_ON reachable) — would imply a hard spinlock deadlock between vb2_buffer_done's signal path and panthor's wait
    • Oops happened but didn't fsync to journal before reset (likely — kernel logs go through klogd → systemd-journald → fsync)
    • Panic with panic=N set such that kernel auto-reboots before logging
    • Watchdog fired before kernel panicked
  4. Does the hang happen at producer-side fence allocation or consumer-side wait? Phase 6 instrumentation: pr_warn at vb2_buffer_attach_release_fence entry/exit, and at vb2_buffer_done signal site. If entry fires but exit doesn't → producer hang. If both fire but compositor still wedges → consumer wait hang.

Substrate for post-mortem retry

  • Kernel source on ampere: ~/src/linux-rockchip branch ampere-minimal-devices, iter3+4+diag patches in working tree (NOT committed), iter6 source changes already reverted via git checkout + sed surgery (per iter6 close commit)
  • Build host: ampere itself (boltzmann tree on incompatible branch per kernel-agent#14 prior discovery)
  • Iter6 .ko backup: ampere:~/iter6-broken-modules-bak-20260516-1720/ — 4 modules, ~1MB total — for binary diff analysis (e.g. compare reverted-but-rebuilt rockchip-vdec.ko vs iter6 version to see exact differences in compiled code)
  • Backup rule now in project memory: feedback_backup_before_module_replace.md cites iter6 incident — next session loads it cold
  • Auto-login disabled: sddm drops to login prompt — if iter6 retry wedges at compositor handover, system stops at login screen instead of cascading to watchdog reset
  • Recovery stick: still at higgs (or wherever user moved it after recovery), default label coolpi_rk3588_gbook, fstab includes LABEL=writable / ext4 defaults,noatime 0 1, backup config files preserved on stick — ready for next recovery if needed
  • Serial console access: ampere extlinux command line includes console=ttyS2,1500000 — would capture pre-watchdog output IF a serial cable is hooked up. Currently no record of serial cable being connected.

Hard safeguards for iter6 retry (Phase 4 baseline)

These are NON-NEGOTIABLE before any iter6 patch goes back:

  1. cp backup of every .ko AND tarball off-device archive to a different host (boltzmann:/home/mfritsche/iter6-postmortem-backups/-pre-attemptN/) BEFORE sudo install. Per feedback_backup_before_module_replace.md.
  2. CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y in the kernel BEFORE applying any iter6 patches. Full kernel rebuild (not just modules) required for these flags.
  3. Bisect-apply: one patch at a time + reboot + smoke-test + revert if regression. Don't apply 0004+0005+0006+0007 all at once again. Test boot after each.
  4. Auto-login OFF (done). If sddm wedges at handover, system halts at login prompt — manual rescue without compositor crash → no watchdog → no reset.
  5. pstore.backend=ramoops kernel cmdline + reserved ramoops region in DT (if available on RK3588) → kernel oops survives reset. Pre-condition check Phase 6 step 0.
  6. Phase 5 architect review of the plan + the 0007 rkvdec opt-in code before applying. Spawn separate Claude session OR get user sign-off explicitly.

Phase 0 close

Substrate locked. Five locked-in observations + four gating questions + six hard safeguards. Phase 1 starts with Q1 (which patch is the vector) via cheap iter6-rebuilt-binary diff against current state, then escalates to actual bisect-boot if binary diff is inconclusive.