iter6 post-mortem Phase 0: substrate + safeguards
Locks in forensics from boot -9..-1 journal: - Silent watchdog reset, no oops/panic in logs - vblank/edp WARNs are pre-existing (also fire on recovered boot 0) - vb2_buffer_attach_release_fence / dma_fence / dma_resv NEVER appear in iter6-boot kernel logs — deadlock at a level kernel can't reach printk - Hardware Synopsys DesignWare watchdog is the reset mechanism Six non-negotiable safeguards for any retry: 1. backup .ko AND off-device archive before sudo install 2. CONFIG_PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP etc 3. bisect-apply one patch at a time, reboot+test between 4. SDDM auto-login OFF (done — file renamed .disabled-iter6postmortem) 5. pstore.backend=ramoops to capture kernel oops across reset 6. Phase 5 architect review of plan + 0007 source before apply Four gating questions for Phase 1, starting with bisect: - which of 4 patches is the actual vector - lockdep splat hidden by CONFIG_PROVE_LOCKING=n - why no oops in journal - producer-side fence-alloc hang vs consumer-side wait hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
# Phase 0 — iter6 post-mortem substrate
|
||||
|
||||
Opened 2026-05-16 evening, immediately following iter6 surgical revert + ampere recovery via USB stick.
|
||||
|
||||
## Research question
|
||||
|
||||
**What caused the iter6 module set (vb2_dma_resv fence series 0004/0005/0006 + my unreviewed 0007 rkvdec opt-in) to silently wedge ampere during SDDM/wayland session handover, with no oops/panic logged, ending in hardware-watchdog reset within 30-90 seconds of each boot attempt — and what's the minimum kernel-debug instrumentation required to make the next attempt's failure mode self-diagnostic?**
|
||||
|
||||
## Locked-in evidence
|
||||
|
||||
| Observation | Source | Status |
|
||||
|-------------|--------|--------|
|
||||
| iter6 modules installed at 2026-05-16 15:00 (videobuf2-common, hantro-vpu, rockchip-rga, rockchip-vdec, all dated 15:00 in the backed-up dir) | ampere:~/iter6-broken-modules-bak-20260516-1720/ | preserved for analysis |
|
||||
| Reboot loop boots -8..-1 in journalctl, each 1-42 sec lifetime before watchdog reset | `journalctl --list-boots` on ampere | confirmed |
|
||||
| Last log entries before each truncation: SDDM started, plasma wayland session jumped to VT 2, kwin_wayland chose drm backend, then journal cuts off mid-line | `journalctl -b -9 --no-pager` | confirmed |
|
||||
| **Zero `Internal error / Oops / panic / dma_fence / vb2_buffer_attach_release_fence / dma_resv` lines in iter6-boot kernel logs** | journalctl -b -N -k filter | confirmed |
|
||||
| `panel-edp.c:814 generic_edp_panel_probe` WARN: fires on iter6 boots AND on current recovered boot 0 — **pre-existing, not iter6** | filter against boot 0 | ruled out |
|
||||
| `drm_atomic_helper.c:1921 drm_atomic_helper_wait_for_vblanks+0x1d0` WARN: fires on iter6 boots from kworker / plymouthd / Xorg / kwin_wayland — `Modules linked in:` list does NOT include videobuf2_common, only panthor + drm_* — **pre-existing hardware-display path WARN**, not iter6 | module list inspection | ruled out as cause |
|
||||
| No pstore dumps (perm denied to /sys/fs/pstore, /var/lib/systemd/pstore empty), no coredumps from kernel | survey | no kernel-side artifacts |
|
||||
| Hardware watchdog (`Synopsys DesignWare Watchdog`, 10 min hw timeout) is armed by systemd-shutdown but also by systemd-pid1 during normal runtime | boot -10 shutdown log | confirms watchdog is the reset mechanism — software hang exceeded systemd ping interval |
|
||||
| Backend `.so` md5 `404041ea2dcc03c769e0ab8c43ddadd6` unchanged across iter4/5/6 — userspace not implicated | strings check | userspace not the cause |
|
||||
| Recovery (revert iter6, rebuild, install pre-iter6 with iter3+4 kept) yielded clean ampere boot, ffmpeg exit 0, HEVC decode runs (still uniform black per iter5) | boot 0 verification | revert confirmed working |
|
||||
|
||||
## What we DON'T know (gating questions for Phase 1)
|
||||
|
||||
1. **Which of the 4 iter6 patches is the actual deadlock vector?** Patches are interdependent (0004 helper required for 0005/0006/0007), but 0007 (my rkvdec opt-in) is the only NEW driver opt-in beyond what fresnel-fourier validated. Bisect: apply 0004 alone (helper, no opt-ins) → boot. Then add 0005 → boot. Then 0006 → boot. Then 0007 → boot. Whichever first one wedges is the vector.
|
||||
|
||||
2. **Is it a lockdep splat we're missing because CONFIG_PROVE_LOCKING=n?** The 0004 commit message explicitly claims `dma_fence_begin_signalling()` was validated under PROVE_LOCKING on RK3566. On RK3588 without PROVE_LOCKING, a lockdep violation could deadlock silently. Phase 6 must rebuild with `CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=y` before any iter6 patches go back.
|
||||
|
||||
3. **Why no oops/panic in journal?** Possibilities:
|
||||
- True silent CPU hang (not even WARN_ON reachable) — would imply a hard spinlock deadlock between vb2_buffer_done's signal path and panthor's wait
|
||||
- Oops happened but didn't fsync to journal before reset (likely — kernel logs go through klogd → systemd-journald → fsync)
|
||||
- Panic with `panic=N` set such that kernel auto-reboots before logging
|
||||
- Watchdog fired before kernel panicked
|
||||
|
||||
4. **Does the hang happen at producer-side fence allocation or consumer-side wait?** Phase 6 instrumentation: pr_warn at vb2_buffer_attach_release_fence entry/exit, and at vb2_buffer_done signal site. If entry fires but exit doesn't → producer hang. If both fire but compositor still wedges → consumer wait hang.
|
||||
|
||||
## Substrate for post-mortem retry
|
||||
|
||||
- **Kernel source on ampere**: ~/src/linux-rockchip branch ampere-minimal-devices, iter3+4+diag patches in working tree (NOT committed), iter6 source changes already reverted via git checkout + sed surgery (per iter6 close commit)
|
||||
- **Build host**: ampere itself (boltzmann tree on incompatible branch per kernel-agent#14 prior discovery)
|
||||
- **Iter6 .ko backup**: ampere:~/iter6-broken-modules-bak-20260516-1720/ — 4 modules, ~1MB total — for binary diff analysis (e.g. compare reverted-but-rebuilt rockchip-vdec.ko vs iter6 version to see exact differences in compiled code)
|
||||
- **Backup rule now in project memory**: `feedback_backup_before_module_replace.md` cites iter6 incident — next session loads it cold
|
||||
- **Auto-login disabled**: sddm drops to login prompt — if iter6 retry wedges at compositor handover, system stops at login screen instead of cascading to watchdog reset
|
||||
- **Recovery stick**: still at higgs (or wherever user moved it after recovery), default label coolpi_rk3588_gbook, fstab includes `LABEL=writable / ext4 defaults,noatime 0 1`, backup config files preserved on stick — ready for next recovery if needed
|
||||
- **Serial console access**: ampere extlinux command line includes `console=ttyS2,1500000` — would capture pre-watchdog output IF a serial cable is hooked up. Currently no record of serial cable being connected.
|
||||
|
||||
## Hard safeguards for iter6 retry (Phase 4 baseline)
|
||||
|
||||
These are NON-NEGOTIABLE before any iter6 patch goes back:
|
||||
|
||||
1. **`cp` backup of every .ko AND tarball off-device archive** to a different host (boltzmann:/home/mfritsche/iter6-postmortem-backups/<release>-pre-attemptN/) BEFORE `sudo install`. Per `feedback_backup_before_module_replace.md`.
|
||||
2. **`CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_LOCKDEP=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y`** in the kernel BEFORE applying any iter6 patches. Full kernel rebuild (not just modules) required for these flags.
|
||||
3. **Bisect-apply: one patch at a time + reboot + smoke-test + revert if regression**. Don't apply 0004+0005+0006+0007 all at once again. Test boot after each.
|
||||
4. **Auto-login OFF** (done). If sddm wedges at handover, system halts at login prompt — manual rescue without compositor crash → no watchdog → no reset.
|
||||
5. **`pstore.backend=ramoops` kernel cmdline** + reserved ramoops region in DT (if available on RK3588) → kernel oops survives reset. Pre-condition check Phase 6 step 0.
|
||||
6. **Phase 5 architect review** of the plan + the 0007 rkvdec opt-in code before applying. Spawn separate Claude session OR get user sign-off explicitly.
|
||||
|
||||
## Phase 0 close
|
||||
|
||||
Substrate locked. Five locked-in observations + four gating questions + six hard safeguards. Phase 1 starts with Q1 (which patch is the vector) via cheap iter6-rebuilt-binary diff against current state, then escalates to actual bisect-boot if binary diff is inconclusive.
|
||||
Reference in New Issue
Block a user