Step 2.1 (0004 vb2 helper alone) on lockdep+ramoops kernel: - Headless smoke test PASS: ffmpeg rc=0, 4147200B output, zero lockdep splat, iter4-DIAG validate_sps per-frame, iter5-IRQ DEC_RDY=1 - ~10 min later when GPU compositor + concurrent vb2 traffic active: silent wedge → panic_timeout=10 → boot loop - PROVE_LOCKING + 9 other lock-debug flags DID NOT catch it before wedge Architect H1 hypothesis (classical lock-order inversion lockdep models) is incomplete. Refined hypothesis ladder for v3: - H1' cross-context wait not modeled by lockdep (KCSAN) - H2 use-after-free / double-signal of dma_fence (KASAN) - H3 race in fence alloc/init under concurrency (KCSAN) - H4 panthor consumer-side bug (separate non-vb2 test) Sub-bug fixed during execution: forcibly-rebuild-all-vb2-consumers required after 0004 (touch core.h + make -j8 modules), otherwise ABI mismatch produces 0-byte ffmpeg output. DT-based ramoops reservation works (memmap= cmdline is x86-only). Verified 2465B+70818B ramoops capture across panic+reset. Recovery via WeChat stick chroot — 4 modules restored to pre-base md5s, extlinux default flipped to arch_devices vanilla. Ampere confirmed clean. v3 plan pending: add KASAN+KCSAN+DEBUG_PREEMPT, full rebuild, intentionally reproduce wedge under compositor. ~75 min build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.7 KiB
Iter6 v2 attempt-2 close — partial-validation outcome, lockdep insufficient
Date: 2026-05-16 night
Branch: master
Plan executed: phase4_plan_iter6_postmortem_v2.md (commit 1656f84)
Patches applied: 0004 (vb2 helper, no driver opt-in)
Bottom line
Step 2.1 (0004 alone on lockdep+ramoops base kernel) passes the headless smoke test but wedges ampere ~10 min later when GPU compositor + concurrent vb2 traffic happen. The wedge silently triggers a panic-reboot loop. PROVE_LOCKING did not produce a lockdep splat — meaning the architect's primary hypothesis (classical lock-order inversion lockdep can model) is incomplete.
Phase 6 execution outcome per step
| Step | Outcome |
|---|---|
| Pre-flight 0.1-0.5 | PASS — modules backed up on-device + tarball on boltzmann, kernel/initramfs backed up, sysctls survey |
| 1.1-1.3 lockdep build | PASS — 64 min build (longer than estimated 45). CONFIG_PROVE_RCU forced =y by Kconfig dep (documented A5 deviation) |
| 1.4-1.7 install + boot | PASS — 7.0.0-rc3-lockdep+ boots clean, Lockdep is enabled confirmed |
| 1.8a ramoops verify gate (HARD) | INITIAL FAIL → RESOLVED via DT route. memmap=$0x10000000 cmdline parameter is unrecognized on arm64 (silent failure). Fixed by adding reserved-memory { ramoops@10000000 } node to genbook DTS + rebuild dtb. Second panic-trigger captured 2465B console-ramoops-0 + 70818B dmesg-ramoops-0. HARD GATE PASS |
| 1.8b iter3+4 baseline on lockdep | PASS — ffmpeg rc=0, 4147200B, zero lockdep splat |
| 2.1 apply 0004 + rebuild modules + install + boot | PARTIAL: first ssh post-reboot showed 0-byte ffmpeg output (ABI mismatch — only videobuf2-common rebuilt, rkvdec/hantro/rga still old). RESOLVED by touch include/media/videobuf2-core.h && make -j8 modules to force dep rebuild + reinstall all 4. After reinstall+reboot: headless smoke test clean at 21:20:29 (rc=0, 4147200B, zero lockdep splat) |
| 2.1 GPU-concurrent test (implicit, post-smoke) | WEDGE — observed by user as boot loop ~10 min later. Recovery via WeChat stick: chroot, restored 4 modules from pre-base backup, flipped extlinux default to arch_devices (vanilla). Ampere booted clean on vanilla kernel |
Forensics: what we know about the wedge
- Headless ffmpeg HEVC decode WORKS with 0004 (no driver opt-in yet). vb2_buffer_done's new fence-signal path is a no-op when no driver has called vb2_buffer_attach_release_fence; that's the only code 0004 adds beyond the helper and a vb2_queue flag.
- pstore was empty post-recovery — the boot-loop didn't leave ramoops artifacts that survived the recovery process. Either the panics were too fast for ramoops to flush, or the ramoops region got reused by boot-loop attempts.
- PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP + DEBUG_RT_MUTEXES + DEBUG_SPINLOCK + DEBUG_MUTEXES + DEBUG_LOCK_ALLOC + PROVE_RAW_LOCK_NESTING + DEBUG_WW_MUTEX_SLOWPATH + PROVE_RCU all enabled. None caught the wedge BEFORE the deadlock became hard.
- The "ABI mismatch with only-one-module-rebuilt" sub-bug at the start of step 2.1 is a separate, recoverable issue — fixed by
touch include/media/videobuf2-core.h && make modules. Not the cause of the eventual wedge.
Refined hypothesis ladder for v3 (per architect H1 + this attempt's findings)
| # | Hypothesis | Test |
|---|---|---|
| H1' | Cross-context wait NOT modeled by lockdep — e.g. atomic-context wait_for_completion-style fence wait that lockdep can't analyze | KCSAN (concurrency sanitizer) + DEBUG_PREEMPT + can-sleep-but-in-atomic warnings |
| H2 | Use-after-free or double-signal of dma_fence — e.g. vb2 frees release_fence while panthor still has a reference | KASAN (kernel address sanitizer) |
| H3 | Race in fence allocation/init path (0004's helper) under concurrency | KCSAN data-race detection |
| H4 | Panthor / drm_sched specific bug triggered by the new vb2-published fences — not vb2's fault but a panthor consumer-side bug | Repro with panthor module instrumentation, OR test with a non-panthor GPU consumer (vainfo/ffmpeg headless DRM_PRIME export) |
Recovery state at close
- Ampere: vanilla
7.0.0-rc3-devices+kernel, all 4 media .ko's restored to pre-base md5s. Confirmed working (uname -r+ md5 verify post-recovery). - Lockdep build artifacts (kernel Image, initramfs, dtb, modules) remain installed in /lib/modules/7.0.0-rc3-lockdep+/ and /boot/firmware/. extlinux
default arch_devicesselects vanilla. - iter3+4+diag patches still in
~/src/linux-rockchipworking tree (uncommitted). - 0004 patch source applied to working tree — needs git checkout to revert for v3 KASAN attempt OR can be kept since v3 will rebuild against newly-enabled KASAN anyway.
- Iter6 v1 backup at
~/iter6-broken-modules-bak-20260516-1720preserved. - Iter6 v2 backups at
~/iter6-postmortem-backups-attempt2-pre-base-20260516-1853(devices+ kernel) +~/iter6-postmortem-backups-attempt2-pre-2.1-2053(lockdep base partial — just videobuf2-common) preserved.
v3 next steps (PENDING ARCHITECT REVIEW + USER GO)
- Add
CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_KCSAN=y CONFIG_DEBUG_PREEMPT=y CONFIG_PROVE_RAW_LOCK_NESTING=yto lockdep base config - Full kernel rebuild (~75 min — KASAN+KCSAN slow compilation, also slows runtime ~3-5x)
- Install as separate label
arch_devices_lockdep_kasan(keep lockdep label as fallback) - Boot, re-run smoke test
- Re-apply 0004, rebuild modules (consistent — touch core.h to force), install, reboot
- Reproduce wedge intentionally: open plasma + play HEVC video — expect KASAN or KCSAN report to fire BEFORE wedge
Risk: KASAN+KCSAN combined may slow ampere enough that boot itself fails. Mitigation: keep arch_devices vanilla as extlinux default; user can also fall back to arch_devices_lockdep (the pre-KASAN lockdep build) if KASAN kernel doesn't boot.