iter6 v2 attempt-2 close — 0004 headless-clean but compositor-wedges

Step 2.1 (0004 vb2 helper alone) on lockdep+ramoops kernel:
- Headless smoke test PASS: ffmpeg rc=0, 4147200B output, zero
  lockdep splat, iter4-DIAG validate_sps per-frame, iter5-IRQ DEC_RDY=1
- ~10 min later when GPU compositor + concurrent vb2 traffic active:
  silent wedge → panic_timeout=10 → boot loop
- PROVE_LOCKING + 9 other lock-debug flags DID NOT catch it before wedge

Architect H1 hypothesis (classical lock-order inversion lockdep models)
is incomplete. Refined hypothesis ladder for v3:
- H1' cross-context wait not modeled by lockdep (KCSAN)
- H2 use-after-free / double-signal of dma_fence (KASAN)
- H3 race in fence alloc/init under concurrency (KCSAN)
- H4 panthor consumer-side bug (separate non-vb2 test)

Sub-bug fixed during execution: forcibly-rebuild-all-vb2-consumers
required after 0004 (touch core.h + make -j8 modules), otherwise
ABI mismatch produces 0-byte ffmpeg output.

DT-based ramoops reservation works (memmap= cmdline is x86-only).
Verified 2465B+70818B ramoops capture across panic+reset.

Recovery via WeChat stick chroot — 4 modules restored to pre-base
md5s, extlinux default flipped to arch_devices vanilla. Ampere
confirmed clean.

v3 plan pending: add KASAN+KCSAN+DEBUG_PREEMPT, full rebuild,
intentionally reproduce wedge under compositor. ~75 min build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Markus Fritsche
2026-05-16 19:26:30 +00:00
parent 1656f84a40
commit 53e247724f
+58
View File
@@ -0,0 +1,58 @@
# Iter6 v2 attempt-2 close — partial-validation outcome, lockdep insufficient
Date: 2026-05-16 night
Branch: master
Plan executed: phase4_plan_iter6_postmortem_v2.md (commit 1656f84)
Patches applied: 0004 (vb2 helper, no driver opt-in)
## Bottom line
Step 2.1 (0004 alone on lockdep+ramoops base kernel) **passes the headless smoke test** but **wedges ampere ~10 min later** when GPU compositor + concurrent vb2 traffic happen. The wedge silently triggers a panic-reboot loop. **PROVE_LOCKING did not produce a lockdep splat** — meaning the architect's primary hypothesis (classical lock-order inversion lockdep can model) is incomplete.
## Phase 6 execution outcome per step
| Step | Outcome |
|------|---------|
| Pre-flight 0.1-0.5 | PASS — modules backed up on-device + tarball on boltzmann, kernel/initramfs backed up, sysctls survey |
| 1.1-1.3 lockdep build | PASS — 64 min build (longer than estimated 45). CONFIG_PROVE_RCU forced =y by Kconfig dep (documented A5 deviation) |
| 1.4-1.7 install + boot | PASS — `7.0.0-rc3-lockdep+` boots clean, `Lockdep is enabled` confirmed |
| 1.8a ramoops verify gate (HARD) | INITIAL FAIL → RESOLVED via DT route. memmap=$0x10000000 cmdline parameter is **unrecognized on arm64** (silent failure). Fixed by adding `reserved-memory { ramoops@10000000 }` node to genbook DTS + rebuild dtb. Second panic-trigger captured 2465B console-ramoops-0 + 70818B dmesg-ramoops-0. HARD GATE PASS |
| 1.8b iter3+4 baseline on lockdep | PASS — ffmpeg rc=0, 4147200B, zero lockdep splat |
| 2.1 apply 0004 + rebuild modules + install + boot | PARTIAL: first ssh post-reboot showed 0-byte ffmpeg output (ABI mismatch — only videobuf2-common rebuilt, rkvdec/hantro/rga still old). RESOLVED by `touch include/media/videobuf2-core.h && make -j8 modules` to force dep rebuild + reinstall all 4. After reinstall+reboot: headless smoke test clean at 21:20:29 (rc=0, 4147200B, zero lockdep splat) |
| 2.1 GPU-concurrent test (implicit, post-smoke) | **WEDGE** — observed by user as boot loop ~10 min later. Recovery via WeChat stick: chroot, restored 4 modules from pre-base backup, flipped extlinux default to arch_devices (vanilla). Ampere booted clean on vanilla kernel |
## Forensics: what we know about the wedge
- Headless ffmpeg HEVC decode WORKS with 0004 (no driver opt-in yet). vb2_buffer_done's new fence-signal path is a no-op when no driver has called vb2_buffer_attach_release_fence; that's the only code 0004 adds beyond the helper and a vb2_queue flag.
- pstore was empty post-recovery — the boot-loop didn't leave ramoops artifacts that survived the recovery process. Either the panics were too fast for ramoops to flush, or the ramoops region got reused by boot-loop attempts.
- PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP + DEBUG_RT_MUTEXES + DEBUG_SPINLOCK + DEBUG_MUTEXES + DEBUG_LOCK_ALLOC + PROVE_RAW_LOCK_NESTING + DEBUG_WW_MUTEX_SLOWPATH + PROVE_RCU all enabled. None caught the wedge BEFORE the deadlock became hard.
- The "ABI mismatch with only-one-module-rebuilt" sub-bug at the start of step 2.1 is a separate, recoverable issue — fixed by `touch include/media/videobuf2-core.h && make modules`. Not the cause of the eventual wedge.
## Refined hypothesis ladder for v3 (per architect H1 + this attempt's findings)
| # | Hypothesis | Test |
|---|------------|------|
| H1' | Cross-context wait NOT modeled by lockdep — e.g. atomic-context wait_for_completion-style fence wait that lockdep can't analyze | KCSAN (concurrency sanitizer) + DEBUG_PREEMPT + can-sleep-but-in-atomic warnings |
| H2 | Use-after-free or double-signal of dma_fence — e.g. vb2 frees release_fence while panthor still has a reference | KASAN (kernel address sanitizer) |
| H3 | Race in fence allocation/init path (0004's helper) under concurrency | KCSAN data-race detection |
| H4 | Panthor / drm_sched specific bug triggered by the new vb2-published fences — not vb2's fault but a panthor consumer-side bug | Repro with panthor module instrumentation, OR test with a non-panthor GPU consumer (vainfo/ffmpeg headless DRM_PRIME export) |
## Recovery state at close
- Ampere: vanilla `7.0.0-rc3-devices+` kernel, all 4 media .ko's restored to pre-base md5s. Confirmed working (`uname -r` + md5 verify post-recovery).
- Lockdep build artifacts (kernel Image, initramfs, dtb, modules) remain installed in /lib/modules/7.0.0-rc3-lockdep+/ and /boot/firmware/. extlinux `default arch_devices` selects vanilla.
- iter3+4+diag patches still in `~/src/linux-rockchip` working tree (uncommitted).
- 0004 patch source applied to working tree — needs git checkout to revert for v3 KASAN attempt OR can be kept since v3 will rebuild against newly-enabled KASAN anyway.
- Iter6 v1 backup at `~/iter6-broken-modules-bak-20260516-1720` preserved.
- Iter6 v2 backups at `~/iter6-postmortem-backups-attempt2-pre-base-20260516-1853` (devices+ kernel) + `~/iter6-postmortem-backups-attempt2-pre-2.1-2053` (lockdep base partial — just videobuf2-common) preserved.
## v3 next steps (PENDING ARCHITECT REVIEW + USER GO)
1. Add `CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_KCSAN=y CONFIG_DEBUG_PREEMPT=y CONFIG_PROVE_RAW_LOCK_NESTING=y` to lockdep base config
2. Full kernel rebuild (~75 min — KASAN+KCSAN slow compilation, also slows runtime ~3-5x)
3. Install as separate label `arch_devices_lockdep_kasan` (keep lockdep label as fallback)
4. Boot, re-run smoke test
5. Re-apply 0004, rebuild modules (consistent — touch core.h to force), install, reboot
6. Reproduce wedge intentionally: open plasma + play HEVC video — expect KASAN or KCSAN report to fire BEFORE wedge
Risk: KASAN+KCSAN combined may slow ampere enough that boot itself fails. Mitigation: keep arch_devices vanilla as extlinux default; user can also fall back to arch_devices_lockdep (the pre-KASAN lockdep build) if KASAN kernel doesn't boot.