Files
ampere-kernel-decoders/iter6_v2_attempt2_close.md
Markus Fritsche 53e247724f iter6 v2 attempt-2 close — 0004 headless-clean but compositor-wedges
Step 2.1 (0004 vb2 helper alone) on lockdep+ramoops kernel:
- Headless smoke test PASS: ffmpeg rc=0, 4147200B output, zero
  lockdep splat, iter4-DIAG validate_sps per-frame, iter5-IRQ DEC_RDY=1
- ~10 min later when GPU compositor + concurrent vb2 traffic active:
  silent wedge → panic_timeout=10 → boot loop
- PROVE_LOCKING + 9 other lock-debug flags DID NOT catch it before wedge

Architect H1 hypothesis (classical lock-order inversion lockdep models)
is incomplete. Refined hypothesis ladder for v3:
- H1' cross-context wait not modeled by lockdep (KCSAN)
- H2 use-after-free / double-signal of dma_fence (KASAN)
- H3 race in fence alloc/init under concurrency (KCSAN)
- H4 panthor consumer-side bug (separate non-vb2 test)

Sub-bug fixed during execution: forcibly-rebuild-all-vb2-consumers
required after 0004 (touch core.h + make -j8 modules), otherwise
ABI mismatch produces 0-byte ffmpeg output.

DT-based ramoops reservation works (memmap= cmdline is x86-only).
Verified 2465B+70818B ramoops capture across panic+reset.

Recovery via WeChat stick chroot — 4 modules restored to pre-base
md5s, extlinux default flipped to arch_devices vanilla. Ampere
confirmed clean.

v3 plan pending: add KASAN+KCSAN+DEBUG_PREEMPT, full rebuild,
intentionally reproduce wedge under compositor. ~75 min build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:26:30 +00:00

5.7 KiB

Iter6 v2 attempt-2 close — partial-validation outcome, lockdep insufficient

Date: 2026-05-16 night Branch: master Plan executed: phase4_plan_iter6_postmortem_v2.md (commit 1656f84) Patches applied: 0004 (vb2 helper, no driver opt-in)

Bottom line

Step 2.1 (0004 alone on lockdep+ramoops base kernel) passes the headless smoke test but wedges ampere ~10 min later when GPU compositor + concurrent vb2 traffic happen. The wedge silently triggers a panic-reboot loop. PROVE_LOCKING did not produce a lockdep splat — meaning the architect's primary hypothesis (classical lock-order inversion lockdep can model) is incomplete.

Phase 6 execution outcome per step

Step Outcome
Pre-flight 0.1-0.5 PASS — modules backed up on-device + tarball on boltzmann, kernel/initramfs backed up, sysctls survey
1.1-1.3 lockdep build PASS — 64 min build (longer than estimated 45). CONFIG_PROVE_RCU forced =y by Kconfig dep (documented A5 deviation)
1.4-1.7 install + boot PASS — 7.0.0-rc3-lockdep+ boots clean, Lockdep is enabled confirmed
1.8a ramoops verify gate (HARD) INITIAL FAIL → RESOLVED via DT route. memmap=$0x10000000 cmdline parameter is unrecognized on arm64 (silent failure). Fixed by adding reserved-memory { ramoops@10000000 } node to genbook DTS + rebuild dtb. Second panic-trigger captured 2465B console-ramoops-0 + 70818B dmesg-ramoops-0. HARD GATE PASS
1.8b iter3+4 baseline on lockdep PASS — ffmpeg rc=0, 4147200B, zero lockdep splat
2.1 apply 0004 + rebuild modules + install + boot PARTIAL: first ssh post-reboot showed 0-byte ffmpeg output (ABI mismatch — only videobuf2-common rebuilt, rkvdec/hantro/rga still old). RESOLVED by touch include/media/videobuf2-core.h && make -j8 modules to force dep rebuild + reinstall all 4. After reinstall+reboot: headless smoke test clean at 21:20:29 (rc=0, 4147200B, zero lockdep splat)
2.1 GPU-concurrent test (implicit, post-smoke) WEDGE — observed by user as boot loop ~10 min later. Recovery via WeChat stick: chroot, restored 4 modules from pre-base backup, flipped extlinux default to arch_devices (vanilla). Ampere booted clean on vanilla kernel

Forensics: what we know about the wedge

  • Headless ffmpeg HEVC decode WORKS with 0004 (no driver opt-in yet). vb2_buffer_done's new fence-signal path is a no-op when no driver has called vb2_buffer_attach_release_fence; that's the only code 0004 adds beyond the helper and a vb2_queue flag.
  • pstore was empty post-recovery — the boot-loop didn't leave ramoops artifacts that survived the recovery process. Either the panics were too fast for ramoops to flush, or the ramoops region got reused by boot-loop attempts.
  • PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP + DEBUG_RT_MUTEXES + DEBUG_SPINLOCK + DEBUG_MUTEXES + DEBUG_LOCK_ALLOC + PROVE_RAW_LOCK_NESTING + DEBUG_WW_MUTEX_SLOWPATH + PROVE_RCU all enabled. None caught the wedge BEFORE the deadlock became hard.
  • The "ABI mismatch with only-one-module-rebuilt" sub-bug at the start of step 2.1 is a separate, recoverable issue — fixed by touch include/media/videobuf2-core.h && make modules. Not the cause of the eventual wedge.

Refined hypothesis ladder for v3 (per architect H1 + this attempt's findings)

# Hypothesis Test
H1' Cross-context wait NOT modeled by lockdep — e.g. atomic-context wait_for_completion-style fence wait that lockdep can't analyze KCSAN (concurrency sanitizer) + DEBUG_PREEMPT + can-sleep-but-in-atomic warnings
H2 Use-after-free or double-signal of dma_fence — e.g. vb2 frees release_fence while panthor still has a reference KASAN (kernel address sanitizer)
H3 Race in fence allocation/init path (0004's helper) under concurrency KCSAN data-race detection
H4 Panthor / drm_sched specific bug triggered by the new vb2-published fences — not vb2's fault but a panthor consumer-side bug Repro with panthor module instrumentation, OR test with a non-panthor GPU consumer (vainfo/ffmpeg headless DRM_PRIME export)

Recovery state at close

  • Ampere: vanilla 7.0.0-rc3-devices+ kernel, all 4 media .ko's restored to pre-base md5s. Confirmed working (uname -r + md5 verify post-recovery).
  • Lockdep build artifacts (kernel Image, initramfs, dtb, modules) remain installed in /lib/modules/7.0.0-rc3-lockdep+/ and /boot/firmware/. extlinux default arch_devices selects vanilla.
  • iter3+4+diag patches still in ~/src/linux-rockchip working tree (uncommitted).
  • 0004 patch source applied to working tree — needs git checkout to revert for v3 KASAN attempt OR can be kept since v3 will rebuild against newly-enabled KASAN anyway.
  • Iter6 v1 backup at ~/iter6-broken-modules-bak-20260516-1720 preserved.
  • Iter6 v2 backups at ~/iter6-postmortem-backups-attempt2-pre-base-20260516-1853 (devices+ kernel) + ~/iter6-postmortem-backups-attempt2-pre-2.1-2053 (lockdep base partial — just videobuf2-common) preserved.

v3 next steps (PENDING ARCHITECT REVIEW + USER GO)

  1. Add CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_KCSAN=y CONFIG_DEBUG_PREEMPT=y CONFIG_PROVE_RAW_LOCK_NESTING=y to lockdep base config
  2. Full kernel rebuild (~75 min — KASAN+KCSAN slow compilation, also slows runtime ~3-5x)
  3. Install as separate label arch_devices_lockdep_kasan (keep lockdep label as fallback)
  4. Boot, re-run smoke test
  5. Re-apply 0004, rebuild modules (consistent — touch core.h to force), install, reboot
  6. Reproduce wedge intentionally: open plasma + play HEVC video — expect KASAN or KCSAN report to fire BEFORE wedge

Risk: KASAN+KCSAN combined may slow ampere enough that boot itself fails. Mitigation: keep arch_devices vanilla as extlinux default; user can also fall back to arch_devices_lockdep (the pre-KASAN lockdep build) if KASAN kernel doesn't boot.