From 53e247724ff60247092eff563dd88f17fec750f5 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 19:26:30 +0000 Subject: [PATCH] =?UTF-8?q?iter6=20v2=20attempt-2=20close=20=E2=80=94=2000?= =?UTF-8?q?04=20headless-clean=20but=20compositor-wedges?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 2.1 (0004 vb2 helper alone) on lockdep+ramoops kernel: - Headless smoke test PASS: ffmpeg rc=0, 4147200B output, zero lockdep splat, iter4-DIAG validate_sps per-frame, iter5-IRQ DEC_RDY=1 - ~10 min later when GPU compositor + concurrent vb2 traffic active: silent wedge → panic_timeout=10 → boot loop - PROVE_LOCKING + 9 other lock-debug flags DID NOT catch it before wedge Architect H1 hypothesis (classical lock-order inversion lockdep models) is incomplete. Refined hypothesis ladder for v3: - H1' cross-context wait not modeled by lockdep (KCSAN) - H2 use-after-free / double-signal of dma_fence (KASAN) - H3 race in fence alloc/init under concurrency (KCSAN) - H4 panthor consumer-side bug (separate non-vb2 test) Sub-bug fixed during execution: forcibly-rebuild-all-vb2-consumers required after 0004 (touch core.h + make -j8 modules), otherwise ABI mismatch produces 0-byte ffmpeg output. DT-based ramoops reservation works (memmap= cmdline is x86-only). Verified 2465B+70818B ramoops capture across panic+reset. Recovery via WeChat stick chroot — 4 modules restored to pre-base md5s, extlinux default flipped to arch_devices vanilla. Ampere confirmed clean. v3 plan pending: add KASAN+KCSAN+DEBUG_PREEMPT, full rebuild, intentionally reproduce wedge under compositor. ~75 min build. Co-Authored-By: Claude Opus 4.7 --- iter6_v2_attempt2_close.md | 58 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) create mode 100644 iter6_v2_attempt2_close.md diff --git a/iter6_v2_attempt2_close.md b/iter6_v2_attempt2_close.md new file mode 100644 index 0000000..80751ba --- /dev/null +++ b/iter6_v2_attempt2_close.md @@ -0,0 +1,58 @@ +# Iter6 v2 attempt-2 close — partial-validation outcome, lockdep insufficient + +Date: 2026-05-16 night +Branch: master +Plan executed: phase4_plan_iter6_postmortem_v2.md (commit 1656f84) +Patches applied: 0004 (vb2 helper, no driver opt-in) + +## Bottom line + +Step 2.1 (0004 alone on lockdep+ramoops base kernel) **passes the headless smoke test** but **wedges ampere ~10 min later** when GPU compositor + concurrent vb2 traffic happen. The wedge silently triggers a panic-reboot loop. **PROVE_LOCKING did not produce a lockdep splat** — meaning the architect's primary hypothesis (classical lock-order inversion lockdep can model) is incomplete. + +## Phase 6 execution outcome per step + +| Step | Outcome | +|------|---------| +| Pre-flight 0.1-0.5 | PASS — modules backed up on-device + tarball on boltzmann, kernel/initramfs backed up, sysctls survey | +| 1.1-1.3 lockdep build | PASS — 64 min build (longer than estimated 45). CONFIG_PROVE_RCU forced =y by Kconfig dep (documented A5 deviation) | +| 1.4-1.7 install + boot | PASS — `7.0.0-rc3-lockdep+` boots clean, `Lockdep is enabled` confirmed | +| 1.8a ramoops verify gate (HARD) | INITIAL FAIL → RESOLVED via DT route. memmap=$0x10000000 cmdline parameter is **unrecognized on arm64** (silent failure). Fixed by adding `reserved-memory { ramoops@10000000 }` node to genbook DTS + rebuild dtb. Second panic-trigger captured 2465B console-ramoops-0 + 70818B dmesg-ramoops-0. HARD GATE PASS | +| 1.8b iter3+4 baseline on lockdep | PASS — ffmpeg rc=0, 4147200B, zero lockdep splat | +| 2.1 apply 0004 + rebuild modules + install + boot | PARTIAL: first ssh post-reboot showed 0-byte ffmpeg output (ABI mismatch — only videobuf2-common rebuilt, rkvdec/hantro/rga still old). RESOLVED by `touch include/media/videobuf2-core.h && make -j8 modules` to force dep rebuild + reinstall all 4. After reinstall+reboot: headless smoke test clean at 21:20:29 (rc=0, 4147200B, zero lockdep splat) | +| 2.1 GPU-concurrent test (implicit, post-smoke) | **WEDGE** — observed by user as boot loop ~10 min later. Recovery via WeChat stick: chroot, restored 4 modules from pre-base backup, flipped extlinux default to arch_devices (vanilla). Ampere booted clean on vanilla kernel | + +## Forensics: what we know about the wedge + +- Headless ffmpeg HEVC decode WORKS with 0004 (no driver opt-in yet). vb2_buffer_done's new fence-signal path is a no-op when no driver has called vb2_buffer_attach_release_fence; that's the only code 0004 adds beyond the helper and a vb2_queue flag. +- pstore was empty post-recovery — the boot-loop didn't leave ramoops artifacts that survived the recovery process. Either the panics were too fast for ramoops to flush, or the ramoops region got reused by boot-loop attempts. +- PROVE_LOCKING + DEBUG_ATOMIC_SLEEP + LOCKDEP + DEBUG_RT_MUTEXES + DEBUG_SPINLOCK + DEBUG_MUTEXES + DEBUG_LOCK_ALLOC + PROVE_RAW_LOCK_NESTING + DEBUG_WW_MUTEX_SLOWPATH + PROVE_RCU all enabled. None caught the wedge BEFORE the deadlock became hard. +- The "ABI mismatch with only-one-module-rebuilt" sub-bug at the start of step 2.1 is a separate, recoverable issue — fixed by `touch include/media/videobuf2-core.h && make modules`. Not the cause of the eventual wedge. + +## Refined hypothesis ladder for v3 (per architect H1 + this attempt's findings) + +| # | Hypothesis | Test | +|---|------------|------| +| H1' | Cross-context wait NOT modeled by lockdep — e.g. atomic-context wait_for_completion-style fence wait that lockdep can't analyze | KCSAN (concurrency sanitizer) + DEBUG_PREEMPT + can-sleep-but-in-atomic warnings | +| H2 | Use-after-free or double-signal of dma_fence — e.g. vb2 frees release_fence while panthor still has a reference | KASAN (kernel address sanitizer) | +| H3 | Race in fence allocation/init path (0004's helper) under concurrency | KCSAN data-race detection | +| H4 | Panthor / drm_sched specific bug triggered by the new vb2-published fences — not vb2's fault but a panthor consumer-side bug | Repro with panthor module instrumentation, OR test with a non-panthor GPU consumer (vainfo/ffmpeg headless DRM_PRIME export) | + +## Recovery state at close + +- Ampere: vanilla `7.0.0-rc3-devices+` kernel, all 4 media .ko's restored to pre-base md5s. Confirmed working (`uname -r` + md5 verify post-recovery). +- Lockdep build artifacts (kernel Image, initramfs, dtb, modules) remain installed in /lib/modules/7.0.0-rc3-lockdep+/ and /boot/firmware/. extlinux `default arch_devices` selects vanilla. +- iter3+4+diag patches still in `~/src/linux-rockchip` working tree (uncommitted). +- 0004 patch source applied to working tree — needs git checkout to revert for v3 KASAN attempt OR can be kept since v3 will rebuild against newly-enabled KASAN anyway. +- Iter6 v1 backup at `~/iter6-broken-modules-bak-20260516-1720` preserved. +- Iter6 v2 backups at `~/iter6-postmortem-backups-attempt2-pre-base-20260516-1853` (devices+ kernel) + `~/iter6-postmortem-backups-attempt2-pre-2.1-2053` (lockdep base partial — just videobuf2-common) preserved. + +## v3 next steps (PENDING ARCHITECT REVIEW + USER GO) + +1. Add `CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_KCSAN=y CONFIG_DEBUG_PREEMPT=y CONFIG_PROVE_RAW_LOCK_NESTING=y` to lockdep base config +2. Full kernel rebuild (~75 min — KASAN+KCSAN slow compilation, also slows runtime ~3-5x) +3. Install as separate label `arch_devices_lockdep_kasan` (keep lockdep label as fallback) +4. Boot, re-run smoke test +5. Re-apply 0004, rebuild modules (consistent — touch core.h to force), install, reboot +6. Reproduce wedge intentionally: open plasma + play HEVC video — expect KASAN or KCSAN report to fire BEFORE wedge + +Risk: KASAN+KCSAN combined may slow ampere enough that boot itself fails. Mitigation: keep arch_devices vanilla as extlinux default; user can also fall back to arch_devices_lockdep (the pre-KASAN lockdep build) if KASAN kernel doesn't boot.