iter6 v3 plan: KASAN+KCSAN delta on lockdep substrate
Pending architect delta-review before any execution. Adds: KASAN_GENERIC + KCSAN + DEBUG_PREEMPT + DEBUG_OBJECTS + SLUB_DEBUG_ON + PAGE_POISONING. CONFIG_LOCALVERSION=-lockdep-kasan so separate from existing lockdep modules. Smoke test extended: after headless ffmpeg passes, manually log in to plasma + run mpv hwdec=vaapi-copy to reproduce v2's compositor wedge condition. Monitor pstore + journal for KASAN/KCSAN reports. Risk additions: KASAN/KCSAN boot slowness, KASAN shadow region, KASAN false positives, GPU repro requires physical login. Exit conditions: success (sanitizer report fires) / partial (still silent — switch to serial) / failure (KASAN kernel won't boot). ~75-90 min build + ~45 min test cycle for 0004 verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,93 @@
|
||||
# Phase 4 — iter6 v3 plan (KASAN+KCSAN delta on v2 substrate)
|
||||
|
||||
Plan ID: iter6-v3-kasan-kcsan
|
||||
Author: Claude Opus 4.7
|
||||
Status: Pending Phase 5 architect delta-review
|
||||
Cross-ref: phase4_plan_iter6_postmortem_v2.md (commit 1656f84, executed), iter6_v2_attempt2_close.md (commit 53e2477)
|
||||
|
||||
## Why v3
|
||||
|
||||
v2 attempt-2 found: 0004 alone passes headless HEVC clean, but wedges silently ~10 min later under GPU compositor + concurrent vb2 traffic. PROVE_LOCKING+9 lock-debug flags did NOT fire before the wedge → not a classical lock-order inversion lockdep models.
|
||||
|
||||
Refined hypothesis (4 candidates) needs different sanitizers:
|
||||
- H1' cross-context wait not lockdep-visible → KCSAN
|
||||
- H2 use-after-free / double-signal of dma_fence → KASAN
|
||||
- H3 race in fence alloc/init → KCSAN
|
||||
- H4 panthor consumer-side bug → headless-only repro path
|
||||
|
||||
## Delta from v2 plan
|
||||
|
||||
All v2 pre-flight + lockdep config + ramoops setup is preserved (already executed and validated). v3 adds:
|
||||
|
||||
### Kernel config additions (step 1.2-DELTA)
|
||||
|
||||
```
|
||||
./scripts/config --enable KASAN --enable KASAN_GENERIC \
|
||||
--enable KASAN_OUTLINE \
|
||||
--enable KCSAN --enable KCSAN_REPORT_ONCE_IN_MS=2000 \
|
||||
--enable DEBUG_PREEMPT \
|
||||
--enable DEBUG_OBJECTS --enable DEBUG_OBJECTS_RCU_HEAD \
|
||||
--enable SLUB_DEBUG --enable SLUB_DEBUG_ON \
|
||||
--enable PAGE_POISONING --enable PAGE_POISONING_NO_SANITY
|
||||
CONFIG_LOCALVERSION="-lockdep-kasan"
|
||||
```
|
||||
|
||||
Notes:
|
||||
- KASAN_GENERIC (not SW_TAGS) — generic mode works on all aarch64
|
||||
- KASAN_OUTLINE (not INLINE) — slower but smaller binary, supports more features
|
||||
- KCSAN_REPORT_ONCE_IN_MS=2000 — rate-limit reports so console doesn't drown
|
||||
- SLUB_DEBUG_ON — enable all SLUB debug at boot (poisoning, redzone, owner-tracking)
|
||||
- DEBUG_OBJECTS — catches double-init / use-after-free of registered kernel objects (dma_fences are registered)
|
||||
|
||||
### Build (step 1.3-DELTA)
|
||||
|
||||
`time make -j8 Image modules dtbs` — estimated 75-90 min (KASAN+KCSAN compile slow)
|
||||
|
||||
### Install (step 1.4-1.7-DELTA)
|
||||
|
||||
- Modules install to `/lib/modules/7.0.0-rc3-lockdep-kasan+/` (separate via LOCALVERSION)
|
||||
- Image installs to `/boot/firmware/Image-7.0.0-rc3-lockdep-kasan+`
|
||||
- Initramfs for new release
|
||||
- Copy lockdep DTB (with ramoops node) to `/boot/firmware/rk3588-coolpi-cm5-genbook.dtb-7.0.0-rc3-lockdep-kasan+`
|
||||
- ADD third extlinux label `arch_devices_lockdep_kasan`. Keep `arch_devices` (vanilla) as default. Keep `arch_devices_lockdep` (without KASAN) as a fallback if KASAN kernel doesn't boot
|
||||
- User picks `arch_devices_lockdep_kasan` at U-Boot menu (requires HDMI for first boot — OR temporarily flip default for the test)
|
||||
|
||||
### Bisect-apply (step 2.x-DELTA)
|
||||
|
||||
Same as v2: 0004 → 0005 → 0006 → 0007-v2. Between each: reboot + smoke + harvest journal/pstore for KASAN/KCSAN reports.
|
||||
|
||||
**Critical addition for step 2.1**: after the headless smoke test passes, intentionally trigger GPU compositor activity to reproduce the v2 wedge condition:
|
||||
1. Log into plasma wayland at SDDM prompt (manual, physical)
|
||||
2. Start a video that drives both GPU compositor + V4L2 decode concurrently. E.g.: `mpv --hwdec=vaapi-copy ~/measurements/encoded/bbb_60s_720p.hevc.mp4` in the plasma session
|
||||
3. Watch for ~2 minutes (v2 wedge timing was ~10 min, but KASAN/KCSAN should catch earlier)
|
||||
4. From ssh: `tail -f` `/sys/fs/pstore/` AND `journalctl -k -f` for KASAN / KCSAN / BUG: KASAN / data-race / use-after-free reports
|
||||
5. If a report fires, capture it then reboot back to lockdep-non-kasan kernel for analysis
|
||||
|
||||
## Risk register additions vs v2
|
||||
|
||||
| # | Risk | Mitigation |
|
||||
|---|------|-----------|
|
||||
| R7 | KASAN+KCSAN combined boot too slow → systemd timeout → boot failure | Multiple fallback labels in extlinux: `arch_devices_lockdep_kasan` (new), `arch_devices_lockdep` (no kasan), `arch_devices` (default vanilla). User picks at U-Boot. systemd unit timeouts may need extension via `DefaultTimeoutStartSec=` if observed |
|
||||
| R8 | KASAN region overlaps with our ramoops carve-out (0x10000000-0x100fffff) | KASAN shadow region uses high VA, not low PA. Carve-out is from real physical RAM, KASAN shadow is in vmalloc space. No conflict expected — verify with `/proc/iomem` post-boot |
|
||||
| R9 | KASAN tags trigger false positives on pre-existing kernel code (e.g. panel-edp's WARN) → noise drowns real signal | KASAN reports are distinctive ("BUG: KASAN:" prefix). Grep filtering should isolate. If false-positive volume is overwhelming, switch to KASAN_SW_TAGS or selectively disable instrumentation per-file |
|
||||
| R10 | GPU-compositor wedge reproduction requires physical login (no remote-wayland start) | User has physical access to ampere for this test step (per v2 plan A4 conditional). Time-boxed: if wedge not reproduced in ~10 min of compositor activity, declare iter6 v3 inconclusive |
|
||||
|
||||
## Estimated wall-time
|
||||
|
||||
| Phase | Time |
|
||||
|-------|------|
|
||||
| Pre-flight + config delta | 5 min (config additions only, no new pre-flight) |
|
||||
| KASAN+KCSAN kernel build | 75-90 min |
|
||||
| Install + reboot + lockdep-kasan baseline | 10 min |
|
||||
| Apply 0004 + module rebuild + install + reboot + headless smoke | 15 min |
|
||||
| GPU-compositor concurrent test + monitor for KASAN/KCSAN reports | 15 min |
|
||||
| **Subtotal for 0004 verdict** | ~120 min wall-time |
|
||||
| 0005/0006/0007 follow-ups if 0004 verdict is "clean under KASAN" | +30 min each |
|
||||
|
||||
## Exit conditions
|
||||
|
||||
**Success**: KASAN or KCSAN report fires during step 2.1's GPU-compositor test, pinpointing the bug class + responsible code. → file kernel-agent issue, fix in 0004 v2 OR upstream, retry.
|
||||
|
||||
**Partial**: 0004 v3 (with KASAN) still wedges silently with NO KASAN/KCSAN report. Means bug is OUTSIDE the sanitizer-instrumented memory (e.g. firmware blob, GPU hardware state machine). Switch to serial console for live trace (final fallback).
|
||||
|
||||
**Failure**: KASAN+KCSAN kernel doesn't boot at all → fall back to lockdep kernel, declare iter6 line closed, move iter7 to non-fence cache-coherency angle.
|
||||
Reference in New Issue
Block a user