iter7 Phase 0: pivot to cache-coherency hypothesis (H1) for iter5 black-output

iter6 was an off-path investigation. Sonnet round-1 review + my
own corrected memory feedback_rfc_v2_vb2_dma_resv_scope.md make
clear: vb2 fence series targets Wayland compositor green-frames,
not libva cached-mmap readback. iter6 found a real upstream NULL
deref bug (filed for kernel-agent#16 when UART captures the trace)
but it's not on the critical path for iter5.

iter7 returns to iter5's actual hypothesis ladder:
- H1(a) DT dma-coherent on rkvdec node — cheapest, first
- H1(b) backend DMA_BUF_IOCTL_SYNC userspace fix — if H1(a) fails
- H2 wrong-DMA-address — if H1(a)+(b) fail
- H3 false-positive DEC_RDY — last resort

Test on vanilla kernel + iter3+iter4-fixed modules (already
verified working pipeline). Pass criterion: ffmpeg-vaapi output
shows more than {16, 128} unique bytes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Markus Fritsche
2026-05-16 22:32:20 +00:00
parent b8930df801
commit 24272596cd
+66
View File
@@ -0,0 +1,66 @@
# Phase 0 — iter7 substrate (cache-coherency angle for iter5 black-output)
Opened 2026-05-17 ~00:45, following iter6 v1-v6 chain that exhausted the vb2-fence-series hypothesis (iter6 was an off-path investigation — see Why-pivot below).
## Research question
**Does forcing the RK3588 rkvdec DMA path into the cache-coherent domain (via DT `dma-coherent` property on the rkvdec node) make iter5's CAPTURE buffer readback show real decoded NV12 content instead of uniform Y=0x10/CbCr=0x80?**
## Locked-in evidence carried from iter5 (still binding)
| Observation | Source | Status |
|-------------|--------|--------|
| Iter3+iter4 kernel patches verified working: HEVC OOPS gone, EINVAL gone, decoder runs end-to-end | iter3_close.md (kernel-agent#14) + iter4_close.md (kernel-agent#15) | verified, both upstream-aligned |
| ffmpeg HEVC test: rc=0, /tmp/o.nv12 = 4147200 bytes (exact 3×NV12-frame size for 1280×720) | iter5_close.md | confirmed |
| iter5-IRQ pr_warn diagnostic: every frame ends with `STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0` | iter5_close.md | confirmed — hardware decode succeeds |
| Output bytes are uniform Y=0x10 (luma "video black") and Cb/Cr=0x80 (chroma neutral) — solid black | iter5_close.md | confirmed — symptom not garbage |
| OUTPUT bitstream is byte-identical to raw HEVC NAL with Annex-B prefix prepended | iter5_close.md | confirmed |
| populate_ext_sps_rps_cache returns -ENODATA because ffmpeg-vaapi strips SPS/VPS/PPS | iter5_close.md | confirmed Phase 5 reviewer's prediction |
| Backend's 5-control batch commits (post iter4 SLICE_PARAMS registration) | iter5_close.md | confirmed via iter4-DIAG validate_sps firing per-frame |
## Why pivot from iter6 (vb2 fence series)
Iter6's premise was: "vb2 fence series unlocks libva cached-mmap readback". **Premise is wrong.** Sonnet round-1 architect review pointed this out; my own memory `feedback_rfc_v2_vb2_dma_resv_scope.md` (which I corrected today) confirms: the fence series targets **Wayland compositor implicit-sync green-frames** on GPU consumers, NOT the libva readback path. iter6's investigation found a real upstream bug (NULL deref at dma_fence->context inside dma_resv_add_fence, see `iter6_v6_substrate_null_deref_at_0x20.md`), but that bug fix wouldn't have made iter5 work either. iter6 = off-path; iter7 = back on the right hypothesis ladder.
## Iter5's actual hypothesis ladder (carried to iter7)
| H | Hypothesis | Test |
|---|-----------|------|
| H1 | **DMA cache coherency** between hardware-written CAPTURE buffer and userspace cached-mmap. HW writes to physical RAM. CPU reads via cached mmap. If rkvdec isn't in the ACE-Lite coherent domain (or the kernel doesn't know it is), CPU cache holds stale pre-decode bytes. | (a) DT `dma-coherent` property on rkvdec node; (b) backend `DMA_BUF_IOCTL_SYNC` round-trip |
| H2 | HW writes to a different physical address than vb2-allocated CAPTURE buffer | log dst_addr in rkvdec.config_registers vs vb2_dma_contig_plane_dma_addr; compare |
| H3 | DEC_RDY=1 is a false positive — pipeline registers look valid but actual decode is no-op | examine register-config code |
H1 is the leading hypothesis (matches RK3399 pattern from `reference_dmabuf_resv_blocker.md`, matches the "solid value" symptom). Iter7 starts with H1 (a) — DT `dma-coherent`.
## Substrate for iter7 H1(a) test
- ampere kernel source: `~/src/linux-rockchip` branch `ampere-minimal-devices`, working tree has iter3+iter4 patches uncommitted + iter6 patches partially in (need cleanup before DTB rebuild — only need to rebuild dtb, source state of .c files doesn't matter for dtb-only build)
- DTS file to edit: `arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dts`
- rkvdec node: `&rkvdec_ccu`, `&rkvdec0`, `&rkvdec1` (rk3588 has dual-core rkvdec). For initial test, add `dma-coherent` to ALL rkvdec nodes
- DTB output: `arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dtb`
- Install path: `/boot/firmware/rk3588-coolpi-cm5-genbook.dtb-7.0.0-rc3-devices+` (vanilla kernel's dtb, since we're testing on vanilla which has iter3+iter4 fixes baked into the loaded modules)
- Test reboot needed (kernel re-reads DTB during early init)
- Test command: `LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -vf hwdownload,format=nv12 -frames:v 3 -f rawvideo -pix_fmt nv12 /tmp/o.nv12`
- Pass criterion: `head -c 4147200 /tmp/o.nv12 | od -An -tu1 -w1 | sort -u | head -10` shows MORE THAN 2 unique bytes (i.e. not just `16` and `128`)
- Better pass criterion: byte-compare against ffmpeg SW-decoded NV12 of the same source — if reasonably close (HW vs SW differ slightly but not solid-color), confirmed PASS
## Risk register
| # | Risk | Mitigation |
|---|------|-----------|
| R1 | `dma-coherent` on a non-coherent SoC IP causes DATA CORRUPTION (CPU caches not synced for DMA reads, missing writebacks for DMA writes) — could corrupt decoded data OR oops on missed cacheline | RK3588 ACE-Lite supports rkvdec coherent per upstream comments; if wrong, symptom is more corruption not less. Backup vanilla DTB before swap. Recovery via WeChat stick |
| R2 | DTB-only change doesn't kick in because boot loader caches the old DTB | extlinux on this system reads `fdt` path on each boot — uncached. Safe |
| R3 | iter6 source mods in working tree contaminate dtb rebuild | DTB rebuild only touches DT files, not .c files. Safe |
| R4 | If H1(a) fails (still black), need to try H1(b) backend SYNC. Backend rebuild + replace `.so` on ampere — already routine work | documented path |
## Open questions tabled for Phase 1
(Phase 1 starts after Phase 0 close + this test runs.)
1. If H1(a) succeeds: is the kernel rkvdec driver also affected (needs explicit dma_sync calls when DT lies about coherency)? Need to verify by stress-testing under varied load (concurrent vb2 + GPU) for no data corruption.
2. If H1(a) fails: is H1(b) the right next move, or did we mis-diagnose? Need to check `/sys/class/dma_heap/` or `/sys/firmware/devicetree/.../dma-coherent` state to confirm DT property took effect.
3. If both H1 variants fail: H2 (wrong DMA address) — instrument rkvdec to log dst_addr vs vb2 mapping.
## Phase 0 close
Substrate locked. Pivot from iter6 documented (real bug, off-path). H1(a) DT dma-coherent is cheapest test, minimal risk, ~10 min wallclock from ampere-reachable to pass/fail verdict.