diff --git a/phase0_findings_iter7.md b/phase0_findings_iter7.md new file mode 100644 index 0000000..c2af9ff --- /dev/null +++ b/phase0_findings_iter7.md @@ -0,0 +1,66 @@ +# Phase 0 — iter7 substrate (cache-coherency angle for iter5 black-output) + +Opened 2026-05-17 ~00:45, following iter6 v1-v6 chain that exhausted the vb2-fence-series hypothesis (iter6 was an off-path investigation — see Why-pivot below). + +## Research question + +**Does forcing the RK3588 rkvdec DMA path into the cache-coherent domain (via DT `dma-coherent` property on the rkvdec node) make iter5's CAPTURE buffer readback show real decoded NV12 content instead of uniform Y=0x10/CbCr=0x80?** + +## Locked-in evidence carried from iter5 (still binding) + +| Observation | Source | Status | +|-------------|--------|--------| +| Iter3+iter4 kernel patches verified working: HEVC OOPS gone, EINVAL gone, decoder runs end-to-end | iter3_close.md (kernel-agent#14) + iter4_close.md (kernel-agent#15) | verified, both upstream-aligned | +| ffmpeg HEVC test: rc=0, /tmp/o.nv12 = 4147200 bytes (exact 3×NV12-frame size for 1280×720) | iter5_close.md | confirmed | +| iter5-IRQ pr_warn diagnostic: every frame ends with `STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0` | iter5_close.md | confirmed — hardware decode succeeds | +| Output bytes are uniform Y=0x10 (luma "video black") and Cb/Cr=0x80 (chroma neutral) — solid black | iter5_close.md | confirmed — symptom not garbage | +| OUTPUT bitstream is byte-identical to raw HEVC NAL with Annex-B prefix prepended | iter5_close.md | confirmed | +| populate_ext_sps_rps_cache returns -ENODATA because ffmpeg-vaapi strips SPS/VPS/PPS | iter5_close.md | confirmed Phase 5 reviewer's prediction | +| Backend's 5-control batch commits (post iter4 SLICE_PARAMS registration) | iter5_close.md | confirmed via iter4-DIAG validate_sps firing per-frame | + +## Why pivot from iter6 (vb2 fence series) + +Iter6's premise was: "vb2 fence series unlocks libva cached-mmap readback". **Premise is wrong.** Sonnet round-1 architect review pointed this out; my own memory `feedback_rfc_v2_vb2_dma_resv_scope.md` (which I corrected today) confirms: the fence series targets **Wayland compositor implicit-sync green-frames** on GPU consumers, NOT the libva readback path. iter6's investigation found a real upstream bug (NULL deref at dma_fence->context inside dma_resv_add_fence, see `iter6_v6_substrate_null_deref_at_0x20.md`), but that bug fix wouldn't have made iter5 work either. iter6 = off-path; iter7 = back on the right hypothesis ladder. + +## Iter5's actual hypothesis ladder (carried to iter7) + +| H | Hypothesis | Test | +|---|-----------|------| +| H1 | **DMA cache coherency** between hardware-written CAPTURE buffer and userspace cached-mmap. HW writes to physical RAM. CPU reads via cached mmap. If rkvdec isn't in the ACE-Lite coherent domain (or the kernel doesn't know it is), CPU cache holds stale pre-decode bytes. | (a) DT `dma-coherent` property on rkvdec node; (b) backend `DMA_BUF_IOCTL_SYNC` round-trip | +| H2 | HW writes to a different physical address than vb2-allocated CAPTURE buffer | log dst_addr in rkvdec.config_registers vs vb2_dma_contig_plane_dma_addr; compare | +| H3 | DEC_RDY=1 is a false positive — pipeline registers look valid but actual decode is no-op | examine register-config code | + +H1 is the leading hypothesis (matches RK3399 pattern from `reference_dmabuf_resv_blocker.md`, matches the "solid value" symptom). Iter7 starts with H1 (a) — DT `dma-coherent`. + +## Substrate for iter7 H1(a) test + +- ampere kernel source: `~/src/linux-rockchip` branch `ampere-minimal-devices`, working tree has iter3+iter4 patches uncommitted + iter6 patches partially in (need cleanup before DTB rebuild — only need to rebuild dtb, source state of .c files doesn't matter for dtb-only build) +- DTS file to edit: `arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dts` +- rkvdec node: `&rkvdec_ccu`, `&rkvdec0`, `&rkvdec1` (rk3588 has dual-core rkvdec). For initial test, add `dma-coherent` to ALL rkvdec nodes +- DTB output: `arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dtb` +- Install path: `/boot/firmware/rk3588-coolpi-cm5-genbook.dtb-7.0.0-rc3-devices+` (vanilla kernel's dtb, since we're testing on vanilla which has iter3+iter4 fixes baked into the loaded modules) +- Test reboot needed (kernel re-reads DTB during early init) +- Test command: `LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -vf hwdownload,format=nv12 -frames:v 3 -f rawvideo -pix_fmt nv12 /tmp/o.nv12` +- Pass criterion: `head -c 4147200 /tmp/o.nv12 | od -An -tu1 -w1 | sort -u | head -10` shows MORE THAN 2 unique bytes (i.e. not just `16` and `128`) +- Better pass criterion: byte-compare against ffmpeg SW-decoded NV12 of the same source — if reasonably close (HW vs SW differ slightly but not solid-color), confirmed PASS + +## Risk register + +| # | Risk | Mitigation | +|---|------|-----------| +| R1 | `dma-coherent` on a non-coherent SoC IP causes DATA CORRUPTION (CPU caches not synced for DMA reads, missing writebacks for DMA writes) — could corrupt decoded data OR oops on missed cacheline | RK3588 ACE-Lite supports rkvdec coherent per upstream comments; if wrong, symptom is more corruption not less. Backup vanilla DTB before swap. Recovery via WeChat stick | +| R2 | DTB-only change doesn't kick in because boot loader caches the old DTB | extlinux on this system reads `fdt` path on each boot — uncached. Safe | +| R3 | iter6 source mods in working tree contaminate dtb rebuild | DTB rebuild only touches DT files, not .c files. Safe | +| R4 | If H1(a) fails (still black), need to try H1(b) backend SYNC. Backend rebuild + replace `.so` on ampere — already routine work | documented path | + +## Open questions tabled for Phase 1 + +(Phase 1 starts after Phase 0 close + this test runs.) + +1. If H1(a) succeeds: is the kernel rkvdec driver also affected (needs explicit dma_sync calls when DT lies about coherency)? Need to verify by stress-testing under varied load (concurrent vb2 + GPU) for no data corruption. +2. If H1(a) fails: is H1(b) the right next move, or did we mis-diagnose? Need to check `/sys/class/dma_heap/` or `/sys/firmware/devicetree/.../dma-coherent` state to confirm DT property took effect. +3. If both H1 variants fail: H2 (wrong DMA address) — instrument rkvdec to log dst_addr vs vb2 mapping. + +## Phase 0 close + +Substrate locked. Pivot from iter6 documented (real bug, off-path). H1(a) DT dma-coherent is cheapest test, minimal risk, ~10 min wallclock from ampere-reachable to pass/fail verdict.