Files
ampere-kernel-decoders/phase0_findings_iter7.md
T
Markus Fritsche 24272596cd iter7 Phase 0: pivot to cache-coherency hypothesis (H1) for iter5 black-output
iter6 was an off-path investigation. Sonnet round-1 review + my
own corrected memory feedback_rfc_v2_vb2_dma_resv_scope.md make
clear: vb2 fence series targets Wayland compositor green-frames,
not libva cached-mmap readback. iter6 found a real upstream NULL
deref bug (filed for kernel-agent#16 when UART captures the trace)
but it's not on the critical path for iter5.

iter7 returns to iter5's actual hypothesis ladder:
- H1(a) DT dma-coherent on rkvdec node — cheapest, first
- H1(b) backend DMA_BUF_IOCTL_SYNC userspace fix — if H1(a) fails
- H2 wrong-DMA-address — if H1(a)+(b) fail
- H3 false-positive DEC_RDY — last resort

Test on vanilla kernel + iter3+iter4-fixed modules (already
verified working pipeline). Pass criterion: ffmpeg-vaapi output
shows more than {16, 128} unique bytes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:32:20 +00:00

6.3 KiB
Raw Blame History

Phase 0 — iter7 substrate (cache-coherency angle for iter5 black-output)

Opened 2026-05-17 ~00:45, following iter6 v1-v6 chain that exhausted the vb2-fence-series hypothesis (iter6 was an off-path investigation — see Why-pivot below).

Research question

Does forcing the RK3588 rkvdec DMA path into the cache-coherent domain (via DT dma-coherent property on the rkvdec node) make iter5's CAPTURE buffer readback show real decoded NV12 content instead of uniform Y=0x10/CbCr=0x80?

Locked-in evidence carried from iter5 (still binding)

Observation Source Status
Iter3+iter4 kernel patches verified working: HEVC OOPS gone, EINVAL gone, decoder runs end-to-end iter3_close.md (kernel-agent#14) + iter4_close.md (kernel-agent#15) verified, both upstream-aligned
ffmpeg HEVC test: rc=0, /tmp/o.nv12 = 4147200 bytes (exact 3×NV12-frame size for 1280×720) iter5_close.md confirmed
iter5-IRQ pr_warn diagnostic: every frame ends with STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0 iter5_close.md confirmed — hardware decode succeeds
Output bytes are uniform Y=0x10 (luma "video black") and Cb/Cr=0x80 (chroma neutral) — solid black iter5_close.md confirmed — symptom not garbage
OUTPUT bitstream is byte-identical to raw HEVC NAL with Annex-B prefix prepended iter5_close.md confirmed
populate_ext_sps_rps_cache returns -ENODATA because ffmpeg-vaapi strips SPS/VPS/PPS iter5_close.md confirmed Phase 5 reviewer's prediction
Backend's 5-control batch commits (post iter4 SLICE_PARAMS registration) iter5_close.md confirmed via iter4-DIAG validate_sps firing per-frame

Why pivot from iter6 (vb2 fence series)

Iter6's premise was: "vb2 fence series unlocks libva cached-mmap readback". Premise is wrong. Sonnet round-1 architect review pointed this out; my own memory feedback_rfc_v2_vb2_dma_resv_scope.md (which I corrected today) confirms: the fence series targets Wayland compositor implicit-sync green-frames on GPU consumers, NOT the libva readback path. iter6's investigation found a real upstream bug (NULL deref at dma_fence->context inside dma_resv_add_fence, see iter6_v6_substrate_null_deref_at_0x20.md), but that bug fix wouldn't have made iter5 work either. iter6 = off-path; iter7 = back on the right hypothesis ladder.

Iter5's actual hypothesis ladder (carried to iter7)

H Hypothesis Test
H1 DMA cache coherency between hardware-written CAPTURE buffer and userspace cached-mmap. HW writes to physical RAM. CPU reads via cached mmap. If rkvdec isn't in the ACE-Lite coherent domain (or the kernel doesn't know it is), CPU cache holds stale pre-decode bytes. (a) DT dma-coherent property on rkvdec node; (b) backend DMA_BUF_IOCTL_SYNC round-trip
H2 HW writes to a different physical address than vb2-allocated CAPTURE buffer log dst_addr in rkvdec.config_registers vs vb2_dma_contig_plane_dma_addr; compare
H3 DEC_RDY=1 is a false positive — pipeline registers look valid but actual decode is no-op examine register-config code

H1 is the leading hypothesis (matches RK3399 pattern from reference_dmabuf_resv_blocker.md, matches the "solid value" symptom). Iter7 starts with H1 (a) — DT dma-coherent.

Substrate for iter7 H1(a) test

  • ampere kernel source: ~/src/linux-rockchip branch ampere-minimal-devices, working tree has iter3+iter4 patches uncommitted + iter6 patches partially in (need cleanup before DTB rebuild — only need to rebuild dtb, source state of .c files doesn't matter for dtb-only build)
  • DTS file to edit: arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dts
  • rkvdec node: &rkvdec_ccu, &rkvdec0, &rkvdec1 (rk3588 has dual-core rkvdec). For initial test, add dma-coherent to ALL rkvdec nodes
  • DTB output: arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-genbook.dtb
  • Install path: /boot/firmware/rk3588-coolpi-cm5-genbook.dtb-7.0.0-rc3-devices+ (vanilla kernel's dtb, since we're testing on vanilla which has iter3+iter4 fixes baked into the loaded modules)
  • Test reboot needed (kernel re-reads DTB during early init)
  • Test command: LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 -vf hwdownload,format=nv12 -frames:v 3 -f rawvideo -pix_fmt nv12 /tmp/o.nv12
  • Pass criterion: head -c 4147200 /tmp/o.nv12 | od -An -tu1 -w1 | sort -u | head -10 shows MORE THAN 2 unique bytes (i.e. not just 16 and 128)
  • Better pass criterion: byte-compare against ffmpeg SW-decoded NV12 of the same source — if reasonably close (HW vs SW differ slightly but not solid-color), confirmed PASS

Risk register

# Risk Mitigation
R1 dma-coherent on a non-coherent SoC IP causes DATA CORRUPTION (CPU caches not synced for DMA reads, missing writebacks for DMA writes) — could corrupt decoded data OR oops on missed cacheline RK3588 ACE-Lite supports rkvdec coherent per upstream comments; if wrong, symptom is more corruption not less. Backup vanilla DTB before swap. Recovery via WeChat stick
R2 DTB-only change doesn't kick in because boot loader caches the old DTB extlinux on this system reads fdt path on each boot — uncached. Safe
R3 iter6 source mods in working tree contaminate dtb rebuild DTB rebuild only touches DT files, not .c files. Safe
R4 If H1(a) fails (still black), need to try H1(b) backend SYNC. Backend rebuild + replace .so on ampere — already routine work documented path

Open questions tabled for Phase 1

(Phase 1 starts after Phase 0 close + this test runs.)

  1. If H1(a) succeeds: is the kernel rkvdec driver also affected (needs explicit dma_sync calls when DT lies about coherency)? Need to verify by stress-testing under varied load (concurrent vb2 + GPU) for no data corruption.
  2. If H1(a) fails: is H1(b) the right next move, or did we mis-diagnose? Need to check /sys/class/dma_heap/ or /sys/firmware/devicetree/.../dma-coherent state to confirm DT property took effect.
  3. If both H1 variants fail: H2 (wrong DMA address) — instrument rkvdec to log dst_addr vs vb2 mapping.

Phase 0 close

Substrate locked. Pivot from iter6 documented (real bug, off-path). H1(a) DT dma-coherent is cheapest test, minimal risk, ~10 min wallclock from ampere-reachable to pass/fail verdict.