# H.264 baseline contract trace + pixel verify — fresnel 2026-05-07 Phase 0 deliverable #4 evidence. Three artefacts: 1. **V4L2 + media-request contract trace** captured under strace + ftrace. 2. **Cache-safe pixel verification PASSES** via mpv `--hwdec=vaapi --vo=image` (DMA-BUF GL import path). 3. **Cache-stale path bug identified** in the libva backend's vaDeriveImage / cached-mmap readback (the iter1 patch-0011 bug class on RK3399). Phase 0 boolean-correctness criterion for H.264 on rkvdec: **PASS**. ## TL;DR ``` Fixture: ~/fourier-test/bbb_1080p30_h264.mp4 (1920×1080@24fps) Bind: rkvdec (/dev/video3 + /dev/media1) Backend: libva-v4l2-request-fourier @ 65969da (iter8 Phase 4) Kernel: 6.19.9-99-eos-arm GL-DMA-BUF readback (mpv --hwdec=vaapi --vo=image, +30s seek): HW frame 1 == SW frame 1 (sha256 f623d5f7..., 651726 bytes) HW frame 2 == SW frame 2 (sha256 7d7bc6f2..., 630433 bytes) Pixel-perfect match against software decode. Cached-mmap readback (ffmpeg -hwaccel vaapi -hwaccel_output_format nv12): 544 / 6,220,800 bytes non-zero (0.009%) Pattern: 16-byte non-zero chunks at every 1920-byte row stride Stale-cache-coherency bug present in the readback path. ``` ## Contract trace — V4L2 + media-request ioctl sequence Captured via `strace -ff -tt -y -e ioctl,openat,close` plus ftrace `events/v4l2/*` tracepoints. Raw artefacts (gitignored): - `mpv.strace.{12410,12413,...12430}` — per-thread strace (19 threads, ffmpeg's frame-threaded h264 decoder spreads work across av:h264:dfN workers). - `ftrace_v4l2.txt` — kernel-side qbuf/dqbuf events (52 entries for 5 frames + init). - `merged_ioctls.tsv` — time-sorted V4L2/MEDIA/DRM-only ioctls across all threads (215 entries). - `mpv.stdout` — mpv log including `[vaapi] libva: User environment variable requested driver 'v4l2_request'`, `[vaapi] libva: Trying to open /usr/lib/dri/v4l2_request_drv_video.so`, `Using hardware decoding (vaapi-copy).` Re-run incantation: ```bash ssh fresnel ' sudo sh -c "echo 0 > /sys/kernel/tracing/tracing_on; echo 0 > /sys/kernel/tracing/trace; \ echo 1 > /sys/kernel/tracing/events/v4l2/enable; echo 1 > /sys/kernel/tracing/tracing_on" strace -ff -tt -y -e trace=ioctl,openat,close \ -o /tmp/h264_baseline/mpv.strace \ env LIBVA_DRIVER_NAME=v4l2_request \ LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video3 \ LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media1 \ mpv --hwdec=vaapi-copy --frames=5 --vo=null --no-audio \ --no-input-default-bindings ~/fourier-test/bbb_1080p30_h264.mp4 sudo cp /sys/kernel/tracing/trace /tmp/h264_baseline/ftrace_v4l2.txt sudo sh -c "echo 0 > /sys/kernel/tracing/tracing_on; echo 0 > /sys/kernel/tracing/events/v4l2/enable" ' ``` ### Init phase (one-shot at decoder open) Approximate ordering (some interleaved across threads): 1. `DRM_IOCTL_VERSION × 2` on `/dev/dri/renderD128` — libva render-node probe. 2. `VIDIOC_QUERYCAP` on `/dev/video3` — confirms driver=`rkvdec`, card=`rkvdec`, bus=`platform:rkvdec`, version=KERNEL_VERSION(6,19,9). 3. `VIDIOC_ENUM_FMT × 22` (V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE) — enumerate compressed-input fourcc list. Returns `S265 (HEVC)`, `S264 (H.264)`, `VP9F (VP9)`, then -1 at index 3 = end. 4. `VIDIOC_S_FMT` (OUTPUT_MPLANE) — set to `S264` (H.264 Annex-B), 1920×1088. 5. `VIDIOC_ENUM_FMT × 1` (CAPTURE_MPLANE) — returns `Y/UV 4:2:0 (NV12)`. 6. `VIDIOC_G_FMT × 17` (CAPTURE_MPLANE) — returns NV12, 1920×1088. 7. `VIDIOC_CREATE_BUFS count=24` (CAPTURE_MPLANE) — instantiates the iter7 cap_pool (24 slots). 8. `VIDIOC_QUERYBUF × 40` — collect mmap offsets for cap_pool slots + extras. 9. `VIDIOC_REQBUFS × 2` — finalize OUTPUT and CAPTURE buffer pools. 10. `MEDIA_IOC_REQUEST_ALLOC × 16` on `/dev/media1` — preallocate request_fd pool (one per active OUTPUT slot, iter6 binding pattern). 11. `VIDIOC_STREAMON × 2` — start OUTPUT and CAPTURE streams. mpv backend logs: `v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)` — the iter7 cap_pool harness instantiates correctly. ### Per-frame decode pattern (the contract) Each H.264 frame goes through this 7-ioctl sequence on a single ffmpeg av:h264:dfN worker thread: ``` S_EXT_CTRLS (CODEC_STATELESS class 0xf010000, request_fd=N) ← bind H.264 SPS/PPS/decode_params to the request QBUF CAPTURE_MPLANE index=K ← provide an empty CAPTURE buffer for decoded NV12 QBUF OUTPUT_MPLANE index=K (compressed slice bytes) ← submit compressed input slice MEDIA_REQUEST_IOC_QUEUE (request_fd=N) ← submit the request bundle, kernel begins decode MEDIA_REQUEST_IOC_REINIT (request_fd=N) ← reset the request_fd to IDLE for reuse DQBUF OUTPUT_MPLANE index=K ← collect input slot back from kernel DQBUF CAPTURE_MPLANE index=K ← collect decoded NV12 ``` Notable observations: - **REINIT before DQBUF.** REINIT succeeds because by the time userspace gets to it (~0.6 ms after QUEUE), the kernel has already moved the request from QUEUED to COMPLETE state. The mainline `media_request_ioctl_reinit()` accepts both IDLE and COMPLETE (returns -EBUSY only for QUEUED). This is the iter6/iter7 per-OUTPUT-slot REINIT pattern observed in action — the request_fd ownership is per-slot, REINIT is called eagerly to recycle. - **Cycle time per frame: 4–10 ms wall-clock** (timestamps from the merged trace), dominated by the `S_EXT_CTRLS` payload serialization and the kernel's actual decode. Not a meaningful performance number — Phase 1+ binding cells will measure performance per `feedback_no_fixture_hardcoding.md`. - **request_fd values 17–24** observed in the 5-frame window. With the cap_pool at 24 slots and 16 preallocated request_fds, fds map roughly per-slot per the iter6 binding pattern. - **No errors, no EINVAL, no EBUSY.** The contract is clean end-to-end; iter4's frame-11 EINVAL bug from libva-multiplanar does not reproduce on RK3399 in this short window. (Longer-run bug repro will require a longer trace; that's a Phase 1+ task.) ### ftrace v4l2 events (kernel-side perspective) Per frame, the kernel sees: ``` v4l2_qbuf CAPTURE_MPLANE index=K bytesused=0 flags=MAPPED|QUEUED|... v4l2_qbuf OUTPUT_MPLANE index=K bytesused=0 flags=MAPPED|TIMESTAMP_COPY|0x800080 timestamp= v4l2_dqbuf OUTPUT_MPLANE index=K flags=MAPPED|TIMESTAMP_COPY|0x800000 ← buffer marked DONE v4l2_dqbuf CAPTURE_MPLANE index=K flags=MAPPED|TIMESTAMP_COPY ← buffer marked DONE ``` `minor=3` confirms `/dev/video3` (rkvdec). Buffer indices cycle 0, 1, 2, 3, ... — using slots from the cap_pool. The `0x800080` and `0x800000` bits in flags are `V4L2_BUF_FLAG_REQUEST_FD` (0x00800000) plus `V4L2_BUF_FLAG_IN_REQUEST` (0x00000080) — confirming request-API binding is engaged. The user-space-visible u64 timestamp (e.g., `1778187928425269000`) is mpv's PTS in arbitrary units, not wall-clock — it just needs to match between OUTPUT and CAPTURE for the kernel to pair them. Standard request-API contract. ## Cache-safe pixel verification (PASS) Goal per `phase0_findings.md` and the libva-multiplanar iter1 patch-0011 lesson: prove decoded pixels are **non-zero, non-sentinel, semantically-correct** via a cache-coherency-safe readback path. ### Method mpv `--hwdec=vaapi --vo=image` does HW decode → vaExportSurfaceHandle DMA-BUF FD → EGL `EGL_EXT_image_dma_buf_import` → GL texture → glReadPixels → JPEG encode. The DMA-BUF import path on Mesa/panfrost includes correct cache management, so the readback sees what the kernel actually wrote. Test 1 — first 2 frames (BBB intro fade-in, solid dark content): ``` HW frame 1 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68 SW frame 1 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68 (BYTE-IDENTICAL) HW frame 2 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68 SW frame 2 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68 ``` All four hashes match — the BBB intro is solid-color enough that frames 1, 2 produce identical JPEGs across HW/SW. Useful but doesn't rule out "everything's solid color" coincidences. Test 2 — 2 frames at +30s seek (mid-content, real bunny motion): ``` seek30s HW frame 1 sha256 = f623d5f7a41697f67dd227275c6f1b21ffc257f65626d32fde8229357f8764c9 (651,726 bytes) seek30s SW frame 1 sha256 = f623d5f7a41697f67dd227275c6f1b21ffc257f65626d32fde8229357f8764c9 (BYTE-IDENTICAL) seek30s HW frame 2 sha256 = 7d7bc6f2146dda8b2d223bba622c4b9fbe9674181ff1e02afe286b620342e0a8 (630,433 bytes) seek30s SW frame 2 sha256 = 7d7bc6f2146dda8b2d223bba622c4b9fbe9674181ff1e02afe286b620342e0a8 (BYTE-IDENTICAL) ``` Frames 1 vs 2 differ in size (real content changes) — confirms genuine motion content, not solid color. HW vs SW match byte-for-byte. **Hardware H.264 decode on RK3399 / rkvdec / libva-v4l2-request-fourier @ iter8 produces bit-exact correct pixels against software reference**, when read via the cache-safe DMA-BUF GL import path. JPEGs in `phase0_evidence/2026-05-07/h264_baseline/seek30s_frame{1,2}_{hw,sw}.jpg` (gitignored as binary; regenerable from the mpv invocation in the re-run incantation below). ### Re-run incantation (cache-safe verify) ```bash ssh fresnel ' mkdir -p /tmp/h264_baseline/png_seek_hw /tmp/h264_baseline/png_seek_sw WAYLAND_DISPLAY=wayland-0 XDG_RUNTIME_DIR=/run/user/1000 \ LIBVA_DRIVER_NAME=v4l2_request \ LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video3 \ LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media1 \ mpv --hwdec=vaapi --frames=2 --vo=image --no-audio --no-input-default-bindings \ --start=00:00:30 \ --vo-image-outdir=/tmp/h264_baseline/png_seek_hw \ ~/fourier-test/bbb_1080p30_h264.mp4 mpv --hwdec=no --frames=2 --vo=image --no-audio --no-input-default-bindings \ --start=00:00:30 \ --vo-image-outdir=/tmp/h264_baseline/png_seek_sw \ ~/fourier-test/bbb_1080p30_h264.mp4 sha256sum /tmp/h264_baseline/png_seek_*/00000001.jpg /tmp/h264_baseline/png_seek_*/00000002.jpg ' ``` Both pairs should have matching hashes. If they diverge on a future run, that's a regression worth investigating. ## Cache-stale path bug — present on RK3399 (Phase 4 work item) When pixels are read via the **cached mmap path** (libva's vaDeriveImage / vaMapBuffer, used by ffmpeg `-hwaccel vaapi -hwaccel_output_format nv12`), the readback is corrupted in exactly the iter1 patch-0011 pattern. ### Evidence ``` ffmpeg -hwaccel vaapi -hwaccel_output_format nv12 -i bbb_1080p30_h264.mp4 \ -frames:v 2 -f rawvideo -y frames_hw.nv12 # Result on fresnel: size = 6,220,800 bytes (matches 2 × 1920×1080×1.5 NV12) non-zero = 544 (0.009%) first 16 bytes = 81 81 80 80 80 7f 7f 7f 7f 7f 7f 80 80 80 81 81 (Y ≈ 128, gray) Non-zero pattern: offsets 0..15 = non-zero (16 consecutive) offset 16..1919 = ZERO (1904 bytes) offsets 1920..1935 = non-zero (16 consecutive, next row start) offset 1936..3839 = ZERO ... pattern repeats for ~32 rows rest of buffer = mostly ZERO with stride-8 specks at the end ``` Compared against software-decoded reference (same ffmpeg invocation without `-hwaccel vaapi`): ``` SW frames_sw.nv12: size=6,220,800, non-zero=100% (every byte non-zero, plausible black-frame Y=16 fill) Bytewise diff HW vs SW: 100% of bytes differ Mean absolute error per byte: 53.3 (vs ~0–2 expected for matched-codec rounding) ``` ### Diagnosis The 16-byte non-zero stripes at every 1920-byte boundary, with the rest reading as zero, is the canonical **stale cached-mmap** pattern from libva-multiplanar iter1 patch-0011. The kernel is writing real pixels to a DMA-coherent buffer, but the libva backend's image-export path returns a cached pointer without the proper cache-invalidation incantation. Userspace then reads stale memory — mostly the all-zero state from before the kernel wrote — punctuated by whatever happened to land in cache lines that got fetched after the write. ### Implication - **Boolean correctness for Phase 0**: ✅ PASS. The kernel produces correct pixels (proven via DMA-BUF GL import). Bug is in the libva backend's *export path*, not in the decode itself. - **Phase 4 work item**: port or audit the iter1 patch-0011 cache-flush fix on the RK3399 path. The fix landed in the libva-multiplanar fork on ohm; the iter8 master tip should already carry it. Either: - (a) the fix is present and effective on RK3568 (ohm) but not effective on RK3399 (different cache topology / different DMA mapping mode), OR - (b) the fix is present but fresnel's kernel routes the buffer through a path that bypasses the flush (e.g., V4L2_MEMORY_MMAP page protection differs), OR - (c) the fix is conditional on something that doesn't hold on RK3399. - **Phase 1+ binding cells must use the DMA-BUF GL import path for pixel verification**, not vaDeriveImage / cached-mmap. This is the iter1 lesson restated: the cached-mmap readback is unreliable on this hardware family. ## Comparison against ohm iter5/iter8 trace — deferred The libva-multiplanar campaign has phase8_iteration[1-8]_close.md docs but I haven't located a directly comparable strace/ftrace dump for an apples-to-apples diff. A deeper Phase 0/Phase 1 compare would: 1. Run the same `mpv --hwdec=vaapi-copy --frames=5` under strace+ftrace on ohm with the same iter8 backend. 2. Diff the merged_ioctls.tsv files. 3. Identify any RK3568-vs-RK3399-specific divergences — e.g., does the rkvdec-bound contract differ from hantro-vpu-bound contract structurally? (Likely no — both are stateless V4L2 decoders following the same request-API.) Defer to Phase 1 lock since it doesn't gate boolean correctness. ## What this leaves Phase 0 with | Deliverable | Status | |---|---| | #1 SDDM recovery | done as watchpoint | | #2 V4L2 inventory | done | | #3 fork build + vainfo smoke | done | | **#4 H.264 baseline trace + cache-safe pixel verify** | **done — PASS for boolean correctness; cache-stale bug in vaDeriveImage flagged for Phase 4** | | #5 per-codec test fixtures | next | | #6 chromium-fourier cross-validator trace | needs #5 | | Phase 0 close commit | last | ## Stretch finding worth noting (not gating) mpv stdout reports `Using hardware decoding (vaapi-copy)` — even though the iter8 backend has h265.c excluded from the build, mpv defaults to H.264 path (since the test fixture is H.264) and our backend handles it cleanly. No reason to think H.264 has any RK3399-specific weirdness in iter8 master beyond the cache-stale readback noted above. ffmpeg version: `n8.1-13-gb57fbbe50c` — this is the active downstream `code.ffmpeg.org/Kwiboo/FFmpeg.git` branch `v4l2-request-n8.1` referenced in libva-multiplanar phase0_findings.md. The h264_v4l2request decoder is engaged via the `vaapi` hwaccel through libva → our backend → kernel. Same dispatch path as ohm.