phase 0 deliverable 4: H.264 baseline trace — PASS boolean correctness

H.264 hardware decode on RK3399 / rkvdec / libva-v4l2-request-fourier
@ master tip 65969da (iter8 Phase 4) verified bit-exact correct against
software reference, when read via the cache-safe DMA-BUF GL import path.

Test method:

  - mpv --hwdec=vaapi --vo=image (DMA-BUF + EGL_EXT_image_dma_buf_import
    + glReadPixels + JPEG encode — cache-coherency-safe per the iter1
    patch-0011 lesson).
  - Decoded 2 frames at +30s seek (mid-content bunny motion, not BBB
    intro fade-in) so size + content variation is genuine.
  - Compared HW JPEGs vs SW reference JPEGs (same mpv invocation with
    --hwdec=no).

Result:

  HW frame 1 sha256 = f623d5f7...  (651,726 bytes)  byte-identical
  SW frame 1 sha256 = f623d5f7...  (651,726 bytes)  to SW reference
  HW frame 2 sha256 = 7d7bc6f2...  (630,433 bytes)  byte-identical
  SW frame 2 sha256 = 7d7bc6f2...  (630,433 bytes)  to SW reference

  Frames 1 vs 2 differ in size — real content change captured.

Phase 0 boolean-correctness criterion for H.264: PASS.

Contract trace:

The V4L2 + media-request ioctl sequence per H.264 frame is the
canonical iter6/iter7 pattern:

  S_EXT_CTRLS (CODEC_STATELESS class, request_fd=N)
  QBUF CAPTURE_MPLANE  index=K
  QBUF OUTPUT_MPLANE   index=K  (compressed slice)
  MEDIA_REQUEST_IOC_QUEUE   (request_fd=N)
  MEDIA_REQUEST_IOC_REINIT  (request_fd=N)  ← per-OUTPUT-slot reuse
  DQBUF OUTPUT_MPLANE  index=K
  DQBUF CAPTURE_MPLANE index=K

REINIT-before-DQBUF works because the kernel completes decode in
~0.6 ms (request → COMPLETE state), and mainline media_request_
ioctl_reinit accepts both IDLE and COMPLETE. iter7 cap_pool
instantiates 24 slots cleanly: "v4l2-request: cap_pool_init: 24
slots ready" in mpv stdout.

No EINVAL, no EBUSY, no errors observed across 5 frames. iter4's
frame-11 EINVAL bug from libva-multiplanar does not reproduce on
RK3399 in this short window (longer-run repro is Phase 1+ work).

Side finding — cache-stale readback bug present in libva-backend's
vaDeriveImage path on RK3399:

When pixels are read via the cached-mmap path (libva's vaDeriveImage
+ vaMapBuffer, used by ffmpeg -hwaccel vaapi -hwaccel_output_format
nv12), readback is corrupted in exactly the iter1 patch-0011 pattern:

  size=6,220,800 bytes (correct: 2 × 1920×1080×1.5 NV12)
  non-zero=544 (0.009%)
  pattern: 16 consecutive non-zero bytes at every 1920-byte row stride,
           rest of buffer reads as zero
  diff vs SW reference: 100% of bytes differ, MAE=53.3 per byte

This is the canonical stale-cached-mmap pattern. Kernel writes real
pixels (proven by DMA-BUF GL import readback succeeding), but the
libva backend's image-export path returns a cached pointer without
the correct cache-invalidation incantation. Userspace reads stale
all-zero memory punctuated by whichever cache lines happened to fetch
post-write.

Phase 4 work item: audit whether the iter1 patch-0011 cache-flush
fix is present, effective, or RK3399-routing-bypassed. Three
possibilities: (a) fix landed for RK3568 but cache topology differs
on RK3399, (b) fix is gated on something that's not true on RK3399,
or (c) RK3399 V4L2_MEMORY_MMAP page protection bypasses the flush.
Not gating Phase 0 — kernel-side decode is correct.

Phase 1+ binding cells must use the DMA-BUF GL import path for pixel
verification, not vaDeriveImage / cached-mmap. The iter1 lesson
restated: cached-mmap readback is unreliable on this hardware family.

Evidence files (under phase0_evidence/2026-05-07/h264_baseline_trace.md
and h264_baseline/):

  - mpv.stdout — libva log, vaapi-copy engaged, cap_pool_init
  - h264_baseline_trace.md — full writeup with re-run incantations
  - mpv.strace.* (gitignored) — 19 per-thread ioctl/openat traces
  - ftrace_v4l2.txt (gitignored) — kernel qbuf/dqbuf events
  - merged_ioctls.tsv (gitignored) — time-sorted V4L2/MEDIA/DRM
    ioctls across all threads
  - *.jpg (gitignored) — HW vs SW JPEG comparison artefacts
  - frames_hw_cached_readback.nv12 (gitignored) — broken nv12
    readback for forensic reference

gitignore: extended extension list (jpg, png, nv12, yuv, tsv, strace*).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-07 21:32:36 +00:00
parent 60e62da666
commit d8a9903ef4
3 changed files with 260 additions and 0 deletions
@@ -0,0 +1,13 @@
● Video --vid=1 --vlang=eng (h264 1920x1080 24 fps) [default]
○ Audio --aid=1 --alang=eng (aac 6ch 48000 Hz 438 kbps) [default]
[vaapi] libva: VA-API version 1.23.0
[vaapi] libva: User environment variable requested driver 'v4l2_request'
[vaapi] libva: Trying to open /usr/lib/dri/v4l2_request_drv_video.so
[vaapi] libva: Found init function __vaDriverInit_1_23
[vaapi] libva: va_openDriver() returns 0
[vaapi] Initialized VAAPI: version 1.23
v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)
Using hardware decoding (vaapi-copy).
VO: [null] 1920x1080 nv12
V: 00:00:00 / 00:09:56 (0%)
Exiting... (End of file)
@@ -0,0 +1,241 @@
# H.264 baseline contract trace + pixel verify — fresnel 2026-05-07
Phase 0 deliverable #4 evidence. Three artefacts:
1. **V4L2 + media-request contract trace** captured under strace + ftrace.
2. **Cache-safe pixel verification PASSES** via mpv `--hwdec=vaapi --vo=image` (DMA-BUF GL import path).
3. **Cache-stale path bug identified** in the libva backend's vaDeriveImage / cached-mmap readback (the iter1 patch-0011 bug class on RK3399).
Phase 0 boolean-correctness criterion for H.264 on rkvdec: **PASS**.
## TL;DR
```
Fixture: ~/fourier-test/bbb_1080p30_h264.mp4 (1920×1080@24fps)
Bind: rkvdec (/dev/video3 + /dev/media1)
Backend: libva-v4l2-request-fourier @ 65969da (iter8 Phase 4)
Kernel: 6.19.9-99-eos-arm
GL-DMA-BUF readback (mpv --hwdec=vaapi --vo=image, +30s seek):
HW frame 1 == SW frame 1 (sha256 f623d5f7..., 651726 bytes)
HW frame 2 == SW frame 2 (sha256 7d7bc6f2..., 630433 bytes)
Pixel-perfect match against software decode.
Cached-mmap readback (ffmpeg -hwaccel vaapi -hwaccel_output_format nv12):
544 / 6,220,800 bytes non-zero (0.009%)
Pattern: 16-byte non-zero chunks at every 1920-byte row stride
Stale-cache-coherency bug present in the readback path.
```
## Contract trace — V4L2 + media-request ioctl sequence
Captured via `strace -ff -tt -y -e ioctl,openat,close` plus ftrace `events/v4l2/*` tracepoints. Raw artefacts (gitignored):
- `mpv.strace.{12410,12413,...12430}` — per-thread strace (19 threads, ffmpeg's frame-threaded h264 decoder spreads work across av:h264:dfN workers).
- `ftrace_v4l2.txt` — kernel-side qbuf/dqbuf events (52 entries for 5 frames + init).
- `merged_ioctls.tsv` — time-sorted V4L2/MEDIA/DRM-only ioctls across all threads (215 entries).
- `mpv.stdout` — mpv log including `[vaapi] libva: User environment variable requested driver 'v4l2_request'`, `[vaapi] libva: Trying to open /usr/lib/dri/v4l2_request_drv_video.so`, `Using hardware decoding (vaapi-copy).`
Re-run incantation:
```bash
ssh fresnel '
sudo sh -c "echo 0 > /sys/kernel/tracing/tracing_on; echo 0 > /sys/kernel/tracing/trace; \
echo 1 > /sys/kernel/tracing/events/v4l2/enable; echo 1 > /sys/kernel/tracing/tracing_on"
strace -ff -tt -y -e trace=ioctl,openat,close \
-o /tmp/h264_baseline/mpv.strace \
env LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video3 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media1 \
mpv --hwdec=vaapi-copy --frames=5 --vo=null --no-audio \
--no-input-default-bindings ~/fourier-test/bbb_1080p30_h264.mp4
sudo cp /sys/kernel/tracing/trace /tmp/h264_baseline/ftrace_v4l2.txt
sudo sh -c "echo 0 > /sys/kernel/tracing/tracing_on; echo 0 > /sys/kernel/tracing/events/v4l2/enable"
'
```
### Init phase (one-shot at decoder open)
Approximate ordering (some interleaved across threads):
1. `DRM_IOCTL_VERSION × 2` on `/dev/dri/renderD128` — libva render-node probe.
2. `VIDIOC_QUERYCAP` on `/dev/video3` — confirms driver=`rkvdec`, card=`rkvdec`, bus=`platform:rkvdec`, version=KERNEL_VERSION(6,19,9).
3. `VIDIOC_ENUM_FMT × 22` (V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE) — enumerate compressed-input fourcc list. Returns `S265 (HEVC)`, `S264 (H.264)`, `VP9F (VP9)`, then -1 at index 3 = end.
4. `VIDIOC_S_FMT` (OUTPUT_MPLANE) — set to `S264` (H.264 Annex-B), 1920×1088.
5. `VIDIOC_ENUM_FMT × 1` (CAPTURE_MPLANE) — returns `Y/UV 4:2:0 (NV12)`.
6. `VIDIOC_G_FMT × 17` (CAPTURE_MPLANE) — returns NV12, 1920×1088.
7. `VIDIOC_CREATE_BUFS count=24` (CAPTURE_MPLANE) — instantiates the iter7 cap_pool (24 slots).
8. `VIDIOC_QUERYBUF × 40` — collect mmap offsets for cap_pool slots + extras.
9. `VIDIOC_REQBUFS × 2` — finalize OUTPUT and CAPTURE buffer pools.
10. `MEDIA_IOC_REQUEST_ALLOC × 16` on `/dev/media1` — preallocate request_fd pool (one per active OUTPUT slot, iter6 binding pattern).
11. `VIDIOC_STREAMON × 2` — start OUTPUT and CAPTURE streams.
mpv backend logs: `v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)` — the iter7 cap_pool harness instantiates correctly.
### Per-frame decode pattern (the contract)
Each H.264 frame goes through this 7-ioctl sequence on a single ffmpeg av:h264:dfN worker thread:
```
S_EXT_CTRLS (CODEC_STATELESS class 0xf010000, request_fd=N) ← bind H.264 SPS/PPS/decode_params to the request
QBUF CAPTURE_MPLANE index=K ← provide an empty CAPTURE buffer for decoded NV12
QBUF OUTPUT_MPLANE index=K (compressed slice bytes) ← submit compressed input slice
MEDIA_REQUEST_IOC_QUEUE (request_fd=N) ← submit the request bundle, kernel begins decode
MEDIA_REQUEST_IOC_REINIT (request_fd=N) ← reset the request_fd to IDLE for reuse
DQBUF OUTPUT_MPLANE index=K ← collect input slot back from kernel
DQBUF CAPTURE_MPLANE index=K ← collect decoded NV12
```
Notable observations:
- **REINIT before DQBUF.** REINIT succeeds because by the time userspace gets to it (~0.6 ms after QUEUE), the kernel has already moved the request from QUEUED to COMPLETE state. The mainline `media_request_ioctl_reinit()` accepts both IDLE and COMPLETE (returns -EBUSY only for QUEUED). This is the iter6/iter7 per-OUTPUT-slot REINIT pattern observed in action — the request_fd ownership is per-slot, REINIT is called eagerly to recycle.
- **Cycle time per frame: 410 ms wall-clock** (timestamps from the merged trace), dominated by the `S_EXT_CTRLS` payload serialization and the kernel's actual decode. Not a meaningful performance number — Phase 1+ binding cells will measure performance per `feedback_no_fixture_hardcoding.md`.
- **request_fd values 1724** observed in the 5-frame window. With the cap_pool at 24 slots and 16 preallocated request_fds, fds map roughly per-slot per the iter6 binding pattern.
- **No errors, no EINVAL, no EBUSY.** The contract is clean end-to-end; iter4's frame-11 EINVAL bug from libva-multiplanar does not reproduce on RK3399 in this short window. (Longer-run bug repro will require a longer trace; that's a Phase 1+ task.)
### ftrace v4l2 events (kernel-side perspective)
Per frame, the kernel sees:
```
v4l2_qbuf CAPTURE_MPLANE index=K bytesused=0 flags=MAPPED|QUEUED|...
v4l2_qbuf OUTPUT_MPLANE index=K bytesused=0 flags=MAPPED|TIMESTAMP_COPY|0x800080 timestamp=<u64>
v4l2_dqbuf OUTPUT_MPLANE index=K flags=MAPPED|TIMESTAMP_COPY|0x800000 ← buffer marked DONE
v4l2_dqbuf CAPTURE_MPLANE index=K flags=MAPPED|TIMESTAMP_COPY ← buffer marked DONE
```
`minor=3` confirms `/dev/video3` (rkvdec). Buffer indices cycle 0, 1, 2, 3, ... — using slots from the cap_pool. The `0x800080` and `0x800000` bits in flags are `V4L2_BUF_FLAG_REQUEST_FD` (0x00800000) plus `V4L2_BUF_FLAG_IN_REQUEST` (0x00000080) — confirming request-API binding is engaged.
The user-space-visible u64 timestamp (e.g., `1778187928425269000`) is mpv's PTS in arbitrary units, not wall-clock — it just needs to match between OUTPUT and CAPTURE for the kernel to pair them. Standard request-API contract.
## Cache-safe pixel verification (PASS)
Goal per `phase0_findings.md` and the libva-multiplanar iter1 patch-0011 lesson: prove decoded pixels are **non-zero, non-sentinel, semantically-correct** via a cache-coherency-safe readback path.
### Method
mpv `--hwdec=vaapi --vo=image` does HW decode → vaExportSurfaceHandle DMA-BUF FD → EGL `EGL_EXT_image_dma_buf_import` → GL texture → glReadPixels → JPEG encode. The DMA-BUF import path on Mesa/panfrost includes correct cache management, so the readback sees what the kernel actually wrote.
Test 1 — first 2 frames (BBB intro fade-in, solid dark content):
```
HW frame 1 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68
SW frame 1 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68 (BYTE-IDENTICAL)
HW frame 2 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68
SW frame 2 sha256 = 05b74172e03dc3f10f26fd89f167aa0755bc448007943cc4a64f5b36556dfd68
```
All four hashes match — the BBB intro is solid-color enough that frames 1, 2 produce identical JPEGs across HW/SW. Useful but doesn't rule out "everything's solid color" coincidences.
Test 2 — 2 frames at +30s seek (mid-content, real bunny motion):
```
seek30s HW frame 1 sha256 = f623d5f7a41697f67dd227275c6f1b21ffc257f65626d32fde8229357f8764c9 (651,726 bytes)
seek30s SW frame 1 sha256 = f623d5f7a41697f67dd227275c6f1b21ffc257f65626d32fde8229357f8764c9 (BYTE-IDENTICAL)
seek30s HW frame 2 sha256 = 7d7bc6f2146dda8b2d223bba622c4b9fbe9674181ff1e02afe286b620342e0a8 (630,433 bytes)
seek30s SW frame 2 sha256 = 7d7bc6f2146dda8b2d223bba622c4b9fbe9674181ff1e02afe286b620342e0a8 (BYTE-IDENTICAL)
```
Frames 1 vs 2 differ in size (real content changes) — confirms genuine motion content, not solid color. HW vs SW match byte-for-byte. **Hardware H.264 decode on RK3399 / rkvdec / libva-v4l2-request-fourier @ iter8 produces bit-exact correct pixels against software reference**, when read via the cache-safe DMA-BUF GL import path.
JPEGs in `phase0_evidence/2026-05-07/h264_baseline/seek30s_frame{1,2}_{hw,sw}.jpg` (gitignored as binary; regenerable from the mpv invocation in the re-run incantation below).
### Re-run incantation (cache-safe verify)
```bash
ssh fresnel '
mkdir -p /tmp/h264_baseline/png_seek_hw /tmp/h264_baseline/png_seek_sw
WAYLAND_DISPLAY=wayland-0 XDG_RUNTIME_DIR=/run/user/1000 \
LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video3 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media1 \
mpv --hwdec=vaapi --frames=2 --vo=image --no-audio --no-input-default-bindings \
--start=00:00:30 \
--vo-image-outdir=/tmp/h264_baseline/png_seek_hw \
~/fourier-test/bbb_1080p30_h264.mp4
mpv --hwdec=no --frames=2 --vo=image --no-audio --no-input-default-bindings \
--start=00:00:30 \
--vo-image-outdir=/tmp/h264_baseline/png_seek_sw \
~/fourier-test/bbb_1080p30_h264.mp4
sha256sum /tmp/h264_baseline/png_seek_*/00000001.jpg /tmp/h264_baseline/png_seek_*/00000002.jpg
'
```
Both pairs should have matching hashes. If they diverge on a future run, that's a regression worth investigating.
## Cache-stale path bug — present on RK3399 (Phase 4 work item)
When pixels are read via the **cached mmap path** (libva's vaDeriveImage / vaMapBuffer, used by ffmpeg `-hwaccel vaapi -hwaccel_output_format nv12`), the readback is corrupted in exactly the iter1 patch-0011 pattern.
### Evidence
```
ffmpeg -hwaccel vaapi -hwaccel_output_format nv12 -i bbb_1080p30_h264.mp4 \
-frames:v 2 -f rawvideo -y frames_hw.nv12
# Result on fresnel:
size = 6,220,800 bytes (matches 2 × 1920×1080×1.5 NV12)
non-zero = 544 (0.009%)
first 16 bytes = 81 81 80 80 80 7f 7f 7f 7f 7f 7f 80 80 80 81 81 (Y ≈ 128, gray)
Non-zero pattern:
offsets 0..15 = non-zero (16 consecutive)
offset 16..1919 = ZERO (1904 bytes)
offsets 1920..1935 = non-zero (16 consecutive, next row start)
offset 1936..3839 = ZERO
... pattern repeats for ~32 rows
rest of buffer = mostly ZERO with stride-8 specks at the end
```
Compared against software-decoded reference (same ffmpeg invocation without `-hwaccel vaapi`):
```
SW frames_sw.nv12: size=6,220,800, non-zero=100% (every byte non-zero, plausible black-frame Y=16 fill)
Bytewise diff HW vs SW: 100% of bytes differ
Mean absolute error per byte: 53.3 (vs ~02 expected for matched-codec rounding)
```
### Diagnosis
The 16-byte non-zero stripes at every 1920-byte boundary, with the rest reading as zero, is the canonical **stale cached-mmap** pattern from libva-multiplanar iter1 patch-0011. The kernel is writing real pixels to a DMA-coherent buffer, but the libva backend's image-export path returns a cached pointer without the proper cache-invalidation incantation. Userspace then reads stale memory — mostly the all-zero state from before the kernel wrote — punctuated by whatever happened to land in cache lines that got fetched after the write.
### Implication
- **Boolean correctness for Phase 0**: ✅ PASS. The kernel produces correct pixels (proven via DMA-BUF GL import). Bug is in the libva backend's *export path*, not in the decode itself.
- **Phase 4 work item**: port or audit the iter1 patch-0011 cache-flush fix on the RK3399 path. The fix landed in the libva-multiplanar fork on ohm; the iter8 master tip should already carry it. Either:
- (a) the fix is present and effective on RK3568 (ohm) but not effective on RK3399 (different cache topology / different DMA mapping mode), OR
- (b) the fix is present but fresnel's kernel routes the buffer through a path that bypasses the flush (e.g., V4L2_MEMORY_MMAP page protection differs), OR
- (c) the fix is conditional on something that doesn't hold on RK3399.
- **Phase 1+ binding cells must use the DMA-BUF GL import path for pixel verification**, not vaDeriveImage / cached-mmap. This is the iter1 lesson restated: the cached-mmap readback is unreliable on this hardware family.
## Comparison against ohm iter5/iter8 trace — deferred
The libva-multiplanar campaign has phase8_iteration[1-8]_close.md docs but I haven't located a directly comparable strace/ftrace dump for an apples-to-apples diff. A deeper Phase 0/Phase 1 compare would:
1. Run the same `mpv --hwdec=vaapi-copy --frames=5` under strace+ftrace on ohm with the same iter8 backend.
2. Diff the merged_ioctls.tsv files.
3. Identify any RK3568-vs-RK3399-specific divergences — e.g., does the rkvdec-bound contract differ from hantro-vpu-bound contract structurally? (Likely no — both are stateless V4L2 decoders following the same request-API.)
Defer to Phase 1 lock since it doesn't gate boolean correctness.
## What this leaves Phase 0 with
| Deliverable | Status |
|---|---|
| #1 SDDM recovery | done as watchpoint |
| #2 V4L2 inventory | done |
| #3 fork build + vainfo smoke | done |
| **#4 H.264 baseline trace + cache-safe pixel verify** | **done — PASS for boolean correctness; cache-stale bug in vaDeriveImage flagged for Phase 4** |
| #5 per-codec test fixtures | next |
| #6 chromium-fourier cross-validator trace | needs #5 |
| Phase 0 close commit | last |
## Stretch finding worth noting (not gating)
mpv stdout reports `Using hardware decoding (vaapi-copy)` — even though the iter8 backend has h265.c excluded from the build, mpv defaults to H.264 path (since the test fixture is H.264) and our backend handles it cleanly. No reason to think H.264 has any RK3399-specific weirdness in iter8 master beyond the cache-stale readback noted above.
ffmpeg version: `n8.1-13-gb57fbbe50c` — this is the active downstream `code.ffmpeg.org/Kwiboo/FFmpeg.git` branch `v4l2-request-n8.1` referenced in libva-multiplanar phase0_findings.md. The h264_v4l2request decoder is engaged via the `vaapi` hwaccel through libva → our backend → kernel. Same dispatch path as ohm.