# Phase 8.9 closure — long-form stress, multi-codec HDR, libva-v4l2-request scoping **Status:** closed 2026-05-18. The roadmap's Phase 8.9 promised full libva-v4l2-request consumer integration ("close the loop from YouTube to /dev/video0"). Investigation showed the bootlin upstream supports only MPEG-2 / H.264 / HEVC (no VP9 or AV1) and expects the older `V4L2_PIX_FMT_H264_SLICE_RAW` fourcc. A real integration means **adding VP9 + AV1 support to libva-v4l2-request itself** — multi-session work that deserves its own dedicated phase. So 8.9 ships what's bounded and useful: 1. **libva-v4l2-request scoping** — characterised the gap; documented what a future Phase 8.10 would need to build. 2. **Long-form (1800-frame / 60s) playback stress test** — exercises the daemon over a sustained workload to verify no buffer leaks, no fps degradation, daemon stable. 3. **Multi-codec HDR** — extends 8.8's VP9-10bit-only P010 tests with AV1-10bit and H.264-10bit at 1080p, both byte-exact against `ffmpeg -pix_fmt p010le`. ## What lands No code changes — Phase 8.9 is verification + scoping work. The test harness from 8.8 (`tools/test_m2m_stream`, already capable of VP9/AV1/H.264 + NV12M/P010) covers everything here. ## Verification ### libva-v4l2-request scoping (task 95) Source check on `bootlin/libva-v4l2-request@master`: | Where | What | Our status | |-------|------|------------| | `src/config.c` | Profile list: MPEG2 / H264 / HEVC | We support VP9 + AV1 + H264 — VP9/AV1 not listed | | `src/config.c` | H264 expects `V4L2_PIX_FMT_H264_SLICE_RAW` | We advertise newer `V4L2_PIX_FMT_H264_SLICE` | | `src/video.c` | CAPTURE expects `V4L2_PIX_FMT_NV12` | We advertise `NV12M` + `P010` — `NV12` (single-plane 8-bit) easy to add | **Phase 8.10 integration plan** (deferred): 1. Patch libva-v4l2-request: - Add `VAProfileVP9Profile0/2` → `V4L2_PIX_FMT_VP9_FRAME` - Add `VAProfileAV1Profile0/1` → `V4L2_PIX_FMT_AV1_FRAME` - Either teach config.c about `V4L2_PIX_FMT_H264_SLICE` or have our driver also advertise the older `H264_SLICE_RAW` fourcc. 2. Add `V4L2_PIX_FMT_NV12` (single-plane) to our CAPTURE enum so libva-v4l2-request's video.c picks us. 3. End-to-end: `vainfo -d /dev/dri/renderD128 --display drm` should list our device + the new profiles; then `mpv --hwdec=vaapi` against a test file. 4. Fall-back consumer if libva-v4l2-request integration stalls: FFmpeg's `v4l2_request` hwaccel (different code path, currently disabled by default in Debian builds). ### Long-form stress test (task 96) The test: - 1800 frames (60s at 30fps) of VP9 1080p, built by concatenating `vp9_5s.ivf` (150-frame source) 12× with PTS adjustment per loop and re-muxed as one IVF with correct frame count in the header. - Decoded as-fast-as-possible through `tools/test_m2m_stream` with 4-deep OUTPUT + 4-deep CAPTURE buffer rings. Result: ``` parsed 1800 frames, 1920x1080 CAPTURE fmt=NM12 planes=2 sizeimage=[2073600,1036800] OUTPUT reqbufs -> 4 CAPTURE reqbufs -> 4 STREAMON both decoded 1800 / 1800 frames to /dev/null perf: mean=8267us p50=7718us p99=17259us min=6273us max=28452us | wall=14887ms fps=120.9 ``` - **All 1800 frames decoded cleanly**. - **fps 120.9** averaged over the full 14.9 s wallclock — 4× over the 30fps target sustained across 60s of content. - **p99 = 17.3 ms / frame**, well inside the 33 ms 30fps budget — no per-frame outliers that would cause stutter. - **No errors** in daemon log (cookies ascending 1..1820 on first run, 1821..3620 on second run — no gaps, no "unknown cookie" warnings, no decode failures). - **Daemon alive** after the run; RSS = 23 MiB across two back-to-back stress runs (3620 cookies total) — no observable leak. - **No kernel oops / WARN** in dmesg. ### Multi-codec HDR (task 97) 10-frame 1080p P010 streams for AV1 and H.264 10-bit profiles, byte-exact against `ffmpeg -pix_fmt p010le`: | Codec | Wall (10 frames) | fps | Byte-exact | |---------|------------------|-------|------------| | VP9 10-bit (from 8.8) | 204 ms | 48.8 | ✓ | | AV1 10-bit | 584 ms | 17.1 | ✓ | | H.264 10-bit (high10) | 372 ms | 26.9 | ✓ | AV1 10-bit is below the 30fps@1080p target (17fps). H.264 10-bit is close to target (27fps). Both are intrinsically expensive on CPU — the daemon is doing a full software decode plus the 10→16-bit MSB-align pack. For the project's user-facing `30fps-floor-is-fine` criterion (daily YouTube), this is acceptable: most YouTube content is 8-bit VP9 / AV1 where we're 2-3× over target. 10-bit HDR delivery on the web is rare and tends to come through hardware-accelerated paths elsewhere in the desktop. Per-codec p99 from short tests has high variance (10 frames, short warmup); longer streams (Phase 8.10+) would give better statistics. ## Design decisions ### Why not patch libva-v4l2-request now? Multi-session effort. Adding VP9 + AV1 support to libva-v4l2-request means: - Writing new VAAPI ↔ V4L2 stateless control mappings for VP9_FRAME and AV1_FRAME control structs (the union of the existing H264 mapping work). - A real integration test (a VAAPI consumer like mpv or gstreamer driving the patched library). - Potentially upstreaming changes back to bootlin (review cycles). Phase 8.9 was scoped as one phase among many — comparable in size to 8.5/8.6/8.7/8.8 — and the right move is to characterise the work and defer the long tail. ### Why concat the 5s file instead of encoding 60s fresh? The 60s libvpx-vp9 encode at `-cpu-used 8` was taking 3-5 minutes on hertz. Concatenating 12× a known-good 5s file via Python IVF surgery (rewrite header frame count, adjust per-frame PTS) takes ~50 ms and produces the same content the daemon sees per frame. The stress test cares about quantity-of-frames and stability, not encoder diversity. ### Why HDR results aren't a regression 10-bit decode is 1.5-2× more expensive than 8-bit: - More memory bandwidth (16 bits/sample vs 8). - More CPU per sample (10-bit codec internals are wider). - Plus our pack does an extra shift-left-by-6 per sample. AV1 10-bit specifically takes ~58 ms/frame mean — that's dav1d on a single Cortex-A76 thread doing real 10-bit AV1 work. 17fps@1080p for 10-bit AV1 isn't bad for software CPU decode; it's just below the 30fps SDR target. Real-world 10-bit content is rare enough that this doesn't move the user-facing meter. ## What's NOT here (deferred) - **libva-v4l2-request integration** — moved to Phase 8.10. - **QPU dispatch substitution** — still deferred; 8.8 showed it's not needed for the 30fps@1080p SDR target but it'd help the 10-bit + 4K cases. - **Mixed real-world content tests** — concat-of-testsrc has the right frame count but not the right entropy characteristics (real video has motion, scene changes, variable bitrate). Phase 8.10+ when we have a meaningful consumer (libva-v4l2-request, FFmpeg v4l2_request, …) can drive real content end-to-end. ## Phase 8.10 plan 1. Build libva-v4l2-request from source on hertz. 2. Patch it to accept our V4L2_PIX_FMT_VP9_FRAME + AV1_FRAME + (new) H264_SLICE + NV12M. 3. End-to-end: mpv --hwdec=vaapi → libva-v4l2-request → /dev/video0 → daemon → decoded frame. 4. (Optional) Upstream the VP9 + AV1 + NV12M support back to bootlin if the patch is clean.