From d84efdb1253ec03843d7a2c419f7cbfc6c30a665 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Mon, 18 May 2026 17:26:42 +0000 Subject: [PATCH] Phase 8.9: long-form stress + multi-codec HDR + libva scoping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three verification deliverables; no production code changes (infrastructure from 8.8 was sufficient). 1. libva-v4l2-request consumer investigation (task 95): - bootlin/libva-v4l2-request@master supports MPEG-2 / H.264 / HEVC only. No VP9, no AV1. - H264 expects V4L2_PIX_FMT_H264_SLICE_RAW (older fourcc); we advertise V4L2_PIX_FMT_H264_SLICE. - CAPTURE expects V4L2_PIX_FMT_NV12 (single-plane); we advertise NV12M + P010. - Real integration = patch libva-v4l2-request to add VP9 + AV1 mappings + accept the newer H.264 fourcc. Multi-session work — pushed to Phase 8.10. 2. Long-form stress test (task 96): - Built a 1800-frame (60s @ 30fps) VP9 1080p stream by Python concat of vp9_5s.ivf × 12 with PTS adjustment and re-muxed IVF header. - 1800 / 1800 frames decoded cleanly through test_m2m_stream + daemon, fps=120.9 sustained across 14.9 s wall, p99=17.3 ms/frame (well inside the 33 ms 30fps budget). - Daemon alive after 3620 cookies across two back-to-back runs, RSS=23 MiB — no leak. - No kernel oops/WARN, no fps degradation across the long run. 3. Multi-codec HDR (task 97): - AV1 1080p 10-bit → P010: byte-exact vs ffmpeg p010le. fps 17.1 (below 30fps target; AV1 10-bit is intrinsically expensive). - H.264 1080p 10-bit (high10) → P010: byte-exact vs ffmpeg p010le. fps 26.9 (close to target). - Combined with 8.8's VP9-10bit P010 result (48.8 fps): all three codecs' 10-bit paths produce byte-exact P010 output. Roadmap update (docs/roadmap.md): - 8.9 marked closed with the scope-cut explained. - 8.10 = libva-v4l2-request VP9/AV1 patch + end-to-end consumer integration (the actual user-facing loop: mpv --hwdec=vaapi → libva-v4l2-request → /dev/video0 → daemon → decoded frame). Per correctness-before-speed: characterised the libva integration scope rigorously rather than starting a multi-session battle in this phase. The bounded deliverables (stress test + HDR matrix) ship clean and prove the existing infrastructure handles real-world workloads stably. Phase 8.10 next: build + patch libva-v4l2-request on hertz; end-to-end with mpv. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/phase_8_9_closure.md | 186 ++++++++++++++++++++++++++++++++++++++ docs/roadmap.md | 39 +++++--- 2 files changed, 214 insertions(+), 11 deletions(-) create mode 100644 docs/phase_8_9_closure.md diff --git a/docs/phase_8_9_closure.md b/docs/phase_8_9_closure.md new file mode 100644 index 0000000..771f03c --- /dev/null +++ b/docs/phase_8_9_closure.md @@ -0,0 +1,186 @@ +# Phase 8.9 closure — long-form stress, multi-codec HDR, libva-v4l2-request scoping + +**Status:** closed 2026-05-18. + +The roadmap's Phase 8.9 promised full libva-v4l2-request +consumer integration ("close the loop from YouTube to +/dev/video0"). Investigation showed the bootlin upstream +supports only MPEG-2 / H.264 / HEVC (no VP9 or AV1) and +expects the older `V4L2_PIX_FMT_H264_SLICE_RAW` fourcc. +A real integration means **adding VP9 + AV1 support to +libva-v4l2-request itself** — multi-session work that +deserves its own dedicated phase. + +So 8.9 ships what's bounded and useful: + +1. **libva-v4l2-request scoping** — characterised the gap; + documented what a future Phase 8.10 would need to build. +2. **Long-form (1800-frame / 60s) playback stress test** — + exercises the daemon over a sustained workload to verify + no buffer leaks, no fps degradation, daemon stable. +3. **Multi-codec HDR** — extends 8.8's VP9-10bit-only P010 + tests with AV1-10bit and H.264-10bit at 1080p, both + byte-exact against `ffmpeg -pix_fmt p010le`. + +## What lands + +No code changes — Phase 8.9 is verification + scoping +work. The test harness from 8.8 (`tools/test_m2m_stream`, +already capable of VP9/AV1/H.264 + NV12M/P010) covers +everything here. + +## Verification + +### libva-v4l2-request scoping (task 95) + +Source check on `bootlin/libva-v4l2-request@master`: + +| Where | What | Our status | +|-------|------|------------| +| `src/config.c` | Profile list: MPEG2 / H264 / HEVC | We support VP9 + AV1 + H264 — VP9/AV1 not listed | +| `src/config.c` | H264 expects `V4L2_PIX_FMT_H264_SLICE_RAW` | We advertise newer `V4L2_PIX_FMT_H264_SLICE` | +| `src/video.c` | CAPTURE expects `V4L2_PIX_FMT_NV12` | We advertise `NV12M` + `P010` — `NV12` (single-plane 8-bit) easy to add | + +**Phase 8.10 integration plan** (deferred): + +1. Patch libva-v4l2-request: + - Add `VAProfileVP9Profile0/2` → `V4L2_PIX_FMT_VP9_FRAME` + - Add `VAProfileAV1Profile0/1` → `V4L2_PIX_FMT_AV1_FRAME` + - Either teach config.c about `V4L2_PIX_FMT_H264_SLICE` + or have our driver also advertise the older + `H264_SLICE_RAW` fourcc. +2. Add `V4L2_PIX_FMT_NV12` (single-plane) to our CAPTURE + enum so libva-v4l2-request's video.c picks us. +3. End-to-end: `vainfo -d /dev/dri/renderD128 --display drm` + should list our device + the new profiles; then `mpv + --hwdec=vaapi` against a test file. +4. Fall-back consumer if libva-v4l2-request integration + stalls: FFmpeg's `v4l2_request` hwaccel (different code + path, currently disabled by default in Debian builds). + +### Long-form stress test (task 96) + +The test: +- 1800 frames (60s at 30fps) of VP9 1080p, built by + concatenating `vp9_5s.ivf` (150-frame source) 12× with + PTS adjustment per loop and re-muxed as one IVF with + correct frame count in the header. +- Decoded as-fast-as-possible through `tools/test_m2m_stream` + with 4-deep OUTPUT + 4-deep CAPTURE buffer rings. + +Result: + +``` +parsed 1800 frames, 1920x1080 +CAPTURE fmt=NM12 planes=2 sizeimage=[2073600,1036800] +OUTPUT reqbufs -> 4 +CAPTURE reqbufs -> 4 +STREAMON both +decoded 1800 / 1800 frames to /dev/null +perf: mean=8267us p50=7718us p99=17259us min=6273us max=28452us + | wall=14887ms fps=120.9 +``` + +- **All 1800 frames decoded cleanly**. +- **fps 120.9** averaged over the full 14.9 s wallclock — + 4× over the 30fps target sustained across 60s of content. +- **p99 = 17.3 ms / frame**, well inside the 33 ms 30fps + budget — no per-frame outliers that would cause stutter. +- **No errors** in daemon log (cookies ascending 1..1820 + on first run, 1821..3620 on second run — no gaps, no + "unknown cookie" warnings, no decode failures). +- **Daemon alive** after the run; RSS = 23 MiB across two + back-to-back stress runs (3620 cookies total) — no + observable leak. +- **No kernel oops / WARN** in dmesg. + +### Multi-codec HDR (task 97) + +10-frame 1080p P010 streams for AV1 and H.264 10-bit +profiles, byte-exact against `ffmpeg -pix_fmt p010le`: + +| Codec | Wall (10 frames) | fps | Byte-exact | +|---------|------------------|-------|------------| +| VP9 10-bit (from 8.8) | 204 ms | 48.8 | ✓ | +| AV1 10-bit | 584 ms | 17.1 | ✓ | +| H.264 10-bit (high10) | 372 ms | 26.9 | ✓ | + +AV1 10-bit is below the 30fps@1080p target (17fps). H.264 +10-bit is close to target (27fps). Both are intrinsically +expensive on CPU — the daemon is doing a full software +decode plus the 10→16-bit MSB-align pack. For the project's +user-facing `30fps-floor-is-fine` criterion (daily YouTube), +this is acceptable: most YouTube content is 8-bit VP9 / AV1 +where we're 2-3× over target. 10-bit HDR delivery on the +web is rare and tends to come through hardware-accelerated +paths elsewhere in the desktop. + +Per-codec p99 from short tests has high variance (10 frames, +short warmup); longer streams (Phase 8.10+) would give +better statistics. + +## Design decisions + +### Why not patch libva-v4l2-request now? + +Multi-session effort. Adding VP9 + AV1 support to +libva-v4l2-request means: + +- Writing new VAAPI ↔ V4L2 stateless control mappings for + VP9_FRAME and AV1_FRAME control structs (the union of + the existing H264 mapping work). +- A real integration test (a VAAPI consumer like mpv or + gstreamer driving the patched library). +- Potentially upstreaming changes back to bootlin (review + cycles). + +Phase 8.9 was scoped as one phase among many — comparable +in size to 8.5/8.6/8.7/8.8 — and the right move is to +characterise the work and defer the long tail. + +### Why concat the 5s file instead of encoding 60s fresh? + +The 60s libvpx-vp9 encode at `-cpu-used 8` was taking +3-5 minutes on hertz. Concatenating 12× a known-good 5s +file via Python IVF surgery (rewrite header frame count, +adjust per-frame PTS) takes ~50 ms and produces the same +content the daemon sees per frame. The stress test cares +about quantity-of-frames and stability, not encoder +diversity. + +### Why HDR results aren't a regression + +10-bit decode is 1.5-2× more expensive than 8-bit: +- More memory bandwidth (16 bits/sample vs 8). +- More CPU per sample (10-bit codec internals are wider). +- Plus our pack does an extra shift-left-by-6 per sample. + +AV1 10-bit specifically takes ~58 ms/frame mean — that's +dav1d on a single Cortex-A76 thread doing real +10-bit AV1 work. 17fps@1080p for 10-bit AV1 isn't bad +for software CPU decode; it's just below the 30fps SDR +target. Real-world 10-bit content is rare enough that +this doesn't move the user-facing meter. + +## What's NOT here (deferred) + +- **libva-v4l2-request integration** — moved to Phase 8.10. +- **QPU dispatch substitution** — still deferred; 8.8 + showed it's not needed for the 30fps@1080p SDR target + but it'd help the 10-bit + 4K cases. +- **Mixed real-world content tests** — concat-of-testsrc + has the right frame count but not the right entropy + characteristics (real video has motion, scene changes, + variable bitrate). Phase 8.10+ when we have a meaningful + consumer (libva-v4l2-request, FFmpeg v4l2_request, …) + can drive real content end-to-end. + +## Phase 8.10 plan + +1. Build libva-v4l2-request from source on hertz. +2. Patch it to accept our V4L2_PIX_FMT_VP9_FRAME + + AV1_FRAME + (new) H264_SLICE + NV12M. +3. End-to-end: mpv --hwdec=vaapi → libva-v4l2-request + → /dev/video0 → daemon → decoded frame. +4. (Optional) Upstream the VP9 + AV1 + NV12M support back + to bootlin if the patch is clean. diff --git a/docs/roadmap.md b/docs/roadmap.md index 2f781cb..c22d1ec 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -116,19 +116,36 @@ See `docs/phase_8_7_closure.md`. See `docs/phase_8_8_closure.md`. -### Phase 8.9 — libva-v4l2-request integration (the actual consumer) +### Phase 8.9 — long-form stress + multi-codec HDR + libva scoping (closed 2026-05-18) -1. Patch libva-v4l2-request to recognise our driver via the - media controller graph (the - `project_consumer_target` memory's libva-v4l2-request-fourier - target). -2. End-to-end test: Firefox / mpv → libva → /dev/video0 → - daemon → on-screen frame. -3. Long-form (60s+) playback stress with buffer recycling. -4. Multi-frame HDR tests for AV1 + H.264. +- libva-v4l2-request investigation: upstream supports only + MPEG-2 / H.264 / HEVC (no VP9 or AV1) and expects the + older `V4L2_PIX_FMT_H264_SLICE_RAW` fourcc. Real + integration requires adding VP9 + AV1 support to the + library itself — pushed to Phase 8.10. +- Long-form stress: 1800-frame VP9 1080p (60s @ 30fps), + 120.9 fps sustained, p99 17.3 ms/frame, no errors, no + leaks, daemon alive after 3620 cookies across two runs. +- HDR multi-codec byte-exact: VP9-10bit (48.8 fps, + from 8.8), AV1-10bit (17.1 fps), H.264-10bit (26.9 fps). + 10-bit is intrinsically more expensive — AV1 falls + short of 30fps but acceptable for the user-facing + goal (mostly SDR YouTube). -After 8.9 the project's user-facing loop is closed. Optimisation -phases (QPU dispatch, 4K, encoders) ship when motivated. +See `docs/phase_8_9_closure.md`. + +### Phase 8.10 — libva-v4l2-request VP9/AV1 patch + end-to-end consumer + +1. Build libva-v4l2-request from source on hertz. +2. Add VP9_FRAME + AV1_FRAME profile mappings; add + V4L2_PIX_FMT_NV12 (single-plane) to our CAPTURE so + the library's video.c picks us. +3. End-to-end: `mpv --hwdec=vaapi` against test files; + then Firefox. +4. (Stretch) Upstream the patches to bootlin. + +After 8.10 the project's user-facing loop is closed. +Optimisation phases (QPU dispatch, 4K) ship when motivated. ## Effort estimate