daedalus-v4l2/docs/phase_8_9_closure.md

# Phase 8.9 closure — long-form stress, multi-codec HDR, libva-v4l2-request scoping

**Status:** closed 2026-05-18.

The roadmap's Phase 8.9 promised full libva-v4l2-request
consumer integration ("close the loop from YouTube to
/dev/video0").  Investigation showed the bootlin upstream
supports only MPEG-2 / H.264 / HEVC (no VP9 or AV1) and
expects the older `V4L2_PIX_FMT_H264_SLICE_RAW` fourcc.
A real integration means **adding VP9 + AV1 support to
libva-v4l2-request itself** — multi-session work that
deserves its own dedicated phase.

So 8.9 ships what's bounded and useful:

1. **libva-v4l2-request scoping** — characterised the gap;
   documented what a future Phase 8.10 would need to build.
2. **Long-form (1800-frame / 60s) playback stress test** —
   exercises the daemon over a sustained workload to verify
   no buffer leaks, no fps degradation, daemon stable.
3. **Multi-codec HDR** — extends 8.8's VP9-10bit-only P010
   tests with AV1-10bit and H.264-10bit at 1080p, both
   byte-exact against `ffmpeg -pix_fmt p010le`.

## What lands

No code changes — Phase 8.9 is verification + scoping
work.  The test harness from 8.8 (`tools/test_m2m_stream`,
already capable of VP9/AV1/H.264 + NV12M/P010) covers
everything here.

## Verification

### libva-v4l2-request scoping (task 95)

Source check on `bootlin/libva-v4l2-request@master`:

| Where | What | Our status |
|-------|------|------------|
| `src/config.c` | Profile list: MPEG2 / H264 / HEVC | We support VP9 + AV1 + H264 — VP9/AV1 not listed |
| `src/config.c` | H264 expects `V4L2_PIX_FMT_H264_SLICE_RAW` | We advertise newer `V4L2_PIX_FMT_H264_SLICE` |
| `src/video.c` | CAPTURE expects `V4L2_PIX_FMT_NV12` | We advertise `NV12M` + `P010` — `NV12` (single-plane 8-bit) easy to add |

**Phase 8.10 integration plan** (deferred):

1. Patch libva-v4l2-request:
   - Add `VAProfileVP9Profile0/2` → `V4L2_PIX_FMT_VP9_FRAME`
   - Add `VAProfileAV1Profile0/1` → `V4L2_PIX_FMT_AV1_FRAME`
   - Either teach config.c about `V4L2_PIX_FMT_H264_SLICE`
     or have our driver also advertise the older
     `H264_SLICE_RAW` fourcc.
2. Add `V4L2_PIX_FMT_NV12` (single-plane) to our CAPTURE
   enum so libva-v4l2-request's video.c picks us.
3. End-to-end: `vainfo -d /dev/dri/renderD128 --display drm`
   should list our device + the new profiles; then `mpv
   --hwdec=vaapi` against a test file.
4. Fall-back consumer if libva-v4l2-request integration
   stalls: FFmpeg's `v4l2_request` hwaccel (different code
   path, currently disabled by default in Debian builds).

### Long-form stress test (task 96)

The test:
- 1800 frames (60s at 30fps) of VP9 1080p, built by
  concatenating `vp9_5s.ivf` (150-frame source) 12× with
  PTS adjustment per loop and re-muxed as one IVF with
  correct frame count in the header.
- Decoded as-fast-as-possible through `tools/test_m2m_stream`
  with 4-deep OUTPUT + 4-deep CAPTURE buffer rings.

Result:

```
parsed 1800 frames, 1920x1080
CAPTURE fmt=NM12 planes=2 sizeimage=[2073600,1036800]
OUTPUT reqbufs -> 4
CAPTURE reqbufs -> 4
STREAMON both
decoded 1800 / 1800 frames to /dev/null
perf: mean=8267us p50=7718us p99=17259us min=6273us max=28452us
      | wall=14887ms fps=120.9
```

- **All 1800 frames decoded cleanly**.
- **fps 120.9** averaged over the full 14.9 s wallclock —
  4× over the 30fps target sustained across 60s of content.
- **p99 = 17.3 ms / frame**, well inside the 33 ms 30fps
  budget — no per-frame outliers that would cause stutter.
- **No errors** in daemon log (cookies ascending 1..1820
  on first run, 1821..3620 on second run — no gaps, no
  "unknown cookie" warnings, no decode failures).
- **Daemon alive** after the run; RSS = 23 MiB across two
  back-to-back stress runs (3620 cookies total) — no
  observable leak.
- **No kernel oops / WARN** in dmesg.

### Multi-codec HDR (task 97)

10-frame 1080p P010 streams for AV1 and H.264 10-bit
profiles, byte-exact against `ffmpeg -pix_fmt p010le`:

| Codec   | Wall (10 frames) | fps   | Byte-exact |
|---------|------------------|-------|------------|
| VP9 10-bit (from 8.8) | 204 ms | 48.8 | ✓ |
| AV1 10-bit            | 584 ms | 17.1 | ✓ |
| H.264 10-bit (high10) | 372 ms | 26.9 | ✓ |

AV1 10-bit is below the 30fps@1080p target (17fps).  H.264
10-bit is close to target (27fps).  Both are intrinsically
expensive on CPU — the daemon is doing a full software
decode plus the 10→16-bit MSB-align pack.  For the project's
user-facing `30fps-floor-is-fine` criterion (daily YouTube),
this is acceptable: most YouTube content is 8-bit VP9 / AV1
where we're 2-3× over target.  10-bit HDR delivery on the
web is rare and tends to come through hardware-accelerated
paths elsewhere in the desktop.

Per-codec p99 from short tests has high variance (10 frames,
short warmup); longer streams (Phase 8.10+) would give
better statistics.

## Design decisions

### Why not patch libva-v4l2-request now?

Multi-session effort.  Adding VP9 + AV1 support to
libva-v4l2-request means:

- Writing new VAAPI ↔ V4L2 stateless control mappings for
  VP9_FRAME and AV1_FRAME control structs (the union of
  the existing H264 mapping work).
- A real integration test (a VAAPI consumer like mpv or
  gstreamer driving the patched library).
- Potentially upstreaming changes back to bootlin (review
  cycles).

Phase 8.9 was scoped as one phase among many — comparable
in size to 8.5/8.6/8.7/8.8 — and the right move is to
characterise the work and defer the long tail.

### Why concat the 5s file instead of encoding 60s fresh?

The 60s libvpx-vp9 encode at `-cpu-used 8` was taking
3-5 minutes on hertz.  Concatenating 12× a known-good 5s
file via Python IVF surgery (rewrite header frame count,
adjust per-frame PTS) takes ~50 ms and produces the same
content the daemon sees per frame.  The stress test cares
about quantity-of-frames and stability, not encoder
diversity.

### Why HDR results aren't a regression

10-bit decode is 1.5-2× more expensive than 8-bit:
- More memory bandwidth (16 bits/sample vs 8).
- More CPU per sample (10-bit codec internals are wider).
- Plus our pack does an extra shift-left-by-6 per sample.

AV1 10-bit specifically takes ~58 ms/frame mean — that's
dav1d on a single Cortex-A76 thread doing real
10-bit AV1 work.  17fps@1080p for 10-bit AV1 isn't bad
for software CPU decode; it's just below the 30fps SDR
target.  Real-world 10-bit content is rare enough that
this doesn't move the user-facing meter.

## What's NOT here (deferred)

- **libva-v4l2-request integration** — moved to Phase 8.10.
- **QPU dispatch substitution** — still deferred; 8.8
  showed it's not needed for the 30fps@1080p SDR target
  but it'd help the 10-bit + 4K cases.
- **Mixed real-world content tests** — concat-of-testsrc
  has the right frame count but not the right entropy
  characteristics (real video has motion, scene changes,
  variable bitrate).  Phase 8.10+ when we have a meaningful
  consumer (libva-v4l2-request, FFmpeg v4l2_request, …)
  can drive real content end-to-end.

## Phase 8.10 plan

1. Build libva-v4l2-request from source on hertz.
2. Patch it to accept our V4L2_PIX_FMT_VP9_FRAME +
   AV1_FRAME + (new) H264_SLICE + NV12M.
3. End-to-end: mpv --hwdec=vaapi → libva-v4l2-request
   → /dev/video0 → daemon → decoded frame.
4. (Optional) Upstream the VP9 + AV1 + NV12M support back
   to bootlin if the patch is clean.