# Phase 8.8 closure — throughput baseline + multi-codec streams + HDR **Status:** closed 2026-05-18. The roadmap going into 8.8 prescribed a substantial QPU dispatch substitution effort to hit the `30fps-floor-is-fine` user-facing criterion. The proper correctness-before-speed move was to **measure first** — turns out the daemon's FFmpeg software path on Pi 5's Cortex-A76 already hits **65-88 fps@1080p** across all three codecs, 2-3× over the 30fps target. QPU substitution would have been premature optimization. So 8.8 ships what's actually useful: 1. **Per-frame timing instrumentation** in `test_m2m_stream` with mean / p50 / p99 / fps reporting. 2. **Multi-frame AV1 + H.264 streams verified** byte-exact at 1080p (closing the "VP9-only stream tests" gap from 8.7). 3. **HDR / 10-bit support** — `V4L2_PIX_FMT_P010` added as a CAPTURE format with depth-aware packing in the daemon. ## What lands ### Test harness (`tools/test_m2m_stream.c`) - Per-frame microsecond timing via `clock_gettime(CLOCK_ MONOTONIC)`. Final report: mean / p50 / p99 / min / max per-frame microseconds + wall ms + fps. - Annex-B H.264 parser: split bitstream on 3- or 4-byte start codes, accumulate NALs into access units (push when we see a VCL NAL — type 1 or 5). Without access-unit grouping, FFmpeg's H.264 decoder rejects SPS-only or PPS-only buffers as "no frame!". - Format auto-detection: IVF (DKIF magic) → `parse_ivf`; anything else → `parse_annexb`. Non-IVF input requires explicit `[w] [h]` since framing carries no dimensions. - New optional 6th argument `[capture]`: `nv12m` (default, 8-bit, 2 planes) or `p010` (10-bit, 1 plane). - CAPTURE mmap path generalised to handle `num_planes == 1` (P010) — previously hardcoded to 2. ### Kernel (`kernel/daedalus_v4l2_main.c`) - CAPTURE formats array: `{ NV12M, P010 }`, with `daedalus_is_supported_capture` matching the OUTPUT-side helper. - `enum_fmt` on CAPTURE walks the array (2 entries). - `daedalus_fill_capture_fmt` takes a fourcc: - NV12M: 2 planes, plane[0]=W*H, plane[1]=W*H/2, bytesperline=W. - P010: 1 plane, sizeimage = W*H*3 (Y=2 bytes per pixel × H rows + interleaved CbCr=W bytes per chroma row × H/2 rows = W*H*2 + W*H = W*H*3), bytesperline = W*2. - `try_fmt` for CAPTURE preserves caller fourcc when supported, falls back to NV12M default otherwise. - `daedalus_complete_resp_frame` refactored: the dmabuf path (pixels_len == 0) now sets each plane's payload to `vb2_plane_size(vb, p)` — the daemon fully populated the plane, so payload = sizeimage. Generalises cleanly to 1-plane (P010) and 2-plane (NV12M) formats. ### Daemon (`daemon/src/decoder.c`) - `pack_p010_to_plane` — packs YUV420P10LE into P010 single-plane layout: Y plane (16-bit samples, MSB-aligned 10-bit data, low 6 bits zero) at base+0, interleaved CbCr at base+(Y plane size). Strips source stride padding from `fr->linesize[*]`; respects destination stride from `planes->stride[0]`. - `daedalus_decoder_run_request` dispatches on `req->capture_pix_fmt`: - `V4L2_PIX_FMT_NV12M` → `pack_nv12_to_planes` - `V4L2_PIX_FMT_P010` → `pack_p010_to_plane` - else → warn + skip pack (decoder still reports the frame metadata). - Includes `` for the fourcc constants. ## Verification All measurements on hertz (Pi 5, 6.12.75+rpt-rpi-2712). ### 1080p throughput baseline — 30fps target met across the board 30-frame `testsrc` at 1920×1080, decoded via the V4L2 m2m + dmabuf path; per-frame µs measured from QBUF OUTPUT to write(of, NV12) returning. | Codec | Mean | p50 | p99 | fps | byte-exact vs ffmpeg | |-------|------|-----|-----|-----|----------------------| | VP9 | 12.0 ms | 11.8 ms | 15.9 ms | **83.1** | ✓ | | AV1 | 15.4 ms | 14.3 ms | 41.0 ms | **65.0** | ✓ | | H.264 | 11.3 ms | 10.5 ms | 21.5 ms | **88.3** | ✓ | The `30fps-floor-is-fine` memory's user-facing criterion is "daily YouTube playback with CPU free for vscode." At 65-88 fps single-threaded the daemon is so far above the floor that real-world content has comfortable headroom for the rest of the desktop. ### HDR / 10-bit P010 — byte-exact + still real-time ``` $ ffmpeg -f lavfi -i 'testsrc=duration=0.4:size=1920x1080:rate=25' \ -pix_fmt yuv420p10le -c:v libvpx-vp9 -cpu-used 8 \ -y vp9_10bit_1080.ivf $ ffmpeg -i vp9_10bit_1080.ivf -pix_fmt p010le -f rawvideo \ -y vp9_10bit_1080_ref.p010 $ sudo ./tools/test_m2m_stream \ vp9_10bit_1080.ivf \ vp9_10bit_1080_out.p010 \ 1920 1080 vp9 p010 parsed 10 frames, 1920x1080 CAPTURE fmt=P010 planes=1 sizeimage=[6220800,0] decoded 10 / 10 frames perf: mean=20.5ms p50=19.0ms p99=28.0ms | fps=48.8 $ cmp vp9_10bit_1080_out.p010 vp9_10bit_1080_ref.p010 0 # 62 MB across 10 frames, byte-for-byte match ``` The 10-bit path is ~50fps@1080p — still above the 30fps target. The overhead vs 8-bit comes from the shift-left-by-6 plus the wider memory writes (16-bit per sample); both are inherent to the format. The smaller 320×240 P010 test ran at **966 fps** — the fixed daemon-side overhead dominates at small resolutions. ### v4l2-compliance — unchanged from 8.7 ``` Total for daedalus_v4l2 device /dev/video0: 49, Succeeded: 49, Failed: 0, Warnings: 0 ``` Compliance was already complete after 8.7; the added P010 format passes through the same MMAP / DMABUF / REQBUFS / EXPBUF tests cleanly. ### Format enumeration ``` $ v4l2-ctl -d /dev/video0 --list-formats [0]: 'NM12' (Y/UV 4:2:0 (N-C)) [1]: 'P010' (10-bit Y/UV 4:2:0) ``` ### Clean teardown ``` $ pkill -TERM daedalus_v4l2_daemon $ sudo rmmod daedalus_v4l2 $ sudo dmesg | grep -E 'BUG|oops' (empty) ``` ## Design decisions ### Why measure before substituting QPU kernels? The Phase 8.8 roadmap entry was "profile + dispatch QPU kernels for hot paths." The unstated assumption was "FFmpeg software decode is too slow at 1080p." Measurement falsified the assumption — Cortex-A76 ARM has enough single-thread throughput that libvpx-vp9 / libdav1d / libavcodec H.264 all clear 30fps@1080p without help. QPU substitution still has value for: - Higher resolutions (4K), - Higher frame rates (60fps+), - Lower-power CPUs (Pi 5 is competitive; older Pis aren't), - Power efficiency at any throughput. Per `feedback_correctness_before_speed`: measure, then optimize what's actually slow. The QPU work is still in the roadmap but it's no longer urgent — it's an optimization phase, not a feature phase. ### Why P010 (single plane) and not P010M (multi plane)? The kernel uABI only defines `V4L2_PIX_FMT_P010` (single plane, fourcc 'P010'). There is no `P010M` constant in v6.12 headers. Single plane works fine for our purposes — the daemon's dmabuf path gets one fd, one mmap, and the Y/CbCr layout is fixed by the format spec. If a future userspace ever needs separate Y and CbCr buffers we could define our own `V4L2_PIX_FMT_P010M`- shaped layout, but that would diverge from the standard fourcc and is hard to motivate without an actual consumer. ### Why the Annex-B parser accumulates access units The V4L2 stateless H.264 spec says each OUTPUT buffer contains ONE PARSED SLICE. Our daemon doesn't use the SLICE_PARAMS controls — it just passes bytes to FFmpeg which re-parses. FFmpeg's H.264 decoder rejects "SPS-only" or "PPS-only" buffers as "no frame!", so splitting on every start code fails. Solution: accumulate NALs into access units. An AU contains zero or more non-VCL NALs (SPS/PPS/SEI/AUD) followed by one VCL NAL (slice type 1 or 5). We push each completed AU as one OUTPUT buffer. Works for any H.264 Annex-B stream where one access unit = one frame (our ultrafast baseline x264 encode), which is the common case for the test harness. ### Per-frame timing measures the full QBUF→DQBUF cycle The reported "mean=12ms" includes: 1. memcpy bitstream into OUTPUT MMAP plane 2. VIDIOC_QBUF 3. poll() — blocks until CAPTURE ready 4. VIDIOC_DQBUF OUTPUT 5. VIDIOC_DQBUF CAPTURE 6. fwrite NV12 (or P010) plane(s) to output file 7. VIDIOC_QBUF CAPTURE recycle The actual decode wallclock is somewhere inside (3); the rest is measurement overhead that a real consumer (libva-v4l2-request) wouldn't pay (no fwrite, fewer ioctls per frame with pipelining). So the reported fps is a **conservative lower bound** on what the daemon can sustain. ## What's NOT here (deferred) - **QPU dispatch substitution.** Not needed for 30fps@1080p (proven by measurement). Stays on the roadmap for higher-throughput / lower-power scenarios. - **libva-v4l2-request consumer integration.** Per `project_consumer_target` memory this is the actual end point — what the daemon's V4L2 stateless API was built to feed. Phase 8.9+ work; would close the loop from YouTube → Firefox → libva → /dev/video0 → daemon. - **Multi-frame HDR tests for AV1/H.264.** Phase 8.8's P010 test is VP9 only. Adding AV1+H.264 multi-frame HDR streams is straightforward (encoder already supports yuv420p10le) but didn't fit the current phase scope. - **>1080p resolutions.** No 4K stream tests. The protocol/code paths are size-agnostic; only the test harness needs bigger inputs. ## Phase 8.9 plan 1. libva-v4l2-request integration — the actual consumer that closes the project's user-facing loop (per `project_consumer_target`). Patch the library to recognise our driver via media controller, wire codec parsing to feed our OUTPUT buffers. 2. End-to-end test: Firefox → libva → /dev/video0 → daemon → on-screen frame. 3. Stress: long-form (60s+) playback with proper buffer recycling timing. 4. Multi-frame HDR tests for AV1 + H.264. After 8.9 the project's user-facing goal is hit; the remaining sub-phases (QPU substitution, 4K, encoders) are optimisation work that ships when motivated.