Files
daedalus-v4l2/docs/phase_8_8_closure.md
T
marfrit 1ae9528e76 Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00

9.7 KiB
Raw Blame History

Phase 8.8 closure — throughput baseline + multi-codec streams + HDR

Status: closed 2026-05-18.

The roadmap going into 8.8 prescribed a substantial QPU dispatch substitution effort to hit the 30fps-floor-is-fine user-facing criterion. The proper correctness-before-speed move was to measure first — turns out the daemon's FFmpeg software path on Pi 5's Cortex-A76 already hits 65-88 fps@1080p across all three codecs, 2-3× over the 30fps target. QPU substitution would have been premature optimization.

So 8.8 ships what's actually useful:

  1. Per-frame timing instrumentation in test_m2m_stream with mean / p50 / p99 / fps reporting.
  2. Multi-frame AV1 + H.264 streams verified byte-exact at 1080p (closing the "VP9-only stream tests" gap from 8.7).
  3. HDR / 10-bit supportV4L2_PIX_FMT_P010 added as a CAPTURE format with depth-aware packing in the daemon.

What lands

Test harness (tools/test_m2m_stream.c)

  • Per-frame microsecond timing via clock_gettime(CLOCK_ MONOTONIC). Final report: mean / p50 / p99 / min / max per-frame microseconds + wall ms + fps.
  • Annex-B H.264 parser: split bitstream on 3- or 4-byte start codes, accumulate NALs into access units (push when we see a VCL NAL — type 1 or 5). Without access-unit grouping, FFmpeg's H.264 decoder rejects SPS-only or PPS-only buffers as "no frame!".
  • Format auto-detection: IVF (DKIF magic) → parse_ivf; anything else → parse_annexb. Non-IVF input requires explicit [w] [h] since framing carries no dimensions.
  • New optional 6th argument [capture]: nv12m (default, 8-bit, 2 planes) or p010 (10-bit, 1 plane).
  • CAPTURE mmap path generalised to handle num_planes == 1 (P010) — previously hardcoded to 2.

Kernel (kernel/daedalus_v4l2_main.c)

  • CAPTURE formats array: { NV12M, P010 }, with daedalus_is_supported_capture matching the OUTPUT-side helper.
  • enum_fmt on CAPTURE walks the array (2 entries).
  • daedalus_fill_capture_fmt takes a fourcc:
    • NV12M: 2 planes, plane[0]=WH, plane[1]=WH/2, bytesperline=W.
    • P010: 1 plane, sizeimage = WH3 (Y=2 bytes per pixel × H rows + interleaved CbCr=W bytes per chroma row × H/2 rows = WH2 + WH = WH3), bytesperline = W2.
  • try_fmt for CAPTURE preserves caller fourcc when supported, falls back to NV12M default otherwise.
  • daedalus_complete_resp_frame refactored: the dmabuf path (pixels_len == 0) now sets each plane's payload to vb2_plane_size(vb, p) — the daemon fully populated the plane, so payload = sizeimage. Generalises cleanly to 1-plane (P010) and 2-plane (NV12M) formats.

Daemon (daemon/src/decoder.c)

  • pack_p010_to_plane — packs YUV420P10LE into P010 single-plane layout: Y plane (16-bit samples, MSB-aligned 10-bit data, low 6 bits zero) at base+0, interleaved CbCr at base+(Y plane size). Strips source stride padding from fr->linesize[*]; respects destination stride from planes->stride[0].
  • daedalus_decoder_run_request dispatches on req->capture_pix_fmt:
    • V4L2_PIX_FMT_NV12Mpack_nv12_to_planes
    • V4L2_PIX_FMT_P010pack_p010_to_plane
    • else → warn + skip pack (decoder still reports the frame metadata).
  • Includes <linux/videodev2.h> for the fourcc constants.

Verification

All measurements on hertz (Pi 5, 6.12.75+rpt-rpi-2712).

1080p throughput baseline — 30fps target met across the board

30-frame testsrc at 1920×1080, decoded via the V4L2 m2m

  • dmabuf path; per-frame µs measured from QBUF OUTPUT to write(of, NV12) returning.
Codec Mean p50 p99 fps byte-exact vs ffmpeg
VP9 12.0 ms 11.8 ms 15.9 ms 83.1
AV1 15.4 ms 14.3 ms 41.0 ms 65.0
H.264 11.3 ms 10.5 ms 21.5 ms 88.3

The 30fps-floor-is-fine memory's user-facing criterion is "daily YouTube playback with CPU free for vscode." At 65-88 fps single-threaded the daemon is so far above the floor that real-world content has comfortable headroom for the rest of the desktop.

HDR / 10-bit P010 — byte-exact + still real-time

$ ffmpeg -f lavfi -i 'testsrc=duration=0.4:size=1920x1080:rate=25' \
       -pix_fmt yuv420p10le -c:v libvpx-vp9 -cpu-used 8 \
       -y vp9_10bit_1080.ivf
$ ffmpeg -i vp9_10bit_1080.ivf -pix_fmt p010le -f rawvideo \
       -y vp9_10bit_1080_ref.p010

$ sudo ./tools/test_m2m_stream \
       vp9_10bit_1080.ivf \
       vp9_10bit_1080_out.p010 \
       1920 1080 vp9 p010
  parsed 10 frames, 1920x1080
  CAPTURE fmt=P010 planes=1 sizeimage=[6220800,0]
  decoded 10 / 10 frames
  perf: mean=20.5ms p50=19.0ms p99=28.0ms  | fps=48.8

$ cmp vp9_10bit_1080_out.p010 vp9_10bit_1080_ref.p010
0   # 62 MB across 10 frames, byte-for-byte match

The 10-bit path is ~50fps@1080p — still above the 30fps target. The overhead vs 8-bit comes from the shift-left-by-6 plus the wider memory writes (16-bit per sample); both are inherent to the format.

The smaller 320×240 P010 test ran at 966 fps — the fixed daemon-side overhead dominates at small resolutions.

v4l2-compliance — unchanged from 8.7

Total for daedalus_v4l2 device /dev/video0: 49, Succeeded: 49,
Failed: 0, Warnings: 0

Compliance was already complete after 8.7; the added P010 format passes through the same MMAP / DMABUF / REQBUFS / EXPBUF tests cleanly.

Format enumeration

$ v4l2-ctl -d /dev/video0 --list-formats
  [0]: 'NM12' (Y/UV 4:2:0 (N-C))
  [1]: 'P010' (10-bit Y/UV 4:2:0)

Clean teardown

$ pkill -TERM daedalus_v4l2_daemon
$ sudo rmmod daedalus_v4l2
$ sudo dmesg | grep -E 'BUG|oops'
(empty)

Design decisions

Why measure before substituting QPU kernels?

The Phase 8.8 roadmap entry was "profile + dispatch QPU kernels for hot paths." The unstated assumption was "FFmpeg software decode is too slow at 1080p." Measurement falsified the assumption — Cortex-A76 ARM has enough single-thread throughput that libvpx-vp9 / libdav1d / libavcodec H.264 all clear 30fps@1080p without help.

QPU substitution still has value for:

  • Higher resolutions (4K),
  • Higher frame rates (60fps+),
  • Lower-power CPUs (Pi 5 is competitive; older Pis aren't),
  • Power efficiency at any throughput.

Per feedback_correctness_before_speed: measure, then optimize what's actually slow. The QPU work is still in the roadmap but it's no longer urgent — it's an optimization phase, not a feature phase.

Why P010 (single plane) and not P010M (multi plane)?

The kernel uABI only defines V4L2_PIX_FMT_P010 (single plane, fourcc 'P010'). There is no P010M constant in v6.12 headers. Single plane works fine for our purposes — the daemon's dmabuf path gets one fd, one mmap, and the Y/CbCr layout is fixed by the format spec.

If a future userspace ever needs separate Y and CbCr buffers we could define our own V4L2_PIX_FMT_P010M- shaped layout, but that would diverge from the standard fourcc and is hard to motivate without an actual consumer.

Why the Annex-B parser accumulates access units

The V4L2 stateless H.264 spec says each OUTPUT buffer contains ONE PARSED SLICE. Our daemon doesn't use the SLICE_PARAMS controls — it just passes bytes to FFmpeg which re-parses. FFmpeg's H.264 decoder rejects "SPS-only" or "PPS-only" buffers as "no frame!", so splitting on every start code fails.

Solution: accumulate NALs into access units. An AU contains zero or more non-VCL NALs (SPS/PPS/SEI/AUD) followed by one VCL NAL (slice type 1 or 5). We push each completed AU as one OUTPUT buffer. Works for any H.264 Annex-B stream where one access unit = one frame (our ultrafast baseline x264 encode), which is the common case for the test harness.

Per-frame timing measures the full QBUF→DQBUF cycle

The reported "mean=12ms" includes:

  1. memcpy bitstream into OUTPUT MMAP plane
  2. VIDIOC_QBUF
  3. poll() — blocks until CAPTURE ready
  4. VIDIOC_DQBUF OUTPUT
  5. VIDIOC_DQBUF CAPTURE
  6. fwrite NV12 (or P010) plane(s) to output file
  7. VIDIOC_QBUF CAPTURE recycle

The actual decode wallclock is somewhere inside (3); the rest is measurement overhead that a real consumer (libva-v4l2-request) wouldn't pay (no fwrite, fewer ioctls per frame with pipelining). So the reported fps is a conservative lower bound on what the daemon can sustain.

What's NOT here (deferred)

  • QPU dispatch substitution. Not needed for 30fps@1080p (proven by measurement). Stays on the roadmap for higher-throughput / lower-power scenarios.
  • libva-v4l2-request consumer integration. Per project_consumer_target memory this is the actual end point — what the daemon's V4L2 stateless API was built to feed. Phase 8.9+ work; would close the loop from YouTube → Firefox → libva → /dev/video0 → daemon.
  • Multi-frame HDR tests for AV1/H.264. Phase 8.8's P010 test is VP9 only. Adding AV1+H.264 multi-frame HDR streams is straightforward (encoder already supports yuv420p10le) but didn't fit the current phase scope.
  • >1080p resolutions. No 4K stream tests. The protocol/code paths are size-agnostic; only the test harness needs bigger inputs.

Phase 8.9 plan

  1. libva-v4l2-request integration — the actual consumer that closes the project's user-facing loop (per project_consumer_target). Patch the library to recognise our driver via media controller, wire codec parsing to feed our OUTPUT buffers.
  2. End-to-end test: Firefox → libva → /dev/video0 → daemon → on-screen frame.
  3. Stress: long-form (60s+) playback with proper buffer recycling timing.
  4. Multi-frame HDR tests for AV1 + H.264.

After 8.9 the project's user-facing goal is hit; the remaining sub-phases (QPU substitution, 4K, encoders) are optimisation work that ships when motivated.