Phase 8.8: throughput baseline + multi-codec streams + HDR

Per the correctness-before-speed principle: measure before optimising. Roadmap going in said "QPU dispatch substitution to hit 30fps@1080p". Measurement on hertz shows the FFmpeg software path already hits 65-88 fps@1080p across all three codecs — QPU substitution would be premature optimisation. So 8.8 ships what's actually useful: 1. Per-frame timing in test_m2m_stream. 2. Multi-frame AV1 + H.264 streams verified byte-exact at 1080p (closes the "VP9-only stream tests" gap from 8.7). 3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon pack_p010_to_plane. Test harness (tools/test_m2m_stream.c): - Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/ p99/min/max + wall ms + fps. - Annex-B H.264 parser: split on 3-/4-byte start codes, accumulate NALs into access units (push on VCL NAL types 1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only buffers as "no frame!". - Format auto-detect (DKIF magic → IVF; else Annex-B). - Optional 6th arg `[capture]`: nv12m | p010. - CAPTURE mmap path generalised for num_planes==1 (P010). Kernel (kernel/daedalus_v4l2_main.c): - CAPTURE formats array {NV12M, P010}; enum_fmt walks it. - daedalus_fill_capture_fmt takes a fourcc: NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W P010: 1 plane, W*H*2 + W*H bytes, bpl=W*2 - try_fmt preserves caller fourcc when supported. - daedalus_complete_resp_frame's dmabuf path now sets each plane's payload to vb2_plane_size(vb,p) — generalises cleanly across 1-plane (P010) and 2-plane (NV12M) layouts; the daemon fully populates the plane so payload = sizeimage. Daemon (daemon/src/decoder.c): - pack_p010_to_plane: YUV420P10LE → P010 single-plane. 10-bit samples shifted left by 6 to MSB-align in 16-bit words per V4L2 ABI. Y at base+0, interleaved CbCr right after Y plane (per format spec for single-plane P010). Strips source stride padding; respects destination stride. - daedalus_decoder_run_request dispatches on req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010 → pack_p010_to_plane; else warn + skip). - Includes <linux/videodev2.h> for fourcc constants. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): 1080p throughput baseline (30 frames testsrc, dmabuf path): VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps **83.1**, byte-exact ✓ AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps **65.0**, byte-exact ✓ H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps **88.3**, byte-exact ✓ All 2-3× over the 30fps-floor-is-fine criterion. HDR / 10-bit 1080p P010: 10 frames, 62 MB output, fps **48.8**, byte-exact vs `ffmpeg -pix_fmt p010le -f rawvideo`. Small-frame P010 (320×240): fps 966 — fixed daemon overhead dominates at low resolutions. v4l2-compliance unchanged from 8.7: 49/49 passing. Format enumeration confirms NM12 + P010 on CAPTURE. Clean SIGTERM + rmmod; no kernel oops/WARN. Roadmap update (docs/roadmap.md): - 8.8 marked closed with closure-doc reference, including the explicit "QPU substitution not needed" rationale. - 8.9 reshaped: libva-v4l2-request consumer integration (per project_consumer_target memory) — the actual user-facing endpoint. Per correctness-before-speed: - Measured first; QPU work explicitly justified-out via data. - Byte-exact pixel comparison for every codec/format combo (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and 1080p). - AU grouping in the Annex-B parser is the correct semantic boundary, not just a workaround. - vb2_plane_size for payload generalises to any plane count, not hardcoded to 2. Phase 8.9 next: libva-v4l2-request integration — close the loop from YouTube/Firefox to /dev/video0 + daemon playback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00
parent 5965805d86
commit 1ae9528e76
5 changed files with 713 additions and 71 deletions
@@ -0,0 +1,261 @@
+# Phase 8.8 closure — throughput baseline + multi-codec streams + HDR
+
+**Status:** closed 2026-05-18.
+
+The roadmap going into 8.8 prescribed a substantial QPU
+dispatch substitution effort to hit the
+`30fps-floor-is-fine` user-facing criterion.  The proper
+correctness-before-speed move was to **measure first** —
+turns out the daemon's FFmpeg software path on Pi 5's
+Cortex-A76 already hits **65-88 fps@1080p** across all three
+codecs, 2-3× over the 30fps target.  QPU substitution would
+have been premature optimization.
+
+So 8.8 ships what's actually useful:
+
+1. **Per-frame timing instrumentation** in
+   `test_m2m_stream` with mean / p50 / p99 / fps reporting.
+2. **Multi-frame AV1 + H.264 streams verified** byte-exact
+   at 1080p (closing the "VP9-only stream tests" gap from
+   8.7).
+3. **HDR / 10-bit support** — `V4L2_PIX_FMT_P010` added as
+   a CAPTURE format with depth-aware packing in the daemon.
+
+## What lands
+
+### Test harness (`tools/test_m2m_stream.c`)
+- Per-frame microsecond timing via `clock_gettime(CLOCK_
+  MONOTONIC)`.  Final report: mean / p50 / p99 / min / max
+  per-frame microseconds + wall ms + fps.
+- Annex-B H.264 parser: split bitstream on
+  3- or 4-byte start codes, accumulate NALs into access
+  units (push when we see a VCL NAL — type 1 or 5).
+  Without access-unit grouping, FFmpeg's H.264 decoder
+  rejects SPS-only or PPS-only buffers as "no frame!".
+- Format auto-detection: IVF (DKIF magic) → `parse_ivf`;
+  anything else → `parse_annexb`.  Non-IVF input requires
+  explicit `[w] [h]` since framing carries no dimensions.
+- New optional 6th argument `[capture]`:
+  `nv12m` (default, 8-bit, 2 planes) or
+  `p010` (10-bit, 1 plane).
+- CAPTURE mmap path generalised to handle
+  `num_planes == 1` (P010) — previously hardcoded to 2.
+
+### Kernel (`kernel/daedalus_v4l2_main.c`)
+- CAPTURE formats array: `{ NV12M, P010 }`, with
+  `daedalus_is_supported_capture` matching the OUTPUT-side
+  helper.
+- `enum_fmt` on CAPTURE walks the array (2 entries).
+- `daedalus_fill_capture_fmt` takes a fourcc:
+  - NV12M: 2 planes, plane[0]=W*H, plane[1]=W*H/2,
+    bytesperline=W.
+  - P010: 1 plane, sizeimage = W*H*3 (Y=2 bytes per pixel
+    × H rows + interleaved CbCr=W bytes per chroma row ×
+    H/2 rows = W*H*2 + W*H = W*H*3), bytesperline = W*2.
+- `try_fmt` for CAPTURE preserves caller fourcc when
+  supported, falls back to NV12M default otherwise.
+- `daedalus_complete_resp_frame` refactored: the dmabuf
+  path (pixels_len == 0) now sets each plane's payload to
+  `vb2_plane_size(vb, p)` — the daemon fully populated the
+  plane, so payload = sizeimage.  Generalises cleanly to
+  1-plane (P010) and 2-plane (NV12M) formats.
+
+### Daemon (`daemon/src/decoder.c`)
+- `pack_p010_to_plane` — packs YUV420P10LE into P010
+  single-plane layout: Y plane (16-bit samples, MSB-aligned
+  10-bit data, low 6 bits zero) at base+0, interleaved
+  CbCr at base+(Y plane size).  Strips source stride
+  padding from `fr->linesize[*]`; respects destination
+  stride from `planes->stride[0]`.
+- `daedalus_decoder_run_request` dispatches on
+  `req->capture_pix_fmt`:
+  - `V4L2_PIX_FMT_NV12M` → `pack_nv12_to_planes`
+  - `V4L2_PIX_FMT_P010`  → `pack_p010_to_plane`
+  - else → warn + skip pack (decoder still reports the
+    frame metadata).
+- Includes `<linux/videodev2.h>` for the fourcc constants.
+
+## Verification
+
+All measurements on hertz (Pi 5, 6.12.75+rpt-rpi-2712).
+
+### 1080p throughput baseline — 30fps target met across the board
+
+30-frame `testsrc` at 1920×1080, decoded via the V4L2 m2m
+ dmabuf path; per-frame µs measured from QBUF OUTPUT to
+write(of, NV12) returning.
+
+| Codec | Mean | p50 | p99 | fps | byte-exact vs ffmpeg |
+|-------|------|-----|-----|-----|----------------------|
+| VP9   | 12.0 ms | 11.8 ms | 15.9 ms | **83.1** | ✓ |
+| AV1   | 15.4 ms | 14.3 ms | 41.0 ms | **65.0** | ✓ |
+| H.264 | 11.3 ms | 10.5 ms | 21.5 ms | **88.3** | ✓ |
+
+The `30fps-floor-is-fine` memory's user-facing criterion is
+"daily YouTube playback with CPU free for vscode."  At
+65-88 fps single-threaded the daemon is so far above the
+floor that real-world content has comfortable headroom for
+the rest of the desktop.
+
+### HDR / 10-bit P010 — byte-exact + still real-time
+
+```
+$ ffmpeg -f lavfi -i 'testsrc=duration=0.4:size=1920x1080:rate=25' \
+       -pix_fmt yuv420p10le -c:v libvpx-vp9 -cpu-used 8 \
+       -y vp9_10bit_1080.ivf
+$ ffmpeg -i vp9_10bit_1080.ivf -pix_fmt p010le -f rawvideo \
+       -y vp9_10bit_1080_ref.p010
+
+$ sudo ./tools/test_m2m_stream \
+       vp9_10bit_1080.ivf \
+       vp9_10bit_1080_out.p010 \
+       1920 1080 vp9 p010
+  parsed 10 frames, 1920x1080
+  CAPTURE fmt=P010 planes=1 sizeimage=[6220800,0]
+  decoded 10 / 10 frames
+  perf: mean=20.5ms p50=19.0ms p99=28.0ms  | fps=48.8
+
+$ cmp vp9_10bit_1080_out.p010 vp9_10bit_1080_ref.p010
+0   # 62 MB across 10 frames, byte-for-byte match
+```
+
+The 10-bit path is ~50fps@1080p — still above the 30fps
+target.  The overhead vs 8-bit comes from the
+shift-left-by-6 plus the wider memory writes (16-bit per
+sample); both are inherent to the format.
+
+The smaller 320×240 P010 test ran at **966 fps** — the
+fixed daemon-side overhead dominates at small resolutions.
+
+### v4l2-compliance — unchanged from 8.7
+
+```
+Total for daedalus_v4l2 device /dev/video0: 49, Succeeded: 49,
+Failed: 0, Warnings: 0
+```
+
+Compliance was already complete after 8.7; the added P010
+format passes through the same MMAP / DMABUF / REQBUFS /
+EXPBUF tests cleanly.
+
+### Format enumeration
+
+```
+$ v4l2-ctl -d /dev/video0 --list-formats
+  [0]: 'NM12' (Y/UV 4:2:0 (N-C))
+  [1]: 'P010' (10-bit Y/UV 4:2:0)
+```
+
+### Clean teardown
+
+```
+$ pkill -TERM daedalus_v4l2_daemon
+$ sudo rmmod daedalus_v4l2
+$ sudo dmesg | grep -E 'BUG|oops'
+(empty)
+```
+
+## Design decisions
+
+### Why measure before substituting QPU kernels?
+
+The Phase 8.8 roadmap entry was "profile + dispatch QPU
+kernels for hot paths."  The unstated assumption was
+"FFmpeg software decode is too slow at 1080p."  Measurement
+falsified the assumption — Cortex-A76 ARM has enough
+single-thread throughput that libvpx-vp9 / libdav1d /
+libavcodec H.264 all clear 30fps@1080p without help.
+
+QPU substitution still has value for:
+- Higher resolutions (4K),
+- Higher frame rates (60fps+),
+- Lower-power CPUs (Pi 5 is competitive; older Pis aren't),
+- Power efficiency at any throughput.
+
+Per `feedback_correctness_before_speed`: measure, then
+optimize what's actually slow.  The QPU work is still in
+the roadmap but it's no longer urgent — it's an
+optimization phase, not a feature phase.
+
+### Why P010 (single plane) and not P010M (multi plane)?
+
+The kernel uABI only defines `V4L2_PIX_FMT_P010`
+(single plane, fourcc 'P010').  There is no `P010M`
+constant in v6.12 headers.  Single plane works fine for
+our purposes — the daemon's dmabuf path gets one fd, one
+mmap, and the Y/CbCr layout is fixed by the format spec.
+
+If a future userspace ever needs separate Y and CbCr
+buffers we could define our own `V4L2_PIX_FMT_P010M`-
+shaped layout, but that would diverge from the standard
+fourcc and is hard to motivate without an actual consumer.
+
+### Why the Annex-B parser accumulates access units
+
+The V4L2 stateless H.264 spec says each OUTPUT buffer
+contains ONE PARSED SLICE.  Our daemon doesn't use the
+SLICE_PARAMS controls — it just passes bytes to FFmpeg
+which re-parses.  FFmpeg's H.264 decoder rejects "SPS-only"
+or "PPS-only" buffers as "no frame!", so splitting on every
+start code fails.
+
+Solution: accumulate NALs into access units.  An AU
+contains zero or more non-VCL NALs (SPS/PPS/SEI/AUD)
+followed by one VCL NAL (slice type 1 or 5).  We push
+each completed AU as one OUTPUT buffer.  Works for any
+H.264 Annex-B stream where one access unit = one frame
+(our ultrafast baseline x264 encode), which is the common
+case for the test harness.
+
+### Per-frame timing measures the full QBUF→DQBUF cycle
+
+The reported "mean=12ms" includes:
+1. memcpy bitstream into OUTPUT MMAP plane
+2. VIDIOC_QBUF
+3. poll() — blocks until CAPTURE ready
+4. VIDIOC_DQBUF OUTPUT
+5. VIDIOC_DQBUF CAPTURE
+6. fwrite NV12 (or P010) plane(s) to output file
+7. VIDIOC_QBUF CAPTURE recycle
+
+The actual decode wallclock is somewhere inside (3); the
+rest is measurement overhead that a real consumer
+(libva-v4l2-request) wouldn't pay (no fwrite, fewer
+ioctls per frame with pipelining).  So the reported fps
+is a **conservative lower bound** on what the daemon can
+sustain.
+
+## What's NOT here (deferred)
+
+- **QPU dispatch substitution.**  Not needed for 30fps@1080p
+  (proven by measurement).  Stays on the roadmap for
+  higher-throughput / lower-power scenarios.
+- **libva-v4l2-request consumer integration.**  Per
+  `project_consumer_target` memory this is the actual end
+  point — what the daemon's V4L2 stateless API was built
+  to feed.  Phase 8.9+ work; would close the loop from
+  YouTube → Firefox → libva → /dev/video0 → daemon.
+- **Multi-frame HDR tests for AV1/H.264.**  Phase 8.8's
+  P010 test is VP9 only.  Adding AV1+H.264 multi-frame
+  HDR streams is straightforward (encoder already supports
+  yuv420p10le) but didn't fit the current phase scope.
+- **>1080p resolutions.**  No 4K stream tests.  The
+  protocol/code paths are size-agnostic; only the test
+  harness needs bigger inputs.
+
+## Phase 8.9 plan
+
+1. libva-v4l2-request integration — the actual consumer
+   that closes the project's user-facing loop (per
+   `project_consumer_target`).  Patch the
+   library to recognise our driver via media controller,
+   wire codec parsing to feed our OUTPUT buffers.
+2. End-to-end test: Firefox → libva → /dev/video0 →
+   daemon → on-screen frame.
+3. Stress: long-form (60s+) playback with proper buffer
+   recycling timing.
+4. Multi-frame HDR tests for AV1 + H.264.
+
+After 8.9 the project's user-facing goal is hit; the
+remaining sub-phases (QPU substitution, 4K, encoders) are
+optimisation work that ships when motivated.