Files
daedalus-v4l2/docs/phase_8_8_closure.md
marfrit 1ae9528e76 Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00

262 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 8.8 closure — throughput baseline + multi-codec streams + HDR
**Status:** closed 2026-05-18.
The roadmap going into 8.8 prescribed a substantial QPU
dispatch substitution effort to hit the
`30fps-floor-is-fine` user-facing criterion. The proper
correctness-before-speed move was to **measure first**
turns out the daemon's FFmpeg software path on Pi 5's
Cortex-A76 already hits **65-88 fps@1080p** across all three
codecs, 2-3× over the 30fps target. QPU substitution would
have been premature optimization.
So 8.8 ships what's actually useful:
1. **Per-frame timing instrumentation** in
`test_m2m_stream` with mean / p50 / p99 / fps reporting.
2. **Multi-frame AV1 + H.264 streams verified** byte-exact
at 1080p (closing the "VP9-only stream tests" gap from
8.7).
3. **HDR / 10-bit support**`V4L2_PIX_FMT_P010` added as
a CAPTURE format with depth-aware packing in the daemon.
## What lands
### Test harness (`tools/test_m2m_stream.c`)
- Per-frame microsecond timing via `clock_gettime(CLOCK_
MONOTONIC)`. Final report: mean / p50 / p99 / min / max
per-frame microseconds + wall ms + fps.
- Annex-B H.264 parser: split bitstream on
3- or 4-byte start codes, accumulate NALs into access
units (push when we see a VCL NAL — type 1 or 5).
Without access-unit grouping, FFmpeg's H.264 decoder
rejects SPS-only or PPS-only buffers as "no frame!".
- Format auto-detection: IVF (DKIF magic) → `parse_ivf`;
anything else → `parse_annexb`. Non-IVF input requires
explicit `[w] [h]` since framing carries no dimensions.
- New optional 6th argument `[capture]`:
`nv12m` (default, 8-bit, 2 planes) or
`p010` (10-bit, 1 plane).
- CAPTURE mmap path generalised to handle
`num_planes == 1` (P010) — previously hardcoded to 2.
### Kernel (`kernel/daedalus_v4l2_main.c`)
- CAPTURE formats array: `{ NV12M, P010 }`, with
`daedalus_is_supported_capture` matching the OUTPUT-side
helper.
- `enum_fmt` on CAPTURE walks the array (2 entries).
- `daedalus_fill_capture_fmt` takes a fourcc:
- NV12M: 2 planes, plane[0]=W*H, plane[1]=W*H/2,
bytesperline=W.
- P010: 1 plane, sizeimage = W*H*3 (Y=2 bytes per pixel
× H rows + interleaved CbCr=W bytes per chroma row ×
H/2 rows = W*H*2 + W*H = W*H*3), bytesperline = W*2.
- `try_fmt` for CAPTURE preserves caller fourcc when
supported, falls back to NV12M default otherwise.
- `daedalus_complete_resp_frame` refactored: the dmabuf
path (pixels_len == 0) now sets each plane's payload to
`vb2_plane_size(vb, p)` — the daemon fully populated the
plane, so payload = sizeimage. Generalises cleanly to
1-plane (P010) and 2-plane (NV12M) formats.
### Daemon (`daemon/src/decoder.c`)
- `pack_p010_to_plane` — packs YUV420P10LE into P010
single-plane layout: Y plane (16-bit samples, MSB-aligned
10-bit data, low 6 bits zero) at base+0, interleaved
CbCr at base+(Y plane size). Strips source stride
padding from `fr->linesize[*]`; respects destination
stride from `planes->stride[0]`.
- `daedalus_decoder_run_request` dispatches on
`req->capture_pix_fmt`:
- `V4L2_PIX_FMT_NV12M` → `pack_nv12_to_planes`
- `V4L2_PIX_FMT_P010` → `pack_p010_to_plane`
- else → warn + skip pack (decoder still reports the
frame metadata).
- Includes `<linux/videodev2.h>` for the fourcc constants.
## Verification
All measurements on hertz (Pi 5, 6.12.75+rpt-rpi-2712).
### 1080p throughput baseline — 30fps target met across the board
30-frame `testsrc` at 1920×1080, decoded via the V4L2 m2m
+ dmabuf path; per-frame µs measured from QBUF OUTPUT to
write(of, NV12) returning.
| Codec | Mean | p50 | p99 | fps | byte-exact vs ffmpeg |
|-------|------|-----|-----|-----|----------------------|
| VP9 | 12.0 ms | 11.8 ms | 15.9 ms | **83.1** | ✓ |
| AV1 | 15.4 ms | 14.3 ms | 41.0 ms | **65.0** | ✓ |
| H.264 | 11.3 ms | 10.5 ms | 21.5 ms | **88.3** | ✓ |
The `30fps-floor-is-fine` memory's user-facing criterion is
"daily YouTube playback with CPU free for vscode." At
65-88 fps single-threaded the daemon is so far above the
floor that real-world content has comfortable headroom for
the rest of the desktop.
### HDR / 10-bit P010 — byte-exact + still real-time
```
$ ffmpeg -f lavfi -i 'testsrc=duration=0.4:size=1920x1080:rate=25' \
-pix_fmt yuv420p10le -c:v libvpx-vp9 -cpu-used 8 \
-y vp9_10bit_1080.ivf
$ ffmpeg -i vp9_10bit_1080.ivf -pix_fmt p010le -f rawvideo \
-y vp9_10bit_1080_ref.p010
$ sudo ./tools/test_m2m_stream \
vp9_10bit_1080.ivf \
vp9_10bit_1080_out.p010 \
1920 1080 vp9 p010
parsed 10 frames, 1920x1080
CAPTURE fmt=P010 planes=1 sizeimage=[6220800,0]
decoded 10 / 10 frames
perf: mean=20.5ms p50=19.0ms p99=28.0ms | fps=48.8
$ cmp vp9_10bit_1080_out.p010 vp9_10bit_1080_ref.p010
0 # 62 MB across 10 frames, byte-for-byte match
```
The 10-bit path is ~50fps@1080p — still above the 30fps
target. The overhead vs 8-bit comes from the
shift-left-by-6 plus the wider memory writes (16-bit per
sample); both are inherent to the format.
The smaller 320×240 P010 test ran at **966 fps** — the
fixed daemon-side overhead dominates at small resolutions.
### v4l2-compliance — unchanged from 8.7
```
Total for daedalus_v4l2 device /dev/video0: 49, Succeeded: 49,
Failed: 0, Warnings: 0
```
Compliance was already complete after 8.7; the added P010
format passes through the same MMAP / DMABUF / REQBUFS /
EXPBUF tests cleanly.
### Format enumeration
```
$ v4l2-ctl -d /dev/video0 --list-formats
[0]: 'NM12' (Y/UV 4:2:0 (N-C))
[1]: 'P010' (10-bit Y/UV 4:2:0)
```
### Clean teardown
```
$ pkill -TERM daedalus_v4l2_daemon
$ sudo rmmod daedalus_v4l2
$ sudo dmesg | grep -E 'BUG|oops'
(empty)
```
## Design decisions
### Why measure before substituting QPU kernels?
The Phase 8.8 roadmap entry was "profile + dispatch QPU
kernels for hot paths." The unstated assumption was
"FFmpeg software decode is too slow at 1080p." Measurement
falsified the assumption — Cortex-A76 ARM has enough
single-thread throughput that libvpx-vp9 / libdav1d /
libavcodec H.264 all clear 30fps@1080p without help.
QPU substitution still has value for:
- Higher resolutions (4K),
- Higher frame rates (60fps+),
- Lower-power CPUs (Pi 5 is competitive; older Pis aren't),
- Power efficiency at any throughput.
Per `feedback_correctness_before_speed`: measure, then
optimize what's actually slow. The QPU work is still in
the roadmap but it's no longer urgent — it's an
optimization phase, not a feature phase.
### Why P010 (single plane) and not P010M (multi plane)?
The kernel uABI only defines `V4L2_PIX_FMT_P010`
(single plane, fourcc 'P010'). There is no `P010M`
constant in v6.12 headers. Single plane works fine for
our purposes — the daemon's dmabuf path gets one fd, one
mmap, and the Y/CbCr layout is fixed by the format spec.
If a future userspace ever needs separate Y and CbCr
buffers we could define our own `V4L2_PIX_FMT_P010M`-
shaped layout, but that would diverge from the standard
fourcc and is hard to motivate without an actual consumer.
### Why the Annex-B parser accumulates access units
The V4L2 stateless H.264 spec says each OUTPUT buffer
contains ONE PARSED SLICE. Our daemon doesn't use the
SLICE_PARAMS controls — it just passes bytes to FFmpeg
which re-parses. FFmpeg's H.264 decoder rejects "SPS-only"
or "PPS-only" buffers as "no frame!", so splitting on every
start code fails.
Solution: accumulate NALs into access units. An AU
contains zero or more non-VCL NALs (SPS/PPS/SEI/AUD)
followed by one VCL NAL (slice type 1 or 5). We push
each completed AU as one OUTPUT buffer. Works for any
H.264 Annex-B stream where one access unit = one frame
(our ultrafast baseline x264 encode), which is the common
case for the test harness.
### Per-frame timing measures the full QBUF→DQBUF cycle
The reported "mean=12ms" includes:
1. memcpy bitstream into OUTPUT MMAP plane
2. VIDIOC_QBUF
3. poll() — blocks until CAPTURE ready
4. VIDIOC_DQBUF OUTPUT
5. VIDIOC_DQBUF CAPTURE
6. fwrite NV12 (or P010) plane(s) to output file
7. VIDIOC_QBUF CAPTURE recycle
The actual decode wallclock is somewhere inside (3); the
rest is measurement overhead that a real consumer
(libva-v4l2-request) wouldn't pay (no fwrite, fewer
ioctls per frame with pipelining). So the reported fps
is a **conservative lower bound** on what the daemon can
sustain.
## What's NOT here (deferred)
- **QPU dispatch substitution.** Not needed for 30fps@1080p
(proven by measurement). Stays on the roadmap for
higher-throughput / lower-power scenarios.
- **libva-v4l2-request consumer integration.** Per
`project_consumer_target` memory this is the actual end
point — what the daemon's V4L2 stateless API was built
to feed. Phase 8.9+ work; would close the loop from
YouTube → Firefox → libva → /dev/video0 → daemon.
- **Multi-frame HDR tests for AV1/H.264.** Phase 8.8's
P010 test is VP9 only. Adding AV1+H.264 multi-frame
HDR streams is straightforward (encoder already supports
yuv420p10le) but didn't fit the current phase scope.
- **>1080p resolutions.** No 4K stream tests. The
protocol/code paths are size-agnostic; only the test
harness needs bigger inputs.
## Phase 8.9 plan
1. libva-v4l2-request integration — the actual consumer
that closes the project's user-facing loop (per
`project_consumer_target`). Patch the
library to recognise our driver via media controller,
wire codec parsing to feed our OUTPUT buffers.
2. End-to-end test: Firefox → libva → /dev/video0 →
daemon → on-screen frame.
3. Stress: long-form (60s+) playback with proper buffer
recycling timing.
4. Multi-frame HDR tests for AV1 + H.264.
After 8.9 the project's user-facing goal is hit; the
remaining sub-phases (QPU substitution, 4K, encoders) are
optimisation work that ships when motivated.