daedalus-v4l2/docs/phase_8_8_closure.md

# Phase 8.8 closure — throughput baseline + multi-codec streams + HDR

**Status:** closed 2026-05-18.

The roadmap going into 8.8 prescribed a substantial QPU
dispatch substitution effort to hit the
`30fps-floor-is-fine` user-facing criterion.  The proper
correctness-before-speed move was to **measure first** —
turns out the daemon's FFmpeg software path on Pi 5's
Cortex-A76 already hits **65-88 fps@1080p** across all three
codecs, 2-3× over the 30fps target.  QPU substitution would
have been premature optimization.

So 8.8 ships what's actually useful:

1. **Per-frame timing instrumentation** in
   `test_m2m_stream` with mean / p50 / p99 / fps reporting.
2. **Multi-frame AV1 + H.264 streams verified** byte-exact
   at 1080p (closing the "VP9-only stream tests" gap from
   8.7).
3. **HDR / 10-bit support** — `V4L2_PIX_FMT_P010` added as
   a CAPTURE format with depth-aware packing in the daemon.

## What lands

### Test harness (`tools/test_m2m_stream.c`)
- Per-frame microsecond timing via `clock_gettime(CLOCK_
  MONOTONIC)`.  Final report: mean / p50 / p99 / min / max
  per-frame microseconds + wall ms + fps.
- Annex-B H.264 parser: split bitstream on
  3- or 4-byte start codes, accumulate NALs into access
  units (push when we see a VCL NAL — type 1 or 5).
  Without access-unit grouping, FFmpeg's H.264 decoder
  rejects SPS-only or PPS-only buffers as "no frame!".
- Format auto-detection: IVF (DKIF magic) → `parse_ivf`;
  anything else → `parse_annexb`.  Non-IVF input requires
  explicit `[w] [h]` since framing carries no dimensions.
- New optional 6th argument `[capture]`:
  `nv12m` (default, 8-bit, 2 planes) or
  `p010` (10-bit, 1 plane).
- CAPTURE mmap path generalised to handle
  `num_planes == 1` (P010) — previously hardcoded to 2.

### Kernel (`kernel/daedalus_v4l2_main.c`)
- CAPTURE formats array: `{ NV12M, P010 }`, with
  `daedalus_is_supported_capture` matching the OUTPUT-side
  helper.
- `enum_fmt` on CAPTURE walks the array (2 entries).
- `daedalus_fill_capture_fmt` takes a fourcc:
  - NV12M: 2 planes, plane[0]=W*H, plane[1]=W*H/2,
    bytesperline=W.
  - P010: 1 plane, sizeimage = W*H*3 (Y=2 bytes per pixel
    × H rows + interleaved CbCr=W bytes per chroma row ×
    H/2 rows = W*H*2 + W*H = W*H*3), bytesperline = W*2.
- `try_fmt` for CAPTURE preserves caller fourcc when
  supported, falls back to NV12M default otherwise.
- `daedalus_complete_resp_frame` refactored: the dmabuf
  path (pixels_len == 0) now sets each plane's payload to
  `vb2_plane_size(vb, p)` — the daemon fully populated the
  plane, so payload = sizeimage.  Generalises cleanly to
  1-plane (P010) and 2-plane (NV12M) formats.

### Daemon (`daemon/src/decoder.c`)
- `pack_p010_to_plane` — packs YUV420P10LE into P010
  single-plane layout: Y plane (16-bit samples, MSB-aligned
  10-bit data, low 6 bits zero) at base+0, interleaved
  CbCr at base+(Y plane size).  Strips source stride
  padding from `fr->linesize[*]`; respects destination
  stride from `planes->stride[0]`.
- `daedalus_decoder_run_request` dispatches on
  `req->capture_pix_fmt`:
  - `V4L2_PIX_FMT_NV12M` → `pack_nv12_to_planes`
  - `V4L2_PIX_FMT_P010`  → `pack_p010_to_plane`
  - else → warn + skip pack (decoder still reports the
    frame metadata).
- Includes `<linux/videodev2.h>` for the fourcc constants.

## Verification

All measurements on hertz (Pi 5, 6.12.75+rpt-rpi-2712).

### 1080p throughput baseline — 30fps target met across the board

30-frame `testsrc` at 1920×1080, decoded via the V4L2 m2m
+ dmabuf path; per-frame µs measured from QBUF OUTPUT to
write(of, NV12) returning.

| Codec | Mean | p50 | p99 | fps | byte-exact vs ffmpeg |
|-------|------|-----|-----|-----|----------------------|
| VP9   | 12.0 ms | 11.8 ms | 15.9 ms | **83.1** | ✓ |
| AV1   | 15.4 ms | 14.3 ms | 41.0 ms | **65.0** | ✓ |
| H.264 | 11.3 ms | 10.5 ms | 21.5 ms | **88.3** | ✓ |

The `30fps-floor-is-fine` memory's user-facing criterion is
"daily YouTube playback with CPU free for vscode."  At
65-88 fps single-threaded the daemon is so far above the
floor that real-world content has comfortable headroom for
the rest of the desktop.

### HDR / 10-bit P010 — byte-exact + still real-time

```
$ ffmpeg -f lavfi -i 'testsrc=duration=0.4:size=1920x1080:rate=25' \
       -pix_fmt yuv420p10le -c:v libvpx-vp9 -cpu-used 8 \
       -y vp9_10bit_1080.ivf
$ ffmpeg -i vp9_10bit_1080.ivf -pix_fmt p010le -f rawvideo \
       -y vp9_10bit_1080_ref.p010

$ sudo ./tools/test_m2m_stream \
       vp9_10bit_1080.ivf \
       vp9_10bit_1080_out.p010 \
       1920 1080 vp9 p010
  parsed 10 frames, 1920x1080
  CAPTURE fmt=P010 planes=1 sizeimage=[6220800,0]
  decoded 10 / 10 frames
  perf: mean=20.5ms p50=19.0ms p99=28.0ms  | fps=48.8

$ cmp vp9_10bit_1080_out.p010 vp9_10bit_1080_ref.p010
0   # 62 MB across 10 frames, byte-for-byte match
```

The 10-bit path is ~50fps@1080p — still above the 30fps
target.  The overhead vs 8-bit comes from the
shift-left-by-6 plus the wider memory writes (16-bit per
sample); both are inherent to the format.

The smaller 320×240 P010 test ran at **966 fps** — the
fixed daemon-side overhead dominates at small resolutions.

### v4l2-compliance — unchanged from 8.7

```
Total for daedalus_v4l2 device /dev/video0: 49, Succeeded: 49,
Failed: 0, Warnings: 0
```

Compliance was already complete after 8.7; the added P010
format passes through the same MMAP / DMABUF / REQBUFS /
EXPBUF tests cleanly.

### Format enumeration

```
$ v4l2-ctl -d /dev/video0 --list-formats
  [0]: 'NM12' (Y/UV 4:2:0 (N-C))
  [1]: 'P010' (10-bit Y/UV 4:2:0)
```

### Clean teardown

```
$ pkill -TERM daedalus_v4l2_daemon
$ sudo rmmod daedalus_v4l2
$ sudo dmesg | grep -E 'BUG|oops'
(empty)
```

## Design decisions

### Why measure before substituting QPU kernels?

The Phase 8.8 roadmap entry was "profile + dispatch QPU
kernels for hot paths."  The unstated assumption was
"FFmpeg software decode is too slow at 1080p."  Measurement
falsified the assumption — Cortex-A76 ARM has enough
single-thread throughput that libvpx-vp9 / libdav1d /
libavcodec H.264 all clear 30fps@1080p without help.

QPU substitution still has value for:
- Higher resolutions (4K),
- Higher frame rates (60fps+),
- Lower-power CPUs (Pi 5 is competitive; older Pis aren't),
- Power efficiency at any throughput.

Per `feedback_correctness_before_speed`: measure, then
optimize what's actually slow.  The QPU work is still in
the roadmap but it's no longer urgent — it's an
optimization phase, not a feature phase.

### Why P010 (single plane) and not P010M (multi plane)?

The kernel uABI only defines `V4L2_PIX_FMT_P010`
(single plane, fourcc 'P010').  There is no `P010M`
constant in v6.12 headers.  Single plane works fine for
our purposes — the daemon's dmabuf path gets one fd, one
mmap, and the Y/CbCr layout is fixed by the format spec.

If a future userspace ever needs separate Y and CbCr
buffers we could define our own `V4L2_PIX_FMT_P010M`-
shaped layout, but that would diverge from the standard
fourcc and is hard to motivate without an actual consumer.

### Why the Annex-B parser accumulates access units

The V4L2 stateless H.264 spec says each OUTPUT buffer
contains ONE PARSED SLICE.  Our daemon doesn't use the
SLICE_PARAMS controls — it just passes bytes to FFmpeg
which re-parses.  FFmpeg's H.264 decoder rejects "SPS-only"
or "PPS-only" buffers as "no frame!", so splitting on every
start code fails.

Solution: accumulate NALs into access units.  An AU
contains zero or more non-VCL NALs (SPS/PPS/SEI/AUD)
followed by one VCL NAL (slice type 1 or 5).  We push
each completed AU as one OUTPUT buffer.  Works for any
H.264 Annex-B stream where one access unit = one frame
(our ultrafast baseline x264 encode), which is the common
case for the test harness.

### Per-frame timing measures the full QBUF→DQBUF cycle

The reported "mean=12ms" includes:
1. memcpy bitstream into OUTPUT MMAP plane
2. VIDIOC_QBUF
3. poll() — blocks until CAPTURE ready
4. VIDIOC_DQBUF OUTPUT
5. VIDIOC_DQBUF CAPTURE
6. fwrite NV12 (or P010) plane(s) to output file
7. VIDIOC_QBUF CAPTURE recycle

The actual decode wallclock is somewhere inside (3); the
rest is measurement overhead that a real consumer
(libva-v4l2-request) wouldn't pay (no fwrite, fewer
ioctls per frame with pipelining).  So the reported fps
is a **conservative lower bound** on what the daemon can
sustain.

## What's NOT here (deferred)

- **QPU dispatch substitution.**  Not needed for 30fps@1080p
  (proven by measurement).  Stays on the roadmap for
  higher-throughput / lower-power scenarios.
- **libva-v4l2-request consumer integration.**  Per
  `project_consumer_target` memory this is the actual end
  point — what the daemon's V4L2 stateless API was built
  to feed.  Phase 8.9+ work; would close the loop from
  YouTube → Firefox → libva → /dev/video0 → daemon.
- **Multi-frame HDR tests for AV1/H.264.**  Phase 8.8's
  P010 test is VP9 only.  Adding AV1+H.264 multi-frame
  HDR streams is straightforward (encoder already supports
  yuv420p10le) but didn't fit the current phase scope.
- **>1080p resolutions.**  No 4K stream tests.  The
  protocol/code paths are size-agnostic; only the test
  harness needs bigger inputs.

## Phase 8.9 plan

1. libva-v4l2-request integration — the actual consumer
   that closes the project's user-facing loop (per
   `project_consumer_target`).  Patch the
   library to recognise our driver via media controller,
   wire codec parsing to feed our OUTPUT buffers.
2. End-to-end test: Firefox → libva → /dev/video0 →
   daemon → on-screen frame.
3. Stress: long-form (60s+) playback with proper buffer
   recycling timing.
4. Multi-frame HDR tests for AV1 + H.264.

After 8.9 the project's user-facing goal is hit; the
remaining sub-phases (QPU substitution, 4K, encoders) are
optimisation work that ships when motivated.