Files
daedalus-v4l2/docs/roadmap.md
T
marfrit 1ae9528e76 Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00

139 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# daedalus-v4l2 — roadmap
## Sub-phases
### Phase 8.1 — kernel module skeleton
Out-of-tree kernel module that:
- Registers `/dev/videoNN` with `VFL_TYPE_VIDEO` + a no-op
V4L2 stateless dispatch table.
- Accepts open/close, S_FMT, REQBUFS ioctls without doing
anything (yet).
- Builds against `/lib/modules/$(uname -r)/build`.
Deliverable: `modprobe daedalus_v4l2` works, `v4l2-ctl --list-devices`
shows the new device.
### Phase 8.2 — kernel ↔ daemon chardev bridge
- Kernel module creates `/dev/daedalus-v4l2` chardev.
- Defines a simple req/resp protocol in `include/daedalus_v4l2_proto.h`.
- Daemon connects, exchanges echo requests.
Deliverable: ping-pong test passes.
### Phase 8.3 — daemon FFmpeg dlopen + parse
- Daemon links `libdaedalus_core.a` from sibling.
- Daemon dlopens FFmpeg.
- Test program: feed a VP9 IVF file to FFmpeg parsers,
extract block-level metadata, validate against expected.
Deliverable: daemon can parse a VP9 frame and walk the
block-level info.
### Phase 8.4 — daemon ↔ kernel decode round-trip (closed 2026-05-18)
Shipped as a debugfs-triggered chardev round-trip rather than
the original V4L2-ioctl plan (which moved to Phase 8.5).
- REQ_DECODE / RESP_FRAME wire protocol
- Daemon decodes VP9 via FFmpeg dlopen, returns FNV-1a digest
- Verified content-dependent + deterministic; structured
error handling for bad bitstreams
See `docs/phase_8_4_closure.md`.
### Phase 8.5 — full V4L2 m2m driver (closed 2026-05-18)
Real V4L2 m2m driver — userspace clients drive
`S_FMT`/`REQBUFS`/`QBUF`/`DQBUF` the standard way. Bitstream
flows kernel→daemon as inline REQ_DECODE payload; decoded NV12
pixels flow daemon→kernel as inline RESP_FRAME payload. Works
end-to-end for small frames (≤ ~64 KiB NV12).
Deliverable hit: kernel m2m driver passes most v4l2-compliance
checks; `tools/test_m2m_decode` produces a NV12 frame that's
byte-for-byte identical to `ffmpeg -pix_fmt nv12` reference.
See `docs/phase_8_5_closure.md`.
### Phase 8.6 — dmabuf + AV1 + H.264 + stateless controls (closed 2026-05-18)
- CAPTURE (and OUTPUT) on `vb2_dma_contig_memops`.
- New `DAEDALUS_IOC_GET_DMABUF` chardev ioctl — daemon
mmaps the in-flight CAPTURE buffer, decodes pixels in
place, sends RESP_FRAME metadata-only.
- 64 KiB frame-size cap removed. 1080p VP9 + 128×96 AV1
+ 128×96 H.264 all byte-exact against reference FFmpeg
decode.
- V4L2 stateless controls registered for VP9 / AV1 /
H.264 (11 controls visible to userspace).
- Colorspace round-trip fix (TRY_FMT preserve, S_FMT
OUTPUT→CAPTURE propagation).
- Cookie unified across V4L2 + debugfs paths.
- v4l2-compliance: 47/48 (only DECODER_CMD remains,
needs media controller — moved to 8.7).
See `docs/phase_8_6_closure.md`.
### Phase 8.7 — media controller + multi-frame streaming (closed 2026-05-18)
- Media controller bound via
`v4l2_m2m_register_media_controller` +
`media_device_register`; `/dev/mediaN` published.
- `tools/test_m2m_stream` parses IVF and pushes frames
sequentially through a 4-deep buffer ring; daemon
AVCodecContext preserves reference frames across calls.
- 30-frame VP9 320×240 stream byte-exact (3.46 MB across
1 keyframe + 29 P-frames).
- 10-frame VP9 1080p stream byte-exact (31 MB across
10 frames at full HD).
- v4l2-compliance: **49/49 passing** (was 47/48 in 8.6;
media controller added a 49th test and closed DECODER_CMD).
See `docs/phase_8_7_closure.md`.
### Phase 8.8 — throughput baseline + multi-codec streams + HDR (closed 2026-05-18)
- Per-frame µs timing in test_m2m_stream; multi-codec
baseline:
- VP9 1080p: 83.1 fps
- AV1 1080p: 65.0 fps
- H.264 1080p: 88.3 fps
All byte-exact vs ffmpeg reference; all 2-3× over the
30fps-floor-is-fine criterion.
- QPU dispatch substitution explicitly **not needed** — measurement
shows the FFmpeg software path already clears the target on
Pi 5's Cortex-A76. Substitution moves to the
optimisation roadmap.
- Annex-B H.264 access-unit splitter in the test harness
(NALs grouped by VCL boundary).
- HDR / 10-bit: V4L2_PIX_FMT_P010 added as CAPTURE format;
daemon pack_p010_to_plane handles YUV420P10LE → P010
with MSB-aligned 10-bit data. 10-bit 1080p byte-exact
at 48.8 fps.
See `docs/phase_8_8_closure.md`.
### Phase 8.9 — libva-v4l2-request integration (the actual consumer)
1. Patch libva-v4l2-request to recognise our driver via the
media controller graph (the
`project_consumer_target` memory's libva-v4l2-request-fourier
target).
2. End-to-end test: Firefox / mpv → libva → /dev/video0 →
daemon → on-screen frame.
3. Long-form (60s+) playback stress with buffer recycling.
4. Multi-frame HDR tests for AV1 + H.264.
After 8.9 the project's user-facing loop is closed. Optimisation
phases (QPU dispatch, 4K, encoders) ship when motivated.
## Effort estimate
Each phase: ~1 week of focused work (~40 hours).
Total: 7 weeks for v1.
Could be split across multiple sessions / contributors.