1ae9528e76
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.
So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
pack_p010_to_plane.
Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
accumulate NALs into access units (push on VCL NAL types
1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).
Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
P010: 1 plane, W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
plane's payload to vb2_plane_size(vb,p) — generalises
cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
the daemon fully populates the plane so payload =
sizeimage.
Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
10-bit samples shifted left by 6 to MSB-align in 16-bit
words per V4L2 ABI. Y at base+0, interleaved CbCr right
after Y plane (per format spec for single-plane P010).
Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
→ pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
1080p throughput baseline (30 frames testsrc, dmabuf path):
VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps **83.1**, byte-exact ✓
AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps **65.0**, byte-exact ✓
H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps **88.3**, byte-exact ✓
All 2-3× over the 30fps-floor-is-fine criterion.
HDR / 10-bit 1080p P010:
10 frames, 62 MB output, fps **48.8**, byte-exact vs
`ffmpeg -pix_fmt p010le -f rawvideo`.
Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.
v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.
Clean SIGTERM + rmmod; no kernel oops/WARN.
Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
(per project_consumer_target memory) — the actual
user-facing endpoint.
Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
(NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
1080p).
- AU grouping in the Annex-B parser is the correct
semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
count, not hardcoded to 2.
Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
139 lines
4.9 KiB
Markdown
139 lines
4.9 KiB
Markdown
# daedalus-v4l2 — roadmap
|
||
|
||
## Sub-phases
|
||
|
||
### Phase 8.1 — kernel module skeleton
|
||
|
||
Out-of-tree kernel module that:
|
||
- Registers `/dev/videoNN` with `VFL_TYPE_VIDEO` + a no-op
|
||
V4L2 stateless dispatch table.
|
||
- Accepts open/close, S_FMT, REQBUFS ioctls without doing
|
||
anything (yet).
|
||
- Builds against `/lib/modules/$(uname -r)/build`.
|
||
|
||
Deliverable: `modprobe daedalus_v4l2` works, `v4l2-ctl --list-devices`
|
||
shows the new device.
|
||
|
||
### Phase 8.2 — kernel ↔ daemon chardev bridge
|
||
|
||
- Kernel module creates `/dev/daedalus-v4l2` chardev.
|
||
- Defines a simple req/resp protocol in `include/daedalus_v4l2_proto.h`.
|
||
- Daemon connects, exchanges echo requests.
|
||
|
||
Deliverable: ping-pong test passes.
|
||
|
||
### Phase 8.3 — daemon FFmpeg dlopen + parse
|
||
|
||
- Daemon links `libdaedalus_core.a` from sibling.
|
||
- Daemon dlopens FFmpeg.
|
||
- Test program: feed a VP9 IVF file to FFmpeg parsers,
|
||
extract block-level metadata, validate against expected.
|
||
|
||
Deliverable: daemon can parse a VP9 frame and walk the
|
||
block-level info.
|
||
|
||
### Phase 8.4 — daemon ↔ kernel decode round-trip (closed 2026-05-18)
|
||
|
||
Shipped as a debugfs-triggered chardev round-trip rather than
|
||
the original V4L2-ioctl plan (which moved to Phase 8.5).
|
||
|
||
- REQ_DECODE / RESP_FRAME wire protocol
|
||
- Daemon decodes VP9 via FFmpeg dlopen, returns FNV-1a digest
|
||
- Verified content-dependent + deterministic; structured
|
||
error handling for bad bitstreams
|
||
|
||
See `docs/phase_8_4_closure.md`.
|
||
|
||
### Phase 8.5 — full V4L2 m2m driver (closed 2026-05-18)
|
||
|
||
Real V4L2 m2m driver — userspace clients drive
|
||
`S_FMT`/`REQBUFS`/`QBUF`/`DQBUF` the standard way. Bitstream
|
||
flows kernel→daemon as inline REQ_DECODE payload; decoded NV12
|
||
pixels flow daemon→kernel as inline RESP_FRAME payload. Works
|
||
end-to-end for small frames (≤ ~64 KiB NV12).
|
||
|
||
Deliverable hit: kernel m2m driver passes most v4l2-compliance
|
||
checks; `tools/test_m2m_decode` produces a NV12 frame that's
|
||
byte-for-byte identical to `ffmpeg -pix_fmt nv12` reference.
|
||
|
||
See `docs/phase_8_5_closure.md`.
|
||
|
||
### Phase 8.6 — dmabuf + AV1 + H.264 + stateless controls (closed 2026-05-18)
|
||
|
||
- CAPTURE (and OUTPUT) on `vb2_dma_contig_memops`.
|
||
- New `DAEDALUS_IOC_GET_DMABUF` chardev ioctl — daemon
|
||
mmaps the in-flight CAPTURE buffer, decodes pixels in
|
||
place, sends RESP_FRAME metadata-only.
|
||
- 64 KiB frame-size cap removed. 1080p VP9 + 128×96 AV1
|
||
+ 128×96 H.264 all byte-exact against reference FFmpeg
|
||
decode.
|
||
- V4L2 stateless controls registered for VP9 / AV1 /
|
||
H.264 (11 controls visible to userspace).
|
||
- Colorspace round-trip fix (TRY_FMT preserve, S_FMT
|
||
OUTPUT→CAPTURE propagation).
|
||
- Cookie unified across V4L2 + debugfs paths.
|
||
- v4l2-compliance: 47/48 (only DECODER_CMD remains,
|
||
needs media controller — moved to 8.7).
|
||
|
||
See `docs/phase_8_6_closure.md`.
|
||
|
||
### Phase 8.7 — media controller + multi-frame streaming (closed 2026-05-18)
|
||
|
||
- Media controller bound via
|
||
`v4l2_m2m_register_media_controller` +
|
||
`media_device_register`; `/dev/mediaN` published.
|
||
- `tools/test_m2m_stream` parses IVF and pushes frames
|
||
sequentially through a 4-deep buffer ring; daemon
|
||
AVCodecContext preserves reference frames across calls.
|
||
- 30-frame VP9 320×240 stream byte-exact (3.46 MB across
|
||
1 keyframe + 29 P-frames).
|
||
- 10-frame VP9 1080p stream byte-exact (31 MB across
|
||
10 frames at full HD).
|
||
- v4l2-compliance: **49/49 passing** (was 47/48 in 8.6;
|
||
media controller added a 49th test and closed DECODER_CMD).
|
||
|
||
See `docs/phase_8_7_closure.md`.
|
||
|
||
### Phase 8.8 — throughput baseline + multi-codec streams + HDR (closed 2026-05-18)
|
||
|
||
- Per-frame µs timing in test_m2m_stream; multi-codec
|
||
baseline:
|
||
- VP9 1080p: 83.1 fps
|
||
- AV1 1080p: 65.0 fps
|
||
- H.264 1080p: 88.3 fps
|
||
All byte-exact vs ffmpeg reference; all 2-3× over the
|
||
30fps-floor-is-fine criterion.
|
||
- QPU dispatch substitution explicitly **not needed** — measurement
|
||
shows the FFmpeg software path already clears the target on
|
||
Pi 5's Cortex-A76. Substitution moves to the
|
||
optimisation roadmap.
|
||
- Annex-B H.264 access-unit splitter in the test harness
|
||
(NALs grouped by VCL boundary).
|
||
- HDR / 10-bit: V4L2_PIX_FMT_P010 added as CAPTURE format;
|
||
daemon pack_p010_to_plane handles YUV420P10LE → P010
|
||
with MSB-aligned 10-bit data. 10-bit 1080p byte-exact
|
||
at 48.8 fps.
|
||
|
||
See `docs/phase_8_8_closure.md`.
|
||
|
||
### Phase 8.9 — libva-v4l2-request integration (the actual consumer)
|
||
|
||
1. Patch libva-v4l2-request to recognise our driver via the
|
||
media controller graph (the
|
||
`project_consumer_target` memory's libva-v4l2-request-fourier
|
||
target).
|
||
2. End-to-end test: Firefox / mpv → libva → /dev/video0 →
|
||
daemon → on-screen frame.
|
||
3. Long-form (60s+) playback stress with buffer recycling.
|
||
4. Multi-frame HDR tests for AV1 + H.264.
|
||
|
||
After 8.9 the project's user-facing loop is closed. Optimisation
|
||
phases (QPU dispatch, 4K, encoders) ship when motivated.
|
||
|
||
## Effort estimate
|
||
|
||
Each phase: ~1 week of focused work (~40 hours).
|
||
Total: 7 weeks for v1.
|
||
|
||
Could be split across multiple sessions / contributors.
|