daedalus-v4l2

reauktion/daedalus-v4l2

Fork 0

Commit Graph

Author	SHA1	Message	Date
marfrit	0de0288dce	Phase 8.10+8.11: libva consumer integration scaffold Brings daedalus_v4l2 from "standalone test client" to "VAAPI- discoverable decoder" by adding the surface formats and media-controller plumbing that libva-v4l2-request-fourier (sibling repo) requires. libva-v4l2-request-fourier patches (pushed separately): - b5b3acf: daedalus_v4l2 added to known_decoder_drivers - 2146341: meson option gate This commit (daedalus-v4l2 side, 3 production changes): 1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE - Added to daedalus_capture_formats[] alongside NV12M + P010 - daedalus_fill_capture_fmt handles num_planes=1 case (sizeimage = WH3/2, bytesperline = W) - daemon pack_nv12_single_to_plane: Y at base+0, interleaved CbCr at base+(stride*H); same byte content as NV12M two-plane, different layout - Required because libva-v4l2-request-fourier's video.c only knows non-multi-plane NV12 (it advertises v4l2_mplane=true but uses the single-plane fourcc). - Verified byte-exact via test_m2m_stream against ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames, 31 MB). 2. V4L2 Request API media ops - daedalus_media_ops = { vb2_request_validate, v4l2_m2m_request_queue } assigned to mdev.ops before media_device_init. - Without this, MEDIA_IOC_REQUEST_ALLOC returned -ENOTTY and no VAAPI consumer could allocate a media_request. 3. Stateless control registration via v4l2_ctrl_new_custom - Switched from v4l2_ctrl_new_std_compound(NULL p_def) to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/ hantro use. Adds a no-op s_ctrl callback. Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712): LibVA trace through `ffmpeg -hwaccel vaapi`: vaInitialize / Profiles / Entrypoints / CreateConfig / QuerySurfaceAttributes / CreateSurfaces / CreateContext (cap_pool: 24 slots, 1 plane each) / CreateBuffer (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC — all succeed. Standalone NV12 decode path: test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12 → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12 vainfo (via libva-v4l2-request-fourier with our driver): 7 VAProfile entries with VAEntrypointVLD (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh, VP9Profile0, AV1Profile0) What's NOT here (Phase 8.12): The libva trace stops at VIDIOC_S_EXT_CTRLS returning EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on the request. The compound-control payload validation against the kernel's expected struct shape rejects. This isn't a "missing line" fix — it needs proper stateless control plumbing (the SPS/PPS/SliceParams get_dims, validate, default-value paths that in-tree rkvdec/cedrus/hantro implement to satisfy v4l2-core's std_validate). Documented as Phase 8.12 scope. The shipped integration is itself a meaningful deliverable: all the framework scaffolding is in place; the remaining gap is well-characterised and bounded. See docs/phase_8_10_11_closure.md for the full trace analysis + next-phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:51:16 +00:00
marfrit	1ae9528e76	Phase 8.8: throughput baseline + multi-codec streams + HDR Per the correctness-before-speed principle: measure before optimising. Roadmap going in said "QPU dispatch substitution to hit 30fps@1080p". Measurement on hertz shows the FFmpeg software path already hits 65-88 fps@1080p across all three codecs — QPU substitution would be premature optimisation. So 8.8 ships what's actually useful: 1. Per-frame timing in test_m2m_stream. 2. Multi-frame AV1 + H.264 streams verified byte-exact at 1080p (closes the "VP9-only stream tests" gap from 8.7). 3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon pack_p010_to_plane. Test harness (tools/test_m2m_stream.c): - Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/ p99/min/max + wall ms + fps. - Annex-B H.264 parser: split on 3-/4-byte start codes, accumulate NALs into access units (push on VCL NAL types 1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only buffers as "no frame!". - Format auto-detect (DKIF magic → IVF; else Annex-B). - Optional 6th arg `[capture]`: nv12m \| p010. - CAPTURE mmap path generalised for num_planes==1 (P010). Kernel (kernel/daedalus_v4l2_main.c): - CAPTURE formats array {NV12M, P010}; enum_fmt walks it. - daedalus_fill_capture_fmt takes a fourcc: NV12M: 2 planes, WH + WH/2 bytes, bpl=W P010: 1 plane, WH2 + WH bytes, bpl=W2 - try_fmt preserves caller fourcc when supported. - daedalus_complete_resp_frame's dmabuf path now sets each plane's payload to vb2_plane_size(vb,p) — generalises cleanly across 1-plane (P010) and 2-plane (NV12M) layouts; the daemon fully populates the plane so payload = sizeimage. Daemon (daemon/src/decoder.c): - pack_p010_to_plane: YUV420P10LE → P010 single-plane. 10-bit samples shifted left by 6 to MSB-align in 16-bit words per V4L2 ABI. Y at base+0, interleaved CbCr right after Y plane (per format spec for single-plane P010). Strips source stride padding; respects destination stride. - daedalus_decoder_run_request dispatches on req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010 → pack_p010_to_plane; else warn + skip). - Includes <linux/videodev2.h> for fourcc constants. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): 1080p throughput baseline (30 frames testsrc, dmabuf path): VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps 83.1, byte-exact ✓ AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps 65.0, byte-exact ✓ H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps 88.3, byte-exact ✓ All 2-3× over the 30fps-floor-is-fine criterion. HDR / 10-bit 1080p P010: 10 frames, 62 MB output, fps 48.8, byte-exact vs `ffmpeg -pix_fmt p010le -f rawvideo`. Small-frame P010 (320×240): fps 966 — fixed daemon overhead dominates at low resolutions. v4l2-compliance unchanged from 8.7: 49/49 passing. Format enumeration confirms NM12 + P010 on CAPTURE. Clean SIGTERM + rmmod; no kernel oops/WARN. Roadmap update (docs/roadmap.md): - 8.8 marked closed with closure-doc reference, including the explicit "QPU substitution not needed" rationale. - 8.9 reshaped: libva-v4l2-request consumer integration (per project_consumer_target memory) — the actual user-facing endpoint. Per correctness-before-speed: - Measured first; QPU work explicitly justified-out via data. - Byte-exact pixel comparison for every codec/format combo (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and 1080p). - AU grouping in the Annex-B parser is the correct semantic boundary, not just a workaround. - vb2_plane_size for payload generalises to any plane count, not hardcoded to 2. Phase 8.9 next: libva-v4l2-request integration — close the loop from YouTube/Firefox to /dev/video0 + daemon playback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:34:05 +00:00
marfrit	5965805d86	Phase 8.7: media controller + multi-frame streaming verification Two pieces — both shipped: 1. Media controller binding closes the last v4l2-compliance failure from 8.6 (DECODER_CMD, which requires has_media on stateless decoders) and unlocks the V4L2 request API for libva-v4l2-request. 2. Multi-frame streaming test exercises the daemon's AVCodecContext state preservation across many REQ_DECODE calls — Phase 8.6's tests pushed exactly one keyframe per invocation; real content has P-frame references. Compliance now reaches 49/49 passing. Kernel (kernel/daedalus_v4l2_main.{c,h}): - Added `struct media_device mdev` to daedalus_dev. - media_device_init(&mdev) BEFORE v4l2_device_register so v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m entities into the graph during register. - After video_register_device: v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER) then media_device_register so userspace sees the complete graph in /dev/mediaN with the decoder entity tagged. - daedalus_remove unwinds in reverse: unregister media, unregister mc, unregister video, release m2m, unregister v4l2, cleanup mdev. - Error paths added for both new failure points. Test harness (tools/test_m2m_stream.c, new): - Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer rings on both queues → per-frame QBUF/DQBUF loop → concatenates decoded NV12 to output file. Returns 0 only if every input frame decoded without error. - Same codec vocabulary as test_m2m_decode (vp9 \| av1 \| h264 via 5th arg). Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings. $ v4l2-ctl --list-devices daedalus-fourier V3D7+NEON (platform:daedalus_v4l2): /dev/video0 /dev/media3 VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt nv12 -f rawvideo`. VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf path): byte-for-byte match vs same reference command. Daemon log shows cookies 1..30 all completing cleanly in order; lazily-opened AVCodecContext maintains reference frames across the chardev round-trips. Clean SIGTERM + rmmod, no oops/WARN. Roadmap update (docs/roadmap.md): - 8.7 marked closed with closure-doc reference. - 8.8 reshaped: perf profiling, QPU dispatch substitution via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M). Per correctness-before-speed: - Order-correct media controller lifecycle (init → bind v4l2_dev → register video → register mc → register media; reverse for teardown). - 4-deep buffer rings on both queues — the scheduler actually pipelines multiple in-flight cookies through the chardev (not just one-at-a-time as in 8.5/8.6 tests). - Bit-exact comparison against ffmpeg, not "looks right." - All resource paths cleaned on every error branch. Phase 8.8 next: profile daemon hot loops, dlopen daedalus-fourier from the daemon, swap FFmpeg per-block calls for daedalus_dispatch_* where the kernel matches, target 30fps@1080p from 30fps-floor-is-fine memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:21:58 +00:00

Author

SHA1

Message

Date

marfrit

0de0288dce

Phase 8.10+8.11: libva consumer integration scaffold

Brings daedalus_v4l2 from "standalone test client" to "VAAPI-
discoverable decoder" by adding the surface formats and
media-controller plumbing that libva-v4l2-request-fourier
(sibling repo) requires.

libva-v4l2-request-fourier patches (pushed separately):
- b5b3acf: daedalus_v4l2 added to known_decoder_drivers
- 2146341: meson option gate

This commit (daedalus-v4l2 side, 3 production changes):

1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE
   - Added to daedalus_capture_formats[] alongside NV12M + P010
   - daedalus_fill_capture_fmt handles num_planes=1 case
     (sizeimage = W*H*3/2, bytesperline = W)
   - daemon pack_nv12_single_to_plane: Y at base+0,
     interleaved CbCr at base+(stride*H); same byte content
     as NV12M two-plane, different layout
   - Required because libva-v4l2-request-fourier's video.c
     only knows non-multi-plane NV12 (it advertises
     v4l2_mplane=true but uses the single-plane fourcc).
   - Verified byte-exact via test_m2m_stream against
     ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames,
     31 MB).

2. V4L2 Request API media ops
   - daedalus_media_ops = { vb2_request_validate,
     v4l2_m2m_request_queue } assigned to mdev.ops before
     media_device_init.
   - Without this, MEDIA_IOC_REQUEST_ALLOC returned
     -ENOTTY and no VAAPI consumer could allocate a
     media_request.

3. Stateless control registration via v4l2_ctrl_new_custom
   - Switched from v4l2_ctrl_new_std_compound(NULL p_def)
     to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/
     hantro use. Adds a no-op s_ctrl callback.

Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712):

LibVA trace through `ffmpeg -hwaccel vaapi`:
  vaInitialize / Profiles / Entrypoints / CreateConfig /
  QuerySurfaceAttributes / CreateSurfaces / CreateContext
  (cap_pool: 24 slots, 1 plane each) / CreateBuffer
  (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC
  — all succeed.

Standalone NV12 decode path:
  test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12
  → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12

vainfo (via libva-v4l2-request-fourier with our driver):
  7 VAProfile entries with VAEntrypointVLD
  (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh,
   VP9Profile0, AV1Profile0)

What's NOT here (Phase 8.12):

The libva trace stops at VIDIOC_S_EXT_CTRLS returning
EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on
the request. The compound-control payload validation
against the kernel's expected struct shape rejects.
This isn't a "missing line" fix — it needs proper
stateless control plumbing (the SPS/PPS/SliceParams
get_dims, validate, default-value paths that in-tree
rkvdec/cedrus/hantro implement to satisfy v4l2-core's
std_validate). Documented as Phase 8.12 scope.

The shipped integration is itself a meaningful deliverable:
all the framework scaffolding is in place; the remaining
gap is well-characterised and bounded.

See docs/phase_8_10_11_closure.md for the full trace
analysis + next-phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 17:51:16 +00:00

marfrit

1ae9528e76

Phase 8.8: throughput baseline + multi-codec streams + HDR

Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 16:34:05 +00:00

marfrit

5965805d86

Phase 8.7: media controller + multi-frame streaming verification

Two pieces — both shipped:

1. Media controller binding closes the last v4l2-compliance
   failure from 8.6 (DECODER_CMD, which requires has_media on
   stateless decoders) and unlocks the V4L2 request API for
   libva-v4l2-request.

2. Multi-frame streaming test exercises the daemon's
   AVCodecContext state preservation across many REQ_DECODE
   calls — Phase 8.6's tests pushed exactly one keyframe per
   invocation; real content has P-frame references.

Compliance now reaches **49/49 passing.**

Kernel (kernel/daedalus_v4l2_main.{c,h}):
- Added `struct media_device mdev` to daedalus_dev.
- media_device_init(&mdev) BEFORE v4l2_device_register so
  v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m
  entities into the graph during register.
- After video_register_device:
  v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER)
  then media_device_register so userspace sees the complete
  graph in /dev/mediaN with the decoder entity tagged.
- daedalus_remove unwinds in reverse: unregister media,
  unregister mc, unregister video, release m2m, unregister
  v4l2, cleanup mdev.
- Error paths added for both new failure points.

Test harness (tools/test_m2m_stream.c, new):
- Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer
  rings on both queues → per-frame QBUF/DQBUF loop →
  concatenates decoded NV12 to output file. Returns 0 only
  if every input frame decoded without error.
- Same codec vocabulary as test_m2m_decode (vp9 | av1 |
  h264 via 5th arg).

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings.

  $ v4l2-ctl --list-devices
  daedalus-fourier V3D7+NEON (platform:daedalus_v4l2):
        /dev/video0
        /dev/media3

VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB
NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt
nv12 -f rawvideo`.

VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf
path): byte-for-byte match vs same reference command.

Daemon log shows cookies 1..30 all completing cleanly in
order; lazily-opened AVCodecContext maintains reference
frames across the chardev round-trips.

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.7 marked closed with closure-doc reference.
- 8.8 reshaped: perf profiling, QPU dispatch substitution
  via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M).

Per correctness-before-speed:
- Order-correct media controller lifecycle (init → bind
  v4l2_dev → register video → register mc → register
  media; reverse for teardown).
- 4-deep buffer rings on both queues — the scheduler
  actually pipelines multiple in-flight cookies through
  the chardev (not just one-at-a-time as in 8.5/8.6 tests).
- Bit-exact comparison against ffmpeg, not "looks right."
- All resource paths cleaned on every error branch.

Phase 8.8 next: profile daemon hot loops, dlopen
daedalus-fourier from the daemon, swap FFmpeg per-block
calls for daedalus_dispatch_* where the kernel matches,
target 30fps@1080p from 30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 16:21:58 +00:00

3 Commits