5d1ff511787b00c5730e7700a3096c7630a9282e
53 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5d1ff51178 |
Merge pull request 'daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter' (#24) from noether/daemon-av1-frame-header-obu into main
Reviewed-on: #24 |
||
|
|
9797a0daa6 |
daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter
Extends the AV1 OBU encoder pack (PR #22 landed the Sequence Header half) with the two remaining pieces of the per-frame OBU assembly: - av1_synth_temporal_delimiter_obu() — trivial 2-byte OBU (0x12, 0x00) that AV1 temporal units must start with so libavcodec's parser can detect access-unit boundaries. - av1_synth_frame_header_obu() — encodes a Frame Header OBU (AV1 §5.9) from V4L2_CID_STATELESS_AV1_SEQUENCE + V4L2_CID_STATELESS_ AV1_FRAME controls. ## Frame Header scope The encoder covers the libva-v4l2-request common-case path: - frame_type: KEY / INTER / INTRA_ONLY supported. SWITCH returns 0. - tile_info: single-tile uniform-spacing only (forced tile_cols_log2 = tile_rows_log2 = 0). - quantization_params: full coverage (base_q_idx, delta_q_*, qmatrix). - loop_filter_params: full coverage (levels, sharpness, ref/mode deltas). - cdef_params: full coverage. - segmentation: only enabled=0 path supported (returns 0 if enabled). - loop_restoration: only RESTORE_NONE supported (returns 0 if any plane uses Wiener / SGRPROJ / SWITCHABLE). - global_motion: only IDENTITY warp model emitted (returns 0 if any ref uses ROTZOOM / AFFINE / TRANSLATION). - film_grain_params: only "not present" path — returns 0 if the sequence header has FILM_GRAIN_PARAMS_PRESENT set. Out-of-scope branches return 0 so a future decoder.c integration can surface a coverage warning and fall back to direct libavcodec parsing of the original bitstream where the consumer happens to ship a fully-OBU'd access unit. ## Integration status The new primitives are NOT yet wired into decoder.c. The AV1 decode hot path still passes the OUTPUT buffer straight to libavcodec, which works only when the V4L2 consumer is sending a fully-OBU'd access unit (not strictly the V4L2 stateless contract). A real wiring needs a separate kernel-side change: - daedalus_v4l2_proto.h: add struct daedalus_av1_meta mirroring v4l2_ctrl_av1_sequence + v4l2_ctrl_av1_frame - kernel/daedalus_v4l2_main.c: capture V4L2_CID_STATELESS_AV1_{SEQUENCE, FRAME} at device_run, ship over the chardev - daemon/src/chardev_client.c: receive meta - daemon/src/decoder.c: assemble TD + SH + FH + OBU_TILE_GROUP-wrapped OUTPUT bytes, send to libavcodec Tracked as a follow-on. ## Tests test_av1_obu_synth.c grows 5 new cases (9 total, all green on hertz): === av1_synth_temporal_delimiter_obu === temporal delimiter: OK === av1_synth_frame_header_obu === KEY frame 1080p: OK (13 bytes) INTER frame: OK (18 bytes) SWITCH frame rejected: OK segmentation enabled rejected: OK AV1 OBU synth tests PASSED Bit-walk of the KEY-frame happy path confirms the OBU envelope (obu_type=3 = FRAME_HEADER, has_size_field=1, leb128 size byte), then steps through show_existing_frame, frame_type, show_frame, disable_cdf_update, allow_screen_content_tools. Fuller bit-walks would tie the test to encoder details that are spec-driven and already linear in the source; structural smoke + spec-driven linearity is the right gate. Build clean on hertz (Pi 5, Debian trixie, 6.18.29+rpt-rpi-2712, gcc -Wall -Wextra -Wpedantic). No new warnings. Closes daedalus backlog task #159 (AV1 Frame Header OBU synthesiser; decoder.c integration deferred per task notes above). |
||
|
|
3a8f5405d4 |
Merge pull request 'daemon: AV1 Sequence Header OBU synthesiser + unit test' (#22) from noether/daemon-av1-obu-synth into main
Reviewed-on: #22 |
||
|
|
4cfe0b470f |
Merge pull request 'daemon: bounds-check pack_* functions against CAPTURE plane size' (#21) from noether/daemon-pack-bounds-check into main
Reviewed-on: #21 |
||
|
|
b958ef8166 |
Merge pull request 'kernel: drain in-flight m2m jobs on daemon disconnect (fixes #146 D-state)' (#23) from noether/kernel-drain-inflight-on-chardev-release into main
Reviewed-on: #23 |
||
|
|
94be8c3d03 |
kernel: drain in-flight m2m jobs on daemon disconnect
Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that triggers chardev release) leaves V4L2 consumers in unkillable TASK_UNINTERRUPTIBLE on /dev/video0 close. ## Root cause device_run() adds an entry to dev->inflight when it sends a REQ_DECODE to the daemon, marking the m2m job as "running". The job is only cleared via v4l2_m2m_buf_done_and_job_finish() in daedalus_complete_resp_frame(), which only fires on RESP_FRAME. If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the matching RESP_FRAME: - the inflight entry is never popped - v4l2_m2m_buf_done_and_job_finish is never called - the m2m scheduler still thinks a job is running Later, when the V4L2 consumer's close() runs (or gets signalled to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits for !job_running indefinitely. The consumer enters D-state and survives SIGKILL until reboot. Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712: $ sudo kill -STOP $DAEMON_PID # block daemon I/O $ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 & $ sudo kill -9 $DAEMON_PID # chardev_release fires $ kill -9 $CLIENT_PID # ignored — D-state # client stack: v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem] v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem] daedalus_release+0x2c [daedalus_v4l2] v4l2_release+0x7c [videodev] __fput → do_exit → SIGKILL never delivered ## Fix New API daedalus_drain_inflight_on_disconnect() in main.{c,h}: walks the in-flight list, marks both src+dst buffers VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and releases the bound media_request if any. Same completion shape as daedalus_complete_resp_frame() takes on the success path, just with state = ERROR for every in-flight entry. chardev_release calls the drain after flushing dev->req_queue (messages still in req_queue weren't released to the daemon yet, so they don't need the m2m-job-finish dance — freeing them is sufficient). The order matters: queue first (cheap), then m2m drain (heavier, takes the inflight list). Locking: list_splice_init under inflight_lock to take the entire list atomically; lock dropped before iterating because v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done dispatch and can re-enter device_run via the scheduler (which would need inflight_lock again on the next REQ_DECODE). ## Verification path Cannot rmmod the running module on hertz right now — the D-state corpse from the repro session pins the refcount. Verification of the fixed module needs a reboot or fresh test host: $ sudo reboot # clears hung client $ sudo make modules_install # install new .ko $ sudo modprobe daedalus_v4l2 $ # rerun the repro script — client should die cleanly with $ # an -EIO / similar return from poll/DQBUF instead of hanging. Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings. The pre-existing "frame size 2128 > 2048" warning on daedalus_device_run is unchanged by this commit. ## Followup not in scope If a new V4L2 consumer races a REQ_DECODE through device_run AFTER the drain has spliced the list (but before the daemon chardev is reopened), the new entry sits in a freshly-empty inflight list and the same hang can recur for that consumer when the systemd auto-restart of the daemon either fails or takes longer than the consumer's patience. A secondary safeguard would be to fail-fast in device_run when dev->chardev is unopened — proposing as a separate ticket if this race materialises in practice. Closes #146. |
||
|
|
1e9619afe8 |
daemon: AV1 Sequence Header OBU synthesiser + unit test
V4L2 stateless AV1 passes the sequence header information as a structured control (V4L2_CID_STATELESS_AV1_SEQUENCE) and ships only tile-group bytes in the OUTPUT buffer. libavcodec's AV1 decoder is full-bitstream, so the daemon needs to reconstruct the OBU bytes the consumer parsed out before feeding the assembled stream to libavcodec. This commit lands the Sequence Header OBU half of that reconstruction — av1_synth_sequence_header_obu(). Frame Header / Frame OBU synthesisers + the integration that wires the assembled OBUs into the decode hot path are separate follow-on modules. Module shape mirrors the H.264 NAL synthesiser (PR #1): - Public API: single function returning byte count or 0 on overflow/invalid input. - Wire encoder uses the existing bitstream_writer (bsw_put_u is AV1's f(n); bsw_put_ue is bit-identical to AV1's uvlc; bsw_align_rbsp matches AV1's trailing_bits()). - AV1-specific helpers (leb128 size, min_bits_for, subsampling resolution per §5.5.2) are file-local statics. - No emulation prevention — AV1 uses leb128-sized OBUs for bitstream boundaries, not byte-pattern escapes. Synthesis decisions for fields V4L2 doesn't carry are documented verbatim in the file header (reduced_still_picture_header = 0; single operating point at seq_level_idx = 13 / level 5.1; color_description_present_flag = 0; chroma_sample_position = 0; seq_choose_screen_detection_tools = 1; seq_choose_integer_mv = 1). Rejection cases: - seq_profile > 2 - bit_depth not in {8, 10, 12} - seq_profile = 1 + monochrome (4:4:4 forced colour) - seq_profile = 1 + bit_depth = 12 (only profile 2 allows it) - max_frame_{width,height}_minus_1 requiring > 16 length bits - out_cap too small to hold header + leb128 + payload Each returns 0 to surface the mismatch loudly rather than emit nonsense the libavcodec parser would reject downstream. Unit test (test_av1_obu_synth.c, opt-in via DAEDALUS_BUILD_TESTS=ON) exercises four cases bit-by-bit against a hand-computed reference: 1. profile 0, 1080p, 8-bit, 4:2:0, order_hint on (7 bits), CDEF+restoration on — the common Pi 5 path. 2. profile 0, 720p, 10-bit, monochrome — exercises high_bitdepth and the monochrome short-form color_config. 3. profile 1 + bit_depth 12 → expects 0 (rejected). 4. tiny out_cap → expects 0 (overflow). All four green on hertz (aarch64 Arch, gcc Wall+Wextra+Wpedantic clean). This commit does not change daemon behaviour — av1_obu_synth.c is built into the daemon binary so the symbols are reachable, but no call site is wired yet. Integration goes in the follow-on DAEMON-AV1 patches that also synthesise the Frame Header OBU and bracket the assembled OBUs with a Temporal Delimiter. Refs reauktion/daedalus-v4l2#11 daemon-half; closes daedalus backlog task #144. |
||
|
|
a43296c1ed |
daemon: bounds-check pack_* functions against CAPTURE plane size
The three NV12/P010 pack functions (pack_nv12_single_to_plane, pack_nv12_to_planes, pack_p010_to_plane) wrote into the V4L2 client's CAPTURE dmabuf without checking that the mapped size covers the frame libavcodec just decoded. Crash scenario: YouTube DASH stepping resolution mid-stream (e.g. 480p -> 720p when bandwidth improves) — libva is supposed to handle the V4L2_EVENT_SOURCE_CHANGE with STREAMOFF / S_FMT / REQBUFS, but in practice a stale CAPTURE request with the old buffer size sometimes slips through carrying the new (larger) frame. The chroma-interleave inner loop walks past the mapping boundary and the daemon takes SIGSEGV mid-frame, which in turn leaves V4L2 clients hanging in vb2_core_dqbuf — see the followup ticket on the D-state symptom. Fix: compute required = y_size + uv_size against planes->size[N] BEFORE any write. On mismatch, log_warn with both sizes and the frame dimensions, and return -EOVERFLOW. The caller (process_decode_request loop) already handles a negative pack return with a log_warn and proceeds without aborting the decode — the kernel still gets the response with metadata-only and the V4L2 client sees a frame whose pixels are stale but whose buffer-done event fires normally. The next SOURCE_CHANGE the client processes resyncs the buffer size. All three pack paths get the same bounds-check; the comment on pack_nv12_single is the canonical explanation, the other two reference it. Verified: builds clean against trixie aarch64; no behavioural change on the happy path (the bounds check is a single size compare; on a correctly-sized CAPTURE buffer it's a 1-cycle pass). Closes daedalus-v4l2 task #145 (daemon SEGV in pack_nv12_single on resolution change). |
||
|
|
872eec505e |
Merge pull request 'proto: bump PROTO_MAX_PAYLOAD 64 KiB → 1 MiB (closes #19)' (#20) from noether/issue-19-bump-proto-payload-1mib into main
Reviewed-on: #20 |
||
|
|
ee42419479 |
proto: bump PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB (closes #19)
Real H.264 access units routinely exceed the previous 64 KiB cap on the chardev wire protocol: 720p worst-case I-frame ~200 KiB 1080p worst-case I-frame ~500 KiB libva-v4l2-request-fourier detects the under-sized OUTPUT-MPLANE buffer and tries to grow it via VIDIOC_S_FMT to 147456 B, but daedalus_fill_output_fmt unconditionally pins sizeimage to DAEDALUS_MAX_BITSTREAM (= 65484) regardless of userspace's request. Firefox loses the slice, falls back to libmozavcodec SW for the rest of the session. Bumping the wire-protocol cap to 1 MiB lifts the kernel OUTPUT_MPLANE sizeimage with it (DAEDALUS_MAX_BITSTREAM is derived from the same #define). All allocations (kernel kmalloc / kmemdup, daemon read buffer, vb2 plane backing) are dynamic and sized per-payload at runtime, so the only growth is the daemon's startup read buffer (one ~1 MiB allocation per daemon process) and the V4L2 OUTPUT_MPLANE per-buffer size. KMALLOC_MAX_SIZE on aarch64 SLUB is several MiB; 1 MiB is well within bounds. Other V4L2 stateless decoders (cedrus, rkvdec, hantro) report 1-4 MiB OUTPUT_MPLANE sizeimage — this puts daedalus at the conservative end of normal. ## Compatibility #define-only change; struct layout unchanged. But the effective cap is the smaller of (kernel cap, daemon cap), so: - new daemon + stale kernel: still capped at 64 KiB until the kernel module rebuilds. - new kernel + stale daemon: same. Lock-step install of daedalus-v4l2 + daedalus-v4l2-dkms is therefore required for the fix to take effect; mirrors the PR-#7/#8 cadence. ## NOT changed in this commit - daedalus_fill_output_fmt still hardcodes sizeimage = DAEDALUS_MAX_BITSTREAM regardless of userspace request. Acceptable: vb2 will allocate up to that, and libva's resize- test now sees the kernel report a sizeimage at-least-as-large as what it asked for (147456 < 1048524). A future cleanup could respect userspace's S_FMT.sizeimage clamped to the cap, to save memory on tiny streams. - chardev kmalloc → kvmalloc swap (only matters above KMALLOC_MAX_SIZE, not here). Refs #19. |
||
|
|
1d8f5af164 |
Merge pull request 'daemon: filter tiny pause-time bitstreams (closes #17)' (#18) from noether/issue-17-tiny-bitstream-filter into main
Reviewed-on: #18 |
||
|
|
3e4e6e8eae |
daemon: filter tiny pause-time bitstreams (closes #17)
libva-v4l2-request-fourier flushes a stub packet into the V4L2
OUTPUT_MPLANE queue at playback-pause boundaries. The payload is
shorter than any parseable H.264 NAL (3-byte start code + 1-byte
NAL header = 4 bytes minimum); avcodec_send_packet returns
AVERROR_INVALIDDATA (-1094995529), which propagated to the kernel
as a decode failure. Firefox then marked H.264-via-VAAPI as
broken for the session and routed every subsequent frame to
libmozavcodec SW — pause never recovered to HW.
At the REQ_DECODE entry in chardev_client.c::handle_req_decode,
short-circuit any bitstream below the minimum-parseable threshold:
log INFO, skip daedalus_decoder_run_request, and reply RESP_FRAME
with status=DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool
stays healthy and Firefox doesn't see a failure.
Repro: Pi CM5 trixie, daedalus-v4l2 0.1.0+r41 + ffmpeg-v4l2-
request-fourier 2:8.1+rfourier+gb57fbbe-9, Firefox YouTube avc1.
Play → daemon decodes at ~46 fps. Pause ≥ 1s. Resume → daemon
silent; sudo journalctl -u daedalus-v4l2 --since '10s' | grep -c
'decoder: OK' = 0. Last entry before silence:
REQ_DECODE cookie=N codec=3 bitstream=3 bytes ...
[h264 @ ...] no frame!
[ERR] decoder: avcodec_send_packet failed: -1094995529
After this fix the 3-byte sentinel logs as 'tiny bitstream 3
bytes — dropping as no-op' and the libavcodec context is
untouched; the next real REQ_DECODE proceeds normally.
Scope NOT covered (intentionally deferred):
- A more general "tolerate AVERROR_INVALIDDATA mid-stream" path.
Worth doing later but masks unrelated bugs.
- Investigating WHY libva sends the 3-byte sentinel on pause.
Likely an upstream libva-v4l2-request-fourier issue; tracked
separately if this filter is not enough.
Wire protocol unchanged. No DAEDALUS_PROTO_VERSION bump.
|
||
|
|
6e6dfa144d |
Merge pull request 'daemon: dlopen Kwiboo fork's soname 62 (FFmpeg 8.1 at /opt/fourier)' (#16) from noether/daemon-dlopen-kwiboo-soname62 into main
Reviewed-on: #16 |
||
|
|
514da29a73 |
daemon: dlopen Kwiboo fork's libavcodec.so.62 / libavformat.so.62 / libavutil.so.60
Switch the daemon's runtime dlopen targets from Debian-stock soname 61/61/59 (FFmpeg 7.1.3) to the Kwiboo fourier fork's soname 62/62/60 (FFmpeg 8.1) installed at the /opt/fourier prefix. Why --- The substitution arc tracked at daedalus-v4l2#11 needs daedalus- fourier kernel calls woven into libavcodec's H264DSPContext NEON init (replacing ff_h264_idct_add_neon etc. with thunks calling daedalus_recipe_dispatch_h264_*). We do that via patches in the ffmpeg-v4l2-request-fourier package source — which we own, in marfrit-packages, alongside the existing libudev-bypass and nv15-to-p010 patches. But that package builds the Kwiboo fork at soname 62 / /opt/fourier. The daemon currently dlopens soname 61 (Debian-stock + a separately-built +fourier2 patch that isn't in marfrit-packages' source tree), so substitution patches there wouldn't reach the daemon. Switching to soname 62 routes the daemon through the package we control — first step toward landing daedalus-fourier kernel substitution into the production decode path. Compat ------ - /opt/fourier libs are already on every host running the daemon (hard build-dep of ffmpeg-v4l2-request-fourier). Firefox-fourier and mpv-fourier already dlopen them via the same path. - /etc/ld.so.conf.d/fourier.conf entry resolves the new sonames from /opt/fourier/lib via the ld cache; dlopen-by-soname works without LD_LIBRARY_PATH wrappers. - Build-side: daemon's pkg_check_modules picks up libav*.pc from /opt/fourier/lib/pkgconfig when PKG_CONFIG_PATH includes that directory (build-deb.sh follow-up will set it). - API surface unchanged: avcodec_send_packet / receive_frame / AVCodecContext flags / AVFrame fields are all stable between FFmpeg 7.1 and 8.1. Verified clean cross-compile on hertz. Wire protocol unchanged. No kmod bump. Next step (follow-up PRs) ------------------------- 1. ffmpeg-v4l2-request-fourier patch: add 0003-daedalus-fourier- substitute-h264-idct4.patch that replaces ff_h264_idct_add_neon in libavcodec/aarch64/h264dsp_init_aarch64.c with a thunk calling daedalus_recipe_dispatch_h264_idct4. 2. Repeat for IDCT 8×8, deblock luma-v, qpel mc20 (one kernel per PR for reviewability; bench delta + decode_us delta documented per substitution). 3. marfrit-packages bump to pick up the new daemon + the substituted fourier package. |
||
|
|
3bc0da168c |
Merge pull request 'daemon: per-frame decode_us + periodic stats (#11 step 1)' (#15) from noether/daemon-decode-stats into main
Reviewed-on: #15 |
||
|
|
814b74d0bb |
daemon: per-frame decode_us + periodic stats summary (#11 step 1)
Establishes observable baseline metrics before any daedalus-fourier
kernel substitution lands. Step 1 of the daemon-rewrite arc tracked
at daedalus-v4l2#11.
Changes
-------
- Per-frame `decoder: OK ...` log line now carries decode_us=N (the
send_packet + receive_frame wall-clock cost in microseconds —
exclusively the libavcodec round-trip, not the bitstream pack /
SPS-PPS synth / pack-to-planes work).
- New "decoder stats" summary line every DAEDALUS_STATS_EVERY (60)
decoded frames, reporting: codec, frame count, window seconds,
fps, avg decode_us, MBs/s throughput, bytes/MB bitrate.
Sample
------
decoder stats: codec=h264 frames=300 window=12.32s fps=24.35
avg_decode_us=4216.4 mbs_per_s=87643 bs_b_per_mb=1.56
What this tells us
------------------
Steady-state on higgs (Pi CM5) decoding bbb_720p_h264.mp4:
~4 ms decode_us, ~90 K MBs/s, well under the daedalus-fourier
NEON kernel ceilings (IDCT 4×4 @ 175 Mblocks/s, deblock @ 92 Medges/s,
qpel mc20 @ 131 Mblocks/s — all 100-1000× over our actual workload).
Means the 4 ms/frame is mostly libavcodec's CABAC + MV prediction +
intra prediction overhead, NOT the pixel-math primitives.
Substituting a single primitive would shave only a small slice of
the 4 ms. Useful as guidance for the upcoming substitution work —
we'll pick the primitive with the largest cycle cost relative to
the alternative, and measure CPU saved per substitution.
No behaviour change: counters are static + unsynchronised (the
chardev event loop is single-threaded); reset when codec_id changes.
clock_gettime(CLOCK_MONOTONIC) for timing.
|
||
|
|
77e14e5a19 |
Merge pull request 'daemon: link daedalus-fourier + log substrate availability at startup' (#13) from noether/daemon-link-daedalus-fourier into main
Reviewed-on: #13 |
||
|
|
88b2ebfaa9 |
daemon: link daedalus-fourier + log substrate availability at startup
First incremental step toward H.264 daemon-rewrite (daedalus-v4l2#11):
make the daedalus-fourier kernel library available to the daemon
process so subsequent patches can substitute its primitives
(IDCT 4×4, IDCT 8×8, luma vertical deblock, etc.) for libavcodec's
per-MB pixel math.
This patch does NOT yet dispatch any kernels. It only:
- Adds `pkg_check_modules(DAEDALUS_FOURIER REQUIRED daedalus-fourier)`
to the daemon's CMakeLists, with explicit link ordering
(libdaedalus_core.a must precede -lvulkan because the static
archive references vulkan symbols and the linker resolves
left-to-right). We bypass IMPORTED_TARGET because pkg-config's
Requires.private chain leaves CMake's dependency graph reordering
the archive after -lvulkan, breaking the static link.
- Calls daedalus_ctx_create_no_qpu() at daemon startup, logs the
substrate-availability line, destroys the context at exit.
no_qpu mode skips V3D Vulkan probe — proves linkage works
without depending on shader-path resolution (which is a
separate piece of work, since v3d_runner currently loads
.spv files from cwd-relative paths and consumer would need
a search path override).
Sample journal line:
[2026-05-21 17:59:35.271 INFO] daedalus-fourier: linked, ctx alive
(no_qpu mode; has_qpu=0)
Build-test verified on hertz (Pi 5 dev host) against an installed
copy of daedalus-fourier r35+gd87239d (from marfrit/daedalus-fourier
PR #1). Binary links cleanly, --help prints, daemon mode opens
chardev (fails predictably on hertz which has no daedalus_v4l2
kmod; on higgs this is the existing working path).
Follow-up patches per daedalus-v4l2#11:
1. Instrument the existing libavcodec decode path to count
per-frame IDCT blocks / deblock edges / MC tiles so we have
a baseline of what work the daemon dispatches for a typical
YouTube H.264 stream.
2. Substitute daedalus-fourier kernels one at a time, measuring
CPU saved per substitution.
3. Wire shader path resolution into daedalus_ctx_create() for
the QPU substrate (V3D opportunistic helper paths).
Wire protocol unchanged. DAEDALUS_PROTO_VERSION stays at 0.
|
||
|
|
64b9599e47 |
Merge pull request 'daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — implements #11 part (2)' (#12) from noether/daemon-low-delay-h264 into main
Reviewed-on: #12 |
||
|
|
234a103084 |
daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — fix display-reorder breaking V4L2 1:1
Force libavcodec's H.264 decoder to emit frames in DECODE order (one frame per send_packet, no internal display-order reorder queue). Single-line addition: ctx->flags |= AV_CODEC_FLAG_LOW_DELAY before avcodec_open2, gated on codec_id == DAEDALUS_CODEC_H264. Closes daedalus-v4l2#11 part (2). Background ---------- PR #7's "parking design" approach to the H.264 display-reorder problem broke libva-v4l2-request-fourier's 1:1 CAPTURE-completion contract (see #9 + #10). After the revert, the visible "2 1 4 3" pair-swap regressed and the only path forward was to align the daemon's output ordering with what V4L2 stateless clients expect: **decode order, one CAPTURE buffer per OUTPUT slice, with display reorder pushed upstream to ffmpeg-vaapi's per-VAAPI-surface POC logic** (which it already does correctly for every real H.264 hardware decoder via VAPictureParameterBufferH264). How LOW_DELAY does this ----------------------- Inside libavcodec/h264dec.c, the flag sets h->low_delay = 1. h264_select_output_frame (h264_picture.c) emits the just-decoded picture immediately instead of routing through the display-order DPB output queue. DPB management for reference frames (short_ref / long_ref) is unaffected — B-frame decoding correctness is preserved; only the output buffering is bypassed. Skipped for VP9 / AV1 — those codecs don't reorder internally, so the flag would be a no-op but adds no value. Verified -------- On higgs (Pi CM5, 6.18.29+rpt-rpi-2712), test daemon hot-swapped into /usr/bin/daedalus_v4l2_daemon, mpv --hwdec=vaapi-copy --frames=300 against bbb_720p_h264.mp4: 311 REQ_DECODEs received, 308 successful "decoder: OK" responses (99.04% steady-state delivery — 3 lost at GOP boundaries, no compounding drift). mpv plays to its --frames cap and exits cleanly with "End of file". No "Unable to dequeue buffer", no "Failed to end picture decode", no "AVHWFramesContext: Failed to sync surface" — all the failures from #9 are gone. Builds clean against ffmpeg-v4l2-request-fourier libavcodec. |
||
|
|
5d8b4369e5 |
Merge pull request 'kernel + daemon: revert PRs #7 + #8 (parking design incompatible with V4L2 stateless 1:1 expectation)' (#10) from noether/revert-parking-pr7-pr8 into main
Reviewed-on: #10 |
||
|
|
714d781d22 |
Revert "Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main"
This reverts commit |
||
|
|
49e60c9bba |
Revert "Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7)' (#8) from noether/kernel-claim-bufs-at-device-run into main"
This reverts commit |
||
|
|
6ffe92bcac |
Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7)' (#8) from noether/kernel-claim-bufs-at-device-run into main
Reviewed-on: #8 |
||
|
|
f10a26d883 |
kernel: claim src/dst at device_run, not at buf_done
Hard reboot observed on higgs (Pi CM5) during the first mpv vaapi-copy playback against the freshly-deployed r28+g79256dc stack — kernel panic, no persistent journal, no recoverable trace. Bug introduced by the daedalus-v4l2#6 reorder fix (#7). Cause ----- The new completion path runs `v4l2_m2m_job_finish` on SRC_CONSUMED even when the dst_buf is still parked (waiting for a future HAS_PIXELS). job_finish moves the m2m_ctx back to IDLE, the scheduler dispatches the next device_run — which calls `v4l2_m2m_next_dst_buf`, which returns the head of the CAPTURE ready-queue, which is STILL the parked dst_buf because we never removed it. Two inflight entries now reference the same vb2_buffer; the later HAS_PIXELS triggers `v4l2_m2m_dst_buf_remove_by_buf` on a vb2_buffer whose list_head is no longer linked to that queue, and `list_del()` smashes the next/prev pointers of whatever ELSE was at those addresses. Fix --- Take both src and dst off `m2m_ctx`'s rdy_queue at device_run — as soon as `v4l2_m2m_next_*_buf` has peeked them and all early-exit validation has passed. After that, the daemon owns both halves exclusively via the inflight item; the m2m scheduler can't re-issue them on the next device_run. Completion path drops the redundant `_remove_by_buf` calls — list is already detached, so `buf_done` alone is correct. Matches the amphion `vdec.c`/`venc.c` pattern (which also claims at device_run for the same reason: amphion's encode pipeline parks output buffers across multiple frames waiting for the codec to finish, structurally the same as our H.264 B-frame DPB parking). `fail_buf_error` learns about the new `claimed` flag and skips the `v4l2_m2m_*_buf_remove` calls when the buffers have already been removed by-buf at device_run. Verified -------- Builds clean against 6.18.29+rpt-rpi-2712. Field test pending — deploy via marfrit-packages bump in lock-step with the daemon (which doesn't need to change for this fix; PROTO_VERSION stays at 1). |
||
|
|
79256dc7ef |
Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main
Reviewed-on: #7 |
||
|
|
15fc2aba14 |
kernel + daemon: H.264 B-frame display reorder fix (issue #6)
H.264 streams with B-frames showed visibly pair-swapped output in
mpv / Firefox playback through the libva → daedalus_v4l2 path —
"frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6". Reproduced in mpv
with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor),
confirming the bug was in this daemon pipeline, not downstream.
Root cause
----------
libavcodec's H.264 decoder internally reorders output to DISPLAY
order before returning from avcodec_receive_frame. The daemon
previously called send_packet → receive_frame ONCE per REQ_DECODE
and shipped the resulting pixels in a RESP_FRAME tagged with the
SAME cookie. For B-frames this is wrong: the frame returned from
receive_frame may belong to an EARLIER bitstream (libavcodec held
it for display-order release). Cookie N's CAPTURE buffer therefore
got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got
silently marked VB2_BUF_STATE_ERROR (the daemon returned
DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held).
Fix shape
---------
Decouple kernel cookie identity (decode-order routing) from
libavcodec's display-ordered output. Wire-protocol changes:
REQ_DECODE + __u64 src_pts (= src_buf->vb2_buf.timestamp)
RESP_FRAME + __u32 flags (HAS_PIXELS | SRC_CONSUMED)
+ __u64 output_src_pts (= frame->pts on drain)
PROTO_VERSION bumped 0 → 1. Lock-step rebuild required.
Kernel
------
device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts
before sending REQ_DECODE, and stores it on the inflight item so
the completion path can stamp dst_buf.timestamp explicitly when
src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto-
pairing no longer applies).
daedalus_complete_resp_frame splits into:
HAS_PIXELS: pack pixels into THIS cookie's CAPTURE buffer,
stamp dst timestamp from inflight->src_pts,
v4l2_m2m_buf_done(dst, DONE/ERROR).
No job_finish here.
SRC_CONSUMED: release the bound media_request, run
v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so
the scheduler can dispatch the next REQ. dst_buf
may still be parked at this point.
Inflight entry is removed and freed only when BOTH src_buf and
dst_buf have been cleared. Combined HAS_PIXELS|SRC_CONSUMED RESPs
(steady-state VP9/AV1 with no reorder lag) collapse to the prior
1:1 behaviour for free.
Daemon
------
daedalus_decoder_run_request split into three primitives:
daedalus_decoder_submit — set pkt->pts = req->src_pts,
avcodec_send_packet.
daedalus_decoder_drain_one — avcodec_receive_frame, populate
resp meta + output_src_pts (= the
frame's pts, carried back from
the bitstream that produced it).
daedalus_decoder_pack_current — pack current AVFrame into the
caller-mapped CAPTURE planes.
chardev_client maintains a small (src_pts → cookie, cached_req)
table indexed linearly (≤64 entries; bounded by V4L2 client buffer
pool depth). On each REQ_DECODE:
1. Register (src_pts → cookie) in the table.
2. submit().
3. Drain loop: for each frame returned, look up its owner cookie
via pending_lookup(frame->pts), GET_DMABUF for THAT cookie,
pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS,
output_src_pts=frame->pts). Combine with SRC_CONSUMED when
owner_cookie equals the current REQ's cookie.
4. If the current REQ's cookie wasn't drained inside the loop
(libavcodec held the frame), emit a standalone SRC_CONSUMED
RESP so the kernel runs job_finish + dispatches the next REQ;
dst_buf for this cookie stays parked until a future drain
produces its pixels.
VP9 / AV1 paths are unchanged in behaviour: one frame per REQ,
HAS_PIXELS|SRC_CONSUMED in one combined RESP.
Verified
--------
Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712
(Pi CM5). Frame-size warning in device_run is pre-existing
(unchanged by this commit).
|
||
|
|
7ff2d897ea |
Merge pull request 'kernel: register H.264 DECODE_MODE + START_CODE menu controls' (#4) from noether/kernel-h264-menu-ctrls into main
Reviewed-on: #4 |
||
|
|
69a62a922f |
kernel: register H.264 DECODE_MODE + START_CODE menu controls
libva-v4l2-request sets V4L2_CID_STATELESS_H264_DECODE_MODE and
V4L2_CID_STATELESS_H264_START_CODE on the device fd at context init
(see libva-v4l2-request-fourier src/context.c:577 — best-effort call,
result is (void)cast). Our ctrl_handler did not advertise either
control, so v4l2-core returned EINVAL on validate; userspace logged
the noisy
v4l2-request: Unable to set control(s): Invalid argument
(error_idx=2/2 ioctl-level)
at every Firefox/ffmpeg context creation, despite decode itself
succeeding (the daemon already operates as FRAME_BASED + ANNEX_B and
the per-request SPS/PPS/SCALING_MATRIX/DECODE_PARAMS batch lands
fine).
Register the two as v4l2_ctrl_new_std_menu with the only value each
the daemon actually supports — FRAME_BASED for DECODE_MODE,
ANNEX_B for START_CODE — and mask out the unsupported alternates
(SLICE_BASED, NONE). Pattern matches rkvdec / hantro. Update the
handler-init capacity hint to ARRAY_SIZE(daedalus_stateless_ctrls)
+ 2 to cover the additions.
Verified: builds clean on 6.18.29+rpt-rpi-2712 (Pi CM5) DKMS source
tree.
|
||
|
|
f0d41867f6 |
Merge pull request 'kernel: per-ctx vb2 lock — Firefox multi-process VAAPI unblock' (#3) from noether/kernel-per-ctx-vb-mutex into main
Reviewed-on: #3 |
||
|
|
a3ada8ba38 |
kernel: per-ctx vb2 lock so concurrent clients don't serialise on dev mutex
daedalus_queue_init was wiring both src_vq->lock and dst_vq->lock to
ctx->dev->m2m_lock — a device-wide mutex. That serialises every
vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) across ALL
concurrent clients of /dev/video0. For a single-client consumer
like the test_m2m_* tools it doesn't matter; for Firefox, which
spawns separate content + RDD + GPU processes that each open
/dev/video0 and run libva probe simultaneously, the contention
showed up as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE)
when another session was mid-streamon on the same device.
Observable on higgs (Pi CM5):
$ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
...
v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ...
v4l2-request: cap_pool_init: 24 slots ready
v4l2-request: Unable to set format for type 10: Device or
resource busy
After this fix, each open() gets its own ctx->vb_mutex and the
per-context vb2_queue locks are independent — Firefox's multi-
process VAAPI clients no longer fight each other. YouTube
playback on higgs runs through daedalus at ~230 fps sustained
(640x368, libavcodec dlopen path), 7× headroom over the 30fps
target.
cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern
for the same reason. This mirrors them.
Lifecycle:
- mutex_init in daedalus_open (right after the kzalloc that
creates ctx, before v4l2_fh_init).
- mutex_destroy in daedalus_release (after v4l2_fh_exit, before
kfree), and in the err_ctrl unwind path in daedalus_open.
Verified end-to-end on higgs:
- rmmod + modprobe the rebuilt .ko.
- Restart daedalus-v4l2.service.
- Firefox YouTube playback engages VAAPI, daemon journal shows
cookie=1..N codec=3 (H.264) REQ_DECODE / decoder:OK pairs
with unique per-frame fnv1a hashes.
- No EBUSY in either firefox stderr or daemon journal during
the entire session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
462aa4b480 |
Merge pull request 'kernel: bind request controls to p_cur via v4l2_ctrl_request_setup' (#2) from noether/kernel-ctrl-request-setup into main
Reviewed-on: #2 |
||
|
|
29f16ece13 |
kernel: bind request controls to p_cur before reading them
device_run was reading ctrl->p_cur.p_h264_* directly, but v4l2-m2m's request scheduler does NOT auto-bind the in-flight media_request's control values to the ctrl handler's p_cur slots — drivers have to call v4l2_ctrl_request_setup() explicitly. cedrus / rkvdec / hantro all do this in their device_run; daedalus didn't. Result: daedalus_collect_h264_meta() read stale or default values (whatever the prior request had left in p_cur, or v4l2_ctrl_new_custom initial state if no prior request had completed) instead of the S_EXT_CTRLS V4L2_CTRL_WHICH_REQUEST_VAL values libva-v4l2-request- fourier had just sent for THIS frame. The mismatch was a smoking gun on higgs after libva PR #9 / packages PR #52 landed an instrumentation log at h264_set_controls entry: libva boundary (sent to kernel): VAProfile=13 seq_fields=0x00032051 pic_fields=0x00000500 num_ref_frames=1 daedalus daemon (read from kernel p_cur): prof=100 level=41 ref_frames=0 flags=0x10 pps_flags=0x0 After calling v4l2_ctrl_request_setup() at the top of device_run: daedalus daemon (read from kernel p_cur): prof=66 level=11 ref_frames=1 poc_type=2 flags=0x50 pps_flags=0x88 — matches what libva sent, matches the bitstream's actual SPS. End-to-end test on higgs with libva-v4l2-request-fourier 1.0.0+r382 +gc1bb444 (after-fix-3-and-fix-4-instrumentation) + this kernel patch: $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi \ -hwaccel_device /dev/dri/renderD128 -i h264_test.mp4 \ -frames:v 1 -f null - ... rc=0 daemon journal: zero "error while decoding MB" lines, zero "reference frames exceeds max" lines. Per-frame fnv1a hashes differ (0xf1c515aa, 0x16e915e8, 0x16bd16cc, ...) instead of the constant 0x6a6a05c5 "give-up-and-zero" hash from before — libavcodec is actually decoding real pixel content from each P-frame. Pair note: the daemon side already calls v4l2_ctrl_request_complete in daedalus_complete_resp_frame (line 834) — this commit pairs the setup half with that completion half. The daemon side change (decoder.c) is a small log-level promotion: the per-frame "h264 SPS/PPS prepended ..." trace went from log_debug to log_info so the journal shows what's being shipped into libavcodec without needing a daemon rebuild with --debug. Matches the libva- side h264_set_controls instrumentation that landed in libva PR #9. Closes part of issue libva-v4l2-request-fourier#8 — the SPS/PPS field-value gap. Profile/level still come from libva's session- derived hardcoded values (h264_profile_to_idc + h264_derive_level_ idc) which is sufficient for libavcodec to accept the synthesised NAL unit; a true stream-parsed profile/level would need SPS-NAL parsing in libva — separate operator-design call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
3dd0eb070a |
Merge pull request 'DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls' (#1) from noether/daemon-pps-h264-nal-synth into main
Reviewed-on: #1 |
||
|
|
8c1d9960c4 |
DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls
libva-v4l2-request-fourier (and any V4L2-stateless-API consumer)
passes H.264 SPS/PPS as separate V4L2_CID_STATELESS_H264_{SPS,PPS}
controls; only the slice NAL goes into the OUTPUT buffer. This is
correct per the V4L2 stateless contract. But libavcodec — which
the daedalus daemon uses for actual decode (Option γ) — wants a
self-contained AnnexB stream including SPS+PPS before any slice.
Result on higgs: "non-existing PPS 0 referenced" + decode_slice_
header errors on every H.264 frame, even after LIBVA-1 and -2
routing correctly delivered the request to the daemon.
Fix splits across kernel + daemon, keeping the kernel module as a
thin transport and putting the actual NAL encoding in userspace:
include/daedalus_v4l2_proto.h:
Add struct daedalus_h264_meta (the four v4l2_ctrl_h264_*
structs the kernel collects) and DAEDALUS_REQ_FLAG_H264_META
(set in req.flags when the meta block is present between the
daedalus_req_decode prefix and the slice bitstream).
kernel/daedalus_v4l2_main.c:
Add daedalus_collect_h264_meta() — reads the H.264 ctrl values
from the bound media_request via v4l2_ctrl_find +
ctrl->p_cur.p_h264_*. device_run() calls it on H.264 codec_id,
copies the structs into the REQ_DECODE payload between the
prefix and bitstream, and sets the flag. Payload size is
bounds-checked against DAEDALUS_PROTO_MAX_PAYLOAD so an over-
sized slice + meta fails loud instead of truncating.
daemon/src/bitstream_writer.{c,h}:
New module — MSB-first bit packer with H.264 Exp-Golomb ue(v)
and se(v) coding + rbsp_trailing_bits alignment. Sticky
overflow flag so callers can verify the output buffer wasn't
truncated.
daemon/src/h264_nal_synth.{c,h}:
New module — turns v4l2_ctrl_h264_sps / v4l2_ctrl_h264_pps
into AnnexB-framed NAL units per ITU-T H.264 7.3.2.1 / 7.3.2.2.
Emits emulation prevention bytes (0x03 after every 00 00 in the
EBSP) and the 4-byte start code (0x00000001). Coverage matches
what V4L2 stateless surface gives us: VUI parameters and full
scaling matrices are NOT emitted (V4L2 doesn't carry them — the
seq_scaling_matrix_present_flag is set to 0 and libavcodec uses
flat defaults, which matches the de-facto behaviour of most
H.264 streams libva-v4l2-request drives).
daemon/src/decoder.c:
daedalus_decoder_run_request() now takes an optional
h264_meta parameter. For codec_id == H264 with meta != NULL,
synthesises SPS+PPS NAL units, allocates a combined
[SPS][PPS][slice] buffer (+ AV_INPUT_BUFFER_PADDING_SIZE), and
feeds that to avcodec_send_packet instead of the raw slice.
VP9/AV1 path unchanged (frames are self-contained). Cleanup
now goes through a unified `out:` label so the assembled
buffer is always freed on every exit (including the existing
decoder_open_codec / no-frame / receive_frame failure paths).
daemon/src/chardev_client.c:
handle_req_decode() peels off the optional meta block when the
flag is set, passes it through to the decoder, and updates
the payload-length consistency check (now allows for an extra
sizeof(daedalus_h264_meta) when the flag is on).
Build (boltzmann aarch64): clean compile of all daemon sources,
including bitstream_writer + h264_nal_synth + the refactored
decoder.c. Kernel module compile to be verified via DKMS rebuild
on higgs in the marfrit-packages bump that follows.
Test plan: with this commit + a marfrit-packages daedalus pin
bump, higgs's ffmpeg -hwaccel vaapi -i h264_test.mp4 should
produce a successful decode (vs. the previous "non-existing PPS 0
referenced" failure). The daemon log should show:
decoder: opened h264 context
decoder: h264 prepended SPS=NB PPS=MB slice=KB
decoder: OK 320x240 fmt=0 (yuv420p) fnv1a=0x...
VP9 / AV1 behaviour unchanged — they don't carry meta and the
existing per-frame self-describing path still applies.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
481279c9bf |
packaging/systemd: ship daedalus-v4l2.service + modules-load drop-in
Canonical location for the systemd unit + module-autoload conf,
referenced by both arch/daedalus-v4l2 and debian/daedalus-v4l2
in marfrit-packages. Was a real gap in the original packaging:
postinst installed the daemon binary but nothing started it, so
the libva path got REQ_DECODE messages with nobody listening on
/dev/daedalus-v4l2 and timed out.
packaging/systemd/daedalus-v4l2.service:
- Type=simple, ExecStart=/usr/bin/daedalus_v4l2_daemon daemon
- After=systemd-modules-load.service + ConditionPathExists=
/dev/daedalus-v4l2 (so it only starts when the kernel module
is loaded; doesn't false-fire on non-daedalus hosts that
happen to have the package installed)
- Restart=on-failure, RestartSec=2
- MemoryHigh=128M / MemoryMax=256M (Phase 8.9 stress run
showed RSS settling around 25 MiB; leaves headroom)
- Hardening: NoNewPrivileges, ProtectSystem=strict, ProtectHome,
PrivateTmp, ProtectKernel*, SystemCallFilter=@system-service.
PrivateDevices=false because we DO need /dev/daedalus-v4l2
packaging/systemd/daedalus-v4l2.modules-load:
- Drops to /etc/modules-load.d/daedalus-v4l2.conf so the kernel
module loads before the .service unit fires.
Both files are picked up by the package recipes (next bump in
marfrit-packages) — neither lives in /usr/lib/systemd/system or
/etc/modules-load.d until the .deb / .pkg installs them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f0cd29a340 |
kernel: v4l2_fh_add/del gained file* arg in 6.18 — version-conditional
DKMS build failure on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712): daedalus_v4l2_main.c:1049: error: too few arguments to function 'v4l2_fh_add' v4l2-fh.h:97: void v4l2_fh_add(struct v4l2_fh *fh, struct file *filp); daedalus_v4l2_main.c:1063: error: too few arguments to function 'v4l2_fh_del' Signature changed exactly at v6.18 (verified v6.13–v6.17 still use the one-arg form via raw.githubusercontent.com tag walk). Wrap the calls with LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0) so the module keeps building against: * 6.12 LTS / RPi 6.12.75 (one-arg) — hertz * 6.12.88+deb13-arm64 (one-arg) * 6.18.29+rpt-rpi-2712 (file* arg) — higgs running kernel Build verified on both: hertz 6.12.75 clean, higgs 6.18.29 clean + modprobe daedalus_v4l2 succeeds, /dev/daedalus-v4l2 + /dev/video0 appear. Add #include <linux/version.h> for KERNEL_VERSION + LINUX_VERSION_CODE (also pulled transitively via module.h but explicit is better than implicit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f55b2cd002 |
kernel: media_request_get/put around inf->req (UAF safety)
Sonnet pre-deployment review flagged a SHIP-WITH-EYES-OPEN risk:
Phase 8.13's inf->req captured src_buf->vb2_buf.req_obj.req as a
raw pointer with no media_request_get(). On the normal decode
path that's fine because vb2-core holds its own reference until
v4l2_m2m_buf_done_and_job_finish releases it.
But on a concurrent cancel (MEDIA_IOC_REQUEST_REINIT or a process
kill triggering buf_request_complete from the cancel path before
RESP_FRAME comes back), vb2 could drop its reference first. Our
inf->req would then dangle through v4l2_ctrl_request_complete +
buf_done_and_job_finish — UAF.
Fix matches the cedrus / rkvdec pattern: take our own reference
when we capture the pointer, release it after we're done with it
(after buf_done_and_job_finish to keep the ordering crystal-clear).
/* in daedalus_device_run, after inf->req = src_buf->...->req */
if (inf->req)
media_request_get(inf->req);
/* in daedalus_complete_resp_frame, after buf_done_and_job_finish */
if (inf->req)
media_request_put(inf->req);
Verified on hertz:
- libva path (request-bound, inf->req != NULL): byte-exact NV12,
same FNV-1a as standalone.
- test_m2m_stream (direct QBUF, inf->req == NULL): 30/30 frames
decoded, conditional skip works.
- No kernel oops / WARN, no leak in dmesg.
Add #include <media/media-request.h> for the helpers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f04d7000f8 |
Phase 8.13: byte-exact end-to-end via libva (consumer target hit)
The project's consumer-side goal landed: a real VAAPI consumer
(ffmpeg with -hwaccel vaapi) drives our libva backend → V4L2
driver → daemon → byte-exact NV12 output back to ffmpeg.
ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
-hwaccel_output_format nv12 -i vp9_small.ivf \
-f rawvideo -y /tmp/vp9_via_libva.nv12
cmp /tmp/vp9_via_libva.nv12 /tmp/vp9_ref_for_libva.nv12 → match
18432-byte NV12 byte-for-byte identical to plain ffmpeg
-pix_fmt nv12 software decode. The project_consumer_target
memory's deliverable shape — "V4L2 stateless node consumed by
a real VAAPI client" — is achieved.
Two related kernel changes:
1. v4l2_ctrl_handler_setup(&ctx->hdl) after registration —
matches rkvdec/cedrus/hantro. Brings each registered
compound control out of "uninitialised" state via
std_init_compound defaults.
2. Per-request control completion in the decode path —
the real fix for "Timeout when waiting for media request".
vb2-core's vb2_buffer_done unbinds the BUFFER's req_obj
on normal decode completion, but the per-request CONTROL
object stays bound. buf_request_complete fires only from
queue-cancel paths (vb2-core line 2284), NOT from normal
buf_done. The driver must call
v4l2_ctrl_request_complete(req, hdl) explicitly from the
completion path.
struct daedalus_inflight gained a `struct media_request
*req` field, captured from src_buf->vb2_buf.req_obj.req
in device_run. daedalus_complete_resp_frame then calls
v4l2_ctrl_request_complete before
v4l2_m2m_buf_done_and_job_finish — triggers
MEDIA_REQUEST_STATE_COMPLETE and wakes the request fd
poll.
For non-request flows (test_m2m_stream direct QBUF)
inf->req is NULL; the conditional skips the call.
Both consumer styles work concurrently.
Diagnostic clarification (was Phase 8.13a):
strace traced three S_EXT_CTRLS calls per frame:
1. H264_PROFILE + H264_LEVEL → EINVAL (we don't register)
2. HEVC_PROFILE + HEVC_LEVEL → EINVAL (we don't register)
3. VP9_FRAME + VP9_COMPRESSED_HDR → SUCCESS
The first two are harmless: libva probes whether we support
H264/HEVC integer profile/level controls during config
negotiation; we don't (we expose them as stateless), so EINVAL
just falls through. The actual VP9 stateless controls (#3)
succeeded all along — the libva-side "Unable to set control(s)"
log was misleading us into thinking the control path was the
bug.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
daemon log:
REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes
decoder: opened vp9 context
decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe ...
ffmpeg side:
no Timeout, no Decoding error
/tmp/vp9_via_libva.nv12: 18432 bytes
cmp vs reference: byte-for-byte identical.
Roadmap update:
- 8.10/8.11, 8.12, 8.13 marked closed with closure docs.
- 8.14 = multi-frame VP9 via libva, AV1 + H.264, mpv/Firefox
higher-level consumers.
Per correctness-before-speed:
- strace + kernel-source-reading found the actual root cause
rather than guessing.
- Conditional v4l2_ctrl_request_complete preserves the existing
test_m2m_stream non-request path — both consumer styles work
concurrently without per-flow branching elsewhere.
- Byte-exact pixel comparison, not "frame size matches."
Phase 8.14 next: multi-frame stream + multi-codec via libva +
mpv/Firefox.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a7d585eee8 |
Phase 8.12: first VP9 frame decoded via libva
ffmpeg -hwaccel vaapi → libva-v4l2-request-fourier → /dev/video0 → daedalus_v4l2 kernel → REQ_DECODE on the chardev → daemon FFmpeg decode → byte-exact NV12 (FNV-1a 0x1eb34bfe, same hash the standalone test_m2m_stream produces for the same 128x96 VP9 keyframe). The pixel-correct decode through the libva path is the milestone. What's NOT yet working: libva times out on the media_request fd because buf_request_complete never fires (vb->req_obj.req is NULL when buf_done runs — the S_EXT_CTRLS EINVAL leaves the buffer un-bound to the request even though the buffer queues anyway). Phase 8.13 fixes the EINVAL so the request bind takes and the completion signal propagates. Kernel V4L2 request API integration: - media_device_ops.req_validate / req_queue = vb2_request_ validate / v4l2_m2m_request_queue (Phase 8.11) — MEDIA_IOC_REQUEST_ALLOC succeeds. - vb2_queue.supports_requests = true on OUTPUT queue — without this v4l2-core rejects S_EXT_CTRLS(REQUEST_VAL). - vb2_ops.buf_request_complete = daedalus_buf_request_complete → v4l2_ctrl_request_complete(req, &ctx->hdl). Without this v4l2-core WARNs at videobuf2-v4l2.c:440. - vb2_ops.buf_out_validate: sets field=V4L2_FIELD_NONE on OUTPUT buf. Required for the same WARN check. - requires_requests intentionally NOT set: lets the existing test_m2m_stream (direct QBUF, no request) keep working alongside the libva path. Stateless control re-registration: - Switched from v4l2_ctrl_new_std_compound(NULL p_def) to v4l2_ctrl_new_custom(&cfg, NULL) — pattern rkvdec / cedrus / hantro use. v4l2-core auto-fills elem_size + type from std table (verified: VP9_FRAME elem_size=168, matches sizeof(struct v4l2_ctrl_vp9_frame)). - No-op s_ctrl callback so SET requests don't crash — daemon ignores values, FFmpeg re-parses the bitstream. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): ffmpeg -hwaccel vaapi -i vp9_small.ivf … daemon: REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes daemon: decoder: opened vp9 context daemon: decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe … Same FNV-1a hash as the standalone test_m2m_stream produces for the same VP9 keyframe. End-to-end through libva. Remaining (Phase 8.13): - S_EXT_CTRLS EINVAL on V4L2_CID_STATELESS_VP9_FRAME despite matching elem_size — needs deeper validate-path debugging. - Once the request bind takes, buf_request_complete fires on buf_done, request fd signals completion, libva DQBUFs the decoded NV12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0de0288dce |
Phase 8.10+8.11: libva consumer integration scaffold
Brings daedalus_v4l2 from "standalone test client" to "VAAPI-
discoverable decoder" by adding the surface formats and
media-controller plumbing that libva-v4l2-request-fourier
(sibling repo) requires.
libva-v4l2-request-fourier patches (pushed separately):
- b5b3acf: daedalus_v4l2 added to known_decoder_drivers
- 2146341: meson option gate
This commit (daedalus-v4l2 side, 3 production changes):
1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE
- Added to daedalus_capture_formats[] alongside NV12M + P010
- daedalus_fill_capture_fmt handles num_planes=1 case
(sizeimage = W*H*3/2, bytesperline = W)
- daemon pack_nv12_single_to_plane: Y at base+0,
interleaved CbCr at base+(stride*H); same byte content
as NV12M two-plane, different layout
- Required because libva-v4l2-request-fourier's video.c
only knows non-multi-plane NV12 (it advertises
v4l2_mplane=true but uses the single-plane fourcc).
- Verified byte-exact via test_m2m_stream against
ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames,
31 MB).
2. V4L2 Request API media ops
- daedalus_media_ops = { vb2_request_validate,
v4l2_m2m_request_queue } assigned to mdev.ops before
media_device_init.
- Without this, MEDIA_IOC_REQUEST_ALLOC returned
-ENOTTY and no VAAPI consumer could allocate a
media_request.
3. Stateless control registration via v4l2_ctrl_new_custom
- Switched from v4l2_ctrl_new_std_compound(NULL p_def)
to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/
hantro use. Adds a no-op s_ctrl callback.
Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712):
LibVA trace through `ffmpeg -hwaccel vaapi`:
vaInitialize / Profiles / Entrypoints / CreateConfig /
QuerySurfaceAttributes / CreateSurfaces / CreateContext
(cap_pool: 24 slots, 1 plane each) / CreateBuffer
(slice + picture params) / MEDIA_IOC_REQUEST_ALLOC
— all succeed.
Standalone NV12 decode path:
test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12
→ 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12
vainfo (via libva-v4l2-request-fourier with our driver):
7 VAProfile entries with VAEntrypointVLD
(H264 Main/High/CBaseline/MultiviewHigh/StereoHigh,
VP9Profile0, AV1Profile0)
What's NOT here (Phase 8.12):
The libva trace stops at VIDIOC_S_EXT_CTRLS returning
EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on
the request. The compound-control payload validation
against the kernel's expected struct shape rejects.
This isn't a "missing line" fix — it needs proper
stateless control plumbing (the SPS/PPS/SliceParams
get_dims, validate, default-value paths that in-tree
rkvdec/cedrus/hantro implement to satisfy v4l2-core's
std_validate). Documented as Phase 8.12 scope.
The shipped integration is itself a meaningful deliverable:
all the framework scaffolding is in place; the remaining
gap is well-characterised and bounded.
See docs/phase_8_10_11_closure.md for the full trace
analysis + next-phase plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d84efdb125 |
Phase 8.9: long-form stress + multi-codec HDR + libva scoping
Three verification deliverables; no production code changes
(infrastructure from 8.8 was sufficient).
1. libva-v4l2-request consumer investigation (task 95):
- bootlin/libva-v4l2-request@master supports MPEG-2 /
H.264 / HEVC only. No VP9, no AV1.
- H264 expects V4L2_PIX_FMT_H264_SLICE_RAW (older
fourcc); we advertise V4L2_PIX_FMT_H264_SLICE.
- CAPTURE expects V4L2_PIX_FMT_NV12 (single-plane);
we advertise NV12M + P010.
- Real integration = patch libva-v4l2-request to add
VP9 + AV1 mappings + accept the newer H.264 fourcc.
Multi-session work — pushed to Phase 8.10.
2. Long-form stress test (task 96):
- Built a 1800-frame (60s @ 30fps) VP9 1080p stream
by Python concat of vp9_5s.ivf × 12 with PTS
adjustment and re-muxed IVF header.
- 1800 / 1800 frames decoded cleanly through
test_m2m_stream + daemon, fps=120.9 sustained
across 14.9 s wall, p99=17.3 ms/frame (well inside
the 33 ms 30fps budget).
- Daemon alive after 3620 cookies across two
back-to-back runs, RSS=23 MiB — no leak.
- No kernel oops/WARN, no fps degradation across
the long run.
3. Multi-codec HDR (task 97):
- AV1 1080p 10-bit → P010: byte-exact vs ffmpeg
p010le. fps 17.1 (below 30fps target; AV1 10-bit
is intrinsically expensive).
- H.264 1080p 10-bit (high10) → P010: byte-exact
vs ffmpeg p010le. fps 26.9 (close to target).
- Combined with 8.8's VP9-10bit P010 result
(48.8 fps): all three codecs' 10-bit paths
produce byte-exact P010 output.
Roadmap update (docs/roadmap.md):
- 8.9 marked closed with the scope-cut explained.
- 8.10 = libva-v4l2-request VP9/AV1 patch + end-to-end
consumer integration (the actual user-facing loop:
mpv --hwdec=vaapi → libva-v4l2-request → /dev/video0
→ daemon → decoded frame).
Per correctness-before-speed: characterised the libva
integration scope rigorously rather than starting a
multi-session battle in this phase. The bounded
deliverables (stress test + HDR matrix) ship clean and
prove the existing infrastructure handles real-world
workloads stably.
Phase 8.10 next: build + patch libva-v4l2-request on
hertz; end-to-end with mpv.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1d0db3b5a9 |
docs: pure ffmpeg vs daedalus pipeline CPU comparison
Measured on hertz (Pi 5, 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3)
to quantify the architectural cost/benefit of routing decode
through the V4L2 m2m + chardev + dmabuf path vs running
ffmpeg standalone.
1080p × 150 frames, decode-as-fast-as-possible:
VP9 8-bit: ffmpeg 214.9% CPU / 1083ms wall
daedalus 96.3% CPU / 1229ms wall
AV1 8-bit: ffmpeg 201.5% CPU / 1162ms wall
daedalus 96.6% CPU / 1478ms wall
H.264 8-bit: ffmpeg 205.8% CPU / 1063ms wall
daedalus 100.1% CPU / 1020ms wall
VP9 10-bit: ffmpeg 155.8% CPU / 269ms wall
daedalus 91.6% CPU / 131ms wall
Key takeaway: the daedalus pipeline uses ~half the CPU for
roughly the same wall throughput. FFmpeg standalone defaults
to 2 threads; for single-stream decode that doesn't
parallelise well, so the 2× CPU usage is overhead, not
parallelism benefit. The daemon's single-threaded serialised
event loop avoids that tax.
For the project's 30fps-floor-is-fine target ("daily YouTube
with CPU free for vscode"), daedalus leaves ~2× the CPU
headroom for the rest of the desktop at the same playback
rate.
VP9-10bit is striking — daedalus is faster wallclock too
(131ms vs 269ms) because at small per-frame work FFmpeg's
thread pool spin-up dominates.
Note: "daedalus" still uses FFmpeg internally (Phase 8.8
explicitly deferred QPU substitution after measurement showed
30fps@1080p was already met). The benefit here is
architectural — single-threaded decode, out-of-process
daemon, dmabuf zero-copy — not QPU offload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1ae9528e76 |
Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.
So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
pack_p010_to_plane.
Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
accumulate NALs into access units (push on VCL NAL types
1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).
Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
P010: 1 plane, W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
plane's payload to vb2_plane_size(vb,p) — generalises
cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
the daemon fully populates the plane so payload =
sizeimage.
Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
10-bit samples shifted left by 6 to MSB-align in 16-bit
words per V4L2 ABI. Y at base+0, interleaved CbCr right
after Y plane (per format spec for single-plane P010).
Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
→ pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
1080p throughput baseline (30 frames testsrc, dmabuf path):
VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps **83.1**, byte-exact ✓
AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps **65.0**, byte-exact ✓
H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps **88.3**, byte-exact ✓
All 2-3× over the 30fps-floor-is-fine criterion.
HDR / 10-bit 1080p P010:
10 frames, 62 MB output, fps **48.8**, byte-exact vs
`ffmpeg -pix_fmt p010le -f rawvideo`.
Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.
v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.
Clean SIGTERM + rmmod; no kernel oops/WARN.
Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
(per project_consumer_target memory) — the actual
user-facing endpoint.
Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
(NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
1080p).
- AU grouping in the Annex-B parser is the correct
semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
count, not hardcoded to 2.
Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5965805d86 |
Phase 8.7: media controller + multi-frame streaming verification
Two pieces — both shipped:
1. Media controller binding closes the last v4l2-compliance
failure from 8.6 (DECODER_CMD, which requires has_media on
stateless decoders) and unlocks the V4L2 request API for
libva-v4l2-request.
2. Multi-frame streaming test exercises the daemon's
AVCodecContext state preservation across many REQ_DECODE
calls — Phase 8.6's tests pushed exactly one keyframe per
invocation; real content has P-frame references.
Compliance now reaches **49/49 passing.**
Kernel (kernel/daedalus_v4l2_main.{c,h}):
- Added `struct media_device mdev` to daedalus_dev.
- media_device_init(&mdev) BEFORE v4l2_device_register so
v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m
entities into the graph during register.
- After video_register_device:
v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER)
then media_device_register so userspace sees the complete
graph in /dev/mediaN with the decoder entity tagged.
- daedalus_remove unwinds in reverse: unregister media,
unregister mc, unregister video, release m2m, unregister
v4l2, cleanup mdev.
- Error paths added for both new failure points.
Test harness (tools/test_m2m_stream.c, new):
- Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer
rings on both queues → per-frame QBUF/DQBUF loop →
concatenates decoded NV12 to output file. Returns 0 only
if every input frame decoded without error.
- Same codec vocabulary as test_m2m_decode (vp9 | av1 |
h264 via 5th arg).
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings.
$ v4l2-ctl --list-devices
daedalus-fourier V3D7+NEON (platform:daedalus_v4l2):
/dev/video0
/dev/media3
VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB
NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt
nv12 -f rawvideo`.
VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf
path): byte-for-byte match vs same reference command.
Daemon log shows cookies 1..30 all completing cleanly in
order; lazily-opened AVCodecContext maintains reference
frames across the chardev round-trips.
Clean SIGTERM + rmmod, no oops/WARN.
Roadmap update (docs/roadmap.md):
- 8.7 marked closed with closure-doc reference.
- 8.8 reshaped: perf profiling, QPU dispatch substitution
via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M).
Per correctness-before-speed:
- Order-correct media controller lifecycle (init → bind
v4l2_dev → register video → register mc → register
media; reverse for teardown).
- 4-deep buffer rings on both queues — the scheduler
actually pipelines multiple in-flight cookies through
the chardev (not just one-at-a-time as in 8.5/8.6 tests).
- Bit-exact comparison against ffmpeg, not "looks right."
- All resource paths cleaned on every error branch.
Phase 8.8 next: profile daemon hot loops, dlopen
daedalus-fourier from the daemon, swap FFmpeg per-block
calls for daedalus_dispatch_* where the kernel matches,
target 30fps@1080p from 30fps-floor-is-fine memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c7f6fb90cb |
Phase 8.6: dmabuf + AV1 + H.264 + stateless controls
Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.
Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
(width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
in daemon task context so the fd lands in the daemon's table.
Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
vmalloc in 8.5; the switch is needed because vmalloc doesn't
honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
REQBUFS test rejected the driver because of it. We still
read bitstream via vb2_plane_vaddr (dma_contig gives a
kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
walks inflight list, calls vb2_core_expbuf on the CAPTURE
buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
path (kept for debugging) and the 8.6 dmabuf path (pixels
already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
re-parses); registration is what makes libva-v4l2-request
see us.
Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
allocator via shared daedalus_next_cookie() — kills the
Phase 8.5 namespace collision.
Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
is essential — vb2_core_expbuf extracts O_ACCMODE from flags
and exports read-only by default (caught on first run; mmap
-EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
writes Y line-by-line into planes[0] with planes[0].stride;
interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
runs decode (pixels land in CAPTURE buffer directly), closes
planes, sends metadata-only RESP_FRAME. No wire-pixel
allocation.
Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
drives all three codecs.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
VP9 1920x1080 3,110,400 bytes MATCH
AV1 128x96 18,432 bytes MATCH
H.264 128x96 18,432 bytes MATCH
VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.
v4l2-compliance:
Phase 8.1: 44/48
Phase 8.5: 44/48 (different fails after m2m landed)
Phase 8.6: 47/48
Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
controller — explicitly Phase 8.7 work).
11 standard compound controls visible:
vp9_frame_decode_parameters, vp9_probabilities_updates,
h264_sequence_parameter_set, h264_picture_parameter_set,
h264_scaling_matrix, h264_prediction_weight_table,
h264_slice_parameters, h264_decode_parameters,
av1_sequence_parameters, av1_frame_parameters,
av1_film_grain (av1_tile_group_entry refused by hdl->error
on this kernel — skipped silently).
Clean SIGTERM + rmmod, no oops/WARN.
Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
long-form multi-frame streaming.
Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
right" eyeballing.
- Each compliance edge documented with the underlying test
source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
every exit, including error paths).
Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
6f4b580f7c |
Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF
Replaces the Phase 8.4 debugfs-triggered chardev path with a
real V4L2 m2m driver. Userspace clients now drive decoding the
standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream)
queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run
packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg;
RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE
buffer. Phase 8.6 swaps the inline payload for dmabuf so big
frames stop being capped at 64 KiB.
Kernel (daedalus_v4l2_main.c, rewritten + main.h added):
- Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler,
per-queue v4l2_pix_format_mplane.
- Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed
yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export):
OUTPUT = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, VP9_FRAME
CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M
- Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt
for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/
prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_*
helpers.
- v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds
REQ_DECODE inline with the bitstream bytes, enqueues with an
auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in
a per-device inflight list. Job stays open until RESP_FRAME.
- daedalus_complete_resp_frame(): pops the inflight entry,
memcpys inline NV12 pixels into the CAPTURE buffer (Y plane
+ interleaved CbCr), finishes via
v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done +
job_finish, which leaves the src buf on the m2m queue and
causes device_run to immediately re-run on the same input
(caught on first run; second REQ_DECODE for same bitstream +
eventual oops in stop_streaming on teardown).
Kernel (daedalus_v4l2_chardev.c):
- RESP_FRAME handler now hands inline pixel payload to
daedalus_complete_resp_frame so it lands in the CAPTURE
vb2 buffer. Existing PONG and debugfs test_decode paths still
work; the latter produces a harmless ratelimited "unknown
cookie" since it bypasses V4L2 m2m.
Daemon (decoder.c, decoder.h):
- daedalus_decoder_run_request signature extended with
(nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the
decoder packs YUV420P into NV12 in the caller's buffer: Y
plane line-by-line stripped of stride padding; Cb/Cr
interleaved into a single chroma plane. Truncation silent —
kernel only memcpys what fits in the CAPTURE plane.
Daemon (chardev_client.c):
- handle_req_decode allocates a response buffer sized for the
full chardev payload, lets decoder fill the pixel area
after the resp_frame struct, sends the full payload via the
existing send_response.
Test client (tools/test_m2m_decode.c, new):
- Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each,
mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump
CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the
whole flow without needing v4l2-ctl framing.
Roadmap update (docs/roadmap.md):
- Phase 8.4 retitled "daemon ↔ kernel decode round-trip"
to reflect what actually shipped (vs. the original V4L2-
ioctl-driven plan which moved here).
- Phase 8.5 retitled "full V4L2 m2m driver" with closure
status.
- Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/
stateless controls + media controller. Adds the punch list
of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace)
that 8.6 will fix.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
Kernel + daemon build clean (-Wall -Wextra clean both sides).
Test harness drives one VP9 keyframe end-to-end:
OUTPUT REQBUFS -> 2
CAPTURE REQBUFS -> 2
QBUF OUTPUT[0] bytesused=1566
QBUF CAPTURE[0]; STREAMON both
poll revents=0x5
DQBUF OUTPUT[0] flags=0x4001 (DONE)
DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12
Pixel correctness vs reference:
ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓
Byte-for-byte identical to FFmpeg's stock CPU decode.
v4l2-compliance: detected as Stateless Decoder; most ioctls
pass; two expected fails documented in closure doc
(DECODER_CMD/media controller, S_FMT colorspace).
Clean teardown: SIGTERM the daemon, rmmod the module, no
oops/WARN in dmesg.
Per correctness-before-speed:
- Real V4L2 ioctl table (not stubs); uses v4l2-core helpers
where they exist instead of reinventing.
- v4l2_m2m_buf_done_and_job_finish (not the manual sequence)
to keep scheduler state consistent.
- Bit-exact reference comparison, not just "looks right."
- Documented every compliance failure with the planned fix.
- All resource paths (kmalloc/kfree, inflight list cleanup,
src/dst buf removal in stop_streaming) handled on every
error branch.
Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB
frame-size cap), add AV1+H.264 codecs, add V4L2 stateless
controls + media controller binding, fix the colorspace +
cookie-namespace compliance issues.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2a449632b9 |
Phase 8.4: daemon ↔ kernel decode round-trip (VP9 end-to-end)
Wires the Phase 8.3 FFmpeg loader through the Phase 8.2 chardev
bridge: kernel injects REQ_DECODE carrying a raw VP9 access unit,
daemon hands the bitstream to libavcodec via dlopen, sends
RESP_FRAME back with a content-dependent FNV-1a digest of the
decoded YUV planes. Pure CPU decode for now — Phase 8.5 swaps in
dmabuf + QPU dispatch.
Protocol (include/daedalus_v4l2_proto.h):
- New REQ_DECODE (kernel→daemon) and RESP_FRAME (daemon→kernel)
message types, with fixed-size payload structs.
- New DAEDALUS_CODEC_VP9/AV1/H264 enum (wire-stable so 8.6's
AV1+H.264 work doesn't move existing values).
- New DAEDALUS_DECODE_* status enum (OK / NO_FRAME / ERR_OPEN /
ERR_SEND / ERR_RECV / ERR_CODEC).
- Converted the prior `enum daedalus_msg_type` to #defines —
high-bit values exceed INT_MAX and tripped -Wpedantic on
userspace; kernel uABI headers use the same idiom.
Kernel (kernel/daedalus_v4l2_chardev.c):
- New debugfs entry /sys/kernel/debug/daedalus_v4l2/test_decode:
writing raw bitstream bytes wraps them in a REQ_DECODE
(codec=VP9 for Phase 8.4) and enqueues with an
auto-incrementing cookie.
- daedalus_chardev_write learned RESP_FRAME: parses the payload
and emits a single pr_info line with decode metadata. Keeps
existing PONG handling on the default arm.
Daemon (daemon/src/...):
- chardev_client.{c,h} — opens /dev/daedalus-v4l2, blocking read
loop, single-buffer write() responses (kernel chardev has only
.write, not .write_iter, so writev lands as -EINVAL —
discovered the hard way during first run).
- decoder.{c,h} — lazily-opened AVCodecContext per codec, shared
AVPacket/AVFrame pair, descriptor-driven plane walker
(av_pix_fmt_desc_get) so the same hash path covers YUV420P,
YUV422P, YUV444P, GBRP and other 8-bit planar layouts.
Generalised after first run decoded testsrc as GBRP (71)
rather than the assumed YUV420P.
- `daemon` command in main.c opens the chardev and runs the loop
until SIGINT/SIGTERM. Cookie correlation handled end-to-end.
- ffmpeg_loader gained av_pix_fmt_desc_get (23 symbols total).
Build:
- CMakeLists adds chardev_client.c + decoder.c; explicit
-I../include for the shared protocol header.
- Still -Wall -Wextra -Wpedantic clean.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
$ ffmpeg ... -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 \
-y /tmp/vp9_test.ivf
$ python3 ... strip IVF framing → vp9_keyframe.bin (3268 B)
$ sudo insmod kernel/daedalus_v4l2.ko
$ daedalus_v4l2_daemon -v daemon &
$ sudo dd if=vp9_keyframe.bin \
of=/sys/kernel/debug/daedalus_v4l2/test_decode
daemon: REQ_DECODE cookie=2 → decoded yuv420p 320x240
fnv1a=0x6ef10d71 luma=76800 chroma=38400
kernel: RESP_FRAME cookie=2 status=0 320x240 pixfmt=0
fnv1a=0x6ef10d71 ← matches daemon ✓
Hash properties verified:
cookie=2 testsrc 3268 B → 0x6ef10d71 (first decode)
cookie=3 red 44 B → 0x7f6e5dc5 (content-dependent ✓)
cookie=4 testsrc 3268 B → 0x6ef10d71 (deterministic ✓)
cookie=5 64 B random → status=101 (ERR_SEND, daemon alive)
Daemon survives bad input (FFmpeg "Invalid sync code" wrapped
into structured ERR_SEND response). Clean SIGTERM shutdown,
clean rmmod.
Phase 8.4 acceptance criteria met:
- ✓ end-to-end kernel→daemon→FFmpeg→kernel round-trip
- ✓ cookie correlation per request/response pair
- ✓ content-dependent + deterministic digest
- ✓ structured error responses (no daemon crash on bad input)
- ✓ clean teardown (SIGTERM + rmmod)
- ✓ builds clean on both kernel kbuild and daemon CMake
Per correctness-before-speed:
- Real chardev I/O (no shortcuts, no select-loop hacks)
- Real FFmpeg AVCodecContext lifecycle (lazily opened, properly
freed on cleanup)
- Descriptor-driven plane walk (generalises across pix_fmts)
- Structured error path (not just log-and-continue)
- All resource paths cleaned up on every error branch
- Documented why FNV-1a digest, why write() not writev(), why
pix_desc walk in docs/phase_8_4_closure.md
Phase 8.5 next: V4L2 m2m queue submits REQ_DECODE from
vidioc_qbuf; dmabuf carries actual pixel data so the chardev's
64 KiB cap doesn't gate frame size; begin substituting
daedalus_dispatch_* into the daemon's decode path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
873a04c622 |
Phase 8.3: userspace daemon scaffold + FFmpeg dlopen + parse path
Builds the daemon executable per the locked Phase 8 architecture
(Option γ: dlopen FFmpeg at runtime). Phase 8.3 scope: parse
path validation only — no V4L2 wiring, no decode, no chardev
connection.
Components:
- daemon/CMakeLists.txt — CMake with -Wall -Wextra -Wpedantic
clean. pkg-config for FFmpeg headers; only -ldl + -lpthread
at link time.
- daemon/src/main.c — entry point, signal handlers
(SIGINT/SIGTERM), command dispatcher. Currently `parse <file>`.
- daemon/src/ffmpeg_loader.{c,h} — runtime FFmpeg loader.
dlopens libavformat.so.61, libavcodec.so.61, libavutil.so.59.
Resolves 22 function pointers using POSIX-recommended
*(void**)& dlsym idiom (per POSIX.1-2017 dlsym(3p) Rationale).
- daemon/src/parser.{c,h} — demux loop via avformat_open_input +
av_read_frame. Per-frame logging on -v.
- daemon/src/log.{c,h} — logging facade (stderr Phase 8.3;
syslog/journal planned for 8.5+).
Verification on hertz:
$ ffmpeg -f lavfi -i testsrc=duration=2:size=320x240:rate=30 \
-c:v libvpx-vp9 -y /tmp/testsrc.ivf
$ daedalus_v4l2_daemon parse /tmp/testsrc.ivf
[INFO] FFmpeg loaded: 7.1.3-0+deb13u1+rpt1 (libavformat 61.7.100)
[INFO] video stream #0: codec=vp9 (Google VP9) 320x240, 0/0 fps
[INFO] parse complete: 60 frames (1 key) total 17859 bytes
Error paths verified:
- Missing file → "avformat_open_input(...): code -2", exit 1
- No command → usage message, exit 2
- Bad command → usage message, exit 2
Per correctness-before-speed:
- Real CMake (no Makefile hacks)
- pkg-config for headers
- POSIX-conformant dlsym pattern (no -Wpedantic suppression)
- Real signal handling + proper exit codes
- Real logging with timestamp + level
- Headers included at compile-time for type safety; dlopen
decouples runtime
- All FFmpeg resources freed on every exit path
- Builds clean on -Wall -Wextra -Wpedantic
Phase 8.3 acceptance criteria met:
- ✓ daemon binary builds
- ✓ dlopen FFmpeg at runtime
- ✓ demux a VP9 IVF file end-to-end
- ✓ per-frame metadata logged correctly
- ✓ frame count + keyframe count + byte total accurate
Phase 8.4 next: wire daemon to /dev/daedalus-v4l2 chardev,
add REQ_DECODE / RESP_FRAME handling, drive VP9 decode
end-to-end via daedalus_dispatch_* from daedalus-fourier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
895f57c63a |
Phase 8.2: kernel ↔ daemon chardev bridge with round-trip test
Adds /dev/daedalus-v4l2 misc chardev to the kernel module. The
chardev is the IPC channel for the future userspace decoder
daemon: kernel enqueues REQ_* messages, daemon read()s them,
processes, write()s RESP_* back.
Wire protocol (pre-1.0, header in include/daedalus_v4l2_proto.h):
- struct daedalus_msg_hdr: magic (D04V) + version + type +
cookie + payload_len + reserved
- Request/response separated by high bit of type field
- Max 64 KiB payload per message
- Cookie correlates request with matching response
Kernel implementation (kernel/daedalus_v4l2_chardev.{c,h}):
- Single-instance chardev (-EBUSY on second open)
- In-kernel FIFO bounded at 64 messages
- Blocking + non-blocking read; poll() with EPOLLIN on queued
- write() parses + validates header, logs response at pr_debug
- Bad magic → -EBADMSG, bad version → -EPROTO, oversize → -EMSGSIZE
- All error paths free resources
Phase 8.2 test trigger via debugfs:
- /sys/kernel/debug/daedalus_v4l2/test_ping — any byte
enqueues a PING with a fixed 24-byte payload. Removed in
Phase 8.4 when real REQ_DECODE from V4L2 path takes over.
Userspace verification tool (tools/test_chardev_pingpong.c):
- Real C program, proper error reporting via strerror
- Validates the 6-step round-trip: open → empty-queue EAGAIN →
trigger ping → read PING → verify all fields → write PONG → close
- Builds with -Wall -Wextra clean
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
$ sudo insmod daedalus_v4l2.ko
$ sudo tools/test_chardev_pingpong
opening /dev/daedalus-v4l2...
non-blocking read on empty queue: EAGAIN ✓
injected PING via debugfs ✓
read PING: magic ✓ version ✓ type=PING ✓ cookie=0x1234 ✓ payload=24 bytes
payload: "DAEDALUS-V4L2-PING-PL"
wrote PONG (cookie=0x1234) ✓
ALL TESTS PASSED.
$ sudo rmmod daedalus_v4l2 # clean
Per correctness-before-speed: full kerneldoc on structs, 8-tab
kernel style, SPDX headers, proper error paths, real test
program (not "I ran it once"), failure-mode coverage documented.
Phase 8.3 next: userspace daemon with dlopen'd FFmpeg parse path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|