daedalus-v4l2

Author	SHA1	Message	Date
marfrit	5d1ff51178	Merge pull request 'daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter' (#24 ) from noether/daemon-av1-frame-header-obu into main Reviewed-on: #24	2026-05-23 17:16:27 +00:00
claude-noether	9797a0daa6	daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter Extends the AV1 OBU encoder pack (PR #22 landed the Sequence Header half) with the two remaining pieces of the per-frame OBU assembly: - av1_synth_temporal_delimiter_obu() — trivial 2-byte OBU (0x12, 0x00) that AV1 temporal units must start with so libavcodec's parser can detect access-unit boundaries. - av1_synth_frame_header_obu() — encodes a Frame Header OBU (AV1 §5.9) from V4L2_CID_STATELESS_AV1_SEQUENCE + V4L2_CID_STATELESS_ AV1_FRAME controls. ## Frame Header scope The encoder covers the libva-v4l2-request common-case path: - frame_type: KEY / INTER / INTRA_ONLY supported. SWITCH returns 0. - tile_info: single-tile uniform-spacing only (forced tile_cols_log2 = tile_rows_log2 = 0). - quantization_params: full coverage (base_q_idx, delta_q_*, qmatrix). - loop_filter_params: full coverage (levels, sharpness, ref/mode deltas). - cdef_params: full coverage. - segmentation: only enabled=0 path supported (returns 0 if enabled). - loop_restoration: only RESTORE_NONE supported (returns 0 if any plane uses Wiener / SGRPROJ / SWITCHABLE). - global_motion: only IDENTITY warp model emitted (returns 0 if any ref uses ROTZOOM / AFFINE / TRANSLATION). - film_grain_params: only "not present" path — returns 0 if the sequence header has FILM_GRAIN_PARAMS_PRESENT set. Out-of-scope branches return 0 so a future decoder.c integration can surface a coverage warning and fall back to direct libavcodec parsing of the original bitstream where the consumer happens to ship a fully-OBU'd access unit. ## Integration status The new primitives are NOT yet wired into decoder.c. The AV1 decode hot path still passes the OUTPUT buffer straight to libavcodec, which works only when the V4L2 consumer is sending a fully-OBU'd access unit (not strictly the V4L2 stateless contract). A real wiring needs a separate kernel-side change: - daedalus_v4l2_proto.h: add struct daedalus_av1_meta mirroring v4l2_ctrl_av1_sequence + v4l2_ctrl_av1_frame - kernel/daedalus_v4l2_main.c: capture V4L2_CID_STATELESS_AV1_{SEQUENCE, FRAME} at device_run, ship over the chardev - daemon/src/chardev_client.c: receive meta - daemon/src/decoder.c: assemble TD + SH + FH + OBU_TILE_GROUP-wrapped OUTPUT bytes, send to libavcodec Tracked as a follow-on. ## Tests test_av1_obu_synth.c grows 5 new cases (9 total, all green on hertz): === av1_synth_temporal_delimiter_obu === temporal delimiter: OK === av1_synth_frame_header_obu === KEY frame 1080p: OK (13 bytes) INTER frame: OK (18 bytes) SWITCH frame rejected: OK segmentation enabled rejected: OK AV1 OBU synth tests PASSED Bit-walk of the KEY-frame happy path confirms the OBU envelope (obu_type=3 = FRAME_HEADER, has_size_field=1, leb128 size byte), then steps through show_existing_frame, frame_type, show_frame, disable_cdf_update, allow_screen_content_tools. Fuller bit-walks would tie the test to encoder details that are spec-driven and already linear in the source; structural smoke + spec-driven linearity is the right gate. Build clean on hertz (Pi 5, Debian trixie, 6.18.29+rpt-rpi-2712, gcc -Wall -Wextra -Wpedantic). No new warnings. Closes daedalus backlog task #159 (AV1 Frame Header OBU synthesiser; decoder.c integration deferred per task notes above).	2026-05-23 18:31:41 +02:00
marfrit	3a8f5405d4	Merge pull request 'daemon: AV1 Sequence Header OBU synthesiser + unit test' (#22 ) from noether/daemon-av1-obu-synth into main Reviewed-on: #22	2026-05-23 15:12:16 +00:00
marfrit	4cfe0b470f	Merge pull request 'daemon: bounds-check pack_* functions against CAPTURE plane size' (#21 ) from noether/daemon-pack-bounds-check into main Reviewed-on: #21	2026-05-23 15:11:57 +00:00
marfrit	b958ef8166	Merge pull request 'kernel: drain in-flight m2m jobs on daemon disconnect (fixes #146 D-state)' (#23 ) from noether/kernel-drain-inflight-on-chardev-release into main Reviewed-on: #23	2026-05-23 15:11:40 +00:00
claude-noether	94be8c3d03	kernel: drain in-flight m2m jobs on daemon disconnect Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that triggers chardev release) leaves V4L2 consumers in unkillable TASK_UNINTERRUPTIBLE on /dev/video0 close. ## Root cause device_run() adds an entry to dev->inflight when it sends a REQ_DECODE to the daemon, marking the m2m job as "running". The job is only cleared via v4l2_m2m_buf_done_and_job_finish() in daedalus_complete_resp_frame(), which only fires on RESP_FRAME. If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the matching RESP_FRAME: - the inflight entry is never popped - v4l2_m2m_buf_done_and_job_finish is never called - the m2m scheduler still thinks a job is running Later, when the V4L2 consumer's close() runs (or gets signalled to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits for !job_running indefinitely. The consumer enters D-state and survives SIGKILL until reboot. Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712: $ sudo kill -STOP $DAEMON_PID # block daemon I/O $ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 & $ sudo kill -9 $DAEMON_PID # chardev_release fires $ kill -9 $CLIENT_PID # ignored — D-state # client stack: v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem] v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem] daedalus_release+0x2c [daedalus_v4l2] v4l2_release+0x7c [videodev] __fput → do_exit → SIGKILL never delivered ## Fix New API daedalus_drain_inflight_on_disconnect() in main.{c,h}: walks the in-flight list, marks both src+dst buffers VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and releases the bound media_request if any. Same completion shape as daedalus_complete_resp_frame() takes on the success path, just with state = ERROR for every in-flight entry. chardev_release calls the drain after flushing dev->req_queue (messages still in req_queue weren't released to the daemon yet, so they don't need the m2m-job-finish dance — freeing them is sufficient). The order matters: queue first (cheap), then m2m drain (heavier, takes the inflight list). Locking: list_splice_init under inflight_lock to take the entire list atomically; lock dropped before iterating because v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done dispatch and can re-enter device_run via the scheduler (which would need inflight_lock again on the next REQ_DECODE). ## Verification path Cannot rmmod the running module on hertz right now — the D-state corpse from the repro session pins the refcount. Verification of the fixed module needs a reboot or fresh test host: $ sudo reboot # clears hung client $ sudo make modules_install # install new .ko $ sudo modprobe daedalus_v4l2 $ # rerun the repro script — client should die cleanly with $ # an -EIO / similar return from poll/DQBUF instead of hanging. Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings. The pre-existing "frame size 2128 > 2048" warning on daedalus_device_run is unchanged by this commit. ## Followup not in scope If a new V4L2 consumer races a REQ_DECODE through device_run AFTER the drain has spliced the list (but before the daemon chardev is reopened), the new entry sits in a freshly-empty inflight list and the same hang can recur for that consumer when the systemd auto-restart of the daemon either fails or takes longer than the consumer's patience. A secondary safeguard would be to fail-fast in device_run when dev->chardev is unopened — proposing as a separate ticket if this race materialises in practice. Closes #146.	2026-05-23 17:06:06 +02:00
claude-noether	1e9619afe8	daemon: AV1 Sequence Header OBU synthesiser + unit test V4L2 stateless AV1 passes the sequence header information as a structured control (V4L2_CID_STATELESS_AV1_SEQUENCE) and ships only tile-group bytes in the OUTPUT buffer. libavcodec's AV1 decoder is full-bitstream, so the daemon needs to reconstruct the OBU bytes the consumer parsed out before feeding the assembled stream to libavcodec. This commit lands the Sequence Header OBU half of that reconstruction — av1_synth_sequence_header_obu(). Frame Header / Frame OBU synthesisers + the integration that wires the assembled OBUs into the decode hot path are separate follow-on modules. Module shape mirrors the H.264 NAL synthesiser (PR #1): - Public API: single function returning byte count or 0 on overflow/invalid input. - Wire encoder uses the existing bitstream_writer (bsw_put_u is AV1's f(n); bsw_put_ue is bit-identical to AV1's uvlc; bsw_align_rbsp matches AV1's trailing_bits()). - AV1-specific helpers (leb128 size, min_bits_for, subsampling resolution per §5.5.2) are file-local statics. - No emulation prevention — AV1 uses leb128-sized OBUs for bitstream boundaries, not byte-pattern escapes. Synthesis decisions for fields V4L2 doesn't carry are documented verbatim in the file header (reduced_still_picture_header = 0; single operating point at seq_level_idx = 13 / level 5.1; color_description_present_flag = 0; chroma_sample_position = 0; seq_choose_screen_detection_tools = 1; seq_choose_integer_mv = 1). Rejection cases: - seq_profile > 2 - bit_depth not in {8, 10, 12} - seq_profile = 1 + monochrome (4:4:4 forced colour) - seq_profile = 1 + bit_depth = 12 (only profile 2 allows it) - max_frame_{width,height}_minus_1 requiring > 16 length bits - out_cap too small to hold header + leb128 + payload Each returns 0 to surface the mismatch loudly rather than emit nonsense the libavcodec parser would reject downstream. Unit test (test_av1_obu_synth.c, opt-in via DAEDALUS_BUILD_TESTS=ON) exercises four cases bit-by-bit against a hand-computed reference: 1. profile 0, 1080p, 8-bit, 4:2:0, order_hint on (7 bits), CDEF+restoration on — the common Pi 5 path. 2. profile 0, 720p, 10-bit, monochrome — exercises high_bitdepth and the monochrome short-form color_config. 3. profile 1 + bit_depth 12 → expects 0 (rejected). 4. tiny out_cap → expects 0 (overflow). All four green on hertz (aarch64 Arch, gcc Wall+Wextra+Wpedantic clean). This commit does not change daemon behaviour — av1_obu_synth.c is built into the daemon binary so the symbols are reachable, but no call site is wired yet. Integration goes in the follow-on DAEMON-AV1 patches that also synthesise the Frame Header OBU and bracket the assembled OBUs with a Temporal Delimiter. Refs reauktion/daedalus-v4l2#11 daemon-half; closes daedalus backlog task #144.	2026-05-23 15:41:07 +02:00
claude-noether	a43296c1ed	daemon: bounds-check pack_* functions against CAPTURE plane size The three NV12/P010 pack functions (pack_nv12_single_to_plane, pack_nv12_to_planes, pack_p010_to_plane) wrote into the V4L2 client's CAPTURE dmabuf without checking that the mapped size covers the frame libavcodec just decoded. Crash scenario: YouTube DASH stepping resolution mid-stream (e.g. 480p -> 720p when bandwidth improves) — libva is supposed to handle the V4L2_EVENT_SOURCE_CHANGE with STREAMOFF / S_FMT / REQBUFS, but in practice a stale CAPTURE request with the old buffer size sometimes slips through carrying the new (larger) frame. The chroma-interleave inner loop walks past the mapping boundary and the daemon takes SIGSEGV mid-frame, which in turn leaves V4L2 clients hanging in vb2_core_dqbuf — see the followup ticket on the D-state symptom. Fix: compute required = y_size + uv_size against planes->size[N] BEFORE any write. On mismatch, log_warn with both sizes and the frame dimensions, and return -EOVERFLOW. The caller (process_decode_request loop) already handles a negative pack return with a log_warn and proceeds without aborting the decode — the kernel still gets the response with metadata-only and the V4L2 client sees a frame whose pixels are stale but whose buffer-done event fires normally. The next SOURCE_CHANGE the client processes resyncs the buffer size. All three pack paths get the same bounds-check; the comment on pack_nv12_single is the canonical explanation, the other two reference it. Verified: builds clean against trixie aarch64; no behavioural change on the happy path (the bounds check is a single size compare; on a correctly-sized CAPTURE buffer it's a 1-cycle pass). Closes daedalus-v4l2 task #145 (daemon SEGV in pack_nv12_single on resolution change).	2026-05-23 15:31:50 +02:00
marfrit	872eec505e	Merge pull request 'proto: bump PROTO_MAX_PAYLOAD 64 KiB → 1 MiB (closes #19 )' (#20 ) from noether/issue-19-bump-proto-payload-1mib into main Reviewed-on: #20	2026-05-22 18:47:46 +00:00
marfrit	ee42419479	proto: bump PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB (closes #19 ) Real H.264 access units routinely exceed the previous 64 KiB cap on the chardev wire protocol: 720p worst-case I-frame ~200 KiB 1080p worst-case I-frame ~500 KiB libva-v4l2-request-fourier detects the under-sized OUTPUT-MPLANE buffer and tries to grow it via VIDIOC_S_FMT to 147456 B, but daedalus_fill_output_fmt unconditionally pins sizeimage to DAEDALUS_MAX_BITSTREAM (= 65484) regardless of userspace's request. Firefox loses the slice, falls back to libmozavcodec SW for the rest of the session. Bumping the wire-protocol cap to 1 MiB lifts the kernel OUTPUT_MPLANE sizeimage with it (DAEDALUS_MAX_BITSTREAM is derived from the same #define). All allocations (kernel kmalloc / kmemdup, daemon read buffer, vb2 plane backing) are dynamic and sized per-payload at runtime, so the only growth is the daemon's startup read buffer (one ~1 MiB allocation per daemon process) and the V4L2 OUTPUT_MPLANE per-buffer size. KMALLOC_MAX_SIZE on aarch64 SLUB is several MiB; 1 MiB is well within bounds. Other V4L2 stateless decoders (cedrus, rkvdec, hantro) report 1-4 MiB OUTPUT_MPLANE sizeimage — this puts daedalus at the conservative end of normal. ## Compatibility #define-only change; struct layout unchanged. But the effective cap is the smaller of (kernel cap, daemon cap), so: - new daemon + stale kernel: still capped at 64 KiB until the kernel module rebuilds. - new kernel + stale daemon: same. Lock-step install of daedalus-v4l2 + daedalus-v4l2-dkms is therefore required for the fix to take effect; mirrors the PR-#7/#8 cadence. ## NOT changed in this commit - daedalus_fill_output_fmt still hardcodes sizeimage = DAEDALUS_MAX_BITSTREAM regardless of userspace request. Acceptable: vb2 will allocate up to that, and libva's resize- test now sees the kernel report a sizeimage at-least-as-large as what it asked for (147456 < 1048524). A future cleanup could respect userspace's S_FMT.sizeimage clamped to the cap, to save memory on tiny streams. - chardev kmalloc → kvmalloc swap (only matters above KMALLOC_MAX_SIZE, not here). Refs #19.	2026-05-22 20:46:27 +02:00
marfrit	1d8f5af164	Merge pull request 'daemon: filter tiny pause-time bitstreams (closes #17 )' (#18 ) from noether/issue-17-tiny-bitstream-filter into main Reviewed-on: #18	2026-05-22 16:14:56 +00:00
marfrit	3e4e6e8eae	daemon: filter tiny pause-time bitstreams (closes #17 ) libva-v4l2-request-fourier flushes a stub packet into the V4L2 OUTPUT_MPLANE queue at playback-pause boundaries. The payload is shorter than any parseable H.264 NAL (3-byte start code + 1-byte NAL header = 4 bytes minimum); avcodec_send_packet returns AVERROR_INVALIDDATA (-1094995529), which propagated to the kernel as a decode failure. Firefox then marked H.264-via-VAAPI as broken for the session and routed every subsequent frame to libmozavcodec SW — pause never recovered to HW. At the REQ_DECODE entry in chardev_client.c::handle_req_decode, short-circuit any bitstream below the minimum-parseable threshold: log INFO, skip daedalus_decoder_run_request, and reply RESP_FRAME with status=DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool stays healthy and Firefox doesn't see a failure. Repro: Pi CM5 trixie, daedalus-v4l2 0.1.0+r41 + ffmpeg-v4l2- request-fourier 2:8.1+rfourier+gb57fbbe-9, Firefox YouTube avc1. Play → daemon decodes at ~46 fps. Pause ≥ 1s. Resume → daemon silent; sudo journalctl -u daedalus-v4l2 --since '10s' \| grep -c 'decoder: OK' = 0. Last entry before silence: REQ_DECODE cookie=N codec=3 bitstream=3 bytes ... [h264 @ ...] no frame! [ERR] decoder: avcodec_send_packet failed: -1094995529 After this fix the 3-byte sentinel logs as 'tiny bitstream 3 bytes — dropping as no-op' and the libavcodec context is untouched; the next real REQ_DECODE proceeds normally. Scope NOT covered (intentionally deferred): - A more general "tolerate AVERROR_INVALIDDATA mid-stream" path. Worth doing later but masks unrelated bugs. - Investigating WHY libva sends the 3-byte sentinel on pause. Likely an upstream libva-v4l2-request-fourier issue; tracked separately if this filter is not enough. Wire protocol unchanged. No DAEDALUS_PROTO_VERSION bump.	2026-05-22 17:26:25 +02:00
marfrit	6e6dfa144d	Merge pull request 'daemon: dlopen Kwiboo fork's soname 62 (FFmpeg 8.1 at /opt/fourier)' (#16 ) from noether/daemon-dlopen-kwiboo-soname62 into main Reviewed-on: #16	2026-05-21 19:20:22 +00:00
claude-noether	514da29a73	daemon: dlopen Kwiboo fork's libavcodec.so.62 / libavformat.so.62 / libavutil.so.60 Switch the daemon's runtime dlopen targets from Debian-stock soname 61/61/59 (FFmpeg 7.1.3) to the Kwiboo fourier fork's soname 62/62/60 (FFmpeg 8.1) installed at the /opt/fourier prefix. Why --- The substitution arc tracked at daedalus-v4l2#11 needs daedalus- fourier kernel calls woven into libavcodec's H264DSPContext NEON init (replacing ff_h264_idct_add_neon etc. with thunks calling daedalus_recipe_dispatch_h264_). We do that via patches in the ffmpeg-v4l2-request-fourier package source — which we own, in marfrit-packages, alongside the existing libudev-bypass and nv15-to-p010 patches. But that package builds the Kwiboo fork at soname 62 / /opt/fourier. The daemon currently dlopens soname 61 (Debian-stock + a separately-built +fourier2 patch that isn't in marfrit-packages' source tree), so substitution patches there wouldn't reach the daemon. Switching to soname 62 routes the daemon through the package we control — first step toward landing daedalus-fourier kernel substitution into the production decode path. Compat ------ - /opt/fourier libs are already on every host running the daemon (hard build-dep of ffmpeg-v4l2-request-fourier). Firefox-fourier and mpv-fourier already dlopen them via the same path. - /etc/ld.so.conf.d/fourier.conf entry resolves the new sonames from /opt/fourier/lib via the ld cache; dlopen-by-soname works without LD_LIBRARY_PATH wrappers. - Build-side: daemon's pkg_check_modules picks up libav.pc from /opt/fourier/lib/pkgconfig when PKG_CONFIG_PATH includes that directory (build-deb.sh follow-up will set it). - API surface unchanged: avcodec_send_packet / receive_frame / AVCodecContext flags / AVFrame fields are all stable between FFmpeg 7.1 and 8.1. Verified clean cross-compile on hertz. Wire protocol unchanged. No kmod bump. Next step (follow-up PRs) ------------------------- 1. ffmpeg-v4l2-request-fourier patch: add 0003-daedalus-fourier- substitute-h264-idct4.patch that replaces ff_h264_idct_add_neon in libavcodec/aarch64/h264dsp_init_aarch64.c with a thunk calling daedalus_recipe_dispatch_h264_idct4. 2. Repeat for IDCT 8×8, deblock luma-v, qpel mc20 (one kernel per PR for reviewability; bench delta + decode_us delta documented per substitution). 3. marfrit-packages bump to pick up the new daemon + the substituted fourier package.	2026-05-21 21:19:24 +02:00
marfrit	3bc0da168c	Merge pull request 'daemon: per-frame decode_us + periodic stats (#11 step 1)' (#15 ) from noether/daemon-decode-stats into main Reviewed-on: #15	2026-05-21 18:26:50 +00:00
claude-noether	814b74d0bb	daemon: per-frame decode_us + periodic stats summary (#11 step 1) Establishes observable baseline metrics before any daedalus-fourier kernel substitution lands. Step 1 of the daemon-rewrite arc tracked at daedalus-v4l2#11. Changes ------- - Per-frame `decoder: OK ...` log line now carries decode_us=N (the send_packet + receive_frame wall-clock cost in microseconds — exclusively the libavcodec round-trip, not the bitstream pack / SPS-PPS synth / pack-to-planes work). - New "decoder stats" summary line every DAEDALUS_STATS_EVERY (60) decoded frames, reporting: codec, frame count, window seconds, fps, avg decode_us, MBs/s throughput, bytes/MB bitrate. Sample ------ decoder stats: codec=h264 frames=300 window=12.32s fps=24.35 avg_decode_us=4216.4 mbs_per_s=87643 bs_b_per_mb=1.56 What this tells us ------------------ Steady-state on higgs (Pi CM5) decoding bbb_720p_h264.mp4: ~4 ms decode_us, ~90 K MBs/s, well under the daedalus-fourier NEON kernel ceilings (IDCT 4×4 @ 175 Mblocks/s, deblock @ 92 Medges/s, qpel mc20 @ 131 Mblocks/s — all 100-1000× over our actual workload). Means the 4 ms/frame is mostly libavcodec's CABAC + MV prediction + intra prediction overhead, NOT the pixel-math primitives. Substituting a single primitive would shave only a small slice of the 4 ms. Useful as guidance for the upcoming substitution work — we'll pick the primitive with the largest cycle cost relative to the alternative, and measure CPU saved per substitution. No behaviour change: counters are static + unsynchronised (the chardev event loop is single-threaded); reset when codec_id changes. clock_gettime(CLOCK_MONOTONIC) for timing.	2026-05-21 20:17:09 +02:00
marfrit	77e14e5a19	Merge pull request 'daemon: link daedalus-fourier + log substrate availability at startup' (#13 ) from noether/daemon-link-daedalus-fourier into main Reviewed-on: #13	2026-05-21 16:35:38 +00:00
claude-noether	88b2ebfaa9	daemon: link daedalus-fourier + log substrate availability at startup First incremental step toward H.264 daemon-rewrite (daedalus-v4l2#11): make the daedalus-fourier kernel library available to the daemon process so subsequent patches can substitute its primitives (IDCT 4×4, IDCT 8×8, luma vertical deblock, etc.) for libavcodec's per-MB pixel math. This patch does NOT yet dispatch any kernels. It only: - Adds `pkg_check_modules(DAEDALUS_FOURIER REQUIRED daedalus-fourier)` to the daemon's CMakeLists, with explicit link ordering (libdaedalus_core.a must precede -lvulkan because the static archive references vulkan symbols and the linker resolves left-to-right). We bypass IMPORTED_TARGET because pkg-config's Requires.private chain leaves CMake's dependency graph reordering the archive after -lvulkan, breaking the static link. - Calls daedalus_ctx_create_no_qpu() at daemon startup, logs the substrate-availability line, destroys the context at exit. no_qpu mode skips V3D Vulkan probe — proves linkage works without depending on shader-path resolution (which is a separate piece of work, since v3d_runner currently loads .spv files from cwd-relative paths and consumer would need a search path override). Sample journal line: [2026-05-21 17:59:35.271 INFO] daedalus-fourier: linked, ctx alive (no_qpu mode; has_qpu=0) Build-test verified on hertz (Pi 5 dev host) against an installed copy of daedalus-fourier r35+gd87239d (from marfrit/daedalus-fourier PR #1). Binary links cleanly, --help prints, daemon mode opens chardev (fails predictably on hertz which has no daedalus_v4l2 kmod; on higgs this is the existing working path). Follow-up patches per daedalus-v4l2#11: 1. Instrument the existing libavcodec decode path to count per-frame IDCT blocks / deblock edges / MC tiles so we have a baseline of what work the daemon dispatches for a typical YouTube H.264 stream. 2. Substitute daedalus-fourier kernels one at a time, measuring CPU saved per substitution. 3. Wire shader path resolution into daedalus_ctx_create() for the QPU substrate (V3D opportunistic helper paths). Wire protocol unchanged. DAEDALUS_PROTO_VERSION stays at 0.	2026-05-21 18:00:46 +02:00
marfrit	64b9599e47	Merge pull request 'daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — implements #11 part (2)' (#12 ) from noether/daemon-low-delay-h264 into main Reviewed-on: #12	2026-05-21 15:17:57 +00:00
claude-noether	234a103084	daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — fix display-reorder breaking V4L2 1:1 Force libavcodec's H.264 decoder to emit frames in DECODE order (one frame per send_packet, no internal display-order reorder queue). Single-line addition: ctx->flags \|= AV_CODEC_FLAG_LOW_DELAY before avcodec_open2, gated on codec_id == DAEDALUS_CODEC_H264. Closes daedalus-v4l2#11 part (2). Background ---------- PR #7's "parking design" approach to the H.264 display-reorder problem broke libva-v4l2-request-fourier's 1:1 CAPTURE-completion contract (see #9 + #10). After the revert, the visible "2 1 4 3" pair-swap regressed and the only path forward was to align the daemon's output ordering with what V4L2 stateless clients expect: decode order, one CAPTURE buffer per OUTPUT slice, with display reorder pushed upstream to ffmpeg-vaapi's per-VAAPI-surface POC logic (which it already does correctly for every real H.264 hardware decoder via VAPictureParameterBufferH264). How LOW_DELAY does this ----------------------- Inside libavcodec/h264dec.c, the flag sets h->low_delay = 1. h264_select_output_frame (h264_picture.c) emits the just-decoded picture immediately instead of routing through the display-order DPB output queue. DPB management for reference frames (short_ref / long_ref) is unaffected — B-frame decoding correctness is preserved; only the output buffering is bypassed. Skipped for VP9 / AV1 — those codecs don't reorder internally, so the flag would be a no-op but adds no value. Verified -------- On higgs (Pi CM5, 6.18.29+rpt-rpi-2712), test daemon hot-swapped into /usr/bin/daedalus_v4l2_daemon, mpv --hwdec=vaapi-copy --frames=300 against bbb_720p_h264.mp4: 311 REQ_DECODEs received, 308 successful "decoder: OK" responses (99.04% steady-state delivery — 3 lost at GOP boundaries, no compounding drift). mpv plays to its --frames cap and exits cleanly with "End of file". No "Unable to dequeue buffer", no "Failed to end picture decode", no "AVHWFramesContext: Failed to sync surface" — all the failures from #9 are gone. Builds clean against ffmpeg-v4l2-request-fourier libavcodec.	2026-05-21 17:14:33 +02:00
marfrit	5d8b4369e5	Merge pull request 'kernel + daemon: revert PRs #7 + #8 (parking design incompatible with V4L2 stateless 1:1 expectation)' (#10 ) from noether/revert-parking-pr7-pr8 into main Reviewed-on: #10	2026-05-21 13:39:09 +00:00
marfrit	714d781d22	Revert "Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6 )' (#7 ) from noether/kernel-daemon-h264-reorder-fix into main" This reverts commit `79256dc7ef`, reversing changes made to `7ff2d897ea`.	2026-05-21 14:40:59 +02:00
marfrit	49e60c9bba	Revert "Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7 )' (#8 ) from noether/kernel-claim-bufs-at-device-run into main" This reverts commit `6ffe92bcac`, reversing changes made to `79256dc7ef`.	2026-05-21 14:40:52 +02:00
marfrit	6ffe92bcac	Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7 )' (#8 ) from noether/kernel-claim-bufs-at-device-run into main Reviewed-on: #8	2026-05-21 11:54:52 +00:00
claude-noether	f10a26d883	kernel: claim src/dst at device_run, not at buf_done Hard reboot observed on higgs (Pi CM5) during the first mpv vaapi-copy playback against the freshly-deployed r28+g79256dc stack — kernel panic, no persistent journal, no recoverable trace. Bug introduced by the daedalus-v4l2#6 reorder fix (#7). Cause ----- The new completion path runs `v4l2_m2m_job_finish` on SRC_CONSUMED even when the dst_buf is still parked (waiting for a future HAS_PIXELS). job_finish moves the m2m_ctx back to IDLE, the scheduler dispatches the next device_run — which calls `v4l2_m2m_next_dst_buf`, which returns the head of the CAPTURE ready-queue, which is STILL the parked dst_buf because we never removed it. Two inflight entries now reference the same vb2_buffer; the later HAS_PIXELS triggers `v4l2_m2m_dst_buf_remove_by_buf` on a vb2_buffer whose list_head is no longer linked to that queue, and `list_del()` smashes the next/prev pointers of whatever ELSE was at those addresses. Fix --- Take both src and dst off `m2m_ctx`'s rdy_queue at device_run — as soon as `v4l2_m2m_next__buf` has peeked them and all early-exit validation has passed. After that, the daemon owns both halves exclusively via the inflight item; the m2m scheduler can't re-issue them on the next device_run. Completion path drops the redundant `_remove_by_buf` calls — list is already detached, so `buf_done` alone is correct. Matches the amphion `vdec.c`/`venc.c` pattern (which also claims at device_run for the same reason: amphion's encode pipeline parks output buffers across multiple frames waiting for the codec to finish, structurally the same as our H.264 B-frame DPB parking). `fail_buf_error` learns about the new `claimed` flag and skips the `v4l2_m2m__buf_remove` calls when the buffers have already been removed by-buf at device_run. Verified -------- Builds clean against 6.18.29+rpt-rpi-2712. Field test pending — deploy via marfrit-packages bump in lock-step with the daemon (which doesn't need to change for this fix; PROTO_VERSION stays at 1).	2026-05-21 13:49:44 +02:00
marfrit	79256dc7ef	Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6 )' (#7 ) from noether/kernel-daemon-h264-reorder-fix into main Reviewed-on: #7	2026-05-21 10:36:53 +00:00
claude-noether	15fc2aba14	kernel + daemon: H.264 B-frame display reorder fix (issue #6 ) H.264 streams with B-frames showed visibly pair-swapped output in mpv / Firefox playback through the libva → daedalus_v4l2 path — "frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6". Reproduced in mpv with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor), confirming the bug was in this daemon pipeline, not downstream. Root cause ---------- libavcodec's H.264 decoder internally reorders output to DISPLAY order before returning from avcodec_receive_frame. The daemon previously called send_packet → receive_frame ONCE per REQ_DECODE and shipped the resulting pixels in a RESP_FRAME tagged with the SAME cookie. For B-frames this is wrong: the frame returned from receive_frame may belong to an EARLIER bitstream (libavcodec held it for display-order release). Cookie N's CAPTURE buffer therefore got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got silently marked VB2_BUF_STATE_ERROR (the daemon returned DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held). Fix shape --------- Decouple kernel cookie identity (decode-order routing) from libavcodec's display-ordered output. Wire-protocol changes: REQ_DECODE + __u64 src_pts (= src_buf->vb2_buf.timestamp) RESP_FRAME + __u32 flags (HAS_PIXELS \| SRC_CONSUMED) + __u64 output_src_pts (= frame->pts on drain) PROTO_VERSION bumped 0 → 1. Lock-step rebuild required. Kernel ------ device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts before sending REQ_DECODE, and stores it on the inflight item so the completion path can stamp dst_buf.timestamp explicitly when src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto- pairing no longer applies). daedalus_complete_resp_frame splits into: HAS_PIXELS: pack pixels into THIS cookie's CAPTURE buffer, stamp dst timestamp from inflight->src_pts, v4l2_m2m_buf_done(dst, DONE/ERROR). No job_finish here. SRC_CONSUMED: release the bound media_request, run v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so the scheduler can dispatch the next REQ. dst_buf may still be parked at this point. Inflight entry is removed and freed only when BOTH src_buf and dst_buf have been cleared. Combined HAS_PIXELS\|SRC_CONSUMED RESPs (steady-state VP9/AV1 with no reorder lag) collapse to the prior 1:1 behaviour for free. Daemon ------ daedalus_decoder_run_request split into three primitives: daedalus_decoder_submit — set pkt->pts = req->src_pts, avcodec_send_packet. daedalus_decoder_drain_one — avcodec_receive_frame, populate resp meta + output_src_pts (= the frame's pts, carried back from the bitstream that produced it). daedalus_decoder_pack_current — pack current AVFrame into the caller-mapped CAPTURE planes. chardev_client maintains a small (src_pts → cookie, cached_req) table indexed linearly (≤64 entries; bounded by V4L2 client buffer pool depth). On each REQ_DECODE: 1. Register (src_pts → cookie) in the table. 2. submit(). 3. Drain loop: for each frame returned, look up its owner cookie via pending_lookup(frame->pts), GET_DMABUF for THAT cookie, pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS, output_src_pts=frame->pts). Combine with SRC_CONSUMED when owner_cookie equals the current REQ's cookie. 4. If the current REQ's cookie wasn't drained inside the loop (libavcodec held the frame), emit a standalone SRC_CONSUMED RESP so the kernel runs job_finish + dispatches the next REQ; dst_buf for this cookie stays parked until a future drain produces its pixels. VP9 / AV1 paths are unchanged in behaviour: one frame per REQ, HAS_PIXELS\|SRC_CONSUMED in one combined RESP. Verified -------- Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712 (Pi CM5). Frame-size warning in device_run is pre-existing (unchanged by this commit).	2026-05-21 12:32:47 +02:00
marfrit	7ff2d897ea	Merge pull request 'kernel: register H.264 DECODE_MODE + START_CODE menu controls' (#4 ) from noether/kernel-h264-menu-ctrls into main Reviewed-on: #4	2026-05-21 09:02:43 +00:00
claude-noether	69a62a922f	kernel: register H.264 DECODE_MODE + START_CODE menu controls libva-v4l2-request sets V4L2_CID_STATELESS_H264_DECODE_MODE and V4L2_CID_STATELESS_H264_START_CODE on the device fd at context init (see libva-v4l2-request-fourier src/context.c:577 — best-effort call, result is (void)cast). Our ctrl_handler did not advertise either control, so v4l2-core returned EINVAL on validate; userspace logged the noisy v4l2-request: Unable to set control(s): Invalid argument (error_idx=2/2 ioctl-level) at every Firefox/ffmpeg context creation, despite decode itself succeeding (the daemon already operates as FRAME_BASED + ANNEX_B and the per-request SPS/PPS/SCALING_MATRIX/DECODE_PARAMS batch lands fine). Register the two as v4l2_ctrl_new_std_menu with the only value each the daemon actually supports — FRAME_BASED for DECODE_MODE, ANNEX_B for START_CODE — and mask out the unsupported alternates (SLICE_BASED, NONE). Pattern matches rkvdec / hantro. Update the handler-init capacity hint to ARRAY_SIZE(daedalus_stateless_ctrls) + 2 to cover the additions. Verified: builds clean on 6.18.29+rpt-rpi-2712 (Pi CM5) DKMS source tree.	2026-05-21 11:01:41 +02:00
marfrit	f0d41867f6	Merge pull request 'kernel: per-ctx vb2 lock — Firefox multi-process VAAPI unblock' (#3 ) from noether/kernel-per-ctx-vb-mutex into main Reviewed-on: #3	2026-05-20 19:25:02 +00:00
marfrit	a3ada8ba38	kernel: per-ctx vb2 lock so concurrent clients don't serialise on dev mutex daedalus_queue_init was wiring both src_vq->lock and dst_vq->lock to ctx->dev->m2m_lock — a device-wide mutex. That serialises every vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) across ALL concurrent clients of /dev/video0. For a single-client consumer like the test_m2m_* tools it doesn't matter; for Firefox, which spawns separate content + RDD + GPU processes that each open /dev/video0 and run libva probe simultaneously, the contention showed up as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE) when another session was mid-streamon on the same device. Observable on higgs (Pi CM5): $ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox ... v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ... v4l2-request: cap_pool_init: 24 slots ready v4l2-request: Unable to set format for type 10: Device or resource busy After this fix, each open() gets its own ctx->vb_mutex and the per-context vb2_queue locks are independent — Firefox's multi- process VAAPI clients no longer fight each other. YouTube playback on higgs runs through daedalus at ~230 fps sustained (640x368, libavcodec dlopen path), 7× headroom over the 30fps target. cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern for the same reason. This mirrors them. Lifecycle: - mutex_init in daedalus_open (right after the kzalloc that creates ctx, before v4l2_fh_init). - mutex_destroy in daedalus_release (after v4l2_fh_exit, before kfree), and in the err_ctrl unwind path in daedalus_open. Verified end-to-end on higgs: - rmmod + modprobe the rebuilt .ko. - Restart daedalus-v4l2.service. - Firefox YouTube playback engages VAAPI, daemon journal shows cookie=1..N codec=3 (H.264) REQ_DECODE / decoder:OK pairs with unique per-frame fnv1a hashes. - No EBUSY in either firefox stderr or daemon journal during the entire session. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 21:23:44 +02:00
marfrit	462aa4b480	Merge pull request 'kernel: bind request controls to p_cur via v4l2_ctrl_request_setup' (#2 ) from noether/kernel-ctrl-request-setup into main Reviewed-on: #2	2026-05-20 18:37:12 +00:00
marfrit	29f16ece13	kernel: bind request controls to p_cur before reading them device_run was reading ctrl->p_cur.p_h264_* directly, but v4l2-m2m's request scheduler does NOT auto-bind the in-flight media_request's control values to the ctrl handler's p_cur slots — drivers have to call v4l2_ctrl_request_setup() explicitly. cedrus / rkvdec / hantro all do this in their device_run; daedalus didn't. Result: daedalus_collect_h264_meta() read stale or default values (whatever the prior request had left in p_cur, or v4l2_ctrl_new_custom initial state if no prior request had completed) instead of the S_EXT_CTRLS V4L2_CTRL_WHICH_REQUEST_VAL values libva-v4l2-request- fourier had just sent for THIS frame. The mismatch was a smoking gun on higgs after libva PR #9 / packages PR #52 landed an instrumentation log at h264_set_controls entry: libva boundary (sent to kernel): VAProfile=13 seq_fields=0x00032051 pic_fields=0x00000500 num_ref_frames=1 daedalus daemon (read from kernel p_cur): prof=100 level=41 ref_frames=0 flags=0x10 pps_flags=0x0 After calling v4l2_ctrl_request_setup() at the top of device_run: daedalus daemon (read from kernel p_cur): prof=66 level=11 ref_frames=1 poc_type=2 flags=0x50 pps_flags=0x88 — matches what libva sent, matches the bitstream's actual SPS. End-to-end test on higgs with libva-v4l2-request-fourier 1.0.0+r382 +gc1bb444 (after-fix-3-and-fix-4-instrumentation) + this kernel patch: $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi \ -hwaccel_device /dev/dri/renderD128 -i h264_test.mp4 \ -frames:v 1 -f null - ... rc=0 daemon journal: zero "error while decoding MB" lines, zero "reference frames exceeds max" lines. Per-frame fnv1a hashes differ (0xf1c515aa, 0x16e915e8, 0x16bd16cc, ...) instead of the constant 0x6a6a05c5 "give-up-and-zero" hash from before — libavcodec is actually decoding real pixel content from each P-frame. Pair note: the daemon side already calls v4l2_ctrl_request_complete in daedalus_complete_resp_frame (line 834) — this commit pairs the setup half with that completion half. The daemon side change (decoder.c) is a small log-level promotion: the per-frame "h264 SPS/PPS prepended ..." trace went from log_debug to log_info so the journal shows what's being shipped into libavcodec without needing a daemon rebuild with --debug. Matches the libva- side h264_set_controls instrumentation that landed in libva PR #9. Closes part of issue libva-v4l2-request-fourier#8 — the SPS/PPS field-value gap. Profile/level still come from libva's session- derived hardcoded values (h264_profile_to_idc + h264_derive_level_ idc) which is sufficient for libavcodec to accept the synthesised NAL unit; a true stream-parsed profile/level would need SPS-NAL parsing in libva — separate operator-design call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 20:35:06 +02:00
marfrit	3dd0eb070a	Merge pull request 'DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls' (#1 ) from noether/daemon-pps-h264-nal-synth into main Reviewed-on: #1	2026-05-20 16:51:26 +00:00
marfrit	8c1d9960c4	DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls libva-v4l2-request-fourier (and any V4L2-stateless-API consumer) passes H.264 SPS/PPS as separate V4L2_CID_STATELESS_H264_{SPS,PPS} controls; only the slice NAL goes into the OUTPUT buffer. This is correct per the V4L2 stateless contract. But libavcodec — which the daedalus daemon uses for actual decode (Option γ) — wants a self-contained AnnexB stream including SPS+PPS before any slice. Result on higgs: "non-existing PPS 0 referenced" + decode_slice_ header errors on every H.264 frame, even after LIBVA-1 and -2 routing correctly delivered the request to the daemon. Fix splits across kernel + daemon, keeping the kernel module as a thin transport and putting the actual NAL encoding in userspace: include/daedalus_v4l2_proto.h: Add struct daedalus_h264_meta (the four v4l2_ctrl_h264_* structs the kernel collects) and DAEDALUS_REQ_FLAG_H264_META (set in req.flags when the meta block is present between the daedalus_req_decode prefix and the slice bitstream). kernel/daedalus_v4l2_main.c: Add daedalus_collect_h264_meta() — reads the H.264 ctrl values from the bound media_request via v4l2_ctrl_find + ctrl->p_cur.p_h264_*. device_run() calls it on H.264 codec_id, copies the structs into the REQ_DECODE payload between the prefix and bitstream, and sets the flag. Payload size is bounds-checked against DAEDALUS_PROTO_MAX_PAYLOAD so an over- sized slice + meta fails loud instead of truncating. daemon/src/bitstream_writer.{c,h}: New module — MSB-first bit packer with H.264 Exp-Golomb ue(v) and se(v) coding + rbsp_trailing_bits alignment. Sticky overflow flag so callers can verify the output buffer wasn't truncated. daemon/src/h264_nal_synth.{c,h}: New module — turns v4l2_ctrl_h264_sps / v4l2_ctrl_h264_pps into AnnexB-framed NAL units per ITU-T H.264 7.3.2.1 / 7.3.2.2. Emits emulation prevention bytes (0x03 after every 00 00 in the EBSP) and the 4-byte start code (0x00000001). Coverage matches what V4L2 stateless surface gives us: VUI parameters and full scaling matrices are NOT emitted (V4L2 doesn't carry them — the seq_scaling_matrix_present_flag is set to 0 and libavcodec uses flat defaults, which matches the de-facto behaviour of most H.264 streams libva-v4l2-request drives). daemon/src/decoder.c: daedalus_decoder_run_request() now takes an optional h264_meta parameter. For codec_id == H264 with meta != NULL, synthesises SPS+PPS NAL units, allocates a combined [SPS][PPS][slice] buffer (+ AV_INPUT_BUFFER_PADDING_SIZE), and feeds that to avcodec_send_packet instead of the raw slice. VP9/AV1 path unchanged (frames are self-contained). Cleanup now goes through a unified `out:` label so the assembled buffer is always freed on every exit (including the existing decoder_open_codec / no-frame / receive_frame failure paths). daemon/src/chardev_client.c: handle_req_decode() peels off the optional meta block when the flag is set, passes it through to the decoder, and updates the payload-length consistency check (now allows for an extra sizeof(daedalus_h264_meta) when the flag is on). Build (boltzmann aarch64): clean compile of all daemon sources, including bitstream_writer + h264_nal_synth + the refactored decoder.c. Kernel module compile to be verified via DKMS rebuild on higgs in the marfrit-packages bump that follows. Test plan: with this commit + a marfrit-packages daedalus pin bump, higgs's ffmpeg -hwaccel vaapi -i h264_test.mp4 should produce a successful decode (vs. the previous "non-existing PPS 0 referenced" failure). The daemon log should show: decoder: opened h264 context decoder: h264 prepended SPS=NB PPS=MB slice=KB decoder: OK 320x240 fmt=0 (yuv420p) fnv1a=0x... VP9 / AV1 behaviour unchanged — they don't carry meta and the existing per-frame self-describing path still applies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 17:35:24 +02:00
marfrit	481279c9bf	packaging/systemd: ship daedalus-v4l2.service + modules-load drop-in Canonical location for the systemd unit + module-autoload conf, referenced by both arch/daedalus-v4l2 and debian/daedalus-v4l2 in marfrit-packages. Was a real gap in the original packaging: postinst installed the daemon binary but nothing started it, so the libva path got REQ_DECODE messages with nobody listening on /dev/daedalus-v4l2 and timed out. packaging/systemd/daedalus-v4l2.service: - Type=simple, ExecStart=/usr/bin/daedalus_v4l2_daemon daemon - After=systemd-modules-load.service + ConditionPathExists= /dev/daedalus-v4l2 (so it only starts when the kernel module is loaded; doesn't false-fire on non-daedalus hosts that happen to have the package installed) - Restart=on-failure, RestartSec=2 - MemoryHigh=128M / MemoryMax=256M (Phase 8.9 stress run showed RSS settling around 25 MiB; leaves headroom) - Hardening: NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, ProtectKernel*, SystemCallFilter=@system-service. PrivateDevices=false because we DO need /dev/daedalus-v4l2 packaging/systemd/daedalus-v4l2.modules-load: - Drops to /etc/modules-load.d/daedalus-v4l2.conf so the kernel module loads before the .service unit fires. Both files are picked up by the package recipes (next bump in marfrit-packages) — neither lives in /usr/lib/systemd/system or /etc/modules-load.d until the .deb / .pkg installs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:26:58 +02:00
marfrit	f0cd29a340	kernel: v4l2_fh_add/del gained file* arg in 6.18 — version-conditional DKMS build failure on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712): daedalus_v4l2_main.c:1049: error: too few arguments to function 'v4l2_fh_add' v4l2-fh.h:97: void v4l2_fh_add(struct v4l2_fh fh, struct file filp); daedalus_v4l2_main.c:1063: error: too few arguments to function 'v4l2_fh_del' Signature changed exactly at v6.18 (verified v6.13–v6.17 still use the one-arg form via raw.githubusercontent.com tag walk). Wrap the calls with LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0) so the module keeps building against: * 6.12 LTS / RPi 6.12.75 (one-arg) — hertz * 6.12.88+deb13-arm64 (one-arg) * 6.18.29+rpt-rpi-2712 (file* arg) — higgs running kernel Build verified on both: hertz 6.12.75 clean, higgs 6.18.29 clean + modprobe daedalus_v4l2 succeeds, /dev/daedalus-v4l2 + /dev/video0 appear. Add #include <linux/version.h> for KERNEL_VERSION + LINUX_VERSION_CODE (also pulled transitively via module.h but explicit is better than implicit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:15:24 +02:00
marfrit	f55b2cd002	kernel: media_request_get/put around inf->req (UAF safety) Sonnet pre-deployment review flagged a SHIP-WITH-EYES-OPEN risk: Phase 8.13's inf->req captured src_buf->vb2_buf.req_obj.req as a raw pointer with no media_request_get(). On the normal decode path that's fine because vb2-core holds its own reference until v4l2_m2m_buf_done_and_job_finish releases it. But on a concurrent cancel (MEDIA_IOC_REQUEST_REINIT or a process kill triggering buf_request_complete from the cancel path before RESP_FRAME comes back), vb2 could drop its reference first. Our inf->req would then dangle through v4l2_ctrl_request_complete + buf_done_and_job_finish — UAF. Fix matches the cedrus / rkvdec pattern: take our own reference when we capture the pointer, release it after we're done with it (after buf_done_and_job_finish to keep the ordering crystal-clear). /* in daedalus_device_run, after inf->req = src_buf->...->req / if (inf->req) media_request_get(inf->req); / in daedalus_complete_resp_frame, after buf_done_and_job_finish */ if (inf->req) media_request_put(inf->req); Verified on hertz: - libva path (request-bound, inf->req != NULL): byte-exact NV12, same FNV-1a as standalone. - test_m2m_stream (direct QBUF, inf->req == NULL): 30/30 frames decoded, conditional skip works. - No kernel oops / WARN, no leak in dmesg. Add #include <media/media-request.h> for the helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:39:10 +00:00
marfrit	f04d7000f8	Phase 8.13: byte-exact end-to-end via libva (consumer target hit) The project's consumer-side goal landed: a real VAAPI consumer (ffmpeg with -hwaccel vaapi) drives our libva backend → V4L2 driver → daemon → byte-exact NV12 output back to ffmpeg. ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \ -hwaccel_output_format nv12 -i vp9_small.ivf \ -f rawvideo -y /tmp/vp9_via_libva.nv12 cmp /tmp/vp9_via_libva.nv12 /tmp/vp9_ref_for_libva.nv12 → match 18432-byte NV12 byte-for-byte identical to plain ffmpeg -pix_fmt nv12 software decode. The project_consumer_target memory's deliverable shape — "V4L2 stateless node consumed by a real VAAPI client" — is achieved. Two related kernel changes: 1. v4l2_ctrl_handler_setup(&ctx->hdl) after registration — matches rkvdec/cedrus/hantro. Brings each registered compound control out of "uninitialised" state via std_init_compound defaults. 2. Per-request control completion in the decode path — the real fix for "Timeout when waiting for media request". vb2-core's vb2_buffer_done unbinds the BUFFER's req_obj on normal decode completion, but the per-request CONTROL object stays bound. buf_request_complete fires only from queue-cancel paths (vb2-core line 2284), NOT from normal buf_done. The driver must call v4l2_ctrl_request_complete(req, hdl) explicitly from the completion path. struct daedalus_inflight gained a `struct media_request *req` field, captured from src_buf->vb2_buf.req_obj.req in device_run. daedalus_complete_resp_frame then calls v4l2_ctrl_request_complete before v4l2_m2m_buf_done_and_job_finish — triggers MEDIA_REQUEST_STATE_COMPLETE and wakes the request fd poll. For non-request flows (test_m2m_stream direct QBUF) inf->req is NULL; the conditional skips the call. Both consumer styles work concurrently. Diagnostic clarification (was Phase 8.13a): strace traced three S_EXT_CTRLS calls per frame: 1. H264_PROFILE + H264_LEVEL → EINVAL (we don't register) 2. HEVC_PROFILE + HEVC_LEVEL → EINVAL (we don't register) 3. VP9_FRAME + VP9_COMPRESSED_HDR → SUCCESS The first two are harmless: libva probes whether we support H264/HEVC integer profile/level controls during config negotiation; we don't (we expose them as stateless), so EINVAL just falls through. The actual VP9 stateless controls (#3) succeeded all along — the libva-side "Unable to set control(s)" log was misleading us into thinking the control path was the bug. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): daemon log: REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes decoder: opened vp9 context decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe ... ffmpeg side: no Timeout, no Decoding error /tmp/vp9_via_libva.nv12: 18432 bytes cmp vs reference: byte-for-byte identical. Roadmap update: - 8.10/8.11, 8.12, 8.13 marked closed with closure docs. - 8.14 = multi-frame VP9 via libva, AV1 + H.264, mpv/Firefox higher-level consumers. Per correctness-before-speed: - strace + kernel-source-reading found the actual root cause rather than guessing. - Conditional v4l2_ctrl_request_complete preserves the existing test_m2m_stream non-request path — both consumer styles work concurrently without per-flow branching elsewhere. - Byte-exact pixel comparison, not "frame size matches." Phase 8.14 next: multi-frame stream + multi-codec via libva + mpv/Firefox. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:14:34 +00:00
marfrit	a7d585eee8	Phase 8.12: first VP9 frame decoded via libva ffmpeg -hwaccel vaapi → libva-v4l2-request-fourier → /dev/video0 → daedalus_v4l2 kernel → REQ_DECODE on the chardev → daemon FFmpeg decode → byte-exact NV12 (FNV-1a 0x1eb34bfe, same hash the standalone test_m2m_stream produces for the same 128x96 VP9 keyframe). The pixel-correct decode through the libva path is the milestone. What's NOT yet working: libva times out on the media_request fd because buf_request_complete never fires (vb->req_obj.req is NULL when buf_done runs — the S_EXT_CTRLS EINVAL leaves the buffer un-bound to the request even though the buffer queues anyway). Phase 8.13 fixes the EINVAL so the request bind takes and the completion signal propagates. Kernel V4L2 request API integration: - media_device_ops.req_validate / req_queue = vb2_request_ validate / v4l2_m2m_request_queue (Phase 8.11) — MEDIA_IOC_REQUEST_ALLOC succeeds. - vb2_queue.supports_requests = true on OUTPUT queue — without this v4l2-core rejects S_EXT_CTRLS(REQUEST_VAL). - vb2_ops.buf_request_complete = daedalus_buf_request_complete → v4l2_ctrl_request_complete(req, &ctx->hdl). Without this v4l2-core WARNs at videobuf2-v4l2.c:440. - vb2_ops.buf_out_validate: sets field=V4L2_FIELD_NONE on OUTPUT buf. Required for the same WARN check. - requires_requests intentionally NOT set: lets the existing test_m2m_stream (direct QBUF, no request) keep working alongside the libva path. Stateless control re-registration: - Switched from v4l2_ctrl_new_std_compound(NULL p_def) to v4l2_ctrl_new_custom(&cfg, NULL) — pattern rkvdec / cedrus / hantro use. v4l2-core auto-fills elem_size + type from std table (verified: VP9_FRAME elem_size=168, matches sizeof(struct v4l2_ctrl_vp9_frame)). - No-op s_ctrl callback so SET requests don't crash — daemon ignores values, FFmpeg re-parses the bitstream. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): ffmpeg -hwaccel vaapi -i vp9_small.ivf … daemon: REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes daemon: decoder: opened vp9 context daemon: decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe … Same FNV-1a hash as the standalone test_m2m_stream produces for the same VP9 keyframe. End-to-end through libva. Remaining (Phase 8.13): - S_EXT_CTRLS EINVAL on V4L2_CID_STATELESS_VP9_FRAME despite matching elem_size — needs deeper validate-path debugging. - Once the request bind takes, buf_request_complete fires on buf_done, request fd signals completion, libva DQBUFs the decoded NV12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:01:26 +00:00
marfrit	0de0288dce	Phase 8.10+8.11: libva consumer integration scaffold Brings daedalus_v4l2 from "standalone test client" to "VAAPI- discoverable decoder" by adding the surface formats and media-controller plumbing that libva-v4l2-request-fourier (sibling repo) requires. libva-v4l2-request-fourier patches (pushed separately): - b5b3acf: daedalus_v4l2 added to known_decoder_drivers - 2146341: meson option gate This commit (daedalus-v4l2 side, 3 production changes): 1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE - Added to daedalus_capture_formats[] alongside NV12M + P010 - daedalus_fill_capture_fmt handles num_planes=1 case (sizeimage = WH3/2, bytesperline = W) - daemon pack_nv12_single_to_plane: Y at base+0, interleaved CbCr at base+(stride*H); same byte content as NV12M two-plane, different layout - Required because libva-v4l2-request-fourier's video.c only knows non-multi-plane NV12 (it advertises v4l2_mplane=true but uses the single-plane fourcc). - Verified byte-exact via test_m2m_stream against ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames, 31 MB). 2. V4L2 Request API media ops - daedalus_media_ops = { vb2_request_validate, v4l2_m2m_request_queue } assigned to mdev.ops before media_device_init. - Without this, MEDIA_IOC_REQUEST_ALLOC returned -ENOTTY and no VAAPI consumer could allocate a media_request. 3. Stateless control registration via v4l2_ctrl_new_custom - Switched from v4l2_ctrl_new_std_compound(NULL p_def) to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/ hantro use. Adds a no-op s_ctrl callback. Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712): LibVA trace through `ffmpeg -hwaccel vaapi`: vaInitialize / Profiles / Entrypoints / CreateConfig / QuerySurfaceAttributes / CreateSurfaces / CreateContext (cap_pool: 24 slots, 1 plane each) / CreateBuffer (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC — all succeed. Standalone NV12 decode path: test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12 → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12 vainfo (via libva-v4l2-request-fourier with our driver): 7 VAProfile entries with VAEntrypointVLD (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh, VP9Profile0, AV1Profile0) What's NOT here (Phase 8.12): The libva trace stops at VIDIOC_S_EXT_CTRLS returning EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on the request. The compound-control payload validation against the kernel's expected struct shape rejects. This isn't a "missing line" fix — it needs proper stateless control plumbing (the SPS/PPS/SliceParams get_dims, validate, default-value paths that in-tree rkvdec/cedrus/hantro implement to satisfy v4l2-core's std_validate). Documented as Phase 8.12 scope. The shipped integration is itself a meaningful deliverable: all the framework scaffolding is in place; the remaining gap is well-characterised and bounded. See docs/phase_8_10_11_closure.md for the full trace analysis + next-phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:51:16 +00:00
marfrit	d84efdb125	Phase 8.9: long-form stress + multi-codec HDR + libva scoping Three verification deliverables; no production code changes (infrastructure from 8.8 was sufficient). 1. libva-v4l2-request consumer investigation (task 95): - bootlin/libva-v4l2-request@master supports MPEG-2 / H.264 / HEVC only. No VP9, no AV1. - H264 expects V4L2_PIX_FMT_H264_SLICE_RAW (older fourcc); we advertise V4L2_PIX_FMT_H264_SLICE. - CAPTURE expects V4L2_PIX_FMT_NV12 (single-plane); we advertise NV12M + P010. - Real integration = patch libva-v4l2-request to add VP9 + AV1 mappings + accept the newer H.264 fourcc. Multi-session work — pushed to Phase 8.10. 2. Long-form stress test (task 96): - Built a 1800-frame (60s @ 30fps) VP9 1080p stream by Python concat of vp9_5s.ivf × 12 with PTS adjustment and re-muxed IVF header. - 1800 / 1800 frames decoded cleanly through test_m2m_stream + daemon, fps=120.9 sustained across 14.9 s wall, p99=17.3 ms/frame (well inside the 33 ms 30fps budget). - Daemon alive after 3620 cookies across two back-to-back runs, RSS=23 MiB — no leak. - No kernel oops/WARN, no fps degradation across the long run. 3. Multi-codec HDR (task 97): - AV1 1080p 10-bit → P010: byte-exact vs ffmpeg p010le. fps 17.1 (below 30fps target; AV1 10-bit is intrinsically expensive). - H.264 1080p 10-bit (high10) → P010: byte-exact vs ffmpeg p010le. fps 26.9 (close to target). - Combined with 8.8's VP9-10bit P010 result (48.8 fps): all three codecs' 10-bit paths produce byte-exact P010 output. Roadmap update (docs/roadmap.md): - 8.9 marked closed with the scope-cut explained. - 8.10 = libva-v4l2-request VP9/AV1 patch + end-to-end consumer integration (the actual user-facing loop: mpv --hwdec=vaapi → libva-v4l2-request → /dev/video0 → daemon → decoded frame). Per correctness-before-speed: characterised the libva integration scope rigorously rather than starting a multi-session battle in this phase. The bounded deliverables (stress test + HDR matrix) ship clean and prove the existing infrastructure handles real-world workloads stably. Phase 8.10 next: build + patch libva-v4l2-request on hertz; end-to-end with mpv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:26:42 +00:00
marfrit	1d0db3b5a9	docs: pure ffmpeg vs daedalus pipeline CPU comparison Measured on hertz (Pi 5, 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3) to quantify the architectural cost/benefit of routing decode through the V4L2 m2m + chardev + dmabuf path vs running ffmpeg standalone. 1080p × 150 frames, decode-as-fast-as-possible: VP9 8-bit: ffmpeg 214.9% CPU / 1083ms wall daedalus 96.3% CPU / 1229ms wall AV1 8-bit: ffmpeg 201.5% CPU / 1162ms wall daedalus 96.6% CPU / 1478ms wall H.264 8-bit: ffmpeg 205.8% CPU / 1063ms wall daedalus 100.1% CPU / 1020ms wall VP9 10-bit: ffmpeg 155.8% CPU / 269ms wall daedalus 91.6% CPU / 131ms wall Key takeaway: the daedalus pipeline uses ~half the CPU for roughly the same wall throughput. FFmpeg standalone defaults to 2 threads; for single-stream decode that doesn't parallelise well, so the 2× CPU usage is overhead, not parallelism benefit. The daemon's single-threaded serialised event loop avoids that tax. For the project's 30fps-floor-is-fine target ("daily YouTube with CPU free for vscode"), daedalus leaves ~2× the CPU headroom for the rest of the desktop at the same playback rate. VP9-10bit is striking — daedalus is faster wallclock too (131ms vs 269ms) because at small per-frame work FFmpeg's thread pool spin-up dominates. Note: "daedalus" still uses FFmpeg internally (Phase 8.8 explicitly deferred QPU substitution after measurement showed 30fps@1080p was already met). The benefit here is architectural — single-threaded decode, out-of-process daemon, dmabuf zero-copy — not QPU offload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:20:22 +00:00
marfrit	1ae9528e76	Phase 8.8: throughput baseline + multi-codec streams + HDR Per the correctness-before-speed principle: measure before optimising. Roadmap going in said "QPU dispatch substitution to hit 30fps@1080p". Measurement on hertz shows the FFmpeg software path already hits 65-88 fps@1080p across all three codecs — QPU substitution would be premature optimisation. So 8.8 ships what's actually useful: 1. Per-frame timing in test_m2m_stream. 2. Multi-frame AV1 + H.264 streams verified byte-exact at 1080p (closes the "VP9-only stream tests" gap from 8.7). 3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon pack_p010_to_plane. Test harness (tools/test_m2m_stream.c): - Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/ p99/min/max + wall ms + fps. - Annex-B H.264 parser: split on 3-/4-byte start codes, accumulate NALs into access units (push on VCL NAL types 1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only buffers as "no frame!". - Format auto-detect (DKIF magic → IVF; else Annex-B). - Optional 6th arg `[capture]`: nv12m \| p010. - CAPTURE mmap path generalised for num_planes==1 (P010). Kernel (kernel/daedalus_v4l2_main.c): - CAPTURE formats array {NV12M, P010}; enum_fmt walks it. - daedalus_fill_capture_fmt takes a fourcc: NV12M: 2 planes, WH + WH/2 bytes, bpl=W P010: 1 plane, WH2 + WH bytes, bpl=W2 - try_fmt preserves caller fourcc when supported. - daedalus_complete_resp_frame's dmabuf path now sets each plane's payload to vb2_plane_size(vb,p) — generalises cleanly across 1-plane (P010) and 2-plane (NV12M) layouts; the daemon fully populates the plane so payload = sizeimage. Daemon (daemon/src/decoder.c): - pack_p010_to_plane: YUV420P10LE → P010 single-plane. 10-bit samples shifted left by 6 to MSB-align in 16-bit words per V4L2 ABI. Y at base+0, interleaved CbCr right after Y plane (per format spec for single-plane P010). Strips source stride padding; respects destination stride. - daedalus_decoder_run_request dispatches on req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010 → pack_p010_to_plane; else warn + skip). - Includes <linux/videodev2.h> for fourcc constants. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): 1080p throughput baseline (30 frames testsrc, dmabuf path): VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps 83.1, byte-exact ✓ AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps 65.0, byte-exact ✓ H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps 88.3, byte-exact ✓ All 2-3× over the 30fps-floor-is-fine criterion. HDR / 10-bit 1080p P010: 10 frames, 62 MB output, fps 48.8, byte-exact vs `ffmpeg -pix_fmt p010le -f rawvideo`. Small-frame P010 (320×240): fps 966 — fixed daemon overhead dominates at low resolutions. v4l2-compliance unchanged from 8.7: 49/49 passing. Format enumeration confirms NM12 + P010 on CAPTURE. Clean SIGTERM + rmmod; no kernel oops/WARN. Roadmap update (docs/roadmap.md): - 8.8 marked closed with closure-doc reference, including the explicit "QPU substitution not needed" rationale. - 8.9 reshaped: libva-v4l2-request consumer integration (per project_consumer_target memory) — the actual user-facing endpoint. Per correctness-before-speed: - Measured first; QPU work explicitly justified-out via data. - Byte-exact pixel comparison for every codec/format combo (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and 1080p). - AU grouping in the Annex-B parser is the correct semantic boundary, not just a workaround. - vb2_plane_size for payload generalises to any plane count, not hardcoded to 2. Phase 8.9 next: libva-v4l2-request integration — close the loop from YouTube/Firefox to /dev/video0 + daemon playback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:34:05 +00:00
marfrit	5965805d86	Phase 8.7: media controller + multi-frame streaming verification Two pieces — both shipped: 1. Media controller binding closes the last v4l2-compliance failure from 8.6 (DECODER_CMD, which requires has_media on stateless decoders) and unlocks the V4L2 request API for libva-v4l2-request. 2. Multi-frame streaming test exercises the daemon's AVCodecContext state preservation across many REQ_DECODE calls — Phase 8.6's tests pushed exactly one keyframe per invocation; real content has P-frame references. Compliance now reaches 49/49 passing. Kernel (kernel/daedalus_v4l2_main.{c,h}): - Added `struct media_device mdev` to daedalus_dev. - media_device_init(&mdev) BEFORE v4l2_device_register so v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m entities into the graph during register. - After video_register_device: v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER) then media_device_register so userspace sees the complete graph in /dev/mediaN with the decoder entity tagged. - daedalus_remove unwinds in reverse: unregister media, unregister mc, unregister video, release m2m, unregister v4l2, cleanup mdev. - Error paths added for both new failure points. Test harness (tools/test_m2m_stream.c, new): - Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer rings on both queues → per-frame QBUF/DQBUF loop → concatenates decoded NV12 to output file. Returns 0 only if every input frame decoded without error. - Same codec vocabulary as test_m2m_decode (vp9 \| av1 \| h264 via 5th arg). Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings. $ v4l2-ctl --list-devices daedalus-fourier V3D7+NEON (platform:daedalus_v4l2): /dev/video0 /dev/media3 VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt nv12 -f rawvideo`. VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf path): byte-for-byte match vs same reference command. Daemon log shows cookies 1..30 all completing cleanly in order; lazily-opened AVCodecContext maintains reference frames across the chardev round-trips. Clean SIGTERM + rmmod, no oops/WARN. Roadmap update (docs/roadmap.md): - 8.7 marked closed with closure-doc reference. - 8.8 reshaped: perf profiling, QPU dispatch substitution via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M). Per correctness-before-speed: - Order-correct media controller lifecycle (init → bind v4l2_dev → register video → register mc → register media; reverse for teardown). - 4-deep buffer rings on both queues — the scheduler actually pipelines multiple in-flight cookies through the chardev (not just one-at-a-time as in 8.5/8.6 tests). - Bit-exact comparison against ffmpeg, not "looks right." - All resource paths cleaned on every error branch. Phase 8.8 next: profile daemon hot loops, dlopen daedalus-fourier from the daemon, swap FFmpeg per-block calls for daedalus_dispatch_* where the kernel matches, target 30fps@1080p from 30fps-floor-is-fine memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:21:58 +00:00
marfrit	c7f6fb90cb	Phase 8.6: dmabuf + AV1 + H.264 + stateless controls Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE buffers as dmabuf-fds the daemon mmaps and writes pixels into directly. Adds AV1 + H.264 codec support, V4L2 stateless control registration, and the compliance polish that brings the driver to 47/48 v4l2-compliance pass. Protocol (include/daedalus_v4l2_proto.h): - struct daedalus_req_decode grew capture-buffer metadata (width/height/pix_fmt/num_planes + per-plane size+stride). - New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf in daemon task context so the fd lands in the daemon's table. Kernel m2m driver (kernel/daedalus_v4l2_main.c): - Both queues switched to vb2_dma_contig_memops. OUTPUT was vmalloc in 8.5; the switch is needed because vmalloc doesn't honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's REQBUFS test rejected the driver because of it. We still read bitstream via vb2_plane_vaddr (dma_contig gives a kernel virtual address just like vmalloc did). - dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe. - queue_setup populates alloc_devs[plane] = &pdev->dev for both queues; allow_cache_hints=1 on both. - daedalus_export_capture_dmabuf(cookie, plane, flags, fd): walks inflight list, calls vb2_core_expbuf on the CAPTURE buffer in the caller's (daemon's) task context. - device_run fills the new REQ_DECODE capture fields from ctx->dst_fmt and maps ctx->src_fmt.pixelformat to DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9). - daedalus_complete_resp_frame handles both the 8.5 inline path (kept for debugging) and the 8.6 dmabuf path (pixels already in CAPTURE buffer, just set payload from metadata). - enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264). - try_fmt preserves userspace colorspace fields instead of overwriting with REC709 defaults (fixes 8.5 compliance fail). - s_fmt propagates OUTPUT colorspace → CAPTURE (stateless decoder round-trip test at v4l2-test-formats.cpp:958). - 12 V4L2 stateless controls registered per open (VP9_FRAME, VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/ SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/ TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg re-parses); registration is what makes libva-v4l2-request see us. Kernel chardev (kernel/daedalus_v4l2_chardev.c): - New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to daedalus_export_capture_dmabuf. - debugfs test_decode cookies unified with the m2m cookie allocator via shared daedalus_next_cookie() — kills the Phase 8.5 namespace collision. Daemon (daemon/src/...): - New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on REQ_DECODE; munmap + close on completion. O_RDWR \| O_CLOEXEC is essential — vb2_core_expbuf extracts O_ACCMODE from flags and exports read-only by default (caught on first run; mmap -EACCES on PROT_WRITE). - decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes writes Y line-by-line into planes[0] with planes[0].stride; interleaves Cb/Cr into planes[1] with planes[1].stride. - chardev_client.c handle_req_decode: opens dmabuf planes, runs decode (pixels land in CAPTURE buffer directly), closes planes, sends metadata-only RESP_FRAME. No wire-pixel allocation. Test harness (tools/test_m2m_decode.c): - Optional 5th arg `codec` (vp9 \| av1 \| h264). Same client drives all three codecs. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`: VP9 1920x1080 3,110,400 bytes MATCH AV1 128x96 18,432 bytes MATCH H.264 128x96 18,432 bytes MATCH VP9 1080p went through the full dmabuf path with no chardev payload bloat — the same chardev that capped at 64 KiB in 8.5 now ferries metadata only and lets the daemon mmap+write a 3.1 MB frame directly into the V4L2 client's buffer. v4l2-compliance: Phase 8.1: 44/48 Phase 8.5: 44/48 (different fails after m2m landed) Phase 8.6: 47/48 Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media controller — explicitly Phase 8.7 work). 11 standard compound controls visible: vp9_frame_decode_parameters, vp9_probabilities_updates, h264_sequence_parameter_set, h264_picture_parameter_set, h264_scaling_matrix, h264_prediction_weight_table, h264_slice_parameters, h264_decode_parameters, av1_sequence_parameters, av1_frame_parameters, av1_film_grain (av1_tile_group_entry refused by hdl->error on this kernel — skipped silently). Clean SIGTERM + rmmod, no oops/WARN. Roadmap update (docs/roadmap.md): - Phase 8.6 marked closed with the closure-doc reference. - Phase 8.7 reshaped to (1) media controller, (2) perf + daedalus_dispatch_ substitution, (3) HDR/10-bit, (4) long-form multi-frame streaming. Per correctness-before-speed: - Real V4L2 dmabuf via vb2_core_expbuf (not a sideband fd-passing hack). - O_RDWR access mode threaded through correctly. - Strict pixel-byte comparison against ffmpeg, not "looks right" eyeballing. - Each compliance edge documented with the underlying test source-line + the fix. - All resource paths cleaned (munmap + close per plane on every exit, including error paths). Phase 8.7 next: media controller binding (closes last compliance fail), per-frame profiling, QPU dispatch substitution targeting 30fps@1080p from 30fps-floor-is-fine memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:16:06 +00:00
marfrit	6f4b580f7c	Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF Replaces the Phase 8.4 debugfs-triggered chardev path with a real V4L2 m2m driver. Userspace clients now drive decoding the standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream) queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg; RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE buffer. Phase 8.6 swaps the inline payload for dmabuf so big frames stop being capped at 64 KiB. Kernel (daedalus_v4l2_main.c, rewritten + main.h added): - Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler, per-queue v4l2_pix_format_mplane. - Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export): OUTPUT = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, VP9_FRAME CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M - Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/ prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_* helpers. - v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds REQ_DECODE inline with the bitstream bytes, enqueues with an auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in a per-device inflight list. Job stays open until RESP_FRAME. - daedalus_complete_resp_frame(): pops the inflight entry, memcpys inline NV12 pixels into the CAPTURE buffer (Y plane + interleaved CbCr), finishes via v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done + job_finish, which leaves the src buf on the m2m queue and causes device_run to immediately re-run on the same input (caught on first run; second REQ_DECODE for same bitstream + eventual oops in stop_streaming on teardown). Kernel (daedalus_v4l2_chardev.c): - RESP_FRAME handler now hands inline pixel payload to daedalus_complete_resp_frame so it lands in the CAPTURE vb2 buffer. Existing PONG and debugfs test_decode paths still work; the latter produces a harmless ratelimited "unknown cookie" since it bypasses V4L2 m2m. Daemon (decoder.c, decoder.h): - daedalus_decoder_run_request signature extended with (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the decoder packs YUV420P into NV12 in the caller's buffer: Y plane line-by-line stripped of stride padding; Cb/Cr interleaved into a single chroma plane. Truncation silent — kernel only memcpys what fits in the CAPTURE plane. Daemon (chardev_client.c): - handle_req_decode allocates a response buffer sized for the full chardev payload, lets decoder fill the pixel area after the resp_frame struct, sends the full payload via the existing send_response. Test client (tools/test_m2m_decode.c, new): - Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each, mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the whole flow without needing v4l2-ctl framing. Roadmap update (docs/roadmap.md): - Phase 8.4 retitled "daemon ↔ kernel decode round-trip" to reflect what actually shipped (vs. the original V4L2- ioctl-driven plan which moved here). - Phase 8.5 retitled "full V4L2 m2m driver" with closure status. - Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/ stateless controls + media controller. Adds the punch list of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace) that 8.6 will fix. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): Kernel + daemon build clean (-Wall -Wextra clean both sides). Test harness drives one VP9 keyframe end-to-end: OUTPUT REQBUFS -> 2 CAPTURE REQBUFS -> 2 QBUF OUTPUT[0] bytesused=1566 QBUF CAPTURE[0]; STREAMON both poll revents=0x5 DQBUF OUTPUT[0] flags=0x4001 (DONE) DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144] wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12 Pixel correctness vs reference: ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12 cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓ Byte-for-byte identical to FFmpeg's stock CPU decode. v4l2-compliance: detected as Stateless Decoder; most ioctls pass; two expected fails documented in closure doc (DECODER_CMD/media controller, S_FMT colorspace). Clean teardown: SIGTERM the daemon, rmmod the module, no oops/WARN in dmesg. Per correctness-before-speed: - Real V4L2 ioctl table (not stubs); uses v4l2-core helpers where they exist instead of reinventing. - v4l2_m2m_buf_done_and_job_finish (not the manual sequence) to keep scheduler state consistent. - Bit-exact reference comparison, not just "looks right." - Documented every compliance failure with the planned fix. - All resource paths (kmalloc/kfree, inflight list cleanup, src/dst buf removal in stop_streaming) handled on every error branch. Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB frame-size cap), add AV1+H.264 codecs, add V4L2 stateless controls + media controller binding, fix the colorspace + cookie-namespace compliance issues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:55:10 +00:00
marfrit	2a449632b9	Phase 8.4: daemon ↔ kernel decode round-trip (VP9 end-to-end) Wires the Phase 8.3 FFmpeg loader through the Phase 8.2 chardev bridge: kernel injects REQ_DECODE carrying a raw VP9 access unit, daemon hands the bitstream to libavcodec via dlopen, sends RESP_FRAME back with a content-dependent FNV-1a digest of the decoded YUV planes. Pure CPU decode for now — Phase 8.5 swaps in dmabuf + QPU dispatch. Protocol (include/daedalus_v4l2_proto.h): - New REQ_DECODE (kernel→daemon) and RESP_FRAME (daemon→kernel) message types, with fixed-size payload structs. - New DAEDALUS_CODEC_VP9/AV1/H264 enum (wire-stable so 8.6's AV1+H.264 work doesn't move existing values). - New DAEDALUS_DECODE_* status enum (OK / NO_FRAME / ERR_OPEN / ERR_SEND / ERR_RECV / ERR_CODEC). - Converted the prior `enum daedalus_msg_type` to #defines — high-bit values exceed INT_MAX and tripped -Wpedantic on userspace; kernel uABI headers use the same idiom. Kernel (kernel/daedalus_v4l2_chardev.c): - New debugfs entry /sys/kernel/debug/daedalus_v4l2/test_decode: writing raw bitstream bytes wraps them in a REQ_DECODE (codec=VP9 for Phase 8.4) and enqueues with an auto-incrementing cookie. - daedalus_chardev_write learned RESP_FRAME: parses the payload and emits a single pr_info line with decode metadata. Keeps existing PONG handling on the default arm. Daemon (daemon/src/...): - chardev_client.{c,h} — opens /dev/daedalus-v4l2, blocking read loop, single-buffer write() responses (kernel chardev has only .write, not .write_iter, so writev lands as -EINVAL — discovered the hard way during first run). - decoder.{c,h} — lazily-opened AVCodecContext per codec, shared AVPacket/AVFrame pair, descriptor-driven plane walker (av_pix_fmt_desc_get) so the same hash path covers YUV420P, YUV422P, YUV444P, GBRP and other 8-bit planar layouts. Generalised after first run decoded testsrc as GBRP (71) rather than the assumed YUV420P. - `daemon` command in main.c opens the chardev and runs the loop until SIGINT/SIGTERM. Cookie correlation handled end-to-end. - ffmpeg_loader gained av_pix_fmt_desc_get (23 symbols total). Build: - CMakeLists adds chardev_client.c + decoder.c; explicit -I../include for the shared protocol header. - Still -Wall -Wextra -Wpedantic clean. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): $ ffmpeg ... -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 \ -y /tmp/vp9_test.ivf $ python3 ... strip IVF framing → vp9_keyframe.bin (3268 B) $ sudo insmod kernel/daedalus_v4l2.ko $ daedalus_v4l2_daemon -v daemon & $ sudo dd if=vp9_keyframe.bin \ of=/sys/kernel/debug/daedalus_v4l2/test_decode daemon: REQ_DECODE cookie=2 → decoded yuv420p 320x240 fnv1a=0x6ef10d71 luma=76800 chroma=38400 kernel: RESP_FRAME cookie=2 status=0 320x240 pixfmt=0 fnv1a=0x6ef10d71 ← matches daemon ✓ Hash properties verified: cookie=2 testsrc 3268 B → 0x6ef10d71 (first decode) cookie=3 red 44 B → 0x7f6e5dc5 (content-dependent ✓) cookie=4 testsrc 3268 B → 0x6ef10d71 (deterministic ✓) cookie=5 64 B random → status=101 (ERR_SEND, daemon alive) Daemon survives bad input (FFmpeg "Invalid sync code" wrapped into structured ERR_SEND response). Clean SIGTERM shutdown, clean rmmod. Phase 8.4 acceptance criteria met: - ✓ end-to-end kernel→daemon→FFmpeg→kernel round-trip - ✓ cookie correlation per request/response pair - ✓ content-dependent + deterministic digest - ✓ structured error responses (no daemon crash on bad input) - ✓ clean teardown (SIGTERM + rmmod) - ✓ builds clean on both kernel kbuild and daemon CMake Per correctness-before-speed: - Real chardev I/O (no shortcuts, no select-loop hacks) - Real FFmpeg AVCodecContext lifecycle (lazily opened, properly freed on cleanup) - Descriptor-driven plane walk (generalises across pix_fmts) - Structured error path (not just log-and-continue) - All resource paths cleaned up on every error branch - Documented why FNV-1a digest, why write() not writev(), why pix_desc walk in docs/phase_8_4_closure.md Phase 8.5 next: V4L2 m2m queue submits REQ_DECODE from vidioc_qbuf; dmabuf carries actual pixel data so the chardev's 64 KiB cap doesn't gate frame size; begin substituting daedalus_dispatch_* into the daemon's decode path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:22:16 +00:00
marfrit	873a04c622	Phase 8.3: userspace daemon scaffold + FFmpeg dlopen + parse path Builds the daemon executable per the locked Phase 8 architecture (Option γ: dlopen FFmpeg at runtime). Phase 8.3 scope: parse path validation only — no V4L2 wiring, no decode, no chardev connection. Components: - daemon/CMakeLists.txt — CMake with -Wall -Wextra -Wpedantic clean. pkg-config for FFmpeg headers; only -ldl + -lpthread at link time. - daemon/src/main.c — entry point, signal handlers (SIGINT/SIGTERM), command dispatcher. Currently `parse <file>`. - daemon/src/ffmpeg_loader.{c,h} — runtime FFmpeg loader. dlopens libavformat.so.61, libavcodec.so.61, libavutil.so.59. Resolves 22 function pointers using POSIX-recommended (void)& dlsym idiom (per POSIX.1-2017 dlsym(3p) Rationale). - daemon/src/parser.{c,h} — demux loop via avformat_open_input + av_read_frame. Per-frame logging on -v. - daemon/src/log.{c,h} — logging facade (stderr Phase 8.3; syslog/journal planned for 8.5+). Verification on hertz: $ ffmpeg -f lavfi -i testsrc=duration=2:size=320x240:rate=30 \ -c:v libvpx-vp9 -y /tmp/testsrc.ivf $ daedalus_v4l2_daemon parse /tmp/testsrc.ivf [INFO] FFmpeg loaded: 7.1.3-0+deb13u1+rpt1 (libavformat 61.7.100) [INFO] video stream #0: codec=vp9 (Google VP9) 320x240, 0/0 fps [INFO] parse complete: 60 frames (1 key) total 17859 bytes Error paths verified: - Missing file → "avformat_open_input(...): code -2", exit 1 - No command → usage message, exit 2 - Bad command → usage message, exit 2 Per correctness-before-speed: - Real CMake (no Makefile hacks) - pkg-config for headers - POSIX-conformant dlsym pattern (no -Wpedantic suppression) - Real signal handling + proper exit codes - Real logging with timestamp + level - Headers included at compile-time for type safety; dlopen decouples runtime - All FFmpeg resources freed on every exit path - Builds clean on -Wall -Wextra -Wpedantic Phase 8.3 acceptance criteria met: - ✓ daemon binary builds - ✓ dlopen FFmpeg at runtime - ✓ demux a VP9 IVF file end-to-end - ✓ per-frame metadata logged correctly - ✓ frame count + keyframe count + byte total accurate Phase 8.4 next: wire daemon to /dev/daedalus-v4l2 chardev, add REQ_DECODE / RESP_FRAME handling, drive VP9 decode end-to-end via daedalus_dispatch_ from daedalus-fourier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:10:22 +00:00
marfrit	895f57c63a	Phase 8.2: kernel ↔ daemon chardev bridge with round-trip test Adds /dev/daedalus-v4l2 misc chardev to the kernel module. The chardev is the IPC channel for the future userspace decoder daemon: kernel enqueues REQ_* messages, daemon read()s them, processes, write()s RESP_* back. Wire protocol (pre-1.0, header in include/daedalus_v4l2_proto.h): - struct daedalus_msg_hdr: magic (D04V) + version + type + cookie + payload_len + reserved - Request/response separated by high bit of type field - Max 64 KiB payload per message - Cookie correlates request with matching response Kernel implementation (kernel/daedalus_v4l2_chardev.{c,h}): - Single-instance chardev (-EBUSY on second open) - In-kernel FIFO bounded at 64 messages - Blocking + non-blocking read; poll() with EPOLLIN on queued - write() parses + validates header, logs response at pr_debug - Bad magic → -EBADMSG, bad version → -EPROTO, oversize → -EMSGSIZE - All error paths free resources Phase 8.2 test trigger via debugfs: - /sys/kernel/debug/daedalus_v4l2/test_ping — any byte enqueues a PING with a fixed 24-byte payload. Removed in Phase 8.4 when real REQ_DECODE from V4L2 path takes over. Userspace verification tool (tools/test_chardev_pingpong.c): - Real C program, proper error reporting via strerror - Validates the 6-step round-trip: open → empty-queue EAGAIN → trigger ping → read PING → verify all fields → write PONG → close - Builds with -Wall -Wextra clean Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): $ sudo insmod daedalus_v4l2.ko $ sudo tools/test_chardev_pingpong opening /dev/daedalus-v4l2... non-blocking read on empty queue: EAGAIN ✓ injected PING via debugfs ✓ read PING: magic ✓ version ✓ type=PING ✓ cookie=0x1234 ✓ payload=24 bytes payload: "DAEDALUS-V4L2-PING-PL" wrote PONG (cookie=0x1234) ✓ ALL TESTS PASSED. $ sudo rmmod daedalus_v4l2 # clean Per correctness-before-speed: full kerneldoc on structs, 8-tab kernel style, SPDX headers, proper error paths, real test program (not "I ran it once"), failure-mode coverage documented. Phase 8.3 next: userspace daemon with dlopen'd FFmpeg parse path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:05:54 +00:00

1 2

53 Commits