daedalus-v4l2

Author	SHA1	Message	Date
marfrit	dbf01eddb8	daemon: shadow_decoder wiring (PR-Q3a.1) Toolchain plumbing for the upcoming daedalus-decoder shadow-mode path. Production behaviour is unchanged. What lands here: 1. CMake links libdaedalus_decoder via pkg-config. Static archive, so no .so dependency change in the daemon's link map. 2. ffmpeg_loader resolves ff_h264_set_mb_inspect_cb NULL-tolerantly. Stock libavcodec lacks the symbol (logged as INFO at startup); the marfrit-packages ffmpeg-v4l2-request-fourier fork's 0016 patch exports it. The shadow path activates only when both env DAEDALUS_SHADOW_MODE=1 AND the symbol resolves. 3. New shadow_decoder.[ch] module: - shadow_decoder_create() gates on env + symbol presence, returns NULL in production state (the common case). - shadow_decoder_install_cb() registers a per-MB callback on the H.264 AVCodecContext; lazily-created daedalus_decoder context will pick up dimensions from the first AVFrame. - shadow_decoder_on_frame() logs per-frame MB-observed count. Every entry point is NULL-safe so decoder.c stays clean of conditionals. 4. decoder.{c,h} grow a `struct shadow_decoder *shadow` field on daedalus_decoder. Install hook fires once per H.264 codec open; frame hook fires after each successful avcodec_receive_frame. PR-Q3a.1 scope ENDS here. The callback just counts MBs; no daedalus_decoder_append_mb or flush_frame yet. Real-coeffs / edges extraction needs the patched FFmpeg source-tree headers (DAEDALUS_FFMPEG_SRC) to introspect H264Context internals — that lands in PR-Q3a.2. dejavu-check: this path is daedalus-decoder's frame-major UMA dispatch architecture (one cmdbuf per frame, one submit) running alongside libavcodec's reference decode for validation. It is NOT per-kernel libavcodec function-pointer substitution. No new libavcodec patches; the existing 0016 callback is the only intercept point. Verified on hertz: - Build: clean, libdaedalus_decoder.a linked. - Disabled state (env unset OR symbol absent): no shadow log lines, daemon init continues normally, INFO logs "libavcodec lacks ff_h264_set_mb_inspect_cb (stock build, no daedalus-fourier 0016 patch) — shadow-mode unavailable". - Enabled state would require ffmpeg-v4l2-request-fourier .deb rebuilt with patches 0016/0017 deployed to hertz (current .deb release 10 predates them). That's a deployment task, separate from this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:15:13 +02:00
claude-noether	9797a0daa6	daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter Extends the AV1 OBU encoder pack (PR #22 landed the Sequence Header half) with the two remaining pieces of the per-frame OBU assembly: - av1_synth_temporal_delimiter_obu() — trivial 2-byte OBU (0x12, 0x00) that AV1 temporal units must start with so libavcodec's parser can detect access-unit boundaries. - av1_synth_frame_header_obu() — encodes a Frame Header OBU (AV1 §5.9) from V4L2_CID_STATELESS_AV1_SEQUENCE + V4L2_CID_STATELESS_ AV1_FRAME controls. ## Frame Header scope The encoder covers the libva-v4l2-request common-case path: - frame_type: KEY / INTER / INTRA_ONLY supported. SWITCH returns 0. - tile_info: single-tile uniform-spacing only (forced tile_cols_log2 = tile_rows_log2 = 0). - quantization_params: full coverage (base_q_idx, delta_q_*, qmatrix). - loop_filter_params: full coverage (levels, sharpness, ref/mode deltas). - cdef_params: full coverage. - segmentation: only enabled=0 path supported (returns 0 if enabled). - loop_restoration: only RESTORE_NONE supported (returns 0 if any plane uses Wiener / SGRPROJ / SWITCHABLE). - global_motion: only IDENTITY warp model emitted (returns 0 if any ref uses ROTZOOM / AFFINE / TRANSLATION). - film_grain_params: only "not present" path — returns 0 if the sequence header has FILM_GRAIN_PARAMS_PRESENT set. Out-of-scope branches return 0 so a future decoder.c integration can surface a coverage warning and fall back to direct libavcodec parsing of the original bitstream where the consumer happens to ship a fully-OBU'd access unit. ## Integration status The new primitives are NOT yet wired into decoder.c. The AV1 decode hot path still passes the OUTPUT buffer straight to libavcodec, which works only when the V4L2 consumer is sending a fully-OBU'd access unit (not strictly the V4L2 stateless contract). A real wiring needs a separate kernel-side change: - daedalus_v4l2_proto.h: add struct daedalus_av1_meta mirroring v4l2_ctrl_av1_sequence + v4l2_ctrl_av1_frame - kernel/daedalus_v4l2_main.c: capture V4L2_CID_STATELESS_AV1_{SEQUENCE, FRAME} at device_run, ship over the chardev - daemon/src/chardev_client.c: receive meta - daemon/src/decoder.c: assemble TD + SH + FH + OBU_TILE_GROUP-wrapped OUTPUT bytes, send to libavcodec Tracked as a follow-on. ## Tests test_av1_obu_synth.c grows 5 new cases (9 total, all green on hertz): === av1_synth_temporal_delimiter_obu === temporal delimiter: OK === av1_synth_frame_header_obu === KEY frame 1080p: OK (13 bytes) INTER frame: OK (18 bytes) SWITCH frame rejected: OK segmentation enabled rejected: OK AV1 OBU synth tests PASSED Bit-walk of the KEY-frame happy path confirms the OBU envelope (obu_type=3 = FRAME_HEADER, has_size_field=1, leb128 size byte), then steps through show_existing_frame, frame_type, show_frame, disable_cdf_update, allow_screen_content_tools. Fuller bit-walks would tie the test to encoder details that are spec-driven and already linear in the source; structural smoke + spec-driven linearity is the right gate. Build clean on hertz (Pi 5, Debian trixie, 6.18.29+rpt-rpi-2712, gcc -Wall -Wextra -Wpedantic). No new warnings. Closes daedalus backlog task #159 (AV1 Frame Header OBU synthesiser; decoder.c integration deferred per task notes above).	2026-05-23 18:31:41 +02:00
marfrit	3a8f5405d4	Merge pull request 'daemon: AV1 Sequence Header OBU synthesiser + unit test' (#22 ) from noether/daemon-av1-obu-synth into main Reviewed-on: #22	2026-05-23 15:12:16 +00:00
claude-noether	1e9619afe8	daemon: AV1 Sequence Header OBU synthesiser + unit test V4L2 stateless AV1 passes the sequence header information as a structured control (V4L2_CID_STATELESS_AV1_SEQUENCE) and ships only tile-group bytes in the OUTPUT buffer. libavcodec's AV1 decoder is full-bitstream, so the daemon needs to reconstruct the OBU bytes the consumer parsed out before feeding the assembled stream to libavcodec. This commit lands the Sequence Header OBU half of that reconstruction — av1_synth_sequence_header_obu(). Frame Header / Frame OBU synthesisers + the integration that wires the assembled OBUs into the decode hot path are separate follow-on modules. Module shape mirrors the H.264 NAL synthesiser (PR #1): - Public API: single function returning byte count or 0 on overflow/invalid input. - Wire encoder uses the existing bitstream_writer (bsw_put_u is AV1's f(n); bsw_put_ue is bit-identical to AV1's uvlc; bsw_align_rbsp matches AV1's trailing_bits()). - AV1-specific helpers (leb128 size, min_bits_for, subsampling resolution per §5.5.2) are file-local statics. - No emulation prevention — AV1 uses leb128-sized OBUs for bitstream boundaries, not byte-pattern escapes. Synthesis decisions for fields V4L2 doesn't carry are documented verbatim in the file header (reduced_still_picture_header = 0; single operating point at seq_level_idx = 13 / level 5.1; color_description_present_flag = 0; chroma_sample_position = 0; seq_choose_screen_detection_tools = 1; seq_choose_integer_mv = 1). Rejection cases: - seq_profile > 2 - bit_depth not in {8, 10, 12} - seq_profile = 1 + monochrome (4:4:4 forced colour) - seq_profile = 1 + bit_depth = 12 (only profile 2 allows it) - max_frame_{width,height}_minus_1 requiring > 16 length bits - out_cap too small to hold header + leb128 + payload Each returns 0 to surface the mismatch loudly rather than emit nonsense the libavcodec parser would reject downstream. Unit test (test_av1_obu_synth.c, opt-in via DAEDALUS_BUILD_TESTS=ON) exercises four cases bit-by-bit against a hand-computed reference: 1. profile 0, 1080p, 8-bit, 4:2:0, order_hint on (7 bits), CDEF+restoration on — the common Pi 5 path. 2. profile 0, 720p, 10-bit, monochrome — exercises high_bitdepth and the monochrome short-form color_config. 3. profile 1 + bit_depth 12 → expects 0 (rejected). 4. tiny out_cap → expects 0 (overflow). All four green on hertz (aarch64 Arch, gcc Wall+Wextra+Wpedantic clean). This commit does not change daemon behaviour — av1_obu_synth.c is built into the daemon binary so the symbols are reachable, but no call site is wired yet. Integration goes in the follow-on DAEMON-AV1 patches that also synthesise the Frame Header OBU and bracket the assembled OBUs with a Temporal Delimiter. Refs reauktion/daedalus-v4l2#11 daemon-half; closes daedalus backlog task #144.	2026-05-23 15:41:07 +02:00
claude-noether	a43296c1ed	daemon: bounds-check pack_* functions against CAPTURE plane size The three NV12/P010 pack functions (pack_nv12_single_to_plane, pack_nv12_to_planes, pack_p010_to_plane) wrote into the V4L2 client's CAPTURE dmabuf without checking that the mapped size covers the frame libavcodec just decoded. Crash scenario: YouTube DASH stepping resolution mid-stream (e.g. 480p -> 720p when bandwidth improves) — libva is supposed to handle the V4L2_EVENT_SOURCE_CHANGE with STREAMOFF / S_FMT / REQBUFS, but in practice a stale CAPTURE request with the old buffer size sometimes slips through carrying the new (larger) frame. The chroma-interleave inner loop walks past the mapping boundary and the daemon takes SIGSEGV mid-frame, which in turn leaves V4L2 clients hanging in vb2_core_dqbuf — see the followup ticket on the D-state symptom. Fix: compute required = y_size + uv_size against planes->size[N] BEFORE any write. On mismatch, log_warn with both sizes and the frame dimensions, and return -EOVERFLOW. The caller (process_decode_request loop) already handles a negative pack return with a log_warn and proceeds without aborting the decode — the kernel still gets the response with metadata-only and the V4L2 client sees a frame whose pixels are stale but whose buffer-done event fires normally. The next SOURCE_CHANGE the client processes resyncs the buffer size. All three pack paths get the same bounds-check; the comment on pack_nv12_single is the canonical explanation, the other two reference it. Verified: builds clean against trixie aarch64; no behavioural change on the happy path (the bounds check is a single size compare; on a correctly-sized CAPTURE buffer it's a 1-cycle pass). Closes daedalus-v4l2 task #145 (daemon SEGV in pack_nv12_single on resolution change).	2026-05-23 15:31:50 +02:00
marfrit	3e4e6e8eae	daemon: filter tiny pause-time bitstreams (closes #17 ) libva-v4l2-request-fourier flushes a stub packet into the V4L2 OUTPUT_MPLANE queue at playback-pause boundaries. The payload is shorter than any parseable H.264 NAL (3-byte start code + 1-byte NAL header = 4 bytes minimum); avcodec_send_packet returns AVERROR_INVALIDDATA (-1094995529), which propagated to the kernel as a decode failure. Firefox then marked H.264-via-VAAPI as broken for the session and routed every subsequent frame to libmozavcodec SW — pause never recovered to HW. At the REQ_DECODE entry in chardev_client.c::handle_req_decode, short-circuit any bitstream below the minimum-parseable threshold: log INFO, skip daedalus_decoder_run_request, and reply RESP_FRAME with status=DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool stays healthy and Firefox doesn't see a failure. Repro: Pi CM5 trixie, daedalus-v4l2 0.1.0+r41 + ffmpeg-v4l2- request-fourier 2:8.1+rfourier+gb57fbbe-9, Firefox YouTube avc1. Play → daemon decodes at ~46 fps. Pause ≥ 1s. Resume → daemon silent; sudo journalctl -u daedalus-v4l2 --since '10s' \| grep -c 'decoder: OK' = 0. Last entry before silence: REQ_DECODE cookie=N codec=3 bitstream=3 bytes ... [h264 @ ...] no frame! [ERR] decoder: avcodec_send_packet failed: -1094995529 After this fix the 3-byte sentinel logs as 'tiny bitstream 3 bytes — dropping as no-op' and the libavcodec context is untouched; the next real REQ_DECODE proceeds normally. Scope NOT covered (intentionally deferred): - A more general "tolerate AVERROR_INVALIDDATA mid-stream" path. Worth doing later but masks unrelated bugs. - Investigating WHY libva sends the 3-byte sentinel on pause. Likely an upstream libva-v4l2-request-fourier issue; tracked separately if this filter is not enough. Wire protocol unchanged. No DAEDALUS_PROTO_VERSION bump.	2026-05-22 17:26:25 +02:00
claude-noether	514da29a73	daemon: dlopen Kwiboo fork's libavcodec.so.62 / libavformat.so.62 / libavutil.so.60 Switch the daemon's runtime dlopen targets from Debian-stock soname 61/61/59 (FFmpeg 7.1.3) to the Kwiboo fourier fork's soname 62/62/60 (FFmpeg 8.1) installed at the /opt/fourier prefix. Why --- The substitution arc tracked at daedalus-v4l2#11 needs daedalus- fourier kernel calls woven into libavcodec's H264DSPContext NEON init (replacing ff_h264_idct_add_neon etc. with thunks calling daedalus_recipe_dispatch_h264_). We do that via patches in the ffmpeg-v4l2-request-fourier package source — which we own, in marfrit-packages, alongside the existing libudev-bypass and nv15-to-p010 patches. But that package builds the Kwiboo fork at soname 62 / /opt/fourier. The daemon currently dlopens soname 61 (Debian-stock + a separately-built +fourier2 patch that isn't in marfrit-packages' source tree), so substitution patches there wouldn't reach the daemon. Switching to soname 62 routes the daemon through the package we control — first step toward landing daedalus-fourier kernel substitution into the production decode path. Compat ------ - /opt/fourier libs are already on every host running the daemon (hard build-dep of ffmpeg-v4l2-request-fourier). Firefox-fourier and mpv-fourier already dlopen them via the same path. - /etc/ld.so.conf.d/fourier.conf entry resolves the new sonames from /opt/fourier/lib via the ld cache; dlopen-by-soname works without LD_LIBRARY_PATH wrappers. - Build-side: daemon's pkg_check_modules picks up libav.pc from /opt/fourier/lib/pkgconfig when PKG_CONFIG_PATH includes that directory (build-deb.sh follow-up will set it). - API surface unchanged: avcodec_send_packet / receive_frame / AVCodecContext flags / AVFrame fields are all stable between FFmpeg 7.1 and 8.1. Verified clean cross-compile on hertz. Wire protocol unchanged. No kmod bump. Next step (follow-up PRs) ------------------------- 1. ffmpeg-v4l2-request-fourier patch: add 0003-daedalus-fourier- substitute-h264-idct4.patch that replaces ff_h264_idct_add_neon in libavcodec/aarch64/h264dsp_init_aarch64.c with a thunk calling daedalus_recipe_dispatch_h264_idct4. 2. Repeat for IDCT 8×8, deblock luma-v, qpel mc20 (one kernel per PR for reviewability; bench delta + decode_us delta documented per substitution). 3. marfrit-packages bump to pick up the new daemon + the substituted fourier package.	2026-05-21 21:19:24 +02:00
claude-noether	814b74d0bb	daemon: per-frame decode_us + periodic stats summary (#11 step 1) Establishes observable baseline metrics before any daedalus-fourier kernel substitution lands. Step 1 of the daemon-rewrite arc tracked at daedalus-v4l2#11. Changes ------- - Per-frame `decoder: OK ...` log line now carries decode_us=N (the send_packet + receive_frame wall-clock cost in microseconds — exclusively the libavcodec round-trip, not the bitstream pack / SPS-PPS synth / pack-to-planes work). - New "decoder stats" summary line every DAEDALUS_STATS_EVERY (60) decoded frames, reporting: codec, frame count, window seconds, fps, avg decode_us, MBs/s throughput, bytes/MB bitrate. Sample ------ decoder stats: codec=h264 frames=300 window=12.32s fps=24.35 avg_decode_us=4216.4 mbs_per_s=87643 bs_b_per_mb=1.56 What this tells us ------------------ Steady-state on higgs (Pi CM5) decoding bbb_720p_h264.mp4: ~4 ms decode_us, ~90 K MBs/s, well under the daedalus-fourier NEON kernel ceilings (IDCT 4×4 @ 175 Mblocks/s, deblock @ 92 Medges/s, qpel mc20 @ 131 Mblocks/s — all 100-1000× over our actual workload). Means the 4 ms/frame is mostly libavcodec's CABAC + MV prediction + intra prediction overhead, NOT the pixel-math primitives. Substituting a single primitive would shave only a small slice of the 4 ms. Useful as guidance for the upcoming substitution work — we'll pick the primitive with the largest cycle cost relative to the alternative, and measure CPU saved per substitution. No behaviour change: counters are static + unsynchronised (the chardev event loop is single-threaded); reset when codec_id changes. clock_gettime(CLOCK_MONOTONIC) for timing.	2026-05-21 20:17:09 +02:00
marfrit	77e14e5a19	Merge pull request 'daemon: link daedalus-fourier + log substrate availability at startup' (#13 ) from noether/daemon-link-daedalus-fourier into main Reviewed-on: #13	2026-05-21 16:35:38 +00:00
claude-noether	88b2ebfaa9	daemon: link daedalus-fourier + log substrate availability at startup First incremental step toward H.264 daemon-rewrite (daedalus-v4l2#11): make the daedalus-fourier kernel library available to the daemon process so subsequent patches can substitute its primitives (IDCT 4×4, IDCT 8×8, luma vertical deblock, etc.) for libavcodec's per-MB pixel math. This patch does NOT yet dispatch any kernels. It only: - Adds `pkg_check_modules(DAEDALUS_FOURIER REQUIRED daedalus-fourier)` to the daemon's CMakeLists, with explicit link ordering (libdaedalus_core.a must precede -lvulkan because the static archive references vulkan symbols and the linker resolves left-to-right). We bypass IMPORTED_TARGET because pkg-config's Requires.private chain leaves CMake's dependency graph reordering the archive after -lvulkan, breaking the static link. - Calls daedalus_ctx_create_no_qpu() at daemon startup, logs the substrate-availability line, destroys the context at exit. no_qpu mode skips V3D Vulkan probe — proves linkage works without depending on shader-path resolution (which is a separate piece of work, since v3d_runner currently loads .spv files from cwd-relative paths and consumer would need a search path override). Sample journal line: [2026-05-21 17:59:35.271 INFO] daedalus-fourier: linked, ctx alive (no_qpu mode; has_qpu=0) Build-test verified on hertz (Pi 5 dev host) against an installed copy of daedalus-fourier r35+gd87239d (from marfrit/daedalus-fourier PR #1). Binary links cleanly, --help prints, daemon mode opens chardev (fails predictably on hertz which has no daedalus_v4l2 kmod; on higgs this is the existing working path). Follow-up patches per daedalus-v4l2#11: 1. Instrument the existing libavcodec decode path to count per-frame IDCT blocks / deblock edges / MC tiles so we have a baseline of what work the daemon dispatches for a typical YouTube H.264 stream. 2. Substitute daedalus-fourier kernels one at a time, measuring CPU saved per substitution. 3. Wire shader path resolution into daedalus_ctx_create() for the QPU substrate (V3D opportunistic helper paths). Wire protocol unchanged. DAEDALUS_PROTO_VERSION stays at 0.	2026-05-21 18:00:46 +02:00
claude-noether	234a103084	daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — fix display-reorder breaking V4L2 1:1 Force libavcodec's H.264 decoder to emit frames in DECODE order (one frame per send_packet, no internal display-order reorder queue). Single-line addition: ctx->flags \|= AV_CODEC_FLAG_LOW_DELAY before avcodec_open2, gated on codec_id == DAEDALUS_CODEC_H264. Closes daedalus-v4l2#11 part (2). Background ---------- PR #7's "parking design" approach to the H.264 display-reorder problem broke libva-v4l2-request-fourier's 1:1 CAPTURE-completion contract (see #9 + #10). After the revert, the visible "2 1 4 3" pair-swap regressed and the only path forward was to align the daemon's output ordering with what V4L2 stateless clients expect: decode order, one CAPTURE buffer per OUTPUT slice, with display reorder pushed upstream to ffmpeg-vaapi's per-VAAPI-surface POC logic (which it already does correctly for every real H.264 hardware decoder via VAPictureParameterBufferH264). How LOW_DELAY does this ----------------------- Inside libavcodec/h264dec.c, the flag sets h->low_delay = 1. h264_select_output_frame (h264_picture.c) emits the just-decoded picture immediately instead of routing through the display-order DPB output queue. DPB management for reference frames (short_ref / long_ref) is unaffected — B-frame decoding correctness is preserved; only the output buffering is bypassed. Skipped for VP9 / AV1 — those codecs don't reorder internally, so the flag would be a no-op but adds no value. Verified -------- On higgs (Pi CM5, 6.18.29+rpt-rpi-2712), test daemon hot-swapped into /usr/bin/daedalus_v4l2_daemon, mpv --hwdec=vaapi-copy --frames=300 against bbb_720p_h264.mp4: 311 REQ_DECODEs received, 308 successful "decoder: OK" responses (99.04% steady-state delivery — 3 lost at GOP boundaries, no compounding drift). mpv plays to its --frames cap and exits cleanly with "End of file". No "Unable to dequeue buffer", no "Failed to end picture decode", no "AVHWFramesContext: Failed to sync surface" — all the failures from #9 are gone. Builds clean against ffmpeg-v4l2-request-fourier libavcodec.	2026-05-21 17:14:33 +02:00
marfrit	714d781d22	Revert "Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6 )' (#7 ) from noether/kernel-daemon-h264-reorder-fix into main" This reverts commit `79256dc7ef`, reversing changes made to `7ff2d897ea`.	2026-05-21 14:40:59 +02:00
claude-noether	15fc2aba14	kernel + daemon: H.264 B-frame display reorder fix (issue #6 ) H.264 streams with B-frames showed visibly pair-swapped output in mpv / Firefox playback through the libva → daedalus_v4l2 path — "frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6". Reproduced in mpv with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor), confirming the bug was in this daemon pipeline, not downstream. Root cause ---------- libavcodec's H.264 decoder internally reorders output to DISPLAY order before returning from avcodec_receive_frame. The daemon previously called send_packet → receive_frame ONCE per REQ_DECODE and shipped the resulting pixels in a RESP_FRAME tagged with the SAME cookie. For B-frames this is wrong: the frame returned from receive_frame may belong to an EARLIER bitstream (libavcodec held it for display-order release). Cookie N's CAPTURE buffer therefore got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got silently marked VB2_BUF_STATE_ERROR (the daemon returned DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held). Fix shape --------- Decouple kernel cookie identity (decode-order routing) from libavcodec's display-ordered output. Wire-protocol changes: REQ_DECODE + __u64 src_pts (= src_buf->vb2_buf.timestamp) RESP_FRAME + __u32 flags (HAS_PIXELS \| SRC_CONSUMED) + __u64 output_src_pts (= frame->pts on drain) PROTO_VERSION bumped 0 → 1. Lock-step rebuild required. Kernel ------ device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts before sending REQ_DECODE, and stores it on the inflight item so the completion path can stamp dst_buf.timestamp explicitly when src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto- pairing no longer applies). daedalus_complete_resp_frame splits into: HAS_PIXELS: pack pixels into THIS cookie's CAPTURE buffer, stamp dst timestamp from inflight->src_pts, v4l2_m2m_buf_done(dst, DONE/ERROR). No job_finish here. SRC_CONSUMED: release the bound media_request, run v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so the scheduler can dispatch the next REQ. dst_buf may still be parked at this point. Inflight entry is removed and freed only when BOTH src_buf and dst_buf have been cleared. Combined HAS_PIXELS\|SRC_CONSUMED RESPs (steady-state VP9/AV1 with no reorder lag) collapse to the prior 1:1 behaviour for free. Daemon ------ daedalus_decoder_run_request split into three primitives: daedalus_decoder_submit — set pkt->pts = req->src_pts, avcodec_send_packet. daedalus_decoder_drain_one — avcodec_receive_frame, populate resp meta + output_src_pts (= the frame's pts, carried back from the bitstream that produced it). daedalus_decoder_pack_current — pack current AVFrame into the caller-mapped CAPTURE planes. chardev_client maintains a small (src_pts → cookie, cached_req) table indexed linearly (≤64 entries; bounded by V4L2 client buffer pool depth). On each REQ_DECODE: 1. Register (src_pts → cookie) in the table. 2. submit(). 3. Drain loop: for each frame returned, look up its owner cookie via pending_lookup(frame->pts), GET_DMABUF for THAT cookie, pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS, output_src_pts=frame->pts). Combine with SRC_CONSUMED when owner_cookie equals the current REQ's cookie. 4. If the current REQ's cookie wasn't drained inside the loop (libavcodec held the frame), emit a standalone SRC_CONSUMED RESP so the kernel runs job_finish + dispatches the next REQ; dst_buf for this cookie stays parked until a future drain produces its pixels. VP9 / AV1 paths are unchanged in behaviour: one frame per REQ, HAS_PIXELS\|SRC_CONSUMED in one combined RESP. Verified -------- Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712 (Pi CM5). Frame-size warning in device_run is pre-existing (unchanged by this commit).	2026-05-21 12:32:47 +02:00
marfrit	29f16ece13	kernel: bind request controls to p_cur before reading them device_run was reading ctrl->p_cur.p_h264_* directly, but v4l2-m2m's request scheduler does NOT auto-bind the in-flight media_request's control values to the ctrl handler's p_cur slots — drivers have to call v4l2_ctrl_request_setup() explicitly. cedrus / rkvdec / hantro all do this in their device_run; daedalus didn't. Result: daedalus_collect_h264_meta() read stale or default values (whatever the prior request had left in p_cur, or v4l2_ctrl_new_custom initial state if no prior request had completed) instead of the S_EXT_CTRLS V4L2_CTRL_WHICH_REQUEST_VAL values libva-v4l2-request- fourier had just sent for THIS frame. The mismatch was a smoking gun on higgs after libva PR #9 / packages PR #52 landed an instrumentation log at h264_set_controls entry: libva boundary (sent to kernel): VAProfile=13 seq_fields=0x00032051 pic_fields=0x00000500 num_ref_frames=1 daedalus daemon (read from kernel p_cur): prof=100 level=41 ref_frames=0 flags=0x10 pps_flags=0x0 After calling v4l2_ctrl_request_setup() at the top of device_run: daedalus daemon (read from kernel p_cur): prof=66 level=11 ref_frames=1 poc_type=2 flags=0x50 pps_flags=0x88 — matches what libva sent, matches the bitstream's actual SPS. End-to-end test on higgs with libva-v4l2-request-fourier 1.0.0+r382 +gc1bb444 (after-fix-3-and-fix-4-instrumentation) + this kernel patch: $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi \ -hwaccel_device /dev/dri/renderD128 -i h264_test.mp4 \ -frames:v 1 -f null - ... rc=0 daemon journal: zero "error while decoding MB" lines, zero "reference frames exceeds max" lines. Per-frame fnv1a hashes differ (0xf1c515aa, 0x16e915e8, 0x16bd16cc, ...) instead of the constant 0x6a6a05c5 "give-up-and-zero" hash from before — libavcodec is actually decoding real pixel content from each P-frame. Pair note: the daemon side already calls v4l2_ctrl_request_complete in daedalus_complete_resp_frame (line 834) — this commit pairs the setup half with that completion half. The daemon side change (decoder.c) is a small log-level promotion: the per-frame "h264 SPS/PPS prepended ..." trace went from log_debug to log_info so the journal shows what's being shipped into libavcodec without needing a daemon rebuild with --debug. Matches the libva- side h264_set_controls instrumentation that landed in libva PR #9. Closes part of issue libva-v4l2-request-fourier#8 — the SPS/PPS field-value gap. Profile/level still come from libva's session- derived hardcoded values (h264_profile_to_idc + h264_derive_level_ idc) which is sufficient for libavcodec to accept the synthesised NAL unit; a true stream-parsed profile/level would need SPS-NAL parsing in libva — separate operator-design call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 20:35:06 +02:00
marfrit	8c1d9960c4	DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls libva-v4l2-request-fourier (and any V4L2-stateless-API consumer) passes H.264 SPS/PPS as separate V4L2_CID_STATELESS_H264_{SPS,PPS} controls; only the slice NAL goes into the OUTPUT buffer. This is correct per the V4L2 stateless contract. But libavcodec — which the daedalus daemon uses for actual decode (Option γ) — wants a self-contained AnnexB stream including SPS+PPS before any slice. Result on higgs: "non-existing PPS 0 referenced" + decode_slice_ header errors on every H.264 frame, even after LIBVA-1 and -2 routing correctly delivered the request to the daemon. Fix splits across kernel + daemon, keeping the kernel module as a thin transport and putting the actual NAL encoding in userspace: include/daedalus_v4l2_proto.h: Add struct daedalus_h264_meta (the four v4l2_ctrl_h264_* structs the kernel collects) and DAEDALUS_REQ_FLAG_H264_META (set in req.flags when the meta block is present between the daedalus_req_decode prefix and the slice bitstream). kernel/daedalus_v4l2_main.c: Add daedalus_collect_h264_meta() — reads the H.264 ctrl values from the bound media_request via v4l2_ctrl_find + ctrl->p_cur.p_h264_*. device_run() calls it on H.264 codec_id, copies the structs into the REQ_DECODE payload between the prefix and bitstream, and sets the flag. Payload size is bounds-checked against DAEDALUS_PROTO_MAX_PAYLOAD so an over- sized slice + meta fails loud instead of truncating. daemon/src/bitstream_writer.{c,h}: New module — MSB-first bit packer with H.264 Exp-Golomb ue(v) and se(v) coding + rbsp_trailing_bits alignment. Sticky overflow flag so callers can verify the output buffer wasn't truncated. daemon/src/h264_nal_synth.{c,h}: New module — turns v4l2_ctrl_h264_sps / v4l2_ctrl_h264_pps into AnnexB-framed NAL units per ITU-T H.264 7.3.2.1 / 7.3.2.2. Emits emulation prevention bytes (0x03 after every 00 00 in the EBSP) and the 4-byte start code (0x00000001). Coverage matches what V4L2 stateless surface gives us: VUI parameters and full scaling matrices are NOT emitted (V4L2 doesn't carry them — the seq_scaling_matrix_present_flag is set to 0 and libavcodec uses flat defaults, which matches the de-facto behaviour of most H.264 streams libva-v4l2-request drives). daemon/src/decoder.c: daedalus_decoder_run_request() now takes an optional h264_meta parameter. For codec_id == H264 with meta != NULL, synthesises SPS+PPS NAL units, allocates a combined [SPS][PPS][slice] buffer (+ AV_INPUT_BUFFER_PADDING_SIZE), and feeds that to avcodec_send_packet instead of the raw slice. VP9/AV1 path unchanged (frames are self-contained). Cleanup now goes through a unified `out:` label so the assembled buffer is always freed on every exit (including the existing decoder_open_codec / no-frame / receive_frame failure paths). daemon/src/chardev_client.c: handle_req_decode() peels off the optional meta block when the flag is set, passes it through to the decoder, and updates the payload-length consistency check (now allows for an extra sizeof(daedalus_h264_meta) when the flag is on). Build (boltzmann aarch64): clean compile of all daemon sources, including bitstream_writer + h264_nal_synth + the refactored decoder.c. Kernel module compile to be verified via DKMS rebuild on higgs in the marfrit-packages bump that follows. Test plan: with this commit + a marfrit-packages daedalus pin bump, higgs's ffmpeg -hwaccel vaapi -i h264_test.mp4 should produce a successful decode (vs. the previous "non-existing PPS 0 referenced" failure). The daemon log should show: decoder: opened h264 context decoder: h264 prepended SPS=NB PPS=MB slice=KB decoder: OK 320x240 fmt=0 (yuv420p) fnv1a=0x... VP9 / AV1 behaviour unchanged — they don't carry meta and the existing per-frame self-describing path still applies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 17:35:24 +02:00
marfrit	0de0288dce	Phase 8.10+8.11: libva consumer integration scaffold Brings daedalus_v4l2 from "standalone test client" to "VAAPI- discoverable decoder" by adding the surface formats and media-controller plumbing that libva-v4l2-request-fourier (sibling repo) requires. libva-v4l2-request-fourier patches (pushed separately): - b5b3acf: daedalus_v4l2 added to known_decoder_drivers - 2146341: meson option gate This commit (daedalus-v4l2 side, 3 production changes): 1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE - Added to daedalus_capture_formats[] alongside NV12M + P010 - daedalus_fill_capture_fmt handles num_planes=1 case (sizeimage = WH3/2, bytesperline = W) - daemon pack_nv12_single_to_plane: Y at base+0, interleaved CbCr at base+(stride*H); same byte content as NV12M two-plane, different layout - Required because libva-v4l2-request-fourier's video.c only knows non-multi-plane NV12 (it advertises v4l2_mplane=true but uses the single-plane fourcc). - Verified byte-exact via test_m2m_stream against ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames, 31 MB). 2. V4L2 Request API media ops - daedalus_media_ops = { vb2_request_validate, v4l2_m2m_request_queue } assigned to mdev.ops before media_device_init. - Without this, MEDIA_IOC_REQUEST_ALLOC returned -ENOTTY and no VAAPI consumer could allocate a media_request. 3. Stateless control registration via v4l2_ctrl_new_custom - Switched from v4l2_ctrl_new_std_compound(NULL p_def) to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/ hantro use. Adds a no-op s_ctrl callback. Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712): LibVA trace through `ffmpeg -hwaccel vaapi`: vaInitialize / Profiles / Entrypoints / CreateConfig / QuerySurfaceAttributes / CreateSurfaces / CreateContext (cap_pool: 24 slots, 1 plane each) / CreateBuffer (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC — all succeed. Standalone NV12 decode path: test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12 → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12 vainfo (via libva-v4l2-request-fourier with our driver): 7 VAProfile entries with VAEntrypointVLD (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh, VP9Profile0, AV1Profile0) What's NOT here (Phase 8.12): The libva trace stops at VIDIOC_S_EXT_CTRLS returning EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on the request. The compound-control payload validation against the kernel's expected struct shape rejects. This isn't a "missing line" fix — it needs proper stateless control plumbing (the SPS/PPS/SliceParams get_dims, validate, default-value paths that in-tree rkvdec/cedrus/hantro implement to satisfy v4l2-core's std_validate). Documented as Phase 8.12 scope. The shipped integration is itself a meaningful deliverable: all the framework scaffolding is in place; the remaining gap is well-characterised and bounded. See docs/phase_8_10_11_closure.md for the full trace analysis + next-phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:51:16 +00:00
marfrit	1ae9528e76	Phase 8.8: throughput baseline + multi-codec streams + HDR Per the correctness-before-speed principle: measure before optimising. Roadmap going in said "QPU dispatch substitution to hit 30fps@1080p". Measurement on hertz shows the FFmpeg software path already hits 65-88 fps@1080p across all three codecs — QPU substitution would be premature optimisation. So 8.8 ships what's actually useful: 1. Per-frame timing in test_m2m_stream. 2. Multi-frame AV1 + H.264 streams verified byte-exact at 1080p (closes the "VP9-only stream tests" gap from 8.7). 3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon pack_p010_to_plane. Test harness (tools/test_m2m_stream.c): - Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/ p99/min/max + wall ms + fps. - Annex-B H.264 parser: split on 3-/4-byte start codes, accumulate NALs into access units (push on VCL NAL types 1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only buffers as "no frame!". - Format auto-detect (DKIF magic → IVF; else Annex-B). - Optional 6th arg `[capture]`: nv12m \| p010. - CAPTURE mmap path generalised for num_planes==1 (P010). Kernel (kernel/daedalus_v4l2_main.c): - CAPTURE formats array {NV12M, P010}; enum_fmt walks it. - daedalus_fill_capture_fmt takes a fourcc: NV12M: 2 planes, WH + WH/2 bytes, bpl=W P010: 1 plane, WH2 + WH bytes, bpl=W2 - try_fmt preserves caller fourcc when supported. - daedalus_complete_resp_frame's dmabuf path now sets each plane's payload to vb2_plane_size(vb,p) — generalises cleanly across 1-plane (P010) and 2-plane (NV12M) layouts; the daemon fully populates the plane so payload = sizeimage. Daemon (daemon/src/decoder.c): - pack_p010_to_plane: YUV420P10LE → P010 single-plane. 10-bit samples shifted left by 6 to MSB-align in 16-bit words per V4L2 ABI. Y at base+0, interleaved CbCr right after Y plane (per format spec for single-plane P010). Strips source stride padding; respects destination stride. - daedalus_decoder_run_request dispatches on req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010 → pack_p010_to_plane; else warn + skip). - Includes <linux/videodev2.h> for fourcc constants. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): 1080p throughput baseline (30 frames testsrc, dmabuf path): VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps 83.1, byte-exact ✓ AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps 65.0, byte-exact ✓ H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps 88.3, byte-exact ✓ All 2-3× over the 30fps-floor-is-fine criterion. HDR / 10-bit 1080p P010: 10 frames, 62 MB output, fps 48.8, byte-exact vs `ffmpeg -pix_fmt p010le -f rawvideo`. Small-frame P010 (320×240): fps 966 — fixed daemon overhead dominates at low resolutions. v4l2-compliance unchanged from 8.7: 49/49 passing. Format enumeration confirms NM12 + P010 on CAPTURE. Clean SIGTERM + rmmod; no kernel oops/WARN. Roadmap update (docs/roadmap.md): - 8.8 marked closed with closure-doc reference, including the explicit "QPU substitution not needed" rationale. - 8.9 reshaped: libva-v4l2-request consumer integration (per project_consumer_target memory) — the actual user-facing endpoint. Per correctness-before-speed: - Measured first; QPU work explicitly justified-out via data. - Byte-exact pixel comparison for every codec/format combo (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and 1080p). - AU grouping in the Annex-B parser is the correct semantic boundary, not just a workaround. - vb2_plane_size for payload generalises to any plane count, not hardcoded to 2. Phase 8.9 next: libva-v4l2-request integration — close the loop from YouTube/Firefox to /dev/video0 + daemon playback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:34:05 +00:00
marfrit	c7f6fb90cb	Phase 8.6: dmabuf + AV1 + H.264 + stateless controls Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE buffers as dmabuf-fds the daemon mmaps and writes pixels into directly. Adds AV1 + H.264 codec support, V4L2 stateless control registration, and the compliance polish that brings the driver to 47/48 v4l2-compliance pass. Protocol (include/daedalus_v4l2_proto.h): - struct daedalus_req_decode grew capture-buffer metadata (width/height/pix_fmt/num_planes + per-plane size+stride). - New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf in daemon task context so the fd lands in the daemon's table. Kernel m2m driver (kernel/daedalus_v4l2_main.c): - Both queues switched to vb2_dma_contig_memops. OUTPUT was vmalloc in 8.5; the switch is needed because vmalloc doesn't honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's REQBUFS test rejected the driver because of it. We still read bitstream via vb2_plane_vaddr (dma_contig gives a kernel virtual address just like vmalloc did). - dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe. - queue_setup populates alloc_devs[plane] = &pdev->dev for both queues; allow_cache_hints=1 on both. - daedalus_export_capture_dmabuf(cookie, plane, flags, fd): walks inflight list, calls vb2_core_expbuf on the CAPTURE buffer in the caller's (daemon's) task context. - device_run fills the new REQ_DECODE capture fields from ctx->dst_fmt and maps ctx->src_fmt.pixelformat to DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9). - daedalus_complete_resp_frame handles both the 8.5 inline path (kept for debugging) and the 8.6 dmabuf path (pixels already in CAPTURE buffer, just set payload from metadata). - enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264). - try_fmt preserves userspace colorspace fields instead of overwriting with REC709 defaults (fixes 8.5 compliance fail). - s_fmt propagates OUTPUT colorspace → CAPTURE (stateless decoder round-trip test at v4l2-test-formats.cpp:958). - 12 V4L2 stateless controls registered per open (VP9_FRAME, VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/ SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/ TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg re-parses); registration is what makes libva-v4l2-request see us. Kernel chardev (kernel/daedalus_v4l2_chardev.c): - New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to daedalus_export_capture_dmabuf. - debugfs test_decode cookies unified with the m2m cookie allocator via shared daedalus_next_cookie() — kills the Phase 8.5 namespace collision. Daemon (daemon/src/...): - New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on REQ_DECODE; munmap + close on completion. O_RDWR \| O_CLOEXEC is essential — vb2_core_expbuf extracts O_ACCMODE from flags and exports read-only by default (caught on first run; mmap -EACCES on PROT_WRITE). - decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes writes Y line-by-line into planes[0] with planes[0].stride; interleaves Cb/Cr into planes[1] with planes[1].stride. - chardev_client.c handle_req_decode: opens dmabuf planes, runs decode (pixels land in CAPTURE buffer directly), closes planes, sends metadata-only RESP_FRAME. No wire-pixel allocation. Test harness (tools/test_m2m_decode.c): - Optional 5th arg `codec` (vp9 \| av1 \| h264). Same client drives all three codecs. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`: VP9 1920x1080 3,110,400 bytes MATCH AV1 128x96 18,432 bytes MATCH H.264 128x96 18,432 bytes MATCH VP9 1080p went through the full dmabuf path with no chardev payload bloat — the same chardev that capped at 64 KiB in 8.5 now ferries metadata only and lets the daemon mmap+write a 3.1 MB frame directly into the V4L2 client's buffer. v4l2-compliance: Phase 8.1: 44/48 Phase 8.5: 44/48 (different fails after m2m landed) Phase 8.6: 47/48 Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media controller — explicitly Phase 8.7 work). 11 standard compound controls visible: vp9_frame_decode_parameters, vp9_probabilities_updates, h264_sequence_parameter_set, h264_picture_parameter_set, h264_scaling_matrix, h264_prediction_weight_table, h264_slice_parameters, h264_decode_parameters, av1_sequence_parameters, av1_frame_parameters, av1_film_grain (av1_tile_group_entry refused by hdl->error on this kernel — skipped silently). Clean SIGTERM + rmmod, no oops/WARN. Roadmap update (docs/roadmap.md): - Phase 8.6 marked closed with the closure-doc reference. - Phase 8.7 reshaped to (1) media controller, (2) perf + daedalus_dispatch_ substitution, (3) HDR/10-bit, (4) long-form multi-frame streaming. Per correctness-before-speed: - Real V4L2 dmabuf via vb2_core_expbuf (not a sideband fd-passing hack). - O_RDWR access mode threaded through correctly. - Strict pixel-byte comparison against ffmpeg, not "looks right" eyeballing. - Each compliance edge documented with the underlying test source-line + the fix. - All resource paths cleaned (munmap + close per plane on every exit, including error paths). Phase 8.7 next: media controller binding (closes last compliance fail), per-frame profiling, QPU dispatch substitution targeting 30fps@1080p from 30fps-floor-is-fine memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:16:06 +00:00
marfrit	6f4b580f7c	Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF Replaces the Phase 8.4 debugfs-triggered chardev path with a real V4L2 m2m driver. Userspace clients now drive decoding the standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream) queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg; RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE buffer. Phase 8.6 swaps the inline payload for dmabuf so big frames stop being capped at 64 KiB. Kernel (daedalus_v4l2_main.c, rewritten + main.h added): - Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler, per-queue v4l2_pix_format_mplane. - Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export): OUTPUT = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, VP9_FRAME CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M - Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/ prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_* helpers. - v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds REQ_DECODE inline with the bitstream bytes, enqueues with an auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in a per-device inflight list. Job stays open until RESP_FRAME. - daedalus_complete_resp_frame(): pops the inflight entry, memcpys inline NV12 pixels into the CAPTURE buffer (Y plane + interleaved CbCr), finishes via v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done + job_finish, which leaves the src buf on the m2m queue and causes device_run to immediately re-run on the same input (caught on first run; second REQ_DECODE for same bitstream + eventual oops in stop_streaming on teardown). Kernel (daedalus_v4l2_chardev.c): - RESP_FRAME handler now hands inline pixel payload to daedalus_complete_resp_frame so it lands in the CAPTURE vb2 buffer. Existing PONG and debugfs test_decode paths still work; the latter produces a harmless ratelimited "unknown cookie" since it bypasses V4L2 m2m. Daemon (decoder.c, decoder.h): - daedalus_decoder_run_request signature extended with (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the decoder packs YUV420P into NV12 in the caller's buffer: Y plane line-by-line stripped of stride padding; Cb/Cr interleaved into a single chroma plane. Truncation silent — kernel only memcpys what fits in the CAPTURE plane. Daemon (chardev_client.c): - handle_req_decode allocates a response buffer sized for the full chardev payload, lets decoder fill the pixel area after the resp_frame struct, sends the full payload via the existing send_response. Test client (tools/test_m2m_decode.c, new): - Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each, mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the whole flow without needing v4l2-ctl framing. Roadmap update (docs/roadmap.md): - Phase 8.4 retitled "daemon ↔ kernel decode round-trip" to reflect what actually shipped (vs. the original V4L2- ioctl-driven plan which moved here). - Phase 8.5 retitled "full V4L2 m2m driver" with closure status. - Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/ stateless controls + media controller. Adds the punch list of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace) that 8.6 will fix. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): Kernel + daemon build clean (-Wall -Wextra clean both sides). Test harness drives one VP9 keyframe end-to-end: OUTPUT REQBUFS -> 2 CAPTURE REQBUFS -> 2 QBUF OUTPUT[0] bytesused=1566 QBUF CAPTURE[0]; STREAMON both poll revents=0x5 DQBUF OUTPUT[0] flags=0x4001 (DONE) DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144] wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12 Pixel correctness vs reference: ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12 cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓ Byte-for-byte identical to FFmpeg's stock CPU decode. v4l2-compliance: detected as Stateless Decoder; most ioctls pass; two expected fails documented in closure doc (DECODER_CMD/media controller, S_FMT colorspace). Clean teardown: SIGTERM the daemon, rmmod the module, no oops/WARN in dmesg. Per correctness-before-speed: - Real V4L2 ioctl table (not stubs); uses v4l2-core helpers where they exist instead of reinventing. - v4l2_m2m_buf_done_and_job_finish (not the manual sequence) to keep scheduler state consistent. - Bit-exact reference comparison, not just "looks right." - Documented every compliance failure with the planned fix. - All resource paths (kmalloc/kfree, inflight list cleanup, src/dst buf removal in stop_streaming) handled on every error branch. Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB frame-size cap), add AV1+H.264 codecs, add V4L2 stateless controls + media controller binding, fix the colorspace + cookie-namespace compliance issues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:55:10 +00:00
marfrit	2a449632b9	Phase 8.4: daemon ↔ kernel decode round-trip (VP9 end-to-end) Wires the Phase 8.3 FFmpeg loader through the Phase 8.2 chardev bridge: kernel injects REQ_DECODE carrying a raw VP9 access unit, daemon hands the bitstream to libavcodec via dlopen, sends RESP_FRAME back with a content-dependent FNV-1a digest of the decoded YUV planes. Pure CPU decode for now — Phase 8.5 swaps in dmabuf + QPU dispatch. Protocol (include/daedalus_v4l2_proto.h): - New REQ_DECODE (kernel→daemon) and RESP_FRAME (daemon→kernel) message types, with fixed-size payload structs. - New DAEDALUS_CODEC_VP9/AV1/H264 enum (wire-stable so 8.6's AV1+H.264 work doesn't move existing values). - New DAEDALUS_DECODE_* status enum (OK / NO_FRAME / ERR_OPEN / ERR_SEND / ERR_RECV / ERR_CODEC). - Converted the prior `enum daedalus_msg_type` to #defines — high-bit values exceed INT_MAX and tripped -Wpedantic on userspace; kernel uABI headers use the same idiom. Kernel (kernel/daedalus_v4l2_chardev.c): - New debugfs entry /sys/kernel/debug/daedalus_v4l2/test_decode: writing raw bitstream bytes wraps them in a REQ_DECODE (codec=VP9 for Phase 8.4) and enqueues with an auto-incrementing cookie. - daedalus_chardev_write learned RESP_FRAME: parses the payload and emits a single pr_info line with decode metadata. Keeps existing PONG handling on the default arm. Daemon (daemon/src/...): - chardev_client.{c,h} — opens /dev/daedalus-v4l2, blocking read loop, single-buffer write() responses (kernel chardev has only .write, not .write_iter, so writev lands as -EINVAL — discovered the hard way during first run). - decoder.{c,h} — lazily-opened AVCodecContext per codec, shared AVPacket/AVFrame pair, descriptor-driven plane walker (av_pix_fmt_desc_get) so the same hash path covers YUV420P, YUV422P, YUV444P, GBRP and other 8-bit planar layouts. Generalised after first run decoded testsrc as GBRP (71) rather than the assumed YUV420P. - `daemon` command in main.c opens the chardev and runs the loop until SIGINT/SIGTERM. Cookie correlation handled end-to-end. - ffmpeg_loader gained av_pix_fmt_desc_get (23 symbols total). Build: - CMakeLists adds chardev_client.c + decoder.c; explicit -I../include for the shared protocol header. - Still -Wall -Wextra -Wpedantic clean. Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712): $ ffmpeg ... -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 \ -y /tmp/vp9_test.ivf $ python3 ... strip IVF framing → vp9_keyframe.bin (3268 B) $ sudo insmod kernel/daedalus_v4l2.ko $ daedalus_v4l2_daemon -v daemon & $ sudo dd if=vp9_keyframe.bin \ of=/sys/kernel/debug/daedalus_v4l2/test_decode daemon: REQ_DECODE cookie=2 → decoded yuv420p 320x240 fnv1a=0x6ef10d71 luma=76800 chroma=38400 kernel: RESP_FRAME cookie=2 status=0 320x240 pixfmt=0 fnv1a=0x6ef10d71 ← matches daemon ✓ Hash properties verified: cookie=2 testsrc 3268 B → 0x6ef10d71 (first decode) cookie=3 red 44 B → 0x7f6e5dc5 (content-dependent ✓) cookie=4 testsrc 3268 B → 0x6ef10d71 (deterministic ✓) cookie=5 64 B random → status=101 (ERR_SEND, daemon alive) Daemon survives bad input (FFmpeg "Invalid sync code" wrapped into structured ERR_SEND response). Clean SIGTERM shutdown, clean rmmod. Phase 8.4 acceptance criteria met: - ✓ end-to-end kernel→daemon→FFmpeg→kernel round-trip - ✓ cookie correlation per request/response pair - ✓ content-dependent + deterministic digest - ✓ structured error responses (no daemon crash on bad input) - ✓ clean teardown (SIGTERM + rmmod) - ✓ builds clean on both kernel kbuild and daemon CMake Per correctness-before-speed: - Real chardev I/O (no shortcuts, no select-loop hacks) - Real FFmpeg AVCodecContext lifecycle (lazily opened, properly freed on cleanup) - Descriptor-driven plane walk (generalises across pix_fmts) - Structured error path (not just log-and-continue) - All resource paths cleaned up on every error branch - Documented why FNV-1a digest, why write() not writev(), why pix_desc walk in docs/phase_8_4_closure.md Phase 8.5 next: V4L2 m2m queue submits REQ_DECODE from vidioc_qbuf; dmabuf carries actual pixel data so the chardev's 64 KiB cap doesn't gate frame size; begin substituting daedalus_dispatch_* into the daemon's decode path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:22:16 +00:00
marfrit	873a04c622	Phase 8.3: userspace daemon scaffold + FFmpeg dlopen + parse path Builds the daemon executable per the locked Phase 8 architecture (Option γ: dlopen FFmpeg at runtime). Phase 8.3 scope: parse path validation only — no V4L2 wiring, no decode, no chardev connection. Components: - daemon/CMakeLists.txt — CMake with -Wall -Wextra -Wpedantic clean. pkg-config for FFmpeg headers; only -ldl + -lpthread at link time. - daemon/src/main.c — entry point, signal handlers (SIGINT/SIGTERM), command dispatcher. Currently `parse <file>`. - daemon/src/ffmpeg_loader.{c,h} — runtime FFmpeg loader. dlopens libavformat.so.61, libavcodec.so.61, libavutil.so.59. Resolves 22 function pointers using POSIX-recommended (void)& dlsym idiom (per POSIX.1-2017 dlsym(3p) Rationale). - daemon/src/parser.{c,h} — demux loop via avformat_open_input + av_read_frame. Per-frame logging on -v. - daemon/src/log.{c,h} — logging facade (stderr Phase 8.3; syslog/journal planned for 8.5+). Verification on hertz: $ ffmpeg -f lavfi -i testsrc=duration=2:size=320x240:rate=30 \ -c:v libvpx-vp9 -y /tmp/testsrc.ivf $ daedalus_v4l2_daemon parse /tmp/testsrc.ivf [INFO] FFmpeg loaded: 7.1.3-0+deb13u1+rpt1 (libavformat 61.7.100) [INFO] video stream #0: codec=vp9 (Google VP9) 320x240, 0/0 fps [INFO] parse complete: 60 frames (1 key) total 17859 bytes Error paths verified: - Missing file → "avformat_open_input(...): code -2", exit 1 - No command → usage message, exit 2 - Bad command → usage message, exit 2 Per correctness-before-speed: - Real CMake (no Makefile hacks) - pkg-config for headers - POSIX-conformant dlsym pattern (no -Wpedantic suppression) - Real signal handling + proper exit codes - Real logging with timestamp + level - Headers included at compile-time for type safety; dlopen decouples runtime - All FFmpeg resources freed on every exit path - Builds clean on -Wall -Wextra -Wpedantic Phase 8.3 acceptance criteria met: - ✓ daemon binary builds - ✓ dlopen FFmpeg at runtime - ✓ demux a VP9 IVF file end-to-end - ✓ per-frame metadata logged correctly - ✓ frame count + keyframe count + byte total accurate Phase 8.4 next: wire daemon to /dev/daedalus-v4l2 chardev, add REQ_DECODE / RESP_FRAME handling, drive VP9 decode end-to-end via daedalus_dispatch_ from daedalus-fourier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:10:22 +00:00
marfrit	c7d8050cc9	Initial scaffold: daedalus-v4l2 sibling repo V4L2 stateless decoder for Pi 5, backed by sibling daedalus-fourier kernel library (VP9 + AV1 CDEF + H.264 video decode kernels on VideoCore VII compute + ARM NEON). Architecture locked 2026-05-18 by mfritsche per daedalus-fourier/docs/phase8_scoping.md: - Option B: Linux kernel V4L2 shim + userspace daemon (not v4l2loopback). Real /dev/videoNN; proper DRM PRIME for browser zero-copy. - Option γ: dlopen FFmpeg at runtime as parser. No vendoring; fastest to v1. - Sibling repo (this repo): V4L2-side work outside of daedalus-fourier so kernel-library API stays clean. Components: kernel/ - Linux out-of-tree kernel module (GPLv2; V4L2 device + chardev bridge to userspace daemon) daemon/ - userspace decoder daemon (BSD-2-Clause; links libdaedalus_core.a from sibling; dlopens FFmpeg) docs/ - architecture + 7-phase roadmap (8.1..8.7) include/ - shared headers between kernel and daemon Roadmap (7 sub-phases, ~1 week each): 8.1 kernel skeleton (/dev/videoNN with no-op ioctls) 8.2 chardev bridge (kernel ↔ daemon ping-pong) 8.3 daemon FFmpeg dlopen + parse path 8.4 VP9 end-to-end via daedalus_dispatch_* 8.5 dmabuf / DRM PRIME for zero-copy 8.6 AV1 + H.264 codec support 8.7 performance: hit 30fps@1080p (project floor) No code yet — only README + design docs + directory structure. First implementation work starts in Phase 8.1 next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:54:56 +00:00

22 Commits