Commit Graph

53 Commits

Author SHA1 Message Date
marfrit 5d1ff51178 Merge pull request 'daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter' (#24) from noether/daemon-av1-frame-header-obu into main
Reviewed-on: #24
2026-05-23 17:16:27 +00:00
claude-noether 9797a0daa6 daemon: AV1 Frame Header OBU synthesiser + Temporal Delimiter
Extends the AV1 OBU encoder pack (PR #22 landed the Sequence Header
half) with the two remaining pieces of the per-frame OBU assembly:

  - av1_synth_temporal_delimiter_obu() — trivial 2-byte OBU (0x12,
    0x00) that AV1 temporal units must start with so libavcodec's
    parser can detect access-unit boundaries.

  - av1_synth_frame_header_obu() — encodes a Frame Header OBU (AV1
    §5.9) from V4L2_CID_STATELESS_AV1_SEQUENCE + V4L2_CID_STATELESS_
    AV1_FRAME controls.

## Frame Header scope

The encoder covers the libva-v4l2-request common-case path:

  - frame_type: KEY / INTER / INTRA_ONLY supported.  SWITCH returns 0.
  - tile_info: single-tile uniform-spacing only (forced
    tile_cols_log2 = tile_rows_log2 = 0).
  - quantization_params: full coverage (base_q_idx, delta_q_*, qmatrix).
  - loop_filter_params: full coverage (levels, sharpness, ref/mode deltas).
  - cdef_params: full coverage.
  - segmentation: only enabled=0 path supported (returns 0 if enabled).
  - loop_restoration: only RESTORE_NONE supported (returns 0 if
    any plane uses Wiener / SGRPROJ / SWITCHABLE).
  - global_motion: only IDENTITY warp model emitted (returns 0 if
    any ref uses ROTZOOM / AFFINE / TRANSLATION).
  - film_grain_params: only "not present" path — returns 0 if the
    sequence header has FILM_GRAIN_PARAMS_PRESENT set.

Out-of-scope branches return 0 so a future decoder.c integration can
surface a coverage warning and fall back to direct libavcodec
parsing of the original bitstream where the consumer happens to
ship a fully-OBU'd access unit.

## Integration status

The new primitives are NOT yet wired into decoder.c.  The AV1 decode
hot path still passes the OUTPUT buffer straight to libavcodec,
which works only when the V4L2 consumer is sending a fully-OBU'd
access unit (not strictly the V4L2 stateless contract).

A real wiring needs a separate kernel-side change:
  - daedalus_v4l2_proto.h: add struct daedalus_av1_meta mirroring
    v4l2_ctrl_av1_sequence + v4l2_ctrl_av1_frame
  - kernel/daedalus_v4l2_main.c: capture V4L2_CID_STATELESS_AV1_{SEQUENCE,
    FRAME} at device_run, ship over the chardev
  - daemon/src/chardev_client.c: receive meta
  - daemon/src/decoder.c: assemble TD + SH + FH + OBU_TILE_GROUP-wrapped
    OUTPUT bytes, send to libavcodec

Tracked as a follow-on.

## Tests

test_av1_obu_synth.c grows 5 new cases (9 total, all green on hertz):

  === av1_synth_temporal_delimiter_obu ===
    temporal delimiter: OK
  === av1_synth_frame_header_obu ===
    KEY frame 1080p: OK (13 bytes)
    INTER frame: OK (18 bytes)
    SWITCH frame rejected: OK
    segmentation enabled rejected: OK
  AV1 OBU synth tests PASSED

Bit-walk of the KEY-frame happy path confirms the OBU envelope
(obu_type=3 = FRAME_HEADER, has_size_field=1, leb128 size byte),
then steps through show_existing_frame, frame_type, show_frame,
disable_cdf_update, allow_screen_content_tools.  Fuller bit-walks
would tie the test to encoder details that are spec-driven and
already linear in the source; structural smoke + spec-driven
linearity is the right gate.

Build clean on hertz (Pi 5, Debian trixie, 6.18.29+rpt-rpi-2712,
gcc -Wall -Wextra -Wpedantic).  No new warnings.

Closes daedalus backlog task #159 (AV1 Frame Header OBU synthesiser;
decoder.c integration deferred per task notes above).
2026-05-23 18:31:41 +02:00
marfrit 3a8f5405d4 Merge pull request 'daemon: AV1 Sequence Header OBU synthesiser + unit test' (#22) from noether/daemon-av1-obu-synth into main
Reviewed-on: #22
2026-05-23 15:12:16 +00:00
marfrit 4cfe0b470f Merge pull request 'daemon: bounds-check pack_* functions against CAPTURE plane size' (#21) from noether/daemon-pack-bounds-check into main
Reviewed-on: #21
2026-05-23 15:11:57 +00:00
marfrit b958ef8166 Merge pull request 'kernel: drain in-flight m2m jobs on daemon disconnect (fixes #146 D-state)' (#23) from noether/kernel-drain-inflight-on-chardev-release into main
Reviewed-on: #23
2026-05-23 15:11:40 +00:00
claude-noether 94be8c3d03 kernel: drain in-flight m2m jobs on daemon disconnect
Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that
triggers chardev release) leaves V4L2 consumers in unkillable
TASK_UNINTERRUPTIBLE on /dev/video0 close.

## Root cause

device_run() adds an entry to dev->inflight when it sends a
REQ_DECODE to the daemon, marking the m2m job as "running".
The job is only cleared via v4l2_m2m_buf_done_and_job_finish()
in daedalus_complete_resp_frame(), which only fires on RESP_FRAME.

If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the
matching RESP_FRAME:
  - the inflight entry is never popped
  - v4l2_m2m_buf_done_and_job_finish is never called
  - the m2m scheduler still thinks a job is running

Later, when the V4L2 consumer's close() runs (or gets signalled
to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits
for !job_running indefinitely.  The consumer enters D-state and
survives SIGKILL until reboot.

Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712:

  $ sudo kill -STOP $DAEMON_PID            # block daemon I/O
  $ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 &
  $ sudo kill -9 $DAEMON_PID               # chardev_release fires
  $ kill -9 $CLIENT_PID                    # ignored — D-state
  # client stack:
  v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem]
  v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem]
  daedalus_release+0x2c [daedalus_v4l2]
  v4l2_release+0x7c [videodev]
  __fput → do_exit → SIGKILL never delivered

## Fix

New API daedalus_drain_inflight_on_disconnect() in main.{c,h}:
walks the in-flight list, marks both src+dst buffers
VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and
releases the bound media_request if any.  Same completion shape
as daedalus_complete_resp_frame() takes on the success path,
just with state = ERROR for every in-flight entry.

chardev_release calls the drain after flushing dev->req_queue
(messages still in req_queue weren't released to the daemon yet,
so they don't need the m2m-job-finish dance — freeing them is
sufficient).  The order matters: queue first (cheap), then m2m
drain (heavier, takes the inflight list).

Locking: list_splice_init under inflight_lock to take the entire
list atomically; lock dropped before iterating because
v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done
dispatch and can re-enter device_run via the scheduler (which
would need inflight_lock again on the next REQ_DECODE).

## Verification path

Cannot rmmod the running module on hertz right now — the D-state
corpse from the repro session pins the refcount.  Verification
of the fixed module needs a reboot or fresh test host:

  $ sudo reboot                            # clears hung client
  $ sudo make modules_install              # install new .ko
  $ sudo modprobe daedalus_v4l2
  $ # rerun the repro script — client should die cleanly with
  $ # an -EIO / similar return from poll/DQBUF instead of hanging.

Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings.
The pre-existing "frame size 2128 > 2048" warning on
daedalus_device_run is unchanged by this commit.

## Followup not in scope

If a new V4L2 consumer races a REQ_DECODE through device_run
AFTER the drain has spliced the list (but before the daemon
chardev is reopened), the new entry sits in a freshly-empty
inflight list and the same hang can recur for that consumer
when the systemd auto-restart of the daemon either fails or
takes longer than the consumer's patience.  A secondary
safeguard would be to fail-fast in device_run when dev->chardev
is unopened — proposing as a separate ticket if this race
materialises in practice.

Closes #146.
2026-05-23 17:06:06 +02:00
claude-noether 1e9619afe8 daemon: AV1 Sequence Header OBU synthesiser + unit test
V4L2 stateless AV1 passes the sequence header information as a
structured control (V4L2_CID_STATELESS_AV1_SEQUENCE) and ships
only tile-group bytes in the OUTPUT buffer.  libavcodec's AV1
decoder is full-bitstream, so the daemon needs to reconstruct
the OBU bytes the consumer parsed out before feeding the
assembled stream to libavcodec.

This commit lands the Sequence Header OBU half of that
reconstruction — av1_synth_sequence_header_obu().  Frame
Header / Frame OBU synthesisers + the integration that wires
the assembled OBUs into the decode hot path are separate
follow-on modules.

Module shape mirrors the H.264 NAL synthesiser (PR #1):

  - Public API: single function returning byte count or 0
    on overflow/invalid input.
  - Wire encoder uses the existing bitstream_writer (bsw_put_u
    is AV1's f(n); bsw_put_ue is bit-identical to AV1's uvlc;
    bsw_align_rbsp matches AV1's trailing_bits()).
  - AV1-specific helpers (leb128 size, min_bits_for, subsampling
    resolution per §5.5.2) are file-local statics.
  - No emulation prevention — AV1 uses leb128-sized OBUs for
    bitstream boundaries, not byte-pattern escapes.

Synthesis decisions for fields V4L2 doesn't carry are documented
verbatim in the file header (reduced_still_picture_header = 0;
single operating point at seq_level_idx = 13 / level 5.1;
color_description_present_flag = 0; chroma_sample_position = 0;
seq_choose_screen_detection_tools = 1; seq_choose_integer_mv = 1).

Rejection cases:
  - seq_profile > 2
  - bit_depth not in {8, 10, 12}
  - seq_profile = 1 + monochrome (4:4:4 forced colour)
  - seq_profile = 1 + bit_depth = 12 (only profile 2 allows it)
  - max_frame_{width,height}_minus_1 requiring > 16 length bits
  - out_cap too small to hold header + leb128 + payload

Each returns 0 to surface the mismatch loudly rather than emit
nonsense the libavcodec parser would reject downstream.

Unit test (test_av1_obu_synth.c, opt-in via DAEDALUS_BUILD_TESTS=ON)
exercises four cases bit-by-bit against a hand-computed reference:

  1. profile 0, 1080p, 8-bit, 4:2:0, order_hint on (7 bits),
     CDEF+restoration on — the common Pi 5 path.
  2. profile 0, 720p, 10-bit, monochrome — exercises high_bitdepth
     and the monochrome short-form color_config.
  3. profile 1 + bit_depth 12 → expects 0 (rejected).
  4. tiny out_cap → expects 0 (overflow).

All four green on hertz (aarch64 Arch, gcc Wall+Wextra+Wpedantic
clean).

This commit does not change daemon behaviour — av1_obu_synth.c is
built into the daemon binary so the symbols are reachable, but
no call site is wired yet.  Integration goes in the follow-on
DAEMON-AV1 patches that also synthesise the Frame Header OBU
and bracket the assembled OBUs with a Temporal Delimiter.

Refs reauktion/daedalus-v4l2#11 daemon-half; closes daedalus
backlog task #144.
2026-05-23 15:41:07 +02:00
claude-noether a43296c1ed daemon: bounds-check pack_* functions against CAPTURE plane size
The three NV12/P010 pack functions (pack_nv12_single_to_plane,
pack_nv12_to_planes, pack_p010_to_plane) wrote into the V4L2
client's CAPTURE dmabuf without checking that the mapped size
covers the frame libavcodec just decoded.

Crash scenario: YouTube DASH stepping resolution mid-stream
(e.g. 480p -> 720p when bandwidth improves) — libva is supposed
to handle the V4L2_EVENT_SOURCE_CHANGE with STREAMOFF / S_FMT /
REQBUFS, but in practice a stale CAPTURE request with the old
buffer size sometimes slips through carrying the new (larger)
frame.  The chroma-interleave inner loop walks past the mapping
boundary and the daemon takes SIGSEGV mid-frame, which in turn
leaves V4L2 clients hanging in vb2_core_dqbuf — see the followup
ticket on the D-state symptom.

Fix: compute required = y_size + uv_size against planes->size[N]
BEFORE any write.  On mismatch, log_warn with both sizes and the
frame dimensions, and return -EOVERFLOW.

The caller (process_decode_request loop) already handles a
negative pack return with a log_warn and proceeds without
aborting the decode — the kernel still gets the response with
metadata-only and the V4L2 client sees a frame whose pixels are
stale but whose buffer-done event fires normally.  The next
SOURCE_CHANGE the client processes resyncs the buffer size.

All three pack paths get the same bounds-check; the comment on
pack_nv12_single is the canonical explanation, the other two
reference it.

Verified: builds clean against trixie aarch64; no behavioural
change on the happy path (the bounds check is a single size
compare; on a correctly-sized CAPTURE buffer it's a 1-cycle pass).

Closes daedalus-v4l2 task #145 (daemon SEGV in pack_nv12_single
on resolution change).
2026-05-23 15:31:50 +02:00
marfrit 872eec505e Merge pull request 'proto: bump PROTO_MAX_PAYLOAD 64 KiB → 1 MiB (closes #19)' (#20) from noether/issue-19-bump-proto-payload-1mib into main
Reviewed-on: #20
2026-05-22 18:47:46 +00:00
marfrit ee42419479 proto: bump PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB (closes #19)
Real H.264 access units routinely exceed the previous 64 KiB cap
on the chardev wire protocol:

  720p worst-case I-frame  ~200 KiB
  1080p worst-case I-frame ~500 KiB

libva-v4l2-request-fourier detects the under-sized OUTPUT-MPLANE
buffer and tries to grow it via VIDIOC_S_FMT to 147456 B, but
daedalus_fill_output_fmt unconditionally pins sizeimage to
DAEDALUS_MAX_BITSTREAM (= 65484) regardless of userspace's
request.  Firefox loses the slice, falls back to libmozavcodec
SW for the rest of the session.

Bumping the wire-protocol cap to 1 MiB lifts the kernel
OUTPUT_MPLANE sizeimage with it (DAEDALUS_MAX_BITSTREAM is derived
from the same #define).  All allocations (kernel kmalloc /
kmemdup, daemon read buffer, vb2 plane backing) are dynamic and
sized per-payload at runtime, so the only growth is the daemon's
startup read buffer (one ~1 MiB allocation per daemon process)
and the V4L2 OUTPUT_MPLANE per-buffer size.  KMALLOC_MAX_SIZE on
aarch64 SLUB is several MiB; 1 MiB is well within bounds.  Other
V4L2 stateless decoders (cedrus, rkvdec, hantro) report 1-4 MiB
OUTPUT_MPLANE sizeimage — this puts daedalus at the conservative
end of normal.

## Compatibility

#define-only change; struct layout unchanged.  But the
effective cap is the smaller of (kernel cap, daemon cap), so:
- new daemon + stale kernel: still capped at 64 KiB until the
  kernel module rebuilds.
- new kernel + stale daemon: same.
Lock-step install of daedalus-v4l2 + daedalus-v4l2-dkms is
therefore required for the fix to take effect; mirrors the
PR-#7/#8 cadence.

## NOT changed in this commit

- daedalus_fill_output_fmt still hardcodes sizeimage =
  DAEDALUS_MAX_BITSTREAM regardless of userspace request.
  Acceptable: vb2 will allocate up to that, and libva's resize-
  test now sees the kernel report a sizeimage at-least-as-large
  as what it asked for (147456 < 1048524).  A future cleanup
  could respect userspace's S_FMT.sizeimage clamped to the cap,
  to save memory on tiny streams.
- chardev kmalloc → kvmalloc swap (only matters above
  KMALLOC_MAX_SIZE, not here).

Refs #19.
2026-05-22 20:46:27 +02:00
marfrit 1d8f5af164 Merge pull request 'daemon: filter tiny pause-time bitstreams (closes #17)' (#18) from noether/issue-17-tiny-bitstream-filter into main
Reviewed-on: #18
2026-05-22 16:14:56 +00:00
marfrit 3e4e6e8eae daemon: filter tiny pause-time bitstreams (closes #17)
libva-v4l2-request-fourier flushes a stub packet into the V4L2
OUTPUT_MPLANE queue at playback-pause boundaries.  The payload is
shorter than any parseable H.264 NAL (3-byte start code + 1-byte
NAL header = 4 bytes minimum); avcodec_send_packet returns
AVERROR_INVALIDDATA (-1094995529), which propagated to the kernel
as a decode failure.  Firefox then marked H.264-via-VAAPI as
broken for the session and routed every subsequent frame to
libmozavcodec SW — pause never recovered to HW.

At the REQ_DECODE entry in chardev_client.c::handle_req_decode,
short-circuit any bitstream below the minimum-parseable threshold:
log INFO, skip daedalus_decoder_run_request, and reply RESP_FRAME
with status=DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool
stays healthy and Firefox doesn't see a failure.

Repro: Pi CM5 trixie, daedalus-v4l2 0.1.0+r41 + ffmpeg-v4l2-
request-fourier 2:8.1+rfourier+gb57fbbe-9, Firefox YouTube avc1.
Play → daemon decodes at ~46 fps.  Pause ≥ 1s.  Resume → daemon
silent; sudo journalctl -u daedalus-v4l2 --since '10s' | grep -c
'decoder: OK' = 0.  Last entry before silence:

    REQ_DECODE cookie=N codec=3 bitstream=3 bytes ...
    [h264 @ ...] no frame!
    [ERR] decoder: avcodec_send_packet failed: -1094995529

After this fix the 3-byte sentinel logs as 'tiny bitstream 3
bytes — dropping as no-op' and the libavcodec context is
untouched; the next real REQ_DECODE proceeds normally.

Scope NOT covered (intentionally deferred):
- A more general "tolerate AVERROR_INVALIDDATA mid-stream" path.
  Worth doing later but masks unrelated bugs.
- Investigating WHY libva sends the 3-byte sentinel on pause.
  Likely an upstream libva-v4l2-request-fourier issue; tracked
  separately if this filter is not enough.

Wire protocol unchanged.  No DAEDALUS_PROTO_VERSION bump.
2026-05-22 17:26:25 +02:00
marfrit 6e6dfa144d Merge pull request 'daemon: dlopen Kwiboo fork's soname 62 (FFmpeg 8.1 at /opt/fourier)' (#16) from noether/daemon-dlopen-kwiboo-soname62 into main
Reviewed-on: #16
2026-05-21 19:20:22 +00:00
claude-noether 514da29a73 daemon: dlopen Kwiboo fork's libavcodec.so.62 / libavformat.so.62 / libavutil.so.60
Switch the daemon's runtime dlopen targets from Debian-stock soname
61/61/59 (FFmpeg 7.1.3) to the Kwiboo fourier fork's soname
62/62/60 (FFmpeg 8.1) installed at the /opt/fourier prefix.

Why
---
The substitution arc tracked at daedalus-v4l2#11 needs daedalus-
fourier kernel calls woven into libavcodec's H264DSPContext NEON
init (replacing ff_h264_idct_add_neon etc. with thunks calling
daedalus_recipe_dispatch_h264_*).  We do that via patches in the
ffmpeg-v4l2-request-fourier package source — which we own, in
marfrit-packages, alongside the existing libudev-bypass and
nv15-to-p010 patches.  But that package builds the Kwiboo fork at
soname 62 / /opt/fourier.  The daemon currently dlopens soname 61
(Debian-stock + a separately-built +fourier2 patch that isn't in
marfrit-packages' source tree), so substitution patches there
wouldn't reach the daemon.

Switching to soname 62 routes the daemon through the package we
control — first step toward landing daedalus-fourier kernel
substitution into the production decode path.

Compat
------
- /opt/fourier libs are already on every host running the daemon
  (hard build-dep of ffmpeg-v4l2-request-fourier).  Firefox-fourier
  and mpv-fourier already dlopen them via the same path.
- /etc/ld.so.conf.d/fourier.conf entry resolves the new sonames
  from /opt/fourier/lib via the ld cache; dlopen-by-soname works
  without LD_LIBRARY_PATH wrappers.
- Build-side: daemon's pkg_check_modules picks up libav*.pc from
  /opt/fourier/lib/pkgconfig when PKG_CONFIG_PATH includes that
  directory (build-deb.sh follow-up will set it).
- API surface unchanged: avcodec_send_packet / receive_frame /
  AVCodecContext flags / AVFrame fields are all stable between
  FFmpeg 7.1 and 8.1.  Verified clean cross-compile on hertz.

Wire protocol unchanged.  No kmod bump.

Next step (follow-up PRs)
-------------------------
1. ffmpeg-v4l2-request-fourier patch: add 0003-daedalus-fourier-
   substitute-h264-idct4.patch that replaces ff_h264_idct_add_neon
   in libavcodec/aarch64/h264dsp_init_aarch64.c with a thunk
   calling daedalus_recipe_dispatch_h264_idct4.
2. Repeat for IDCT 8×8, deblock luma-v, qpel mc20 (one kernel per
   PR for reviewability; bench delta + decode_us delta documented
   per substitution).
3. marfrit-packages bump to pick up the new daemon + the substituted
   fourier package.
2026-05-21 21:19:24 +02:00
marfrit 3bc0da168c Merge pull request 'daemon: per-frame decode_us + periodic stats (#11 step 1)' (#15) from noether/daemon-decode-stats into main
Reviewed-on: #15
2026-05-21 18:26:50 +00:00
claude-noether 814b74d0bb daemon: per-frame decode_us + periodic stats summary (#11 step 1)
Establishes observable baseline metrics before any daedalus-fourier
kernel substitution lands.  Step 1 of the daemon-rewrite arc tracked
at daedalus-v4l2#11.

Changes
-------
- Per-frame `decoder: OK ...` log line now carries decode_us=N (the
  send_packet + receive_frame wall-clock cost in microseconds —
  exclusively the libavcodec round-trip, not the bitstream pack /
  SPS-PPS synth / pack-to-planes work).
- New "decoder stats" summary line every DAEDALUS_STATS_EVERY (60)
  decoded frames, reporting: codec, frame count, window seconds,
  fps, avg decode_us, MBs/s throughput, bytes/MB bitrate.

Sample
------
  decoder stats: codec=h264 frames=300 window=12.32s fps=24.35
                 avg_decode_us=4216.4 mbs_per_s=87643 bs_b_per_mb=1.56

What this tells us
------------------
Steady-state on higgs (Pi CM5) decoding bbb_720p_h264.mp4:
~4 ms decode_us, ~90 K MBs/s, well under the daedalus-fourier
NEON kernel ceilings (IDCT 4×4 @ 175 Mblocks/s, deblock @ 92 Medges/s,
qpel mc20 @ 131 Mblocks/s — all 100-1000× over our actual workload).

Means the 4 ms/frame is mostly libavcodec's CABAC + MV prediction +
intra prediction overhead, NOT the pixel-math primitives.
Substituting a single primitive would shave only a small slice of
the 4 ms.  Useful as guidance for the upcoming substitution work —
we'll pick the primitive with the largest cycle cost relative to
the alternative, and measure CPU saved per substitution.

No behaviour change: counters are static + unsynchronised (the
chardev event loop is single-threaded); reset when codec_id changes.
clock_gettime(CLOCK_MONOTONIC) for timing.
2026-05-21 20:17:09 +02:00
marfrit 77e14e5a19 Merge pull request 'daemon: link daedalus-fourier + log substrate availability at startup' (#13) from noether/daemon-link-daedalus-fourier into main
Reviewed-on: #13
2026-05-21 16:35:38 +00:00
claude-noether 88b2ebfaa9 daemon: link daedalus-fourier + log substrate availability at startup
First incremental step toward H.264 daemon-rewrite (daedalus-v4l2#11):
make the daedalus-fourier kernel library available to the daemon
process so subsequent patches can substitute its primitives
(IDCT 4×4, IDCT 8×8, luma vertical deblock, etc.) for libavcodec's
per-MB pixel math.

This patch does NOT yet dispatch any kernels.  It only:

  - Adds `pkg_check_modules(DAEDALUS_FOURIER REQUIRED daedalus-fourier)`
    to the daemon's CMakeLists, with explicit link ordering
    (libdaedalus_core.a must precede -lvulkan because the static
    archive references vulkan symbols and the linker resolves
    left-to-right).  We bypass IMPORTED_TARGET because pkg-config's
    Requires.private chain leaves CMake's dependency graph reordering
    the archive after -lvulkan, breaking the static link.

  - Calls daedalus_ctx_create_no_qpu() at daemon startup, logs the
    substrate-availability line, destroys the context at exit.
    no_qpu mode skips V3D Vulkan probe — proves linkage works
    without depending on shader-path resolution (which is a
    separate piece of work, since v3d_runner currently loads
    .spv files from cwd-relative paths and consumer would need
    a search path override).

Sample journal line:

  [2026-05-21 17:59:35.271 INFO] daedalus-fourier: linked, ctx alive
  (no_qpu mode; has_qpu=0)

Build-test verified on hertz (Pi 5 dev host) against an installed
copy of daedalus-fourier r35+gd87239d (from marfrit/daedalus-fourier
PR #1).  Binary links cleanly, --help prints, daemon mode opens
chardev (fails predictably on hertz which has no daedalus_v4l2
kmod; on higgs this is the existing working path).

Follow-up patches per daedalus-v4l2#11:

  1. Instrument the existing libavcodec decode path to count
     per-frame IDCT blocks / deblock edges / MC tiles so we have
     a baseline of what work the daemon dispatches for a typical
     YouTube H.264 stream.
  2. Substitute daedalus-fourier kernels one at a time, measuring
     CPU saved per substitution.
  3. Wire shader path resolution into daedalus_ctx_create() for
     the QPU substrate (V3D opportunistic helper paths).

Wire protocol unchanged.  DAEDALUS_PROTO_VERSION stays at 0.
2026-05-21 18:00:46 +02:00
marfrit 64b9599e47 Merge pull request 'daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — implements #11 part (2)' (#12) from noether/daemon-low-delay-h264 into main
Reviewed-on: #12
2026-05-21 15:17:57 +00:00
claude-noether 234a103084 daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — fix display-reorder breaking V4L2 1:1
Force libavcodec's H.264 decoder to emit frames in DECODE order
(one frame per send_packet, no internal display-order reorder
queue).  Single-line addition: ctx->flags |= AV_CODEC_FLAG_LOW_DELAY
before avcodec_open2, gated on codec_id == DAEDALUS_CODEC_H264.

Closes daedalus-v4l2#11 part (2).

Background
----------
PR #7's "parking design" approach to the H.264 display-reorder
problem broke libva-v4l2-request-fourier's 1:1 CAPTURE-completion
contract (see #9 + #10).  After the revert, the visible "2 1 4 3"
pair-swap regressed and the only path forward was to align the
daemon's output ordering with what V4L2 stateless clients expect:
**decode order, one CAPTURE buffer per OUTPUT slice, with display
reorder pushed upstream to ffmpeg-vaapi's per-VAAPI-surface POC
logic** (which it already does correctly for every real H.264
hardware decoder via VAPictureParameterBufferH264).

How LOW_DELAY does this
-----------------------
Inside libavcodec/h264dec.c, the flag sets h->low_delay = 1.
h264_select_output_frame (h264_picture.c) emits the just-decoded
picture immediately instead of routing through the display-order
DPB output queue.  DPB management for reference frames
(short_ref / long_ref) is unaffected — B-frame decoding
correctness is preserved; only the output buffering is bypassed.

Skipped for VP9 / AV1 — those codecs don't reorder internally,
so the flag would be a no-op but adds no value.

Verified
--------
On higgs (Pi CM5, 6.18.29+rpt-rpi-2712), test daemon hot-swapped
into /usr/bin/daedalus_v4l2_daemon, mpv --hwdec=vaapi-copy
--frames=300 against bbb_720p_h264.mp4: 311 REQ_DECODEs received,
308 successful "decoder: OK" responses (99.04% steady-state
delivery — 3 lost at GOP boundaries, no compounding drift).
mpv plays to its --frames cap and exits cleanly with "End of
file".  No "Unable to dequeue buffer", no "Failed to end picture
decode", no "AVHWFramesContext: Failed to sync surface" — all
the failures from #9 are gone.

Builds clean against ffmpeg-v4l2-request-fourier libavcodec.
2026-05-21 17:14:33 +02:00
marfrit 5d8b4369e5 Merge pull request 'kernel + daemon: revert PRs #7 + #8 (parking design incompatible with V4L2 stateless 1:1 expectation)' (#10) from noether/revert-parking-pr7-pr8 into main
Reviewed-on: #10
2026-05-21 13:39:09 +00:00
marfrit 714d781d22 Revert "Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main"
This reverts commit 79256dc7ef, reversing
changes made to 7ff2d897ea.
2026-05-21 14:40:59 +02:00
marfrit 49e60c9bba Revert "Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7)' (#8) from noether/kernel-claim-bufs-at-device-run into main"
This reverts commit 6ffe92bcac, reversing
changes made to 79256dc7ef.
2026-05-21 14:40:52 +02:00
marfrit 6ffe92bcac Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7)' (#8) from noether/kernel-claim-bufs-at-device-run into main
Reviewed-on: #8
2026-05-21 11:54:52 +00:00
claude-noether f10a26d883 kernel: claim src/dst at device_run, not at buf_done
Hard reboot observed on higgs (Pi CM5) during the first mpv vaapi-copy
playback against the freshly-deployed r28+g79256dc stack — kernel
panic, no persistent journal, no recoverable trace.  Bug introduced
by the daedalus-v4l2#6 reorder fix (#7).

Cause
-----
The new completion path runs `v4l2_m2m_job_finish` on SRC_CONSUMED
even when the dst_buf is still parked (waiting for a future
HAS_PIXELS).  job_finish moves the m2m_ctx back to IDLE, the
scheduler dispatches the next device_run — which calls
`v4l2_m2m_next_dst_buf`, which returns the head of the CAPTURE
ready-queue, which is STILL the parked dst_buf because we never
removed it.  Two inflight entries now reference the same vb2_buffer;
the later HAS_PIXELS triggers `v4l2_m2m_dst_buf_remove_by_buf` on a
vb2_buffer whose list_head is no longer linked to that queue, and
`list_del()` smashes the next/prev pointers of whatever ELSE was at
those addresses.

Fix
---
Take both src and dst off `m2m_ctx`'s rdy_queue at device_run — as
soon as `v4l2_m2m_next_*_buf` has peeked them and all early-exit
validation has passed.  After that, the daemon owns both halves
exclusively via the inflight item; the m2m scheduler can't re-issue
them on the next device_run.  Completion path drops the redundant
`_remove_by_buf` calls — list is already detached, so `buf_done`
alone is correct.

Matches the amphion `vdec.c`/`venc.c` pattern (which also claims at
device_run for the same reason: amphion's encode pipeline parks
output buffers across multiple frames waiting for the codec to
finish, structurally the same as our H.264 B-frame DPB parking).

`fail_buf_error` learns about the new `claimed` flag and skips the
`v4l2_m2m_*_buf_remove` calls when the buffers have already been
removed by-buf at device_run.

Verified
--------
Builds clean against 6.18.29+rpt-rpi-2712.  Field test pending —
deploy via marfrit-packages bump in lock-step with the daemon
(which doesn't need to change for this fix; PROTO_VERSION stays at 1).
2026-05-21 13:49:44 +02:00
marfrit 79256dc7ef Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main
Reviewed-on: #7
2026-05-21 10:36:53 +00:00
claude-noether 15fc2aba14 kernel + daemon: H.264 B-frame display reorder fix (issue #6)
H.264 streams with B-frames showed visibly pair-swapped output in
mpv / Firefox playback through the libva → daedalus_v4l2 path —
"frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6".  Reproduced in mpv
with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor),
confirming the bug was in this daemon pipeline, not downstream.

Root cause
----------
libavcodec's H.264 decoder internally reorders output to DISPLAY
order before returning from avcodec_receive_frame.  The daemon
previously called send_packet → receive_frame ONCE per REQ_DECODE
and shipped the resulting pixels in a RESP_FRAME tagged with the
SAME cookie.  For B-frames this is wrong: the frame returned from
receive_frame may belong to an EARLIER bitstream (libavcodec held
it for display-order release).  Cookie N's CAPTURE buffer therefore
got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got
silently marked VB2_BUF_STATE_ERROR (the daemon returned
DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held).

Fix shape
---------
Decouple kernel cookie identity (decode-order routing) from
libavcodec's display-ordered output.  Wire-protocol changes:

  REQ_DECODE  + __u64 src_pts        (= src_buf->vb2_buf.timestamp)
  RESP_FRAME  + __u32 flags          (HAS_PIXELS | SRC_CONSUMED)
              + __u64 output_src_pts (= frame->pts on drain)

PROTO_VERSION bumped 0 → 1.  Lock-step rebuild required.

Kernel
------
device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts
before sending REQ_DECODE, and stores it on the inflight item so
the completion path can stamp dst_buf.timestamp explicitly when
src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto-
pairing no longer applies).

daedalus_complete_resp_frame splits into:

  HAS_PIXELS:    pack pixels into THIS cookie's CAPTURE buffer,
                 stamp dst timestamp from inflight->src_pts,
                 v4l2_m2m_buf_done(dst, DONE/ERROR).
                 No job_finish here.

  SRC_CONSUMED:  release the bound media_request, run
                 v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so
                 the scheduler can dispatch the next REQ.  dst_buf
                 may still be parked at this point.

Inflight entry is removed and freed only when BOTH src_buf and
dst_buf have been cleared.  Combined HAS_PIXELS|SRC_CONSUMED RESPs
(steady-state VP9/AV1 with no reorder lag) collapse to the prior
1:1 behaviour for free.

Daemon
------
daedalus_decoder_run_request split into three primitives:

  daedalus_decoder_submit       — set pkt->pts = req->src_pts,
                                  avcodec_send_packet.
  daedalus_decoder_drain_one    — avcodec_receive_frame, populate
                                  resp meta + output_src_pts (= the
                                  frame's pts, carried back from
                                  the bitstream that produced it).
  daedalus_decoder_pack_current — pack current AVFrame into the
                                  caller-mapped CAPTURE planes.

chardev_client maintains a small (src_pts → cookie, cached_req)
table indexed linearly (≤64 entries; bounded by V4L2 client buffer
pool depth).  On each REQ_DECODE:

  1. Register (src_pts → cookie) in the table.
  2. submit().
  3. Drain loop: for each frame returned, look up its owner cookie
     via pending_lookup(frame->pts), GET_DMABUF for THAT cookie,
     pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS,
     output_src_pts=frame->pts).  Combine with SRC_CONSUMED when
     owner_cookie equals the current REQ's cookie.
  4. If the current REQ's cookie wasn't drained inside the loop
     (libavcodec held the frame), emit a standalone SRC_CONSUMED
     RESP so the kernel runs job_finish + dispatches the next REQ;
     dst_buf for this cookie stays parked until a future drain
     produces its pixels.

VP9 / AV1 paths are unchanged in behaviour: one frame per REQ,
HAS_PIXELS|SRC_CONSUMED in one combined RESP.

Verified
--------
Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712
(Pi CM5).  Frame-size warning in device_run is pre-existing
(unchanged by this commit).
2026-05-21 12:32:47 +02:00
marfrit 7ff2d897ea Merge pull request 'kernel: register H.264 DECODE_MODE + START_CODE menu controls' (#4) from noether/kernel-h264-menu-ctrls into main
Reviewed-on: #4
2026-05-21 09:02:43 +00:00
claude-noether 69a62a922f kernel: register H.264 DECODE_MODE + START_CODE menu controls
libva-v4l2-request sets V4L2_CID_STATELESS_H264_DECODE_MODE and
V4L2_CID_STATELESS_H264_START_CODE on the device fd at context init
(see libva-v4l2-request-fourier src/context.c:577 — best-effort call,
result is (void)cast).  Our ctrl_handler did not advertise either
control, so v4l2-core returned EINVAL on validate; userspace logged
the noisy

    v4l2-request: Unable to set control(s): Invalid argument
                  (error_idx=2/2 ioctl-level)

at every Firefox/ffmpeg context creation, despite decode itself
succeeding (the daemon already operates as FRAME_BASED + ANNEX_B and
the per-request SPS/PPS/SCALING_MATRIX/DECODE_PARAMS batch lands
fine).

Register the two as v4l2_ctrl_new_std_menu with the only value each
the daemon actually supports — FRAME_BASED for DECODE_MODE,
ANNEX_B for START_CODE — and mask out the unsupported alternates
(SLICE_BASED, NONE).  Pattern matches rkvdec / hantro.  Update the
handler-init capacity hint to ARRAY_SIZE(daedalus_stateless_ctrls)
+ 2 to cover the additions.

Verified: builds clean on 6.18.29+rpt-rpi-2712 (Pi CM5) DKMS source
tree.
2026-05-21 11:01:41 +02:00
marfrit f0d41867f6 Merge pull request 'kernel: per-ctx vb2 lock — Firefox multi-process VAAPI unblock' (#3) from noether/kernel-per-ctx-vb-mutex into main
Reviewed-on: #3
2026-05-20 19:25:02 +00:00
marfrit a3ada8ba38 kernel: per-ctx vb2 lock so concurrent clients don't serialise on dev mutex
daedalus_queue_init was wiring both src_vq->lock and dst_vq->lock to
ctx->dev->m2m_lock — a device-wide mutex.  That serialises every
vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) across ALL
concurrent clients of /dev/video0.  For a single-client consumer
like the test_m2m_* tools it doesn't matter; for Firefox, which
spawns separate content + RDD + GPU processes that each open
/dev/video0 and run libva probe simultaneously, the contention
showed up as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE)
when another session was mid-streamon on the same device.

Observable on higgs (Pi CM5):

    $ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
    ...
    v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ...
    v4l2-request: cap_pool_init: 24 slots ready
    v4l2-request: Unable to set format for type 10: Device or
                  resource busy

After this fix, each open() gets its own ctx->vb_mutex and the
per-context vb2_queue locks are independent — Firefox's multi-
process VAAPI clients no longer fight each other.  YouTube
playback on higgs runs through daedalus at ~230 fps sustained
(640x368, libavcodec dlopen path), 7× headroom over the 30fps
target.

cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern
for the same reason.  This mirrors them.

Lifecycle:
  - mutex_init in daedalus_open (right after the kzalloc that
    creates ctx, before v4l2_fh_init).
  - mutex_destroy in daedalus_release (after v4l2_fh_exit, before
    kfree), and in the err_ctrl unwind path in daedalus_open.

Verified end-to-end on higgs:
  - rmmod + modprobe the rebuilt .ko.
  - Restart daedalus-v4l2.service.
  - Firefox YouTube playback engages VAAPI, daemon journal shows
    cookie=1..N codec=3 (H.264) REQ_DECODE / decoder:OK pairs
    with unique per-frame fnv1a hashes.
  - No EBUSY in either firefox stderr or daemon journal during
    the entire session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:23:44 +02:00
marfrit 462aa4b480 Merge pull request 'kernel: bind request controls to p_cur via v4l2_ctrl_request_setup' (#2) from noether/kernel-ctrl-request-setup into main
Reviewed-on: #2
2026-05-20 18:37:12 +00:00
marfrit 29f16ece13 kernel: bind request controls to p_cur before reading them
device_run was reading ctrl->p_cur.p_h264_* directly, but v4l2-m2m's
request scheduler does NOT auto-bind the in-flight media_request's
control values to the ctrl handler's p_cur slots — drivers have to
call v4l2_ctrl_request_setup() explicitly.  cedrus / rkvdec / hantro
all do this in their device_run; daedalus didn't.

Result: daedalus_collect_h264_meta() read stale or default values
(whatever the prior request had left in p_cur, or v4l2_ctrl_new_custom
initial state if no prior request had completed) instead of the
S_EXT_CTRLS V4L2_CTRL_WHICH_REQUEST_VAL values libva-v4l2-request-
fourier had just sent for THIS frame.

The mismatch was a smoking gun on higgs after libva PR #9 / packages
PR #52 landed an instrumentation log at h264_set_controls entry:

  libva boundary (sent to kernel):
    VAProfile=13 seq_fields=0x00032051 pic_fields=0x00000500 num_ref_frames=1
  daedalus daemon (read from kernel p_cur):
    prof=100 level=41 ref_frames=0 flags=0x10 pps_flags=0x0

After calling v4l2_ctrl_request_setup() at the top of device_run:

  daedalus daemon (read from kernel p_cur):
    prof=66 level=11 ref_frames=1 poc_type=2 flags=0x50 pps_flags=0x88

— matches what libva sent, matches the bitstream's actual SPS.

End-to-end test on higgs with libva-v4l2-request-fourier 1.0.0+r382
+gc1bb444 (after-fix-3-and-fix-4-instrumentation) + this kernel
patch:

  $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi \
      -hwaccel_device /dev/dri/renderD128 -i h264_test.mp4 \
      -frames:v 1 -f null - ...
  rc=0
  daemon journal: zero "error while decoding MB" lines, zero
  "reference frames exceeds max" lines.  Per-frame fnv1a hashes
  differ (0xf1c515aa, 0x16e915e8, 0x16bd16cc, ...) instead of
  the constant 0x6a6a05c5 "give-up-and-zero" hash from before —
  libavcodec is actually decoding real pixel content from each
  P-frame.

Pair note: the daemon side already calls v4l2_ctrl_request_complete
in daedalus_complete_resp_frame (line 834) — this commit pairs the
setup half with that completion half.

The daemon side change (decoder.c) is a small log-level promotion:
the per-frame "h264 SPS/PPS prepended ..." trace went from log_debug
to log_info so the journal shows what's being shipped into libavcodec
without needing a daemon rebuild with --debug.  Matches the libva-
side h264_set_controls instrumentation that landed in libva PR #9.

Closes part of issue libva-v4l2-request-fourier#8 — the SPS/PPS
field-value gap.  Profile/level still come from libva's session-
derived hardcoded values (h264_profile_to_idc + h264_derive_level_
idc) which is sufficient for libavcodec to accept the synthesised
NAL unit; a true stream-parsed profile/level would need SPS-NAL
parsing in libva — separate operator-design call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:35:06 +02:00
marfrit 3dd0eb070a Merge pull request 'DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls' (#1) from noether/daemon-pps-h264-nal-synth into main
Reviewed-on: #1
2026-05-20 16:51:26 +00:00
marfrit 8c1d9960c4 DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls
libva-v4l2-request-fourier (and any V4L2-stateless-API consumer)
passes H.264 SPS/PPS as separate V4L2_CID_STATELESS_H264_{SPS,PPS}
controls; only the slice NAL goes into the OUTPUT buffer.  This is
correct per the V4L2 stateless contract.  But libavcodec — which
the daedalus daemon uses for actual decode (Option γ) — wants a
self-contained AnnexB stream including SPS+PPS before any slice.
Result on higgs: "non-existing PPS 0 referenced" + decode_slice_
header errors on every H.264 frame, even after LIBVA-1 and -2
routing correctly delivered the request to the daemon.

Fix splits across kernel + daemon, keeping the kernel module as a
thin transport and putting the actual NAL encoding in userspace:

  include/daedalus_v4l2_proto.h:
    Add struct daedalus_h264_meta (the four v4l2_ctrl_h264_*
    structs the kernel collects) and DAEDALUS_REQ_FLAG_H264_META
    (set in req.flags when the meta block is present between the
    daedalus_req_decode prefix and the slice bitstream).

  kernel/daedalus_v4l2_main.c:
    Add daedalus_collect_h264_meta() — reads the H.264 ctrl values
    from the bound media_request via v4l2_ctrl_find +
    ctrl->p_cur.p_h264_*.  device_run() calls it on H.264 codec_id,
    copies the structs into the REQ_DECODE payload between the
    prefix and bitstream, and sets the flag.  Payload size is
    bounds-checked against DAEDALUS_PROTO_MAX_PAYLOAD so an over-
    sized slice + meta fails loud instead of truncating.

  daemon/src/bitstream_writer.{c,h}:
    New module — MSB-first bit packer with H.264 Exp-Golomb ue(v)
    and se(v) coding + rbsp_trailing_bits alignment.  Sticky
    overflow flag so callers can verify the output buffer wasn't
    truncated.

  daemon/src/h264_nal_synth.{c,h}:
    New module — turns v4l2_ctrl_h264_sps / v4l2_ctrl_h264_pps
    into AnnexB-framed NAL units per ITU-T H.264 7.3.2.1 / 7.3.2.2.
    Emits emulation prevention bytes (0x03 after every 00 00 in the
    EBSP) and the 4-byte start code (0x00000001).  Coverage matches
    what V4L2 stateless surface gives us: VUI parameters and full
    scaling matrices are NOT emitted (V4L2 doesn't carry them — the
    seq_scaling_matrix_present_flag is set to 0 and libavcodec uses
    flat defaults, which matches the de-facto behaviour of most
    H.264 streams libva-v4l2-request drives).

  daemon/src/decoder.c:
    daedalus_decoder_run_request() now takes an optional
    h264_meta parameter.  For codec_id == H264 with meta != NULL,
    synthesises SPS+PPS NAL units, allocates a combined
    [SPS][PPS][slice] buffer (+ AV_INPUT_BUFFER_PADDING_SIZE), and
    feeds that to avcodec_send_packet instead of the raw slice.
    VP9/AV1 path unchanged (frames are self-contained).  Cleanup
    now goes through a unified `out:` label so the assembled
    buffer is always freed on every exit (including the existing
    decoder_open_codec / no-frame / receive_frame failure paths).

  daemon/src/chardev_client.c:
    handle_req_decode() peels off the optional meta block when the
    flag is set, passes it through to the decoder, and updates
    the payload-length consistency check (now allows for an extra
    sizeof(daedalus_h264_meta) when the flag is on).

Build (boltzmann aarch64): clean compile of all daemon sources,
including bitstream_writer + h264_nal_synth + the refactored
decoder.c.  Kernel module compile to be verified via DKMS rebuild
on higgs in the marfrit-packages bump that follows.

Test plan: with this commit + a marfrit-packages daedalus pin
bump, higgs's ffmpeg -hwaccel vaapi -i h264_test.mp4 should
produce a successful decode (vs. the previous "non-existing PPS 0
referenced" failure).  The daemon log should show:
  decoder: opened h264 context
  decoder: h264 prepended SPS=NB PPS=MB slice=KB
  decoder: OK 320x240 fmt=0 (yuv420p) fnv1a=0x...

VP9 / AV1 behaviour unchanged — they don't carry meta and the
existing per-frame self-describing path still applies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 17:35:24 +02:00
marfrit 481279c9bf packaging/systemd: ship daedalus-v4l2.service + modules-load drop-in
Canonical location for the systemd unit + module-autoload conf,
referenced by both arch/daedalus-v4l2 and debian/daedalus-v4l2
in marfrit-packages.  Was a real gap in the original packaging:
postinst installed the daemon binary but nothing started it, so
the libva path got REQ_DECODE messages with nobody listening on
/dev/daedalus-v4l2 and timed out.

packaging/systemd/daedalus-v4l2.service:
  - Type=simple, ExecStart=/usr/bin/daedalus_v4l2_daemon daemon
  - After=systemd-modules-load.service + ConditionPathExists=
    /dev/daedalus-v4l2 (so it only starts when the kernel module
    is loaded; doesn't false-fire on non-daedalus hosts that
    happen to have the package installed)
  - Restart=on-failure, RestartSec=2
  - MemoryHigh=128M / MemoryMax=256M (Phase 8.9 stress run
    showed RSS settling around 25 MiB; leaves headroom)
  - Hardening: NoNewPrivileges, ProtectSystem=strict, ProtectHome,
    PrivateTmp, ProtectKernel*, SystemCallFilter=@system-service.
    PrivateDevices=false because we DO need /dev/daedalus-v4l2

packaging/systemd/daedalus-v4l2.modules-load:
  - Drops to /etc/modules-load.d/daedalus-v4l2.conf so the kernel
    module loads before the .service unit fires.

Both files are picked up by the package recipes (next bump in
marfrit-packages) — neither lives in /usr/lib/systemd/system or
/etc/modules-load.d until the .deb / .pkg installs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:26:58 +02:00
marfrit f0cd29a340 kernel: v4l2_fh_add/del gained file* arg in 6.18 — version-conditional
DKMS build failure on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712):

  daedalus_v4l2_main.c:1049: error: too few arguments to function 'v4l2_fh_add'
  v4l2-fh.h:97: void v4l2_fh_add(struct v4l2_fh *fh, struct file *filp);
  daedalus_v4l2_main.c:1063: error: too few arguments to function 'v4l2_fh_del'

Signature changed exactly at v6.18 (verified v6.13–v6.17 still use the
one-arg form via raw.githubusercontent.com tag walk). Wrap the calls
with LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0) so the module
keeps building against:

  * 6.12 LTS / RPi 6.12.75 (one-arg)        — hertz
  * 6.12.88+deb13-arm64 (one-arg)
  * 6.18.29+rpt-rpi-2712 (file* arg)        — higgs running kernel

Build verified on both: hertz 6.12.75 clean, higgs 6.18.29 clean +
modprobe daedalus_v4l2 succeeds, /dev/daedalus-v4l2 + /dev/video0
appear.

Add #include <linux/version.h> for KERNEL_VERSION + LINUX_VERSION_CODE
(also pulled transitively via module.h but explicit is better than
implicit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:15:24 +02:00
marfrit f55b2cd002 kernel: media_request_get/put around inf->req (UAF safety)
Sonnet pre-deployment review flagged a SHIP-WITH-EYES-OPEN risk:
Phase 8.13's inf->req captured src_buf->vb2_buf.req_obj.req as a
raw pointer with no media_request_get(). On the normal decode
path that's fine because vb2-core holds its own reference until
v4l2_m2m_buf_done_and_job_finish releases it.

But on a concurrent cancel (MEDIA_IOC_REQUEST_REINIT or a process
kill triggering buf_request_complete from the cancel path before
RESP_FRAME comes back), vb2 could drop its reference first. Our
inf->req would then dangle through v4l2_ctrl_request_complete +
buf_done_and_job_finish — UAF.

Fix matches the cedrus / rkvdec pattern: take our own reference
when we capture the pointer, release it after we're done with it
(after buf_done_and_job_finish to keep the ordering crystal-clear).

  /* in daedalus_device_run, after inf->req = src_buf->...->req */
  if (inf->req)
      media_request_get(inf->req);

  /* in daedalus_complete_resp_frame, after buf_done_and_job_finish */
  if (inf->req)
      media_request_put(inf->req);

Verified on hertz:
- libva path (request-bound, inf->req != NULL): byte-exact NV12,
  same FNV-1a as standalone.
- test_m2m_stream (direct QBUF, inf->req == NULL): 30/30 frames
  decoded, conditional skip works.
- No kernel oops / WARN, no leak in dmesg.

Add #include <media/media-request.h> for the helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:39:10 +00:00
marfrit f04d7000f8 Phase 8.13: byte-exact end-to-end via libva (consumer target hit)
The project's consumer-side goal landed: a real VAAPI consumer
(ffmpeg with -hwaccel vaapi) drives our libva backend → V4L2
driver → daemon → byte-exact NV12 output back to ffmpeg.

  ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
         -hwaccel_output_format nv12 -i vp9_small.ivf \
         -f rawvideo -y /tmp/vp9_via_libva.nv12
  cmp /tmp/vp9_via_libva.nv12 /tmp/vp9_ref_for_libva.nv12  → match

18432-byte NV12 byte-for-byte identical to plain ffmpeg
-pix_fmt nv12 software decode. The project_consumer_target
memory's deliverable shape — "V4L2 stateless node consumed by
a real VAAPI client" — is achieved.

Two related kernel changes:

1. v4l2_ctrl_handler_setup(&ctx->hdl) after registration —
   matches rkvdec/cedrus/hantro. Brings each registered
   compound control out of "uninitialised" state via
   std_init_compound defaults.

2. Per-request control completion in the decode path —
   the real fix for "Timeout when waiting for media request".
   vb2-core's vb2_buffer_done unbinds the BUFFER's req_obj
   on normal decode completion, but the per-request CONTROL
   object stays bound. buf_request_complete fires only from
   queue-cancel paths (vb2-core line 2284), NOT from normal
   buf_done. The driver must call
   v4l2_ctrl_request_complete(req, hdl) explicitly from the
   completion path.

   struct daedalus_inflight gained a `struct media_request
   *req` field, captured from src_buf->vb2_buf.req_obj.req
   in device_run. daedalus_complete_resp_frame then calls
   v4l2_ctrl_request_complete before
   v4l2_m2m_buf_done_and_job_finish — triggers
   MEDIA_REQUEST_STATE_COMPLETE and wakes the request fd
   poll.

   For non-request flows (test_m2m_stream direct QBUF)
   inf->req is NULL; the conditional skips the call.
   Both consumer styles work concurrently.

Diagnostic clarification (was Phase 8.13a):

strace traced three S_EXT_CTRLS calls per frame:
  1. H264_PROFILE + H264_LEVEL → EINVAL  (we don't register)
  2. HEVC_PROFILE + HEVC_LEVEL → EINVAL  (we don't register)
  3. VP9_FRAME + VP9_COMPRESSED_HDR → SUCCESS

The first two are harmless: libva probes whether we support
H264/HEVC integer profile/level controls during config
negotiation; we don't (we expose them as stateless), so EINVAL
just falls through. The actual VP9 stateless controls (#3)
succeeded all along — the libva-side "Unable to set control(s)"
log was misleading us into thinking the control path was the
bug.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  daemon log:
    REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes
    decoder: opened vp9 context
    decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe ...

  ffmpeg side:
    no Timeout, no Decoding error
    /tmp/vp9_via_libva.nv12: 18432 bytes

  cmp vs reference: byte-for-byte identical.

Roadmap update:
- 8.10/8.11, 8.12, 8.13 marked closed with closure docs.
- 8.14 = multi-frame VP9 via libva, AV1 + H.264, mpv/Firefox
  higher-level consumers.

Per correctness-before-speed:
- strace + kernel-source-reading found the actual root cause
  rather than guessing.
- Conditional v4l2_ctrl_request_complete preserves the existing
  test_m2m_stream non-request path — both consumer styles work
  concurrently without per-flow branching elsewhere.
- Byte-exact pixel comparison, not "frame size matches."

Phase 8.14 next: multi-frame stream + multi-codec via libva +
mpv/Firefox.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:14:34 +00:00
marfrit a7d585eee8 Phase 8.12: first VP9 frame decoded via libva
ffmpeg -hwaccel vaapi → libva-v4l2-request-fourier →
/dev/video0 → daedalus_v4l2 kernel → REQ_DECODE on the
chardev → daemon FFmpeg decode → byte-exact NV12 (FNV-1a
0x1eb34bfe, same hash the standalone test_m2m_stream
produces for the same 128x96 VP9 keyframe).

The pixel-correct decode through the libva path is the
milestone. What's NOT yet working: libva times out on the
media_request fd because buf_request_complete never fires
(vb->req_obj.req is NULL when buf_done runs — the
S_EXT_CTRLS EINVAL leaves the buffer un-bound to the
request even though the buffer queues anyway). Phase 8.13
fixes the EINVAL so the request bind takes and the
completion signal propagates.

Kernel V4L2 request API integration:
- media_device_ops.req_validate / req_queue = vb2_request_
  validate / v4l2_m2m_request_queue (Phase 8.11) —
  MEDIA_IOC_REQUEST_ALLOC succeeds.
- vb2_queue.supports_requests = true on OUTPUT queue —
  without this v4l2-core rejects S_EXT_CTRLS(REQUEST_VAL).
- vb2_ops.buf_request_complete = daedalus_buf_request_complete
  → v4l2_ctrl_request_complete(req, &ctx->hdl). Without
  this v4l2-core WARNs at videobuf2-v4l2.c:440.
- vb2_ops.buf_out_validate: sets field=V4L2_FIELD_NONE on
  OUTPUT buf. Required for the same WARN check.
- requires_requests intentionally NOT set: lets the
  existing test_m2m_stream (direct QBUF, no request) keep
  working alongside the libva path.

Stateless control re-registration:
- Switched from v4l2_ctrl_new_std_compound(NULL p_def) to
  v4l2_ctrl_new_custom(&cfg, NULL) — pattern rkvdec /
  cedrus / hantro use. v4l2-core auto-fills elem_size +
  type from std table (verified: VP9_FRAME elem_size=168,
  matches sizeof(struct v4l2_ctrl_vp9_frame)).
- No-op s_ctrl callback so SET requests don't crash —
  daemon ignores values, FFmpeg re-parses the bitstream.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  ffmpeg -hwaccel vaapi -i vp9_small.ivf …
  daemon: REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes
  daemon: decoder: opened vp9 context
  daemon: decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe …

Same FNV-1a hash as the standalone test_m2m_stream produces
for the same VP9 keyframe. End-to-end through libva.

Remaining (Phase 8.13):
- S_EXT_CTRLS EINVAL on V4L2_CID_STATELESS_VP9_FRAME despite
  matching elem_size — needs deeper validate-path debugging.
- Once the request bind takes, buf_request_complete fires
  on buf_done, request fd signals completion, libva DQBUFs
  the decoded NV12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:01:26 +00:00
marfrit 0de0288dce Phase 8.10+8.11: libva consumer integration scaffold
Brings daedalus_v4l2 from "standalone test client" to "VAAPI-
discoverable decoder" by adding the surface formats and
media-controller plumbing that libva-v4l2-request-fourier
(sibling repo) requires.

libva-v4l2-request-fourier patches (pushed separately):
- b5b3acf: daedalus_v4l2 added to known_decoder_drivers
- 2146341: meson option gate

This commit (daedalus-v4l2 side, 3 production changes):

1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE
   - Added to daedalus_capture_formats[] alongside NV12M + P010
   - daedalus_fill_capture_fmt handles num_planes=1 case
     (sizeimage = W*H*3/2, bytesperline = W)
   - daemon pack_nv12_single_to_plane: Y at base+0,
     interleaved CbCr at base+(stride*H); same byte content
     as NV12M two-plane, different layout
   - Required because libva-v4l2-request-fourier's video.c
     only knows non-multi-plane NV12 (it advertises
     v4l2_mplane=true but uses the single-plane fourcc).
   - Verified byte-exact via test_m2m_stream against
     ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames,
     31 MB).

2. V4L2 Request API media ops
   - daedalus_media_ops = { vb2_request_validate,
     v4l2_m2m_request_queue } assigned to mdev.ops before
     media_device_init.
   - Without this, MEDIA_IOC_REQUEST_ALLOC returned
     -ENOTTY and no VAAPI consumer could allocate a
     media_request.

3. Stateless control registration via v4l2_ctrl_new_custom
   - Switched from v4l2_ctrl_new_std_compound(NULL p_def)
     to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/
     hantro use. Adds a no-op s_ctrl callback.

Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712):

LibVA trace through `ffmpeg -hwaccel vaapi`:
  vaInitialize / Profiles / Entrypoints / CreateConfig /
  QuerySurfaceAttributes / CreateSurfaces / CreateContext
  (cap_pool: 24 slots, 1 plane each) / CreateBuffer
  (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC
  — all succeed.

Standalone NV12 decode path:
  test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12
  → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12

vainfo (via libva-v4l2-request-fourier with our driver):
  7 VAProfile entries with VAEntrypointVLD
  (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh,
   VP9Profile0, AV1Profile0)

What's NOT here (Phase 8.12):

The libva trace stops at VIDIOC_S_EXT_CTRLS returning
EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on
the request. The compound-control payload validation
against the kernel's expected struct shape rejects.
This isn't a "missing line" fix — it needs proper
stateless control plumbing (the SPS/PPS/SliceParams
get_dims, validate, default-value paths that in-tree
rkvdec/cedrus/hantro implement to satisfy v4l2-core's
std_validate). Documented as Phase 8.12 scope.

The shipped integration is itself a meaningful deliverable:
all the framework scaffolding is in place; the remaining
gap is well-characterised and bounded.

See docs/phase_8_10_11_closure.md for the full trace
analysis + next-phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:51:16 +00:00
marfrit d84efdb125 Phase 8.9: long-form stress + multi-codec HDR + libva scoping
Three verification deliverables; no production code changes
(infrastructure from 8.8 was sufficient).

1. libva-v4l2-request consumer investigation (task 95):
   - bootlin/libva-v4l2-request@master supports MPEG-2 /
     H.264 / HEVC only. No VP9, no AV1.
   - H264 expects V4L2_PIX_FMT_H264_SLICE_RAW (older
     fourcc); we advertise V4L2_PIX_FMT_H264_SLICE.
   - CAPTURE expects V4L2_PIX_FMT_NV12 (single-plane);
     we advertise NV12M + P010.
   - Real integration = patch libva-v4l2-request to add
     VP9 + AV1 mappings + accept the newer H.264 fourcc.
     Multi-session work — pushed to Phase 8.10.

2. Long-form stress test (task 96):
   - Built a 1800-frame (60s @ 30fps) VP9 1080p stream
     by Python concat of vp9_5s.ivf × 12 with PTS
     adjustment and re-muxed IVF header.
   - 1800 / 1800 frames decoded cleanly through
     test_m2m_stream + daemon, fps=120.9 sustained
     across 14.9 s wall, p99=17.3 ms/frame (well inside
     the 33 ms 30fps budget).
   - Daemon alive after 3620 cookies across two
     back-to-back runs, RSS=23 MiB — no leak.
   - No kernel oops/WARN, no fps degradation across
     the long run.

3. Multi-codec HDR (task 97):
   - AV1 1080p 10-bit → P010: byte-exact vs ffmpeg
     p010le. fps 17.1 (below 30fps target; AV1 10-bit
     is intrinsically expensive).
   - H.264 1080p 10-bit (high10) → P010: byte-exact
     vs ffmpeg p010le. fps 26.9 (close to target).
   - Combined with 8.8's VP9-10bit P010 result
     (48.8 fps): all three codecs' 10-bit paths
     produce byte-exact P010 output.

Roadmap update (docs/roadmap.md):
- 8.9 marked closed with the scope-cut explained.
- 8.10 = libva-v4l2-request VP9/AV1 patch + end-to-end
  consumer integration (the actual user-facing loop:
  mpv --hwdec=vaapi → libva-v4l2-request → /dev/video0
  → daemon → decoded frame).

Per correctness-before-speed: characterised the libva
integration scope rigorously rather than starting a
multi-session battle in this phase. The bounded
deliverables (stress test + HDR matrix) ship clean and
prove the existing infrastructure handles real-world
workloads stably.

Phase 8.10 next: build + patch libva-v4l2-request on
hertz; end-to-end with mpv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:26:42 +00:00
marfrit 1d0db3b5a9 docs: pure ffmpeg vs daedalus pipeline CPU comparison
Measured on hertz (Pi 5, 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3)
to quantify the architectural cost/benefit of routing decode
through the V4L2 m2m + chardev + dmabuf path vs running
ffmpeg standalone.

1080p × 150 frames, decode-as-fast-as-possible:

  VP9 8-bit:     ffmpeg 214.9% CPU / 1083ms wall
                 daedalus 96.3% CPU / 1229ms wall
  AV1 8-bit:     ffmpeg 201.5% CPU / 1162ms wall
                 daedalus 96.6% CPU / 1478ms wall
  H.264 8-bit:   ffmpeg 205.8% CPU / 1063ms wall
                 daedalus 100.1% CPU / 1020ms wall
  VP9 10-bit:    ffmpeg 155.8% CPU /  269ms wall
                 daedalus 91.6% CPU /  131ms wall

Key takeaway: the daedalus pipeline uses ~half the CPU for
roughly the same wall throughput. FFmpeg standalone defaults
to 2 threads; for single-stream decode that doesn't
parallelise well, so the 2× CPU usage is overhead, not
parallelism benefit. The daemon's single-threaded serialised
event loop avoids that tax.

For the project's 30fps-floor-is-fine target ("daily YouTube
with CPU free for vscode"), daedalus leaves ~2× the CPU
headroom for the rest of the desktop at the same playback
rate.

VP9-10bit is striking — daedalus is faster wallclock too
(131ms vs 269ms) because at small per-frame work FFmpeg's
thread pool spin-up dominates.

Note: "daedalus" still uses FFmpeg internally (Phase 8.8
explicitly deferred QPU substitution after measurement showed
30fps@1080p was already met). The benefit here is
architectural — single-threaded decode, out-of-process
daemon, dmabuf zero-copy — not QPU offload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:20:22 +00:00
marfrit 1ae9528e76 Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00
marfrit 5965805d86 Phase 8.7: media controller + multi-frame streaming verification
Two pieces — both shipped:

1. Media controller binding closes the last v4l2-compliance
   failure from 8.6 (DECODER_CMD, which requires has_media on
   stateless decoders) and unlocks the V4L2 request API for
   libva-v4l2-request.

2. Multi-frame streaming test exercises the daemon's
   AVCodecContext state preservation across many REQ_DECODE
   calls — Phase 8.6's tests pushed exactly one keyframe per
   invocation; real content has P-frame references.

Compliance now reaches **49/49 passing.**

Kernel (kernel/daedalus_v4l2_main.{c,h}):
- Added `struct media_device mdev` to daedalus_dev.
- media_device_init(&mdev) BEFORE v4l2_device_register so
  v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m
  entities into the graph during register.
- After video_register_device:
  v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER)
  then media_device_register so userspace sees the complete
  graph in /dev/mediaN with the decoder entity tagged.
- daedalus_remove unwinds in reverse: unregister media,
  unregister mc, unregister video, release m2m, unregister
  v4l2, cleanup mdev.
- Error paths added for both new failure points.

Test harness (tools/test_m2m_stream.c, new):
- Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer
  rings on both queues → per-frame QBUF/DQBUF loop →
  concatenates decoded NV12 to output file. Returns 0 only
  if every input frame decoded without error.
- Same codec vocabulary as test_m2m_decode (vp9 | av1 |
  h264 via 5th arg).

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings.

  $ v4l2-ctl --list-devices
  daedalus-fourier V3D7+NEON (platform:daedalus_v4l2):
        /dev/video0
        /dev/media3

VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB
NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt
nv12 -f rawvideo`.

VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf
path): byte-for-byte match vs same reference command.

Daemon log shows cookies 1..30 all completing cleanly in
order; lazily-opened AVCodecContext maintains reference
frames across the chardev round-trips.

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.7 marked closed with closure-doc reference.
- 8.8 reshaped: perf profiling, QPU dispatch substitution
  via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M).

Per correctness-before-speed:
- Order-correct media controller lifecycle (init → bind
  v4l2_dev → register video → register mc → register
  media; reverse for teardown).
- 4-deep buffer rings on both queues — the scheduler
  actually pipelines multiple in-flight cookies through
  the chardev (not just one-at-a-time as in 8.5/8.6 tests).
- Bit-exact comparison against ffmpeg, not "looks right."
- All resource paths cleaned on every error branch.

Phase 8.8 next: profile daemon hot loops, dlopen
daedalus-fourier from the daemon, swap FFmpeg per-block
calls for daedalus_dispatch_* where the kernel matches,
target 30fps@1080p from 30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:21:58 +00:00
marfrit c7f6fb90cb Phase 8.6: dmabuf + AV1 + H.264 + stateless controls
Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.

Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
  (width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
  asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
  in daemon task context so the fd lands in the daemon's table.

Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
  vmalloc in 8.5; the switch is needed because vmalloc doesn't
  honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
  REQBUFS test rejected the driver because of it. We still
  read bitstream via vb2_plane_vaddr (dma_contig gives a
  kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
  both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
  walks inflight list, calls vb2_core_expbuf on the CAPTURE
  buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
  ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
  DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
  path (kept for debugging) and the 8.6 dmabuf path (pixels
  already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
  overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
  decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
  VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
  SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
  TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
  re-parses); registration is what makes libva-v4l2-request
  see us.

Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
  daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
  allocator via shared daedalus_next_cookie() — kills the
  Phase 8.5 namespace collision.

Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
  REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
  is essential — vb2_core_expbuf extracts O_ACCMODE from flags
  and exports read-only by default (caught on first run; mmap
  -EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
  addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
  writes Y line-by-line into planes[0] with planes[0].stride;
  interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
  runs decode (pixels land in CAPTURE buffer directly), closes
  planes, sends metadata-only RESP_FRAME. No wire-pixel
  allocation.

Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
  drives all three codecs.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
  VP9   1920x1080  3,110,400 bytes  MATCH
  AV1     128x96      18,432 bytes  MATCH
  H.264   128x96      18,432 bytes  MATCH

VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.

v4l2-compliance:
  Phase 8.1: 44/48
  Phase 8.5: 44/48 (different fails after m2m landed)
  Phase 8.6: 47/48
  Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
  controller — explicitly Phase 8.7 work).

11 standard compound controls visible:
  vp9_frame_decode_parameters, vp9_probabilities_updates,
  h264_sequence_parameter_set, h264_picture_parameter_set,
  h264_scaling_matrix, h264_prediction_weight_table,
  h264_slice_parameters, h264_decode_parameters,
  av1_sequence_parameters, av1_frame_parameters,
  av1_film_grain (av1_tile_group_entry refused by hdl->error
  on this kernel — skipped silently).

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
  daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
  long-form multi-frame streaming.

Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
  fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
  right" eyeballing.
- Each compliance edge documented with the underlying test
  source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
  every exit, including error paths).

Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:16:06 +00:00
marfrit 6f4b580f7c Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF
Replaces the Phase 8.4 debugfs-triggered chardev path with a
real V4L2 m2m driver. Userspace clients now drive decoding the
standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream)
queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run
packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg;
RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE
buffer. Phase 8.6 swaps the inline payload for dmabuf so big
frames stop being capped at 64 KiB.

Kernel (daedalus_v4l2_main.c, rewritten + main.h added):
- Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler,
  per-queue v4l2_pix_format_mplane.
- Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed
  yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export):
    OUTPUT  = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,  VP9_FRAME
    CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M
- Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt
  for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/
  prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_*
  helpers.
- v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds
  REQ_DECODE inline with the bitstream bytes, enqueues with an
  auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in
  a per-device inflight list. Job stays open until RESP_FRAME.
- daedalus_complete_resp_frame(): pops the inflight entry,
  memcpys inline NV12 pixels into the CAPTURE buffer (Y plane
  + interleaved CbCr), finishes via
  v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done +
  job_finish, which leaves the src buf on the m2m queue and
  causes device_run to immediately re-run on the same input
  (caught on first run; second REQ_DECODE for same bitstream +
  eventual oops in stop_streaming on teardown).

Kernel (daedalus_v4l2_chardev.c):
- RESP_FRAME handler now hands inline pixel payload to
  daedalus_complete_resp_frame so it lands in the CAPTURE
  vb2 buffer. Existing PONG and debugfs test_decode paths still
  work; the latter produces a harmless ratelimited "unknown
  cookie" since it bypasses V4L2 m2m.

Daemon (decoder.c, decoder.h):
- daedalus_decoder_run_request signature extended with
  (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the
  decoder packs YUV420P into NV12 in the caller's buffer: Y
  plane line-by-line stripped of stride padding; Cb/Cr
  interleaved into a single chroma plane. Truncation silent —
  kernel only memcpys what fits in the CAPTURE plane.

Daemon (chardev_client.c):
- handle_req_decode allocates a response buffer sized for the
  full chardev payload, lets decoder fill the pixel area
  after the resp_frame struct, sends the full payload via the
  existing send_response.

Test client (tools/test_m2m_decode.c, new):
- Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each,
  mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump
  CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the
  whole flow without needing v4l2-ctl framing.

Roadmap update (docs/roadmap.md):
- Phase 8.4 retitled "daemon ↔ kernel decode round-trip"
  to reflect what actually shipped (vs. the original V4L2-
  ioctl-driven plan which moved here).
- Phase 8.5 retitled "full V4L2 m2m driver" with closure
  status.
- Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/
  stateless controls + media controller. Adds the punch list
  of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace)
  that 8.6 will fix.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  Kernel + daemon build clean (-Wall -Wextra clean both sides).
  Test harness drives one VP9 keyframe end-to-end:
    OUTPUT REQBUFS -> 2
    CAPTURE REQBUFS -> 2
    QBUF OUTPUT[0] bytesused=1566
    QBUF CAPTURE[0]; STREAMON both
    poll revents=0x5
    DQBUF OUTPUT[0] flags=0x4001 (DONE)
    DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
    wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12

  Pixel correctness vs reference:
    ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
    cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓
  Byte-for-byte identical to FFmpeg's stock CPU decode.

  v4l2-compliance: detected as Stateless Decoder; most ioctls
  pass; two expected fails documented in closure doc
  (DECODER_CMD/media controller, S_FMT colorspace).

  Clean teardown: SIGTERM the daemon, rmmod the module, no
  oops/WARN in dmesg.

Per correctness-before-speed:
- Real V4L2 ioctl table (not stubs); uses v4l2-core helpers
  where they exist instead of reinventing.
- v4l2_m2m_buf_done_and_job_finish (not the manual sequence)
  to keep scheduler state consistent.
- Bit-exact reference comparison, not just "looks right."
- Documented every compliance failure with the planned fix.
- All resource paths (kmalloc/kfree, inflight list cleanup,
  src/dst buf removal in stop_streaming) handled on every
  error branch.

Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB
frame-size cap), add AV1+H.264 codecs, add V4L2 stateless
controls + media controller binding, fix the colorspace +
cookie-namespace compliance issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:55:10 +00:00
marfrit 2a449632b9 Phase 8.4: daemon ↔ kernel decode round-trip (VP9 end-to-end)
Wires the Phase 8.3 FFmpeg loader through the Phase 8.2 chardev
bridge: kernel injects REQ_DECODE carrying a raw VP9 access unit,
daemon hands the bitstream to libavcodec via dlopen, sends
RESP_FRAME back with a content-dependent FNV-1a digest of the
decoded YUV planes. Pure CPU decode for now — Phase 8.5 swaps in
dmabuf + QPU dispatch.

Protocol (include/daedalus_v4l2_proto.h):
- New REQ_DECODE (kernel→daemon) and RESP_FRAME (daemon→kernel)
  message types, with fixed-size payload structs.
- New DAEDALUS_CODEC_VP9/AV1/H264 enum (wire-stable so 8.6's
  AV1+H.264 work doesn't move existing values).
- New DAEDALUS_DECODE_* status enum (OK / NO_FRAME / ERR_OPEN /
  ERR_SEND / ERR_RECV / ERR_CODEC).
- Converted the prior `enum daedalus_msg_type` to #defines —
  high-bit values exceed INT_MAX and tripped -Wpedantic on
  userspace; kernel uABI headers use the same idiom.

Kernel (kernel/daedalus_v4l2_chardev.c):
- New debugfs entry /sys/kernel/debug/daedalus_v4l2/test_decode:
  writing raw bitstream bytes wraps them in a REQ_DECODE
  (codec=VP9 for Phase 8.4) and enqueues with an
  auto-incrementing cookie.
- daedalus_chardev_write learned RESP_FRAME: parses the payload
  and emits a single pr_info line with decode metadata. Keeps
  existing PONG handling on the default arm.

Daemon (daemon/src/...):
- chardev_client.{c,h} — opens /dev/daedalus-v4l2, blocking read
  loop, single-buffer write() responses (kernel chardev has only
  .write, not .write_iter, so writev lands as -EINVAL —
  discovered the hard way during first run).
- decoder.{c,h} — lazily-opened AVCodecContext per codec, shared
  AVPacket/AVFrame pair, descriptor-driven plane walker
  (av_pix_fmt_desc_get) so the same hash path covers YUV420P,
  YUV422P, YUV444P, GBRP and other 8-bit planar layouts.
  Generalised after first run decoded testsrc as GBRP (71)
  rather than the assumed YUV420P.
- `daemon` command in main.c opens the chardev and runs the loop
  until SIGINT/SIGTERM. Cookie correlation handled end-to-end.
- ffmpeg_loader gained av_pix_fmt_desc_get (23 symbols total).

Build:
- CMakeLists adds chardev_client.c + decoder.c; explicit
  -I../include for the shared protocol header.
- Still -Wall -Wextra -Wpedantic clean.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  $ ffmpeg ... -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 \
           -y /tmp/vp9_test.ivf
  $ python3 ... strip IVF framing → vp9_keyframe.bin (3268 B)

  $ sudo insmod kernel/daedalus_v4l2.ko
  $ daedalus_v4l2_daemon -v daemon &
  $ sudo dd if=vp9_keyframe.bin \
         of=/sys/kernel/debug/daedalus_v4l2/test_decode

  daemon: REQ_DECODE cookie=2 → decoded yuv420p 320x240
          fnv1a=0x6ef10d71 luma=76800 chroma=38400
  kernel: RESP_FRAME cookie=2 status=0 320x240 pixfmt=0
          fnv1a=0x6ef10d71  ← matches daemon ✓

Hash properties verified:
  cookie=2  testsrc 3268 B → 0x6ef10d71  (first decode)
  cookie=3  red     44 B   → 0x7f6e5dc5  (content-dependent ✓)
  cookie=4  testsrc 3268 B → 0x6ef10d71  (deterministic ✓)
  cookie=5  64 B random    → status=101  (ERR_SEND, daemon alive)

Daemon survives bad input (FFmpeg "Invalid sync code" wrapped
into structured ERR_SEND response). Clean SIGTERM shutdown,
clean rmmod.

Phase 8.4 acceptance criteria met:
- ✓ end-to-end kernel→daemon→FFmpeg→kernel round-trip
- ✓ cookie correlation per request/response pair
- ✓ content-dependent + deterministic digest
- ✓ structured error responses (no daemon crash on bad input)
- ✓ clean teardown (SIGTERM + rmmod)
- ✓ builds clean on both kernel kbuild and daemon CMake

Per correctness-before-speed:
- Real chardev I/O (no shortcuts, no select-loop hacks)
- Real FFmpeg AVCodecContext lifecycle (lazily opened, properly
  freed on cleanup)
- Descriptor-driven plane walk (generalises across pix_fmts)
- Structured error path (not just log-and-continue)
- All resource paths cleaned up on every error branch
- Documented why FNV-1a digest, why write() not writev(), why
  pix_desc walk in docs/phase_8_4_closure.md

Phase 8.5 next: V4L2 m2m queue submits REQ_DECODE from
vidioc_qbuf; dmabuf carries actual pixel data so the chardev's
64 KiB cap doesn't gate frame size; begin substituting
daedalus_dispatch_* into the daemon's decode path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:22:16 +00:00
marfrit 873a04c622 Phase 8.3: userspace daemon scaffold + FFmpeg dlopen + parse path
Builds the daemon executable per the locked Phase 8 architecture
(Option γ: dlopen FFmpeg at runtime). Phase 8.3 scope: parse
path validation only — no V4L2 wiring, no decode, no chardev
connection.

Components:
- daemon/CMakeLists.txt — CMake with -Wall -Wextra -Wpedantic
  clean. pkg-config for FFmpeg headers; only -ldl + -lpthread
  at link time.
- daemon/src/main.c — entry point, signal handlers
  (SIGINT/SIGTERM), command dispatcher. Currently `parse <file>`.
- daemon/src/ffmpeg_loader.{c,h} — runtime FFmpeg loader.
  dlopens libavformat.so.61, libavcodec.so.61, libavutil.so.59.
  Resolves 22 function pointers using POSIX-recommended
  *(void**)& dlsym idiom (per POSIX.1-2017 dlsym(3p) Rationale).
- daemon/src/parser.{c,h} — demux loop via avformat_open_input +
  av_read_frame. Per-frame logging on -v.
- daemon/src/log.{c,h} — logging facade (stderr Phase 8.3;
  syslog/journal planned for 8.5+).

Verification on hertz:
  $ ffmpeg -f lavfi -i testsrc=duration=2:size=320x240:rate=30 \
           -c:v libvpx-vp9 -y /tmp/testsrc.ivf
  $ daedalus_v4l2_daemon parse /tmp/testsrc.ivf
  [INFO] FFmpeg loaded: 7.1.3-0+deb13u1+rpt1 (libavformat 61.7.100)
  [INFO] video stream #0: codec=vp9 (Google VP9) 320x240, 0/0 fps
  [INFO] parse complete: 60 frames (1 key) total 17859 bytes

Error paths verified:
- Missing file → "avformat_open_input(...): code -2", exit 1
- No command → usage message, exit 2
- Bad command → usage message, exit 2

Per correctness-before-speed:
- Real CMake (no Makefile hacks)
- pkg-config for headers
- POSIX-conformant dlsym pattern (no -Wpedantic suppression)
- Real signal handling + proper exit codes
- Real logging with timestamp + level
- Headers included at compile-time for type safety; dlopen
  decouples runtime
- All FFmpeg resources freed on every exit path
- Builds clean on -Wall -Wextra -Wpedantic

Phase 8.3 acceptance criteria met:
- ✓ daemon binary builds
- ✓ dlopen FFmpeg at runtime
- ✓ demux a VP9 IVF file end-to-end
- ✓ per-frame metadata logged correctly
- ✓ frame count + keyframe count + byte total accurate

Phase 8.4 next: wire daemon to /dev/daedalus-v4l2 chardev,
add REQ_DECODE / RESP_FRAME handling, drive VP9 decode
end-to-end via daedalus_dispatch_* from daedalus-fourier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:10:22 +00:00
marfrit 895f57c63a Phase 8.2: kernel ↔ daemon chardev bridge with round-trip test
Adds /dev/daedalus-v4l2 misc chardev to the kernel module. The
chardev is the IPC channel for the future userspace decoder
daemon: kernel enqueues REQ_* messages, daemon read()s them,
processes, write()s RESP_* back.

Wire protocol (pre-1.0, header in include/daedalus_v4l2_proto.h):
- struct daedalus_msg_hdr: magic (D04V) + version + type +
  cookie + payload_len + reserved
- Request/response separated by high bit of type field
- Max 64 KiB payload per message
- Cookie correlates request with matching response

Kernel implementation (kernel/daedalus_v4l2_chardev.{c,h}):
- Single-instance chardev (-EBUSY on second open)
- In-kernel FIFO bounded at 64 messages
- Blocking + non-blocking read; poll() with EPOLLIN on queued
- write() parses + validates header, logs response at pr_debug
- Bad magic → -EBADMSG, bad version → -EPROTO, oversize → -EMSGSIZE
- All error paths free resources

Phase 8.2 test trigger via debugfs:
- /sys/kernel/debug/daedalus_v4l2/test_ping — any byte
  enqueues a PING with a fixed 24-byte payload. Removed in
  Phase 8.4 when real REQ_DECODE from V4L2 path takes over.

Userspace verification tool (tools/test_chardev_pingpong.c):
- Real C program, proper error reporting via strerror
- Validates the 6-step round-trip: open → empty-queue EAGAIN →
  trigger ping → read PING → verify all fields → write PONG → close
- Builds with -Wall -Wextra clean

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
  $ sudo insmod daedalus_v4l2.ko
  $ sudo tools/test_chardev_pingpong
  opening /dev/daedalus-v4l2...
    non-blocking read on empty queue: EAGAIN ✓
    injected PING via debugfs ✓
    read PING: magic ✓ version ✓ type=PING ✓ cookie=0x1234 ✓ payload=24 bytes
      payload: "DAEDALUS-V4L2-PING-PL"
    wrote PONG (cookie=0x1234) ✓
  ALL TESTS PASSED.
  $ sudo rmmod daedalus_v4l2      # clean

Per correctness-before-speed: full kerneldoc on structs, 8-tab
kernel style, SPDX headers, proper error paths, real test
program (not "I ran it once"), failure-mode coverage documented.

Phase 8.3 next: userspace daemon with dlopen'd FFmpeg parse path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:05:54 +00:00