Commit Graph

23 Commits

Author SHA1 Message Date
claude-noether 94be8c3d03 kernel: drain in-flight m2m jobs on daemon disconnect
Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that
triggers chardev release) leaves V4L2 consumers in unkillable
TASK_UNINTERRUPTIBLE on /dev/video0 close.

## Root cause

device_run() adds an entry to dev->inflight when it sends a
REQ_DECODE to the daemon, marking the m2m job as "running".
The job is only cleared via v4l2_m2m_buf_done_and_job_finish()
in daedalus_complete_resp_frame(), which only fires on RESP_FRAME.

If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the
matching RESP_FRAME:
  - the inflight entry is never popped
  - v4l2_m2m_buf_done_and_job_finish is never called
  - the m2m scheduler still thinks a job is running

Later, when the V4L2 consumer's close() runs (or gets signalled
to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits
for !job_running indefinitely.  The consumer enters D-state and
survives SIGKILL until reboot.

Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712:

  $ sudo kill -STOP $DAEMON_PID            # block daemon I/O
  $ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 &
  $ sudo kill -9 $DAEMON_PID               # chardev_release fires
  $ kill -9 $CLIENT_PID                    # ignored — D-state
  # client stack:
  v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem]
  v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem]
  daedalus_release+0x2c [daedalus_v4l2]
  v4l2_release+0x7c [videodev]
  __fput → do_exit → SIGKILL never delivered

## Fix

New API daedalus_drain_inflight_on_disconnect() in main.{c,h}:
walks the in-flight list, marks both src+dst buffers
VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and
releases the bound media_request if any.  Same completion shape
as daedalus_complete_resp_frame() takes on the success path,
just with state = ERROR for every in-flight entry.

chardev_release calls the drain after flushing dev->req_queue
(messages still in req_queue weren't released to the daemon yet,
so they don't need the m2m-job-finish dance — freeing them is
sufficient).  The order matters: queue first (cheap), then m2m
drain (heavier, takes the inflight list).

Locking: list_splice_init under inflight_lock to take the entire
list atomically; lock dropped before iterating because
v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done
dispatch and can re-enter device_run via the scheduler (which
would need inflight_lock again on the next REQ_DECODE).

## Verification path

Cannot rmmod the running module on hertz right now — the D-state
corpse from the repro session pins the refcount.  Verification
of the fixed module needs a reboot or fresh test host:

  $ sudo reboot                            # clears hung client
  $ sudo make modules_install              # install new .ko
  $ sudo modprobe daedalus_v4l2
  $ # rerun the repro script — client should die cleanly with
  $ # an -EIO / similar return from poll/DQBUF instead of hanging.

Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings.
The pre-existing "frame size 2128 > 2048" warning on
daedalus_device_run is unchanged by this commit.

## Followup not in scope

If a new V4L2 consumer races a REQ_DECODE through device_run
AFTER the drain has spliced the list (but before the daemon
chardev is reopened), the new entry sits in a freshly-empty
inflight list and the same hang can recur for that consumer
when the systemd auto-restart of the daemon either fails or
takes longer than the consumer's patience.  A secondary
safeguard would be to fail-fast in device_run when dev->chardev
is unopened — proposing as a separate ticket if this race
materialises in practice.

Closes #146.
2026-05-23 17:06:06 +02:00
marfrit 714d781d22 Revert "Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main"
This reverts commit 79256dc7ef, reversing
changes made to 7ff2d897ea.
2026-05-21 14:40:59 +02:00
marfrit 49e60c9bba Revert "Merge pull request 'kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7)' (#8) from noether/kernel-claim-bufs-at-device-run into main"
This reverts commit 6ffe92bcac, reversing
changes made to 79256dc7ef.
2026-05-21 14:40:52 +02:00
claude-noether f10a26d883 kernel: claim src/dst at device_run, not at buf_done
Hard reboot observed on higgs (Pi CM5) during the first mpv vaapi-copy
playback against the freshly-deployed r28+g79256dc stack — kernel
panic, no persistent journal, no recoverable trace.  Bug introduced
by the daedalus-v4l2#6 reorder fix (#7).

Cause
-----
The new completion path runs `v4l2_m2m_job_finish` on SRC_CONSUMED
even when the dst_buf is still parked (waiting for a future
HAS_PIXELS).  job_finish moves the m2m_ctx back to IDLE, the
scheduler dispatches the next device_run — which calls
`v4l2_m2m_next_dst_buf`, which returns the head of the CAPTURE
ready-queue, which is STILL the parked dst_buf because we never
removed it.  Two inflight entries now reference the same vb2_buffer;
the later HAS_PIXELS triggers `v4l2_m2m_dst_buf_remove_by_buf` on a
vb2_buffer whose list_head is no longer linked to that queue, and
`list_del()` smashes the next/prev pointers of whatever ELSE was at
those addresses.

Fix
---
Take both src and dst off `m2m_ctx`'s rdy_queue at device_run — as
soon as `v4l2_m2m_next_*_buf` has peeked them and all early-exit
validation has passed.  After that, the daemon owns both halves
exclusively via the inflight item; the m2m scheduler can't re-issue
them on the next device_run.  Completion path drops the redundant
`_remove_by_buf` calls — list is already detached, so `buf_done`
alone is correct.

Matches the amphion `vdec.c`/`venc.c` pattern (which also claims at
device_run for the same reason: amphion's encode pipeline parks
output buffers across multiple frames waiting for the codec to
finish, structurally the same as our H.264 B-frame DPB parking).

`fail_buf_error` learns about the new `claimed` flag and skips the
`v4l2_m2m_*_buf_remove` calls when the buffers have already been
removed by-buf at device_run.

Verified
--------
Builds clean against 6.18.29+rpt-rpi-2712.  Field test pending —
deploy via marfrit-packages bump in lock-step with the daemon
(which doesn't need to change for this fix; PROTO_VERSION stays at 1).
2026-05-21 13:49:44 +02:00
marfrit 79256dc7ef Merge pull request 'kernel + daemon: H.264 B-frame display reorder fix (closes #6)' (#7) from noether/kernel-daemon-h264-reorder-fix into main
Reviewed-on: #7
2026-05-21 10:36:53 +00:00
claude-noether 15fc2aba14 kernel + daemon: H.264 B-frame display reorder fix (issue #6)
H.264 streams with B-frames showed visibly pair-swapped output in
mpv / Firefox playback through the libva → daedalus_v4l2 path —
"frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6".  Reproduced in mpv
with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor),
confirming the bug was in this daemon pipeline, not downstream.

Root cause
----------
libavcodec's H.264 decoder internally reorders output to DISPLAY
order before returning from avcodec_receive_frame.  The daemon
previously called send_packet → receive_frame ONCE per REQ_DECODE
and shipped the resulting pixels in a RESP_FRAME tagged with the
SAME cookie.  For B-frames this is wrong: the frame returned from
receive_frame may belong to an EARLIER bitstream (libavcodec held
it for display-order release).  Cookie N's CAPTURE buffer therefore
got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got
silently marked VB2_BUF_STATE_ERROR (the daemon returned
DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held).

Fix shape
---------
Decouple kernel cookie identity (decode-order routing) from
libavcodec's display-ordered output.  Wire-protocol changes:

  REQ_DECODE  + __u64 src_pts        (= src_buf->vb2_buf.timestamp)
  RESP_FRAME  + __u32 flags          (HAS_PIXELS | SRC_CONSUMED)
              + __u64 output_src_pts (= frame->pts on drain)

PROTO_VERSION bumped 0 → 1.  Lock-step rebuild required.

Kernel
------
device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts
before sending REQ_DECODE, and stores it on the inflight item so
the completion path can stamp dst_buf.timestamp explicitly when
src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto-
pairing no longer applies).

daedalus_complete_resp_frame splits into:

  HAS_PIXELS:    pack pixels into THIS cookie's CAPTURE buffer,
                 stamp dst timestamp from inflight->src_pts,
                 v4l2_m2m_buf_done(dst, DONE/ERROR).
                 No job_finish here.

  SRC_CONSUMED:  release the bound media_request, run
                 v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so
                 the scheduler can dispatch the next REQ.  dst_buf
                 may still be parked at this point.

Inflight entry is removed and freed only when BOTH src_buf and
dst_buf have been cleared.  Combined HAS_PIXELS|SRC_CONSUMED RESPs
(steady-state VP9/AV1 with no reorder lag) collapse to the prior
1:1 behaviour for free.

Daemon
------
daedalus_decoder_run_request split into three primitives:

  daedalus_decoder_submit       — set pkt->pts = req->src_pts,
                                  avcodec_send_packet.
  daedalus_decoder_drain_one    — avcodec_receive_frame, populate
                                  resp meta + output_src_pts (= the
                                  frame's pts, carried back from
                                  the bitstream that produced it).
  daedalus_decoder_pack_current — pack current AVFrame into the
                                  caller-mapped CAPTURE planes.

chardev_client maintains a small (src_pts → cookie, cached_req)
table indexed linearly (≤64 entries; bounded by V4L2 client buffer
pool depth).  On each REQ_DECODE:

  1. Register (src_pts → cookie) in the table.
  2. submit().
  3. Drain loop: for each frame returned, look up its owner cookie
     via pending_lookup(frame->pts), GET_DMABUF for THAT cookie,
     pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS,
     output_src_pts=frame->pts).  Combine with SRC_CONSUMED when
     owner_cookie equals the current REQ's cookie.
  4. If the current REQ's cookie wasn't drained inside the loop
     (libavcodec held the frame), emit a standalone SRC_CONSUMED
     RESP so the kernel runs job_finish + dispatches the next REQ;
     dst_buf for this cookie stays parked until a future drain
     produces its pixels.

VP9 / AV1 paths are unchanged in behaviour: one frame per REQ,
HAS_PIXELS|SRC_CONSUMED in one combined RESP.

Verified
--------
Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712
(Pi CM5).  Frame-size warning in device_run is pre-existing
(unchanged by this commit).
2026-05-21 12:32:47 +02:00
claude-noether 69a62a922f kernel: register H.264 DECODE_MODE + START_CODE menu controls
libva-v4l2-request sets V4L2_CID_STATELESS_H264_DECODE_MODE and
V4L2_CID_STATELESS_H264_START_CODE on the device fd at context init
(see libva-v4l2-request-fourier src/context.c:577 — best-effort call,
result is (void)cast).  Our ctrl_handler did not advertise either
control, so v4l2-core returned EINVAL on validate; userspace logged
the noisy

    v4l2-request: Unable to set control(s): Invalid argument
                  (error_idx=2/2 ioctl-level)

at every Firefox/ffmpeg context creation, despite decode itself
succeeding (the daemon already operates as FRAME_BASED + ANNEX_B and
the per-request SPS/PPS/SCALING_MATRIX/DECODE_PARAMS batch lands
fine).

Register the two as v4l2_ctrl_new_std_menu with the only value each
the daemon actually supports — FRAME_BASED for DECODE_MODE,
ANNEX_B for START_CODE — and mask out the unsupported alternates
(SLICE_BASED, NONE).  Pattern matches rkvdec / hantro.  Update the
handler-init capacity hint to ARRAY_SIZE(daedalus_stateless_ctrls)
+ 2 to cover the additions.

Verified: builds clean on 6.18.29+rpt-rpi-2712 (Pi CM5) DKMS source
tree.
2026-05-21 11:01:41 +02:00
marfrit a3ada8ba38 kernel: per-ctx vb2 lock so concurrent clients don't serialise on dev mutex
daedalus_queue_init was wiring both src_vq->lock and dst_vq->lock to
ctx->dev->m2m_lock — a device-wide mutex.  That serialises every
vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) across ALL
concurrent clients of /dev/video0.  For a single-client consumer
like the test_m2m_* tools it doesn't matter; for Firefox, which
spawns separate content + RDD + GPU processes that each open
/dev/video0 and run libva probe simultaneously, the contention
showed up as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE)
when another session was mid-streamon on the same device.

Observable on higgs (Pi CM5):

    $ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
    ...
    v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ...
    v4l2-request: cap_pool_init: 24 slots ready
    v4l2-request: Unable to set format for type 10: Device or
                  resource busy

After this fix, each open() gets its own ctx->vb_mutex and the
per-context vb2_queue locks are independent — Firefox's multi-
process VAAPI clients no longer fight each other.  YouTube
playback on higgs runs through daedalus at ~230 fps sustained
(640x368, libavcodec dlopen path), 7× headroom over the 30fps
target.

cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern
for the same reason.  This mirrors them.

Lifecycle:
  - mutex_init in daedalus_open (right after the kzalloc that
    creates ctx, before v4l2_fh_init).
  - mutex_destroy in daedalus_release (after v4l2_fh_exit, before
    kfree), and in the err_ctrl unwind path in daedalus_open.

Verified end-to-end on higgs:
  - rmmod + modprobe the rebuilt .ko.
  - Restart daedalus-v4l2.service.
  - Firefox YouTube playback engages VAAPI, daemon journal shows
    cookie=1..N codec=3 (H.264) REQ_DECODE / decoder:OK pairs
    with unique per-frame fnv1a hashes.
  - No EBUSY in either firefox stderr or daemon journal during
    the entire session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:23:44 +02:00
marfrit 29f16ece13 kernel: bind request controls to p_cur before reading them
device_run was reading ctrl->p_cur.p_h264_* directly, but v4l2-m2m's
request scheduler does NOT auto-bind the in-flight media_request's
control values to the ctrl handler's p_cur slots — drivers have to
call v4l2_ctrl_request_setup() explicitly.  cedrus / rkvdec / hantro
all do this in their device_run; daedalus didn't.

Result: daedalus_collect_h264_meta() read stale or default values
(whatever the prior request had left in p_cur, or v4l2_ctrl_new_custom
initial state if no prior request had completed) instead of the
S_EXT_CTRLS V4L2_CTRL_WHICH_REQUEST_VAL values libva-v4l2-request-
fourier had just sent for THIS frame.

The mismatch was a smoking gun on higgs after libva PR #9 / packages
PR #52 landed an instrumentation log at h264_set_controls entry:

  libva boundary (sent to kernel):
    VAProfile=13 seq_fields=0x00032051 pic_fields=0x00000500 num_ref_frames=1
  daedalus daemon (read from kernel p_cur):
    prof=100 level=41 ref_frames=0 flags=0x10 pps_flags=0x0

After calling v4l2_ctrl_request_setup() at the top of device_run:

  daedalus daemon (read from kernel p_cur):
    prof=66 level=11 ref_frames=1 poc_type=2 flags=0x50 pps_flags=0x88

— matches what libva sent, matches the bitstream's actual SPS.

End-to-end test on higgs with libva-v4l2-request-fourier 1.0.0+r382
+gc1bb444 (after-fix-3-and-fix-4-instrumentation) + this kernel
patch:

  $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi \
      -hwaccel_device /dev/dri/renderD128 -i h264_test.mp4 \
      -frames:v 1 -f null - ...
  rc=0
  daemon journal: zero "error while decoding MB" lines, zero
  "reference frames exceeds max" lines.  Per-frame fnv1a hashes
  differ (0xf1c515aa, 0x16e915e8, 0x16bd16cc, ...) instead of
  the constant 0x6a6a05c5 "give-up-and-zero" hash from before —
  libavcodec is actually decoding real pixel content from each
  P-frame.

Pair note: the daemon side already calls v4l2_ctrl_request_complete
in daedalus_complete_resp_frame (line 834) — this commit pairs the
setup half with that completion half.

The daemon side change (decoder.c) is a small log-level promotion:
the per-frame "h264 SPS/PPS prepended ..." trace went from log_debug
to log_info so the journal shows what's being shipped into libavcodec
without needing a daemon rebuild with --debug.  Matches the libva-
side h264_set_controls instrumentation that landed in libva PR #9.

Closes part of issue libva-v4l2-request-fourier#8 — the SPS/PPS
field-value gap.  Profile/level still come from libva's session-
derived hardcoded values (h264_profile_to_idc + h264_derive_level_
idc) which is sufficient for libavcodec to accept the synthesised
NAL unit; a true stream-parsed profile/level would need SPS-NAL
parsing in libva — separate operator-design call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:35:06 +02:00
marfrit 8c1d9960c4 DAEMON-PPS: synthesise H.264 SPS/PPS NAL units from V4L2 controls
libva-v4l2-request-fourier (and any V4L2-stateless-API consumer)
passes H.264 SPS/PPS as separate V4L2_CID_STATELESS_H264_{SPS,PPS}
controls; only the slice NAL goes into the OUTPUT buffer.  This is
correct per the V4L2 stateless contract.  But libavcodec — which
the daedalus daemon uses for actual decode (Option γ) — wants a
self-contained AnnexB stream including SPS+PPS before any slice.
Result on higgs: "non-existing PPS 0 referenced" + decode_slice_
header errors on every H.264 frame, even after LIBVA-1 and -2
routing correctly delivered the request to the daemon.

Fix splits across kernel + daemon, keeping the kernel module as a
thin transport and putting the actual NAL encoding in userspace:

  include/daedalus_v4l2_proto.h:
    Add struct daedalus_h264_meta (the four v4l2_ctrl_h264_*
    structs the kernel collects) and DAEDALUS_REQ_FLAG_H264_META
    (set in req.flags when the meta block is present between the
    daedalus_req_decode prefix and the slice bitstream).

  kernel/daedalus_v4l2_main.c:
    Add daedalus_collect_h264_meta() — reads the H.264 ctrl values
    from the bound media_request via v4l2_ctrl_find +
    ctrl->p_cur.p_h264_*.  device_run() calls it on H.264 codec_id,
    copies the structs into the REQ_DECODE payload between the
    prefix and bitstream, and sets the flag.  Payload size is
    bounds-checked against DAEDALUS_PROTO_MAX_PAYLOAD so an over-
    sized slice + meta fails loud instead of truncating.

  daemon/src/bitstream_writer.{c,h}:
    New module — MSB-first bit packer with H.264 Exp-Golomb ue(v)
    and se(v) coding + rbsp_trailing_bits alignment.  Sticky
    overflow flag so callers can verify the output buffer wasn't
    truncated.

  daemon/src/h264_nal_synth.{c,h}:
    New module — turns v4l2_ctrl_h264_sps / v4l2_ctrl_h264_pps
    into AnnexB-framed NAL units per ITU-T H.264 7.3.2.1 / 7.3.2.2.
    Emits emulation prevention bytes (0x03 after every 00 00 in the
    EBSP) and the 4-byte start code (0x00000001).  Coverage matches
    what V4L2 stateless surface gives us: VUI parameters and full
    scaling matrices are NOT emitted (V4L2 doesn't carry them — the
    seq_scaling_matrix_present_flag is set to 0 and libavcodec uses
    flat defaults, which matches the de-facto behaviour of most
    H.264 streams libva-v4l2-request drives).

  daemon/src/decoder.c:
    daedalus_decoder_run_request() now takes an optional
    h264_meta parameter.  For codec_id == H264 with meta != NULL,
    synthesises SPS+PPS NAL units, allocates a combined
    [SPS][PPS][slice] buffer (+ AV_INPUT_BUFFER_PADDING_SIZE), and
    feeds that to avcodec_send_packet instead of the raw slice.
    VP9/AV1 path unchanged (frames are self-contained).  Cleanup
    now goes through a unified `out:` label so the assembled
    buffer is always freed on every exit (including the existing
    decoder_open_codec / no-frame / receive_frame failure paths).

  daemon/src/chardev_client.c:
    handle_req_decode() peels off the optional meta block when the
    flag is set, passes it through to the decoder, and updates
    the payload-length consistency check (now allows for an extra
    sizeof(daedalus_h264_meta) when the flag is on).

Build (boltzmann aarch64): clean compile of all daemon sources,
including bitstream_writer + h264_nal_synth + the refactored
decoder.c.  Kernel module compile to be verified via DKMS rebuild
on higgs in the marfrit-packages bump that follows.

Test plan: with this commit + a marfrit-packages daedalus pin
bump, higgs's ffmpeg -hwaccel vaapi -i h264_test.mp4 should
produce a successful decode (vs. the previous "non-existing PPS 0
referenced" failure).  The daemon log should show:
  decoder: opened h264 context
  decoder: h264 prepended SPS=NB PPS=MB slice=KB
  decoder: OK 320x240 fmt=0 (yuv420p) fnv1a=0x...

VP9 / AV1 behaviour unchanged — they don't carry meta and the
existing per-frame self-describing path still applies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 17:35:24 +02:00
marfrit f0cd29a340 kernel: v4l2_fh_add/del gained file* arg in 6.18 — version-conditional
DKMS build failure on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712):

  daedalus_v4l2_main.c:1049: error: too few arguments to function 'v4l2_fh_add'
  v4l2-fh.h:97: void v4l2_fh_add(struct v4l2_fh *fh, struct file *filp);
  daedalus_v4l2_main.c:1063: error: too few arguments to function 'v4l2_fh_del'

Signature changed exactly at v6.18 (verified v6.13–v6.17 still use the
one-arg form via raw.githubusercontent.com tag walk). Wrap the calls
with LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0) so the module
keeps building against:

  * 6.12 LTS / RPi 6.12.75 (one-arg)        — hertz
  * 6.12.88+deb13-arm64 (one-arg)
  * 6.18.29+rpt-rpi-2712 (file* arg)        — higgs running kernel

Build verified on both: hertz 6.12.75 clean, higgs 6.18.29 clean +
modprobe daedalus_v4l2 succeeds, /dev/daedalus-v4l2 + /dev/video0
appear.

Add #include <linux/version.h> for KERNEL_VERSION + LINUX_VERSION_CODE
(also pulled transitively via module.h but explicit is better than
implicit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:15:24 +02:00
marfrit f55b2cd002 kernel: media_request_get/put around inf->req (UAF safety)
Sonnet pre-deployment review flagged a SHIP-WITH-EYES-OPEN risk:
Phase 8.13's inf->req captured src_buf->vb2_buf.req_obj.req as a
raw pointer with no media_request_get(). On the normal decode
path that's fine because vb2-core holds its own reference until
v4l2_m2m_buf_done_and_job_finish releases it.

But on a concurrent cancel (MEDIA_IOC_REQUEST_REINIT or a process
kill triggering buf_request_complete from the cancel path before
RESP_FRAME comes back), vb2 could drop its reference first. Our
inf->req would then dangle through v4l2_ctrl_request_complete +
buf_done_and_job_finish — UAF.

Fix matches the cedrus / rkvdec pattern: take our own reference
when we capture the pointer, release it after we're done with it
(after buf_done_and_job_finish to keep the ordering crystal-clear).

  /* in daedalus_device_run, after inf->req = src_buf->...->req */
  if (inf->req)
      media_request_get(inf->req);

  /* in daedalus_complete_resp_frame, after buf_done_and_job_finish */
  if (inf->req)
      media_request_put(inf->req);

Verified on hertz:
- libva path (request-bound, inf->req != NULL): byte-exact NV12,
  same FNV-1a as standalone.
- test_m2m_stream (direct QBUF, inf->req == NULL): 30/30 frames
  decoded, conditional skip works.
- No kernel oops / WARN, no leak in dmesg.

Add #include <media/media-request.h> for the helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:39:10 +00:00
marfrit f04d7000f8 Phase 8.13: byte-exact end-to-end via libva (consumer target hit)
The project's consumer-side goal landed: a real VAAPI consumer
(ffmpeg with -hwaccel vaapi) drives our libva backend → V4L2
driver → daemon → byte-exact NV12 output back to ffmpeg.

  ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
         -hwaccel_output_format nv12 -i vp9_small.ivf \
         -f rawvideo -y /tmp/vp9_via_libva.nv12
  cmp /tmp/vp9_via_libva.nv12 /tmp/vp9_ref_for_libva.nv12  → match

18432-byte NV12 byte-for-byte identical to plain ffmpeg
-pix_fmt nv12 software decode. The project_consumer_target
memory's deliverable shape — "V4L2 stateless node consumed by
a real VAAPI client" — is achieved.

Two related kernel changes:

1. v4l2_ctrl_handler_setup(&ctx->hdl) after registration —
   matches rkvdec/cedrus/hantro. Brings each registered
   compound control out of "uninitialised" state via
   std_init_compound defaults.

2. Per-request control completion in the decode path —
   the real fix for "Timeout when waiting for media request".
   vb2-core's vb2_buffer_done unbinds the BUFFER's req_obj
   on normal decode completion, but the per-request CONTROL
   object stays bound. buf_request_complete fires only from
   queue-cancel paths (vb2-core line 2284), NOT from normal
   buf_done. The driver must call
   v4l2_ctrl_request_complete(req, hdl) explicitly from the
   completion path.

   struct daedalus_inflight gained a `struct media_request
   *req` field, captured from src_buf->vb2_buf.req_obj.req
   in device_run. daedalus_complete_resp_frame then calls
   v4l2_ctrl_request_complete before
   v4l2_m2m_buf_done_and_job_finish — triggers
   MEDIA_REQUEST_STATE_COMPLETE and wakes the request fd
   poll.

   For non-request flows (test_m2m_stream direct QBUF)
   inf->req is NULL; the conditional skips the call.
   Both consumer styles work concurrently.

Diagnostic clarification (was Phase 8.13a):

strace traced three S_EXT_CTRLS calls per frame:
  1. H264_PROFILE + H264_LEVEL → EINVAL  (we don't register)
  2. HEVC_PROFILE + HEVC_LEVEL → EINVAL  (we don't register)
  3. VP9_FRAME + VP9_COMPRESSED_HDR → SUCCESS

The first two are harmless: libva probes whether we support
H264/HEVC integer profile/level controls during config
negotiation; we don't (we expose them as stateless), so EINVAL
just falls through. The actual VP9 stateless controls (#3)
succeeded all along — the libva-side "Unable to set control(s)"
log was misleading us into thinking the control path was the
bug.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  daemon log:
    REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes
    decoder: opened vp9 context
    decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe ...

  ffmpeg side:
    no Timeout, no Decoding error
    /tmp/vp9_via_libva.nv12: 18432 bytes

  cmp vs reference: byte-for-byte identical.

Roadmap update:
- 8.10/8.11, 8.12, 8.13 marked closed with closure docs.
- 8.14 = multi-frame VP9 via libva, AV1 + H.264, mpv/Firefox
  higher-level consumers.

Per correctness-before-speed:
- strace + kernel-source-reading found the actual root cause
  rather than guessing.
- Conditional v4l2_ctrl_request_complete preserves the existing
  test_m2m_stream non-request path — both consumer styles work
  concurrently without per-flow branching elsewhere.
- Byte-exact pixel comparison, not "frame size matches."

Phase 8.14 next: multi-frame stream + multi-codec via libva +
mpv/Firefox.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:14:34 +00:00
marfrit a7d585eee8 Phase 8.12: first VP9 frame decoded via libva
ffmpeg -hwaccel vaapi → libva-v4l2-request-fourier →
/dev/video0 → daedalus_v4l2 kernel → REQ_DECODE on the
chardev → daemon FFmpeg decode → byte-exact NV12 (FNV-1a
0x1eb34bfe, same hash the standalone test_m2m_stream
produces for the same 128x96 VP9 keyframe).

The pixel-correct decode through the libva path is the
milestone. What's NOT yet working: libva times out on the
media_request fd because buf_request_complete never fires
(vb->req_obj.req is NULL when buf_done runs — the
S_EXT_CTRLS EINVAL leaves the buffer un-bound to the
request even though the buffer queues anyway). Phase 8.13
fixes the EINVAL so the request bind takes and the
completion signal propagates.

Kernel V4L2 request API integration:
- media_device_ops.req_validate / req_queue = vb2_request_
  validate / v4l2_m2m_request_queue (Phase 8.11) —
  MEDIA_IOC_REQUEST_ALLOC succeeds.
- vb2_queue.supports_requests = true on OUTPUT queue —
  without this v4l2-core rejects S_EXT_CTRLS(REQUEST_VAL).
- vb2_ops.buf_request_complete = daedalus_buf_request_complete
  → v4l2_ctrl_request_complete(req, &ctx->hdl). Without
  this v4l2-core WARNs at videobuf2-v4l2.c:440.
- vb2_ops.buf_out_validate: sets field=V4L2_FIELD_NONE on
  OUTPUT buf. Required for the same WARN check.
- requires_requests intentionally NOT set: lets the
  existing test_m2m_stream (direct QBUF, no request) keep
  working alongside the libva path.

Stateless control re-registration:
- Switched from v4l2_ctrl_new_std_compound(NULL p_def) to
  v4l2_ctrl_new_custom(&cfg, NULL) — pattern rkvdec /
  cedrus / hantro use. v4l2-core auto-fills elem_size +
  type from std table (verified: VP9_FRAME elem_size=168,
  matches sizeof(struct v4l2_ctrl_vp9_frame)).
- No-op s_ctrl callback so SET requests don't crash —
  daemon ignores values, FFmpeg re-parses the bitstream.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  ffmpeg -hwaccel vaapi -i vp9_small.ivf …
  daemon: REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes capture=128x96 1 planes
  daemon: decoder: opened vp9 context
  daemon: decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe …

Same FNV-1a hash as the standalone test_m2m_stream produces
for the same VP9 keyframe. End-to-end through libva.

Remaining (Phase 8.13):
- S_EXT_CTRLS EINVAL on V4L2_CID_STATELESS_VP9_FRAME despite
  matching elem_size — needs deeper validate-path debugging.
- Once the request bind takes, buf_request_complete fires
  on buf_done, request fd signals completion, libva DQBUFs
  the decoded NV12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:01:26 +00:00
marfrit 0de0288dce Phase 8.10+8.11: libva consumer integration scaffold
Brings daedalus_v4l2 from "standalone test client" to "VAAPI-
discoverable decoder" by adding the surface formats and
media-controller plumbing that libva-v4l2-request-fourier
(sibling repo) requires.

libva-v4l2-request-fourier patches (pushed separately):
- b5b3acf: daedalus_v4l2 added to known_decoder_drivers
- 2146341: meson option gate

This commit (daedalus-v4l2 side, 3 production changes):

1. V4L2_PIX_FMT_NV12 (single-plane) on CAPTURE
   - Added to daedalus_capture_formats[] alongside NV12M + P010
   - daedalus_fill_capture_fmt handles num_planes=1 case
     (sizeimage = W*H*3/2, bytesperline = W)
   - daemon pack_nv12_single_to_plane: Y at base+0,
     interleaved CbCr at base+(stride*H); same byte content
     as NV12M two-plane, different layout
   - Required because libva-v4l2-request-fourier's video.c
     only knows non-multi-plane NV12 (it advertises
     v4l2_mplane=true but uses the single-plane fourcc).
   - Verified byte-exact via test_m2m_stream against
     ffmpeg -pix_fmt nv12 reference (VP9 1080p 10 frames,
     31 MB).

2. V4L2 Request API media ops
   - daedalus_media_ops = { vb2_request_validate,
     v4l2_m2m_request_queue } assigned to mdev.ops before
     media_device_init.
   - Without this, MEDIA_IOC_REQUEST_ALLOC returned
     -ENOTTY and no VAAPI consumer could allocate a
     media_request.

3. Stateless control registration via v4l2_ctrl_new_custom
   - Switched from v4l2_ctrl_new_std_compound(NULL p_def)
     to v4l2_ctrl_new_custom — pattern rkvdec/cedrus/
     hantro use. Adds a no-op s_ctrl callback.

Verification (hertz, Pi 5, 6.12.75+rpt-rpi-2712):

LibVA trace through `ffmpeg -hwaccel vaapi`:
  vaInitialize / Profiles / Entrypoints / CreateConfig /
  QuerySurfaceAttributes / CreateSurfaces / CreateContext
  (cap_pool: 24 slots, 1 plane each) / CreateBuffer
  (slice + picture params) / MEDIA_IOC_REQUEST_ALLOC
  — all succeed.

Standalone NV12 decode path:
  test_m2m_stream vp9_1080_stream.ivf out.nv12 1920 1080 vp9 nv12
  → 10/10 frames, byte-exact vs ffmpeg -pix_fmt nv12

vainfo (via libva-v4l2-request-fourier with our driver):
  7 VAProfile entries with VAEntrypointVLD
  (H264 Main/High/CBaseline/MultiviewHigh/StereoHigh,
   VP9Profile0, AV1Profile0)

What's NOT here (Phase 8.12):

The libva trace stops at VIDIOC_S_EXT_CTRLS returning
EINVAL when populating V4L2_CID_STATELESS_VP9_FRAME on
the request. The compound-control payload validation
against the kernel's expected struct shape rejects.
This isn't a "missing line" fix — it needs proper
stateless control plumbing (the SPS/PPS/SliceParams
get_dims, validate, default-value paths that in-tree
rkvdec/cedrus/hantro implement to satisfy v4l2-core's
std_validate). Documented as Phase 8.12 scope.

The shipped integration is itself a meaningful deliverable:
all the framework scaffolding is in place; the remaining
gap is well-characterised and bounded.

See docs/phase_8_10_11_closure.md for the full trace
analysis + next-phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:51:16 +00:00
marfrit 1ae9528e76 Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.

So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
   1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
   pack_p010_to_plane.

Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
  p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
  accumulate NALs into access units (push on VCL NAL types
  1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
  buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).

Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
    NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
    P010:  1 plane,  W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
  plane's payload to vb2_plane_size(vb,p) — generalises
  cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
  the daemon fully populates the plane so payload =
  sizeimage.

Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
  10-bit samples shifted left by 6 to MSB-align in 16-bit
  words per V4L2 ABI. Y at base+0, interleaved CbCr right
  after Y plane (per format spec for single-plane P010).
  Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
  req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
  → pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

1080p throughput baseline (30 frames testsrc, dmabuf path):

  VP9   1080p:  mean 12.0 ms,  p99 15.9 ms,  fps **83.1**, byte-exact ✓
  AV1   1080p:  mean 15.4 ms,  p99 41.0 ms,  fps **65.0**, byte-exact ✓
  H.264 1080p:  mean 11.3 ms,  p99 21.5 ms,  fps **88.3**, byte-exact ✓

All 2-3× over the 30fps-floor-is-fine criterion.

HDR / 10-bit 1080p P010:
  10 frames, 62 MB output, fps **48.8**, byte-exact vs
  `ffmpeg -pix_fmt p010le -f rawvideo`.

Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.

v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.

Clean SIGTERM + rmmod; no kernel oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
  the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
  (per project_consumer_target memory) — the actual
  user-facing endpoint.

Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
  (NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
  1080p).
- AU grouping in the Annex-B parser is the correct
  semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
  count, not hardcoded to 2.

Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:34:05 +00:00
marfrit 5965805d86 Phase 8.7: media controller + multi-frame streaming verification
Two pieces — both shipped:

1. Media controller binding closes the last v4l2-compliance
   failure from 8.6 (DECODER_CMD, which requires has_media on
   stateless decoders) and unlocks the V4L2 request API for
   libva-v4l2-request.

2. Multi-frame streaming test exercises the daemon's
   AVCodecContext state preservation across many REQ_DECODE
   calls — Phase 8.6's tests pushed exactly one keyframe per
   invocation; real content has P-frame references.

Compliance now reaches **49/49 passing.**

Kernel (kernel/daedalus_v4l2_main.{c,h}):
- Added `struct media_device mdev` to daedalus_dev.
- media_device_init(&mdev) BEFORE v4l2_device_register so
  v4l2-core sees v4l2_dev.mdev = &mdev and binds the m2m
  entities into the graph during register.
- After video_register_device:
  v4l2_m2m_register_media_controller(..., MEDIA_ENT_F_PROC_VIDEO_DECODER)
  then media_device_register so userspace sees the complete
  graph in /dev/mediaN with the decoder entity tagged.
- daedalus_remove unwinds in reverse: unregister media,
  unregister mc, unregister video, release m2m, unregister
  v4l2, cleanup mdev.
- Error paths added for both new failure points.

Test harness (tools/test_m2m_stream.c, new):
- Multi-frame V4L2 m2m client: parses IVF → 4-deep buffer
  rings on both queues → per-frame QBUF/DQBUF loop →
  concatenates decoded NV12 to output file. Returns 0 only
  if every input frame decoded without error.
- Same codec vocabulary as test_m2m_decode (vp9 | av1 |
  h264 via 5th arg).

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

v4l2-compliance: 49 tests, 49 passed, 0 failed, 0 warnings.

  $ v4l2-ctl --list-devices
  daedalus-fourier V3D7+NEON (platform:daedalus_v4l2):
        /dev/video0
        /dev/media3

VP9 320×240 30 frames (1 keyframe + 29 P-frames, 3.46 MB
NV12): byte-for-byte match vs `ffmpeg -i in.ivf -pix_fmt
nv12 -f rawvideo`.

VP9 1920×1080 10 frames (31 MB NV12 through the dmabuf
path): byte-for-byte match vs same reference command.

Daemon log shows cookies 1..30 all completing cleanly in
order; lazily-opened AVCodecContext maintains reference
frames across the chardev round-trips.

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- 8.7 marked closed with closure-doc reference.
- 8.8 reshaped: perf profiling, QPU dispatch substitution
  via daedalus-fourier, multi-frame AV1/H.264, HDR (P010M).

Per correctness-before-speed:
- Order-correct media controller lifecycle (init → bind
  v4l2_dev → register video → register mc → register
  media; reverse for teardown).
- 4-deep buffer rings on both queues — the scheduler
  actually pipelines multiple in-flight cookies through
  the chardev (not just one-at-a-time as in 8.5/8.6 tests).
- Bit-exact comparison against ffmpeg, not "looks right."
- All resource paths cleaned on every error branch.

Phase 8.8 next: profile daemon hot loops, dlopen
daedalus-fourier from the daemon, swap FFmpeg per-block
calls for daedalus_dispatch_* where the kernel matches,
target 30fps@1080p from 30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:21:58 +00:00
marfrit c7f6fb90cb Phase 8.6: dmabuf + AV1 + H.264 + stateless controls
Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.

Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
  (width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
  asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
  in daemon task context so the fd lands in the daemon's table.

Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
  vmalloc in 8.5; the switch is needed because vmalloc doesn't
  honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
  REQBUFS test rejected the driver because of it. We still
  read bitstream via vb2_plane_vaddr (dma_contig gives a
  kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
  both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
  walks inflight list, calls vb2_core_expbuf on the CAPTURE
  buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
  ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
  DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
  path (kept for debugging) and the 8.6 dmabuf path (pixels
  already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
  overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
  decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
  VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
  SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
  TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
  re-parses); registration is what makes libva-v4l2-request
  see us.

Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
  daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
  allocator via shared daedalus_next_cookie() — kills the
  Phase 8.5 namespace collision.

Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
  REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
  is essential — vb2_core_expbuf extracts O_ACCMODE from flags
  and exports read-only by default (caught on first run; mmap
  -EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
  addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
  writes Y line-by-line into planes[0] with planes[0].stride;
  interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
  runs decode (pixels land in CAPTURE buffer directly), closes
  planes, sends metadata-only RESP_FRAME. No wire-pixel
  allocation.

Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
  drives all three codecs.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
  VP9   1920x1080  3,110,400 bytes  MATCH
  AV1     128x96      18,432 bytes  MATCH
  H.264   128x96      18,432 bytes  MATCH

VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.

v4l2-compliance:
  Phase 8.1: 44/48
  Phase 8.5: 44/48 (different fails after m2m landed)
  Phase 8.6: 47/48
  Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
  controller — explicitly Phase 8.7 work).

11 standard compound controls visible:
  vp9_frame_decode_parameters, vp9_probabilities_updates,
  h264_sequence_parameter_set, h264_picture_parameter_set,
  h264_scaling_matrix, h264_prediction_weight_table,
  h264_slice_parameters, h264_decode_parameters,
  av1_sequence_parameters, av1_frame_parameters,
  av1_film_grain (av1_tile_group_entry refused by hdl->error
  on this kernel — skipped silently).

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
  daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
  long-form multi-frame streaming.

Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
  fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
  right" eyeballing.
- Each compliance edge documented with the underlying test
  source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
  every exit, including error paths).

Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:16:06 +00:00
marfrit 6f4b580f7c Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF
Replaces the Phase 8.4 debugfs-triggered chardev path with a
real V4L2 m2m driver. Userspace clients now drive decoding the
standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream)
queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run
packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg;
RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE
buffer. Phase 8.6 swaps the inline payload for dmabuf so big
frames stop being capped at 64 KiB.

Kernel (daedalus_v4l2_main.c, rewritten + main.h added):
- Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler,
  per-queue v4l2_pix_format_mplane.
- Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed
  yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export):
    OUTPUT  = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,  VP9_FRAME
    CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M
- Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt
  for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/
  prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_*
  helpers.
- v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds
  REQ_DECODE inline with the bitstream bytes, enqueues with an
  auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in
  a per-device inflight list. Job stays open until RESP_FRAME.
- daedalus_complete_resp_frame(): pops the inflight entry,
  memcpys inline NV12 pixels into the CAPTURE buffer (Y plane
  + interleaved CbCr), finishes via
  v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done +
  job_finish, which leaves the src buf on the m2m queue and
  causes device_run to immediately re-run on the same input
  (caught on first run; second REQ_DECODE for same bitstream +
  eventual oops in stop_streaming on teardown).

Kernel (daedalus_v4l2_chardev.c):
- RESP_FRAME handler now hands inline pixel payload to
  daedalus_complete_resp_frame so it lands in the CAPTURE
  vb2 buffer. Existing PONG and debugfs test_decode paths still
  work; the latter produces a harmless ratelimited "unknown
  cookie" since it bypasses V4L2 m2m.

Daemon (decoder.c, decoder.h):
- daedalus_decoder_run_request signature extended with
  (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the
  decoder packs YUV420P into NV12 in the caller's buffer: Y
  plane line-by-line stripped of stride padding; Cb/Cr
  interleaved into a single chroma plane. Truncation silent —
  kernel only memcpys what fits in the CAPTURE plane.

Daemon (chardev_client.c):
- handle_req_decode allocates a response buffer sized for the
  full chardev payload, lets decoder fill the pixel area
  after the resp_frame struct, sends the full payload via the
  existing send_response.

Test client (tools/test_m2m_decode.c, new):
- Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each,
  mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump
  CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the
  whole flow without needing v4l2-ctl framing.

Roadmap update (docs/roadmap.md):
- Phase 8.4 retitled "daemon ↔ kernel decode round-trip"
  to reflect what actually shipped (vs. the original V4L2-
  ioctl-driven plan which moved here).
- Phase 8.5 retitled "full V4L2 m2m driver" with closure
  status.
- Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/
  stateless controls + media controller. Adds the punch list
  of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace)
  that 8.6 will fix.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  Kernel + daemon build clean (-Wall -Wextra clean both sides).
  Test harness drives one VP9 keyframe end-to-end:
    OUTPUT REQBUFS -> 2
    CAPTURE REQBUFS -> 2
    QBUF OUTPUT[0] bytesused=1566
    QBUF CAPTURE[0]; STREAMON both
    poll revents=0x5
    DQBUF OUTPUT[0] flags=0x4001 (DONE)
    DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
    wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12

  Pixel correctness vs reference:
    ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
    cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓
  Byte-for-byte identical to FFmpeg's stock CPU decode.

  v4l2-compliance: detected as Stateless Decoder; most ioctls
  pass; two expected fails documented in closure doc
  (DECODER_CMD/media controller, S_FMT colorspace).

  Clean teardown: SIGTERM the daemon, rmmod the module, no
  oops/WARN in dmesg.

Per correctness-before-speed:
- Real V4L2 ioctl table (not stubs); uses v4l2-core helpers
  where they exist instead of reinventing.
- v4l2_m2m_buf_done_and_job_finish (not the manual sequence)
  to keep scheduler state consistent.
- Bit-exact reference comparison, not just "looks right."
- Documented every compliance failure with the planned fix.
- All resource paths (kmalloc/kfree, inflight list cleanup,
  src/dst buf removal in stop_streaming) handled on every
  error branch.

Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB
frame-size cap), add AV1+H.264 codecs, add V4L2 stateless
controls + media controller binding, fix the colorspace +
cookie-namespace compliance issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:55:10 +00:00
marfrit 2a449632b9 Phase 8.4: daemon ↔ kernel decode round-trip (VP9 end-to-end)
Wires the Phase 8.3 FFmpeg loader through the Phase 8.2 chardev
bridge: kernel injects REQ_DECODE carrying a raw VP9 access unit,
daemon hands the bitstream to libavcodec via dlopen, sends
RESP_FRAME back with a content-dependent FNV-1a digest of the
decoded YUV planes. Pure CPU decode for now — Phase 8.5 swaps in
dmabuf + QPU dispatch.

Protocol (include/daedalus_v4l2_proto.h):
- New REQ_DECODE (kernel→daemon) and RESP_FRAME (daemon→kernel)
  message types, with fixed-size payload structs.
- New DAEDALUS_CODEC_VP9/AV1/H264 enum (wire-stable so 8.6's
  AV1+H.264 work doesn't move existing values).
- New DAEDALUS_DECODE_* status enum (OK / NO_FRAME / ERR_OPEN /
  ERR_SEND / ERR_RECV / ERR_CODEC).
- Converted the prior `enum daedalus_msg_type` to #defines —
  high-bit values exceed INT_MAX and tripped -Wpedantic on
  userspace; kernel uABI headers use the same idiom.

Kernel (kernel/daedalus_v4l2_chardev.c):
- New debugfs entry /sys/kernel/debug/daedalus_v4l2/test_decode:
  writing raw bitstream bytes wraps them in a REQ_DECODE
  (codec=VP9 for Phase 8.4) and enqueues with an
  auto-incrementing cookie.
- daedalus_chardev_write learned RESP_FRAME: parses the payload
  and emits a single pr_info line with decode metadata. Keeps
  existing PONG handling on the default arm.

Daemon (daemon/src/...):
- chardev_client.{c,h} — opens /dev/daedalus-v4l2, blocking read
  loop, single-buffer write() responses (kernel chardev has only
  .write, not .write_iter, so writev lands as -EINVAL —
  discovered the hard way during first run).
- decoder.{c,h} — lazily-opened AVCodecContext per codec, shared
  AVPacket/AVFrame pair, descriptor-driven plane walker
  (av_pix_fmt_desc_get) so the same hash path covers YUV420P,
  YUV422P, YUV444P, GBRP and other 8-bit planar layouts.
  Generalised after first run decoded testsrc as GBRP (71)
  rather than the assumed YUV420P.
- `daemon` command in main.c opens the chardev and runs the loop
  until SIGINT/SIGTERM. Cookie correlation handled end-to-end.
- ffmpeg_loader gained av_pix_fmt_desc_get (23 symbols total).

Build:
- CMakeLists adds chardev_client.c + decoder.c; explicit
  -I../include for the shared protocol header.
- Still -Wall -Wextra -Wpedantic clean.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  $ ffmpeg ... -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 \
           -y /tmp/vp9_test.ivf
  $ python3 ... strip IVF framing → vp9_keyframe.bin (3268 B)

  $ sudo insmod kernel/daedalus_v4l2.ko
  $ daedalus_v4l2_daemon -v daemon &
  $ sudo dd if=vp9_keyframe.bin \
         of=/sys/kernel/debug/daedalus_v4l2/test_decode

  daemon: REQ_DECODE cookie=2 → decoded yuv420p 320x240
          fnv1a=0x6ef10d71 luma=76800 chroma=38400
  kernel: RESP_FRAME cookie=2 status=0 320x240 pixfmt=0
          fnv1a=0x6ef10d71  ← matches daemon ✓

Hash properties verified:
  cookie=2  testsrc 3268 B → 0x6ef10d71  (first decode)
  cookie=3  red     44 B   → 0x7f6e5dc5  (content-dependent ✓)
  cookie=4  testsrc 3268 B → 0x6ef10d71  (deterministic ✓)
  cookie=5  64 B random    → status=101  (ERR_SEND, daemon alive)

Daemon survives bad input (FFmpeg "Invalid sync code" wrapped
into structured ERR_SEND response). Clean SIGTERM shutdown,
clean rmmod.

Phase 8.4 acceptance criteria met:
- ✓ end-to-end kernel→daemon→FFmpeg→kernel round-trip
- ✓ cookie correlation per request/response pair
- ✓ content-dependent + deterministic digest
- ✓ structured error responses (no daemon crash on bad input)
- ✓ clean teardown (SIGTERM + rmmod)
- ✓ builds clean on both kernel kbuild and daemon CMake

Per correctness-before-speed:
- Real chardev I/O (no shortcuts, no select-loop hacks)
- Real FFmpeg AVCodecContext lifecycle (lazily opened, properly
  freed on cleanup)
- Descriptor-driven plane walk (generalises across pix_fmts)
- Structured error path (not just log-and-continue)
- All resource paths cleaned up on every error branch
- Documented why FNV-1a digest, why write() not writev(), why
  pix_desc walk in docs/phase_8_4_closure.md

Phase 8.5 next: V4L2 m2m queue submits REQ_DECODE from
vidioc_qbuf; dmabuf carries actual pixel data so the chardev's
64 KiB cap doesn't gate frame size; begin substituting
daedalus_dispatch_* into the daemon's decode path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:22:16 +00:00
marfrit 895f57c63a Phase 8.2: kernel ↔ daemon chardev bridge with round-trip test
Adds /dev/daedalus-v4l2 misc chardev to the kernel module. The
chardev is the IPC channel for the future userspace decoder
daemon: kernel enqueues REQ_* messages, daemon read()s them,
processes, write()s RESP_* back.

Wire protocol (pre-1.0, header in include/daedalus_v4l2_proto.h):
- struct daedalus_msg_hdr: magic (D04V) + version + type +
  cookie + payload_len + reserved
- Request/response separated by high bit of type field
- Max 64 KiB payload per message
- Cookie correlates request with matching response

Kernel implementation (kernel/daedalus_v4l2_chardev.{c,h}):
- Single-instance chardev (-EBUSY on second open)
- In-kernel FIFO bounded at 64 messages
- Blocking + non-blocking read; poll() with EPOLLIN on queued
- write() parses + validates header, logs response at pr_debug
- Bad magic → -EBADMSG, bad version → -EPROTO, oversize → -EMSGSIZE
- All error paths free resources

Phase 8.2 test trigger via debugfs:
- /sys/kernel/debug/daedalus_v4l2/test_ping — any byte
  enqueues a PING with a fixed 24-byte payload. Removed in
  Phase 8.4 when real REQ_DECODE from V4L2 path takes over.

Userspace verification tool (tools/test_chardev_pingpong.c):
- Real C program, proper error reporting via strerror
- Validates the 6-step round-trip: open → empty-queue EAGAIN →
  trigger ping → read PING → verify all fields → write PONG → close
- Builds with -Wall -Wextra clean

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
  $ sudo insmod daedalus_v4l2.ko
  $ sudo tools/test_chardev_pingpong
  opening /dev/daedalus-v4l2...
    non-blocking read on empty queue: EAGAIN ✓
    injected PING via debugfs ✓
    read PING: magic ✓ version ✓ type=PING ✓ cookie=0x1234 ✓ payload=24 bytes
      payload: "DAEDALUS-V4L2-PING-PL"
    wrote PONG (cookie=0x1234) ✓
  ALL TESTS PASSED.
  $ sudo rmmod daedalus_v4l2      # clean

Per correctness-before-speed: full kerneldoc on structs, 8-tab
kernel style, SPDX headers, proper error paths, real test
program (not "I ran it once"), failure-mode coverage documented.

Phase 8.3 next: userspace daemon with dlopen'd FFmpeg parse path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:05:54 +00:00
marfrit 9415b7e0f7 Phase 8.1: kernel V4L2 device skeleton (out-of-tree module)
Out-of-tree Linux kernel module registering /dev/videoNN. Phase
8.1 scope: skeleton only — VIDIOC_QUERYCAP works, no codec
ioctls / no vb2_queue / no controls yet.

Real V4L2 plumbing throughout per "correctness before speed":
platform_device + v4l2_device + video_device, properly nested
with error paths and devm_kzalloc-managed lifetime. Per-cycle 9
discipline ports to kernel code: SPDX header, kernel coding
style (8-tab, static-by-default), kerneldoc on structs, no
shortcuts.

Files (~250 LOC total):
- kernel/Makefile — out-of-tree kbuild with checkpatch target
- kernel/daedalus_v4l2_main.c — module init/exit + probe/remove

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
- Builds clean with -Wall -Wextra. No warnings.
- modprobe / rmmod round-trip clean. No dmesg taints beyond
  the expected "out-of-tree taint" line.
- v4l2-ctl --list-devices shows: "daedalus-fourier V3D7+NEON
  (platform:daedalus_v4l2): /dev/video0"
- VIDIOC_QUERYCAP returns driver/card/bus/caps as specified.
- v4l2-compliance: 44/48 passing. The 4 failures are exactly
  the format/buffer ioctls Phase 8.2 will implement
  (ENUM_FMT, G_FMT, Scaling, REQBUFS) — not skeleton bugs,
  legitimately-absent features.

Documentation: docs/phase_8_1_closure.md captures full
verification output + Phase 8.2 plan.

Phase 8.1 acceptance criteria met:
- ✓ /dev/videoNN appears via v4l2-ctl --list-devices
- ✓ VIDIOC_QUERYCAP responds with sensible values
- ✓ rmmod is clean (no kref leaks)
- ✓ v4l2-compliance passes except for explicit Phase 8.2 work

Next: Phase 8.2 chardev bridge for kernel ↔ daemon IPC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:03:22 +00:00
marfrit c7d8050cc9 Initial scaffold: daedalus-v4l2 sibling repo
V4L2 stateless decoder for Pi 5, backed by sibling
daedalus-fourier kernel library (VP9 + AV1 CDEF + H.264 video
decode kernels on VideoCore VII compute + ARM NEON).

Architecture locked 2026-05-18 by mfritsche per
daedalus-fourier/docs/phase8_scoping.md:
- Option B: Linux kernel V4L2 shim + userspace daemon (not
  v4l2loopback). Real /dev/videoNN; proper DRM PRIME for
  browser zero-copy.
- Option γ: dlopen FFmpeg at runtime as parser. No vendoring;
  fastest to v1.
- Sibling repo (this repo): V4L2-side work outside of
  daedalus-fourier so kernel-library API stays clean.

Components:
  kernel/ - Linux out-of-tree kernel module (GPLv2; V4L2
    device + chardev bridge to userspace daemon)
  daemon/ - userspace decoder daemon (BSD-2-Clause; links
    libdaedalus_core.a from sibling; dlopens FFmpeg)
  docs/   - architecture + 7-phase roadmap (8.1..8.7)
  include/ - shared headers between kernel and daemon

Roadmap (7 sub-phases, ~1 week each):
  8.1 kernel skeleton (/dev/videoNN with no-op ioctls)
  8.2 chardev bridge (kernel ↔ daemon ping-pong)
  8.3 daemon FFmpeg dlopen + parse path
  8.4 VP9 end-to-end via daedalus_dispatch_*
  8.5 dmabuf / DRM PRIME for zero-copy
  8.6 AV1 + H.264 codec support
  8.7 performance: hit 30fps@1080p (project floor)

No code yet — only README + design docs + directory structure.
First implementation work starts in Phase 8.1 next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:54:56 +00:00