Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that
triggers chardev release) leaves V4L2 consumers in unkillable
TASK_UNINTERRUPTIBLE on /dev/video0 close.
## Root cause
device_run() adds an entry to dev->inflight when it sends a
REQ_DECODE to the daemon, marking the m2m job as "running".
The job is only cleared via v4l2_m2m_buf_done_and_job_finish()
in daedalus_complete_resp_frame(), which only fires on RESP_FRAME.
If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the
matching RESP_FRAME:
- the inflight entry is never popped
- v4l2_m2m_buf_done_and_job_finish is never called
- the m2m scheduler still thinks a job is running
Later, when the V4L2 consumer's close() runs (or gets signalled
to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits
for !job_running indefinitely. The consumer enters D-state and
survives SIGKILL until reboot.
Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712:
$ sudo kill -STOP $DAEMON_PID # block daemon I/O
$ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 &
$ sudo kill -9 $DAEMON_PID # chardev_release fires
$ kill -9 $CLIENT_PID # ignored — D-state
# client stack:
v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem]
v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem]
daedalus_release+0x2c [daedalus_v4l2]
v4l2_release+0x7c [videodev]
__fput → do_exit → SIGKILL never delivered
## Fix
New API daedalus_drain_inflight_on_disconnect() in main.{c,h}:
walks the in-flight list, marks both src+dst buffers
VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and
releases the bound media_request if any. Same completion shape
as daedalus_complete_resp_frame() takes on the success path,
just with state = ERROR for every in-flight entry.
chardev_release calls the drain after flushing dev->req_queue
(messages still in req_queue weren't released to the daemon yet,
so they don't need the m2m-job-finish dance — freeing them is
sufficient). The order matters: queue first (cheap), then m2m
drain (heavier, takes the inflight list).
Locking: list_splice_init under inflight_lock to take the entire
list atomically; lock dropped before iterating because
v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done
dispatch and can re-enter device_run via the scheduler (which
would need inflight_lock again on the next REQ_DECODE).
## Verification path
Cannot rmmod the running module on hertz right now — the D-state
corpse from the repro session pins the refcount. Verification
of the fixed module needs a reboot or fresh test host:
$ sudo reboot # clears hung client
$ sudo make modules_install # install new .ko
$ sudo modprobe daedalus_v4l2
$ # rerun the repro script — client should die cleanly with
$ # an -EIO / similar return from poll/DQBUF instead of hanging.
Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings.
The pre-existing "frame size 2128 > 2048" warning on
daedalus_device_run is unchanged by this commit.
## Followup not in scope
If a new V4L2 consumer races a REQ_DECODE through device_run
AFTER the drain has spliced the list (but before the daemon
chardev is reopened), the new entry sits in a freshly-empty
inflight list and the same hang can recur for that consumer
when the systemd auto-restart of the daemon either fails or
takes longer than the consumer's patience. A secondary
safeguard would be to fail-fast in device_run when dev->chardev
is unopened — proposing as a separate ticket if this race
materialises in practice.
Closes#146.