kernel: drain in-flight m2m jobs on daemon disconnect (fixes #146 D-state) #23
Reference in New Issue
Block a user
Delete Branch "noether/kernel-drain-inflight-on-chardev-release"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that triggers chardev release) leaves V4L2 consumers in unkillable TASK_UNINTERRUPTIBLE on
/dev/video0close.Reproduced 2026-05-23
On hertz, kernel 6.12.75+rpt-rpi-2712:
The client process is
D (disk sleep), no pending signals get delivered, surviveskill -9indefinitely (only reboot clears it).Root cause
device_run()adds an entry todev->inflightwhen it sends a REQ_DECODE to the daemon. The m2m scheduler marks the job "running". The job is only cleared viav4l2_m2m_buf_done_and_job_finish()indaedalus_complete_resp_frame(), which only fires on RESP_FRAME from the daemon.Daemon dies before sending RESP_FRAME → inflight entry stays → job stays "running" → next
v4l2_m2m_cancel_job()waits forever.Fix
New
daedalus_drain_inflight_on_disconnect()in main.{c,h}: walks the in-flight list, marks both src+dst buffersVB2_BUF_STATE_ERRORviav4l2_m2m_buf_done_and_job_finish(), and releases the bound media_request if any. Same completion shape asdaedalus_complete_resp_frame()takes on the success path, just withstate = ERRORfor every entry.chardev_releasecalls the drain after flushingdev->req_queue. Order matters: queue first (cheap, messages weren't released to the daemon yet so no m2m dance needed), then m2m drain (heavier, takes the inflight list).Locking:
list_splice_initunderinflight_lockto take the entire list atomically; lock dropped before iterating becausev4l2_m2m_buf_done_and_job_finishcan sleep via vb2 buffer-done dispatch and can re-enterdevice_runvia the scheduler (which needsinflight_lockagain on the next REQ_DECODE).Verification path
The repro session left a D-state corpse on hertz that pins the kernel module refcount (rmmod blocked). Verifying the fixed module needs:
Build: clean on 6.12.75+rpt-rpi-2712, no new warnings.
Followup not in scope
If a new V4L2 consumer races a REQ_DECODE through
device_runAFTER the drain has spliced the list (but before the daemon chardev is reopened), the new entry sits in a freshly-empty inflight list and the same hang can recur. A secondary safeguard — fail-fast indevice_runwhendev->chardevis unopened — is left as a separate ticket if this race materialises in practice.Closes #146.