kernel: drain in-flight m2m jobs on daemon disconnect
Fixes issue #146 — daemon-crash (SIGKILL, SEGV, anything that triggers chardev release) leaves V4L2 consumers in unkillable TASK_UNINTERRUPTIBLE on /dev/video0 close. ## Root cause device_run() adds an entry to dev->inflight when it sends a REQ_DECODE to the daemon, marking the m2m job as "running". The job is only cleared via v4l2_m2m_buf_done_and_job_finish() in daedalus_complete_resp_frame(), which only fires on RESP_FRAME. If the daemon dies (SIGKILL, SEGV, exit) BEFORE writing the matching RESP_FRAME: - the inflight entry is never popped - v4l2_m2m_buf_done_and_job_finish is never called - the m2m scheduler still thinks a job is running Later, when the V4L2 consumer's close() runs (or gets signalled to exit), v4l2_m2m_ctx_release() → v4l2_m2m_cancel_job() waits for !job_running indefinitely. The consumer enters D-state and survives SIGKILL until reboot. Reproduced on hertz 2026-05-23, kernel 6.12.75+rpt-rpi-2712: $ sudo kill -STOP $DAEMON_PID # block daemon I/O $ ./test_m2m_decode keyframe.bin out.nv12 1920 1080 vp9 & $ sudo kill -9 $DAEMON_PID # chardev_release fires $ kill -9 $CLIENT_PID # ignored — D-state # client stack: v4l2_m2m_cancel_job+0x14c [v4l2_mem2mem] v4l2_m2m_ctx_release+0x20 [v4l2_mem2mem] daedalus_release+0x2c [daedalus_v4l2] v4l2_release+0x7c [videodev] __fput → do_exit → SIGKILL never delivered ## Fix New API daedalus_drain_inflight_on_disconnect() in main.{c,h}: walks the in-flight list, marks both src+dst buffers VB2_BUF_STATE_ERROR via v4l2_m2m_buf_done_and_job_finish(), and releases the bound media_request if any. Same completion shape as daedalus_complete_resp_frame() takes on the success path, just with state = ERROR for every in-flight entry. chardev_release calls the drain after flushing dev->req_queue (messages still in req_queue weren't released to the daemon yet, so they don't need the m2m-job-finish dance — freeing them is sufficient). The order matters: queue first (cheap), then m2m drain (heavier, takes the inflight list). Locking: list_splice_init under inflight_lock to take the entire list atomically; lock dropped before iterating because v4l2_m2m_buf_done_and_job_finish can sleep via vb2's buffer-done dispatch and can re-enter device_run via the scheduler (which would need inflight_lock again on the next REQ_DECODE). ## Verification path Cannot rmmod the running module on hertz right now — the D-state corpse from the repro session pins the refcount. Verification of the fixed module needs a reboot or fresh test host: $ sudo reboot # clears hung client $ sudo make modules_install # install new .ko $ sudo modprobe daedalus_v4l2 $ # rerun the repro script — client should die cleanly with $ # an -EIO / similar return from poll/DQBUF instead of hanging. Build: clean on Linux 6.12.75 + rpt-rpi-2712, no new warnings. The pre-existing "frame size 2128 > 2048" warning on daedalus_device_run is unchanged by this commit. ## Followup not in scope If a new V4L2 consumer races a REQ_DECODE through device_run AFTER the drain has spliced the list (but before the daemon chardev is reopened), the new entry sits in a freshly-empty inflight list and the same hang can recur for that consumer when the systemd auto-restart of the daemon either fails or takes longer than the consumer's patience. A secondary safeguard would be to fail-fast in device_run when dev->chardev is unopened — proposing as a separate ticket if this race materialises in practice. Closes #146.
This commit is contained in:
@@ -103,4 +103,27 @@ void daedalus_complete_resp_frame(u32 cookie,
|
||||
int daedalus_export_capture_dmabuf(u32 cookie, u32 plane, u32 flags,
|
||||
int *out_fd);
|
||||
|
||||
/**
|
||||
* daedalus_drain_inflight_on_disconnect() - fail all in-flight m2m jobs
|
||||
*
|
||||
* Called from daedalus_chardev_release() when the daemon disconnects
|
||||
* (graceful close, SIGKILL, daemon crash — anything that triggers
|
||||
* chardev release). Walks the in-flight list and, for every entry,
|
||||
* marks both src+dst buffers VB2_BUF_STATE_ERROR and calls
|
||||
* v4l2_m2m_buf_done_and_job_finish() to clear the m2m scheduler's
|
||||
* "job_running" flag.
|
||||
*
|
||||
* Without this, v4l2_m2m_cancel_job() (called from
|
||||
* v4l2_m2m_ctx_release() during the consumer's close() / task exit)
|
||||
* blocks forever waiting for a job_finish that the dead daemon will
|
||||
* never send — the consumer enters TASK_UNINTERRUPTIBLE and survives
|
||||
* SIGKILL until reboot. See issue #146 for the full trace.
|
||||
*
|
||||
* Safe to call with an empty in-flight list; no-op in that case.
|
||||
* Must NOT be called from atomic context — uses inflight_lock
|
||||
* (sleeping mutex) and v4l2_m2m_buf_done_and_job_finish (which can
|
||||
* sleep via vb2 buffer-done dispatch).
|
||||
*/
|
||||
void daedalus_drain_inflight_on_disconnect(void);
|
||||
|
||||
#endif /* DAEDALUS_V4L2_MAIN_H */
|
||||
|
||||
Reference in New Issue
Block a user