kernel: claim src/dst at device_run, not at buf_done (fixes panic from #7) #8

Merged
marfrit merged 1 commits from noether/kernel-claim-bufs-at-device-run into main 2026-05-21 11:54:53 +00:00
Owner

Symptom

Hard reboot (kernel panic; no persistent journal, no trace recoverable) observed on higgs (Pi CM5, 6.18.29+rpt-rpi-2712) during the first mpv --hwdec=vaapi-copy playback against the freshly-deployed r28+g79256dc stack (daedalus-v4l2#7).

No cooked trace available, but the failure mode is reproducible by code reading.

Cause

The split src / dst lifecycle introduced by #7 calls v4l2_m2m_job_finish on SRC_CONSUMED even when the corresponding dst_buf is still parked, waiting for a future HAS_PIXELS. job_finish moves the m2m_ctx back to IDLE, the scheduler dispatches the next device_run, which calls v4l2_m2m_next_dst_buf — which returns the head of the CAPTURE ready-queue.

That head is still our parked dst_buf, because we never removed it.

Two inflight entries now reference the same vb2_buffer. When the later HAS_PIXELS arrives for the original cookie, v4l2_m2m_dst_buf_remove_by_buf calls list_del on a list_head no longer linked to the rdy_queue, smashing the next/prev pointers of whatever ELSE was at those addresses → kernel panic.

Fix

Take both src and dst off m2m_ctx's rdy_queue at device_run — as soon as v4l2_m2m_next_*_buf has peeked them and all early-exit validation has passed. After that the daemon owns both halves exclusively via the inflight item; the m2m scheduler can't re-issue them on the next device_run.

Completion path drops the redundant _remove_by_buf calls — list is already detached, so buf_done alone is correct.

Matches the drivers/media/platform/amphion/vdec.c / venc.c pattern (NXP), which also claims at device_run for the same reason: amphion's encode path parks output buffers across multiple frames waiting for the codec to finish, structurally the same as our H.264 B-frame DPB parking.

fail_buf_error learns about a new claimed flag and skips the v4l2_m2m_*_buf_remove calls when the buffers have already been removed by-buf at device_run.

Wire protocol

UnchangedDAEDALUS_PROTO_VERSION stays at 1, daemon binary doesn't need to be rebuilt for this fix. Only the kernel module changes. apt-side this is still a daedalus-v4l2-dkms-only bump.

Verified

  • Builds clean against 6.18.29+rpt-rpi-2712 on higgs.
  • Field test pending — needs marfrit-packages daedalus-v4l2-dkms bump + apt install. Daemon stays at r28+g79256dc.

Closes the panic regression from #7.

## Symptom Hard reboot (kernel panic; no persistent journal, no trace recoverable) observed on higgs (Pi CM5, 6.18.29+rpt-rpi-2712) during the **first** `mpv --hwdec=vaapi-copy` playback against the freshly-deployed `r28+g79256dc` stack (daedalus-v4l2#7). No cooked trace available, but the failure mode is reproducible by code reading. ## Cause The split src / dst lifecycle introduced by #7 calls `v4l2_m2m_job_finish` on `SRC_CONSUMED` even when the corresponding dst_buf is still parked, waiting for a future `HAS_PIXELS`. `job_finish` moves the m2m_ctx back to IDLE, the scheduler dispatches the next `device_run`, which calls `v4l2_m2m_next_dst_buf` — which returns the head of the CAPTURE ready-queue. That head is still our parked dst_buf, because we never removed it. Two inflight entries now reference the same `vb2_buffer`. When the later `HAS_PIXELS` arrives for the original cookie, `v4l2_m2m_dst_buf_remove_by_buf` calls `list_del` on a `list_head` no longer linked to the rdy_queue, smashing the next/prev pointers of whatever ELSE was at those addresses → kernel panic. ## Fix Take both src and dst off `m2m_ctx`'s rdy_queue at `device_run` — as soon as `v4l2_m2m_next_*_buf` has peeked them and all early-exit validation has passed. After that the daemon owns both halves exclusively via the inflight item; the m2m scheduler can't re-issue them on the next `device_run`. Completion path drops the redundant `_remove_by_buf` calls — list is already detached, so `buf_done` alone is correct. Matches the `drivers/media/platform/amphion/vdec.c` / `venc.c` pattern (NXP), which also claims at `device_run` for the same reason: amphion's encode path parks output buffers across multiple frames waiting for the codec to finish, structurally the same as our H.264 B-frame DPB parking. `fail_buf_error` learns about a new `claimed` flag and skips the `v4l2_m2m_*_buf_remove` calls when the buffers have already been removed by-buf at `device_run`. ## Wire protocol **Unchanged** — `DAEDALUS_PROTO_VERSION` stays at 1, daemon binary doesn't need to be rebuilt for this fix. Only the kernel module changes. apt-side this is still a daedalus-v4l2-dkms-only bump. ## Verified - Builds clean against 6.18.29+rpt-rpi-2712 on higgs. - Field test pending — needs marfrit-packages daedalus-v4l2-dkms bump + apt install. Daemon stays at `r28+g79256dc`. Closes the panic regression from #7.
marfrit added 1 commit 2026-05-21 11:50:20 +00:00
Hard reboot observed on higgs (Pi CM5) during the first mpv vaapi-copy
playback against the freshly-deployed r28+g79256dc stack — kernel
panic, no persistent journal, no recoverable trace.  Bug introduced
by the daedalus-v4l2#6 reorder fix (#7).

Cause
-----
The new completion path runs `v4l2_m2m_job_finish` on SRC_CONSUMED
even when the dst_buf is still parked (waiting for a future
HAS_PIXELS).  job_finish moves the m2m_ctx back to IDLE, the
scheduler dispatches the next device_run — which calls
`v4l2_m2m_next_dst_buf`, which returns the head of the CAPTURE
ready-queue, which is STILL the parked dst_buf because we never
removed it.  Two inflight entries now reference the same vb2_buffer;
the later HAS_PIXELS triggers `v4l2_m2m_dst_buf_remove_by_buf` on a
vb2_buffer whose list_head is no longer linked to that queue, and
`list_del()` smashes the next/prev pointers of whatever ELSE was at
those addresses.

Fix
---
Take both src and dst off `m2m_ctx`'s rdy_queue at device_run — as
soon as `v4l2_m2m_next_*_buf` has peeked them and all early-exit
validation has passed.  After that, the daemon owns both halves
exclusively via the inflight item; the m2m scheduler can't re-issue
them on the next device_run.  Completion path drops the redundant
`_remove_by_buf` calls — list is already detached, so `buf_done`
alone is correct.

Matches the amphion `vdec.c`/`venc.c` pattern (which also claims at
device_run for the same reason: amphion's encode pipeline parks
output buffers across multiple frames waiting for the codec to
finish, structurally the same as our H.264 B-frame DPB parking).

`fail_buf_error` learns about the new `claimed` flag and skips the
`v4l2_m2m_*_buf_remove` calls when the buffers have already been
removed by-buf at device_run.

Verified
--------
Builds clean against 6.18.29+rpt-rpi-2712.  Field test pending —
deploy via marfrit-packages bump in lock-step with the daemon
(which doesn't need to change for this fix; PROTO_VERSION stays at 1).
marfrit merged commit 6ffe92bcac into main 2026-05-21 11:54:53 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: reauktion/daedalus-v4l2#8