kernel + daemon: H.264 B-frame display reorder fix (issue #6)

H.264 streams with B-frames showed visibly pair-swapped output in
mpv / Firefox playback through the libva → daedalus_v4l2 path —
"frames went 2 1 4 3 6 5 instead of 1 2 3 4 5 6".  Reproduced in mpv
with --hwdec=vaapi-copy at 720p (bypassing Firefox's compositor),
confirming the bug was in this daemon pipeline, not downstream.

Root cause
----------
libavcodec's H.264 decoder internally reorders output to DISPLAY
order before returning from avcodec_receive_frame.  The daemon
previously called send_packet → receive_frame ONCE per REQ_DECODE
and shipped the resulting pixels in a RESP_FRAME tagged with the
SAME cookie.  For B-frames this is wrong: the frame returned from
receive_frame may belong to an EARLIER bitstream (libavcodec held
it for display-order release).  Cookie N's CAPTURE buffer therefore
got cookie N-2's pixels, while cookie N-2's CAPTURE buffer got
silently marked VB2_BUF_STATE_ERROR (the daemon returned
DAEDALUS_DECODE_NO_FRAME for the cookie whose pixels were held).

Fix shape
---------
Decouple kernel cookie identity (decode-order routing) from
libavcodec's display-ordered output.  Wire-protocol changes:

  REQ_DECODE  + __u64 src_pts        (= src_buf->vb2_buf.timestamp)
  RESP_FRAME  + __u32 flags          (HAS_PIXELS | SRC_CONSUMED)
              + __u64 output_src_pts (= frame->pts on drain)

PROTO_VERSION bumped 0 → 1.  Lock-step rebuild required.

Kernel
------
device_run now mirrors src_buf->vb2_buf.timestamp into req->src_pts
before sending REQ_DECODE, and stores it on the inflight item so
the completion path can stamp dst_buf.timestamp explicitly when
src/dst lifecycles decouple (V4L2_BUF_FLAG_TIMESTAMP_COPY's auto-
pairing no longer applies).

daedalus_complete_resp_frame splits into:

  HAS_PIXELS:    pack pixels into THIS cookie's CAPTURE buffer,
                 stamp dst timestamp from inflight->src_pts,
                 v4l2_m2m_buf_done(dst, DONE/ERROR).
                 No job_finish here.

  SRC_CONSUMED:  release the bound media_request, run
                 v4l2_m2m_buf_done(src) + v4l2_m2m_job_finish so
                 the scheduler can dispatch the next REQ.  dst_buf
                 may still be parked at this point.

Inflight entry is removed and freed only when BOTH src_buf and
dst_buf have been cleared.  Combined HAS_PIXELS|SRC_CONSUMED RESPs
(steady-state VP9/AV1 with no reorder lag) collapse to the prior
1:1 behaviour for free.

Daemon
------
daedalus_decoder_run_request split into three primitives:

  daedalus_decoder_submit       — set pkt->pts = req->src_pts,
                                  avcodec_send_packet.
  daedalus_decoder_drain_one    — avcodec_receive_frame, populate
                                  resp meta + output_src_pts (= the
                                  frame's pts, carried back from
                                  the bitstream that produced it).
  daedalus_decoder_pack_current — pack current AVFrame into the
                                  caller-mapped CAPTURE planes.

chardev_client maintains a small (src_pts → cookie, cached_req)
table indexed linearly (≤64 entries; bounded by V4L2 client buffer
pool depth).  On each REQ_DECODE:

  1. Register (src_pts → cookie) in the table.
  2. submit().
  3. Drain loop: for each frame returned, look up its owner cookie
     via pending_lookup(frame->pts), GET_DMABUF for THAT cookie,
     pack pixels, emit RESP_FRAME(owner_cookie, HAS_PIXELS,
     output_src_pts=frame->pts).  Combine with SRC_CONSUMED when
     owner_cookie equals the current REQ's cookie.
  4. If the current REQ's cookie wasn't drained inside the loop
     (libavcodec held the frame), emit a standalone SRC_CONSUMED
     RESP so the kernel runs job_finish + dispatches the next REQ;
     dst_buf for this cookie stays parked until a future drain
     produces its pixels.

VP9 / AV1 paths are unchanged in behaviour: one frame per REQ,
HAS_PIXELS|SRC_CONSUMED in one combined RESP.

Verified
--------
Builds clean cross-compiled on higgs against 6.18.29+rpt-rpi-2712
(Pi CM5).  Frame-size warning in device_run is pre-existing
(unchanged by this commit).
This commit is contained in:
2026-05-21 12:32:47 +02:00
parent f0d41867f6
commit 15fc2aba14
6 changed files with 725 additions and 288 deletions
+186 -89
View File
@@ -573,8 +573,28 @@ struct daedalus_inflight {
struct list_head list;
u32 cookie;
struct daedalus_ctx *ctx;
/*
* src_buf / dst_buf decouple in the daedalus-v4l2#6 reorder fix.
* src_buf is cleared (NULL'd) when DAEDALUS_RESP_FLAG_SRC_CONSUMED
* arrives — that signals libavcodec has accepted the bitstream
* even if no display-order frame is ready yet. dst_buf is cleared
* when DAEDALUS_RESP_FLAG_HAS_PIXELS arrives — the daemon has
* written pixels into this CAPTURE buffer. When both are NULL
* the inflight entry is removed and freed.
*/
struct vb2_v4l2_buffer *src_buf;
struct vb2_v4l2_buffer *dst_buf;
/*
* src_buf->vb2_buf.timestamp captured at device_run time.
* Mirrored into REQ_DECODE.src_pts so the daemon can set
* pkt->pts = src_pts on avcodec_send_packet, and read back
* frame->pts to identify which OUTPUT bitstream produced the
* current display-order frame. Kept here so the kernel can
* stamp dst_buf.timestamp explicitly at HAS_PIXELS time even
* though V4L2_BUF_FLAG_TIMESTAMP_COPY's automatic src->dst
* pairing no longer applies (src/dst lifecycles decoupled).
*/
u64 src_pts;
/*
* Captured media_request the src_buf was bound to (if any).
* Set by device_run from src_buf->vb2_buf.req_obj.req;
@@ -585,16 +605,22 @@ struct daedalus_inflight {
struct media_request *req;
};
/*
* Peek (don't remove). The split-completion path may receive
* multiple RESP_FRAME messages on a single inflight item (one for
* SRC_CONSUMED, one for HAS_PIXELS — possibly separated in time if
* libavcodec held the picture for display reorder). Caller removes
* the entry only when both src_buf and dst_buf have been cleared
* from inside the inflight lock.
*/
static struct daedalus_inflight *
daedalus_inflight_pop_locked(struct daedalus_dev *dev, u32 cookie)
daedalus_inflight_peek_locked(struct daedalus_dev *dev, u32 cookie)
{
struct daedalus_inflight *e;
list_for_each_entry(e, &dev->inflight, list) {
if (e->cookie == cookie) {
list_del(&e->list);
if (e->cookie == cookie)
return e;
}
}
return NULL;
}
@@ -757,6 +783,17 @@ static void daedalus_device_run(void *priv)
req->codec_id = cid;
req->bitstream_len = (u32) blen;
/*
* Ferry the OUTPUT buffer's vb2 timestamp through to the
* daemon for the H.264 B-frame display-reorder fix
* (daedalus-v4l2#6). Daemon sets pkt->pts = src_pts before
* avcodec_send_packet; libavcodec stamps frame->pts with
* the same value when it eventually outputs the frame in
* display order, letting the daemon route HAS_PIXELS RESPs
* to the correct cookie even when libavcodec's display
* order disagrees with V4L2's decode submission order.
*/
req->src_pts = (u64) src_buf->vb2_buf.timestamp;
req->capture_width = ctx->dst_fmt.width;
req->capture_height = ctx->dst_fmt.height;
req->capture_pix_fmt = ctx->dst_fmt.pixelformat;
@@ -786,6 +823,7 @@ static void daedalus_device_run(void *priv)
inf->ctx = ctx;
inf->src_buf = src_buf;
inf->dst_buf = dst_buf;
inf->src_pts = req->src_pts;
/*
* Capture the bound media_request (if any) so the
* completion path can call v4l2_ctrl_request_complete +
@@ -851,120 +889,179 @@ static const struct v4l2_m2m_ops daedalus_m2m_ops = {
/* -- chardev RESP_FRAME → buf_done bridge ---------------------------- */
/*
* Pack the daemon's pixel delivery into the inflight item's CAPTURE
* buffer. Called from daedalus_complete_resp_frame on the
* HAS_PIXELS branch, after the lock has been dropped (vb2 ops may
* sleep / take their own locks). The dst_buf reference was
* snapshotted under the inflight lock and cleared from the entry,
* so no other RESP can race for this buffer.
*
* pixels_len == 0 → dmabuf path (Phase 8.6+); the daemon mmap'd the
* CAPTURE plane via GET_DMABUF and wrote pixels in place; we just
* set the plane payloads. pixels_len > 0 → legacy Phase 8.5 inline
* NV12 path; we memcpy from the chardev payload.
*/
static void daedalus_pack_pixels_into_dst(struct vb2_v4l2_buffer *dst_buf,
const struct daedalus_resp_frame *fr,
const u8 *pixels, size_t pixels_len)
{
struct vb2_buffer *vb = &dst_buf->vb2_buf;
void *dst_y, *dst_uv;
u32 y_size, uv_size;
unsigned int p;
if (pixels_len) {
y_size = min_t(u32, fr->luma_len,
(u32) vb2_plane_size(vb, 0));
uv_size = vb->num_planes > 1 ?
min_t(u32, fr->chroma_len,
(u32) vb2_plane_size(vb, 1)) : 0;
dst_y = vb2_plane_vaddr(vb, 0);
dst_uv = vb->num_planes > 1 ?
vb2_plane_vaddr(vb, 1) : NULL;
if (dst_y && y_size && pixels_len >= y_size)
memcpy(dst_y, pixels, y_size);
else
y_size = 0;
if (dst_uv && uv_size &&
pixels_len >= y_size + uv_size)
memcpy(dst_uv, pixels + y_size, uv_size);
else
uv_size = 0;
vb2_set_plane_payload(vb, 0, y_size);
if (vb->num_planes > 1)
vb2_set_plane_payload(vb, 1, uv_size);
} else {
for (p = 0; p < vb->num_planes; p++)
vb2_set_plane_payload(vb, p,
vb2_plane_size(vb, p));
}
}
void daedalus_complete_resp_frame(u32 cookie,
const struct daedalus_resp_frame *fr,
const u8 *pixels, size_t pixels_len)
{
struct daedalus_dev *dev = g_daedalus_dev;
struct daedalus_inflight *inf;
struct daedalus_ctx *ctx = NULL;
struct vb2_v4l2_buffer *src_to_complete = NULL;
struct vb2_v4l2_buffer *dst_to_complete = NULL;
struct media_request *req_to_complete = NULL;
enum vb2_buffer_state state;
void *dst_y, *dst_uv;
u32 y_size, uv_size;
u64 dst_timestamp = 0;
bool entry_freed = false;
bool has_pixels, src_consumed;
if (!dev)
return;
state = (fr->status == DAEDALUS_DECODE_OK)
? VB2_BUF_STATE_DONE : VB2_BUF_STATE_ERROR;
has_pixels = !!(fr->flags & DAEDALUS_RESP_FLAG_HAS_PIXELS);
src_consumed = !!(fr->flags & DAEDALUS_RESP_FLAG_SRC_CONSUMED);
if (!has_pixels && !src_consumed) {
pr_warn_ratelimited(
"daedalus_v4l2: RESP_FRAME cookie=%u with neither HAS_PIXELS nor SRC_CONSUMED — ignoring\n",
cookie);
return;
}
mutex_lock(&dev->inflight_lock);
inf = daedalus_inflight_pop_locked(dev, cookie);
mutex_unlock(&dev->inflight_lock);
inf = daedalus_inflight_peek_locked(dev, cookie);
if (!inf) {
mutex_unlock(&dev->inflight_lock);
pr_warn_ratelimited(
"daedalus_v4l2: RESP_FRAME for unknown cookie=%u\n",
cookie);
return;
}
state = (fr->status == DAEDALUS_DECODE_OK)
? VB2_BUF_STATE_DONE : VB2_BUF_STATE_ERROR;
ctx = inf->ctx;
/*
* Two routes the daemon can take, both supported:
* Snapshot what this RESP completes and clear the matching
* fields on the inflight item, so concurrent RESPs (e.g. a
* later HAS_PIXELS arriving on the same cookie after this
* SRC_CONSUMED clears src_buf) see the correct residual
* state. Actual vb2 buf_done calls happen below the lock.
*
* (a) dmabuf path (Phase 8.6+) — daemon called
* DAEDALUS_IOC_GET_DMABUF, mmap'd the CAPTURE buffer,
* wrote pixels in place. RESP_FRAME carries metadata
* only (pixels_len == 0). Each plane's payload is
* the full plane size (the daemon wrote everything
* the format requires).
*
* (b) Phase 8.5 inline path — daemon shipped raw NV12 in
* the chardev payload (≤ 64 KiB cap). We memcpy
* into the vb2 buffer. Plane payloads come from
* the daemon's NV12 luma/chroma counts.
* Sanity check on output_src_pts only when HAS_PIXELS is
* set — the daemon's output_src_pts should equal this
* inflight's stored src_pts, since the daemon routes pixels
* to the cookie of the OUTPUT bitstream that contained the
* frame's slices (which is what we stored at device_run time).
* Surface a mismatch loudly — indicates daemon-side pts→cookie
* mapping bug, not silent data corruption.
*/
if (state == VB2_BUF_STATE_DONE) {
struct vb2_buffer *vb = &inf->dst_buf->vb2_buf;
unsigned int p;
if (has_pixels) {
if (fr->output_src_pts != inf->src_pts)
pr_warn_ratelimited(
"daedalus_v4l2: RESP HAS_PIXELS cookie=%u output_src_pts=%llu but inflight.src_pts=%llu — daemon dispatch bug?\n",
cookie,
(unsigned long long) fr->output_src_pts,
(unsigned long long) inf->src_pts);
if (pixels_len) {
/* (b) inline NV12 copy — legacy 2-plane only */
y_size = min_t(u32, fr->luma_len,
(u32) vb2_plane_size(vb, 0));
uv_size = vb->num_planes > 1 ?
min_t(u32, fr->chroma_len,
(u32) vb2_plane_size(vb, 1)) : 0;
dst_y = vb2_plane_vaddr(vb, 0);
dst_uv = vb->num_planes > 1 ?
vb2_plane_vaddr(vb, 1) : NULL;
if (dst_y && y_size && pixels_len >= y_size)
memcpy(dst_y, pixels, y_size);
else
y_size = 0;
if (dst_uv && uv_size &&
pixels_len >= y_size + uv_size)
memcpy(dst_uv, pixels + y_size, uv_size);
else
uv_size = 0;
vb2_set_plane_payload(vb, 0, y_size);
if (vb->num_planes > 1)
vb2_set_plane_payload(vb, 1, uv_size);
} else {
/* (a) dmabuf path: plane is fully populated by
* the daemon, so payload == sizeimage. */
for (p = 0; p < vb->num_planes; p++)
vb2_set_plane_payload(vb, p,
vb2_plane_size(vb, p));
}
dst_to_complete = inf->dst_buf;
dst_timestamp = inf->src_pts;
inf->dst_buf = NULL;
}
if (src_consumed) {
src_to_complete = inf->src_buf;
req_to_complete = inf->req;
inf->src_buf = NULL;
inf->req = NULL;
}
if (!inf->src_buf && !inf->dst_buf) {
list_del(&inf->list);
entry_freed = true;
}
mutex_unlock(&dev->inflight_lock);
/*
* Complete the CAPTURE side first (when applicable). vb2-core's
* V4L2_BUF_FLAG_TIMESTAMP_COPY semantics no longer auto-copy
* src→dst timestamps because src and dst are no longer paired
* 1:1 in m2m's view — stamp dst explicitly from the inflight's
* stored src_pts (= the OUTPUT vb2_buf.timestamp captured at
* device_run). The V4L2 client gets the same display-PTS it
* originally set on the OUTPUT side.
*/
if (dst_to_complete) {
if (state == VB2_BUF_STATE_DONE)
daedalus_pack_pixels_into_dst(dst_to_complete, fr,
pixels, pixels_len);
dst_to_complete->vb2_buf.timestamp = dst_timestamp;
v4l2_m2m_dst_buf_remove_by_buf(ctx->m2m_ctx, dst_to_complete);
v4l2_m2m_buf_done(dst_to_complete, state);
}
/*
* Phase 8.14: if the src_buf was bound to a media_request
* (libva-driven decode path), complete the per-request
* control state BEFORE buf_done_and_job_finish. vb2-core's
* buf_done unbinds the buffer's req_obj on its own, but the
* control object stays bound until v4l2_ctrl_request_complete
* runs — only after BOTH objects unbind does the request
* transition to MEDIA_REQUEST_STATE_COMPLETE and wake any
* userspace poll on the request fd.
*
* For non-request flows (test_m2m_stream direct QBUF) inf->req
* is NULL and v4l2_ctrl_request_complete just no-ops.
* Complete the OUTPUT side: release the bound media_request's
* controls (libva-driven path), drop our request reference taken
* in device_run, mark src done, then job_finish so the m2m
* scheduler can dispatch the next pending REQ on this ctx. The
* dst_buf for this cookie may still be parked (HAS_PIXELS hasn't
* arrived yet — libavcodec is holding the frame for display-
* order release). That's fine: the next device_run picks a
* different next_dst_buf out of the CAPTURE queue and proceeds.
*/
if (inf->req)
v4l2_ctrl_request_complete(inf->req, &inf->ctx->hdl);
if (src_to_complete) {
if (req_to_complete)
v4l2_ctrl_request_complete(req_to_complete, &ctx->hdl);
v4l2_m2m_src_buf_remove_by_buf(ctx->m2m_ctx, src_to_complete);
v4l2_m2m_buf_done(src_to_complete, state);
if (req_to_complete)
media_request_put(req_to_complete);
v4l2_m2m_job_finish(dev->m2m_dev, ctx->m2m_ctx);
}
/*
* Use the buf_done_and_job_finish helper rather than plain
* buf_done + job_finish: the helper pops the buffers off
* the m2m queue before marking them done, otherwise the
* scheduler immediately re-runs device_run on the same
* still-queued src buffer. Caught during Phase 8.5 first
* run — second REQ_DECODE with identical bitstream + oops
* in stop_streaming when the test client tore down.
*/
v4l2_m2m_buf_done_and_job_finish(dev->m2m_dev, inf->ctx->m2m_ctx,
state);
/*
* Release our reference taken in device_run; safe to do
* AFTER buf_done_and_job_finish (which dropped the vb2
* reference) because we still hold this one. If the
* refcount hits zero here, media-core releases the request.
*/
if (inf->req)
media_request_put(inf->req);
kfree(inf);
if (entry_freed)
kfree(inf);
}
/* -- v4l2_ioctl_ops -------------------------------------------------- */