H.264 B-frame display reorder: daemon binds libavcodec display-ordered output to decode-ordered V4L2 cookies → pair-swapped frames (visible 2-1-4-3-6-5) #6

Closed
opened 2026-05-21 10:09:50 +00:00 by marfrit · 0 comments
Owner

Symptom

Visible "awful jumping" / pair-swapped frames on every H.264 stream with B-frames, regardless of resolution (verified at 720p / 1080p) and frame rate (30 fps / 60 fps). Reported by user as "frames are 2 1 4 3 6 5 instead of 1 2 3 4 5 6".

Reproduction

  • Pi CM5, Phase 8.x stack alive, daemon at r24+gf0d4186.
  • LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi-copy bbb_720p_h264.mp4 → visible pair-swap. Bypasses Firefox entirely, so the bug is upstream of any browser compositor.
  • Same pattern in firefox-fourier YouTube playback (independently observed before mpv test, but Firefox-induced noise made it harder to localise).

Root cause — cookie / display-order mismatch

Pipeline as it stands today

  1. libva-v4l2-request-fourier submits H.264 bitstream chunks via VIDIOC_QBUF(OUTPUT) in decode order (the order libavcodec gave them to libva). Each OUTPUT buffer is bound to a media_request carrying that frame's per-control state (SPS/PPS/DECODE_PARAMS/SLICE_PARAMS).
  2. daedalus_v4l2 kernel device_run (kernel/daedalus_v4l2_main.c:660-790) pops the next src_buf + dst_buf from the m2m queues, allocates a fresh cookie = daedalus_next_cookie(), packs (cookie, bitstream, h264_meta) into REQ_DECODE, sends to the daemon, and stores (cookie → {src_buf, dst_buf, req}) in the inflight table.
  3. Daemon (daemon/src/decoder.c:495-513) does avcodec_send_packet(pkt) then avcodec_receive_frame(frame) once per REQ_DECODE. Ships the resulting NV12 (or DAEDALUS_DECODE_NO_FRAME if EAGAIN) back as RESP_FRAME with the same cookie.
  4. Kernel pops inflight by cookie and stamps the pixels into dst_buf[cookie]. V4L2_BUF_FLAG_TIMESTAMP_COPY copies src_buf[cookie].timestamp → dst_buf[cookie].timestamp.

Why this is wrong for B-frames

libavcodec's H.264 decoder internally reorders output to display order before returning from avcodec_receive_frame. Each call returns the oldest display-ready frame in its DPB, not necessarily the frame whose slice data the most recent send_packet contained.

Concrete IBP example, decode order I₀ P₃ B₁ B₂ P₆ B₄ B₅ (subscript = display position):

REQ_DECODE cookie bitstream sent libavcodec output Where pixels land
1 I₀ slices I₀ dst_buf[1] (= src_buf[1] timestamp ↔ I₀'s display PTS — correct)
2 P₃ slices EAGAIN RESP NO_FRAME → dst_buf[2] = VB2_BUF_STATE_ERROR (P₃'s pixels held inside libavcodec; lost from V4L2's perspective)
3 B₁ slices I₀ may release, or B₁ dst_buf[3] gets whichever frame libavcodec released, stamped with src_buf[3].timestamp (which is the decode-order PTS of B₁, regardless)
4 B₂ slices next display-order frame dst_buf[4] gets pixels of a different display frame than src_buf[4]'s timestamp implies
5 P₆ slices EAGAIN or earlier-display frame ...

Result: pixel content and dst_buf timestamp drift apart. Many src_bufs get marked ERROR (lost frames). When the V4L2 client's compositor presents dst_buf[N] it sees pixels of an earlier or later display frame than its timestamp claims. At a high level: pairs of P/B frames present in inverted order — the user-visible 2-1-4-3-6-5.

The daemon today even ships some frames with a NV12 resolution that differs from the requested CAPTURE buffer size (decoder: OK 1280x720 shipped into a capture=1920x1088 buffer on the first frame of a resolution change — see journal cookie=15027 on 2026-05-21) which is a related but separate symptom of the same cookie-binding flaw.

Proposed fix — ferry the source PTS through the protocol

Wire protocol

Extend struct daedalus_req_decode and struct daedalus_resp_frame in include/daedalus_v4l2_proto.h:

struct daedalus_req_decode {
    ...
    __u64 src_pts;  // V4L2 OUTPUT buffer timestamp at submission time
};

struct daedalus_resp_frame {
    ...
    __u64 output_src_pts;  // src_pts of the bitstream that PRODUCED this frame
    __u32 flags;           // bit 0 = src_consumed (always 1 today; 0 means src not yet released)
};

Bump a DAEDALUS_PROTO_VERSION; both sides must upgrade together.

Kernel side (device_run)

Fill req->src_pts = src_buf->vb2_buf.timestamp before sending REQ_DECODE. (Already accessible.)

Inflight lookup needs a secondary key on src_pts → dst_buf, since the dst_buf the daemon ultimately wants to fill is whichever one was paired with a different src_buf whose timestamp matches the daemon's output_src_pts. Simplest: keep cookie-keyed inflight + a parallel hash src_pts → inflight. On RESP_FRAME:

  • Look up (cookie) → free src_buf (it's done — the slice data was consumed even if no pixels are ready yet).
  • Look up (output_src_pts) → that inflight's dst_buf is where the pixels go.
  • Stamp dst_buf.timestamp = output_src_pts explicitly (we can no longer rely on V4L2_BUF_FLAG_TIMESTAMP_COPY because src and dst are no longer paired 1:1).

For cookies whose RESP_FRAME says src_consumed=1 but no output_src_pts matches an outstanding dst (i.e., the daemon hasn't released a frame for this cookie's bitstream yet), the dst_buf stays parked in the inflight table until a later RESP brings it home.

Daemon side (decoder.c)

Replace the synchronous send_packet → receive_frame_once pattern with a drain loop:

dec->pkt->pts = (int64_t)req->src_pts;
rc = avcodec_send_packet(ctx, dec->pkt);
// always send; never bail just because receive isn't ready yet

while ((rc = avcodec_receive_frame(ctx, dec->frame)) == 0) {
    uint64_t output_src_pts = (uint64_t)dec->frame->pts;
    pack_frame_to_planes(...);
    send_resp_frame(cookie_for_this_drain_iteration,
                    output_src_pts, ...);
}

The daemon may send 0, 1, or N RESP_FRAME messages per REQ_DECODE (typically 1 in steady state). Each carries an output_src_pts identifying which OUTPUT bitstream's pixels these are.

libva-v4l2-request-fourier — no changes

The libva driver already submits OUTPUT buffers with the correct per-frame timestamp (display PTS the application passed in). The dst_buf-side timestamp the kernel now stamps explicitly will match what libva expects, so VAAPI surface ordering on the application side stays correct.

Why this is the right shape

  • Decouples cookie (in-kernel routing identity) from frame->pts (display-order identity).
  • No need to disable libavcodec's internal reorder (AV_CODEC_FLAG_LOW_DELAY is a hack that gives up display order entirely; we want the opposite — preserve it but make it visible to V4L2).
  • Matches how cedrus / rkvdec / hantro work natively — the kernel decoder writes the decoded picture into whichever CAPTURE buffer the SLICE_PARAMS DPB references, indexed by POC, not by submit order.
  • No CPU work moved — the daemon still does the libavcodec decode itself. (Per feedback_daemon_no_cpu_decode this is acceptable as long as we're not pretending CPU is hardware; the work here is correctness, not perf.)

Effort estimate

Maybe a day of focused work + a day of soak testing on Pi CM5 with mpv + Firefox on a few stream profiles. Risk: getting the inflight ordering right when src and dst decouple — needs care under concurrent client load (the PR #3 vb_mutex fix covered the prior cross-client hazard, but this fix introduces in-context ordering complexity).

Out of scope / related

  • libva-v4l2-request-fourier codec_store_buffer slice-buffer overflow on resolution change (picture.c:112) — separate bug, filed elsewhere. mpv crashes on the 1080p source today because of that; this issue investigated using a 720p source where the same overflow doesn't trigger.
  • The lingering kernel V4L2_BUF_FLAG_TIMESTAMP_COPY flag (kernel/daedalus_v4l2_main.c:536, 553) becomes a misnomer once src/dst are no longer paired. It's still useful for VP9/AV1 (no reorder, src/dst 1:1) so it stays — but the H.264 path effectively overrides with an explicit stamp.
## Symptom Visible "awful jumping" / pair-swapped frames on every H.264 stream with B-frames, regardless of resolution (verified at 720p / 1080p) and frame rate (30 fps / 60 fps). Reported by user as "frames are 2 1 4 3 6 5 instead of 1 2 3 4 5 6". ## Reproduction - Pi CM5, Phase 8.x stack alive, daemon at `r24+gf0d4186`. - `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi-copy bbb_720p_h264.mp4` → visible pair-swap. Bypasses Firefox entirely, so the bug is upstream of any browser compositor. - Same pattern in `firefox-fourier` YouTube playback (independently observed before mpv test, but Firefox-induced noise made it harder to localise). ## Root cause — cookie / display-order mismatch ### Pipeline as it stands today 1. `libva-v4l2-request-fourier` submits H.264 bitstream chunks via `VIDIOC_QBUF`(OUTPUT) **in decode order** (the order libavcodec gave them to libva). Each OUTPUT buffer is bound to a media_request carrying that frame's per-control state (SPS/PPS/DECODE_PARAMS/SLICE_PARAMS). 2. `daedalus_v4l2` kernel `device_run` (`kernel/daedalus_v4l2_main.c:660-790`) pops the next src_buf + dst_buf from the m2m queues, allocates a fresh `cookie = daedalus_next_cookie()`, packs `(cookie, bitstream, h264_meta)` into `REQ_DECODE`, sends to the daemon, and stores `(cookie → {src_buf, dst_buf, req})` in the inflight table. 3. Daemon (`daemon/src/decoder.c:495-513`) does `avcodec_send_packet(pkt)` then `avcodec_receive_frame(frame)` **once per `REQ_DECODE`**. Ships the resulting NV12 (or `DAEDALUS_DECODE_NO_FRAME` if `EAGAIN`) back as `RESP_FRAME` with the **same cookie**. 4. Kernel pops inflight by cookie and stamps the pixels into `dst_buf[cookie]`. `V4L2_BUF_FLAG_TIMESTAMP_COPY` copies `src_buf[cookie].timestamp → dst_buf[cookie].timestamp`. ### Why this is wrong for B-frames libavcodec's H.264 decoder internally reorders output to **display order** before returning from `avcodec_receive_frame`. Each call returns the oldest *display-ready* frame in its DPB, not necessarily the frame whose slice data the most recent `send_packet` contained. Concrete IBP example, decode order I₀ P₃ B₁ B₂ P₆ B₄ B₅ (subscript = display position): | REQ_DECODE cookie | bitstream sent | libavcodec output | Where pixels land | |---|---|---|---| | 1 | I₀ slices | I₀ | dst_buf[1] (= src_buf[1] timestamp ↔ I₀'s display PTS — correct) | | 2 | P₃ slices | EAGAIN | `RESP NO_FRAME` → dst_buf[2] = `VB2_BUF_STATE_ERROR` (P₃'s pixels held inside libavcodec; lost from V4L2's perspective) | | 3 | B₁ slices | I₀ may release, or B₁ | dst_buf[3] gets whichever frame libavcodec released, stamped with src_buf[3].timestamp (which is the decode-order PTS of B₁, regardless) | | 4 | B₂ slices | next display-order frame | dst_buf[4] gets pixels of a different display frame than src_buf[4]'s timestamp implies | | 5 | P₆ slices | EAGAIN or earlier-display frame | ... | Result: pixel content and dst_buf timestamp drift apart. Many src_bufs get marked ERROR (lost frames). When the V4L2 client's compositor presents `dst_buf[N]` it sees pixels of an earlier or later display frame than its timestamp claims. **At a high level: pairs of P/B frames present in inverted order — the user-visible 2-1-4-3-6-5.** The daemon today even ships some frames with a NV12 resolution that differs from the requested CAPTURE buffer size (`decoder: OK 1280x720` shipped into a `capture=1920x1088` buffer on the first frame of a resolution change — see journal cookie=15027 on 2026-05-21) which is a related but separate symptom of the same cookie-binding flaw. ## Proposed fix — ferry the source PTS through the protocol ### Wire protocol Extend `struct daedalus_req_decode` and `struct daedalus_resp_frame` in `include/daedalus_v4l2_proto.h`: ```c struct daedalus_req_decode { ... __u64 src_pts; // V4L2 OUTPUT buffer timestamp at submission time }; struct daedalus_resp_frame { ... __u64 output_src_pts; // src_pts of the bitstream that PRODUCED this frame __u32 flags; // bit 0 = src_consumed (always 1 today; 0 means src not yet released) }; ``` Bump a `DAEDALUS_PROTO_VERSION`; both sides must upgrade together. ### Kernel side (`device_run`) Fill `req->src_pts = src_buf->vb2_buf.timestamp` before sending `REQ_DECODE`. (Already accessible.) Inflight lookup needs a secondary key on `src_pts → dst_buf`, since the dst_buf the daemon ultimately wants to fill is whichever one was paired with a *different* src_buf whose timestamp matches the daemon's `output_src_pts`. Simplest: keep cookie-keyed inflight + a parallel hash `src_pts → inflight`. On `RESP_FRAME`: - Look up `(cookie)` → free src_buf (it's done — the slice data was consumed even if no pixels are ready yet). - Look up `(output_src_pts)` → that inflight's `dst_buf` is where the pixels go. - Stamp `dst_buf.timestamp = output_src_pts` explicitly (we can no longer rely on `V4L2_BUF_FLAG_TIMESTAMP_COPY` because src and dst are no longer paired 1:1). For cookies whose `RESP_FRAME` says `src_consumed=1` but no `output_src_pts` matches an outstanding dst (i.e., the daemon hasn't released a frame for this cookie's bitstream yet), the dst_buf stays parked in the inflight table until a later RESP brings it home. ### Daemon side (`decoder.c`) Replace the synchronous `send_packet → receive_frame_once` pattern with a drain loop: ```c dec->pkt->pts = (int64_t)req->src_pts; rc = avcodec_send_packet(ctx, dec->pkt); // always send; never bail just because receive isn't ready yet while ((rc = avcodec_receive_frame(ctx, dec->frame)) == 0) { uint64_t output_src_pts = (uint64_t)dec->frame->pts; pack_frame_to_planes(...); send_resp_frame(cookie_for_this_drain_iteration, output_src_pts, ...); } ``` The daemon may send 0, 1, or N `RESP_FRAME` messages per `REQ_DECODE` (typically 1 in steady state). Each carries an `output_src_pts` identifying which OUTPUT bitstream's pixels these are. ### libva-v4l2-request-fourier — no changes The libva driver already submits OUTPUT buffers with the correct per-frame timestamp (display PTS the application passed in). The dst_buf-side timestamp the kernel now stamps explicitly will match what libva expects, so VAAPI surface ordering on the application side stays correct. ## Why this is the right shape - Decouples cookie (in-kernel routing identity) from frame->pts (display-order identity). - No need to disable libavcodec's internal reorder (`AV_CODEC_FLAG_LOW_DELAY` is a hack that gives up display order entirely; we want the opposite — preserve it but make it visible to V4L2). - Matches how cedrus / rkvdec / hantro work natively — the kernel decoder writes the decoded picture into whichever CAPTURE buffer the SLICE_PARAMS DPB references, indexed by POC, not by submit order. - No CPU work moved — the daemon still does the libavcodec decode itself. (Per [[feedback_daemon_no_cpu_decode]] this is acceptable as long as we're not pretending CPU is hardware; the work here is correctness, not perf.) ## Effort estimate Maybe a day of focused work + a day of soak testing on Pi CM5 with mpv + Firefox on a few stream profiles. Risk: getting the inflight ordering right when `src` and `dst` decouple — needs care under concurrent client load (the PR #3 vb_mutex fix covered the prior cross-client hazard, but this fix introduces in-context ordering complexity). ## Out of scope / related - libva-v4l2-request-fourier `codec_store_buffer` slice-buffer overflow on resolution change (`picture.c:112`) — separate bug, filed elsewhere. mpv crashes on the 1080p source today because of that; this issue investigated using a 720p source where the same overflow doesn't trigger. - The lingering kernel `V4L2_BUF_FLAG_TIMESTAMP_COPY` flag (`kernel/daedalus_v4l2_main.c:536, 553`) becomes a misnomer once src/dst are no longer paired. It's still useful for VP9/AV1 (no reorder, src/dst 1:1) so it stays — but the H.264 path effectively overrides with an explicit stamp.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: reauktion/daedalus-v4l2#6