H.264 B-frame correctness: re-architect daemon for concurrent in-flight requests + libva-side display-order reorder #11

New Issue

2026-05-21T12:42:56Z

marfrit commented

2026-05-21 12:42:56 +00:00

Tracking issue for the proper fix to the H.264 display-order problem originally diagnosed in #6 and partly mitigated by PR #7 (now reverted in #10 because parking broke libva's 1:1 expectation, see #9).

Context

The naive daemon design — one libavcodec context, single-threaded send_packet → receive_frame → ship pixels per REQ_DECODE — has two competing constraints:

libavcodec internally reorders H.264 output to display order. After avcodec_send_packet of a B-frame's slice, avcodec_receive_frame may return EAGAIN (the picture is in the DPB but not yet output), and a later call returns it. We can't easily disable this without going below libavcodec's frame-decode API.
V4L2 stateless decoder protocol is strict 1:1 between OUTPUT and CAPTURE buffers per request: every queued bitstream packet must produce exactly one CAPTURE completion, populated with the pixels of the picture whose slices were in that OUTPUT packet — regardless of display order. The V4L2 client (libva-v4l2-request-fourier) handles display reorder afterwards using H.264 POC.

Real hardware stateless decoders satisfy (2) natively because they decode a slice into a specific DPB-indexed CAPTURE buffer immediately, with no internal display-order queue.

What broke

PR #7 tried to satisfy (1) by parking CAPTURE buffers that libavcodec hadn't yet emitted, then routing pixels to the correct cookie via frame->pts when libavcodec finally output them. This violated (2): strict V4L2 stateless clients (mpv via vaapi-copy, ffmpeg-vaapi directly) saw CAPTURE DQBUF return EAGAIN and bailed. Firefox tolerated the resulting stale-pixel mess (because it's lenient and re-queues).

Reverted in #10. Visible "2 1 4 3" pair-swap in Firefox YouTube playback regresses pending this proper fix.

Proper fix — design

1. Concurrent in-flight requests in the daemon

The single-threaded chardev loop synchronises one REQ_DECODE at a time. Need to support N pending REQs simultaneously (matching the libva CAPTURE pool depth, ~24) so libavcodec's internal DPB lag doesn't block forward progress.

Minimal version: poll-driven loop, multiple REQ_DECODE messages queued from the kernel, daemon processes them as fast as libavcodec produces output. No threads needed — libavcodec's send_packet and receive_frame calls are non-blocking-ish (send_packet may EAGAIN if the internal queue is full; receive_frame may EAGAIN waiting for input).

Kernel side: the chardev write path needs to accept RESP_FRAME messages that aren't in 1:1 order with READ-side REQ_DECODE delivery. Each REQ_DECODE gets a cookie; RESP_FRAME (in any order) references that cookie.

2. Drop libavcodec's display-order reorder

Two options:

(a) Use the underlying H264Context directly — drop down a level from avcodec_send_packet/receive_frame to the per-slice decode primitives. Each slice decoded immediately writes its pixels into the picture buffer we control. No internal output queue. This is the "act like real HW" path. Significant FFmpeg-internals work; probably touches APIs that aren't in the public ABI.

(b) Use a different H.264 decoder — e.g. OpenH264, or a custom slice-by-slice path. Avoids the FFmpeg ABI risk but adds a new dependency.

Option (a) is more aligned with the project goal (one decoder library, FFmpeg already loaded) but harder. (b) is a more isolated risk.

3. Move display-order reorder back to libva-v4l2-request-fourier

The V4L2 stateless API contract puts display reorder on the V4L2 CLIENT, not the kernel/HW. libva-v4l2-request-fourier should:

Decode each CAPTURE buffer as it comes back (out of order, per request_fd completion).
Maintain a small POC-keyed reorder buffer.
Hand frames to the VAAPI client (vaSyncSurface, vaDeriveImage, etc.) in POC display order.

This change is in libva-v4l2-request-fourier, not daedalus. May already partly exist for HEVC; needs auditing for H.264.

Acceptance criteria

mpv --hwdec=vaapi-copy bbb_720p_h264.mp4 plays smoothly with no frame drops in steady state, no visible pair-swap.
Firefox YouTube H.264 playback shows monotonic motion (no 2-1-4-3).
ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -f null - exits cleanly with no "Failed to end picture decode" errors.
VP9 / AV1 paths unchanged in behaviour (they don't reorder, so concurrent-request support is transparent).

#6 (original diagnosis — design wrong)
#7 (parking attempt — reverted)
#8 (kernel panic from #7 — reverted)
#9 (mpv stuck-pre-playing under #7+#8 — root-cause for revert)
#10 (the revert PR)
marfrit/libva-v4l2-request-fourier#13 (slice-buffer overflow on resolution change — adjacent reliability issue)

Tracking issue for the proper fix to the H.264 display-order problem originally diagnosed in #6 and partly mitigated by PR #7 (now reverted in #10 because parking broke libva's 1:1 expectation, see #9). ## Context The naive daemon design — one libavcodec context, single-threaded `send_packet → receive_frame → ship pixels per REQ_DECODE` — has two competing constraints: 1. **libavcodec internally reorders H.264 output to display order**. After `avcodec_send_packet` of a B-frame's slice, `avcodec_receive_frame` may return EAGAIN (the picture is in the DPB but not yet output), and a later call returns it. We can't easily disable this without going below libavcodec's frame-decode API. 2. **V4L2 stateless decoder protocol is strict 1:1** between OUTPUT and CAPTURE buffers per request: every queued bitstream packet must produce exactly one CAPTURE completion, populated with the pixels of the picture whose slices were in that OUTPUT packet — regardless of display order. The V4L2 client (libva-v4l2-request-fourier) handles display reorder afterwards using H.264 POC. Real hardware stateless decoders satisfy (2) natively because they decode a slice into a specific DPB-indexed CAPTURE buffer immediately, with no internal display-order queue. ## What broke PR #7 tried to satisfy (1) by parking CAPTURE buffers that libavcodec hadn't yet emitted, then routing pixels to the correct cookie via frame->pts when libavcodec finally output them. This violated (2): strict V4L2 stateless clients (mpv via vaapi-copy, ffmpeg-vaapi directly) saw CAPTURE DQBUF return EAGAIN and bailed. Firefox tolerated the resulting stale-pixel mess (because it's lenient and re-queues). Reverted in #10. Visible "2 1 4 3" pair-swap in Firefox YouTube playback regresses pending this proper fix. ## Proper fix — design ### 1. Concurrent in-flight requests in the daemon The single-threaded chardev loop synchronises one REQ_DECODE at a time. Need to support N pending REQs simultaneously (matching the libva CAPTURE pool depth, ~24) so libavcodec's internal DPB lag doesn't block forward progress. Minimal version: poll-driven loop, multiple REQ_DECODE messages queued from the kernel, daemon processes them as fast as libavcodec produces output. No threads needed — libavcodec's `send_packet` and `receive_frame` calls are non-blocking-ish (send_packet may EAGAIN if the internal queue is full; receive_frame may EAGAIN waiting for input). Kernel side: the chardev write path needs to accept RESP_FRAME messages that aren't in 1:1 order with READ-side REQ_DECODE delivery. Each REQ_DECODE gets a cookie; RESP_FRAME (in any order) references that cookie. ### 2. Drop libavcodec's display-order reorder Two options: **(a) Use the underlying H264Context directly** — drop down a level from `avcodec_send_packet/receive_frame` to the per-slice decode primitives. Each slice decoded immediately writes its pixels into the picture buffer we control. No internal output queue. This is the "act like real HW" path. Significant FFmpeg-internals work; probably touches APIs that aren't in the public ABI. **(b) Use a different H.264 decoder** — e.g. OpenH264, or a custom slice-by-slice path. Avoids the FFmpeg ABI risk but adds a new dependency. Option (a) is more aligned with the project goal (one decoder library, FFmpeg already loaded) but harder. (b) is a more isolated risk. ### 3. Move display-order reorder back to libva-v4l2-request-fourier The V4L2 stateless API contract puts display reorder on the V4L2 CLIENT, not the kernel/HW. libva-v4l2-request-fourier should: - Decode each CAPTURE buffer as it comes back (out of order, per request_fd completion). - Maintain a small POC-keyed reorder buffer. - Hand frames to the VAAPI client (`vaSyncSurface`, `vaDeriveImage`, etc.) in POC display order. This change is in `libva-v4l2-request-fourier`, not daedalus. May already partly exist for HEVC; needs auditing for H.264. ## Acceptance criteria - mpv `--hwdec=vaapi-copy bbb_720p_h264.mp4` plays smoothly with no frame drops in steady state, no visible pair-swap. - Firefox YouTube H.264 playback shows monotonic motion (no 2-1-4-3). - `ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -f null -` exits cleanly with no "Failed to end picture decode" errors. - VP9 / AV1 paths unchanged in behaviour (they don't reorder, so concurrent-request support is transparent). ## Related - #6 (original diagnosis — design wrong) - #7 (parking attempt — reverted) - #8 (kernel panic from #7 — reverted) - #9 (mpv stuck-pre-playing under #7+#8 — root-cause for revert) - #10 (the revert PR) - marfrit/libva-v4l2-request-fourier#13 (slice-buffer overflow on resolution change — adjacent reliability issue)

claude-noether referenced this issue from a commit

2026-05-21 13:42:11 +00:00

daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design

marfrit referenced this issue from marfrit/marfrit-packages

2026-05-21 13:42:42 +00:00

daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design #71

marfrit commented

2026-05-21 15:11:07 +00:00

Correction to section (3)

On re-reading the V4L2 stateless API + how libva-vaapi actually consumes CAPTURE buffers: section (3) above is wrong — display-order reorder is NOT the V4L2 client's job.

Where display reorder actually happens (real-HW reference)

V4L2 stateless decoder driver (kernel, e.g. cedrus / hantro / rkvdec): delivers each CAPTURE buffer in decode order, marked DONE the moment the silicon finishes decoding that slice's picture. 1:1 with the OUTPUT slice it was paired with.
libva-v4l2-request-fourier: pure V4L2 ↔ VAAPI surface mapper. Per vaRenderPicture → V4L2 QBUF on a per-request fd. Per vaSyncSurface → V4L2 DQBUF on that request's CAPTURE. Surfaces come back in decode order. No POC reorder here.
ffmpeg-vaapi (VAAPI consumer inside Firefox / mpv / ffmpeg): libavcodec/vaapi_h264.c already does POC-based display reorder. It tracks each VA surface's POC from the picture-parameters it submitted via VAPictureParameterBufferH264, holds the per-frame VA surfaces in its own DPB, and emits display-ordered AVFrames to the upstream caller. This is what every existing VAAPI consumer expects + does today.

So libva-v4l2-request-fourier is supposed to pass CAPTURE buffers through 1:1 transparently — and the section-(3) work I described (adding POC reorder inside libva) would actually break the existing ffmpeg-vaapi consumer, which would double-reorder.

Revised plan

(1) + (2) together, in daedalus. Concurrent in-flight requests + drop libavcodec's display-order reorder from the daemon (so each REQ_DECODE → 1 RESP_FRAME with the decode-order pixels of that slice's picture).
(3) deleted from the proper-fix plan. Once (2) is in, the existing libva-v4l2-request-fourier ↔ ffmpeg-vaapi chain handles display reorder upstream of us, the way every other V4L2 stateless decoder works.
The only libva-v4l2-request-fourier audit worth doing as a sanity check after (1)+(2) deploys: confirm the H.264 path passes CAPTURE buffers 1:1 without any internal queuing. If it does (likely — that's how it's written today), no libva-side work needed at all. If we discover otherwise during soak, file as a separate libva-side bug, not part of this issue.

Implementation: option (2) concretely

The "FFmpeg internals" risk in section 2(a) is smaller than I made it sound. The clean public-API path:

Override AVCodecContext->get_buffer2 so the picture libavcodec is about to decode into gets allocated as our V4L2 CAPTURE buffer (the cookie's mmap'd dmabuf). libavcodec writes pixels there directly during avcodec_send_packet.
Track the (current REQ's cookie ↔ picture identity) mapping inside the get_buffer2 callback's opaque field.
Do not call avcodec_receive_frame at all for H.264. Pixels are already in the right CAPTURE buffer after send_packet returns. Emit RESP_FRAME(cookie=N, HAS_PIXELS+SRC_CONSUMED) for the just-submitted packet.
For VP9 / AV1, keep the existing send_packet → receive_frame 1:1 path — those codecs don't reorder, so the public API is already correct for them.

This is the pattern used by Firefox's FFmpegVideoDecoder for VA-API (the get_buffer2 ↔ VA surface binding). Tested and stable. No private-API risk.

Updating the issue title's accept-criteria stays unchanged — they're symptom-level, not design-level.

## Correction to section (3) On re-reading the V4L2 stateless API + how libva-vaapi actually consumes CAPTURE buffers: section (3) above is **wrong** — display-order reorder is NOT the V4L2 client's job. ### Where display reorder actually happens (real-HW reference) 1. V4L2 stateless decoder driver (kernel, e.g. cedrus / hantro / rkvdec): delivers each CAPTURE buffer in **decode order**, marked DONE the moment the silicon finishes decoding that slice's picture. 1:1 with the OUTPUT slice it was paired with. 2. **libva-v4l2-request-fourier**: pure V4L2 ↔ VAAPI surface mapper. Per `vaRenderPicture` → V4L2 QBUF on a per-request fd. Per `vaSyncSurface` → V4L2 DQBUF on that request's CAPTURE. Surfaces come back in decode order. **No POC reorder here**. 3. **ffmpeg-vaapi** (VAAPI consumer inside Firefox / mpv / ffmpeg): `libavcodec/vaapi_h264.c` already does POC-based display reorder. It tracks each VA surface's POC from the picture-parameters it submitted via `VAPictureParameterBufferH264`, holds the per-frame VA surfaces in its own DPB, and emits display-ordered `AVFrame`s to the upstream caller. This is what every existing VAAPI consumer expects + does today. So libva-v4l2-request-fourier is supposed to pass CAPTURE buffers through 1:1 transparently — and the section-(3) work I described (adding POC reorder inside libva) would actually *break* the existing ffmpeg-vaapi consumer, which would double-reorder. ### Revised plan - **(1) + (2) together, in daedalus.** Concurrent in-flight requests + drop libavcodec's display-order reorder from the daemon (so each REQ_DECODE → 1 RESP_FRAME with the decode-order pixels of that slice's picture). - **(3) deleted from the proper-fix plan.** Once (2) is in, the existing libva-v4l2-request-fourier ↔ ffmpeg-vaapi chain handles display reorder upstream of us, the way every other V4L2 stateless decoder works. - The only libva-v4l2-request-fourier audit worth doing as a sanity check after (1)+(2) deploys: confirm the H.264 path passes CAPTURE buffers 1:1 without any internal queuing. If it does (likely — that's how it's written today), no libva-side work needed at all. If we discover otherwise during soak, file as a separate libva-side bug, not part of this issue. ### Implementation: option (2) concretely The "FFmpeg internals" risk in section 2(a) is smaller than I made it sound. The clean public-API path: - Override `AVCodecContext->get_buffer2` so the picture libavcodec is about to decode into gets allocated as our V4L2 CAPTURE buffer (the cookie's mmap'd dmabuf). libavcodec writes pixels there directly during `avcodec_send_packet`. - Track the (current REQ's cookie ↔ picture identity) mapping inside the `get_buffer2` callback's `opaque` field. - **Do not call `avcodec_receive_frame` at all** for H.264. Pixels are already in the right CAPTURE buffer after `send_packet` returns. Emit RESP_FRAME(cookie=N, HAS_PIXELS+SRC_CONSUMED) for the just-submitted packet. - For VP9 / AV1, keep the existing `send_packet` → `receive_frame` 1:1 path — those codecs don't reorder, so the public API is already correct for them. This is the pattern used by Firefox's `FFmpegVideoDecoder` for VA-API (the `get_buffer2` ↔ VA surface binding). Tested and stable. No private-API risk. Updating the issue title's accept-criteria stays unchanged — they're symptom-level, not design-level.

marfrit referenced this issue

2026-05-21 15:15:11 +00:00

daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — implements #11 part (2) #12

marfrit referenced this issue from a commit

2026-05-21 15:17:59 +00:00

Merge pull request 'daemon: AV_CODEC_FLAG_LOW_DELAY for H.264 — implements #11 part (2)' (#12) from noether/daemon-low-delay-h264 into main

marfrit referenced this issue from marfrit/daedalus-fourier

2026-05-21 15:50:19 +00:00

CMakeLists: install rules + pkg-config for daedalus_core #1

marfrit referenced this issue

2026-05-21 16:01:19 +00:00

daemon: link daedalus-fourier + log substrate availability at startup #13