H.264 B-frame correctness: re-architect daemon for concurrent in-flight requests + libva-side display-order reorder #11
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Tracking issue for the proper fix to the H.264 display-order problem originally diagnosed in #6 and partly mitigated by PR #7 (now reverted in #10 because parking broke libva's 1:1 expectation, see #9).
Context
The naive daemon design — one libavcodec context, single-threaded
send_packet → receive_frame → ship pixels per REQ_DECODE— has two competing constraints:libavcodec internally reorders H.264 output to display order. After
avcodec_send_packetof a B-frame's slice,avcodec_receive_framemay return EAGAIN (the picture is in the DPB but not yet output), and a later call returns it. We can't easily disable this without going below libavcodec's frame-decode API.V4L2 stateless decoder protocol is strict 1:1 between OUTPUT and CAPTURE buffers per request: every queued bitstream packet must produce exactly one CAPTURE completion, populated with the pixels of the picture whose slices were in that OUTPUT packet — regardless of display order. The V4L2 client (libva-v4l2-request-fourier) handles display reorder afterwards using H.264 POC.
Real hardware stateless decoders satisfy (2) natively because they decode a slice into a specific DPB-indexed CAPTURE buffer immediately, with no internal display-order queue.
What broke
PR #7 tried to satisfy (1) by parking CAPTURE buffers that libavcodec hadn't yet emitted, then routing pixels to the correct cookie via frame->pts when libavcodec finally output them. This violated (2): strict V4L2 stateless clients (mpv via vaapi-copy, ffmpeg-vaapi directly) saw CAPTURE DQBUF return EAGAIN and bailed. Firefox tolerated the resulting stale-pixel mess (because it's lenient and re-queues).
Reverted in #10. Visible "2 1 4 3" pair-swap in Firefox YouTube playback regresses pending this proper fix.
Proper fix — design
1. Concurrent in-flight requests in the daemon
The single-threaded chardev loop synchronises one REQ_DECODE at a time. Need to support N pending REQs simultaneously (matching the libva CAPTURE pool depth, ~24) so libavcodec's internal DPB lag doesn't block forward progress.
Minimal version: poll-driven loop, multiple REQ_DECODE messages queued from the kernel, daemon processes them as fast as libavcodec produces output. No threads needed — libavcodec's
send_packetandreceive_framecalls are non-blocking-ish (send_packet may EAGAIN if the internal queue is full; receive_frame may EAGAIN waiting for input).Kernel side: the chardev write path needs to accept RESP_FRAME messages that aren't in 1:1 order with READ-side REQ_DECODE delivery. Each REQ_DECODE gets a cookie; RESP_FRAME (in any order) references that cookie.
2. Drop libavcodec's display-order reorder
Two options:
(a) Use the underlying H264Context directly — drop down a level from
avcodec_send_packet/receive_frameto the per-slice decode primitives. Each slice decoded immediately writes its pixels into the picture buffer we control. No internal output queue. This is the "act like real HW" path. Significant FFmpeg-internals work; probably touches APIs that aren't in the public ABI.(b) Use a different H.264 decoder — e.g. OpenH264, or a custom slice-by-slice path. Avoids the FFmpeg ABI risk but adds a new dependency.
Option (a) is more aligned with the project goal (one decoder library, FFmpeg already loaded) but harder. (b) is a more isolated risk.
3. Move display-order reorder back to libva-v4l2-request-fourier
The V4L2 stateless API contract puts display reorder on the V4L2 CLIENT, not the kernel/HW. libva-v4l2-request-fourier should:
vaSyncSurface,vaDeriveImage, etc.) in POC display order.This change is in
libva-v4l2-request-fourier, not daedalus. May already partly exist for HEVC; needs auditing for H.264.Acceptance criteria
--hwdec=vaapi-copy bbb_720p_h264.mp4plays smoothly with no frame drops in steady state, no visible pair-swap.ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -f null -exits cleanly with no "Failed to end picture decode" errors.Related
Correction to section (3)
On re-reading the V4L2 stateless API + how libva-vaapi actually consumes CAPTURE buffers: section (3) above is wrong — display-order reorder is NOT the V4L2 client's job.
Where display reorder actually happens (real-HW reference)
vaRenderPicture→ V4L2 QBUF on a per-request fd. PervaSyncSurface→ V4L2 DQBUF on that request's CAPTURE. Surfaces come back in decode order. No POC reorder here.libavcodec/vaapi_h264.calready does POC-based display reorder. It tracks each VA surface's POC from the picture-parameters it submitted viaVAPictureParameterBufferH264, holds the per-frame VA surfaces in its own DPB, and emits display-orderedAVFrames to the upstream caller. This is what every existing VAAPI consumer expects + does today.So libva-v4l2-request-fourier is supposed to pass CAPTURE buffers through 1:1 transparently — and the section-(3) work I described (adding POC reorder inside libva) would actually break the existing ffmpeg-vaapi consumer, which would double-reorder.
Revised plan
Implementation: option (2) concretely
The "FFmpeg internals" risk in section 2(a) is smaller than I made it sound. The clean public-API path:
AVCodecContext->get_buffer2so the picture libavcodec is about to decode into gets allocated as our V4L2 CAPTURE buffer (the cookie's mmap'd dmabuf). libavcodec writes pixels there directly duringavcodec_send_packet.get_buffer2callback'sopaquefield.avcodec_receive_frameat all for H.264. Pixels are already in the right CAPTURE buffer aftersend_packetreturns. Emit RESP_FRAME(cookie=N, HAS_PIXELS+SRC_CONSUMED) for the just-submitted packet.send_packet→receive_frame1:1 path — those codecs don't reorder, so the public API is already correct for them.This is the pattern used by Firefox's
FFmpegVideoDecoderfor VA-API (theget_buffer2↔ VA surface binding). Tested and stable. No private-API risk.Updating the issue title's accept-criteria stays unchanged — they're symptom-level, not design-level.