Maps the per-frame decode pipeline (BeginPicture → RenderPicture → EndPicture → SyncSurface) and walks frame-1 IDR + frame-2 P state transitions through h264_set_controls and the DPB. Eliminates 6 of 13 hypotheses from Phase 0 by source-read alone (H-A DPB stale, H-B POC sentinel for small POCs, H-C SLICE flags in FRAME_BASED, H-D request_fd lifecycle, H-F pred-weight, H-G scaling matrix re-upload). Adds 4 new hypotheses (H-J reference_ts derivation, H-K CAPTURE buf count, H-L slice_data alignment vs h264_start_code, H-M frame_num cross-check). Live hypotheses for Phase 3: H-E (CAPTURE rotation/reference-resolution), H-H (start_code prefix), H-L (slice_data alignment), H-K (cap_pool size). Phase 3 plan: strace-diff libva-vaapi-H.264 vs kdirect-H.264 on the same fixture; byte-level frame-1/2/3 examination; dmesg check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
Iteration 8 — Phase 2 (situation analysis)
H.264 backend source-read on iter7 fork tip 6df2159. Maps the per-frame decode pipeline, enumerates code paths the kernel reads, and refines the hypothesis surface ahead of Phase 3 empirical strace-diff.
The per-frame pipeline (libva → V4L2)
For each frame in ffmpeg-vaapi-copy's 3-frame H.264 sweep:
1. RequestBeginPicture (picture.c:286)
For surface S:
- If S had a
current_slotfrom a prior decode →surface_unbind_slotreleases it. cap_pool_acquire(&capture_pool, surface_id)→ fresh CAPTURE slot (LRU). Bind viasurface_bind_slot→ S'sdestination_index,destination_data, etc.request_pool_acquire(&output_pool)→ fresh OUTPUT slot. Bind S'ssource_index,source_data(mmap pointer into the slot),source_size, andrequest_fd = slot->request_fd(the slot's permanent fd, REINIT'd between cycles).- Reset
slices_size = 0,slices_count = 0,params.h264.matrix_set = false. - Set status
VASurfaceRendering, recordrender_surface_id.
2. RequestRenderPicture (picture.c:369)
For each VAAPI buffer ID:
VASliceDataBufferType→ ifcontext->h264_start_code, prepend00 00 01start code, then memcpy slice bytes intosource_data + slices_size. Incrementslices_sizeandslices_count.VAPictureParameterBufferH264→ memcpy intosurface->params.h264.picture(the VAAPI struct, not a V4L2 control yet).VASliceParameterBufferH264→ memcpy intosurface->params.h264.slice(LAST slice's params wins; multi-slice frames lose all but the last).VAIQMatrixBufferH264→ memcpy intosurface->params.h264.matrix; setmatrix_set = true.
3. RequestEndPicture (picture.c:408)
Synchronous decode cycle:
gettimeofday(&surface->timestamp, NULL)— stamps the OUTPUT QBUF's timestamp = the tag the kernel will associate with the decoded CAPTURE buffer.request_fd = surface->request_fd(from BeginPicture's pool-slot binding).codec_set_controls→h264_set_controls(see §h264_set_controls below). This builds all V4L2 controls and doesS_EXT_CTRLS(video_fd, request_fd=request_fd, controls, num).v4l2_queue_bufferCAPTURE — QBUF CAPTURE buffer (destination_index), no request.v4l2_queue_bufferOUTPUT withrequest_fd(QBUF flags |= REQUEST_FD),slices_sizebytes payload.RequestSyncSurface(surface.c:278):media_request_queue(request_fd)→ MEDIA_REQUEST_IOC_QUEUE — atomically commits S_EXT_CTRLS + OUTPUT QBUF as one decode request.media_request_wait_completion— POLL until decode finishes.media_request_reinit— reset request fd for next cycle.v4l2_dequeue_bufferOUTPUT.v4l2_dequeue_bufferCAPTURE.cap_pool_mark_decoded→ slot state IN_DECODE → DECODED.
4. h264_set_controls (h264.c:797)
DPB mutation order:
output = dpb_lookup(context, CurrPic, NULL, NULL)— find existing entry for current frame'spicture_id(returns NULL on first decode of a fresh surface).- If NULL →
output = dpb_find_entry(context)= first invalid-and-unreserved entry, or oldest unused entry. dpb_clear_entry(output, reserved=true)— memset 0, mark reserved=true (excludes from dpb_find_invalid_entry).dpb_update(context, &surface->params.h264.picture):- Increment dpb.age.
- Clear
usedon every entry. - For each ReferenceFrame in
VAPicture->ReferenceFrames[i](i < num_ref_frames):- If
VA_PICTURE_H264_INVALID, skip. dpb_lookupfor that picture_id; if found, age=current, used=true. Elsedpb_insert(inserts the missing reference into a free entry).
- If
Then construct V4L2 controls:
5. h264_va_picture_to_v4l2 (h264.c:346) — fills v4l2_ctrl_h264_decode_params, pps, sps. Includes:
- Bit-parses the slice_header from the OUTPUT buffer bytes (skipping start_code if
h264_start_code) foridr_pic_id,pic_order_cnt_lsb,delta_pic_order_cnt*,pic_order_cnt_bit_size,dec_ref_pic_marking_bit_size. - Calls
h264_fill_dpbwhich iterates DPB entries withvalid && usedand fills v4l2decode->dpb[i]withreference_ts = v4l2_timeval_to_ns(ref_surface->timestamp),frame_num,pic_num,top/bottom_field_order_cnt,flags = VALID|ACTIVE(|LONG_TERM),fields = V4L2_H264_FRAME_REF. - Sets
decode->flags |= IDR_PICifnal_unit_type == 5. Sets FIELD_PIC, BOTTOM_FIELD flags. - Sets pps, sps flags (entropy_coding_mode, weighted_pred, transform_8x8, etc.).
- Scaling matrix (always submitted): from VAAPI's IQMatrix if
matrix_set, else flat 16 default. h264_va_slice_to_v4l2— fillsslice_params+pred_weights(used only in SLICE_BASED mode; here unused).pps.flags |= SCALING_MATRIX_PRESENT.pps.num_ref_idx_l0/l1_default_active_minus1 = slice.num_ref_idx_l0/l1_active_minus1(slice-derived since VAAPI doesn't forward PPS values).- Switch on
slice_type % 5→decode.flags |= PFRAMEor|= BFRAME. sps.profile_idc = h264_profile_to_idc(profile),sps.level_idc = derive_level_idc(width_mbs, height_mbs).const bool slice_based = false;hardcoded. Only SPS + PPS + DECODE_PARAMS + SCALING_MATRIX controls submitted. SLICE_PARAMS and PRED_WEIGHTS skipped (per UAPI doc, must not submit in FRAME_BASED).v4l2_set_controls(video_fd, surface->request_fd, controls, 4)— S_EXT_CTRLS withWHICH_REQUEST_VAL = request_fd.dpb_insert(context, &CurrPic, output=entry)— current frame's CurrPic stored into the DPB entry we reserved at step 3.entry.pic = CurrPic,valid=true,reserved=false,used=true.
Per-frame state walkthrough on bbb_1080p30_h264.mp4 3-frame
Frame 1 (IDR keyframe)
- DPB empty before this frame.
dpb_lookup(CurrPic_1)→ NULL.dpb_find_entry→dpb_find_invalid_entry→ entry[0] (all entries are !valid && !reserved). Returns entry[0].dpb_clear_entry(entry[0], reserved=true)→ memset 0, reserved=1.dpb_update: clear all used. ReferenceFrames[] for IDR is empty. No further mutation.h264_fill_dpb: iterate; no entry with valid && used. v4l2decode->dpb[]all-zero.decode.flags |= IDR_PIC(nal_unit_type==5).- Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C0), QBUF OUTPUT (slot S_O0, timestamp tt_1).
- MEDIA_REQUEST_IOC_QUEUE, wait, REINIT, DQBUF OUTPUT, DQBUF CAPTURE.
dpb_insert(CurrPic_1, entry[0])→ entry[0].pic = CurrPic_1, valid=1, used=1, age=1.
Frame 2 (P-frame, refs frame 1)
- DPB has entry[0] = frame_1 (valid, used, age=1).
dpb_lookup(CurrPic_2)→ NULL (CurrPic_2 has new picture_id).dpb_find_entry→dpb_find_invalid_entryskips entry[0] (valid=true). Returns entry[1].dpb_clear_entry(entry[1], reserved=true)→ memset 0, reserved=1.dpb_update:- age=2.
- Clear
usedon all entries (entry[0].used=0 temporarily). - For ReferenceFrames[0] = frame_1:
dpb_lookupfinds entry[0]. entry[0].age=2, entry[0].used=1.
h264_fill_dpb:- entry[0] valid && used: fill v4l2
dpb[0]:reference_ts = v4l2_timeval_to_ns(SURFACE(driver_data, entry[0].pic.picture_id)->timestamp)= tt_1.frame_num = entry[0].pic.frame_idx(frame 1's frame_num, typically 0).pic_num = FrameNumWrap(0, 1) = 0.top/bottom_field_order_cnt= strip_sentinel(entry[0].pic.TopFieldOrderCnt, .flags).flags = VALID|ACTIVE.fields = V4L2_H264_FRAME_REF.
- entry[1]: clear (reserved, !valid). Skip.
- entries[2..15]: not valid. Skip.
- entry[0] valid && used: fill v4l2
decode.flags |= PFRAME, frame_num=1.- Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C1 ≠ S_C0), QBUF OUTPUT (slot S_O1 ≠ S_O0, timestamp tt_2).
- MEDIA_REQUEST_IOC_QUEUE → kernel: reads dpb[0].reference_ts=tt_1, looks up CAPTURE buffer with that ts in vb2 queue, uses as reference; decodes P MB partition; CAPTURE buffer S_C1 receives decoded pixels.
- DQBUF OUTPUT, DQBUF CAPTURE.
dpb_insert(CurrPic_2, entry[1])→ entry[1] gets frame_2.
Frame 3 (next inter)
Same shape. Both frame_1 and frame_2 are in DPB; ReferenceFrames[] for B/P may include both. The bug signature says frame 3 is also all-zero.
Empirical bug signature recap
- Frame 1: 99.99% zero, but first 16 bytes =
81 81 80 80 80 7f 7f 7f 7f 7f 7f 80 80 80 81 81(real chroma DC near gray). Suggests decode partially executed for a small region of frame 1 (or those 16 bytes leaked from kernel into the cap_pool buffer position; the cap_pool init pattern is constant0x4cgreen; these bytes are not the init pattern). Hash71ac099b…. - Frame 2: 100% zero (or matches the cap_pool init pattern). No decoded content.
- Frame 3: same as frame 2.
The "real first 16 bytes of frame 1" is the strongest cue: decode started writing but stopped after a few bytes. Three possibilities:
- A — Decode aborted after a partial write due to an error mid-decode.
- B — Decode completed correctly but only wrote a few bytes due to a stride / pitch / partial-DMA truncation.
- C — Memory was initialized with garbage (not 0x4c green) just before the read.
(A) and (B) are kernel-side. (C) is a libva-side allocation bug. iter5b-β changed cap_pool initialization patterns; need to verify what frame 1 actually shows vs what frame 2/3 show byte-by-byte to choose between these.
Refined hypothesis surface (Phase 0 H-A through H-I narrowed)
From source-read, I can now narrow:
- H-A (DPB stale): Backend logic is sound per the per-frame walkthrough. Eliminated unless kernel reads a control we don't fill.
- H-B (POC sentinel mis-strip):
h264_strip_ffmpeg_poc_sentinelstrips bit 16 if set, else returns as-is. Frame 2 POC values are computed by ffmpeg-vaapi from H.264 spec. If sentinel detection misfires on a legit POC ≥ 65536, it would subtract 65536 silently. For 30-frame BBB content, POC values are tiny — eliminate. - H-C (slice flags missing): SLICE_PARAMS not submitted in FRAME_BASED. Eliminate.
- H-D (per-frame request_fd lifecycle): iter6 model is "one fd per slot, REINIT between cycles." Each frame gets a possibly-different request_fd (depending on which OUTPUT pool slot it borrows). The fd is REINIT'd cleanly after DQBUF. Eliminate.
- H-E (CAPTURE slot rotation): Each frame gets a different cap_pool slot. Frame 1's CAPTURE buffer (S_C0) stays in
DECODEDstate after DQBUF. The kernel must find it byreference_ts = tt_1to use as reference for frame 2. Live hypothesis: kernel's reference-resolution may require frame 1's CAPTURE buffer to be still QBUF'd, not DQBUF'd. The kernel-direct path may keep CAPTURE buffers QBUF'd for longer (asynchronous decode pipeline) while libva synchronously DQBUFs each frame's CAPTURE before starting the next. - H-F (pred-weight): Not submitted in FRAME_BASED. Eliminate.
- H-G (scaling matrix not re-uploaded): Submitted every frame. Eliminate.
- H-H (start_code prefix):
h264_start_codeis set fromMEDIA_IOC_G_EXT_CTRLSof START_CODE control inh264_get_controls. iter6 added profile gating viafeedback_unconditional_codec_state.md. Need to verify on fresnel: does rkvdec advertiseV4L2_STATELESS_H264_START_CODE_ANNEX_Bor_NONE? If misaligned, slice data is misinterpreted. Live hypothesis — Phase 3 strace check. - H-I (kernel UAPI contract drift):
linux-fresnel-fourier 7.0-1is v6.16-rc4-base; check for new mandatory controls added since v5.13. Specifically:V4L2_CID_STATELESS_H264_DECODE_MODEwas probed at context init; the answer is FRAME_BASED on rkvdec. Other controls — none known to be required. - NEW H-J (reference_ts derivation):
v4l2_timeval_to_nsconverts(s64 sec * 1e9 + ns)fromsurface->timestamp.tv_sec/tv_usec. The kernel-sidevb2_buffer.vb2_v4l2_buffer.vb2_buffer.timestampis also nanoseconds. Check: doessurface->timestampget set to the same value on the OUTPUT QBUF and onreference_ts? Yes — line 440 sets it, line 461 reads it. Eliminate. - NEW H-K (CAPTURE buffer count): How many CAPTURE buffers does the backend REQBUFS? If only 1 or 2, frame 2's CAPTURE may collide with frame 1's reference. iter5b-β
surface_fill_format_uniformand the cap_pool size — need to check. - NEW H-L (slice_data_bit_offset misalignment in FRAME_BASED): In FRAME_BASED mode rkvdec parses the slice header on its own from OUTPUT bytes. If the OUTPUT bytes have the wrong start code presence vs what rkvdec expects, slice header parse misaligns. The start_code prefix is gated by
context->h264_start_codewhich mirrors the START_CODE control's value. Live hypothesis. - NEW H-M (frame_num field on rkvdec frame_based): rkvdec FRAME_BASED parses its own slice header but may still cross-check against
decode->frame_num. If frame 2'sdecode->frame_num=1and the bitstream says frame_num=1, ok. If they mismatch, kernel rejects. Probably OK; verify in strace.
Phase 3 must answer
Empirical questions for Phase 3 strace-diff:
- Does rkvdec accept the decode? Look for DQBUF FLAG_ERROR per frame in libva-h264 vs kdirect-h264. If libva's frame 2 DQBUF carries FLAG_ERROR, the kernel rejected; this is similar to Bug 5 (HEVC).
- What does the libva ioctl stream look like per frame vs kdirect's? Diff
VIDIOC_S_EXT_CTRLSpayload by frame. Look for:- control set differences (kdirect includes a control libva omits)
- DECODE_PARAMS field differences (frame_num, reference_ts, flags)
- DPB array differences (kdirect filled vs libva filled)
- What's
cap_pool's buffer count on fresnel? REQBUFS count for CAPTURE on the rkvdec OUTPUT/CAPTURE queues. - What's
h264_start_codeset to on rkvdec? Inspect MEDIA_IOC_G_EXT_CTRLS at init. Verify START_CODE = ANNEX_B (i.e.,h264_start_code = true, slices prefixed with00 00 01). - Byte-level frame examination: is frame 1's 16-byte real content at frame start (Y plane top-left) or somewhere in the middle? Is frame 2 truly all-zero, or is it the cap_pool init pattern
0x4c? - strace REQUEST_IOC_QUEUE ordering: does the request_fd carry the controls correctly? Look at kernel logs (dmesg) for any rkvdec rejection messages.
Scope confirmation
In scope (this iteration): backend code only.
src/h264.c,src/h264_slice_header.c,src/picture.cH.264 paths,src/surface.c::RequestSyncSurface,src/cap_pool.c,src/request_pool.crequest_fd lifecycle.
Out of scope: kernel patches, VP9/VP8/HEVC/MPEG-2 (read-only for regression-verify), front-end libva, performance.
Phase 3 plan
Run on fresnel:
- Capture strace of libva-vaapi H.264 decode:
strace -f -e trace=ioctl -o libva_h264_strace.log ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -frames:v 3 -vf hwdownload,format=nv12,format=yuv420p -f rawvideo libva_h264.yuv. - Capture strace of kdirect: same fixture via Kwiboo's ffmpeg-v4l2request:
strace -f -e trace=ioctl -o kdirect_h264_strace.log ffmpeg-v4l2request -i bbb_1080p30_h264.mp4 ... > kdirect_h264.yuv. - Compare strace logs per-frame. Extract S_EXT_CTRLS payloads; decode
struct v4l2_ext_controls+ per-control payloads (sps, pps, decode_params, scaling_matrix, dpb[]) using a small python parser if scale demands. - Compare raw byte content of
libva_h264.yuv(frame 1, 2, 3) andkdirect_h264.yuvto characterize what's actually written and where the diverence starts. - Check
dmesgon fresnel during the run for rkvdec messages. - Capture findings in
phase3_iter8_findings.md.
Predicted Phase 3 runtime: 1 hour wallclock contingent on fresnel uptime.
Phase 4 plan-shape predicate
Phase 4 plan is contingent on Phase 3 narrowing. Three likely outcomes:
- Outcome A — Phase 3 finds a specific control or DECODE_PARAMS field divergence (e.g., libva omits a field kdirect always fills). Fix is mechanical, 5-30 LOC in
h264.c::h264_va_picture_to_v4l2orh264_set_controls. - Outcome B — Phase 3 finds DQBUF FLAG_ERROR on frame 2 inter. Then this is similar to Bug 5: kernel-side rejection. Iter8 either fixes whatever the wire-protocol gap is (backend), or downgrades to "narrowed but unfixed" close (deferred to a kernel-touching iteration).
- Outcome C — Phase 3 finds the kernel accepts the decode but writes to a wrong CAPTURE buffer / partial fill. Cap-pool / surface_bind_slot bug; 10-30 LOC fix in
cap_pool.corsurface.c.
LOC estimate at this stage: 5-50 LOC in 1-3 files. One commit, possibly two.