Files

T

marfrit abd97e3eb6 iter8 Phase 2: H.264 backend source-read + refined hypothesis surface

Maps the per-frame decode pipeline (BeginPicture → RenderPicture →
EndPicture → SyncSurface) and walks frame-1 IDR + frame-2 P state
transitions through h264_set_controls and the DPB.

Eliminates 6 of 13 hypotheses from Phase 0 by source-read alone (H-A
DPB stale, H-B POC sentinel for small POCs, H-C SLICE flags in FRAME_BASED,
H-D request_fd lifecycle, H-F pred-weight, H-G scaling matrix re-upload).
Adds 4 new hypotheses (H-J reference_ts derivation, H-K CAPTURE buf count,
H-L slice_data alignment vs h264_start_code, H-M frame_num cross-check).
Live hypotheses for Phase 3: H-E (CAPTURE rotation/reference-resolution),
H-H (start_code prefix), H-L (slice_data alignment), H-K (cap_pool size).

Phase 3 plan: strace-diff libva-vaapi-H.264 vs kdirect-H.264 on the same
fixture; byte-level frame-1/2/3 examination; dmesg check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 11:19:49 +00:00

16 KiB

Raw Permalink Blame History

Iteration 8 — Phase 2 (situation analysis)

H.264 backend source-read on iter7 fork tip 6df2159. Maps the per-frame decode pipeline, enumerates code paths the kernel reads, and refines the hypothesis surface ahead of Phase 3 empirical strace-diff.

The per-frame pipeline (libva → V4L2)

For each frame in ffmpeg-vaapi-copy's 3-frame H.264 sweep:

1. `RequestBeginPicture` (picture.c:286)

For surface S:

If S had a current_slot from a prior decode → surface_unbind_slot releases it.
cap_pool_acquire(&capture_pool, surface_id) → fresh CAPTURE slot (LRU). Bind via surface_bind_slot → S's destination_index, destination_data, etc.
request_pool_acquire(&output_pool) → fresh OUTPUT slot. Bind S's source_index, source_data (mmap pointer into the slot), source_size, and request_fd = slot->request_fd (the slot's permanent fd, REINIT'd between cycles).
Reset slices_size = 0, slices_count = 0, params.h264.matrix_set = false.
Set status VASurfaceRendering, record render_surface_id.

2. `RequestRenderPicture` (picture.c:369)

For each VAAPI buffer ID:

VASliceDataBufferType → if context->h264_start_code, prepend 00 00 01 start code, then memcpy slice bytes into source_data + slices_size. Increment slices_size and slices_count.
VAPictureParameterBufferH264 → memcpy into surface->params.h264.picture (the VAAPI struct, not a V4L2 control yet).
VASliceParameterBufferH264 → memcpy into surface->params.h264.slice (LAST slice's params wins; multi-slice frames lose all but the last).
VAIQMatrixBufferH264 → memcpy into surface->params.h264.matrix; set matrix_set = true.

3. `RequestEndPicture` (picture.c:408)

Synchronous decode cycle:

gettimeofday(&surface->timestamp, NULL) — stamps the OUTPUT QBUF's timestamp = the tag the kernel will associate with the decoded CAPTURE buffer.
request_fd = surface->request_fd (from BeginPicture's pool-slot binding).
codec_set_controls → h264_set_controls (see §h264_set_controls below). This builds all V4L2 controls and does S_EXT_CTRLS(video_fd, request_fd=request_fd, controls, num).
v4l2_queue_buffer CAPTURE — QBUF CAPTURE buffer (destination_index), no request.
v4l2_queue_buffer OUTPUT with request_fd (QBUF flags |= REQUEST_FD), slices_size bytes payload.
RequestSyncSurface (surface.c:278):
- media_request_queue(request_fd) → MEDIA_REQUEST_IOC_QUEUE — atomically commits S_EXT_CTRLS + OUTPUT QBUF as one decode request.
- media_request_wait_completion — POLL until decode finishes.
- media_request_reinit — reset request fd for next cycle.
- v4l2_dequeue_buffer OUTPUT.
- v4l2_dequeue_buffer CAPTURE.
- cap_pool_mark_decoded → slot state IN_DECODE → DECODED.

4. `h264_set_controls` (h264.c:797)

DPB mutation order:

output = dpb_lookup(context, CurrPic, NULL, NULL) — find existing entry for current frame's picture_id (returns NULL on first decode of a fresh surface).
If NULL → output = dpb_find_entry(context) = first invalid-and-unreserved entry, or oldest unused entry.
dpb_clear_entry(output, reserved=true) — memset 0, mark reserved=true (excludes from dpb_find_invalid_entry).
dpb_update(context, &surface->params.h264.picture):
- Increment dpb.age.
- Clear used on every entry.
- For each ReferenceFrame in VAPicture->ReferenceFrames[i] (i < num_ref_frames):
  - If VA_PICTURE_H264_INVALID, skip.
  - dpb_lookup for that picture_id; if found, age=current, used=true. Else dpb_insert (inserts the missing reference into a free entry).

Then construct V4L2 controls: 5. h264_va_picture_to_v4l2 (h264.c:346) — fills v4l2_ctrl_h264_decode_params, pps, sps. Includes:

Bit-parses the slice_header from the OUTPUT buffer bytes (skipping start_code if h264_start_code) for idr_pic_id, pic_order_cnt_lsb, delta_pic_order_cnt*, pic_order_cnt_bit_size, dec_ref_pic_marking_bit_size.
Calls h264_fill_dpb which iterates DPB entries with valid && used and fills v4l2 decode->dpb[i] with reference_ts = v4l2_timeval_to_ns(ref_surface->timestamp), frame_num, pic_num, top/bottom_field_order_cnt, flags = VALID|ACTIVE(|LONG_TERM), fields = V4L2_H264_FRAME_REF.
Sets decode->flags |= IDR_PIC if nal_unit_type == 5. Sets FIELD_PIC, BOTTOM_FIELD flags.
Sets pps, sps flags (entropy_coding_mode, weighted_pred, transform_8x8, etc.).

Scaling matrix (always submitted): from VAAPI's IQMatrix if matrix_set, else flat 16 default.
h264_va_slice_to_v4l2 — fills slice_params + pred_weights (used only in SLICE_BASED mode; here unused).
pps.flags |= SCALING_MATRIX_PRESENT.
pps.num_ref_idx_l0/l1_default_active_minus1 = slice.num_ref_idx_l0/l1_active_minus1 (slice-derived since VAAPI doesn't forward PPS values).
Switch on slice_type % 5 → decode.flags |= PFRAME or |= BFRAME.
sps.profile_idc = h264_profile_to_idc(profile), sps.level_idc = derive_level_idc(width_mbs, height_mbs).
const bool slice_based = false; hardcoded. Only SPS + PPS + DECODE_PARAMS + SCALING_MATRIX controls submitted. SLICE_PARAMS and PRED_WEIGHTS skipped (per UAPI doc, must not submit in FRAME_BASED).
v4l2_set_controls(video_fd, surface->request_fd, controls, 4) — S_EXT_CTRLS with WHICH_REQUEST_VAL = request_fd.
dpb_insert(context, &CurrPic, output=entry) — current frame's CurrPic stored into the DPB entry we reserved at step 3. entry.pic = CurrPic, valid=true, reserved=false, used=true.

Per-frame state walkthrough on `bbb_1080p30_h264.mp4` 3-frame

Frame 1 (IDR keyframe)

DPB empty before this frame.
dpb_lookup(CurrPic_1) → NULL.
dpb_find_entry → dpb_find_invalid_entry → entry[0] (all entries are !valid && !reserved). Returns entry[0].
dpb_clear_entry(entry[0], reserved=true) → memset 0, reserved=1.
dpb_update: clear all used. ReferenceFrames[] for IDR is empty. No further mutation.
h264_fill_dpb: iterate; no entry with valid && used. v4l2 decode->dpb[] all-zero.
decode.flags |= IDR_PIC (nal_unit_type==5).
Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C0), QBUF OUTPUT (slot S_O0, timestamp tt_1).
MEDIA_REQUEST_IOC_QUEUE, wait, REINIT, DQBUF OUTPUT, DQBUF CAPTURE.
dpb_insert(CurrPic_1, entry[0]) → entry[0].pic = CurrPic_1, valid=1, used=1, age=1.

Frame 2 (P-frame, refs frame 1)

DPB has entry[0] = frame_1 (valid, used, age=1).
dpb_lookup(CurrPic_2) → NULL (CurrPic_2 has new picture_id).
dpb_find_entry → dpb_find_invalid_entry skips entry[0] (valid=true). Returns entry[1].
dpb_clear_entry(entry[1], reserved=true) → memset 0, reserved=1.
dpb_update:
- age=2.
- Clear used on all entries (entry[0].used=0 temporarily).
- For ReferenceFrames[0] = frame_1: dpb_lookup finds entry[0]. entry[0].age=2, entry[0].used=1.
h264_fill_dpb:
- entry[0] valid && used: fill v4l2 dpb[0]:
  - reference_ts = v4l2_timeval_to_ns(SURFACE(driver_data, entry[0].pic.picture_id)->timestamp) = tt_1.
  - frame_num = entry[0].pic.frame_idx (frame 1's frame_num, typically 0).
  - pic_num = FrameNumWrap(0, 1) = 0.
  - top/bottom_field_order_cnt = strip_sentinel(entry[0].pic.TopFieldOrderCnt, .flags).
  - flags = VALID|ACTIVE.
  - fields = V4L2_H264_FRAME_REF.
- entry[1]: clear (reserved, !valid). Skip.
- entries[2..15]: not valid. Skip.
decode.flags |= PFRAME, frame_num=1.
Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C1 ≠ S_C0), QBUF OUTPUT (slot S_O1 ≠ S_O0, timestamp tt_2).
MEDIA_REQUEST_IOC_QUEUE → kernel: reads dpb[0].reference_ts=tt_1, looks up CAPTURE buffer with that ts in vb2 queue, uses as reference; decodes P MB partition; CAPTURE buffer S_C1 receives decoded pixels.
DQBUF OUTPUT, DQBUF CAPTURE.
dpb_insert(CurrPic_2, entry[1]) → entry[1] gets frame_2.

Frame 3 (next inter)

Same shape. Both frame_1 and frame_2 are in DPB; ReferenceFrames[] for B/P may include both. The bug signature says frame 3 is also all-zero.

Empirical bug signature recap

Frame 1: 99.99% zero, but first 16 bytes = 81 81 80 80 80 7f 7f 7f 7f 7f 7f 80 80 80 81 81 (real chroma DC near gray). Suggests decode partially executed for a small region of frame 1 (or those 16 bytes leaked from kernel into the cap_pool buffer position; the cap_pool init pattern is constant 0x4c green; these bytes are not the init pattern). Hash 71ac099b….
Frame 2: 100% zero (or matches the cap_pool init pattern). No decoded content.
Frame 3: same as frame 2.

The "real first 16 bytes of frame 1" is the strongest cue: decode started writing but stopped after a few bytes. Three possibilities:

A — Decode aborted after a partial write due to an error mid-decode.
B — Decode completed correctly but only wrote a few bytes due to a stride / pitch / partial-DMA truncation.
C — Memory was initialized with garbage (not 0x4c green) just before the read.

(A) and (B) are kernel-side. (C) is a libva-side allocation bug. iter5b-β changed cap_pool initialization patterns; need to verify what frame 1 actually shows vs what frame 2/3 show byte-by-byte to choose between these.

Refined hypothesis surface (Phase 0 H-A through H-I narrowed)

From source-read, I can now narrow:

H-A (DPB stale): Backend logic is sound per the per-frame walkthrough. Eliminated unless kernel reads a control we don't fill.
H-B (POC sentinel mis-strip): h264_strip_ffmpeg_poc_sentinel strips bit 16 if set, else returns as-is. Frame 2 POC values are computed by ffmpeg-vaapi from H.264 spec. If sentinel detection misfires on a legit POC ≥ 65536, it would subtract 65536 silently. For 30-frame BBB content, POC values are tiny — eliminate.
H-C (slice flags missing): SLICE_PARAMS not submitted in FRAME_BASED. Eliminate.
H-D (per-frame request_fd lifecycle): iter6 model is "one fd per slot, REINIT between cycles." Each frame gets a possibly-different request_fd (depending on which OUTPUT pool slot it borrows). The fd is REINIT'd cleanly after DQBUF. Eliminate.
H-E (CAPTURE slot rotation): Each frame gets a different cap_pool slot. Frame 1's CAPTURE buffer (S_C0) stays in DECODED state after DQBUF. The kernel must find it by reference_ts = tt_1 to use as reference for frame 2. Live hypothesis: kernel's reference-resolution may require frame 1's CAPTURE buffer to be still QBUF'd, not DQBUF'd. The kernel-direct path may keep CAPTURE buffers QBUF'd for longer (asynchronous decode pipeline) while libva synchronously DQBUFs each frame's CAPTURE before starting the next.
H-F (pred-weight): Not submitted in FRAME_BASED. Eliminate.
H-G (scaling matrix not re-uploaded): Submitted every frame. Eliminate.
H-H (start_code prefix): h264_start_code is set from MEDIA_IOC_G_EXT_CTRLS of START_CODE control in h264_get_controls. iter6 added profile gating via feedback_unconditional_codec_state.md. Need to verify on fresnel: does rkvdec advertise V4L2_STATELESS_H264_START_CODE_ANNEX_B or _NONE? If misaligned, slice data is misinterpreted. Live hypothesis — Phase 3 strace check.
H-I (kernel UAPI contract drift): linux-fresnel-fourier 7.0-1 is v6.16-rc4-base; check for new mandatory controls added since v5.13. Specifically: V4L2_CID_STATELESS_H264_DECODE_MODE was probed at context init; the answer is FRAME_BASED on rkvdec. Other controls — none known to be required.
NEW H-J (reference_ts derivation): v4l2_timeval_to_ns converts (s64 sec * 1e9 + ns) from surface->timestamp.tv_sec/tv_usec. The kernel-side vb2_buffer.vb2_v4l2_buffer.vb2_buffer.timestamp is also nanoseconds. Check: does surface->timestamp get set to the same value on the OUTPUT QBUF and on reference_ts? Yes — line 440 sets it, line 461 reads it. Eliminate.
NEW H-K (CAPTURE buffer count): How many CAPTURE buffers does the backend REQBUFS? If only 1 or 2, frame 2's CAPTURE may collide with frame 1's reference. iter5b-β surface_fill_format_uniform and the cap_pool size — need to check.
NEW H-L (slice_data_bit_offset misalignment in FRAME_BASED): In FRAME_BASED mode rkvdec parses the slice header on its own from OUTPUT bytes. If the OUTPUT bytes have the wrong start code presence vs what rkvdec expects, slice header parse misaligns. The start_code prefix is gated by context->h264_start_code which mirrors the START_CODE control's value. Live hypothesis.
NEW H-M (frame_num field on rkvdec frame_based): rkvdec FRAME_BASED parses its own slice header but may still cross-check against decode->frame_num. If frame 2's decode->frame_num=1 and the bitstream says frame_num=1, ok. If they mismatch, kernel rejects. Probably OK; verify in strace.

Phase 3 must answer

Empirical questions for Phase 3 strace-diff:

Does rkvdec accept the decode? Look for DQBUF FLAG_ERROR per frame in libva-h264 vs kdirect-h264. If libva's frame 2 DQBUF carries FLAG_ERROR, the kernel rejected; this is similar to Bug 5 (HEVC).
What does the libva ioctl stream look like per frame vs kdirect's? Diff VIDIOC_S_EXT_CTRLS payload by frame. Look for:
- control set differences (kdirect includes a control libva omits)
- DECODE_PARAMS field differences (frame_num, reference_ts, flags)
- DPB array differences (kdirect filled vs libva filled)
What's cap_pool's buffer count on fresnel? REQBUFS count for CAPTURE on the rkvdec OUTPUT/CAPTURE queues.
What's h264_start_code set to on rkvdec? Inspect MEDIA_IOC_G_EXT_CTRLS at init. Verify START_CODE = ANNEX_B (i.e., h264_start_code = true, slices prefixed with 00 00 01).
Byte-level frame examination: is frame 1's 16-byte real content at frame start (Y plane top-left) or somewhere in the middle? Is frame 2 truly all-zero, or is it the cap_pool init pattern 0x4c?
strace REQUEST_IOC_QUEUE ordering: does the request_fd carry the controls correctly? Look at kernel logs (dmesg) for any rkvdec rejection messages.

Scope confirmation

In scope (this iteration): backend code only.

src/h264.c, src/h264_slice_header.c, src/picture.c H.264 paths, src/surface.c::RequestSyncSurface, src/cap_pool.c, src/request_pool.c request_fd lifecycle.

Out of scope: kernel patches, VP9/VP8/HEVC/MPEG-2 (read-only for regression-verify), front-end libva, performance.

Phase 3 plan

Run on fresnel:

Capture strace of libva-vaapi H.264 decode: strace -f -e trace=ioctl -o libva_h264_strace.log ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -frames:v 3 -vf hwdownload,format=nv12,format=yuv420p -f rawvideo libva_h264.yuv.
Capture strace of kdirect: same fixture via Kwiboo's ffmpeg-v4l2request: strace -f -e trace=ioctl -o kdirect_h264_strace.log ffmpeg-v4l2request -i bbb_1080p30_h264.mp4 ... > kdirect_h264.yuv.
Compare strace logs per-frame. Extract S_EXT_CTRLS payloads; decode struct v4l2_ext_controls + per-control payloads (sps, pps, decode_params, scaling_matrix, dpb[]) using a small python parser if scale demands.
Compare raw byte content of libva_h264.yuv (frame 1, 2, 3) and kdirect_h264.yuv to characterize what's actually written and where the diverence starts.
Check dmesg on fresnel during the run for rkvdec messages.
Capture findings in phase3_iter8_findings.md.

Predicted Phase 3 runtime: 1 hour wallclock contingent on fresnel uptime.

Phase 4 plan-shape predicate

Phase 4 plan is contingent on Phase 3 narrowing. Three likely outcomes:

Outcome A — Phase 3 finds a specific control or DECODE_PARAMS field divergence (e.g., libva omits a field kdirect always fills). Fix is mechanical, 5-30 LOC in h264.c::h264_va_picture_to_v4l2 or h264_set_controls.
Outcome B — Phase 3 finds DQBUF FLAG_ERROR on frame 2 inter. Then this is similar to Bug 5: kernel-side rejection. Iter8 either fixes whatever the wire-protocol gap is (backend), or downgrades to "narrowed but unfixed" close (deferred to a kernel-touching iteration).
Outcome C — Phase 3 finds the kernel accepts the decode but writes to a wrong CAPTURE buffer / partial fill. Cap-pool / surface_bind_slot bug; 10-30 LOC fix in cap_pool.c or surface.c.

LOC estimate at this stage: 5-50 LOC in 1-3 files. One commit, possibly two.

16 KiB Raw Permalink Blame History