Files
fresnel-fourier/phase2_iter8_situation.md
marfrit abd97e3eb6 iter8 Phase 2: H.264 backend source-read + refined hypothesis surface
Maps the per-frame decode pipeline (BeginPicture → RenderPicture →
EndPicture → SyncSurface) and walks frame-1 IDR + frame-2 P state
transitions through h264_set_controls and the DPB.

Eliminates 6 of 13 hypotheses from Phase 0 by source-read alone (H-A
DPB stale, H-B POC sentinel for small POCs, H-C SLICE flags in FRAME_BASED,
H-D request_fd lifecycle, H-F pred-weight, H-G scaling matrix re-upload).
Adds 4 new hypotheses (H-J reference_ts derivation, H-K CAPTURE buf count,
H-L slice_data alignment vs h264_start_code, H-M frame_num cross-check).
Live hypotheses for Phase 3: H-E (CAPTURE rotation/reference-resolution),
H-H (start_code prefix), H-L (slice_data alignment), H-K (cap_pool size).

Phase 3 plan: strace-diff libva-vaapi-H.264 vs kdirect-H.264 on the same
fixture; byte-level frame-1/2/3 examination; dmesg check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:19:49 +00:00

190 lines
16 KiB
Markdown

# Iteration 8 — Phase 2 (situation analysis)
H.264 backend source-read on iter7 fork tip `6df2159`. Maps the per-frame decode pipeline, enumerates code paths the kernel reads, and refines the hypothesis surface ahead of Phase 3 empirical strace-diff.
## The per-frame pipeline (libva → V4L2)
For each frame in ffmpeg-vaapi-copy's 3-frame H.264 sweep:
### 1. `RequestBeginPicture` (picture.c:286)
For surface S:
- If S had a `current_slot` from a prior decode → `surface_unbind_slot` releases it.
- `cap_pool_acquire(&capture_pool, surface_id)` → fresh CAPTURE slot (LRU). Bind via `surface_bind_slot` → S's `destination_index`, `destination_data`, etc.
- `request_pool_acquire(&output_pool)` → fresh OUTPUT slot. Bind S's `source_index`, `source_data` (mmap pointer into the slot), `source_size`, and **`request_fd = slot->request_fd`** (the slot's permanent fd, REINIT'd between cycles).
- Reset `slices_size = 0`, `slices_count = 0`, `params.h264.matrix_set = false`.
- Set status `VASurfaceRendering`, record `render_surface_id`.
### 2. `RequestRenderPicture` (picture.c:369)
For each VAAPI buffer ID:
- `VASliceDataBufferType` → if `context->h264_start_code`, prepend `00 00 01` start code, then memcpy slice bytes into `source_data + slices_size`. Increment `slices_size` and `slices_count`.
- `VAPictureParameterBufferH264` → memcpy into `surface->params.h264.picture` (the VAAPI struct, not a V4L2 control yet).
- `VASliceParameterBufferH264` → memcpy into `surface->params.h264.slice` (LAST slice's params wins; multi-slice frames lose all but the last).
- `VAIQMatrixBufferH264` → memcpy into `surface->params.h264.matrix`; set `matrix_set = true`.
### 3. `RequestEndPicture` (picture.c:408)
Synchronous decode cycle:
- `gettimeofday(&surface->timestamp, NULL)` — stamps the OUTPUT QBUF's timestamp = the tag the kernel will associate with the decoded CAPTURE buffer.
- `request_fd = surface->request_fd` (from BeginPicture's pool-slot binding).
- `codec_set_controls``h264_set_controls` (see §h264_set_controls below). This builds all V4L2 controls and does `S_EXT_CTRLS(video_fd, request_fd=request_fd, controls, num)`.
- `v4l2_queue_buffer` CAPTURE — QBUF CAPTURE buffer (`destination_index`), no request.
- `v4l2_queue_buffer` OUTPUT with `request_fd` (QBUF flags |= REQUEST_FD), `slices_size` bytes payload.
- `RequestSyncSurface` (surface.c:278):
- `media_request_queue(request_fd)` → MEDIA_REQUEST_IOC_QUEUE — atomically commits S_EXT_CTRLS + OUTPUT QBUF as one decode request.
- `media_request_wait_completion` — POLL until decode finishes.
- `media_request_reinit` — reset request fd for next cycle.
- `v4l2_dequeue_buffer` OUTPUT.
- `v4l2_dequeue_buffer` CAPTURE.
- `cap_pool_mark_decoded` → slot state IN_DECODE → DECODED.
### 4. `h264_set_controls` (h264.c:797)
DPB mutation order:
1. `output = dpb_lookup(context, CurrPic, NULL, NULL)` — find existing entry for current frame's `picture_id` (returns NULL on first decode of a fresh surface).
2. If NULL → `output = dpb_find_entry(context)` = first invalid-and-unreserved entry, or oldest unused entry.
3. **`dpb_clear_entry(output, reserved=true)`** — memset 0, mark reserved=true (excludes from dpb_find_invalid_entry).
4. `dpb_update(context, &surface->params.h264.picture)`:
- Increment dpb.age.
- Clear `used` on every entry.
- For each ReferenceFrame in `VAPicture->ReferenceFrames[i]` (i < num_ref_frames):
- If `VA_PICTURE_H264_INVALID`, skip.
- `dpb_lookup` for that picture_id; if found, age=current, used=true. Else `dpb_insert` (inserts the missing reference into a free entry).
Then construct V4L2 controls:
5. `h264_va_picture_to_v4l2` (h264.c:346) — fills `v4l2_ctrl_h264_decode_params`, `pps`, `sps`. Includes:
- Bit-parses the slice_header from the OUTPUT buffer bytes (skipping start_code if `h264_start_code`) for `idr_pic_id`, `pic_order_cnt_lsb`, `delta_pic_order_cnt*`, `pic_order_cnt_bit_size`, `dec_ref_pic_marking_bit_size`.
- Calls `h264_fill_dpb` which iterates DPB entries with `valid && used` and fills v4l2 `decode->dpb[i]` with `reference_ts = v4l2_timeval_to_ns(ref_surface->timestamp)`, `frame_num`, `pic_num`, `top/bottom_field_order_cnt`, `flags = VALID|ACTIVE(|LONG_TERM)`, `fields = V4L2_H264_FRAME_REF`.
- Sets `decode->flags |= IDR_PIC` if `nal_unit_type == 5`. Sets FIELD_PIC, BOTTOM_FIELD flags.
- Sets pps, sps flags (entropy_coding_mode, weighted_pred, transform_8x8, etc.).
6. Scaling matrix (always submitted): from VAAPI's IQMatrix if `matrix_set`, else flat 16 default.
7. `h264_va_slice_to_v4l2` — fills `slice_params` + `pred_weights` (used only in SLICE_BASED mode; here unused).
8. `pps.flags |= SCALING_MATRIX_PRESENT`.
9. `pps.num_ref_idx_l0/l1_default_active_minus1 = slice.num_ref_idx_l0/l1_active_minus1` (slice-derived since VAAPI doesn't forward PPS values).
10. Switch on `slice_type % 5``decode.flags |= PFRAME` or `|= BFRAME`.
11. `sps.profile_idc = h264_profile_to_idc(profile)`, `sps.level_idc = derive_level_idc(width_mbs, height_mbs)`.
12. **`const bool slice_based = false;` hardcoded.** Only SPS + PPS + DECODE_PARAMS + SCALING_MATRIX controls submitted. SLICE_PARAMS and PRED_WEIGHTS skipped (per UAPI doc, must not submit in FRAME_BASED).
13. `v4l2_set_controls(video_fd, surface->request_fd, controls, 4)` — S_EXT_CTRLS with `WHICH_REQUEST_VAL = request_fd`.
14. **`dpb_insert(context, &CurrPic, output=entry)`** — current frame's CurrPic stored into the DPB entry we reserved at step 3. `entry.pic = CurrPic`, `valid=true`, `reserved=false`, `used=true`.
## Per-frame state walkthrough on `bbb_1080p30_h264.mp4` 3-frame
### Frame 1 (IDR keyframe)
- DPB empty before this frame.
- `dpb_lookup(CurrPic_1)` → NULL.
- `dpb_find_entry``dpb_find_invalid_entry` → entry[0] (all entries are !valid && !reserved). Returns entry[0].
- `dpb_clear_entry(entry[0], reserved=true)` → memset 0, reserved=1.
- `dpb_update`: clear all used. ReferenceFrames[] for IDR is empty. No further mutation.
- `h264_fill_dpb`: iterate; no entry with valid && used. v4l2 `decode->dpb[]` all-zero.
- `decode.flags |= IDR_PIC` (nal_unit_type==5).
- Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C0), QBUF OUTPUT (slot S_O0, timestamp tt_1).
- MEDIA_REQUEST_IOC_QUEUE, wait, REINIT, DQBUF OUTPUT, DQBUF CAPTURE.
- `dpb_insert(CurrPic_1, entry[0])` → entry[0].pic = CurrPic_1, valid=1, used=1, age=1.
### Frame 2 (P-frame, refs frame 1)
- DPB has entry[0] = frame_1 (valid, used, age=1).
- `dpb_lookup(CurrPic_2)` → NULL (CurrPic_2 has new picture_id).
- `dpb_find_entry``dpb_find_invalid_entry` skips entry[0] (valid=true). Returns entry[1].
- `dpb_clear_entry(entry[1], reserved=true)` → memset 0, reserved=1.
- `dpb_update`:
- age=2.
- Clear `used` on all entries (entry[0].used=0 temporarily).
- For ReferenceFrames[0] = frame_1: `dpb_lookup` finds entry[0]. entry[0].age=2, entry[0].used=1.
- `h264_fill_dpb`:
- entry[0] valid && used: fill v4l2 `dpb[0]`:
- `reference_ts = v4l2_timeval_to_ns(SURFACE(driver_data, entry[0].pic.picture_id)->timestamp)` = tt_1.
- `frame_num = entry[0].pic.frame_idx` (frame 1's frame_num, typically 0).
- `pic_num = FrameNumWrap(0, 1) = 0`.
- `top/bottom_field_order_cnt` = strip_sentinel(entry[0].pic.TopFieldOrderCnt, .flags).
- `flags = VALID|ACTIVE`.
- `fields = V4L2_H264_FRAME_REF`.
- entry[1]: clear (reserved, !valid). Skip.
- entries[2..15]: not valid. Skip.
- `decode.flags |= PFRAME`, frame_num=1.
- Submit S_EXT_CTRLS, QBUF CAPTURE (slot S_C1 ≠ S_C0), QBUF OUTPUT (slot S_O1 ≠ S_O0, timestamp tt_2).
- MEDIA_REQUEST_IOC_QUEUE → kernel: reads dpb[0].reference_ts=tt_1, **looks up CAPTURE buffer with that ts in vb2 queue**, uses as reference; decodes P MB partition; CAPTURE buffer S_C1 receives decoded pixels.
- DQBUF OUTPUT, DQBUF CAPTURE.
- `dpb_insert(CurrPic_2, entry[1])` → entry[1] gets frame_2.
### Frame 3 (next inter)
Same shape. Both frame_1 and frame_2 are in DPB; ReferenceFrames[] for B/P may include both. The bug signature says frame 3 is also all-zero.
## Empirical bug signature recap
- Frame 1: 99.99% zero, but first 16 bytes = `81 81 80 80 80 7f 7f 7f 7f 7f 7f 80 80 80 81 81` (real chroma DC near gray). Suggests **decode partially executed** for a small region of frame 1 (or those 16 bytes leaked from kernel into the cap_pool buffer position; the cap_pool init pattern is constant `0x4c` green; these bytes are not the init pattern). Hash `71ac099b…`.
- Frame 2: 100% zero (or matches the cap_pool init pattern). No decoded content.
- Frame 3: same as frame 2.
The "real first 16 bytes of frame 1" is the strongest cue: **decode started writing** but stopped after a few bytes. Three possibilities:
- A — Decode aborted after a partial write due to an error mid-decode.
- B — Decode completed correctly but only wrote a few bytes due to a stride / pitch / partial-DMA truncation.
- C — Memory was initialized with garbage (not 0x4c green) just before the read.
(A) and (B) are kernel-side. (C) is a libva-side allocation bug. iter5b-β changed cap_pool initialization patterns; need to verify what frame 1 actually shows vs what frame 2/3 show byte-by-byte to choose between these.
## Refined hypothesis surface (Phase 0 H-A through H-I narrowed)
From source-read, I can now narrow:
- **H-A (DPB stale)**: Backend logic is sound per the per-frame walkthrough. Eliminated unless kernel reads a control we don't fill.
- **H-B (POC sentinel mis-strip)**: `h264_strip_ffmpeg_poc_sentinel` strips bit 16 if set, else returns as-is. Frame 2 POC values are computed by ffmpeg-vaapi from H.264 spec. If sentinel detection misfires on a legit POC ≥ 65536, it would subtract 65536 silently. For 30-frame BBB content, POC values are tiny — eliminate.
- **H-C (slice flags missing)**: SLICE_PARAMS not submitted in FRAME_BASED. Eliminate.
- **H-D (per-frame request_fd lifecycle)**: iter6 model is "one fd per slot, REINIT between cycles." Each frame gets a possibly-different request_fd (depending on which OUTPUT pool slot it borrows). The fd is REINIT'd cleanly after DQBUF. Eliminate.
- **H-E (CAPTURE slot rotation)**: Each frame gets a different cap_pool slot. Frame 1's CAPTURE buffer (S_C0) stays in `DECODED` state after DQBUF. The kernel must find it by `reference_ts = tt_1` to use as reference for frame 2. **Live hypothesis**: kernel's reference-resolution may require frame 1's CAPTURE buffer to be still QBUF'd, not DQBUF'd. The kernel-direct path may keep CAPTURE buffers QBUF'd for longer (asynchronous decode pipeline) while libva synchronously DQBUFs each frame's CAPTURE before starting the next.
- **H-F (pred-weight)**: Not submitted in FRAME_BASED. Eliminate.
- **H-G (scaling matrix not re-uploaded)**: Submitted every frame. Eliminate.
- **H-H (start_code prefix)**: `h264_start_code` is set from `MEDIA_IOC_G_EXT_CTRLS` of START_CODE control in `h264_get_controls`. iter6 added profile gating via `feedback_unconditional_codec_state.md`. Need to verify on fresnel: does rkvdec advertise `V4L2_STATELESS_H264_START_CODE_ANNEX_B` or `_NONE`? If misaligned, slice data is misinterpreted. **Live hypothesis** — Phase 3 strace check.
- **H-I (kernel UAPI contract drift)**: `linux-fresnel-fourier 7.0-1` is v6.16-rc4-base; check for new mandatory controls added since v5.13. Specifically: `V4L2_CID_STATELESS_H264_DECODE_MODE` was probed at context init; the answer is FRAME_BASED on rkvdec. Other controls — none known to be required.
- **NEW H-J (reference_ts derivation)**: `v4l2_timeval_to_ns` converts `(s64 sec * 1e9 + ns)` from `surface->timestamp.tv_sec/tv_usec`. The kernel-side `vb2_buffer.vb2_v4l2_buffer.vb2_buffer.timestamp` is also nanoseconds. **Check**: does `surface->timestamp` get set to the same value on the OUTPUT QBUF and on `reference_ts`? Yes — line 440 sets it, line 461 reads it. Eliminate.
- **NEW H-K (CAPTURE buffer count)**: How many CAPTURE buffers does the backend REQBUFS? If only 1 or 2, frame 2's CAPTURE may collide with frame 1's reference. iter5b-β `surface_fill_format_uniform` and the cap_pool size — need to check.
- **NEW H-L (slice_data_bit_offset misalignment in FRAME_BASED)**: In FRAME_BASED mode rkvdec parses the slice header on its own from OUTPUT bytes. If the OUTPUT bytes have the wrong start code presence vs what rkvdec expects, slice header parse misaligns. The start_code prefix is gated by `context->h264_start_code` which mirrors the START_CODE control's value. **Live hypothesis**.
- **NEW H-M (frame_num field on rkvdec frame_based)**: rkvdec FRAME_BASED parses its own slice header but may still cross-check against `decode->frame_num`. If frame 2's `decode->frame_num=1` and the bitstream says frame_num=1, ok. If they mismatch, kernel rejects. **Probably OK; verify in strace.**
## Phase 3 must answer
Empirical questions for Phase 3 strace-diff:
1. **Does rkvdec accept the decode?** Look for DQBUF FLAG_ERROR per frame in libva-h264 vs kdirect-h264. If libva's frame 2 DQBUF carries FLAG_ERROR, the kernel rejected; this is similar to Bug 5 (HEVC).
2. **What does the libva ioctl stream look like per frame vs kdirect's?** Diff `VIDIOC_S_EXT_CTRLS` payload by frame. Look for:
- control set differences (kdirect includes a control libva omits)
- DECODE_PARAMS field differences (frame_num, reference_ts, flags)
- DPB array differences (kdirect filled vs libva filled)
3. **What's `cap_pool`'s buffer count on fresnel?** REQBUFS count for CAPTURE on the rkvdec OUTPUT/CAPTURE queues.
4. **What's `h264_start_code` set to on rkvdec?** Inspect MEDIA_IOC_G_EXT_CTRLS at init. Verify START_CODE = ANNEX_B (i.e., `h264_start_code = true`, slices prefixed with `00 00 01`).
5. **Byte-level frame examination**: is frame 1's 16-byte real content at frame start (Y plane top-left) or somewhere in the middle? Is frame 2 truly all-zero, or is it the cap_pool init pattern `0x4c`?
6. **strace REQUEST_IOC_QUEUE ordering**: does the request_fd carry the controls correctly? Look at kernel logs (dmesg) for any rkvdec rejection messages.
## Scope confirmation
In scope (this iteration): backend code only.
- `src/h264.c`, `src/h264_slice_header.c`, `src/picture.c` H.264 paths, `src/surface.c::RequestSyncSurface`, `src/cap_pool.c`, `src/request_pool.c` request_fd lifecycle.
Out of scope: kernel patches, VP9/VP8/HEVC/MPEG-2 (read-only for regression-verify), front-end libva, performance.
## Phase 3 plan
Run on fresnel:
1. Capture strace of libva-vaapi H.264 decode: `strace -f -e trace=ioctl -o libva_h264_strace.log ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -frames:v 3 -vf hwdownload,format=nv12,format=yuv420p -f rawvideo libva_h264.yuv`.
2. Capture strace of kdirect: same fixture via Kwiboo's ffmpeg-v4l2request: `strace -f -e trace=ioctl -o kdirect_h264_strace.log ffmpeg-v4l2request -i bbb_1080p30_h264.mp4 ... > kdirect_h264.yuv`.
3. Compare strace logs per-frame. Extract S_EXT_CTRLS payloads; decode `struct v4l2_ext_controls` + per-control payloads (sps, pps, decode_params, scaling_matrix, dpb[]) using a small python parser if scale demands.
4. Compare raw byte content of `libva_h264.yuv` (frame 1, 2, 3) and `kdirect_h264.yuv` to characterize what's actually written and where the diverence starts.
5. Check `dmesg` on fresnel during the run for rkvdec messages.
6. Capture findings in `phase3_iter8_findings.md`.
Predicted Phase 3 runtime: 1 hour wallclock contingent on fresnel uptime.
## Phase 4 plan-shape predicate
Phase 4 plan is contingent on Phase 3 narrowing. Three likely outcomes:
- **Outcome A** — Phase 3 finds a specific control or DECODE_PARAMS field divergence (e.g., libva omits a field kdirect always fills). Fix is mechanical, 5-30 LOC in `h264.c::h264_va_picture_to_v4l2` or `h264_set_controls`.
- **Outcome B** — Phase 3 finds DQBUF FLAG_ERROR on frame 2 inter. Then this is similar to Bug 5: kernel-side rejection. Iter8 either fixes whatever the wire-protocol gap is (backend), or downgrades to "narrowed but unfixed" close (deferred to a kernel-touching iteration).
- **Outcome C** — Phase 3 finds the kernel accepts the decode but writes to a wrong CAPTURE buffer / partial fill. Cap-pool / surface_bind_slot bug; 10-30 LOC fix in `cap_pool.c` or `surface.c`.
LOC estimate at this stage: 5-50 LOC in 1-3 files. One commit, possibly two.