Files
fresnel-fourier/phase8_iteration23_close.md
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

5.3 KiB

Iteration 23 — Phase 8 (close)

Closes 2026-05-14. iter23 = kernel printk inside v4l2_ctrl_request_setup outer loop, BEFORE the continue check, logging every iteration. FULL close.

Method

linux-fresnel-fourier 7.0-7 added one pr_info at TOP of the outer loop in v4l2_ctrl_request_setup, BEFORE if (ref->req_done || (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY)) continue;:

pr_info("iter23_loop: id=0x%x req_done=%d flags=0x%x ncontrols=%d cluster0_id=0x%x\n",
    ctrl->id, ref->req_done, ctrl->flags,
    master->ncontrols,
    master->cluster[0] ? master->cluster[0]->id : 0);

Result — definitive

libva HEVC (first setup): iter23_loop fires for 16 IDs ending at 0xa40a90 (HEVC_SPS). The outer loop EXITS before reaching 0xa40a91.

kdirect HEVC (first setup): iter23_loop fires for 22 IDs ending at 0xa40a96 (HEVC_START_CODE). The outer loop completes normally.

The loop body has only two exit-loop paths after the iter23_loop printk fires:

  1. goto error if req_to_new(r) returns non-zero.
  2. break if try_or_set_cluster(NULL, master, true, 0) returns non-zero.

For libva, ONE of these fires AT HEVC_SPS, exiting the loop. For kdirect, NEITHER fires.

This fully overturns iter21/22:

  • The clone-hdl IS complete for libva (iter22 confirmed all 22 controls cloned).
  • The setup loop reaches HEVC_SPS for libva (iter23 confirmed).
  • The processing of HEVC_SPS in the setup loop FAILS for libva.

The failure of HEVC_SPS processing means:

  • p_cur for HEVC_SPS is never committed → rkvdec reads zero (iter20 finding).
  • All subsequent compound HEVC controls (PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS, DECODE_MODE, START_CODE) NEVER reach their processing → their req_done stays false but they're also never committed → all zero in ctx->ctrl_hdl.

Why does HEVC_SPS processing fail for libva but not kdirect?

The most likely candidates:

Function Failure modes
req_to_new(ref_SPS) -ENOENT if !p_req_valid. -EINVAL if elem count mismatch (p_req_elems != p_array_alloc_elems for non-dyn-array). -ENOMEM if alloc fails for dyn-array resize.
try_or_set_cluster(NULL, master_SPS, true, 0) Validator failures (out-of-range field values). Cluster ops failures. Often returns -EINVAL or -ERANGE.

iter24 will pinpoint which function fails and what return value.

iter21/22's interpretation errors

  • iter21: I concluded the clone-hdl was missing controls. Wrong — the iter21_setup_ref printk was inside the loop body but AFTER the early-continue check. The "missing" controls were actually iterated past after SPS's processing failed and the loop exited — they never even saw the iter21 printk.
  • iter22: The clone trace confirmed clone-hdl is complete. Good. But my mid-conclusion ("clone-hdl is complete; staging fails in setup loop SKIP path") was partially wrong — the loop doesn't SKIP, it EXITS.

Mechanism status (post-iter23)

# Mechanism Status
1 request_fd mismatch DISPROVED iter17/18
2 REINIT clears DISPROVED iter19
3 Stack-locals stale DISPROVED iter18
4 ctrl_hdl mismatch DISPROVED iter20-22
5 error_idx silent failure DISPROVED iter18
6 req->p_new staging incomplete DISPROVED iter22
7 Clone-hdl missing controls DISPROVED iter22
8 Skip-loop bypass DISPROVED iter23 (loop EXITS, not skips)
9 NEW iter23: HEVC_SPS processing in v4l2_ctrl_request_setup fails for libva LEADING — iter24 candidate

iter24 candidate

linux-fresnel-fourier 7.0-8:

ret = req_to_new(r);
pr_info("iter24_req_to_new: id=0x%x ret=%d p_req_valid=%d p_req_elems=%u\n",
    master->cluster[i]->id, ret, r->p_req_valid, r->p_req_elems);
...
ret = try_or_set_cluster(NULL, master, true, 0);
pr_info("iter24_try_or_set: master_id=0x%x ret=%d\n", master->id, ret);

After 7.0-8 deploys, libva HEVC will show:

  • iter24_req_to_new id=0xa40a90 ret=X p_req_valid=Y p_req_elems=Z where X is the actual return value.
  • If req_to_new ret != 0 → bug is in req_to_new for HEVC_SPS on libva's staged data. Compare p_req_elems to kdirect's value.
  • If req_to_new ret == 0 → check iter24_try_or_set's ret. If non-zero → validator rejects libva's SPS but accepts kdirect's. Investigate which field validator rejects.

Substrate state at iter23 close

  • Backend SHA on fresnel: c1d4bb53… (iter15 stable, unchanged).
  • Fork tip e109306 — unchanged.
  • Kernel linux-fresnel-fourier 7.0-7 with iter17 + iter20 + iter21 + iter22 + iter23 printks.
  • 5-codec anchors: unchanged.

iter24 build kicked off

linux-fresnel-fourier 7.0-8 building on boltzmann (PID 1672261, log /tmp/iter24-kbuild.log).

Lesson

Three iterations of mid-loop printk (iter21, iter22, iter23) needed to localize the exit. Each iteration overturned the previous's partial conclusion. Key methodology: place the diagnostic printk at the very top of each loop body, BEFORE any continue/break, to distinguish "skipped" from "exited". Without that, "missing from printk output" is ambiguous.

The bug is now localized to:

  • A specific function: req_to_new OR try_or_set_cluster.
  • A specific control: HEVC_SPS.
  • A specific request lifecycle pattern: libva's, not kdirect's.

One more printk iteration (iter24) should give the failing function + return code.