Files
fresnel-fourier/phase8_iteration22_close.md
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

6.5 KiB
Raw Permalink Blame History

Iteration 22 — Phase 8 (close)

Closes 2026-05-14. iter22 = kernel printk in v4l2_ctrl_request_clone tracing each handler_new_ref step. FULL close. iter21's mid-conclusion is overturned: the request-clone-hdl is COMPLETE for libva — all 22 controls cloned with err=0.

Method

linux-fresnel-fourier 7.0-6 added per-step pr_info to v4l2_ctrl_request_clone:

pr_info("iter22_clone_start: new_hdl=%p from=%p\n", hdl, from);
list_for_each_entry(ref, &from->ctrl_refs, node) {
    ...
    err = handler_new_ref(hdl, ctrl, &new_ref, false, true);
    pr_info("iter22_clone_step: id=0x%x err=%d hdl_error=%d new_ref=%p\n",
            ctrl->id, err, hdl->error, new_ref);
    ...
}
pr_info("iter22_clone_end: hdl=%p err=%d\n", hdl, err);

Built ~2 min via ccache. Deployed via scp + pacman -U + reboot.

Results

libva HEVC (11 clones — one per request_fd binding):

Every clone-step logs err=0 hdl_error=0 new_ref=valid_ptr. Each clone has 22 controls successfully added, ending with iter22_clone_end err=0. Full ID list per clone:

0x990001, 0x990a67, 0x990a6b, 0x990b00, 0x990b67, 0x990b68,
0xa40001, 0xa40900, 0xa40901, 0xa40902, 0xa40903, 0xa40904, 0xa40907,
0xa40a2c, 0xa40a2d,
0xa40a90, 0xa40a91, 0xa40a92, 0xa40a93, 0xa40a94, 0xa40a95, 0xa40a96

Note: H264_PRED_WEIGHTS (0xa40905) and H264_SLICE_PARAMS (0xa40906) are NOT in the source main_hdl (these are not registered in rkvdec_h264_ctrls on this kernel — rkvdec doesn't expose them). All 7 HEVC stateless decode controls ARE present.

kdirect HEVC (13 clones): identical pattern, identical ID set, all err=0.

What this overturns from iter21

iter21 concluded the clone-hdl was missing 6 HEVC controls for libva. That was wrong. The clone-hdl actually has all 22 controls. The iter21_setup_ref printk's iteration was filtering 8 of them out via the early-return check:

if (ref->req_done || (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY))
    continue;

For libva, only 14 refs reach the printk. For kdirect, 21 refs reach it. 8 vs 1 difference in skip count.

What this implies

Two of the 8 skipped controls are clearly READ_ONLY: codec class roots 0x990001 and 0xa40001 (V4L2_CTRL_TYPE_CTRL_CLASS). That accounts for 2 of 8 in both libva and kdirect.

For libva, 6 additional HEVC controls (PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS, DECODE_MODE, START_CODE) get skipped. For kdirect, only DECODE_MODE + START_CODE get skipped (the 2 with p_req_valid=0 that aren't staged this frame).

Wait — kdirect shows DECODE_MODE + START_CODE in the setup_ref printk with p_req_valid=0. So they're NOT skipped by the continue check. So kdirect's 21 displayed = 22 cloned - 1 (only one CLASS root being printed?). Hmm, mismatch.

Actually the iter21 setup_ref for kdirect showed 21 lines visible — 20 entries +1 setup header. Let me re-examine. The kdirect dump had 21 ctrl_refs lines (excluding setup: line). 22 clone - 21 setup_ref = 1 skipped. Possibly only one class root present in kdirect's clone-hdl somehow.

So the actual difference is: libva skips 8, kdirect skips 1. The 7 EXTRA skips for libva are: 6 HEVC controls (PPS, SLICE, SCALING, DECODE_PARAMS, DECODE_MODE, START_CODE) + 1 mystery.

Mechanism status (post-iter22)

# Mechanism Status
1 request_fd mismatch DISPROVED
2 REINIT clears DISPROVED iter19
3 Stack-locals stale DISPROVED iter18
4 ctrl_hdl mismatch REFRAMED iter20
5 error_idx silent failure DISPROVED iter18
6 req->p_new staging incomplete iter21 → 22 OVERTURNED (clone IS complete)
7 Clone-hdl missing controls DISPROVED iter22 (clone has all 22)
8 NEW iter22: 6 of 7 HEVC controls get skipped in v4l2_ctrl_request_setup loop for libva but not kdirect leading hypothesis

iter23 candidate

Add printk inside the v4l2_ctrl_request_setup loop before the continue check, logging req_done, flags, ncontrols, cluster[0]->id:

pr_info("iter23_loop: id=0x%x req_done=%d flags=0x%x ncontrols=%d cluster0_id=0x%x\n",
        ctrl->id, ref->req_done, ctrl->flags,
        master->ncontrols,
        master->cluster[0] ? master->cluster[0]->id : 0);
if (ref->req_done || (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY))
    continue;

Two possible findings:

  • req_done already true for the 6 HEVC controls → an earlier iteration (HEVC_SPS) clustered them and set req_done. Means main_hdl's HEVC_SPS has master->cluster containing PPS+SLICE+SCALING+DECODE_PARAMS+DECODE_MODE+START_CODE on libva's path.
  • flags has READ_ONLY → the controls have READ_ONLY set, which is wrong for stateless decode controls.

ncontrols and cluster0_id reveal cluster membership directly: if ncontrols > 1 for HEVC_SPS, it's been clustered with siblings.

Substrate state at iter22 close

  • Backend SHA on fresnel: c1d4bb53… (iter15 stable, unchanged).
  • Fork tip e109306 (α-24 reverted) — unchanged.
  • Kernel linux-fresnel-fourier 7.0-6 with iter17 + iter20 + iter21 + iter22 printks.
  • 5-codec anchors: unchanged.

Iter23 build kicked off

linux-fresnel-fourier 7.0-7 building on boltzmann (PID 1643224, log /tmp/iter23-kbuild.log). Expected ~2 min via ccache.

Lesson

iter21's mid-conclusion was based on the wrong printk position — the iter21_setup_ref printk was inside v4l2_ctrl_request_setup's loop but AFTER the early-continue checks, missing controls that get skipped. iter22's clone-trace showed the clone IS complete; the staging FAILS via the setup-loop SKIP path, not the clone path.

The empirical pattern is now: libva's per-frame request gets through clone correctly; gets through S_EXT_CTRLS correctly (stages all 5 controls with p_req_valid=1 — at least for SPS, definitely 1); but at setup-loop time, 6 of the 7 HEVC controls get a continue that bypasses req_to_new. SPS alone reaches req_to_newtry_or_set_cluster → commits to p_cur → rkvdec_run reads ctx->ctrl_hdl[SPS]->p_cur which is non-zero?

Actually — wait, iter20 said sps[0..16] was zero for libva. If SPS is the only one that reaches req_to_new + try_or_set_cluster, then SPS's content SHOULD be correct, but iter20 said it's zero. So SPS itself ALSO has a problem in the commit path.

So we have:

  • 6 of 7 HEVC controls: never reach req_to_new (skipped in setup loop).
  • 1 of 7 (SPS): reaches req_to_new but resulting p_cur content is zero anyway.

These are TWO separate bugs (or one bug with two symptoms). iter23 will reveal the skip mechanism; another iter (or test) may need to address the SPS-commit-content issue.