Files
fresnel-fourier/phase8_iteration21_close.md
T
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

7.7 KiB
Raw Blame History

Iteration 21 — Phase 8 (close)

Closes 2026-05-14. iter21 = kernel printk at top of v4l2_ctrl_request_setup + per-ref dump. FULL close. Smoking-gun finding: libva's request-clone-handler is missing 6 HEVC stateless controls registered in main_hdl.

Method

linux-fresnel-fourier 7.0-5 (pkgrel 4→5) adds two pr_info to v4l2_ctrl_request_setup in drivers/media/v4l2-core/v4l2-ctrls-request.c:

obj = media_request_object_find(req, &req_ops, main_hdl);
pr_info("iter21_setup: req=%p main_hdl=%p obj=%p\n", req, main_hdl, obj);
...
list_for_each_entry(ref, &hdl->ctrl_refs, node) {
    ...
    pr_info("iter21_setup_ref: ctrl_id=0x%x p_req_valid=%d have_new=%d\n",
            ctrl->id, ref->p_req_valid, have_new_data);
    ...
}

Built ~1 min via ccache reuse. Deployed via scp + pacman -U + reboot.

α-24 result (predicate: kernel-only path required)

α-24 (libva G_EXT_CTRLS readback after S_EXT_CTRLS) implemented as 1547a5d → amended a9c897f → reverted e109306. Kernel returned EACCES for all 13 libva HEVC frames: this V4L2 build disallows userspace probing of req->p_new for an uncompleted request. The probe path must run inside the kernel.

Result — definitive (libva vs kdirect)

libva HEVC frame 1 setup (clone-hdl ctrl_refs in ID order, 14 entries):

0x990a67  p_req_valid=0
0x990a6b  p_req_valid=0
0x990b00  p_req_valid=0
0x990b67  p_req_valid=0
0x990b68  p_req_valid=0        (5 codec-class menu controls)
0xa40900  p_req_valid=0        H264_DECODE_MODE
0xa40901  p_req_valid=0        H264_START_CODE
0xa40902  p_req_valid=0        H264_SPS
0xa40903  p_req_valid=0        H264_PPS
0xa40904  p_req_valid=0        H264_SCALING_MATRIX
0xa40907  p_req_valid=0        H264_DECODE_PARAMS
0xa40a2c  p_req_valid=0        (misc stateless)
0xa40a2d  p_req_valid=0        (misc stateless)
0xa40a90  p_req_valid=1 have_new=1     HEVC_SPS — CLONE STOPS HERE

Missing from libva clone (vs kdirect):

  • 0xa40905 H264_PRED_WEIGHTS (compound)
  • 0xa40906 H264_SLICE_PARAMS (compound, dyn_array)
  • 0xa40a91 HEVC_PPS (compound)
  • 0xa40a92 HEVC_SLICE_PARAMS (compound, dyn_array)
  • 0xa40a93 HEVC_SCALING_MATRIX (compound)
  • 0xa40a94 HEVC_DECODE_PARAMS (compound)
  • 0xa40a95 HEVC_DECODE_MODE (menu)
  • 0xa40a96 HEVC_START_CODE (menu)

kdirect HEVC frame 1 setup (same hdl, 21 entries — all of above PLUS the 8 missing):

... 14 entries as above ...
0xa40a91  p_req_valid=1 have_new=1     HEVC_PPS
0xa40a92  p_req_valid=1 have_new=1     HEVC_SLICE_PARAMS
0xa40a93  p_req_valid=1 have_new=1     HEVC_SCALING_MATRIX
0xa40a94  p_req_valid=1 have_new=1     HEVC_DECODE_PARAMS
0xa40a95  p_req_valid=0                HEVC_DECODE_MODE (device-init only)
0xa40a96  p_req_valid=0                HEVC_START_CODE  (device-init only)

What this means

v4l2_ctrl_request_setup(req, main_hdl):

  • finds obj for both libva and kdirect (non-NULL) — request properly bound.
  • iterates hdl->ctrl_refs — but libva's hdl is the request-clone-hdl, and it contains 14 of the 21 source controls.
  • libva's HEVC_SPS has p_req_valid=1 — staging worked for that one control.
  • The other 6 HEVC controls (PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS, DECODE_MODE, START_CODE) don't exist in the clone-hdl at all — they cannot be staged.

When libva submits its 5-control S_EXT_CTRLS batch (SPS, PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS), only SPS is registered in the clone-hdl. PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS find no ref → prepare_ext_ctrls returns -EINVAL. (This contradicts iter18 α-22's rc=0 — needs re-investigation of error_idx semantics for the request path; the userspace observation of rc=0 may not reflect the actual kernel error for compound-control lookups in request clones.)

iter20's "zero SPS bytes" explained

iter20 showed rkvdec sees sps[0..16]=00..00 for libva. That's because:

  • HEVC_SPS is in the clone-hdl with p_req_valid=1 — so it got STAGED.
  • But the content in req->p_new[SPS] is all-zero.

Two possible reasons for zero content despite p_req_valid=1:

  1. user_to_new ran on a zero-payload from libva. iter15 strace ruled this out — libva's SPS payload is non-zero at ioctl entry.
  2. new_to_req ran, but the data flow is somehow corrupted. Possible if the master/cluster lookup is wrong on the clone-hdl.

iter22 candidate: add a printk in new_to_req and req_to_new to log the copy: source pointer, dest pointer, first 4 bytes, payload size.

Mechanism status (post-iter21)

# Mechanism Status
1 request_fd mismatch DISPROVED iter17/18
2 REINIT clears DISPROVED iter19
3 Stack-locals stale DISPROVED iter18
4 ctrl_hdl mismatch REFRAMED iter20
5 error_idx silent failure DISPROVED iter18 (but warrants re-check given iter21 finding)
6 req->p_new staging incomplete CONFIRMED iter21: clone-hdl missing controls = staging cannot occur for 6 of 7 HEVC controls
7 NEW iter21: clone-hdl is missing controls that main_hdl has registered Root question for iter22

Why is the clone incomplete?

v4l2_ctrl_request_clone(new_hdl, from=main_hdl) iterates main_hdl->ctrl_refs in ID-sorted order. After cloning HEVC_SPS (0xa40a90), the loop stops before HEVC_PPS (0xa40a91). Equivalent stops happen at H264_PRED_WEIGHTS (0xa40905) — both are first compound controls of their codec block.

Hypothesis: handler_new_ref returns non-zero error at the first compound control AFTER an SPS-like single-struct compound, but only when called from the request-clone path. Or: kzalloc(sizeof(*new_ref) + size_extra_req) fails for ones with larger elem_size (HEVC_PPS = 64 bytes, H264_PRED_WEIGHTS = 32 bytes — small, unlikely to OOM but worth verifying).

Alt hypothesis: handler_new_ref's auto-class-control insertion (v4l2_ctrl_new_std) fails for non-compound HEVC menu controls in request-clone path, which propagates hdl->error and breaks subsequent iterations.

Same kernel succeeds for kdirect on the same from hdl, so something is per-request-bind specific — maybe related to request lifecycle timing in libva (iter6 permanent request_fd at CreateContext) vs kdirect (per-frame request_fd).

Substrate state at iter21 close

  • Backend SHA on fresnel: c1d4bb53… (iter15 stable, unchanged).
  • Fork tip e109306 (α-24 reverted).
  • Kernel linux-fresnel-fourier 7.0-5 with iter17 + iter20 + iter21 printks. NOT a shipping kernel.
  • 5-codec anchors: unchanged. Zero regression.

iter22 candidate

Add printks to v4l2_ctrl_request_clone and handler_new_ref:

// in v4l2_ctrl_request_clone
pr_info("iter22_clone_start: new_hdl=%p from=%p\n", hdl, from);

// per iteration
err = handler_new_ref(hdl, ctrl, &new_ref, false, true);
pr_info("iter22_clone_step: id=0x%x err=%d from_other=%d\n",
        ctrl->id, err, ref->from_other_dev);
if (err) {
    pr_info("iter22_clone_break: at id=0x%x err=%d hdl_error=%d\n",
            ctrl->id, err, hdl->error);
    break;
}

After 7.0-6 deploys, libva HEVC run will show exactly which ctrl_id breaks the loop and the error code. Then we can localize either to kzalloc failure, v4l2_ctrl_new_std failure (auto-class), or some other condition.

Lesson

iter21 overturns the iter11iter18 hypothesis space entirely. The S_EXT_CTRLS ioctl wire-byte payload analysis was correct — libva's bytes match kdirect's. But at the v4l2_ctrl framework level, libva's request-clone is missing the registered controls libva tries to stage. The bug is in how the V4L2 control framework handles libva's specific request-binding pattern, NOT in libva's ioctl content.

This is the strongest narrowing since iter17. We've gone from "anywhere in kernel" → "kernel control framework" → "request-clone path specifically" → "iteration breaks at first compound HEVC control".