Files
Markus Fritsche 46c956bd51 iter4 close — second kernel bug: missing HEVC_SLICE_PARAMS registration
Casanova/Collabora v7.0 HEVC series forgot to register
V4L2_CID_STATELESS_HEVC_SLICE_PARAMS in vdpu38x_hevc_ctrl_descs[].
The legacy rkvdec_hevc_ctrl_descs[] (RK3399 path) has it; the new
vdpu381/vdpu383 path doesn't. Every per-frame S_EXT_CTRLS fails
with EINVAL ("cannot find control id 0xa40a92").

Surfaced via dev_debug=0x3f on /sys/class/video4linux/videoN —
prepare_ext_ctrls's "cannot find" dprintk is gated behind
V4L2_DEV_DEBUG_CTRL (bit 0x20), invisible by default.

1-line patch (5 lines with formatting) mirrors the legacy entry:
SLICE_PARAMS as DYNAMIC_ARRAY, dims={600} (HEVC level >6 max).

Verified on ampere: no EINVAL, no dmesg errors, ffmpeg exit 0,
3-frame NV12 output structurally valid. But output bytes are all
Y=16/Cb=Cr=128 (solid black) — separate downstream bitstream-
feeding bug, deferred to iter5.

Iter5 starts with LIBVA_V4L2_DUMP_OUTPUT to confirm whether the
OUTPUT bitstream is reaching the kernel correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:18:04 +00:00

7.8 KiB
Raw Permalink Blame History

Iter4 close — second kernel bug: missing HEVC_SLICE_PARAMS registration

Date: 2026-05-16 (afternoon, immediately following iter3 close) Branch: master Substrate: ampere 7.0.0-rc3-devices+ with iter3 fix (ext_sps NULL init) carried in. Backend: iter3 instrumented build, md5 404041ea2dcc03c769e0ab8c43ddadd6, deployed at /usr/lib/dri/.

Bottom line

The Casanova/Collabora v7.0 HEVC series forgot to register V4L2_CID_STATELESS_HEVC_SLICE_PARAMS in the new vdpu38x_hevc_ctrl_descs[] table. The legacy rkvdec_hevc_ctrl_descs[] (RK3399 path) has it; the new vdpu381/vdpu383 path doesn't. Result: every per-frame VIDIOC_S_EXT_CTRLS returns -EINVAL ("cannot find control id 0xa40a92") and userspace falls through to queue requests with no controls committed → decoder runs on zero-init control state → all-zero output (or worse, OOPSes on uninit memory before iter3 fix).

Falsifier outcome

F1 (kernel rejects 5-ctrl batch with EINVAL): TRUE pre-patch — confirmed by enabling V4L2_DEV_DEBUG_CTRL (bit 0x20) on /sys/class/video4linux/videoN/dev_debug, which surfaced the previously-silent prepare_ext_ctrls: cannot find control id 0xa40a92 dprintk.

F2 (registering HEVC_SLICE_PARAMS in vdpu38x_hevc_ctrl_descs makes the batch accept): FALSE → TRUE — 1-line patch (5 source lines with formatting) eliminated the EINVAL. ffmpeg exit 0, dmesg fully clean of S_EXT_CTRLS: error and cannot find control id. Decoder runs.

F3 (decoder produces non-empty output post-patch): FALSE — output /tmp/o.nv12 is 4147200 bytes (correct 3×NV12 frames) but contains only Y=16 (luma "video black") and Cb/Cr=128 (chroma neutral) — solid black. Decoder runs but bitstream isn't being interpreted. This is the iter5 hand-off bug.

Root cause (iter4 Phase 6)

drivers/media/platform/rockchip/rkvdec/rkvdec.c has two HEVC ctrl_descs arrays:

array line registers SLICE_PARAMS?
rkvdec_hevc_ctrl_descs[] (legacy RK3399 path) 189 YES — dynamic array, dims={600}
vdpu38x_hevc_ctrl_descs[] (Casanova RK3588/RK3576 path) ~240 NO

Both are passed to the same rkvdec_hevc_run_preamble (rkvdec-hevc-common.c:478) which calls v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_SLICE_PARAMS). With the new table, the ctrl isn't in the handler — userspace VIDIOC_S_EXT_CTRLS for this CID fails in prepare_ext_ctrls → return -EINVAL → kernel sets error_idx = cs->count (since set=true) → backend sees error_idx=5, count=5, err=-22.

The reason this stayed silent in earlier debugging: dprintks for cannot find control id 0x%x are gated behind V4L2_DEV_DEBUG_CTRL = 0x20. Default /sys/.../dev_debug is 0 — no ctrl-class dprintks. Setting 0x3f (all 6 bits) on every video device surfaced the lookup failure immediately.

Minimal kernel patch (verified working)

--- a/drivers/media/platform/rockchip/rkvdec/rkvdec.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec.c
@@ -242,6 +242,12 @@ static const struct rkvdec_ctrl_desc vdpu38x_hevc_ctrl_descs[] = {
        {
                .cfg.id = V4L2_CID_STATELESS_HEVC_DECODE_PARAMS,
        },
+       {
+               .cfg.id = V4L2_CID_STATELESS_HEVC_SLICE_PARAMS,
+               .cfg.flags = V4L2_CTRL_FLAG_DYNAMIC_ARRAY,
+               .cfg.type = V4L2_CTRL_TYPE_HEVC_SLICE_PARAMS,
+               .cfg.dims = { 600 },
+       },
        {
                .cfg.id = V4L2_CID_STATELESS_HEVC_SPS,
                .cfg.ops = &rkvdec_ctrl_ops,

Mirror of the legacy rkvdec_hevc_ctrl_descs[] entry. 600 is the absolute maximum slices per frame for HEVC level > 6 (matches visl and legacy rkvdec).

Verification (on-target empirical)

$ ssh ampere 'sudo dmesg --clear; LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_60s_720p.hevc.mp4 -vf hwdownload,format=nv12 -frames:v 3 -f rawvideo -pix_fmt nv12 /tmp/o.nv12; echo exit=$?'
exit=0

$ ssh ampere 'sudo dmesg | grep -E "rkvdec|cannot find|S_EXT_CTRLS: error"'
[Sat May 16 13:17:08 2026] rkvdec fdc40100.video-codec: missing multi-core support, ignoring this instance

$ ssh ampere 'ls -la /tmp/o.nv12; md5sum /tmp/o.nv12; head -c 4147200 /tmp/o.nv12 | od -An -tu1 -w1 | sort -u'
-rw-r--r-- 1 mfritsche mfritsche 4147200 May 16 13:17 /tmp/o.nv12
25ae521379343783da65b1fc80b1e8e8  /tmp/o.nv12
  16
 128

No dmesg errors. No EINVAL. ffmpeg exit 0. 3-frame NV12 output (correct size). All bytes are 16 (Y) or 128 (Cb/Cr) — solid black, but a structurally-valid decode (no OOPS, no truncation).

Why output is still black (deferred to iter5)

Possible causes for iter5 to investigate:

  1. OUTPUT bitstream not reaching hardware — backend assembles slice NALs into source_data, but maybe slices_size or QBUF length is wrong → hardware reads empty buffer → produces blank frame.
  2. Slice header field mismatch — backend's h265_fill_slice_params may put bit_size/data_byte_offset/slice_segment_addr in fields the kernel doesn't expect. Strace shows bit_size=0x1038 (519 bytes), data_byte_offset=17 — plausible but unverified against the actual NAL.
  3. start_code prefix handling — backend prepends Annex-B 00 00 00 01 when h264_start_code=true. For HEVC under DECODE_MODE_FRAME_BASED + START_CODE_ANNEX_B (both registered in vdpu38x_hevc_ctrl_descs), this should match — but the iter2 backend used h264_start_code as a profile-independent flag (per feedback_unconditional_codec_state); verify it gates correctly for HEVC.
  4. DECODE_PARAMS dpb/poc fields — for IDR frame 1, dpb should be empty (num_active_dpb_entries=0), num_poc_st_curr_before/after/lt_curr=0. If backend sets non-zero, kernel may interpret as needing references that don't exist.

iter5 starts with: enable LIBVA_V4L2_DUMP_OUTPUT=<dir> to capture the per-frame OUTPUT bitstream bytes, diff against the input HEVC stream's raw NALs to confirm the bitstream is being forwarded correctly. From there, branch into (2)/(3)/(4) depending on findings.

Phase 6 question completion (iter4)

Q Answer
Q1 — empirical: validate_sps fires per-frame? NO — fires twice (CreateContext dummy + rkvdec_hevc_start), NOT per-frame. Rules out validate_sps as the EINVAL source.
Q2a/b — which check fails Neither validate_sps nor validate_new. Failure is in prepare_ext_ctrls's find_ref_lock for 0xa40a92 (HEVC_SLICE_PARAMS) which isn't registered.
Q3 — request-API extra steps Not the issue. The clone path replicates whichever ctrls are registered in master, so missing SLICE_PARAMS propagates.
Q4 — st_rps_bits field mapping Not relevant to this iteration — iter4's bug is upstream of EXT_SPS_*_RPS handling. iter5 may revisit.

Substrate state at close

  • Backend .so: unchanged (md5 404041ea2dcc03c769e0ab8c43ddadd6)
  • Kernel module: includes both iter3 fix (run->ext_sps_st_rps/lt_rps = NULL in preamble) AND iter4 fix (HEVC_SLICE_PARAMS registered in vdpu38x_hevc_ctrl_descs)
  • diagnostic pr_warn from iter4 Phase 6 still present in rkvdec_hevc_validate_sps — harmless, fires twice per session
  • Both kernel fixes need filing as separate kernel-agent issues against Casanova v7.0 series (iter3 → kernel-agent#14 (filed); iter4 → kernel-agent#15 (TBD))
  • diagnostic 0x3f on /sys/.../dev_debug should be reset to 0 for production (echo 0 | sudo tee /sys/class/video4linux/video*/dev_debug)

Iter4 takeaway

The 8-phase loop's Phase 6 question-driven instrumentation (Q1 empirical validate_sps trace) worked again: pr_warn falsified the assumed culprit immediately, redirecting attention to the dprintk-gated prepare_ext_ctrls: cannot find control id log that revealed the actual missing registration. Total iter4 wall-clock: ~30 min from Phase 0 lock-in to verified fix.

Iter5 picks up the "decoder runs but output is solid black" downstream bug.