iter4 close — second kernel bug: missing HEVC_SLICE_PARAMS registration

Casanova/Collabora v7.0 HEVC series forgot to register
V4L2_CID_STATELESS_HEVC_SLICE_PARAMS in vdpu38x_hevc_ctrl_descs[].
The legacy rkvdec_hevc_ctrl_descs[] (RK3399 path) has it; the new
vdpu381/vdpu383 path doesn't. Every per-frame S_EXT_CTRLS fails
with EINVAL ("cannot find control id 0xa40a92").

Surfaced via dev_debug=0x3f on /sys/class/video4linux/videoN —
prepare_ext_ctrls's "cannot find" dprintk is gated behind
V4L2_DEV_DEBUG_CTRL (bit 0x20), invisible by default.

1-line patch (5 lines with formatting) mirrors the legacy entry:
SLICE_PARAMS as DYNAMIC_ARRAY, dims={600} (HEVC level >6 max).

Verified on ampere: no EINVAL, no dmesg errors, ffmpeg exit 0,
3-frame NV12 output structurally valid. But output bytes are all
Y=16/Cb=Cr=128 (solid black) — separate downstream bitstream-
feeding bug, deferred to iter5.

Iter5 starts with LIBVA_V4L2_DUMP_OUTPUT to confirm whether the
OUTPUT bitstream is reaching the kernel correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Markus Fritsche
2026-05-16 11:18:04 +00:00
parent 0a400b6f08
commit 46c956bd51
+105
View File
@@ -0,0 +1,105 @@
# Iter4 close — second kernel bug: missing HEVC_SLICE_PARAMS registration
Date: 2026-05-16 (afternoon, immediately following iter3 close)
Branch: `master`
Substrate: ampere `7.0.0-rc3-devices+` with iter3 fix (ext_sps NULL init) carried in.
Backend: iter3 instrumented build, md5 `404041ea2dcc03c769e0ab8c43ddadd6`, deployed at `/usr/lib/dri/`.
## Bottom line
**The Casanova/Collabora v7.0 HEVC series forgot to register `V4L2_CID_STATELESS_HEVC_SLICE_PARAMS` in the new `vdpu38x_hevc_ctrl_descs[]` table.** The legacy `rkvdec_hevc_ctrl_descs[]` (RK3399 path) has it; the new vdpu381/vdpu383 path doesn't. Result: every per-frame `VIDIOC_S_EXT_CTRLS` returns `-EINVAL` ("cannot find control id 0xa40a92") and userspace falls through to queue requests with no controls committed → decoder runs on zero-init control state → all-zero output (or worse, OOPSes on uninit memory before iter3 fix).
## Falsifier outcome
F1 (kernel rejects 5-ctrl batch with EINVAL): **TRUE pre-patch** — confirmed by enabling `V4L2_DEV_DEBUG_CTRL` (bit 0x20) on `/sys/class/video4linux/videoN/dev_debug`, which surfaced the previously-silent `prepare_ext_ctrls: cannot find control id 0xa40a92` dprintk.
F2 (registering HEVC_SLICE_PARAMS in `vdpu38x_hevc_ctrl_descs` makes the batch accept): **FALSE → TRUE** — 1-line patch (5 source lines with formatting) eliminated the EINVAL. ffmpeg exit 0, dmesg fully clean of `S_EXT_CTRLS: error` and `cannot find control id`. Decoder runs.
F3 (decoder produces non-empty output post-patch): **FALSE** — output `/tmp/o.nv12` is 4147200 bytes (correct 3×NV12 frames) but contains only Y=16 (luma "video black") and Cb/Cr=128 (chroma neutral) — solid black. Decoder runs but bitstream isn't being interpreted. This is the iter5 hand-off bug.
## Root cause (iter4 Phase 6)
`drivers/media/platform/rockchip/rkvdec/rkvdec.c` has two HEVC ctrl_descs arrays:
| array | line | registers SLICE_PARAMS? |
|-------|------|-------------------------|
| `rkvdec_hevc_ctrl_descs[]` (legacy RK3399 path) | 189 | YES — dynamic array, dims={600} |
| `vdpu38x_hevc_ctrl_descs[]` (Casanova RK3588/RK3576 path) | ~240 | **NO** |
Both are passed to the same `rkvdec_hevc_run_preamble` (rkvdec-hevc-common.c:478) which calls `v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_SLICE_PARAMS)`. With the new table, the ctrl isn't in the handler — userspace `VIDIOC_S_EXT_CTRLS` for this CID fails in `prepare_ext_ctrls` → return -EINVAL → kernel sets `error_idx = cs->count` (since `set=true`) → backend sees `error_idx=5, count=5, err=-22`.
The reason this stayed silent in earlier debugging: dprintks for `cannot find control id 0x%x` are gated behind `V4L2_DEV_DEBUG_CTRL = 0x20`. Default `/sys/.../dev_debug` is 0 — no ctrl-class dprintks. Setting `0x3f` (all 6 bits) on every video device surfaced the lookup failure immediately.
## Minimal kernel patch (verified working)
```diff
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec.c
@@ -242,6 +242,12 @@ static const struct rkvdec_ctrl_desc vdpu38x_hevc_ctrl_descs[] = {
{
.cfg.id = V4L2_CID_STATELESS_HEVC_DECODE_PARAMS,
},
+ {
+ .cfg.id = V4L2_CID_STATELESS_HEVC_SLICE_PARAMS,
+ .cfg.flags = V4L2_CTRL_FLAG_DYNAMIC_ARRAY,
+ .cfg.type = V4L2_CTRL_TYPE_HEVC_SLICE_PARAMS,
+ .cfg.dims = { 600 },
+ },
{
.cfg.id = V4L2_CID_STATELESS_HEVC_SPS,
.cfg.ops = &rkvdec_ctrl_ops,
```
Mirror of the legacy `rkvdec_hevc_ctrl_descs[]` entry. `600` is the absolute maximum slices per frame for HEVC level > 6 (matches `visl` and legacy rkvdec).
## Verification (on-target empirical)
```
$ ssh ampere 'sudo dmesg --clear; LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i bbb_60s_720p.hevc.mp4 -vf hwdownload,format=nv12 -frames:v 3 -f rawvideo -pix_fmt nv12 /tmp/o.nv12; echo exit=$?'
exit=0
$ ssh ampere 'sudo dmesg | grep -E "rkvdec|cannot find|S_EXT_CTRLS: error"'
[Sat May 16 13:17:08 2026] rkvdec fdc40100.video-codec: missing multi-core support, ignoring this instance
$ ssh ampere 'ls -la /tmp/o.nv12; md5sum /tmp/o.nv12; head -c 4147200 /tmp/o.nv12 | od -An -tu1 -w1 | sort -u'
-rw-r--r-- 1 mfritsche mfritsche 4147200 May 16 13:17 /tmp/o.nv12
25ae521379343783da65b1fc80b1e8e8 /tmp/o.nv12
16
128
```
No dmesg errors. No EINVAL. ffmpeg exit 0. 3-frame NV12 output (correct size). All bytes are 16 (Y) or 128 (Cb/Cr) — solid black, but a structurally-valid decode (no OOPS, no truncation).
## Why output is still black (deferred to iter5)
Possible causes for iter5 to investigate:
1. **OUTPUT bitstream not reaching hardware** — backend assembles slice NALs into `source_data`, but maybe slices_size or QBUF length is wrong → hardware reads empty buffer → produces blank frame.
2. **Slice header field mismatch** — backend's `h265_fill_slice_params` may put bit_size/data_byte_offset/slice_segment_addr in fields the kernel doesn't expect. Strace shows `bit_size=0x1038` (519 bytes), `data_byte_offset=17` — plausible but unverified against the actual NAL.
3. **start_code prefix handling** — backend prepends Annex-B `00 00 00 01` when `h264_start_code=true`. For HEVC under DECODE_MODE_FRAME_BASED + START_CODE_ANNEX_B (both registered in vdpu38x_hevc_ctrl_descs), this should match — but the iter2 backend used `h264_start_code` as a profile-independent flag (per `feedback_unconditional_codec_state`); verify it gates correctly for HEVC.
4. **DECODE_PARAMS dpb/poc fields** — for IDR frame 1, dpb should be empty (num_active_dpb_entries=0), num_poc_st_curr_before/after/lt_curr=0. If backend sets non-zero, kernel may interpret as needing references that don't exist.
iter5 starts with: enable `LIBVA_V4L2_DUMP_OUTPUT=<dir>` to capture the per-frame OUTPUT bitstream bytes, diff against the input HEVC stream's raw NALs to confirm the bitstream is being forwarded correctly. From there, branch into (2)/(3)/(4) depending on findings.
## Phase 6 question completion (iter4)
| Q | Answer |
|---|--------|
| Q1 — empirical: validate_sps fires per-frame? | NO — fires twice (CreateContext dummy + rkvdec_hevc_start), NOT per-frame. Rules out validate_sps as the EINVAL source. |
| Q2a/b — which check fails | Neither validate_sps nor validate_new. Failure is in `prepare_ext_ctrls`'s `find_ref_lock` for `0xa40a92` (HEVC_SLICE_PARAMS) which isn't registered. |
| Q3 — request-API extra steps | Not the issue. The clone path replicates whichever ctrls are registered in master, so missing SLICE_PARAMS propagates. |
| Q4 — st_rps_bits field mapping | Not relevant to this iteration — iter4's bug is upstream of EXT_SPS_*_RPS handling. iter5 may revisit. |
## Substrate state at close
- Backend `.so`: unchanged (md5 `404041ea2dcc03c769e0ab8c43ddadd6`)
- Kernel module: includes both iter3 fix (`run->ext_sps_st_rps/lt_rps = NULL` in preamble) AND iter4 fix (HEVC_SLICE_PARAMS registered in vdpu38x_hevc_ctrl_descs)
- diagnostic `pr_warn` from iter4 Phase 6 still present in `rkvdec_hevc_validate_sps` — harmless, fires twice per session
- Both kernel fixes need filing as separate kernel-agent issues against Casanova v7.0 series (iter3 → kernel-agent#14 (filed); iter4 → kernel-agent#15 (TBD))
- diagnostic `0x3f` on `/sys/.../dev_debug` should be reset to 0 for production (`echo 0 | sudo tee /sys/class/video4linux/video*/dev_debug`)
## Iter4 takeaway
The 8-phase loop's Phase 6 question-driven instrumentation (Q1 empirical validate_sps trace) worked again: pr_warn falsified the assumed culprit immediately, redirecting attention to the dprintk-gated `prepare_ext_ctrls: cannot find control id` log that revealed the actual missing registration. Total iter4 wall-clock: ~30 min from Phase 0 lock-in to verified fix.
Iter5 picks up the "decoder runs but output is solid black" downstream bug.