0a400b6f08
iter3 1-line kernel fix eliminated the OOPS. Now diagnosing why the 5-control batch (SPS PPS SLICE_PARAMS SCALING_MATRIX DECODE_PARAMS) returns EINVAL with error_idx=count=5 → all-zero output. Locked-in evidence: control sizes match kernel-expected elem_size for every CID. SPS values strace-decoded to (chroma=1, bit_depth=0, 1280x720) all pass validate_sps numerically. coded_fmt from S_FMT trace is 1280x720 S265. validate_new dprintk doesn't fire despite DEV_DEBUG_CTRL=0x20 set → rejection is silent inside try_or_set_cluster's try_ctrl path (rkvdec_hevc_validate_sps). Phase 1 starts with empirical: pr_warn at validate_sps entry to confirm/refute. Instrumented module already built + ready to load post-reboot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.5 KiB
5.5 KiB
Phase 0 — iter4 substrate (HEVC per-frame S_EXT_CTRLS EINVAL)
Opened 2026-05-16 afternoon, immediately following iter3 close. Entry conditions are concrete; this Phase 0 is brief.
Research question
Which of the five per-frame HEVC controls (SPS, PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS) causes the kernel to reject the entire VIDIOC_S_EXT_CTRLS batch with -EINVAL, and what's the minimal backend or kernel fix to make the batch commit?
Locked-in evidence carried from iter2 + iter3
| Observation | Source | Status |
|---|---|---|
iter3 1-line kernel patch (run->ext_sps_st_rps = NULL in preamble) eliminates the prior OOPS |
iter3 verification, kernel-agent#14 comment 623 | confirmed |
ffmpeg exit 0, no Internal error in dmesg, /tmp/o.nv12 = 4147200 bytes (exact NV12 3-frame size) |
iter3 close + iter4 reproducer | confirmed |
| Output bytes are all zero — decoder fills CAPTURE buffers from zero-initialized control state because no controls commit | iter3 close + iter4 strace | confirmed |
| Per-frame batch: 5 controls in order SPS(40) PPS(64) SLICE_PARAMS(280) SCALING_MATRIX(1000) DECODE_PARAMS(328) | iter4 strace + dmesg | confirmed |
Backend size fields match kernel-expected elem_size for every CID |
gcc-compiled sizeof() against vendored UAPI headers |
confirmed |
Kernel returns error_idx=count=5 with -22 (EINVAL) |
dmesg VIDIOC_S_EXT_CTRLS: error -22 |
confirmed |
which=0xf010000 = V4L2_CTRL_WHICH_REQUEST_VAL |
strace + dmesg | confirmed |
No failed to validate control NAME dprintk fires despite dev_debug=0x3f (bit 5 = V4L2_DEV_DEBUG_CTRL is set) |
dmesg | confirmed |
→ The EINVAL is therefore NOT from validate_new (type_ops->validate); it's silent inside try_or_set_cluster's try_ctrl path |
source-read v4l2-ctrls-api.c:680-694 | derived |
Only try_ctrl op for HEVC controls is rkvdec_hevc_try_ctrl → rkvdec_hevc_validate_sps (PPS/SLICE/SM/DP try_ctrl returns 0) |
rkvdec-vdpu381-hevc.c:625 | confirmed |
rkvdec_hevc_validate_sps returns -EINVAL for: chroma!=1, bit-depth mismatch, bit-depth ∉{0,2}, sps_width > coded_fmt.width OR sps_height > coded_fmt.height |
rkvdec-vdpu381-hevc.c:515 | confirmed |
Strace-decoded SPS: chroma_format_idc=1, bit_depth_luma_minus8=0, bit_depth_chroma_minus8=0, pic_width=1280, pic_height=720, flags=0x188 |
iter4 strace S_EXT_CTRLS body | confirmed |
Strace-decoded coded_fmt (from S_FMT logs): width=1280, height=720, format=S265 |
dmesg VIDIOC_S_FMT | confirmed |
| → All four validate_sps checks numerically PASS for the observed values | source + data | derived (suspicious — bug must be elsewhere) |
iter2 dummy-SPS pre-seed (context.c:235) submits a synthetic HEVC_SPS during RequestCreateContext before cap_pool_init, to avoid the EBUSY-on-fmt-change cluster bug |
iter2 source | reusable; still firing pre-frame |
Substrate
- Kernel: ampere
7.0.0-rc3-devices+with iter3 1-line patch (rkvdec_hevc_run_preambleinitializes ext_sps fields to NULL). - Backend:
libva-v4l2-request-fourieriter3 instrumented build (md5404041ea2dcc03c769e0ab8c43ddadd6) deployed at/usr/lib/dri/v4l2_request_drv_video.soon ampere. - Build host: ampere itself (boltzmann tree is on incompatible branch per iter3 close).
- Diagnostic toggles available:
echo 0x3f > /sys/class/video4linux/videoN/dev_debugenables V4L2 ioctl + ctrl dprintks. - Module-deploy:
make M=drivers/.../rkvdec modules+ scp +sudo installto/lib/modules/<rel>/kernel/...+ depmod + reboot (rmmod blocked when prior decode wedged a thread in D-state holding the refcount). - Reboot caveat: kernel-agent#13 (black-screen on reboot) hit once during iter3 — recovered after waiting.
Open questions tabled for Phase 1
- Empirical Q1: Add
pr_warnat entry ofrkvdec_hevc_validate_sps(already done in iter4 instrumented build) — confirm whether it fires per-frame and whether any of the four checks reports unexpected values different from what strace shows the backend submitting. If pr_warn fires and all values pass, validate_sps returns 0 and the EINVAL is elsewhere. If pr_warn does not fire, the rejection is upstream (e.g.prepare_ext_ctrls). - Q2 (depends on Q1 outcome — branching):
- 2a if validate_sps DOES fail one check: the SPS the backend submits is wrong; fix backend's
h265_fill_spsto match what kernel expects (probably some flag bit or bit-depth interpretation). - 2b if validate_sps doesn't fail: the EINVAL is upstream — likely in
prepare_ext_ctrls(size mismatch — already ruled out — or unknown CID) OR per-control loop checks before user_to_new (read-only, grabbed). Need to instrument try_set_ext_ctrls_common to dump where ret first becomes non-zero.
- 2a if validate_sps DOES fail one check: the SPS the backend submits is wrong; fix backend's
- Q3: For the request-API path specifically (
try_set_ext_ctrls_request), there's an extramedia_request_object_find+v4l2_ctrl_request_clonestep. Could one of those fail per-frame? Check viapr_warnin those helpers. - Q4 (orthogonal): per
feedback_va_st_rps_bits_is_slice_field, the iter2 backend mappedshort_term_ref_pic_set_sizeintodecode_params.short_term_ref_pic_set_size(later reverted in iter37 to slice_params). Verify the iter2/iter3 build has this correctly placed in slice_params, not decode_params. Decoded slice_params from strace shows the fields explicitly.
Phase 0 close
Substrate locked. Iter3 fix carried in. iter4 instrumented kernel built + scp'd, ready to load post-reboot. Q1 (empirical validate_sps trace) is the cheap gating question — Phase 6 starts there.