Files
ampere-kernel-decoders/phase0_findings_iter4.md
Markus Fritsche 0a400b6f08 iter4 Phase 0: HEVC per-frame S_EXT_CTRLS EINVAL substrate
iter3 1-line kernel fix eliminated the OOPS. Now diagnosing why
the 5-control batch (SPS PPS SLICE_PARAMS SCALING_MATRIX
DECODE_PARAMS) returns EINVAL with error_idx=count=5 → all-zero
output.

Locked-in evidence: control sizes match kernel-expected elem_size
for every CID. SPS values strace-decoded to (chroma=1, bit_depth=0,
1280x720) all pass validate_sps numerically. coded_fmt from S_FMT
trace is 1280x720 S265. validate_new dprintk doesn't fire despite
DEV_DEBUG_CTRL=0x20 set → rejection is silent inside
try_or_set_cluster's try_ctrl path (rkvdec_hevc_validate_sps).

Phase 1 starts with empirical: pr_warn at validate_sps entry to
confirm/refute. Instrumented module already built + ready to
load post-reboot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:14:41 +00:00

5.5 KiB

Phase 0 — iter4 substrate (HEVC per-frame S_EXT_CTRLS EINVAL)

Opened 2026-05-16 afternoon, immediately following iter3 close. Entry conditions are concrete; this Phase 0 is brief.

Research question

Which of the five per-frame HEVC controls (SPS, PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS) causes the kernel to reject the entire VIDIOC_S_EXT_CTRLS batch with -EINVAL, and what's the minimal backend or kernel fix to make the batch commit?

Locked-in evidence carried from iter2 + iter3

Observation Source Status
iter3 1-line kernel patch (run->ext_sps_st_rps = NULL in preamble) eliminates the prior OOPS iter3 verification, kernel-agent#14 comment 623 confirmed
ffmpeg exit 0, no Internal error in dmesg, /tmp/o.nv12 = 4147200 bytes (exact NV12 3-frame size) iter3 close + iter4 reproducer confirmed
Output bytes are all zero — decoder fills CAPTURE buffers from zero-initialized control state because no controls commit iter3 close + iter4 strace confirmed
Per-frame batch: 5 controls in order SPS(40) PPS(64) SLICE_PARAMS(280) SCALING_MATRIX(1000) DECODE_PARAMS(328) iter4 strace + dmesg confirmed
Backend size fields match kernel-expected elem_size for every CID gcc-compiled sizeof() against vendored UAPI headers confirmed
Kernel returns error_idx=count=5 with -22 (EINVAL) dmesg VIDIOC_S_EXT_CTRLS: error -22 confirmed
which=0xf010000 = V4L2_CTRL_WHICH_REQUEST_VAL strace + dmesg confirmed
No failed to validate control NAME dprintk fires despite dev_debug=0x3f (bit 5 = V4L2_DEV_DEBUG_CTRL is set) dmesg confirmed
→ The EINVAL is therefore NOT from validate_new (type_ops->validate); it's silent inside try_or_set_cluster's try_ctrl path source-read v4l2-ctrls-api.c:680-694 derived
Only try_ctrl op for HEVC controls is rkvdec_hevc_try_ctrlrkvdec_hevc_validate_sps (PPS/SLICE/SM/DP try_ctrl returns 0) rkvdec-vdpu381-hevc.c:625 confirmed
rkvdec_hevc_validate_sps returns -EINVAL for: chroma!=1, bit-depth mismatch, bit-depth ∉{0,2}, sps_width > coded_fmt.width OR sps_height > coded_fmt.height rkvdec-vdpu381-hevc.c:515 confirmed
Strace-decoded SPS: chroma_format_idc=1, bit_depth_luma_minus8=0, bit_depth_chroma_minus8=0, pic_width=1280, pic_height=720, flags=0x188 iter4 strace S_EXT_CTRLS body confirmed
Strace-decoded coded_fmt (from S_FMT logs): width=1280, height=720, format=S265 dmesg VIDIOC_S_FMT confirmed
→ All four validate_sps checks numerically PASS for the observed values source + data derived (suspicious — bug must be elsewhere)
iter2 dummy-SPS pre-seed (context.c:235) submits a synthetic HEVC_SPS during RequestCreateContext before cap_pool_init, to avoid the EBUSY-on-fmt-change cluster bug iter2 source reusable; still firing pre-frame

Substrate

  • Kernel: ampere 7.0.0-rc3-devices+ with iter3 1-line patch (rkvdec_hevc_run_preamble initializes ext_sps fields to NULL).
  • Backend: libva-v4l2-request-fourier iter3 instrumented build (md5 404041ea2dcc03c769e0ab8c43ddadd6) deployed at /usr/lib/dri/v4l2_request_drv_video.so on ampere.
  • Build host: ampere itself (boltzmann tree is on incompatible branch per iter3 close).
  • Diagnostic toggles available: echo 0x3f > /sys/class/video4linux/videoN/dev_debug enables V4L2 ioctl + ctrl dprintks.
  • Module-deploy: make M=drivers/.../rkvdec modules + scp + sudo install to /lib/modules/<rel>/kernel/... + depmod + reboot (rmmod blocked when prior decode wedged a thread in D-state holding the refcount).
  • Reboot caveat: kernel-agent#13 (black-screen on reboot) hit once during iter3 — recovered after waiting.

Open questions tabled for Phase 1

  1. Empirical Q1: Add pr_warn at entry of rkvdec_hevc_validate_sps (already done in iter4 instrumented build) — confirm whether it fires per-frame and whether any of the four checks reports unexpected values different from what strace shows the backend submitting. If pr_warn fires and all values pass, validate_sps returns 0 and the EINVAL is elsewhere. If pr_warn does not fire, the rejection is upstream (e.g. prepare_ext_ctrls).
  2. Q2 (depends on Q1 outcome — branching):
    • 2a if validate_sps DOES fail one check: the SPS the backend submits is wrong; fix backend's h265_fill_sps to match what kernel expects (probably some flag bit or bit-depth interpretation).
    • 2b if validate_sps doesn't fail: the EINVAL is upstream — likely in prepare_ext_ctrls (size mismatch — already ruled out — or unknown CID) OR per-control loop checks before user_to_new (read-only, grabbed). Need to instrument try_set_ext_ctrls_common to dump where ret first becomes non-zero.
  3. Q3: For the request-API path specifically (try_set_ext_ctrls_request), there's an extra media_request_object_find + v4l2_ctrl_request_clone step. Could one of those fail per-frame? Check via pr_warn in those helpers.
  4. Q4 (orthogonal): per feedback_va_st_rps_bits_is_slice_field, the iter2 backend mapped short_term_ref_pic_set_size into decode_params.short_term_ref_pic_set_size (later reverted in iter37 to slice_params). Verify the iter2/iter3 build has this correctly placed in slice_params, not decode_params. Decoded slice_params from strace shows the fields explicitly.

Phase 0 close

Substrate locked. Iter3 fix carried in. iter4 instrumented kernel built + scp'd, ready to load post-reboot. Q1 (empirical validate_sps trace) is the cheap gating question — Phase 6 starts there.