From cbead4ec64e9ecb9cee4e372a1101b2d0986a60e Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Thu, 14 May 2026 08:55:58 +0000 Subject: [PATCH] =?UTF-8?q?iter17=20Phase=207:=20KERNEL=20PRINTK=20FINDS?= =?UTF-8?q?=20THE=20BUG=20=E2=80=94=20controls=20lost=20between=20S=5FEXT?= =?UTF-8?q?=5FCTRLS=20and=20rkvdec=20read?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DEFINITIVE FINDING via pr_info in rkvdec_hevc_run on RK3399: libva HEVC: w=0 h=0 reorder=0 chroma=0 nal_unit_type=0 decode_flags=0x0 kdirect HEVC: w=1280 h=720 reorder=2 chroma=1 nal_unit_type=20 decode_flags=0x3 The kernel sees ALL-ZERO control structs for libva HEVC, but CORRECT values for kdirect. Same kernel, same code path, same /dev/video1, same rkvdec_hevc_run_preamble fetching v4l2_ctrl_find(ctx->ctrl_hdl, HEVC_SPS)->p_cur.p. This overturns iter11-iter15's "wire-byte search exhausted" conclusion. The S_EXT_CTRLS payloads ARE byte-correct at the strace observer level, but the kernel sees zeros. The bug is in the S_EXT_CTRLS -> request -> ctx->ctrl_hdl path, specifically for libva. Five mechanisms hypothesized: 1. request_fd mismatch 2. REINIT clears controls before QUEUE 3. Compound-control copy deferred until QUEUE -> stack-locals stale 4. ctrl_hdl mismatch (libva submits to one, rkvdec reads another) 5. error_idx silently fails Key difference observed: libva stores SPS/PPS/decode_params as STACK LOCALS in h265_set_controls kdirect stores them in heap-allocated hwaccel_picture_private Mechanism 3 (kernel defers compound-ctrl copy_from_user) is the leading hypothesis. iter18 α-21: heap-allocate libva's HEVC control structs; if Bug 5 fixes, apply same pattern to H.264 (Bug 4) and VP8 (Bug 6). This is the strongest narrowing since iter5b-β. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase7_iter17_findings.md | 117 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 phase7_iter17_findings.md diff --git a/phase7_iter17_findings.md b/phase7_iter17_findings.md new file mode 100644 index 0000000..db9f529 --- /dev/null +++ b/phase7_iter17_findings.md @@ -0,0 +1,117 @@ +# Iteration 17 — Phase 7 (kernel-side diagnostic finding) + +Captured 2026-05-14 on fresnel running `linux-fresnel-fourier 7.0-3` (kernel module reloaded; vmlinux unchanged so `uname -v` still shows 7.0-2 timestamp, but rkvdec module is the iter17 build). + +## Method + +Added one `pr_info()` at the entry of `rkvdec_hevc_run` (RK3399 variant in `rkvdec-hevc.c`) dumping key state values from `run->sps`, `run->slices_params[0]`, `run->decode_params` — the structs that `rkvdec_hevc_run_preamble` populates via `v4l2_ctrl_find(&ctx->ctrl_hdl, ID)->p_cur.p`. + +## Result — definitive + +Captured 28 lines of dmesg across libva + kdirect HEVC decode runs. Representative sample: + +**libva HEVC** (run 1 of 13 — all 13 identical): + +``` +rkvdec_hevc_run: sps_id=0 dpb_buf=0 reorder=0 + w=0 h=0 bd_l=0 bd_c=0 chroma=0 + num_short_st=0 num_long_lt=0 + slices=1 nal_unit_type=0 slice_type=0 + decode_flags=0x0 +``` + +**kdirect HEVC** (run 1 of 15): + +``` +rkvdec_hevc_run: sps_id=0 dpb_buf=4 reorder=2 + w=1280 h=720 bd_l=0 bd_c=0 chroma=1 + num_short_st=0 num_long_lt=0 + slices=1 nal_unit_type=20 slice_type=2 + decode_flags=0x3 +``` + +**The kernel sees completely different control struct contents.** For libva: all-zero SPS dimensions, zero nal_unit_type, zero decode_flags. For kdirect: correct width=1280, height=720, IRAP|IDR flags, slice_type=I. + +## What this overturns + +iter11–iter15 had established that libva's S_EXT_CTRLS wire-byte payloads were byte-equal to kdirect's for every field rkvdec reads. That was correct at the **strace observer level** — userspace passed the same bytes to the kernel. + +But at the **rkvdec-read level** (after `MEDIA_REQUEST_IOC_QUEUE` should have copied request controls to `ctx->ctrl_hdl->p_cur`), libva's values are zero while kdirect's are correct. + +**The wire-byte search exhaustion was an illusion.** The controls are submitted correctly but lost somewhere between `S_EXT_CTRLS` and `rkvdec_hevc_run_preamble`. + +## Possible mechanisms + +1. **request_fd mismatch**: libva's S_EXT_CTRLS uses one request_fd, libva's QBUF/IOC_QUEUE uses another. Controls stash in request A; ctx->ctrl_hdl never sees A's values. + +2. **REINIT clears controls before QUEUE**: libva's `media_request_reinit` may run on the request between S_EXT_CTRLS and IOC_QUEUE for the NEXT frame. iter6 lifecycle: REINIT runs AFTER wait_completion of the previous decode — should be safe, but worth double-checking. + +3. **Compound-control copy timing**: kernel's `v4l2_s_ext_ctrls()` for `which=V4L2_CTRL_WHICH_REQUEST_VAL` may defer the payload copy until IOC_QUEUE. If libva's userspace pointer (a stack local in `h265_set_controls`) is gone by then, kernel reads garbage/zeros. + +4. **ctrl_hdl mismatch**: libva's S_EXT_CTRLS goes to one ctrl_hdl (e.g., the file-handle-private), but `rkvdec_hevc_run_preamble` reads from another (the m2m-context's). Per V4L2 m2m design these should be the same, but could be misregistered. + +5. **`error_idx` silently fails**: kernel returns 0 from S_EXT_CTRLS but actually skipped the controls due to a flag/validation failure. error_idx in the controls struct would reveal this. + +For kdirect: same kernel, same /dev/video1, same m2m-ctx-class — but it works. So mechanisms 3 and 4 are unlikely (they'd break kdirect too). + +Mechanisms 1, 2, 5 are libva-specific and worth direct testing. + +## kdirect storage shape + +`V4L2RequestControlsHEVC` in ffmpeg-v4l2request: + +```c +typedef struct V4L2RequestControlsHEVC { + V4L2RequestPictureContext pic; + struct v4l2_ctrl_hevc_sps sps; + struct v4l2_ctrl_hevc_pps pps; + ... +}; +V4L2RequestControlsHEVC *controls = h->cur_frame->hwaccel_picture_private; +``` + +Heap-allocated (av_mallocz_array via hwaccel_picture_private). Pointer valid for the AVFrame's full lifecycle — many seconds. + +Libva's `h265_set_controls`: + +```c +struct v4l2_ctrl_hevc_sps sps; // stack local +struct v4l2_ctrl_hevc_pps pps; // stack local +struct v4l2_ctrl_hevc_decode_params decode_params; // stack local +struct v4l2_ctrl_hevc_scaling_matrix scaling_matrix; // stack local +struct v4l2_ctrl_hevc_slice_params *slice_params_array = NULL; // heap (calloc) +... +controls[n++] = (struct v4l2_ext_control){.id = SPS, .ptr = &sps, .size = sizeof(sps)}; +... +rc = v4l2_set_controls(...); +``` + +Stack locals — valid only for h265_set_controls's duration. After that function returns and EndPicture's other code runs (QBUF, IOC_QUEUE), the stack may be reused. + +**If the kernel defers `copy_from_user()` for compound controls until `MEDIA_REQUEST_IOC_QUEUE` (which is *after* h265_set_controls returns), the userspace pointer is stale.** + +The V4L2 kernel spec says `S_EXT_CTRLS` for compound controls copies immediately — but a kernel bug or version-specific deferred-copy could fit. + +## Test hypothesis — heap-allocate libva's HEVC controls + +If mechanism 3 (deferred copy) is correct, moving libva's HEVC controls from stack to heap (a calloc()'d struct, owned by surface_object, freed at DestroyContext) would fix the issue without any other change. + +**This is iter18 α-21.** ~20-30 LOC change in h265.c. + +## Phase 7 conclusion + +iter17's kernel printk produced a definitive empirical finding: **the kernel reads zero control values for libva HEVC despite libva submitting non-zero bytes via S_EXT_CTRLS**. The bug is at the userspace-pointer-lifetime / kernel-copy-timing layer, NOT in libva's ioctl content. + +This is the strongest narrowing in the campaign since iter5b-β. + +## iter18 candidate (α-21) + +Heap-allocate libva's HEVC control structs so the pointer remains valid until `MEDIA_REQUEST_IOC_QUEUE` actually runs. If α-21 fixes Bug 5, apply same pattern to H.264 controls and VP8 controls (both currently use stack locals in their `*_set_controls` functions). + +If α-21 still doesn't fix Bug 5, the bug is in mechanism 1, 2, or 5 — requires further kernel printk. + +## Substrate state at iter17 P7 close + +- Fork tip `111f8ba` (libva backend, no changes this iter). +- Kernel `linux-fresnel-fourier 7.0-3` with diagnostic printk in rkvdec_hevc_run. NOT a shipping kernel — diagnostic only. Should be reverted (printk removed, rebuild as 7.0-4) once iter18 work completes OR replaced with the actual fix instead. +- Backend SHA on fresnel: `c1d4bb53…`.