Files
fresnel-fourier/phase7_iter17_findings.md
T
marfrit cbead4ec64 iter17 Phase 7: KERNEL PRINTK FINDS THE BUG — controls lost between S_EXT_CTRLS and rkvdec read
DEFINITIVE FINDING via pr_info in rkvdec_hevc_run on RK3399:

libva HEVC:    w=0 h=0 reorder=0 chroma=0 nal_unit_type=0 decode_flags=0x0
kdirect HEVC:  w=1280 h=720 reorder=2 chroma=1 nal_unit_type=20 decode_flags=0x3

The kernel sees ALL-ZERO control structs for libva HEVC, but CORRECT values
for kdirect. Same kernel, same code path, same /dev/video1, same
rkvdec_hevc_run_preamble fetching v4l2_ctrl_find(ctx->ctrl_hdl,
HEVC_SPS)->p_cur.p.

This overturns iter11-iter15's "wire-byte search exhausted" conclusion.
The S_EXT_CTRLS payloads ARE byte-correct at the strace observer level,
but the kernel sees zeros. The bug is in the
S_EXT_CTRLS -> request -> ctx->ctrl_hdl path, specifically for libva.

Five mechanisms hypothesized:
  1. request_fd mismatch
  2. REINIT clears controls before QUEUE
  3. Compound-control copy deferred until QUEUE -> stack-locals stale
  4. ctrl_hdl mismatch (libva submits to one, rkvdec reads another)
  5. error_idx silently fails

Key difference observed:
  libva stores SPS/PPS/decode_params as STACK LOCALS in h265_set_controls
  kdirect stores them in heap-allocated hwaccel_picture_private

Mechanism 3 (kernel defers compound-ctrl copy_from_user) is the leading
hypothesis. iter18 α-21: heap-allocate libva's HEVC control structs;
if Bug 5 fixes, apply same pattern to H.264 (Bug 4) and VP8 (Bug 6).

This is the strongest narrowing since iter5b-β.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:55:58 +00:00

6.2 KiB
Raw Blame History

Iteration 17 — Phase 7 (kernel-side diagnostic finding)

Captured 2026-05-14 on fresnel running linux-fresnel-fourier 7.0-3 (kernel module reloaded; vmlinux unchanged so uname -v still shows 7.0-2 timestamp, but rkvdec module is the iter17 build).

Method

Added one pr_info() at the entry of rkvdec_hevc_run (RK3399 variant in rkvdec-hevc.c) dumping key state values from run->sps, run->slices_params[0], run->decode_params — the structs that rkvdec_hevc_run_preamble populates via v4l2_ctrl_find(&ctx->ctrl_hdl, ID)->p_cur.p.

Result — definitive

Captured 28 lines of dmesg across libva + kdirect HEVC decode runs. Representative sample:

libva HEVC (run 1 of 13 — all 13 identical):

rkvdec_hevc_run: sps_id=0 dpb_buf=0 reorder=0
                 w=0 h=0 bd_l=0 bd_c=0 chroma=0
                 num_short_st=0 num_long_lt=0
                 slices=1 nal_unit_type=0 slice_type=0
                 decode_flags=0x0

kdirect HEVC (run 1 of 15):

rkvdec_hevc_run: sps_id=0 dpb_buf=4 reorder=2
                 w=1280 h=720 bd_l=0 bd_c=0 chroma=1
                 num_short_st=0 num_long_lt=0
                 slices=1 nal_unit_type=20 slice_type=2
                 decode_flags=0x3

The kernel sees completely different control struct contents. For libva: all-zero SPS dimensions, zero nal_unit_type, zero decode_flags. For kdirect: correct width=1280, height=720, IRAP|IDR flags, slice_type=I.

What this overturns

iter11iter15 had established that libva's S_EXT_CTRLS wire-byte payloads were byte-equal to kdirect's for every field rkvdec reads. That was correct at the strace observer level — userspace passed the same bytes to the kernel.

But at the rkvdec-read level (after MEDIA_REQUEST_IOC_QUEUE should have copied request controls to ctx->ctrl_hdl->p_cur), libva's values are zero while kdirect's are correct.

The wire-byte search exhaustion was an illusion. The controls are submitted correctly but lost somewhere between S_EXT_CTRLS and rkvdec_hevc_run_preamble.

Possible mechanisms

  1. request_fd mismatch: libva's S_EXT_CTRLS uses one request_fd, libva's QBUF/IOC_QUEUE uses another. Controls stash in request A; ctx->ctrl_hdl never sees A's values.

  2. REINIT clears controls before QUEUE: libva's media_request_reinit may run on the request between S_EXT_CTRLS and IOC_QUEUE for the NEXT frame. iter6 lifecycle: REINIT runs AFTER wait_completion of the previous decode — should be safe, but worth double-checking.

  3. Compound-control copy timing: kernel's v4l2_s_ext_ctrls() for which=V4L2_CTRL_WHICH_REQUEST_VAL may defer the payload copy until IOC_QUEUE. If libva's userspace pointer (a stack local in h265_set_controls) is gone by then, kernel reads garbage/zeros.

  4. ctrl_hdl mismatch: libva's S_EXT_CTRLS goes to one ctrl_hdl (e.g., the file-handle-private), but rkvdec_hevc_run_preamble reads from another (the m2m-context's). Per V4L2 m2m design these should be the same, but could be misregistered.

  5. error_idx silently fails: kernel returns 0 from S_EXT_CTRLS but actually skipped the controls due to a flag/validation failure. error_idx in the controls struct would reveal this.

For kdirect: same kernel, same /dev/video1, same m2m-ctx-class — but it works. So mechanisms 3 and 4 are unlikely (they'd break kdirect too).

Mechanisms 1, 2, 5 are libva-specific and worth direct testing.

kdirect storage shape

V4L2RequestControlsHEVC in ffmpeg-v4l2request:

typedef struct V4L2RequestControlsHEVC {
    V4L2RequestPictureContext pic;
    struct v4l2_ctrl_hevc_sps sps;
    struct v4l2_ctrl_hevc_pps pps;
    ...
};
V4L2RequestControlsHEVC *controls = h->cur_frame->hwaccel_picture_private;

Heap-allocated (av_mallocz_array via hwaccel_picture_private). Pointer valid for the AVFrame's full lifecycle — many seconds.

Libva's h265_set_controls:

struct v4l2_ctrl_hevc_sps sps;     // stack local
struct v4l2_ctrl_hevc_pps pps;     // stack local
struct v4l2_ctrl_hevc_decode_params decode_params;  // stack local
struct v4l2_ctrl_hevc_scaling_matrix scaling_matrix;  // stack local
struct v4l2_ctrl_hevc_slice_params *slice_params_array = NULL;  // heap (calloc)
...
controls[n++] = (struct v4l2_ext_control){.id = SPS, .ptr = &sps, .size = sizeof(sps)};
...
rc = v4l2_set_controls(...);

Stack locals — valid only for h265_set_controls's duration. After that function returns and EndPicture's other code runs (QBUF, IOC_QUEUE), the stack may be reused.

If the kernel defers copy_from_user() for compound controls until MEDIA_REQUEST_IOC_QUEUE (which is after h265_set_controls returns), the userspace pointer is stale.

The V4L2 kernel spec says S_EXT_CTRLS for compound controls copies immediately — but a kernel bug or version-specific deferred-copy could fit.

Test hypothesis — heap-allocate libva's HEVC controls

If mechanism 3 (deferred copy) is correct, moving libva's HEVC controls from stack to heap (a calloc()'d struct, owned by surface_object, freed at DestroyContext) would fix the issue without any other change.

This is iter18 α-21. ~20-30 LOC change in h265.c.

Phase 7 conclusion

iter17's kernel printk produced a definitive empirical finding: the kernel reads zero control values for libva HEVC despite libva submitting non-zero bytes via S_EXT_CTRLS. The bug is at the userspace-pointer-lifetime / kernel-copy-timing layer, NOT in libva's ioctl content.

This is the strongest narrowing in the campaign since iter5b-β.

iter18 candidate (α-21)

Heap-allocate libva's HEVC control structs so the pointer remains valid until MEDIA_REQUEST_IOC_QUEUE actually runs. If α-21 fixes Bug 5, apply same pattern to H.264 controls and VP8 controls (both currently use stack locals in their *_set_controls functions).

If α-21 still doesn't fix Bug 5, the bug is in mechanism 1, 2, or 5 — requires further kernel printk.

Substrate state at iter17 P7 close

  • Fork tip 111f8ba (libva backend, no changes this iter).
  • Kernel linux-fresnel-fourier 7.0-3 with diagnostic printk in rkvdec_hevc_run. NOT a shipping kernel — diagnostic only. Should be reverted (printk removed, rebuild as 7.0-4) once iter18 work completes OR replaced with the actual fix instead.
  • Backend SHA on fresnel: c1d4bb53….