Files
fresnel-fourier/phase7_iter17_findings.md
marfrit cbead4ec64 iter17 Phase 7: KERNEL PRINTK FINDS THE BUG — controls lost between S_EXT_CTRLS and rkvdec read
DEFINITIVE FINDING via pr_info in rkvdec_hevc_run on RK3399:

libva HEVC:    w=0 h=0 reorder=0 chroma=0 nal_unit_type=0 decode_flags=0x0
kdirect HEVC:  w=1280 h=720 reorder=2 chroma=1 nal_unit_type=20 decode_flags=0x3

The kernel sees ALL-ZERO control structs for libva HEVC, but CORRECT values
for kdirect. Same kernel, same code path, same /dev/video1, same
rkvdec_hevc_run_preamble fetching v4l2_ctrl_find(ctx->ctrl_hdl,
HEVC_SPS)->p_cur.p.

This overturns iter11-iter15's "wire-byte search exhausted" conclusion.
The S_EXT_CTRLS payloads ARE byte-correct at the strace observer level,
but the kernel sees zeros. The bug is in the
S_EXT_CTRLS -> request -> ctx->ctrl_hdl path, specifically for libva.

Five mechanisms hypothesized:
  1. request_fd mismatch
  2. REINIT clears controls before QUEUE
  3. Compound-control copy deferred until QUEUE -> stack-locals stale
  4. ctrl_hdl mismatch (libva submits to one, rkvdec reads another)
  5. error_idx silently fails

Key difference observed:
  libva stores SPS/PPS/decode_params as STACK LOCALS in h265_set_controls
  kdirect stores them in heap-allocated hwaccel_picture_private

Mechanism 3 (kernel defers compound-ctrl copy_from_user) is the leading
hypothesis. iter18 α-21: heap-allocate libva's HEVC control structs;
if Bug 5 fixes, apply same pattern to H.264 (Bug 4) and VP8 (Bug 6).

This is the strongest narrowing since iter5b-β.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:55:58 +00:00

118 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 17 — Phase 7 (kernel-side diagnostic finding)
Captured 2026-05-14 on fresnel running `linux-fresnel-fourier 7.0-3` (kernel module reloaded; vmlinux unchanged so `uname -v` still shows 7.0-2 timestamp, but rkvdec module is the iter17 build).
## Method
Added one `pr_info()` at the entry of `rkvdec_hevc_run` (RK3399 variant in `rkvdec-hevc.c`) dumping key state values from `run->sps`, `run->slices_params[0]`, `run->decode_params` — the structs that `rkvdec_hevc_run_preamble` populates via `v4l2_ctrl_find(&ctx->ctrl_hdl, ID)->p_cur.p`.
## Result — definitive
Captured 28 lines of dmesg across libva + kdirect HEVC decode runs. Representative sample:
**libva HEVC** (run 1 of 13 — all 13 identical):
```
rkvdec_hevc_run: sps_id=0 dpb_buf=0 reorder=0
w=0 h=0 bd_l=0 bd_c=0 chroma=0
num_short_st=0 num_long_lt=0
slices=1 nal_unit_type=0 slice_type=0
decode_flags=0x0
```
**kdirect HEVC** (run 1 of 15):
```
rkvdec_hevc_run: sps_id=0 dpb_buf=4 reorder=2
w=1280 h=720 bd_l=0 bd_c=0 chroma=1
num_short_st=0 num_long_lt=0
slices=1 nal_unit_type=20 slice_type=2
decode_flags=0x3
```
**The kernel sees completely different control struct contents.** For libva: all-zero SPS dimensions, zero nal_unit_type, zero decode_flags. For kdirect: correct width=1280, height=720, IRAP|IDR flags, slice_type=I.
## What this overturns
iter11iter15 had established that libva's S_EXT_CTRLS wire-byte payloads were byte-equal to kdirect's for every field rkvdec reads. That was correct at the **strace observer level** — userspace passed the same bytes to the kernel.
But at the **rkvdec-read level** (after `MEDIA_REQUEST_IOC_QUEUE` should have copied request controls to `ctx->ctrl_hdl->p_cur`), libva's values are zero while kdirect's are correct.
**The wire-byte search exhaustion was an illusion.** The controls are submitted correctly but lost somewhere between `S_EXT_CTRLS` and `rkvdec_hevc_run_preamble`.
## Possible mechanisms
1. **request_fd mismatch**: libva's S_EXT_CTRLS uses one request_fd, libva's QBUF/IOC_QUEUE uses another. Controls stash in request A; ctx->ctrl_hdl never sees A's values.
2. **REINIT clears controls before QUEUE**: libva's `media_request_reinit` may run on the request between S_EXT_CTRLS and IOC_QUEUE for the NEXT frame. iter6 lifecycle: REINIT runs AFTER wait_completion of the previous decode — should be safe, but worth double-checking.
3. **Compound-control copy timing**: kernel's `v4l2_s_ext_ctrls()` for `which=V4L2_CTRL_WHICH_REQUEST_VAL` may defer the payload copy until IOC_QUEUE. If libva's userspace pointer (a stack local in `h265_set_controls`) is gone by then, kernel reads garbage/zeros.
4. **ctrl_hdl mismatch**: libva's S_EXT_CTRLS goes to one ctrl_hdl (e.g., the file-handle-private), but `rkvdec_hevc_run_preamble` reads from another (the m2m-context's). Per V4L2 m2m design these should be the same, but could be misregistered.
5. **`error_idx` silently fails**: kernel returns 0 from S_EXT_CTRLS but actually skipped the controls due to a flag/validation failure. error_idx in the controls struct would reveal this.
For kdirect: same kernel, same /dev/video1, same m2m-ctx-class — but it works. So mechanisms 3 and 4 are unlikely (they'd break kdirect too).
Mechanisms 1, 2, 5 are libva-specific and worth direct testing.
## kdirect storage shape
`V4L2RequestControlsHEVC` in ffmpeg-v4l2request:
```c
typedef struct V4L2RequestControlsHEVC {
V4L2RequestPictureContext pic;
struct v4l2_ctrl_hevc_sps sps;
struct v4l2_ctrl_hevc_pps pps;
...
};
V4L2RequestControlsHEVC *controls = h->cur_frame->hwaccel_picture_private;
```
Heap-allocated (av_mallocz_array via hwaccel_picture_private). Pointer valid for the AVFrame's full lifecycle — many seconds.
Libva's `h265_set_controls`:
```c
struct v4l2_ctrl_hevc_sps sps; // stack local
struct v4l2_ctrl_hevc_pps pps; // stack local
struct v4l2_ctrl_hevc_decode_params decode_params; // stack local
struct v4l2_ctrl_hevc_scaling_matrix scaling_matrix; // stack local
struct v4l2_ctrl_hevc_slice_params *slice_params_array = NULL; // heap (calloc)
...
controls[n++] = (struct v4l2_ext_control){.id = SPS, .ptr = &sps, .size = sizeof(sps)};
...
rc = v4l2_set_controls(...);
```
Stack locals — valid only for h265_set_controls's duration. After that function returns and EndPicture's other code runs (QBUF, IOC_QUEUE), the stack may be reused.
**If the kernel defers `copy_from_user()` for compound controls until `MEDIA_REQUEST_IOC_QUEUE` (which is *after* h265_set_controls returns), the userspace pointer is stale.**
The V4L2 kernel spec says `S_EXT_CTRLS` for compound controls copies immediately — but a kernel bug or version-specific deferred-copy could fit.
## Test hypothesis — heap-allocate libva's HEVC controls
If mechanism 3 (deferred copy) is correct, moving libva's HEVC controls from stack to heap (a calloc()'d struct, owned by surface_object, freed at DestroyContext) would fix the issue without any other change.
**This is iter18 α-21.** ~20-30 LOC change in h265.c.
## Phase 7 conclusion
iter17's kernel printk produced a definitive empirical finding: **the kernel reads zero control values for libva HEVC despite libva submitting non-zero bytes via S_EXT_CTRLS**. The bug is at the userspace-pointer-lifetime / kernel-copy-timing layer, NOT in libva's ioctl content.
This is the strongest narrowing in the campaign since iter5b-β.
## iter18 candidate (α-21)
Heap-allocate libva's HEVC control structs so the pointer remains valid until `MEDIA_REQUEST_IOC_QUEUE` actually runs. If α-21 fixes Bug 5, apply same pattern to H.264 controls and VP8 controls (both currently use stack locals in their `*_set_controls` functions).
If α-21 still doesn't fix Bug 5, the bug is in mechanism 1, 2, or 5 — requires further kernel printk.
## Substrate state at iter17 P7 close
- Fork tip `111f8ba` (libva backend, no changes this iter).
- Kernel `linux-fresnel-fourier 7.0-3` with diagnostic printk in rkvdec_hevc_run. NOT a shipping kernel — diagnostic only. Should be reverted (printk removed, rebuild as 7.0-4) once iter18 work completes OR replaced with the actual fix instead.
- Backend SHA on fresnel: `c1d4bb53…`.