Files
fresnel-fourier/phase8_iteration20_close.md
T
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

7.2 KiB
Raw Blame History

Iteration 20 — Phase 8 (close)

Closes 2026-05-14. iter20 = kernel printk for &ctx->ctrl_hdl, run.sps, run.decode_params pointers + first 16 bytes of each, executed at top of rkvdec_hevc_run (after rkvdec_hevc_run_preamble). FULL close. Mechanism 4 reframed; root-cause localized to one kernel layer.

Method

linux-fresnel-fourier 7.0-4 adds rkvdec_iter20: printk to RK3399 rkvdec_hevc_run:

{
    u8 *sps_bytes = (u8 *)run.sps;
    u8 *dp_bytes  = (u8 *)run.decode_params;
    pr_info("rkvdec_iter20: ctrl_hdl=%p sps=%p sps[0..16]=%*ph "
            "dp=%p dp[0..16]=%*ph\n",
            &ctx->ctrl_hdl, run.sps,
            16, sps_bytes ? sps_bytes : (u8 *)"",
            run.decode_params,
            16, dp_bytes ? dp_bytes : (u8 *)"");
}

Deployed via scp + pacman -U + reboot, with sddm autologin reseating mfritsche session. Build wall-clock 50 min on boltzmann.

Results

libva HEVC (13 frames, all identical):

rkvdec_iter20: ctrl_hdl=00000000f9b036ba sps=00000000105406cf
               sps[0..16]=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
               dp=00000000117b947e
               dp[0..16]=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

kdirect HEVC (15 frames):

rkvdec_iter20: ctrl_hdl=00000000d3afe1db sps=0000000095c47ba1
               sps[0..16]=00 00 00 05 d0 02 00 00 04 04 02 04 01 01 00 03
               dp=00000000599ee83f
               dp[0..16]=00..04..03 (varies per frame — correct, decode_params is per-frame)

What this proves

  1. &ctx->ctrl_hdl differs between processes (libva f9b036ba, kdirect d3afe1db) — EXPECTED. Each backend opens /dev/video3 separately, each gets its own rkvdec_ctx with its own private ctrl_hdl. This is normal V4L2 m2m.

  2. The sps pointer is stable across all libva frames (105406cf) — confirms the SPS control is registered to the handler exactly once (at CreateContext / rkvdec_init_ctrls). The allocation exists, v4l2_ctrl_find() returns it correctly. The control structure is registered. Not a registration bug.

  3. libva's *sps content is all-zero, kdirect's *sps has real bytes (00 00 00 05 d0 02 00 00 04 04 02 04 01 01 00 03) — the first SPS bytes in kdirect's case include pic_width_in_luma_samples = 1280 (0x05 0x00 = 1280 in little-endian + framing) which matches kdirect's rkvdec_hevc_run printk showing w=1280. libva's bytes are zero → its w=0 h=0 printk follows.

  4. libva's *decode_params is also all-zero across all 13 frames. kdirect's varies per-frame. Confirms decode_params for libva never gets non-zero values into ctx->ctrl_hdl either.

Mechanism analysis

The SPS control is registered to ctx->ctrl_hdl (pointer valid, stable, same allocation across 13 frames). What's missing is the content copy from S_EXT_CTRLS userspace payload into the registered control's p_cur.p memory.

The V4L2 control-framework path for compound controls with which=V4L2_CTRL_WHICH_REQUEST_VAL=0xf010000:

userspace VIDIOC_S_EXT_CTRLS (which=REQUEST_VAL, request_fd=R, payload=...)
  → kernel v4l2_s_ext_ctrls()
  → which==REQUEST_VAL branch: looks up R's media_request,
      stages payload into req->p_new for each control
  → returns 0

userspace MEDIA_REQUEST_IOC_QUEUE on fd R
  → kernel queues req's pending bufs and pending controls
  → m2m schedules job → device_run callback
  → rkvdec_hevc_run_preamble():
      v4l2_ctrl_request_setup(req, &ctx->ctrl_hdl):
        copies req->p_new → ctx->ctrl_hdl[ctrl]->p_cur
  → rkvdec_hevc_run() — printk fires here, reads ctx->ctrl_hdl values

For libva, the printk fires at the read site and observes all-zero. Three places this can fail:

# Where Likelihood
A v4l2_s_ext_ctrls doesn't stage libva's payload into req->p_new for SPS unknown — needs probe
B req->p_new has correct bytes but v4l2_ctrl_request_setup doesn't run for libva's request unknown — needs probe
C v4l2_ctrl_request_setup runs but doesn't copy SPS for libva's request unknown — needs probe

The kernel-direct path WORKS through the same control framework on the same kernel, same /dev/video3 — so the bug is in how libva invokes the request lifecycle, not in the framework code itself.

Mechanism status update (post-iter20)

# Mechanism Status
1 request_fd mismatch (S_EXT_CTRLS R1, QUEUE R2) strongly disfavored (strace shows consistent fd per frame, but worth one explicit verification)
2 REINIT clears between S_EXT_CTRLS and QUEUE DISPROVED iter19
3 Stack-locals stale DISPROVED iter18
4 ctrl_hdl mismatch — different handlers REFRAMED iter20: handlers differ (expected per-process), but BOTH register SPS correctly, and ctx->ctrl_hdl reads stable pointers. NOT a routing bug.
5 error_idx silent partial fail DISPROVED iter18
6 NEW iter20: req->p_new for SPS never receives libva's payload, OR v4l2_ctrl_request_setup never copies it into ctx->ctrl_hdl leading hypothesis

User-level test for iter21

Libva can self-diagnose between A and B/C without kernel patches:

After S_EXT_CTRLS(which=REQUEST_VAL, request_fd=R, payload=...), immediately issue:

  • G_EXT_CTRLS(which=REQUEST_VAL, request_fd=R) for SPS.

If readback returns non-zero bytes → req->p_new HAS the payload (mechanism A disproved, B or C remains).

If readback returns zero → req->p_new doesn't have it (mechanism A confirmed).

The G_EXT_CTRLS path with which=REQUEST_VAL reads from req->p_new directly — that's the staging slot. Outcome localizes the bug to one of two kernel layers.

Substrate state at iter20 close

  • Backend SHA on fresnel: c1d4bb53… (iter15 stable, unchanged).
  • Fork tip 415688d (iter19 state, unchanged).
  • Kernel linux-fresnel-fourier 7.0-4 with iter17 + iter20 printk in rkvdec_hevc_run. NOT a shipping kernel — diagnostic only.
  • 5-codec anchors: unchanged from iter15. Zero regression.

iter21 candidate

α-24: Add G_EXT_CTRLS readback in libva's h265_set_controls right after every v4l2_set_controls(... which=REQUEST_VAL ...) call. Log first 16 bytes of returned SPS. ~15 LOC, fully reversible. Test in this single iter, then revert (diagnostic only, not for shipping).

Outcomes:

  • Non-zero readback → req->p_new has libva's payload. Bug is in v4l2_ctrl_request_setup not running or not copying. iter22 = kernel printk in v4l2_ctrl_request_setup showing what gets copied for libva's request_fd at IOC_QUEUE time.
  • Zero readback → req->p_new doesn't have libva's payload. Bug is in v4l2_s_ext_ctrls staging for libva's invocation. iter22 = kernel printk in v4l2_s_ext_ctrls showing what libva actually passed.

Lesson

iter17 + iter20 prove &ctx->ctrl_hdl pointer routing is NOT the failure surface (registered controls allocated correctly, found correctly, pointer-stable). The failure surface is the content copy from userspace S_EXT_CTRLS into ctx->ctrl_hdl across the request lifecycle. Three iterations (17, 19, 20) of kernel printk have walked the bug-localization down from "anywhere in the kernel" → "S_EXT_CTRLS staging or v4l2_ctrl_request_setup application". Two more printk+probe iterations should reach the line of code.