Files
fresnel-fourier/phase8_iteration24_close.md
T
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

124 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Iteration 24 — Phase 8 (close)
Closes 2026-05-14. iter24 = kernel printk logging `req_to_new` and `try_or_set_cluster` return values. FULL close. **ROOT CAUSE IDENTIFIED.**
### Method
`linux-fresnel-fourier 7.0-8` (pkgrel 7→8). Added pr_info after each kernel framework call in `v4l2_ctrl_request_setup`'s cluster-process block:
```c
ret = req_to_new(r);
pr_info("iter24_req_to_new: id=0x%x ret=%d p_req_valid=%d p_req_elems=%u\n",
master->cluster[i]->id, ret, r->p_req_valid, r->p_req_elems);
...
ret = try_or_set_cluster(NULL, master, true, 0);
pr_info("iter24_try_or_set: master_id=0x%x ret=%d\n", master->id, ret);
```
### Result — definitive
**libva HEVC** (all 10+ setups, identical pattern):
```
iter24_req_to_new: id=0xa40a90 ret=0 p_req_valid=1 p_req_elems=1
iter24_try_or_set: master_id=0xa40a90 ret=-16
iter24_loop_break: at master_id=0xa40a90 ret=-16
iter24_loop_done: final ret=-16
```
`-16` is `-EBUSY`. `req_to_new` succeeds. `try_or_set_cluster` returns -EBUSY for HEVC_SPS, **exiting the setup loop**.
**kdirect HEVC**: continues processing all 5 staged controls successfully (ret=0 throughout).
### Source localization
The only -EBUSY path in `try_or_set_cluster` is `call_op(master, s_ctrl)` for HEVC_SPS, which dispatches to `rkvdec_s_ctrl` in `drivers/media/platform/rockchip/rkvdec/rkvdec.c:149`:
```c
static int rkvdec_s_ctrl(struct v4l2_ctrl *ctrl)
{
struct rkvdec_ctx *ctx = container_of(ctrl->handler, struct rkvdec_ctx, ctrl_hdl);
const struct rkvdec_coded_fmt_desc *desc = ctx->coded_fmt_desc;
enum rkvdec_image_fmt image_fmt;
struct vb2_queue *vq;
...
/* Check if this change requires a capture format reset */
if (!desc->ops->get_image_fmt)
return 0;
image_fmt = desc->ops->get_image_fmt(ctx, ctrl);
if (rkvdec_image_fmt_changed(ctx, image_fmt)) {
vq = v4l2_m2m_get_vq(ctx->fh.m2m_ctx,
V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE);
if (vb2_is_busy(vq))
return -EBUSY; // ← THIS
ctx->image_fmt = image_fmt;
rkvdec_reset_decoded_fmt(ctx);
}
return 0;
}
```
### Root cause
When the first HEVC_SPS arrives, rkvdec needs to determine the output image format from SPS fields (chroma_format_idc, bit_depth_luma/chroma_minus8). If the format differs from the previous/default — which it does at first-frame because ctx->image_fmt starts at the default — rkvdec wants to reset the CAPTURE format.
But it can only do that if the CAPTURE queue has NO buffers allocated. `vb2_is_busy(vq)` returns true if `vq->num_buffers > 0`.
**libva pre-allocates 24 CAPTURE buffers at CreateContext (iter5b-β design)**. By the time the first per-frame S_EXT_CTRLS(HEVC_SPS, REQUEST_VAL) fires, CAPTURE is already full → vb2_is_busy=true → -EBUSY → setup loop exits → SPS never committed → all-zero in ctx->ctrl_hdl → rkvdec_hevc_run reads zero.
**kdirect (ffmpeg-v4l2request)** allocates CAPTURE buffers AFTER the SPS-driven format is known. So when its first S_EXT_CTRLS fires, CAPTURE is EMPTY → vb2_is_busy=false → format reset succeeds → s_ctrl returns 0 → SPS commits correctly.
### This is THE Bug 5 root cause
After 24 iterations of investigation, including 8 wire-byte hypothesis eliminations, 4 mechanism eliminations, and 5 kernel-side printk iterations:
**Bug 5 (HEVC libva = all-zero CAPTURE) is caused by libva pre-allocating CAPTURE buffers before the first SPS-set, blocking rkvdec's format-reset.**
Bug 4 (H264 libva = keyframe partial) is likely the same root cause — H264_SPS triggers the same image_fmt check via rkvdec_h264_fmt_ops's get_image_fmt.
### Why VP9 works through libva
VP9 (rkvdec_vp9_ctrls) might NOT have a get_image_fmt op (vp9_frame is the only control, and chroma+bit_depth come from frame header, not a separate SPS). Or VP9's frame parameters always resolve to the same image_fmt as the default. Either way, no format-reset attempt → no -EBUSY.
### Mechanism status — RESOLVED
| # | Mechanism | Status |
|---|---|---|
| ALL prior | various | DISPROVED iter17-23 |
| **iter24** | **rkvdec_s_ctrl returns -EBUSY for HEVC_SPS because CAPTURE queue is busy with libva's pre-allocated pool** | **CONFIRMED — ROOT CAUSE** |
### Fix candidates
**Option A** (libva backend fix): Defer libva's CAPTURE pool allocation until AFTER the first per-frame SPS is set. Concretely:
- At CreateContext: skip cap_pool_init.
- On first BeginPicture/EndPicture: after first S_EXT_CTRLS(SPS) succeeds, then REQBUFS+QUERYBUF+MMAP the CAPTURE pool.
- Risk: changes the iter5b-β "permanent CAPTURE pool" model, may regress VP9/MPEG-2.
**Option B** (libva backend fix, narrower): Use S_FMT(CAPTURE) BEFORE allocating CAPTURE buffers, with the same image_fmt the SPS will request. This way, ctx->image_fmt is already correct when SPS arrives → rkvdec_image_fmt_changed returns false → no reset attempt → no -EBUSY.
**Option C** (kernel fix, upstream): Change rkvdec_s_ctrl to silently no-op the format-reset if the image_fmt is already correct, even if get_image_fmt returns a value that triggered the check. This is risky — it changes upstream rkvdec semantics.
**Option B is preferred** — minimal libva change, aligns with kdirect's pattern (set S_FMT(CAPTURE) before allocating).
### iter25 candidate
Implement Option B in libva backend's CreateContext: explicit `v4l2_set_format(CAPTURE, V4L2_PIX_FMT_NV12, fixture_w, fixture_h)` BEFORE `cap_pool_init`. Set the expected format from BBB's parameters (chroma 4:2:0, 8-bit → NV12).
This builds on iter15's α-19 which already adds an explicit S_FMT(CAPTURE) call — but verify it ACTUALLY runs before cap_pool_init in the libva CreateContext flow.
### Substrate state at iter24 close
- Backend SHA on fresnel: `c1d4bb53…` (iter15 stable, unchanged).
- Fork tip `e109306` — unchanged.
- Kernel `linux-fresnel-fourier 7.0-8` with iter17 + iter20-24 printks.
- 5-codec anchors: unchanged.
### Lesson
8 iterations of wire-byte and ioctl-sequence analysis (iter11-iter18) chased an empirical illusion. Once kernel-side printk landed (iter17), 4 more iterations (iter20-23) walked the symptom down to one function call returning one specific error code. **The bug was in a 5-line kernel function we'd never read.** Now we have the right diagnosis and a clear forward path.