iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24: pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
dp now match kdirect. HEVC frame 2+ still diverges
(separate bug, likely DPB entry mapping).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,123 @@
|
||||
## Iteration 24 — Phase 8 (close)
|
||||
|
||||
Closes 2026-05-14. iter24 = kernel printk logging `req_to_new` and `try_or_set_cluster` return values. FULL close. **ROOT CAUSE IDENTIFIED.**
|
||||
|
||||
### Method
|
||||
|
||||
`linux-fresnel-fourier 7.0-8` (pkgrel 7→8). Added pr_info after each kernel framework call in `v4l2_ctrl_request_setup`'s cluster-process block:
|
||||
|
||||
```c
|
||||
ret = req_to_new(r);
|
||||
pr_info("iter24_req_to_new: id=0x%x ret=%d p_req_valid=%d p_req_elems=%u\n",
|
||||
master->cluster[i]->id, ret, r->p_req_valid, r->p_req_elems);
|
||||
...
|
||||
ret = try_or_set_cluster(NULL, master, true, 0);
|
||||
pr_info("iter24_try_or_set: master_id=0x%x ret=%d\n", master->id, ret);
|
||||
```
|
||||
|
||||
### Result — definitive
|
||||
|
||||
**libva HEVC** (all 10+ setups, identical pattern):
|
||||
|
||||
```
|
||||
iter24_req_to_new: id=0xa40a90 ret=0 p_req_valid=1 p_req_elems=1
|
||||
iter24_try_or_set: master_id=0xa40a90 ret=-16
|
||||
iter24_loop_break: at master_id=0xa40a90 ret=-16
|
||||
iter24_loop_done: final ret=-16
|
||||
```
|
||||
|
||||
`-16` is `-EBUSY`. `req_to_new` succeeds. `try_or_set_cluster` returns -EBUSY for HEVC_SPS, **exiting the setup loop**.
|
||||
|
||||
**kdirect HEVC**: continues processing all 5 staged controls successfully (ret=0 throughout).
|
||||
|
||||
### Source localization
|
||||
|
||||
The only -EBUSY path in `try_or_set_cluster` is `call_op(master, s_ctrl)` for HEVC_SPS, which dispatches to `rkvdec_s_ctrl` in `drivers/media/platform/rockchip/rkvdec/rkvdec.c:149`:
|
||||
|
||||
```c
|
||||
static int rkvdec_s_ctrl(struct v4l2_ctrl *ctrl)
|
||||
{
|
||||
struct rkvdec_ctx *ctx = container_of(ctrl->handler, struct rkvdec_ctx, ctrl_hdl);
|
||||
const struct rkvdec_coded_fmt_desc *desc = ctx->coded_fmt_desc;
|
||||
enum rkvdec_image_fmt image_fmt;
|
||||
struct vb2_queue *vq;
|
||||
|
||||
...
|
||||
|
||||
/* Check if this change requires a capture format reset */
|
||||
if (!desc->ops->get_image_fmt)
|
||||
return 0;
|
||||
|
||||
image_fmt = desc->ops->get_image_fmt(ctx, ctrl);
|
||||
if (rkvdec_image_fmt_changed(ctx, image_fmt)) {
|
||||
vq = v4l2_m2m_get_vq(ctx->fh.m2m_ctx,
|
||||
V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE);
|
||||
if (vb2_is_busy(vq))
|
||||
return -EBUSY; // ← THIS
|
||||
|
||||
ctx->image_fmt = image_fmt;
|
||||
rkvdec_reset_decoded_fmt(ctx);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
### Root cause
|
||||
|
||||
When the first HEVC_SPS arrives, rkvdec needs to determine the output image format from SPS fields (chroma_format_idc, bit_depth_luma/chroma_minus8). If the format differs from the previous/default — which it does at first-frame because ctx->image_fmt starts at the default — rkvdec wants to reset the CAPTURE format.
|
||||
|
||||
But it can only do that if the CAPTURE queue has NO buffers allocated. `vb2_is_busy(vq)` returns true if `vq->num_buffers > 0`.
|
||||
|
||||
**libva pre-allocates 24 CAPTURE buffers at CreateContext (iter5b-β design)**. By the time the first per-frame S_EXT_CTRLS(HEVC_SPS, REQUEST_VAL) fires, CAPTURE is already full → vb2_is_busy=true → -EBUSY → setup loop exits → SPS never committed → all-zero in ctx->ctrl_hdl → rkvdec_hevc_run reads zero.
|
||||
|
||||
**kdirect (ffmpeg-v4l2request)** allocates CAPTURE buffers AFTER the SPS-driven format is known. So when its first S_EXT_CTRLS fires, CAPTURE is EMPTY → vb2_is_busy=false → format reset succeeds → s_ctrl returns 0 → SPS commits correctly.
|
||||
|
||||
### This is THE Bug 5 root cause
|
||||
|
||||
After 24 iterations of investigation, including 8 wire-byte hypothesis eliminations, 4 mechanism eliminations, and 5 kernel-side printk iterations:
|
||||
|
||||
**Bug 5 (HEVC libva = all-zero CAPTURE) is caused by libva pre-allocating CAPTURE buffers before the first SPS-set, blocking rkvdec's format-reset.**
|
||||
|
||||
Bug 4 (H264 libva = keyframe partial) is likely the same root cause — H264_SPS triggers the same image_fmt check via rkvdec_h264_fmt_ops's get_image_fmt.
|
||||
|
||||
### Why VP9 works through libva
|
||||
|
||||
VP9 (rkvdec_vp9_ctrls) might NOT have a get_image_fmt op (vp9_frame is the only control, and chroma+bit_depth come from frame header, not a separate SPS). Or VP9's frame parameters always resolve to the same image_fmt as the default. Either way, no format-reset attempt → no -EBUSY.
|
||||
|
||||
### Mechanism status — RESOLVED
|
||||
|
||||
| # | Mechanism | Status |
|
||||
|---|---|---|
|
||||
| ALL prior | various | DISPROVED iter17-23 |
|
||||
| **iter24** | **rkvdec_s_ctrl returns -EBUSY for HEVC_SPS because CAPTURE queue is busy with libva's pre-allocated pool** | **CONFIRMED — ROOT CAUSE** |
|
||||
|
||||
### Fix candidates
|
||||
|
||||
**Option A** (libva backend fix): Defer libva's CAPTURE pool allocation until AFTER the first per-frame SPS is set. Concretely:
|
||||
- At CreateContext: skip cap_pool_init.
|
||||
- On first BeginPicture/EndPicture: after first S_EXT_CTRLS(SPS) succeeds, then REQBUFS+QUERYBUF+MMAP the CAPTURE pool.
|
||||
- Risk: changes the iter5b-β "permanent CAPTURE pool" model, may regress VP9/MPEG-2.
|
||||
|
||||
**Option B** (libva backend fix, narrower): Use S_FMT(CAPTURE) BEFORE allocating CAPTURE buffers, with the same image_fmt the SPS will request. This way, ctx->image_fmt is already correct when SPS arrives → rkvdec_image_fmt_changed returns false → no reset attempt → no -EBUSY.
|
||||
|
||||
**Option C** (kernel fix, upstream): Change rkvdec_s_ctrl to silently no-op the format-reset if the image_fmt is already correct, even if get_image_fmt returns a value that triggered the check. This is risky — it changes upstream rkvdec semantics.
|
||||
|
||||
**Option B is preferred** — minimal libva change, aligns with kdirect's pattern (set S_FMT(CAPTURE) before allocating).
|
||||
|
||||
### iter25 candidate
|
||||
|
||||
Implement Option B in libva backend's CreateContext: explicit `v4l2_set_format(CAPTURE, V4L2_PIX_FMT_NV12, fixture_w, fixture_h)` BEFORE `cap_pool_init`. Set the expected format from BBB's parameters (chroma 4:2:0, 8-bit → NV12).
|
||||
|
||||
This builds on iter15's α-19 which already adds an explicit S_FMT(CAPTURE) call — but verify it ACTUALLY runs before cap_pool_init in the libva CreateContext flow.
|
||||
|
||||
### Substrate state at iter24 close
|
||||
|
||||
- Backend SHA on fresnel: `c1d4bb53…` (iter15 stable, unchanged).
|
||||
- Fork tip `e109306` — unchanged.
|
||||
- Kernel `linux-fresnel-fourier 7.0-8` with iter17 + iter20-24 printks.
|
||||
- 5-codec anchors: unchanged.
|
||||
|
||||
### Lesson
|
||||
|
||||
8 iterations of wire-byte and ioctl-sequence analysis (iter11-iter18) chased an empirical illusion. Once kernel-side printk landed (iter17), 4 more iterations (iter20-23) walked the symptom down to one function call returning one specific error code. **The bug was in a 5-line kernel function we'd never read.** Now we have the right diagnosis and a clear forward path.
|
||||
Reference in New Issue
Block a user