Files
fresnel-fourier/phase8_iteration20_close.md
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

128 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Iteration 20 — Phase 8 (close)
Closes 2026-05-14. iter20 = kernel printk for `&ctx->ctrl_hdl`, `run.sps`, `run.decode_params` pointers + first 16 bytes of each, executed at top of `rkvdec_hevc_run` (after `rkvdec_hevc_run_preamble`). FULL close. Mechanism 4 reframed; root-cause localized to one kernel layer.
### Method
`linux-fresnel-fourier 7.0-4` adds `rkvdec_iter20:` printk to RK3399 `rkvdec_hevc_run`:
```c
{
u8 *sps_bytes = (u8 *)run.sps;
u8 *dp_bytes = (u8 *)run.decode_params;
pr_info("rkvdec_iter20: ctrl_hdl=%p sps=%p sps[0..16]=%*ph "
"dp=%p dp[0..16]=%*ph\n",
&ctx->ctrl_hdl, run.sps,
16, sps_bytes ? sps_bytes : (u8 *)"",
run.decode_params,
16, dp_bytes ? dp_bytes : (u8 *)"");
}
```
Deployed via scp + `pacman -U` + reboot, with sddm autologin reseating mfritsche session. Build wall-clock 50 min on boltzmann.
### Results
**libva HEVC** (13 frames, all identical):
```
rkvdec_iter20: ctrl_hdl=00000000f9b036ba sps=00000000105406cf
sps[0..16]=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
dp=00000000117b947e
dp[0..16]=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
```
**kdirect HEVC** (15 frames):
```
rkvdec_iter20: ctrl_hdl=00000000d3afe1db sps=0000000095c47ba1
sps[0..16]=00 00 00 05 d0 02 00 00 04 04 02 04 01 01 00 03
dp=00000000599ee83f
dp[0..16]=00..04..03 (varies per frame — correct, decode_params is per-frame)
```
### What this proves
1. **`&ctx->ctrl_hdl` differs between processes** (libva `f9b036ba`, kdirect `d3afe1db`) — EXPECTED. Each backend opens `/dev/video3` separately, each gets its own `rkvdec_ctx` with its own private `ctrl_hdl`. This is normal V4L2 m2m.
2. **The `sps` pointer is stable across all libva frames** (`105406cf`) — confirms the SPS control is registered to the handler exactly once (at CreateContext / `rkvdec_init_ctrls`). The allocation exists, `v4l2_ctrl_find()` returns it correctly. The control structure is registered. Not a registration bug.
3. **libva's `*sps` content is all-zero**, **kdirect's `*sps` has real bytes** (`00 00 00 05 d0 02 00 00 04 04 02 04 01 01 00 03`) — the first SPS bytes in kdirect's case include `pic_width_in_luma_samples = 1280` (`0x05 0x00 = 1280` in little-endian + framing) which matches kdirect's `rkvdec_hevc_run` printk showing `w=1280`. libva's bytes are zero → its `w=0 h=0` printk follows.
4. **libva's `*decode_params` is also all-zero** across all 13 frames. kdirect's varies per-frame. Confirms decode_params for libva never gets non-zero values into ctx->ctrl_hdl either.
### Mechanism analysis
The SPS control is **registered to `ctx->ctrl_hdl`** (pointer valid, stable, same allocation across 13 frames). What's missing is the **content copy** from `S_EXT_CTRLS` userspace payload into the registered control's `p_cur.p` memory.
The V4L2 control-framework path for compound controls with `which=V4L2_CTRL_WHICH_REQUEST_VAL=0xf010000`:
```
userspace VIDIOC_S_EXT_CTRLS (which=REQUEST_VAL, request_fd=R, payload=...)
→ kernel v4l2_s_ext_ctrls()
→ which==REQUEST_VAL branch: looks up R's media_request,
stages payload into req->p_new for each control
→ returns 0
userspace MEDIA_REQUEST_IOC_QUEUE on fd R
→ kernel queues req's pending bufs and pending controls
→ m2m schedules job → device_run callback
→ rkvdec_hevc_run_preamble():
v4l2_ctrl_request_setup(req, &ctx->ctrl_hdl):
copies req->p_new → ctx->ctrl_hdl[ctrl]->p_cur
→ rkvdec_hevc_run() — printk fires here, reads ctx->ctrl_hdl values
```
For libva, the printk fires at the **read** site and observes all-zero. Three places this can fail:
| # | Where | Likelihood |
|---|---|---|
| A | `v4l2_s_ext_ctrls` doesn't stage libva's payload into `req->p_new` for SPS | unknown — needs probe |
| B | `req->p_new` has correct bytes but `v4l2_ctrl_request_setup` doesn't run for libva's request | unknown — needs probe |
| C | `v4l2_ctrl_request_setup` runs but doesn't copy SPS for libva's request | unknown — needs probe |
The kernel-direct path WORKS through the same control framework on the same kernel, same /dev/video3 — so the bug is in **how libva invokes the request lifecycle**, not in the framework code itself.
### Mechanism status update (post-iter20)
| # | Mechanism | Status |
|---|---|---|
| 1 | request_fd mismatch (S_EXT_CTRLS R1, QUEUE R2) | strongly disfavored (strace shows consistent fd per frame, but worth one explicit verification) |
| 2 | REINIT clears between S_EXT_CTRLS and QUEUE | DISPROVED iter19 |
| 3 | Stack-locals stale | DISPROVED iter18 |
| 4 | ctrl_hdl mismatch — different handlers | **REFRAMED iter20**: handlers differ (expected per-process), but BOTH register SPS correctly, and ctx->ctrl_hdl reads stable pointers. NOT a routing bug. |
| 5 | error_idx silent partial fail | DISPROVED iter18 |
| 6 | **NEW iter20**: req->p_new for SPS never receives libva's payload, OR v4l2_ctrl_request_setup never copies it into ctx->ctrl_hdl | **leading hypothesis** |
### User-level test for iter21
Libva can self-diagnose between A and B/C without kernel patches:
After `S_EXT_CTRLS(which=REQUEST_VAL, request_fd=R, payload=...)`, immediately issue:
- `G_EXT_CTRLS(which=REQUEST_VAL, request_fd=R)` for SPS.
If readback returns non-zero bytes → **req->p_new HAS the payload** (mechanism A disproved, B or C remains).
If readback returns zero → **req->p_new doesn't have it** (mechanism A confirmed).
The G_EXT_CTRLS path with which=REQUEST_VAL reads from `req->p_new` directly — that's the staging slot. Outcome localizes the bug to one of two kernel layers.
### Substrate state at iter20 close
- Backend SHA on fresnel: `c1d4bb53…` (iter15 stable, unchanged).
- Fork tip `415688d` (iter19 state, unchanged).
- Kernel `linux-fresnel-fourier 7.0-4` with iter17 + iter20 printk in rkvdec_hevc_run. NOT a shipping kernel — diagnostic only.
- 5-codec anchors: unchanged from iter15. Zero regression.
### iter21 candidate
`α-24`: Add G_EXT_CTRLS readback in libva's `h265_set_controls` right after every `v4l2_set_controls(... which=REQUEST_VAL ...)` call. Log first 16 bytes of returned SPS. ~15 LOC, fully reversible. Test in this single iter, then revert (diagnostic only, not for shipping).
Outcomes:
- **Non-zero readback** → req->p_new has libva's payload. Bug is in `v4l2_ctrl_request_setup` not running or not copying. iter22 = kernel printk in `v4l2_ctrl_request_setup` showing what gets copied for libva's request_fd at IOC_QUEUE time.
- **Zero readback** → req->p_new doesn't have libva's payload. Bug is in `v4l2_s_ext_ctrls` staging for libva's invocation. iter22 = kernel printk in `v4l2_s_ext_ctrls` showing what libva actually passed.
### Lesson
iter17 + iter20 prove `&ctx->ctrl_hdl` pointer routing is NOT the failure surface (registered controls allocated correctly, found correctly, pointer-stable). The failure surface is the **content copy** from userspace S_EXT_CTRLS into ctx->ctrl_hdl across the request lifecycle. Three iterations (17, 19, 20) of kernel printk have walked the bug-localization down from "anywhere in the kernel" → "S_EXT_CTRLS staging or v4l2_ctrl_request_setup application". Two more printk+probe iterations should reach the line of code.