Files
fresnel-fourier/phase8_iteration21_close.md
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

147 lines
7.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Iteration 21 — Phase 8 (close)
Closes 2026-05-14. iter21 = kernel printk at top of `v4l2_ctrl_request_setup` + per-ref dump. FULL close. **Smoking-gun finding: libva's request-clone-handler is missing 6 HEVC stateless controls registered in main_hdl.**
### Method
`linux-fresnel-fourier 7.0-5` (pkgrel 4→5) adds two `pr_info` to `v4l2_ctrl_request_setup` in `drivers/media/v4l2-core/v4l2-ctrls-request.c`:
```c
obj = media_request_object_find(req, &req_ops, main_hdl);
pr_info("iter21_setup: req=%p main_hdl=%p obj=%p\n", req, main_hdl, obj);
...
list_for_each_entry(ref, &hdl->ctrl_refs, node) {
...
pr_info("iter21_setup_ref: ctrl_id=0x%x p_req_valid=%d have_new=%d\n",
ctrl->id, ref->p_req_valid, have_new_data);
...
}
```
Built ~1 min via ccache reuse. Deployed via scp + `pacman -U` + reboot.
### α-24 result (predicate: kernel-only path required)
α-24 (libva G_EXT_CTRLS readback after S_EXT_CTRLS) implemented as 1547a5d → amended a9c897f → reverted e109306. Kernel returned **EACCES** for all 13 libva HEVC frames: this V4L2 build disallows userspace probing of `req->p_new` for an uncompleted request. The probe path must run inside the kernel.
### Result — definitive (libva vs kdirect)
**libva HEVC frame 1 setup** (clone-hdl ctrl_refs in ID order, 14 entries):
```
0x990a67 p_req_valid=0
0x990a6b p_req_valid=0
0x990b00 p_req_valid=0
0x990b67 p_req_valid=0
0x990b68 p_req_valid=0 (5 codec-class menu controls)
0xa40900 p_req_valid=0 H264_DECODE_MODE
0xa40901 p_req_valid=0 H264_START_CODE
0xa40902 p_req_valid=0 H264_SPS
0xa40903 p_req_valid=0 H264_PPS
0xa40904 p_req_valid=0 H264_SCALING_MATRIX
0xa40907 p_req_valid=0 H264_DECODE_PARAMS
0xa40a2c p_req_valid=0 (misc stateless)
0xa40a2d p_req_valid=0 (misc stateless)
0xa40a90 p_req_valid=1 have_new=1 HEVC_SPS — CLONE STOPS HERE
```
**Missing from libva clone (vs kdirect):**
- 0xa40905 H264_PRED_WEIGHTS (compound)
- 0xa40906 H264_SLICE_PARAMS (compound, dyn_array)
- 0xa40a91 HEVC_PPS (compound)
- 0xa40a92 HEVC_SLICE_PARAMS (compound, dyn_array)
- 0xa40a93 HEVC_SCALING_MATRIX (compound)
- 0xa40a94 HEVC_DECODE_PARAMS (compound)
- 0xa40a95 HEVC_DECODE_MODE (menu)
- 0xa40a96 HEVC_START_CODE (menu)
**kdirect HEVC frame 1 setup** (same hdl, 21 entries — all of above PLUS the 8 missing):
```
... 14 entries as above ...
0xa40a91 p_req_valid=1 have_new=1 HEVC_PPS
0xa40a92 p_req_valid=1 have_new=1 HEVC_SLICE_PARAMS
0xa40a93 p_req_valid=1 have_new=1 HEVC_SCALING_MATRIX
0xa40a94 p_req_valid=1 have_new=1 HEVC_DECODE_PARAMS
0xa40a95 p_req_valid=0 HEVC_DECODE_MODE (device-init only)
0xa40a96 p_req_valid=0 HEVC_START_CODE (device-init only)
```
### What this means
`v4l2_ctrl_request_setup(req, main_hdl)`:
- finds `obj` for both libva and kdirect (non-NULL) — request properly bound.
- iterates `hdl->ctrl_refs` — but **libva's hdl is the request-clone-hdl, and it contains 14 of the 21 source controls**.
- libva's HEVC_SPS has `p_req_valid=1` — staging worked for that one control.
- The other 6 HEVC controls (PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS, DECODE_MODE, START_CODE) **don't exist in the clone-hdl at all** — they cannot be staged.
When libva submits its 5-control S_EXT_CTRLS batch (SPS, PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS), only SPS is registered in the clone-hdl. PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS find no ref → `prepare_ext_ctrls` returns `-EINVAL`. (This contradicts iter18 α-22's rc=0 — needs re-investigation of error_idx semantics for the request path; the userspace observation of rc=0 may not reflect the actual kernel error for compound-control lookups in request clones.)
### iter20's "zero SPS bytes" explained
iter20 showed `rkvdec sees sps[0..16]=00..00` for libva. That's because:
- HEVC_SPS *is* in the clone-hdl with `p_req_valid=1` — so it got STAGED.
- But the **content** in `req->p_new[SPS]` is all-zero.
Two possible reasons for zero content despite p_req_valid=1:
1. `user_to_new` ran on a zero-payload from libva. iter15 strace ruled this out — libva's SPS payload is non-zero at ioctl entry.
2. `new_to_req` ran, but the data flow is somehow corrupted. Possible if the master/cluster lookup is wrong on the clone-hdl.
iter22 candidate: add a printk in `new_to_req` and `req_to_new` to log the copy: source pointer, dest pointer, first 4 bytes, payload size.
### Mechanism status (post-iter21)
| # | Mechanism | Status |
|---|---|---|
| 1 | request_fd mismatch | DISPROVED iter17/18 |
| 2 | REINIT clears | DISPROVED iter19 |
| 3 | Stack-locals stale | DISPROVED iter18 |
| 4 | ctrl_hdl mismatch | REFRAMED iter20 |
| 5 | error_idx silent failure | DISPROVED iter18 (but warrants re-check given iter21 finding) |
| 6 | req->p_new staging incomplete | **CONFIRMED iter21**: clone-hdl missing controls = staging cannot occur for 6 of 7 HEVC controls |
| 7 | **NEW iter21**: clone-hdl is missing controls that main_hdl has registered | **Root question for iter22** |
### Why is the clone incomplete?
`v4l2_ctrl_request_clone(new_hdl, from=main_hdl)` iterates `main_hdl->ctrl_refs` in ID-sorted order. After cloning HEVC_SPS (0xa40a90), the loop **stops** before HEVC_PPS (0xa40a91). Equivalent stops happen at H264_PRED_WEIGHTS (0xa40905) — both are first compound controls of their codec block.
Hypothesis: `handler_new_ref` returns non-zero error at the first compound control AFTER an SPS-like single-struct compound, but **only when called from the request-clone path**. Or: `kzalloc(sizeof(*new_ref) + size_extra_req)` fails for ones with larger `elem_size` (HEVC_PPS = 64 bytes, H264_PRED_WEIGHTS = 32 bytes — small, unlikely to OOM but worth verifying).
Alt hypothesis: `handler_new_ref`'s auto-class-control insertion (`v4l2_ctrl_new_std`) fails for non-compound HEVC menu controls in request-clone path, which propagates `hdl->error` and breaks subsequent iterations.
Same kernel succeeds for kdirect on the same `from` hdl, so something is **per-request-bind specific** — maybe related to request lifecycle timing in libva (iter6 permanent request_fd at CreateContext) vs kdirect (per-frame request_fd).
### Substrate state at iter21 close
- Backend SHA on fresnel: `c1d4bb53…` (iter15 stable, unchanged).
- Fork tip `e109306` (α-24 reverted).
- Kernel `linux-fresnel-fourier 7.0-5` with iter17 + iter20 + iter21 printks. NOT a shipping kernel.
- 5-codec anchors: unchanged. Zero regression.
### iter22 candidate
Add printks to `v4l2_ctrl_request_clone` and `handler_new_ref`:
```c
// in v4l2_ctrl_request_clone
pr_info("iter22_clone_start: new_hdl=%p from=%p\n", hdl, from);
// per iteration
err = handler_new_ref(hdl, ctrl, &new_ref, false, true);
pr_info("iter22_clone_step: id=0x%x err=%d from_other=%d\n",
ctrl->id, err, ref->from_other_dev);
if (err) {
pr_info("iter22_clone_break: at id=0x%x err=%d hdl_error=%d\n",
ctrl->id, err, hdl->error);
break;
}
```
After 7.0-6 deploys, libva HEVC run will show exactly which ctrl_id breaks the loop and the error code. Then we can localize either to `kzalloc` failure, `v4l2_ctrl_new_std` failure (auto-class), or some other condition.
### Lesson
iter21 overturns the iter11iter18 hypothesis space entirely. The S_EXT_CTRLS ioctl wire-byte payload analysis was correct — libva's bytes match kdirect's. But **at the v4l2_ctrl framework level, libva's request-clone is missing the registered controls libva tries to stage**. The bug is in how the V4L2 control framework handles libva's specific request-binding pattern, NOT in libva's ioctl content.
This is the strongest narrowing since iter17. We've gone from "anywhere in kernel" → "kernel control framework" → "request-clone path specifically" → "iteration breaks at first compound HEVC control".