Files
fresnel-fourier/phase8_iteration22_close.md
T
marfrit bf67900cd8 iter20-26: kernel-side root-cause localization, α-25/α-26 fix Bug 4, partial Bug 5
iter20-23: kernel printk in rkvdec_hevc_run + v4l2_ctrl_request_setup
iter24:    pinpointed rkvdec_s_ctrl returning -EBUSY for HEVC_SPS due
           to vb2_is_busy(CAPTURE) — libva pre-allocates 24 CAPTURE bufs
           before first per-frame S_EXT_CTRLS, blocking image_fmt reset
iter25 α-25: synthetic SPS injection before cap_pool_init seeds
           ctx->image_fmt to RKVDEC_IMG_FMT_420_8BIT while CAPTURE is
           still empty. H264 Bug 4 fully fixed (byte-equal kdirect).
           HEVC Bug 5 frame 1 fixed (byte-equal kdirect).
iter26 α-26: populate decode_params.short_term_ref_pic_set_size from
           picture->st_rps_bits (VAAPI does expose it). Bytes 4-5 of
           dp now match kdirect. HEVC frame 2+ still diverges
           (separate bug, likely DPB entry mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:56 +00:00

119 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Iteration 22 — Phase 8 (close)
Closes 2026-05-14. iter22 = kernel printk in `v4l2_ctrl_request_clone` tracing each `handler_new_ref` step. FULL close. **iter21's mid-conclusion is overturned: the request-clone-hdl is COMPLETE for libva — all 22 controls cloned with err=0.**
### Method
`linux-fresnel-fourier 7.0-6` added per-step pr_info to `v4l2_ctrl_request_clone`:
```c
pr_info("iter22_clone_start: new_hdl=%p from=%p\n", hdl, from);
list_for_each_entry(ref, &from->ctrl_refs, node) {
...
err = handler_new_ref(hdl, ctrl, &new_ref, false, true);
pr_info("iter22_clone_step: id=0x%x err=%d hdl_error=%d new_ref=%p\n",
ctrl->id, err, hdl->error, new_ref);
...
}
pr_info("iter22_clone_end: hdl=%p err=%d\n", hdl, err);
```
Built ~2 min via ccache. Deployed via scp + pacman -U + reboot.
### Results
**libva HEVC** (11 clones — one per request_fd binding):
Every clone-step logs `err=0 hdl_error=0 new_ref=valid_ptr`. Each clone has **22 controls successfully added**, ending with `iter22_clone_end err=0`. Full ID list per clone:
```
0x990001, 0x990a67, 0x990a6b, 0x990b00, 0x990b67, 0x990b68,
0xa40001, 0xa40900, 0xa40901, 0xa40902, 0xa40903, 0xa40904, 0xa40907,
0xa40a2c, 0xa40a2d,
0xa40a90, 0xa40a91, 0xa40a92, 0xa40a93, 0xa40a94, 0xa40a95, 0xa40a96
```
**Note**: H264_PRED_WEIGHTS (0xa40905) and H264_SLICE_PARAMS (0xa40906) are NOT in the source main_hdl (these are not registered in rkvdec_h264_ctrls on this kernel — rkvdec doesn't expose them). All 7 HEVC stateless decode controls ARE present.
**kdirect HEVC** (13 clones): identical pattern, identical ID set, all err=0.
### What this overturns from iter21
iter21 concluded the clone-hdl was missing 6 HEVC controls for libva. **That was wrong.** The clone-hdl actually has all 22 controls. The iter21_setup_ref printk's iteration was filtering 8 of them out via the early-return check:
```c
if (ref->req_done || (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY))
continue;
```
For libva, only 14 refs reach the printk. For kdirect, 21 refs reach it. 8 vs 1 difference in skip count.
### What this implies
Two of the 8 skipped controls are clearly READ_ONLY: codec class roots `0x990001` and `0xa40001` (V4L2_CTRL_TYPE_CTRL_CLASS). That accounts for 2 of 8 in both libva and kdirect.
For libva, **6 additional HEVC controls** (PPS, SLICE_PARAMS, SCALING_MATRIX, DECODE_PARAMS, DECODE_MODE, START_CODE) get skipped. For kdirect, only **DECODE_MODE + START_CODE** get skipped (the 2 with p_req_valid=0 that aren't staged this frame).
Wait — kdirect shows DECODE_MODE + START_CODE in the setup_ref printk with p_req_valid=0. So they're NOT skipped by the continue check. So kdirect's 21 displayed = 22 cloned - 1 (only one CLASS root being printed?). Hmm, mismatch.
Actually the iter21 setup_ref for kdirect showed 21 lines visible — 20 entries +1 setup header. Let me re-examine. The kdirect dump had 21 ctrl_refs lines (excluding setup: line). 22 clone - 21 setup_ref = 1 skipped. Possibly only one class root present in kdirect's clone-hdl somehow.
So the actual difference is: libva skips 8, kdirect skips 1. The 7 EXTRA skips for libva are: 6 HEVC controls (PPS, SLICE, SCALING, DECODE_PARAMS, DECODE_MODE, START_CODE) + 1 mystery.
### Mechanism status (post-iter22)
| # | Mechanism | Status |
|---|---|---|
| 1 | request_fd mismatch | DISPROVED |
| 2 | REINIT clears | DISPROVED iter19 |
| 3 | Stack-locals stale | DISPROVED iter18 |
| 4 | ctrl_hdl mismatch | REFRAMED iter20 |
| 5 | error_idx silent failure | DISPROVED iter18 |
| 6 | req->p_new staging incomplete | iter21 → 22 OVERTURNED (clone IS complete) |
| 7 | Clone-hdl missing controls | DISPROVED iter22 (clone has all 22) |
| 8 | **NEW iter22**: 6 of 7 HEVC controls get skipped in v4l2_ctrl_request_setup loop for libva but not kdirect | **leading hypothesis** |
### iter23 candidate
Add printk **inside** the `v4l2_ctrl_request_setup` loop **before** the `continue` check, logging `req_done`, `flags`, `ncontrols`, `cluster[0]->id`:
```c
pr_info("iter23_loop: id=0x%x req_done=%d flags=0x%x ncontrols=%d cluster0_id=0x%x\n",
ctrl->id, ref->req_done, ctrl->flags,
master->ncontrols,
master->cluster[0] ? master->cluster[0]->id : 0);
if (ref->req_done || (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY))
continue;
```
Two possible findings:
- **req_done already true** for the 6 HEVC controls → an earlier iteration (HEVC_SPS) clustered them and set req_done. Means main_hdl's HEVC_SPS has `master->cluster` containing PPS+SLICE+SCALING+DECODE_PARAMS+DECODE_MODE+START_CODE on libva's path.
- **flags has READ_ONLY** → the controls have READ_ONLY set, which is wrong for stateless decode controls.
`ncontrols` and `cluster0_id` reveal cluster membership directly: if `ncontrols > 1` for HEVC_SPS, it's been clustered with siblings.
### Substrate state at iter22 close
- Backend SHA on fresnel: `c1d4bb53…` (iter15 stable, unchanged).
- Fork tip `e109306` (α-24 reverted) — unchanged.
- Kernel `linux-fresnel-fourier 7.0-6` with iter17 + iter20 + iter21 + iter22 printks.
- 5-codec anchors: unchanged.
### Iter23 build kicked off
`linux-fresnel-fourier 7.0-7` building on boltzmann (PID 1643224, log /tmp/iter23-kbuild.log). Expected ~2 min via ccache.
### Lesson
iter21's mid-conclusion was based on the wrong printk position — the `iter21_setup_ref` printk was inside `v4l2_ctrl_request_setup`'s loop but AFTER the early-`continue` checks, missing controls that get skipped. iter22's clone-trace showed the clone IS complete; the staging FAILS via the setup-loop SKIP path, not the clone path.
The empirical pattern is now: **libva's per-frame request gets through clone correctly; gets through S_EXT_CTRLS correctly (stages all 5 controls with p_req_valid=1 — at least for SPS, definitely 1); but at setup-loop time, 6 of the 7 HEVC controls get a `continue` that bypasses `req_to_new`**. SPS alone reaches `req_to_new``try_or_set_cluster` → commits to `p_cur` → rkvdec_run reads ctx->ctrl_hdl[SPS]->p_cur which is non-zero?
Actually — wait, iter20 said sps[0..16] was zero for libva. If SPS is the only one that reaches req_to_new + try_or_set_cluster, then SPS's content SHOULD be correct, but iter20 said it's zero. So SPS itself ALSO has a problem in the commit path.
So we have:
- 6 of 7 HEVC controls: never reach `req_to_new` (skipped in setup loop).
- 1 of 7 (SPS): reaches `req_to_new` but resulting `p_cur` content is zero anyway.
These are TWO separate bugs (or one bug with two symptoms). iter23 will reveal the skip mechanism; another iter (or test) may need to address the SPS-commit-content issue.