Files
fresnel-fourier/phase8_iteration18_close.md
T
marfrit a449cec92e iter18 Phase 8 close: mechanisms 3 + 5 disproved; iter17 finding stands
α-21 (heap-persist HEVC controls past IOC_QUEUE): hash unchanged.
  -> Kernel does copy at S_EXT_CTRLS time, not deferred. Mechanism 3 dead.

α-22 (log error_idx after S_EXT_CTRLS): error_idx = count - 1 in BOTH
  the working device-init batch AND the broken per-frame batch. Not
  a failure indicator in this kernel version. Mechanism 5 dead.

Backend reverted to iter15 stable state c1d4bb53... All 5-codec
anchors preserved.

Remaining mechanisms (untested):
  1. request_fd mismatch (unlikely; strace shows consistent fd)
  2. REINIT clears controls between S_EXT_CTRLS and QUEUE (LEADING)
  4. ctrl_hdl mismatch (libva submits to one, rkvdec reads from another)

iter17's empirical finding still stands as the campaign's strongest
narrowing: rkvdec sees zero SPS for libva, correct for kdirect. The
mechanism is between S_EXT_CTRLS submission and ctx->ctrl_hdl->p_cur
read, specific to libva's invocation pattern.

iter19 candidate (α-23): test mechanism 2 by disabling
media_request_reinit() in libva's RequestSyncSurface. If hashes
change, REINIT timing is the bug. Alternative (mechanism 4): kernel
printk that dumps &ctx->ctrl_hdl + per-request handler pointer,
comparing libva vs kdirect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:02:19 +00:00

70 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 18 — Phase 8 (close)
Closes 2026-05-14. iter18 = test mechanisms 3 (stale pointer) and 5 (silent partial failure) for iter17's finding. PARTIAL close. Both mechanisms disproved.
## Mechanism tests
### α-21 (mechanism 3 — stale stack-local pointers)
Made libva's HEVC control structs `static` (file-scope), persisting indefinitely. Suppressed `free(slice_params_array)` so the heap-allocated SLICE_PARAMS also persists past `MEDIA_REQUEST_IOC_QUEUE`.
**Result**: Hash unchanged (`06b2c5a0…`). Kernel pr_info still shows `w=0 h=0` for libva. Mechanism 3 **DISPROVED** — kernel does copy at S_EXT_CTRLS time, not deferred.
### α-22 (mechanism 5 — silent partial failure via error_idx)
Added libva-side logging of `controls.error_idx` after each `S_EXT_CTRLS`. Output:
```
S_EXT_CTRLS rc=0 errno=0 count=2 error_idx=1 request_fd=0 which=0x0 # device-init: WORKS
S_EXT_CTRLS rc=0 errno=0 count=5 error_idx=4 request_fd=8 which=0xf010000 # per-frame: BROKEN
```
`error_idx = count - 1` in BOTH the working (device-init, sets HEVC_DECODE_MODE + HEVC_START_CODE correctly) and broken cases. This is **not a failure indicator** in this kernel version — it appears to just be "index of last control processed."
α-22 follow-up test removed DECODE_PARAMS (count=4); error_idx still = count-1 (=3). Removing DECODE_PARAMS didn't unblock — same all-zero kernel state.
Mechanism 5 **DISPROVED** — error_idx isn't reporting partial failure.
## Both reverted, backend restored to iter15 state (`c1d4bb53…`)
iter18 ships zero code changes. All tests proved their hypotheses negative.
## Remaining mechanisms (post-iter18)
| # | Mechanism | Status |
|---|---|---|
| 1 | request_fd mismatch | unlikely (strace shows consistent fd) |
| 2 | REINIT clears controls between S_EXT_CTRLS and QUEUE | **untested — leading hypothesis** |
| 3 | Stack-locals stale | ❌ DISPROVED by α-21 |
| 4 | ctrl_hdl mismatch (libva submits to one, rkvdec reads from another) | **untested — possible** |
| 5 | Silent partial failure via error_idx | ❌ DISPROVED by α-22 |
## iter19 candidate (α-23)
Mechanism 2 test: temporarily disable `media_request_reinit()` in libva's `RequestSyncSurface` for HEVC. If the controls SURVIVE without REINIT-clearing them, mechanism 2 is confirmed. Then the fix is to reorder: REINIT must run **before** the next S_EXT_CTRLS, NOT after the previous decode (which is libva's current iter6 model).
Or alternatively (mechanism 4): add deeper kernel printk that dumps `ctx->ctrl_hdl` pointer + the per-request `req->req` (V4L2 request handler) pointer, comparing libva-trigger vs kdirect-trigger. If they're different handlers, libva is staging to wrong one.
The kernel-side approach (deeper printk) is more invasive but more definitive. Alternative: rebuild rkvdec_hevc_run_preamble to dump `&ctx->ctrl_hdl` AND first 16 bytes of `*run->sps`. If pointer is the same as a previous frame's, suggests no per-request update.
## Substrate state at iter18 close
- Fork tip `fc78ed4` on noether + fresnel + gitea (clean iter15 state).
- Backend SHA `c1d4bb53…` on fresnel (iter15 stable).
- Kernel `linux-fresnel-fourier 7.0-3` (with diagnostic printk; want to keep for iter19).
- 5-codec anchors: byte-identical to iter15 anchors. Zero regression.
## iter17 finding stands
Despite iter18's negative results on mechanisms 3 and 5, the iter17 empirical finding remains the campaign's strongest narrowing:
**The kernel sees zero control values for libva HEVC (`run.sps->{w,h,reorder,chroma}=0,0,0,0`) but correct values for kdirect (`w=1280, h=720, reorder=2, chroma=1`).**
The mechanism is still unknown but localized. The remaining productive direction is targeted kernel investigation of where libva's S_EXT_CTRLS payload lands vs where rkvdec reads from.
## Lessons
1. **error_idx semantics differ between kernel versions / paths.** Don't rely on it for partial-failure detection without first verifying the success-case value.
2. **Stack-local control pointers are SAFE for V4L2 compound controls** — the kernel copies immediately at S_EXT_CTRLS time.
3. The S_EXT_CTRLS → request → ctrl_hdl chain has at least one bug specific to libva's invocation pattern on RK3399 rkvdec, despite identical wire bytes vs kdirect.