diff --git a/iter2_close.md b/iter2_close.md new file mode 100644 index 0000000..9de1c1f --- /dev/null +++ b/iter2_close.md @@ -0,0 +1,103 @@ +# iter2 close — F1 falsifier fired, negative-result iteration + +Closed 2026-05-16 evening. Phase 6 Step 5 (smoke test) hit the F1 falsifier path documented in Phase 1: HEVC OOPS still reproduces with the backend patch installed. **The mechanism reconstruction from Phase 0 + Phase 5 was incomplete.** iter2 produced 4 atomic source commits (vendored parser + GLib compat shim + UAPI header + h265_set_controls wiring), a clean build, and a comprehensive instrumented falsifier capture. The patch did not unblock HEVC, but the evidence narrows the root-cause search. + +## What was tried (commit chain on `marfrit/libva-v4l2-request-fourier`) + +| Step | Commit | Status | +|------|--------|--------| +| 1: vendor GStreamer 1.28.2 unchanged | `f91c3f5` | landed | +| 2: GLib compat shim → build green | `c5fbc5b` | landed | +| 3: UAPI header + per-fd runtime probe | `4f6ba6c` | landed; probe correctly reports `has_hevc_ext_sps_rps_rkvdec=true` on ampere | +| 4: h265_set_controls populates EXT_SPS_*_RPS via vendored parser | `1a2c958` | landed; build clean | +| 5: install + reboot + smoke test (5-frame HEVC decode) | **F1 fired** | OOPS reproduces unchanged | + +`.so` md5 currently installed on ampere: `96337ae14bfdbb32f2dfa98b026d0c57` (iter2 step 4 build). + +## Smoke-test evidence (Phase 7 instrument, pre-completion) + +`strace -ff -e trace=ioctl` of failing ffmpeg captured the per-frame ioctl sequence. Per-frame `VIDIOC_S_EXT_CTRLS` carries 5 controls — IDs `0xa40a90` (HEVC_SPS), `0xa40a91` (HEVC_PPS), `0xa40a92` (HEVC_SLICE_PARAMS), `0xa40a93` (HEVC_SCALING_MATRIX), `0xa40a94` (HEVC_DECODE_PARAMS). The two new EXT controls (`V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` = base + 408 = `0xa40a98`, `_LT_RPS` = `0xa40a99`) **do not appear in any S_EXT_CTRLS call.** + +The standard 5-control call returns `EINVAL` with `error_idx=5` (kernel rejects the whole submission for a structural reason, not a per-control validation failure). Backend's existing logic ignores the EINVAL and proceeds to `MEDIA_REQUEST_IOC_QUEUE`, which is where the kernel OOPSes. + +OOPS register state (identical across all three captures pre- and post-iter2 patch): + +``` +pc : __pi_memcmp+0x10/0x110 +lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec] +x0 : ffff00015ade6c10 (cache pointer — kernel heap, valid) +x1 : 00000000000051a0 (run->ext_sps_st_rps — SMALL INTEGER treated as pointer, NOT a valid kernel address) +x2 : 0000000000000048 (sizeof struct = 72 bytes) +``` + +`x1 = 0x51a0` is the smoking gun. `pgd=0000000000000000, p4d=0000000000000000` confirms the kernel has no page-table mapping for that address. It's not a userspace pointer (those would be in the lower half but in valid mapped ranges); it looks like an OFFSET or COUNT value treated as if it were a pointer. + +## Hypothesis update post-iter2 + +Phase 0's hypothesis ("UAPI gap — backend doesn't populate the new CIDs") is **partially refuted**: + +1. We populated the controls in the backend (Step 4) — but strace shows they don't reach the kernel. +2. The OOPS happens with `run->ext_sps_st_rps = 0x51a0` — a value that LOOKS like an uninitialized field or partially-decoded offset, not a NULL or a backend-supplied pointer. +3. Therefore the kernel-side state path is what's broken — even when userspace DOESN'T set the controls (pre-iter2 state) or DOES try (iter2 state, with the backend's submission silently absent from the ioctl trace), the kernel ends up with `run->ext_sps_st_rps = 0x51a0` and OOPSes. + +The 0x51a0 value is consistent across all captured OOPSes (pre-iter2 + post-iter2), which means it's a deterministic kernel-side state — probably an uninitialized `ctrl->p_cur.p` field or a confused offset-vs-pointer somewhere in the V4L2 controls framework or in rkvdec-hevc-common.c itself. + +## Why our Step 4 controls don't reach the kernel — open question + +Three plausible reasons, ordered by how cheap they are to diagnose next: + +1. **`h265_populate_ext_sps_rps_cache` returns error and silently skips the submission.** The function returns `-ENODATA` if no SPS NAL is found in `surface_object->source_data`. ffmpeg-vaapi's HEVC submission protocol may not put the SPS NAL into the slice data buffer the backend sees — Phase 5 reviewer's #3 finding (now empirically confirmed by absence of the new CIDs in strace). To diagnose: add `request_log` in the populate function logging what it found in source_data + the err return code. +2. **Backend's h265 code path goes through `picture.c` instead of `h265.c::h265_set_controls`.** Some backends route per-codec submission through a generic `picture.c::codec_store_buffer` + per-codec dispatchers; if HEVC's `h265_set_controls` isn't the actual call site, our Step 4 edits don't fire. To diagnose: grep `picture.c` for HEVC-related dispatch + log entry into `h265_set_controls`. +3. **The kernel-side `0x51a0` value is independent of userspace** — possibly `ctrl->p_cur.p` for a `.dims = { 65 }` control gets initialized to something like `65 * sizeof(struct) = 65 * 80 = 5200 = 0x1450`... no, doesn't match. Or maybe a count field gets accidentally read as a pointer somewhere in the dynamic-array control framework. To diagnose: kernel-side instrumentation (printk in `rkvdec_hevc_prepare_hw_st_rps` to dump `run->ext_sps_st_rps` as it's read from `ctrl->p_cur.p`). + +## Loopback decision + +**Loopback to Phase 0** with re-opened `kernel-agent#11` per the Phase 1 falsifier path. iter3 (the kernel-side investigation iteration) starts its Phase 0 with: +- The OOPS register state (`x1 = 0x51a0` is the load-bearing observation) +- The strace evidence that our backend controls don't reach the kernel +- The three open-question candidates above +- The kernel-side gate `ctx->has_sps_st_rps |= !!(ctrl->has_changed)` — the `0x51a0` value would only be read IF `has_sps_st_rps` is true. Open question: how does `has_sps_st_rps` become true if userspace never sets the controls? Possibly via an init path during ctrl_handler registration that flips `has_changed` synthetically. + +## Source code state (rollback / cleanup decisions) + +The 4 source commits (`f91c3f5..1a2c958`) **stay on the backend `master`** — they don't break anything (existing 5-codec functionality preserved per the per-driver-kind gate), they don't add the regression they were meant to prevent (HEVC still OOPSes regardless), and they constitute REAL infrastructure that iter3+ may need (the vendored parser + the UAPI shim header + the runtime probe machinery are all reusable once the actual kernel-side fix is in place). + +If iter3 concludes the userspace integration is wrong-shaped entirely (e.g. needs to be via VAAPI extension buffer not S_EXT_CTRLS), iter3's Phase 6 reverts these commits cleanly. + +The `.so` installed at `/usr/lib/dri/v4l2_request_drv_video.so` (md5 `96337ae14b...`) keeps iter2's binary. ampere's other 4 codecs (H.264, VP8, MPEG-2, VP9 if enabled) should still work — Step 4's gate is HEVC-only. + +## Meta-campaign ledger update + +iter2 closes with F1 falsifier fired. iter3 entry: + +| Sub-iter | Status | +|----------|--------| +| iter2 (HEVC backend extension) | **closed F1** — backend infrastructure landed (parser vendored, UAPI shim, runtime probe, set_controls gate); root cause is NOT the UAPI gap alone | +| iter3 (kernel-side investigation, was: VP9 enablement) | **bumped from "VP9 next" to "HEVC kernel investigation first"** — VP9 deferred to iter4. The HEVC OOPS evidence narrows the search and is more urgent (blocks the cascade-wedge of all decoders on ampere). | +| iter4 (VP9 kernel enablement) | bumped to after iter3 | + +`marfrit/kernel-agent#11` re-opens with the iter2 evidence appended. The reclassification "userspace-only fix" from Phase 0 prior-art survey is **partially overturned** — kernel-side instrumentation is now required. + +## Lesson distilled for memory + +> When the mechanism reconstruction in Phase 0 is based on source-reading of the kernel without instrumented confirmation of the symptom path (strace + register dump), the resulting Phase 4 plan is at higher-than-claimed risk. iter2's prior-art survey found upstream consumers that populate the controls successfully; we replicated their architecture; the symptom persists. The gap is in our model of how OUR backend's submission reaches the kernel — not in the upstream-pattern fidelity. + +Memory entry to file: see iter2 Phase 8 close (separate file). + +## Phase 6/7/8 close (consolidated) + +Phase 6 ran 4 of 5 atomic commits; Step 5 smoke triggered F1. Phase 7 verification reduces to "F1 falsifier captured" — no full C1-C8 sweep is meaningful when HEVC doesn't even decode 5 frames. Phase 8 below. + +iter2 closes 2026-05-16. + +--- + +## Phase 8 memory entry (filed) + +[`feedback_verify_symptom_mechanism_first.md`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_verify_symptom_mechanism_first.md) at the cross-campaign memory location. + +> Upstream-aligned fix architecture is necessary but not sufficient — first verify (via strace/gdb/printk) that the symptom's IMMEDIATE cause is what the upstream fix addresses. Phase 0 mechanism reconstruction based on source-reading alone has a non-trivial wrong-cause rate. + +iter2 is the worked example: clean process discipline, clean implementation, F1 fired anyway because the concrete `x1 = 0x51a0` evidence in Phase 3's OOPS capture was read as "garbage" instead of decoded as a deterministic kernel-side state value. + +The memory rule generalizes [`feedback_strace_gdb_when_in_doubt`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_strace_gdb_when_in_doubt.md) and adds a sharper test: *if my Phase 4 plan's correctness depends on a causal claim, my acceptance criterion needs to include a causal observable change in the relevant kernel/system state — not just "symptom goes away."*