From 17aa443f8f860a63ab7e27da8dfd594e8b5e7d65 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 09:31:24 +0000 Subject: [PATCH] iter2 close: F1 falsifier fired, mechanism reconstruction was wrong MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 6 ran Steps 1-4 cleanly (vendor + adapt + UAPI header + h265_set_controls wiring; 4 atomic commits f91c3f5..1a2c958 on backend master, build green). Phase 6 Step 5 smoke test triggered F1 falsifier per Phase 1 falsifier path. Evidence: - strace ioctl trace shows per-frame VIDIOC_S_EXT_CTRLS carries 5 controls (IDs 0xa40a90..0xa40a94 = standard HEVC SPS/PPS/SLICE/ SCALING/DECODE_PARAMS). The new 0xa40a98 (EXT_SPS_ST_RPS) and 0xa40a99 (EXT_SPS_LT_RPS) DO NOT appear in any S_EXT_CTRLS call. - Probe-line still fires (has_hevc_ext_sps_rps_rkvdec=true confirmed via vainfo log) — so the gate is fine, but the populate-and-add code path doesn't reach v4l2_set_controls. - OOPS register state identical across pre-iter2 + post-iter2: x1 = 0x51a0 (small integer treated as pointer; pgd=0 confirms invalid). The kernel reads this same garbage value regardless of whether userspace tries to set the controls or not. Hypothesis revision: Phase 0's 'UAPI-gap' reconstruction was PARTIALLY refuted. Even when userspace doesn't populate the new CIDs (pre-iter2) AND when it tries to (iter2 but the call doesn't actually fire), the kernel ends up with run->ext_sps_st_rps=0x51a0. The 0x51a0 is a deterministic kernel-side state — uninitialized ctrl->p_cur.p or a confused offset-vs-pointer. Three diagnostic next-steps for iter3 (kernel-side investigation): 1. Backend instrumentation: log h265_populate_ext_sps_rps_cache return code + source_data SPS NAL search outcome 2. Backend code-path check: is h265.c::h265_set_controls really the call site, or does picture.c dispatch via something else? 3. Kernel instrumentation: printk in rkvdec_hevc_prepare_hw_st_rps dumping run->ext_sps_st_rps as read from ctrl->p_cur.p Meta-campaign re-shuffle: - iter2 closes F1 (this commit) - iter3 was 'VP9 enablement' -> now bumped to 'HEVC kernel investigation' (more urgent, has concrete evidence to pursue) - iter4 = VP9 kernel enablement (was iter3) Source code stays on backend master — iter2 infrastructure (vendored parser, UAPI shim, runtime probe) is reusable for iter3+ regardless of whether the eventual kernel-side fix changes how userspace integrates. Co-Authored-By: Claude Opus 4.7 --- iter2_close.md | 103 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 iter2_close.md diff --git a/iter2_close.md b/iter2_close.md new file mode 100644 index 0000000..9de1c1f --- /dev/null +++ b/iter2_close.md @@ -0,0 +1,103 @@ +# iter2 close — F1 falsifier fired, negative-result iteration + +Closed 2026-05-16 evening. Phase 6 Step 5 (smoke test) hit the F1 falsifier path documented in Phase 1: HEVC OOPS still reproduces with the backend patch installed. **The mechanism reconstruction from Phase 0 + Phase 5 was incomplete.** iter2 produced 4 atomic source commits (vendored parser + GLib compat shim + UAPI header + h265_set_controls wiring), a clean build, and a comprehensive instrumented falsifier capture. The patch did not unblock HEVC, but the evidence narrows the root-cause search. + +## What was tried (commit chain on `marfrit/libva-v4l2-request-fourier`) + +| Step | Commit | Status | +|------|--------|--------| +| 1: vendor GStreamer 1.28.2 unchanged | `f91c3f5` | landed | +| 2: GLib compat shim → build green | `c5fbc5b` | landed | +| 3: UAPI header + per-fd runtime probe | `4f6ba6c` | landed; probe correctly reports `has_hevc_ext_sps_rps_rkvdec=true` on ampere | +| 4: h265_set_controls populates EXT_SPS_*_RPS via vendored parser | `1a2c958` | landed; build clean | +| 5: install + reboot + smoke test (5-frame HEVC decode) | **F1 fired** | OOPS reproduces unchanged | + +`.so` md5 currently installed on ampere: `96337ae14bfdbb32f2dfa98b026d0c57` (iter2 step 4 build). + +## Smoke-test evidence (Phase 7 instrument, pre-completion) + +`strace -ff -e trace=ioctl` of failing ffmpeg captured the per-frame ioctl sequence. Per-frame `VIDIOC_S_EXT_CTRLS` carries 5 controls — IDs `0xa40a90` (HEVC_SPS), `0xa40a91` (HEVC_PPS), `0xa40a92` (HEVC_SLICE_PARAMS), `0xa40a93` (HEVC_SCALING_MATRIX), `0xa40a94` (HEVC_DECODE_PARAMS). The two new EXT controls (`V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` = base + 408 = `0xa40a98`, `_LT_RPS` = `0xa40a99`) **do not appear in any S_EXT_CTRLS call.** + +The standard 5-control call returns `EINVAL` with `error_idx=5` (kernel rejects the whole submission for a structural reason, not a per-control validation failure). Backend's existing logic ignores the EINVAL and proceeds to `MEDIA_REQUEST_IOC_QUEUE`, which is where the kernel OOPSes. + +OOPS register state (identical across all three captures pre- and post-iter2 patch): + +``` +pc : __pi_memcmp+0x10/0x110 +lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec] +x0 : ffff00015ade6c10 (cache pointer — kernel heap, valid) +x1 : 00000000000051a0 (run->ext_sps_st_rps — SMALL INTEGER treated as pointer, NOT a valid kernel address) +x2 : 0000000000000048 (sizeof struct = 72 bytes) +``` + +`x1 = 0x51a0` is the smoking gun. `pgd=0000000000000000, p4d=0000000000000000` confirms the kernel has no page-table mapping for that address. It's not a userspace pointer (those would be in the lower half but in valid mapped ranges); it looks like an OFFSET or COUNT value treated as if it were a pointer. + +## Hypothesis update post-iter2 + +Phase 0's hypothesis ("UAPI gap — backend doesn't populate the new CIDs") is **partially refuted**: + +1. We populated the controls in the backend (Step 4) — but strace shows they don't reach the kernel. +2. The OOPS happens with `run->ext_sps_st_rps = 0x51a0` — a value that LOOKS like an uninitialized field or partially-decoded offset, not a NULL or a backend-supplied pointer. +3. Therefore the kernel-side state path is what's broken — even when userspace DOESN'T set the controls (pre-iter2 state) or DOES try (iter2 state, with the backend's submission silently absent from the ioctl trace), the kernel ends up with `run->ext_sps_st_rps = 0x51a0` and OOPSes. + +The 0x51a0 value is consistent across all captured OOPSes (pre-iter2 + post-iter2), which means it's a deterministic kernel-side state — probably an uninitialized `ctrl->p_cur.p` field or a confused offset-vs-pointer somewhere in the V4L2 controls framework or in rkvdec-hevc-common.c itself. + +## Why our Step 4 controls don't reach the kernel — open question + +Three plausible reasons, ordered by how cheap they are to diagnose next: + +1. **`h265_populate_ext_sps_rps_cache` returns error and silently skips the submission.** The function returns `-ENODATA` if no SPS NAL is found in `surface_object->source_data`. ffmpeg-vaapi's HEVC submission protocol may not put the SPS NAL into the slice data buffer the backend sees — Phase 5 reviewer's #3 finding (now empirically confirmed by absence of the new CIDs in strace). To diagnose: add `request_log` in the populate function logging what it found in source_data + the err return code. +2. **Backend's h265 code path goes through `picture.c` instead of `h265.c::h265_set_controls`.** Some backends route per-codec submission through a generic `picture.c::codec_store_buffer` + per-codec dispatchers; if HEVC's `h265_set_controls` isn't the actual call site, our Step 4 edits don't fire. To diagnose: grep `picture.c` for HEVC-related dispatch + log entry into `h265_set_controls`. +3. **The kernel-side `0x51a0` value is independent of userspace** — possibly `ctrl->p_cur.p` for a `.dims = { 65 }` control gets initialized to something like `65 * sizeof(struct) = 65 * 80 = 5200 = 0x1450`... no, doesn't match. Or maybe a count field gets accidentally read as a pointer somewhere in the dynamic-array control framework. To diagnose: kernel-side instrumentation (printk in `rkvdec_hevc_prepare_hw_st_rps` to dump `run->ext_sps_st_rps` as it's read from `ctrl->p_cur.p`). + +## Loopback decision + +**Loopback to Phase 0** with re-opened `kernel-agent#11` per the Phase 1 falsifier path. iter3 (the kernel-side investigation iteration) starts its Phase 0 with: +- The OOPS register state (`x1 = 0x51a0` is the load-bearing observation) +- The strace evidence that our backend controls don't reach the kernel +- The three open-question candidates above +- The kernel-side gate `ctx->has_sps_st_rps |= !!(ctrl->has_changed)` — the `0x51a0` value would only be read IF `has_sps_st_rps` is true. Open question: how does `has_sps_st_rps` become true if userspace never sets the controls? Possibly via an init path during ctrl_handler registration that flips `has_changed` synthetically. + +## Source code state (rollback / cleanup decisions) + +The 4 source commits (`f91c3f5..1a2c958`) **stay on the backend `master`** — they don't break anything (existing 5-codec functionality preserved per the per-driver-kind gate), they don't add the regression they were meant to prevent (HEVC still OOPSes regardless), and they constitute REAL infrastructure that iter3+ may need (the vendored parser + the UAPI shim header + the runtime probe machinery are all reusable once the actual kernel-side fix is in place). + +If iter3 concludes the userspace integration is wrong-shaped entirely (e.g. needs to be via VAAPI extension buffer not S_EXT_CTRLS), iter3's Phase 6 reverts these commits cleanly. + +The `.so` installed at `/usr/lib/dri/v4l2_request_drv_video.so` (md5 `96337ae14b...`) keeps iter2's binary. ampere's other 4 codecs (H.264, VP8, MPEG-2, VP9 if enabled) should still work — Step 4's gate is HEVC-only. + +## Meta-campaign ledger update + +iter2 closes with F1 falsifier fired. iter3 entry: + +| Sub-iter | Status | +|----------|--------| +| iter2 (HEVC backend extension) | **closed F1** — backend infrastructure landed (parser vendored, UAPI shim, runtime probe, set_controls gate); root cause is NOT the UAPI gap alone | +| iter3 (kernel-side investigation, was: VP9 enablement) | **bumped from "VP9 next" to "HEVC kernel investigation first"** — VP9 deferred to iter4. The HEVC OOPS evidence narrows the search and is more urgent (blocks the cascade-wedge of all decoders on ampere). | +| iter4 (VP9 kernel enablement) | bumped to after iter3 | + +`marfrit/kernel-agent#11` re-opens with the iter2 evidence appended. The reclassification "userspace-only fix" from Phase 0 prior-art survey is **partially overturned** — kernel-side instrumentation is now required. + +## Lesson distilled for memory + +> When the mechanism reconstruction in Phase 0 is based on source-reading of the kernel without instrumented confirmation of the symptom path (strace + register dump), the resulting Phase 4 plan is at higher-than-claimed risk. iter2's prior-art survey found upstream consumers that populate the controls successfully; we replicated their architecture; the symptom persists. The gap is in our model of how OUR backend's submission reaches the kernel — not in the upstream-pattern fidelity. + +Memory entry to file: see iter2 Phase 8 close (separate file). + +## Phase 6/7/8 close (consolidated) + +Phase 6 ran 4 of 5 atomic commits; Step 5 smoke triggered F1. Phase 7 verification reduces to "F1 falsifier captured" — no full C1-C8 sweep is meaningful when HEVC doesn't even decode 5 frames. Phase 8 below. + +iter2 closes 2026-05-16. + +--- + +## Phase 8 memory entry (filed) + +[`feedback_verify_symptom_mechanism_first.md`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_verify_symptom_mechanism_first.md) at the cross-campaign memory location. + +> Upstream-aligned fix architecture is necessary but not sufficient — first verify (via strace/gdb/printk) that the symptom's IMMEDIATE cause is what the upstream fix addresses. Phase 0 mechanism reconstruction based on source-reading alone has a non-trivial wrong-cause rate. + +iter2 is the worked example: clean process discipline, clean implementation, F1 fired anyway because the concrete `x1 = 0x51a0` evidence in Phase 3's OOPS capture was read as "garbage" instead of decoded as a deterministic kernel-side state value. + +The memory rule generalizes [`feedback_strace_gdb_when_in_doubt`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_strace_gdb_when_in_doubt.md) and adds a sharper test: *if my Phase 4 plan's correctness depends on a causal claim, my acceptance criterion needs to include a causal observable change in the relevant kernel/system state — not just "symptom goes away."*