Phase 6 ran Steps 1-4 cleanly (vendor + adapt + UAPI header +
h265_set_controls wiring; 4 atomic commits f91c3f5..1a2c958 on
backend master, build green). Phase 6 Step 5 smoke test triggered
F1 falsifier per Phase 1 falsifier path.
Evidence:
- strace ioctl trace shows per-frame VIDIOC_S_EXT_CTRLS carries 5
controls (IDs 0xa40a90..0xa40a94 = standard HEVC SPS/PPS/SLICE/
SCALING/DECODE_PARAMS). The new 0xa40a98 (EXT_SPS_ST_RPS) and
0xa40a99 (EXT_SPS_LT_RPS) DO NOT appear in any S_EXT_CTRLS call.
- Probe-line still fires (has_hevc_ext_sps_rps_rkvdec=true confirmed
via vainfo log) — so the gate is fine, but the populate-and-add
code path doesn't reach v4l2_set_controls.
- OOPS register state identical across pre-iter2 + post-iter2:
x1 = 0x51a0 (small integer treated as pointer; pgd=0 confirms
invalid). The kernel reads this same garbage value regardless of
whether userspace tries to set the controls or not.
Hypothesis revision: Phase 0's 'UAPI-gap' reconstruction was
PARTIALLY refuted. Even when userspace doesn't populate the new
CIDs (pre-iter2) AND when it tries to (iter2 but the call doesn't
actually fire), the kernel ends up with run->ext_sps_st_rps=0x51a0.
The 0x51a0 is a deterministic kernel-side state — uninitialized
ctrl->p_cur.p or a confused offset-vs-pointer.
Three diagnostic next-steps for iter3 (kernel-side investigation):
1. Backend instrumentation: log h265_populate_ext_sps_rps_cache
return code + source_data SPS NAL search outcome
2. Backend code-path check: is h265.c::h265_set_controls really
the call site, or does picture.c dispatch via something else?
3. Kernel instrumentation: printk in rkvdec_hevc_prepare_hw_st_rps
dumping run->ext_sps_st_rps as read from ctrl->p_cur.p
Meta-campaign re-shuffle:
- iter2 closes F1 (this commit)
- iter3 was 'VP9 enablement' -> now bumped to 'HEVC kernel
investigation' (more urgent, has concrete evidence to pursue)
- iter4 = VP9 kernel enablement (was iter3)
Source code stays on backend master — iter2 infrastructure
(vendored parser, UAPI shim, runtime probe) is reusable for iter3+
regardless of whether the eventual kernel-side fix changes how
userspace integrates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.6 KiB
iter2 close — F1 falsifier fired, negative-result iteration
Closed 2026-05-16 evening. Phase 6 Step 5 (smoke test) hit the F1 falsifier path documented in Phase 1: HEVC OOPS still reproduces with the backend patch installed. The mechanism reconstruction from Phase 0 + Phase 5 was incomplete. iter2 produced 4 atomic source commits (vendored parser + GLib compat shim + UAPI header + h265_set_controls wiring), a clean build, and a comprehensive instrumented falsifier capture. The patch did not unblock HEVC, but the evidence narrows the root-cause search.
What was tried (commit chain on marfrit/libva-v4l2-request-fourier)
| Step | Commit | Status |
|---|---|---|
| 1: vendor GStreamer 1.28.2 unchanged | f91c3f5 |
landed |
| 2: GLib compat shim → build green | c5fbc5b |
landed |
| 3: UAPI header + per-fd runtime probe | 4f6ba6c |
landed; probe correctly reports has_hevc_ext_sps_rps_rkvdec=true on ampere |
| 4: h265_set_controls populates EXT_SPS_*_RPS via vendored parser | 1a2c958 |
landed; build clean |
| 5: install + reboot + smoke test (5-frame HEVC decode) | F1 fired | OOPS reproduces unchanged |
.so md5 currently installed on ampere: 96337ae14bfdbb32f2dfa98b026d0c57 (iter2 step 4 build).
Smoke-test evidence (Phase 7 instrument, pre-completion)
strace -ff -e trace=ioctl of failing ffmpeg captured the per-frame ioctl sequence. Per-frame VIDIOC_S_EXT_CTRLS carries 5 controls — IDs 0xa40a90 (HEVC_SPS), 0xa40a91 (HEVC_PPS), 0xa40a92 (HEVC_SLICE_PARAMS), 0xa40a93 (HEVC_SCALING_MATRIX), 0xa40a94 (HEVC_DECODE_PARAMS). The two new EXT controls (V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS = base + 408 = 0xa40a98, _LT_RPS = 0xa40a99) do not appear in any S_EXT_CTRLS call.
The standard 5-control call returns EINVAL with error_idx=5 (kernel rejects the whole submission for a structural reason, not a per-control validation failure). Backend's existing logic ignores the EINVAL and proceeds to MEDIA_REQUEST_IOC_QUEUE, which is where the kernel OOPSes.
OOPS register state (identical across all three captures pre- and post-iter2 patch):
pc : __pi_memcmp+0x10/0x110
lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec]
x0 : ffff00015ade6c10 (cache pointer — kernel heap, valid)
x1 : 00000000000051a0 (run->ext_sps_st_rps — SMALL INTEGER treated as pointer, NOT a valid kernel address)
x2 : 0000000000000048 (sizeof struct = 72 bytes)
x1 = 0x51a0 is the smoking gun. pgd=0000000000000000, p4d=0000000000000000 confirms the kernel has no page-table mapping for that address. It's not a userspace pointer (those would be in the lower half but in valid mapped ranges); it looks like an OFFSET or COUNT value treated as if it were a pointer.
Hypothesis update post-iter2
Phase 0's hypothesis ("UAPI gap — backend doesn't populate the new CIDs") is partially refuted:
- We populated the controls in the backend (Step 4) — but strace shows they don't reach the kernel.
- The OOPS happens with
run->ext_sps_st_rps = 0x51a0— a value that LOOKS like an uninitialized field or partially-decoded offset, not a NULL or a backend-supplied pointer. - Therefore the kernel-side state path is what's broken — even when userspace DOESN'T set the controls (pre-iter2 state) or DOES try (iter2 state, with the backend's submission silently absent from the ioctl trace), the kernel ends up with
run->ext_sps_st_rps = 0x51a0and OOPSes.
The 0x51a0 value is consistent across all captured OOPSes (pre-iter2 + post-iter2), which means it's a deterministic kernel-side state — probably an uninitialized ctrl->p_cur.p field or a confused offset-vs-pointer somewhere in the V4L2 controls framework or in rkvdec-hevc-common.c itself.
Why our Step 4 controls don't reach the kernel — open question
Three plausible reasons, ordered by how cheap they are to diagnose next:
h265_populate_ext_sps_rps_cachereturns error and silently skips the submission. The function returns-ENODATAif no SPS NAL is found insurface_object->source_data. ffmpeg-vaapi's HEVC submission protocol may not put the SPS NAL into the slice data buffer the backend sees — Phase 5 reviewer's #3 finding (now empirically confirmed by absence of the new CIDs in strace). To diagnose: addrequest_login the populate function logging what it found in source_data + the err return code.- Backend's h265 code path goes through
picture.cinstead ofh265.c::h265_set_controls. Some backends route per-codec submission through a genericpicture.c::codec_store_buffer+ per-codec dispatchers; if HEVC'sh265_set_controlsisn't the actual call site, our Step 4 edits don't fire. To diagnose: greppicture.cfor HEVC-related dispatch + log entry intoh265_set_controls. - The kernel-side
0x51a0value is independent of userspace — possiblyctrl->p_cur.pfor a.dims = { 65 }control gets initialized to something like65 * sizeof(struct) = 65 * 80 = 5200 = 0x1450... no, doesn't match. Or maybe a count field gets accidentally read as a pointer somewhere in the dynamic-array control framework. To diagnose: kernel-side instrumentation (printk inrkvdec_hevc_prepare_hw_st_rpsto dumprun->ext_sps_st_rpsas it's read fromctrl->p_cur.p).
Loopback decision
Loopback to Phase 0 with re-opened kernel-agent#11 per the Phase 1 falsifier path. iter3 (the kernel-side investigation iteration) starts its Phase 0 with:
- The OOPS register state (
x1 = 0x51a0is the load-bearing observation) - The strace evidence that our backend controls don't reach the kernel
- The three open-question candidates above
- The kernel-side gate
ctx->has_sps_st_rps |= !!(ctrl->has_changed)— the0x51a0value would only be read IFhas_sps_st_rpsis true. Open question: how doeshas_sps_st_rpsbecome true if userspace never sets the controls? Possibly via an init path during ctrl_handler registration that flipshas_changedsynthetically.
Source code state (rollback / cleanup decisions)
The 4 source commits (f91c3f5..1a2c958) stay on the backend master — they don't break anything (existing 5-codec functionality preserved per the per-driver-kind gate), they don't add the regression they were meant to prevent (HEVC still OOPSes regardless), and they constitute REAL infrastructure that iter3+ may need (the vendored parser + the UAPI shim header + the runtime probe machinery are all reusable once the actual kernel-side fix is in place).
If iter3 concludes the userspace integration is wrong-shaped entirely (e.g. needs to be via VAAPI extension buffer not S_EXT_CTRLS), iter3's Phase 6 reverts these commits cleanly.
The .so installed at /usr/lib/dri/v4l2_request_drv_video.so (md5 96337ae14b...) keeps iter2's binary. ampere's other 4 codecs (H.264, VP8, MPEG-2, VP9 if enabled) should still work — Step 4's gate is HEVC-only.
Meta-campaign ledger update
iter2 closes with F1 falsifier fired. iter3 entry:
| Sub-iter | Status |
|---|---|
| iter2 (HEVC backend extension) | closed F1 — backend infrastructure landed (parser vendored, UAPI shim, runtime probe, set_controls gate); root cause is NOT the UAPI gap alone |
| iter3 (kernel-side investigation, was: VP9 enablement) | bumped from "VP9 next" to "HEVC kernel investigation first" — VP9 deferred to iter4. The HEVC OOPS evidence narrows the search and is more urgent (blocks the cascade-wedge of all decoders on ampere). |
| iter4 (VP9 kernel enablement) | bumped to after iter3 |
marfrit/kernel-agent#11 re-opens with the iter2 evidence appended. The reclassification "userspace-only fix" from Phase 0 prior-art survey is partially overturned — kernel-side instrumentation is now required.
Lesson distilled for memory
When the mechanism reconstruction in Phase 0 is based on source-reading of the kernel without instrumented confirmation of the symptom path (strace + register dump), the resulting Phase 4 plan is at higher-than-claimed risk. iter2's prior-art survey found upstream consumers that populate the controls successfully; we replicated their architecture; the symptom persists. The gap is in our model of how OUR backend's submission reaches the kernel — not in the upstream-pattern fidelity.
Memory entry to file: see iter2 Phase 8 close (separate file).
Phase 6/7/8 close (consolidated)
Phase 6 ran 4 of 5 atomic commits; Step 5 smoke triggered F1. Phase 7 verification reduces to "F1 falsifier captured" — no full C1-C8 sweep is meaningful when HEVC doesn't even decode 5 frames. Phase 8 below.
iter2 closes 2026-05-16.
Phase 8 memory entry (filed)
feedback_verify_symptom_mechanism_first.md at the cross-campaign memory location.
Upstream-aligned fix architecture is necessary but not sufficient — first verify (via strace/gdb/printk) that the symptom's IMMEDIATE cause is what the upstream fix addresses. Phase 0 mechanism reconstruction based on source-reading alone has a non-trivial wrong-cause rate.
iter2 is the worked example: clean process discipline, clean implementation, F1 fired anyway because the concrete x1 = 0x51a0 evidence in Phase 3's OOPS capture was read as "garbage" instead of decoded as a deterministic kernel-side state value.
The memory rule generalizes feedback_strace_gdb_when_in_doubt and adds a sharper test: if my Phase 4 plan's correctness depends on a causal claim, my acceptance criterion needs to include a causal observable change in the relevant kernel/system state — not just "symptom goes away."