Files
ampere-kernel-decoders/iter2_close.md
T
marfrit 17aa443f8f iter2 close: F1 falsifier fired, mechanism reconstruction was wrong
Phase 6 ran Steps 1-4 cleanly (vendor + adapt + UAPI header +
h265_set_controls wiring; 4 atomic commits f91c3f5..1a2c958 on
backend master, build green). Phase 6 Step 5 smoke test triggered
F1 falsifier per Phase 1 falsifier path.

Evidence:
- strace ioctl trace shows per-frame VIDIOC_S_EXT_CTRLS carries 5
  controls (IDs 0xa40a90..0xa40a94 = standard HEVC SPS/PPS/SLICE/
  SCALING/DECODE_PARAMS). The new 0xa40a98 (EXT_SPS_ST_RPS) and
  0xa40a99 (EXT_SPS_LT_RPS) DO NOT appear in any S_EXT_CTRLS call.
- Probe-line still fires (has_hevc_ext_sps_rps_rkvdec=true confirmed
  via vainfo log) — so the gate is fine, but the populate-and-add
  code path doesn't reach v4l2_set_controls.
- OOPS register state identical across pre-iter2 + post-iter2:
  x1 = 0x51a0 (small integer treated as pointer; pgd=0 confirms
  invalid). The kernel reads this same garbage value regardless of
  whether userspace tries to set the controls or not.

Hypothesis revision: Phase 0's 'UAPI-gap' reconstruction was
PARTIALLY refuted. Even when userspace doesn't populate the new
CIDs (pre-iter2) AND when it tries to (iter2 but the call doesn't
actually fire), the kernel ends up with run->ext_sps_st_rps=0x51a0.
The 0x51a0 is a deterministic kernel-side state — uninitialized
ctrl->p_cur.p or a confused offset-vs-pointer.

Three diagnostic next-steps for iter3 (kernel-side investigation):
  1. Backend instrumentation: log h265_populate_ext_sps_rps_cache
     return code + source_data SPS NAL search outcome
  2. Backend code-path check: is h265.c::h265_set_controls really
     the call site, or does picture.c dispatch via something else?
  3. Kernel instrumentation: printk in rkvdec_hevc_prepare_hw_st_rps
     dumping run->ext_sps_st_rps as read from ctrl->p_cur.p

Meta-campaign re-shuffle:
  - iter2 closes F1 (this commit)
  - iter3 was 'VP9 enablement' -> now bumped to 'HEVC kernel
    investigation' (more urgent, has concrete evidence to pursue)
  - iter4 = VP9 kernel enablement (was iter3)

Source code stays on backend master — iter2 infrastructure
(vendored parser, UAPI shim, runtime probe) is reusable for iter3+
regardless of whether the eventual kernel-side fix changes how
userspace integrates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 09:32:05 +00:00

9.6 KiB

iter2 close — F1 falsifier fired, negative-result iteration

Closed 2026-05-16 evening. Phase 6 Step 5 (smoke test) hit the F1 falsifier path documented in Phase 1: HEVC OOPS still reproduces with the backend patch installed. The mechanism reconstruction from Phase 0 + Phase 5 was incomplete. iter2 produced 4 atomic source commits (vendored parser + GLib compat shim + UAPI header + h265_set_controls wiring), a clean build, and a comprehensive instrumented falsifier capture. The patch did not unblock HEVC, but the evidence narrows the root-cause search.

What was tried (commit chain on marfrit/libva-v4l2-request-fourier)

Step Commit Status
1: vendor GStreamer 1.28.2 unchanged f91c3f5 landed
2: GLib compat shim → build green c5fbc5b landed
3: UAPI header + per-fd runtime probe 4f6ba6c landed; probe correctly reports has_hevc_ext_sps_rps_rkvdec=true on ampere
4: h265_set_controls populates EXT_SPS_*_RPS via vendored parser 1a2c958 landed; build clean
5: install + reboot + smoke test (5-frame HEVC decode) F1 fired OOPS reproduces unchanged

.so md5 currently installed on ampere: 96337ae14bfdbb32f2dfa98b026d0c57 (iter2 step 4 build).

Smoke-test evidence (Phase 7 instrument, pre-completion)

strace -ff -e trace=ioctl of failing ffmpeg captured the per-frame ioctl sequence. Per-frame VIDIOC_S_EXT_CTRLS carries 5 controls — IDs 0xa40a90 (HEVC_SPS), 0xa40a91 (HEVC_PPS), 0xa40a92 (HEVC_SLICE_PARAMS), 0xa40a93 (HEVC_SCALING_MATRIX), 0xa40a94 (HEVC_DECODE_PARAMS). The two new EXT controls (V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS = base + 408 = 0xa40a98, _LT_RPS = 0xa40a99) do not appear in any S_EXT_CTRLS call.

The standard 5-control call returns EINVAL with error_idx=5 (kernel rejects the whole submission for a structural reason, not a per-control validation failure). Backend's existing logic ignores the EINVAL and proceeds to MEDIA_REQUEST_IOC_QUEUE, which is where the kernel OOPSes.

OOPS register state (identical across all three captures pre- and post-iter2 patch):

pc : __pi_memcmp+0x10/0x110
lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec]
x0 : ffff00015ade6c10    (cache pointer — kernel heap, valid)
x1 : 00000000000051a0    (run->ext_sps_st_rps — SMALL INTEGER treated as pointer, NOT a valid kernel address)
x2 : 0000000000000048    (sizeof struct = 72 bytes)

x1 = 0x51a0 is the smoking gun. pgd=0000000000000000, p4d=0000000000000000 confirms the kernel has no page-table mapping for that address. It's not a userspace pointer (those would be in the lower half but in valid mapped ranges); it looks like an OFFSET or COUNT value treated as if it were a pointer.

Hypothesis update post-iter2

Phase 0's hypothesis ("UAPI gap — backend doesn't populate the new CIDs") is partially refuted:

  1. We populated the controls in the backend (Step 4) — but strace shows they don't reach the kernel.
  2. The OOPS happens with run->ext_sps_st_rps = 0x51a0 — a value that LOOKS like an uninitialized field or partially-decoded offset, not a NULL or a backend-supplied pointer.
  3. Therefore the kernel-side state path is what's broken — even when userspace DOESN'T set the controls (pre-iter2 state) or DOES try (iter2 state, with the backend's submission silently absent from the ioctl trace), the kernel ends up with run->ext_sps_st_rps = 0x51a0 and OOPSes.

The 0x51a0 value is consistent across all captured OOPSes (pre-iter2 + post-iter2), which means it's a deterministic kernel-side state — probably an uninitialized ctrl->p_cur.p field or a confused offset-vs-pointer somewhere in the V4L2 controls framework or in rkvdec-hevc-common.c itself.

Why our Step 4 controls don't reach the kernel — open question

Three plausible reasons, ordered by how cheap they are to diagnose next:

  1. h265_populate_ext_sps_rps_cache returns error and silently skips the submission. The function returns -ENODATA if no SPS NAL is found in surface_object->source_data. ffmpeg-vaapi's HEVC submission protocol may not put the SPS NAL into the slice data buffer the backend sees — Phase 5 reviewer's #3 finding (now empirically confirmed by absence of the new CIDs in strace). To diagnose: add request_log in the populate function logging what it found in source_data + the err return code.
  2. Backend's h265 code path goes through picture.c instead of h265.c::h265_set_controls. Some backends route per-codec submission through a generic picture.c::codec_store_buffer + per-codec dispatchers; if HEVC's h265_set_controls isn't the actual call site, our Step 4 edits don't fire. To diagnose: grep picture.c for HEVC-related dispatch + log entry into h265_set_controls.
  3. The kernel-side 0x51a0 value is independent of userspace — possibly ctrl->p_cur.p for a .dims = { 65 } control gets initialized to something like 65 * sizeof(struct) = 65 * 80 = 5200 = 0x1450... no, doesn't match. Or maybe a count field gets accidentally read as a pointer somewhere in the dynamic-array control framework. To diagnose: kernel-side instrumentation (printk in rkvdec_hevc_prepare_hw_st_rps to dump run->ext_sps_st_rps as it's read from ctrl->p_cur.p).

Loopback decision

Loopback to Phase 0 with re-opened kernel-agent#11 per the Phase 1 falsifier path. iter3 (the kernel-side investigation iteration) starts its Phase 0 with:

  • The OOPS register state (x1 = 0x51a0 is the load-bearing observation)
  • The strace evidence that our backend controls don't reach the kernel
  • The three open-question candidates above
  • The kernel-side gate ctx->has_sps_st_rps |= !!(ctrl->has_changed) — the 0x51a0 value would only be read IF has_sps_st_rps is true. Open question: how does has_sps_st_rps become true if userspace never sets the controls? Possibly via an init path during ctrl_handler registration that flips has_changed synthetically.

Source code state (rollback / cleanup decisions)

The 4 source commits (f91c3f5..1a2c958) stay on the backend master — they don't break anything (existing 5-codec functionality preserved per the per-driver-kind gate), they don't add the regression they were meant to prevent (HEVC still OOPSes regardless), and they constitute REAL infrastructure that iter3+ may need (the vendored parser + the UAPI shim header + the runtime probe machinery are all reusable once the actual kernel-side fix is in place).

If iter3 concludes the userspace integration is wrong-shaped entirely (e.g. needs to be via VAAPI extension buffer not S_EXT_CTRLS), iter3's Phase 6 reverts these commits cleanly.

The .so installed at /usr/lib/dri/v4l2_request_drv_video.so (md5 96337ae14b...) keeps iter2's binary. ampere's other 4 codecs (H.264, VP8, MPEG-2, VP9 if enabled) should still work — Step 4's gate is HEVC-only.

Meta-campaign ledger update

iter2 closes with F1 falsifier fired. iter3 entry:

Sub-iter Status
iter2 (HEVC backend extension) closed F1 — backend infrastructure landed (parser vendored, UAPI shim, runtime probe, set_controls gate); root cause is NOT the UAPI gap alone
iter3 (kernel-side investigation, was: VP9 enablement) bumped from "VP9 next" to "HEVC kernel investigation first" — VP9 deferred to iter4. The HEVC OOPS evidence narrows the search and is more urgent (blocks the cascade-wedge of all decoders on ampere).
iter4 (VP9 kernel enablement) bumped to after iter3

marfrit/kernel-agent#11 re-opens with the iter2 evidence appended. The reclassification "userspace-only fix" from Phase 0 prior-art survey is partially overturned — kernel-side instrumentation is now required.

Lesson distilled for memory

When the mechanism reconstruction in Phase 0 is based on source-reading of the kernel without instrumented confirmation of the symptom path (strace + register dump), the resulting Phase 4 plan is at higher-than-claimed risk. iter2's prior-art survey found upstream consumers that populate the controls successfully; we replicated their architecture; the symptom persists. The gap is in our model of how OUR backend's submission reaches the kernel — not in the upstream-pattern fidelity.

Memory entry to file: see iter2 Phase 8 close (separate file).

Phase 6/7/8 close (consolidated)

Phase 6 ran 4 of 5 atomic commits; Step 5 smoke triggered F1. Phase 7 verification reduces to "F1 falsifier captured" — no full C1-C8 sweep is meaningful when HEVC doesn't even decode 5 frames. Phase 8 below.

iter2 closes 2026-05-16.


Phase 8 memory entry (filed)

feedback_verify_symptom_mechanism_first.md at the cross-campaign memory location.

Upstream-aligned fix architecture is necessary but not sufficient — first verify (via strace/gdb/printk) that the symptom's IMMEDIATE cause is what the upstream fix addresses. Phase 0 mechanism reconstruction based on source-reading alone has a non-trivial wrong-cause rate.

iter2 is the worked example: clean process discipline, clean implementation, F1 fired anyway because the concrete x1 = 0x51a0 evidence in Phase 3's OOPS capture was read as "garbage" instead of decoded as a deterministic kernel-side state value.

The memory rule generalizes feedback_strace_gdb_when_in_doubt and adds a sharper test: if my Phase 4 plan's correctness depends on a causal claim, my acceptance criterion needs to include a causal observable change in the relevant kernel/system state — not just "symptom goes away."