17aa443f8f
Phase 6 ran Steps 1-4 cleanly (vendor + adapt + UAPI header +
h265_set_controls wiring; 4 atomic commits f91c3f5..1a2c958 on
backend master, build green). Phase 6 Step 5 smoke test triggered
F1 falsifier per Phase 1 falsifier path.
Evidence:
- strace ioctl trace shows per-frame VIDIOC_S_EXT_CTRLS carries 5
controls (IDs 0xa40a90..0xa40a94 = standard HEVC SPS/PPS/SLICE/
SCALING/DECODE_PARAMS). The new 0xa40a98 (EXT_SPS_ST_RPS) and
0xa40a99 (EXT_SPS_LT_RPS) DO NOT appear in any S_EXT_CTRLS call.
- Probe-line still fires (has_hevc_ext_sps_rps_rkvdec=true confirmed
via vainfo log) — so the gate is fine, but the populate-and-add
code path doesn't reach v4l2_set_controls.
- OOPS register state identical across pre-iter2 + post-iter2:
x1 = 0x51a0 (small integer treated as pointer; pgd=0 confirms
invalid). The kernel reads this same garbage value regardless of
whether userspace tries to set the controls or not.
Hypothesis revision: Phase 0's 'UAPI-gap' reconstruction was
PARTIALLY refuted. Even when userspace doesn't populate the new
CIDs (pre-iter2) AND when it tries to (iter2 but the call doesn't
actually fire), the kernel ends up with run->ext_sps_st_rps=0x51a0.
The 0x51a0 is a deterministic kernel-side state — uninitialized
ctrl->p_cur.p or a confused offset-vs-pointer.
Three diagnostic next-steps for iter3 (kernel-side investigation):
1. Backend instrumentation: log h265_populate_ext_sps_rps_cache
return code + source_data SPS NAL search outcome
2. Backend code-path check: is h265.c::h265_set_controls really
the call site, or does picture.c dispatch via something else?
3. Kernel instrumentation: printk in rkvdec_hevc_prepare_hw_st_rps
dumping run->ext_sps_st_rps as read from ctrl->p_cur.p
Meta-campaign re-shuffle:
- iter2 closes F1 (this commit)
- iter3 was 'VP9 enablement' -> now bumped to 'HEVC kernel
investigation' (more urgent, has concrete evidence to pursue)
- iter4 = VP9 kernel enablement (was iter3)
Source code stays on backend master — iter2 infrastructure
(vendored parser, UAPI shim, runtime probe) is reusable for iter3+
regardless of whether the eventual kernel-side fix changes how
userspace integrates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
104 lines
9.6 KiB
Markdown
104 lines
9.6 KiB
Markdown
# iter2 close — F1 falsifier fired, negative-result iteration
|
|
|
|
Closed 2026-05-16 evening. Phase 6 Step 5 (smoke test) hit the F1 falsifier path documented in Phase 1: HEVC OOPS still reproduces with the backend patch installed. **The mechanism reconstruction from Phase 0 + Phase 5 was incomplete.** iter2 produced 4 atomic source commits (vendored parser + GLib compat shim + UAPI header + h265_set_controls wiring), a clean build, and a comprehensive instrumented falsifier capture. The patch did not unblock HEVC, but the evidence narrows the root-cause search.
|
|
|
|
## What was tried (commit chain on `marfrit/libva-v4l2-request-fourier`)
|
|
|
|
| Step | Commit | Status |
|
|
|------|--------|--------|
|
|
| 1: vendor GStreamer 1.28.2 unchanged | `f91c3f5` | landed |
|
|
| 2: GLib compat shim → build green | `c5fbc5b` | landed |
|
|
| 3: UAPI header + per-fd runtime probe | `4f6ba6c` | landed; probe correctly reports `has_hevc_ext_sps_rps_rkvdec=true` on ampere |
|
|
| 4: h265_set_controls populates EXT_SPS_*_RPS via vendored parser | `1a2c958` | landed; build clean |
|
|
| 5: install + reboot + smoke test (5-frame HEVC decode) | **F1 fired** | OOPS reproduces unchanged |
|
|
|
|
`.so` md5 currently installed on ampere: `96337ae14bfdbb32f2dfa98b026d0c57` (iter2 step 4 build).
|
|
|
|
## Smoke-test evidence (Phase 7 instrument, pre-completion)
|
|
|
|
`strace -ff -e trace=ioctl` of failing ffmpeg captured the per-frame ioctl sequence. Per-frame `VIDIOC_S_EXT_CTRLS` carries 5 controls — IDs `0xa40a90` (HEVC_SPS), `0xa40a91` (HEVC_PPS), `0xa40a92` (HEVC_SLICE_PARAMS), `0xa40a93` (HEVC_SCALING_MATRIX), `0xa40a94` (HEVC_DECODE_PARAMS). The two new EXT controls (`V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` = base + 408 = `0xa40a98`, `_LT_RPS` = `0xa40a99`) **do not appear in any S_EXT_CTRLS call.**
|
|
|
|
The standard 5-control call returns `EINVAL` with `error_idx=5` (kernel rejects the whole submission for a structural reason, not a per-control validation failure). Backend's existing logic ignores the EINVAL and proceeds to `MEDIA_REQUEST_IOC_QUEUE`, which is where the kernel OOPSes.
|
|
|
|
OOPS register state (identical across all three captures pre- and post-iter2 patch):
|
|
|
|
```
|
|
pc : __pi_memcmp+0x10/0x110
|
|
lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec]
|
|
x0 : ffff00015ade6c10 (cache pointer — kernel heap, valid)
|
|
x1 : 00000000000051a0 (run->ext_sps_st_rps — SMALL INTEGER treated as pointer, NOT a valid kernel address)
|
|
x2 : 0000000000000048 (sizeof struct = 72 bytes)
|
|
```
|
|
|
|
`x1 = 0x51a0` is the smoking gun. `pgd=0000000000000000, p4d=0000000000000000` confirms the kernel has no page-table mapping for that address. It's not a userspace pointer (those would be in the lower half but in valid mapped ranges); it looks like an OFFSET or COUNT value treated as if it were a pointer.
|
|
|
|
## Hypothesis update post-iter2
|
|
|
|
Phase 0's hypothesis ("UAPI gap — backend doesn't populate the new CIDs") is **partially refuted**:
|
|
|
|
1. We populated the controls in the backend (Step 4) — but strace shows they don't reach the kernel.
|
|
2. The OOPS happens with `run->ext_sps_st_rps = 0x51a0` — a value that LOOKS like an uninitialized field or partially-decoded offset, not a NULL or a backend-supplied pointer.
|
|
3. Therefore the kernel-side state path is what's broken — even when userspace DOESN'T set the controls (pre-iter2 state) or DOES try (iter2 state, with the backend's submission silently absent from the ioctl trace), the kernel ends up with `run->ext_sps_st_rps = 0x51a0` and OOPSes.
|
|
|
|
The 0x51a0 value is consistent across all captured OOPSes (pre-iter2 + post-iter2), which means it's a deterministic kernel-side state — probably an uninitialized `ctrl->p_cur.p` field or a confused offset-vs-pointer somewhere in the V4L2 controls framework or in rkvdec-hevc-common.c itself.
|
|
|
|
## Why our Step 4 controls don't reach the kernel — open question
|
|
|
|
Three plausible reasons, ordered by how cheap they are to diagnose next:
|
|
|
|
1. **`h265_populate_ext_sps_rps_cache` returns error and silently skips the submission.** The function returns `-ENODATA` if no SPS NAL is found in `surface_object->source_data`. ffmpeg-vaapi's HEVC submission protocol may not put the SPS NAL into the slice data buffer the backend sees — Phase 5 reviewer's #3 finding (now empirically confirmed by absence of the new CIDs in strace). To diagnose: add `request_log` in the populate function logging what it found in source_data + the err return code.
|
|
2. **Backend's h265 code path goes through `picture.c` instead of `h265.c::h265_set_controls`.** Some backends route per-codec submission through a generic `picture.c::codec_store_buffer` + per-codec dispatchers; if HEVC's `h265_set_controls` isn't the actual call site, our Step 4 edits don't fire. To diagnose: grep `picture.c` for HEVC-related dispatch + log entry into `h265_set_controls`.
|
|
3. **The kernel-side `0x51a0` value is independent of userspace** — possibly `ctrl->p_cur.p` for a `.dims = { 65 }` control gets initialized to something like `65 * sizeof(struct) = 65 * 80 = 5200 = 0x1450`... no, doesn't match. Or maybe a count field gets accidentally read as a pointer somewhere in the dynamic-array control framework. To diagnose: kernel-side instrumentation (printk in `rkvdec_hevc_prepare_hw_st_rps` to dump `run->ext_sps_st_rps` as it's read from `ctrl->p_cur.p`).
|
|
|
|
## Loopback decision
|
|
|
|
**Loopback to Phase 0** with re-opened `kernel-agent#11` per the Phase 1 falsifier path. iter3 (the kernel-side investigation iteration) starts its Phase 0 with:
|
|
- The OOPS register state (`x1 = 0x51a0` is the load-bearing observation)
|
|
- The strace evidence that our backend controls don't reach the kernel
|
|
- The three open-question candidates above
|
|
- The kernel-side gate `ctx->has_sps_st_rps |= !!(ctrl->has_changed)` — the `0x51a0` value would only be read IF `has_sps_st_rps` is true. Open question: how does `has_sps_st_rps` become true if userspace never sets the controls? Possibly via an init path during ctrl_handler registration that flips `has_changed` synthetically.
|
|
|
|
## Source code state (rollback / cleanup decisions)
|
|
|
|
The 4 source commits (`f91c3f5..1a2c958`) **stay on the backend `master`** — they don't break anything (existing 5-codec functionality preserved per the per-driver-kind gate), they don't add the regression they were meant to prevent (HEVC still OOPSes regardless), and they constitute REAL infrastructure that iter3+ may need (the vendored parser + the UAPI shim header + the runtime probe machinery are all reusable once the actual kernel-side fix is in place).
|
|
|
|
If iter3 concludes the userspace integration is wrong-shaped entirely (e.g. needs to be via VAAPI extension buffer not S_EXT_CTRLS), iter3's Phase 6 reverts these commits cleanly.
|
|
|
|
The `.so` installed at `/usr/lib/dri/v4l2_request_drv_video.so` (md5 `96337ae14b...`) keeps iter2's binary. ampere's other 4 codecs (H.264, VP8, MPEG-2, VP9 if enabled) should still work — Step 4's gate is HEVC-only.
|
|
|
|
## Meta-campaign ledger update
|
|
|
|
iter2 closes with F1 falsifier fired. iter3 entry:
|
|
|
|
| Sub-iter | Status |
|
|
|----------|--------|
|
|
| iter2 (HEVC backend extension) | **closed F1** — backend infrastructure landed (parser vendored, UAPI shim, runtime probe, set_controls gate); root cause is NOT the UAPI gap alone |
|
|
| iter3 (kernel-side investigation, was: VP9 enablement) | **bumped from "VP9 next" to "HEVC kernel investigation first"** — VP9 deferred to iter4. The HEVC OOPS evidence narrows the search and is more urgent (blocks the cascade-wedge of all decoders on ampere). |
|
|
| iter4 (VP9 kernel enablement) | bumped to after iter3 |
|
|
|
|
`marfrit/kernel-agent#11` re-opens with the iter2 evidence appended. The reclassification "userspace-only fix" from Phase 0 prior-art survey is **partially overturned** — kernel-side instrumentation is now required.
|
|
|
|
## Lesson distilled for memory
|
|
|
|
> When the mechanism reconstruction in Phase 0 is based on source-reading of the kernel without instrumented confirmation of the symptom path (strace + register dump), the resulting Phase 4 plan is at higher-than-claimed risk. iter2's prior-art survey found upstream consumers that populate the controls successfully; we replicated their architecture; the symptom persists. The gap is in our model of how OUR backend's submission reaches the kernel — not in the upstream-pattern fidelity.
|
|
|
|
Memory entry to file: see iter2 Phase 8 close (separate file).
|
|
|
|
## Phase 6/7/8 close (consolidated)
|
|
|
|
Phase 6 ran 4 of 5 atomic commits; Step 5 smoke triggered F1. Phase 7 verification reduces to "F1 falsifier captured" — no full C1-C8 sweep is meaningful when HEVC doesn't even decode 5 frames. Phase 8 below.
|
|
|
|
iter2 closes 2026-05-16.
|
|
|
|
---
|
|
|
|
## Phase 8 memory entry (filed)
|
|
|
|
[`feedback_verify_symptom_mechanism_first.md`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_verify_symptom_mechanism_first.md) at the cross-campaign memory location.
|
|
|
|
> Upstream-aligned fix architecture is necessary but not sufficient — first verify (via strace/gdb/printk) that the symptom's IMMEDIATE cause is what the upstream fix addresses. Phase 0 mechanism reconstruction based on source-reading alone has a non-trivial wrong-cause rate.
|
|
|
|
iter2 is the worked example: clean process discipline, clean implementation, F1 fired anyway because the concrete `x1 = 0x51a0` evidence in Phase 3's OOPS capture was read as "garbage" instead of decoded as a deterministic kernel-side state value.
|
|
|
|
The memory rule generalizes [`feedback_strace_gdb_when_in_doubt`](../../.claude/projects/-home-mfritsche-claude/memory/feedback_strace_gdb_when_in_doubt.md) and adds a sharper test: *if my Phase 4 plan's correctness depends on a causal claim, my acceptance criterion needs to include a causal observable change in the relevant kernel/system state — not just "symptom goes away."*
|