Files
ampere-kernel-decoders/iter3_close.md
T
Markus Fritsche 659c08c81c iter3 close — kernel root cause: uninitialized run.ext_sps_st_rps
rkvdec_hevc_run_preamble conditionally assigns run->ext_sps_st_rps
and run->ext_sps_lt_rps only when ctx->has_sps_*_rps is true. The
caller (vdpu381-hevc.c:591) allocates the run struct without zero-init,
so when has_sps_st_rps is false, run.ext_sps_st_rps retains stack
garbage (0x51a0 deterministically on RK3588 CoolPi CM5 GenBook).

prepare_hw_st_rps's `if (!run->ext_sps_st_rps) return;` check passes
(garbage is non-NULL), then memcmp dereferences 0x51a0 → OOPS.

Backend diagnostic instrumentation (iter3-Q4) revealed the full
dispatch path BeginPicture → RenderPicture → EndPicture →
h265_set_controls → populate_ext_sps_rps_cache(ENODATA, because
ffmpeg-vaapi strips SPS/VPS/PPS, leaving only slice NALs) →
fallback to 5-ctrl batch (EINVAL) → MEDIA_REQUEST_IOC_QUEUE → kernel
OOPS. The OOPS occurs even when our EXT_SPS_*_RPS controls are
NEVER submitted — purely a kernel-side init bug.

1-line patches proposed (Option A — fix in preamble, Option B —
zero-init run struct at caller).

Q1-Q5 all answered. Userspace iter2 work stays — upstream-aligned
path for streams that need EXT_SPS_*_RPS. Iter4 deferred until
kernel patch lands via kernel-agent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 10:34:13 +00:00

133 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iter3 close — kernel-side root cause located
Date: 2026-05-16 (afternoon)
Branch: `master`
Substrate: ampere `7.0.0-rc3-devices+` (mmind v7.0 + Casanova HEVC 7.0 series + ampere DTS)
Backend: `libva-v4l2-request-fourier` iter2 + iter3 diagnostic instrumentation (md5 `404041ea`)
## Bottom line
**The HEVC OOPS on RK3588 is a one-line kernel bug in the Casanova/Collabora 7.0 HEVC series.** Userspace work to vendor the GStreamer parser and submit `V4L2_CID_STATELESS_HEVC_EXT_SPS_{ST,LT}_RPS` controls is the *correct upstream-aligned approach* for HEVC streams that need it, BUT the OOPS occurs even without those controls because of a missing struct-field initialization in the kernel.
## Falsifier outcome
F1 (no kernel OOPS): **TRUE — F1 fires again**, same as iter2. OOPS reproduced deterministically on fresh-boot run at 12:30:15 with iter3 instrumented build.
F2 (vendored GStreamer parser extracts SPS from libavcodec-stripped source): **FALSE — verified empirically.** Diagnostic log shows `populate_ext_sps_rps_cache entry: src=0xffff7de54000 src_size=280 cache_valid=0``nal_iter=0 pr=5 offset=0 nalu.type=20 nalu.size=277``exit: err=-61 cache_valid=0`. First NAL in source_data is type 20 (`IDR_N_LP slice`), not type 33 (`SPS`). ffmpeg-vaapi strips VPS/SPS/PPS — confirms Phase 5 reviewer's prediction. This is consistent with the iter2 strace finding and not the bug.
## Root cause (Q1 + Q3 combined)
In `drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c:591`:
```c
static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
{
struct rkvdec_dev *rkvdec = ctx->dev;
struct rkvdec_hevc_run run; // ← UNINITIALIZED
struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;
rkvdec_hevc_run_preamble(ctx, &run);
...
}
```
`rkvdec_hevc_run_preamble` in `rkvdec-hevc-common.c:478` assigns most `run->*` fields unconditionally, but the new EXT_SPS fields are conditional:
```c
if (ctx->has_sps_st_rps) {
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS);
run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL;
}
if (ctx->has_sps_lt_rps) {
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS);
run->ext_sps_lt_rps = ctrl ? ctrl->p_cur.p : NULL;
}
```
When `ctx->has_sps_st_rps == false` (the common case for any HEVC stream where userspace doesn't supply EXT_SPS_*_RPS), `run.ext_sps_st_rps` retains **stack garbage** — the leftover bytes from a previous stack frame at the same offset. Empirically on the CoolPi CM5 GenBook this garbage is deterministically `0x51a0` per call, because the calling pattern is fixed.
Downstream in `rkvdec_hevc_prepare_hw_st_rps` (rkvdec-hevc-common.c:380):
```c
if (!run->ext_sps_st_rps)
return; /* ← garbage 0x51a0 is non-NULL */
if (!memcmp(cache, run->ext_sps_st_rps, sizeof(struct v4l2_ctrl_hevc_ext_sps_st_rps)))
return; /* ← memcmp derefs 0x51a0 → OOPS */
```
The NULL-check is bypassed by stack garbage; `memcmp` faults at the bogus address.
This is reachable on every HEVC decode that hits the assemble path on `vdpu381_variant` (RK3588) and `vdpu383_variant` (RK3576) — i.e., every stream where the dispatcher gate in `rkvdec-vdpu{381,383}-hevc.c:606` enters the else branch.
## Minimal kernel patch
Either of these one-line patches fixes the OOPS without changing any behavior for valid use cases:
**Option A — fix at the producer (preamble):**
```diff
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
@@ -495,6 +495,9 @@ void rkvdec_hevc_run_preamble(struct rkvdec_ctx *ctx,
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
V4L2_CID_STATELESS_HEVC_SCALING_MATRIX);
run->scaling_matrix = ctrl ? ctrl->p_cur.p : NULL;
+
+ run->ext_sps_st_rps = NULL;
+ run->ext_sps_lt_rps = NULL;
if (ctx->has_sps_st_rps) {
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
```
**Option B — fix at the callers (zero-init the run struct):**
```diff
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
@@ -588,7 +588,7 @@ static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
{
struct rkvdec_dev *rkvdec = ctx->dev;
- struct rkvdec_hevc_run run;
+ struct rkvdec_hevc_run run = {};
struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;
```
(same for `rkvdec-vdpu383-hevc.c`)
Option A is more localized to the *cause* (the preamble's incomplete initialization). Option B is more defensive (catches any future fields added without explicit init). Either works.
## Why iter2's userspace fix matters separately
Independently of the kernel patch, the iter2 work (vendored GStreamer 1.28.2 parser + UAPI shim + h265 cache population) IS the upstream-aligned path for HEVC streams where the SPS parameter set semantics require the new ST/LT RPS controls. When ffmpeg-vaapi forwards a buffer that DOES contain the SPS (e.g. some bitstream wrappers, or future ffmpeg versions that don't strip VPS/SPS/PPS), our parser activates and submits the EXT_SPS_*_RPS controls — exactly the upstream contract from the Casanova series.
For the current ffmpeg-vaapi (which strips SPS), our backend correctly returns ENODATA from the parser and falls back to submitting only the standard 5 controls. With the kernel patch above, this fallback succeeds without OOPS.
(There remains a separate question: for bbb_60s_720p.hevc, the SPS encoded by libx265 may or may not actually require EXT_SPS_*_RPS — typical streams use simple ST_RPS that the kernel can derive from slice headers alone, which is what the current rkvdec dispatcher path expects. The user-pref direction is "do it correctly aligned with upstream architecture", so the parser+cache+inject path stays.)
## Phase 6 question completion
| Q | Answer |
|---|--------|
| Q1 — what is `0x51a0` | Stack garbage at `run.ext_sps_st_rps` offset from prior frame. Deterministic per calling pattern. |
| Q2 — where is `ctrl->p_cur.p = 0x51a0` set | N/A — not a ctrl pointer corruption. The `0x51a0` is local stack memory, not from the V4L2 ctrl handler. |
| Q3 — when `ctx->has_sps_st_rps` true | FALSE in our case. The OOPS triggers via the missing-init-of-run, not via the has_sps_st_rps gate. |
| Q4 — backend log fires? | YES, after fflush patch + picture.c entry-point instrumentation. Confirmed full path BeginPicture → RenderPicture×2 → EndPicture → h265_set_controls → populate_ext_sps_rps_cache → ENODATA → fallback to 5-ctrl batch → EINVAL → MEDIA_REQUEST_IOC_QUEUE → kernel OOPS. |
| Q5 — vdpu381 vs vdpu383 on RK3588 | RK3588 → `vdpu381_variant`. Confirmed in `drivers/media/platform/rockchip/rkvdec/rkvdec.c:1757` (`"rockchip,rk3588-vdec"``&vdpu381_variant`). vdpu383 binds only to RK3576. |
## Open follow-ups
- **kernel-agent issue (new)**: file the 1-line patch above as an upstream-aligned fix to the Casanova v7.0 series. Tag for: applies to vdpu381 + vdpu383 both; reproducer is "any HEVC stream where ffmpeg-vaapi strips SPS" (i.e., everything).
- **iter4 plan**: with kernel patched, re-run HEVC and confirm decode succeeds. Then iterate on the standard-5-controls EINVAL (separate issue — likely a slice_params field shape or DECODE_PARAMS flag the iter2 backend isn't setting correctly).
- **Optional userspace robustness**: after kernel patch, the EXT_SPS_*_RPS path stays dormant on standard streams; consider deferring h265_populate to *only when SPS is detected in source_data* (cheap optimization).
- **Diagnostic instrumentation removal**: iter3-Q4 logs in `picture.c` + `h265.c` + `utils.c::fflush` should be stripped once iter4 close confirms the kernel patch works. fflush in utils.c can stay (cheap, useful).
## Substrate state at close
- Backend `.so`: md5 `404041ea2dcc03c769e0ab8c43ddadd6` on ampere `/usr/lib/dri/`
- Backend source: ampere `~/src/libva-v4l2-request-fourier` working tree has uncommitted iter3 diagnostics (picture.c + utils.c + h265.c)
- Kernel: unchanged from iter2 substrate (waiting on kernel-agent issue + patch land)
Iter3 closes with **root cause identified and a 1-line fix proposed**, going from negative-result (iter2 close) to definitive-finding in one diagnostic iteration. The 8-phase loop's Phase 5 (review) recommendation in iter2 to add backend logging was the unblocking move.