Files
Markus Fritsche 659c08c81c iter3 close — kernel root cause: uninitialized run.ext_sps_st_rps
rkvdec_hevc_run_preamble conditionally assigns run->ext_sps_st_rps
and run->ext_sps_lt_rps only when ctx->has_sps_*_rps is true. The
caller (vdpu381-hevc.c:591) allocates the run struct without zero-init,
so when has_sps_st_rps is false, run.ext_sps_st_rps retains stack
garbage (0x51a0 deterministically on RK3588 CoolPi CM5 GenBook).

prepare_hw_st_rps's `if (!run->ext_sps_st_rps) return;` check passes
(garbage is non-NULL), then memcmp dereferences 0x51a0 → OOPS.

Backend diagnostic instrumentation (iter3-Q4) revealed the full
dispatch path BeginPicture → RenderPicture → EndPicture →
h265_set_controls → populate_ext_sps_rps_cache(ENODATA, because
ffmpeg-vaapi strips SPS/VPS/PPS, leaving only slice NALs) →
fallback to 5-ctrl batch (EINVAL) → MEDIA_REQUEST_IOC_QUEUE → kernel
OOPS. The OOPS occurs even when our EXT_SPS_*_RPS controls are
NEVER submitted — purely a kernel-side init bug.

1-line patches proposed (Option A — fix in preamble, Option B —
zero-init run struct at caller).

Q1-Q5 all answered. Userspace iter2 work stays — upstream-aligned
path for streams that need EXT_SPS_*_RPS. Iter4 deferred until
kernel patch lands via kernel-agent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 10:34:13 +00:00

8.5 KiB
Raw Permalink Blame History

Iter3 close — kernel-side root cause located

Date: 2026-05-16 (afternoon) Branch: master Substrate: ampere 7.0.0-rc3-devices+ (mmind v7.0 + Casanova HEVC 7.0 series + ampere DTS) Backend: libva-v4l2-request-fourier iter2 + iter3 diagnostic instrumentation (md5 404041ea)

Bottom line

The HEVC OOPS on RK3588 is a one-line kernel bug in the Casanova/Collabora 7.0 HEVC series. Userspace work to vendor the GStreamer parser and submit V4L2_CID_STATELESS_HEVC_EXT_SPS_{ST,LT}_RPS controls is the correct upstream-aligned approach for HEVC streams that need it, BUT the OOPS occurs even without those controls because of a missing struct-field initialization in the kernel.

Falsifier outcome

F1 (no kernel OOPS): TRUE — F1 fires again, same as iter2. OOPS reproduced deterministically on fresh-boot run at 12:30:15 with iter3 instrumented build.

F2 (vendored GStreamer parser extracts SPS from libavcodec-stripped source): FALSE — verified empirically. Diagnostic log shows populate_ext_sps_rps_cache entry: src=0xffff7de54000 src_size=280 cache_valid=0nal_iter=0 pr=5 offset=0 nalu.type=20 nalu.size=277exit: err=-61 cache_valid=0. First NAL in source_data is type 20 (IDR_N_LP slice), not type 33 (SPS). ffmpeg-vaapi strips VPS/SPS/PPS — confirms Phase 5 reviewer's prediction. This is consistent with the iter2 strace finding and not the bug.

Root cause (Q1 + Q3 combined)

In drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c:591:

static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
{
        struct rkvdec_dev *rkvdec = ctx->dev;
        struct rkvdec_hevc_run run;            // ← UNINITIALIZED
        struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
        struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;

        rkvdec_hevc_run_preamble(ctx, &run);
        ...
}

rkvdec_hevc_run_preamble in rkvdec-hevc-common.c:478 assigns most run->* fields unconditionally, but the new EXT_SPS fields are conditional:

if (ctx->has_sps_st_rps) {
        ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
                              V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS);
        run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL;
}
if (ctx->has_sps_lt_rps) {
        ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
                              V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS);
        run->ext_sps_lt_rps = ctrl ? ctrl->p_cur.p : NULL;
}

When ctx->has_sps_st_rps == false (the common case for any HEVC stream where userspace doesn't supply EXT_SPS_*_RPS), run.ext_sps_st_rps retains stack garbage — the leftover bytes from a previous stack frame at the same offset. Empirically on the CoolPi CM5 GenBook this garbage is deterministically 0x51a0 per call, because the calling pattern is fixed.

Downstream in rkvdec_hevc_prepare_hw_st_rps (rkvdec-hevc-common.c:380):

if (!run->ext_sps_st_rps)
        return;                                /* ← garbage 0x51a0 is non-NULL */
if (!memcmp(cache, run->ext_sps_st_rps, sizeof(struct v4l2_ctrl_hevc_ext_sps_st_rps)))
        return;                                /* ← memcmp derefs 0x51a0 → OOPS */

The NULL-check is bypassed by stack garbage; memcmp faults at the bogus address.

This is reachable on every HEVC decode that hits the assemble path on vdpu381_variant (RK3588) and vdpu383_variant (RK3576) — i.e., every stream where the dispatcher gate in rkvdec-vdpu{381,383}-hevc.c:606 enters the else branch.

Minimal kernel patch

Either of these one-line patches fixes the OOPS without changing any behavior for valid use cases:

Option A — fix at the producer (preamble):

--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
@@ -495,6 +495,9 @@ void rkvdec_hevc_run_preamble(struct rkvdec_ctx *ctx,
        ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
                              V4L2_CID_STATELESS_HEVC_SCALING_MATRIX);
        run->scaling_matrix = ctrl ? ctrl->p_cur.p : NULL;
+
+       run->ext_sps_st_rps = NULL;
+       run->ext_sps_lt_rps = NULL;

        if (ctx->has_sps_st_rps) {
                ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,

Option B — fix at the callers (zero-init the run struct):

--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
@@ -588,7 +588,7 @@ static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
 {
        struct rkvdec_dev *rkvdec = ctx->dev;
-       struct rkvdec_hevc_run run;
+       struct rkvdec_hevc_run run = {};
        struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
        struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;

(same for rkvdec-vdpu383-hevc.c)

Option A is more localized to the cause (the preamble's incomplete initialization). Option B is more defensive (catches any future fields added without explicit init). Either works.

Why iter2's userspace fix matters separately

Independently of the kernel patch, the iter2 work (vendored GStreamer 1.28.2 parser + UAPI shim + h265 cache population) IS the upstream-aligned path for HEVC streams where the SPS parameter set semantics require the new ST/LT RPS controls. When ffmpeg-vaapi forwards a buffer that DOES contain the SPS (e.g. some bitstream wrappers, or future ffmpeg versions that don't strip VPS/SPS/PPS), our parser activates and submits the EXT_SPS_*_RPS controls — exactly the upstream contract from the Casanova series.

For the current ffmpeg-vaapi (which strips SPS), our backend correctly returns ENODATA from the parser and falls back to submitting only the standard 5 controls. With the kernel patch above, this fallback succeeds without OOPS.

(There remains a separate question: for bbb_60s_720p.hevc, the SPS encoded by libx265 may or may not actually require EXT_SPS_*_RPS — typical streams use simple ST_RPS that the kernel can derive from slice headers alone, which is what the current rkvdec dispatcher path expects. The user-pref direction is "do it correctly aligned with upstream architecture", so the parser+cache+inject path stays.)

Phase 6 question completion

Q Answer
Q1 — what is 0x51a0 Stack garbage at run.ext_sps_st_rps offset from prior frame. Deterministic per calling pattern.
Q2 — where is ctrl->p_cur.p = 0x51a0 set N/A — not a ctrl pointer corruption. The 0x51a0 is local stack memory, not from the V4L2 ctrl handler.
Q3 — when ctx->has_sps_st_rps true FALSE in our case. The OOPS triggers via the missing-init-of-run, not via the has_sps_st_rps gate.
Q4 — backend log fires? YES, after fflush patch + picture.c entry-point instrumentation. Confirmed full path BeginPicture → RenderPicture×2 → EndPicture → h265_set_controls → populate_ext_sps_rps_cache → ENODATA → fallback to 5-ctrl batch → EINVAL → MEDIA_REQUEST_IOC_QUEUE → kernel OOPS.
Q5 — vdpu381 vs vdpu383 on RK3588 RK3588 → vdpu381_variant. Confirmed in drivers/media/platform/rockchip/rkvdec/rkvdec.c:1757 ("rockchip,rk3588-vdec"&vdpu381_variant). vdpu383 binds only to RK3576.

Open follow-ups

  • kernel-agent issue (new): file the 1-line patch above as an upstream-aligned fix to the Casanova v7.0 series. Tag for: applies to vdpu381 + vdpu383 both; reproducer is "any HEVC stream where ffmpeg-vaapi strips SPS" (i.e., everything).
  • iter4 plan: with kernel patched, re-run HEVC and confirm decode succeeds. Then iterate on the standard-5-controls EINVAL (separate issue — likely a slice_params field shape or DECODE_PARAMS flag the iter2 backend isn't setting correctly).
  • Optional userspace robustness: after kernel patch, the EXT_SPS_*_RPS path stays dormant on standard streams; consider deferring h265_populate to only when SPS is detected in source_data (cheap optimization).
  • Diagnostic instrumentation removal: iter3-Q4 logs in picture.c + h265.c + utils.c::fflush should be stripped once iter4 close confirms the kernel patch works. fflush in utils.c can stay (cheap, useful).

Substrate state at close

  • Backend .so: md5 404041ea2dcc03c769e0ab8c43ddadd6 on ampere /usr/lib/dri/
  • Backend source: ampere ~/src/libva-v4l2-request-fourier working tree has uncommitted iter3 diagnostics (picture.c + utils.c + h265.c)
  • Kernel: unchanged from iter2 substrate (waiting on kernel-agent issue + patch land)

Iter3 closes with root cause identified and a 1-line fix proposed, going from negative-result (iter2 close) to definitive-finding in one diagnostic iteration. The 8-phase loop's Phase 5 (review) recommendation in iter2 to add backend logging was the unblocking move.