iter3 close — kernel root cause: uninitialized run.ext_sps_st_rps
rkvdec_hevc_run_preamble conditionally assigns run->ext_sps_st_rps and run->ext_sps_lt_rps only when ctx->has_sps_*_rps is true. The caller (vdpu381-hevc.c:591) allocates the run struct without zero-init, so when has_sps_st_rps is false, run.ext_sps_st_rps retains stack garbage (0x51a0 deterministically on RK3588 CoolPi CM5 GenBook). prepare_hw_st_rps's `if (!run->ext_sps_st_rps) return;` check passes (garbage is non-NULL), then memcmp dereferences 0x51a0 → OOPS. Backend diagnostic instrumentation (iter3-Q4) revealed the full dispatch path BeginPicture → RenderPicture → EndPicture → h265_set_controls → populate_ext_sps_rps_cache(ENODATA, because ffmpeg-vaapi strips SPS/VPS/PPS, leaving only slice NALs) → fallback to 5-ctrl batch (EINVAL) → MEDIA_REQUEST_IOC_QUEUE → kernel OOPS. The OOPS occurs even when our EXT_SPS_*_RPS controls are NEVER submitted — purely a kernel-side init bug. 1-line patches proposed (Option A — fix in preamble, Option B — zero-init run struct at caller). Q1-Q5 all answered. Userspace iter2 work stays — upstream-aligned path for streams that need EXT_SPS_*_RPS. Iter4 deferred until kernel patch lands via kernel-agent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
+132
@@ -0,0 +1,132 @@
|
||||
# Iter3 close — kernel-side root cause located
|
||||
|
||||
Date: 2026-05-16 (afternoon)
|
||||
Branch: `master`
|
||||
Substrate: ampere `7.0.0-rc3-devices+` (mmind v7.0 + Casanova HEVC 7.0 series + ampere DTS)
|
||||
Backend: `libva-v4l2-request-fourier` iter2 + iter3 diagnostic instrumentation (md5 `404041ea`)
|
||||
|
||||
## Bottom line
|
||||
|
||||
**The HEVC OOPS on RK3588 is a one-line kernel bug in the Casanova/Collabora 7.0 HEVC series.** Userspace work to vendor the GStreamer parser and submit `V4L2_CID_STATELESS_HEVC_EXT_SPS_{ST,LT}_RPS` controls is the *correct upstream-aligned approach* for HEVC streams that need it, BUT the OOPS occurs even without those controls because of a missing struct-field initialization in the kernel.
|
||||
|
||||
## Falsifier outcome
|
||||
|
||||
F1 (no kernel OOPS): **TRUE — F1 fires again**, same as iter2. OOPS reproduced deterministically on fresh-boot run at 12:30:15 with iter3 instrumented build.
|
||||
|
||||
F2 (vendored GStreamer parser extracts SPS from libavcodec-stripped source): **FALSE — verified empirically.** Diagnostic log shows `populate_ext_sps_rps_cache entry: src=0xffff7de54000 src_size=280 cache_valid=0` → `nal_iter=0 pr=5 offset=0 nalu.type=20 nalu.size=277` → `exit: err=-61 cache_valid=0`. First NAL in source_data is type 20 (`IDR_N_LP slice`), not type 33 (`SPS`). ffmpeg-vaapi strips VPS/SPS/PPS — confirms Phase 5 reviewer's prediction. This is consistent with the iter2 strace finding and not the bug.
|
||||
|
||||
## Root cause (Q1 + Q3 combined)
|
||||
|
||||
In `drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c:591`:
|
||||
|
||||
```c
|
||||
static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
|
||||
{
|
||||
struct rkvdec_dev *rkvdec = ctx->dev;
|
||||
struct rkvdec_hevc_run run; // ← UNINITIALIZED
|
||||
struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
|
||||
struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;
|
||||
|
||||
rkvdec_hevc_run_preamble(ctx, &run);
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
`rkvdec_hevc_run_preamble` in `rkvdec-hevc-common.c:478` assigns most `run->*` fields unconditionally, but the new EXT_SPS fields are conditional:
|
||||
|
||||
```c
|
||||
if (ctx->has_sps_st_rps) {
|
||||
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
|
||||
V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS);
|
||||
run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL;
|
||||
}
|
||||
if (ctx->has_sps_lt_rps) {
|
||||
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
|
||||
V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS);
|
||||
run->ext_sps_lt_rps = ctrl ? ctrl->p_cur.p : NULL;
|
||||
}
|
||||
```
|
||||
|
||||
When `ctx->has_sps_st_rps == false` (the common case for any HEVC stream where userspace doesn't supply EXT_SPS_*_RPS), `run.ext_sps_st_rps` retains **stack garbage** — the leftover bytes from a previous stack frame at the same offset. Empirically on the CoolPi CM5 GenBook this garbage is deterministically `0x51a0` per call, because the calling pattern is fixed.
|
||||
|
||||
Downstream in `rkvdec_hevc_prepare_hw_st_rps` (rkvdec-hevc-common.c:380):
|
||||
|
||||
```c
|
||||
if (!run->ext_sps_st_rps)
|
||||
return; /* ← garbage 0x51a0 is non-NULL */
|
||||
if (!memcmp(cache, run->ext_sps_st_rps, sizeof(struct v4l2_ctrl_hevc_ext_sps_st_rps)))
|
||||
return; /* ← memcmp derefs 0x51a0 → OOPS */
|
||||
```
|
||||
|
||||
The NULL-check is bypassed by stack garbage; `memcmp` faults at the bogus address.
|
||||
|
||||
This is reachable on every HEVC decode that hits the assemble path on `vdpu381_variant` (RK3588) and `vdpu383_variant` (RK3576) — i.e., every stream where the dispatcher gate in `rkvdec-vdpu{381,383}-hevc.c:606` enters the else branch.
|
||||
|
||||
## Minimal kernel patch
|
||||
|
||||
Either of these one-line patches fixes the OOPS without changing any behavior for valid use cases:
|
||||
|
||||
**Option A — fix at the producer (preamble):**
|
||||
```diff
|
||||
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
|
||||
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c
|
||||
@@ -495,6 +495,9 @@ void rkvdec_hevc_run_preamble(struct rkvdec_ctx *ctx,
|
||||
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
|
||||
V4L2_CID_STATELESS_HEVC_SCALING_MATRIX);
|
||||
run->scaling_matrix = ctrl ? ctrl->p_cur.p : NULL;
|
||||
+
|
||||
+ run->ext_sps_st_rps = NULL;
|
||||
+ run->ext_sps_lt_rps = NULL;
|
||||
|
||||
if (ctx->has_sps_st_rps) {
|
||||
ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl,
|
||||
```
|
||||
|
||||
**Option B — fix at the callers (zero-init the run struct):**
|
||||
```diff
|
||||
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
|
||||
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu381-hevc.c
|
||||
@@ -588,7 +588,7 @@ static int rkvdec_hevc_run(struct rkvdec_ctx *ctx)
|
||||
{
|
||||
struct rkvdec_dev *rkvdec = ctx->dev;
|
||||
- struct rkvdec_hevc_run run;
|
||||
+ struct rkvdec_hevc_run run = {};
|
||||
struct rkvdec_hevc_ctx *hevc_ctx = ctx->priv;
|
||||
struct rkvdec_hevc_priv_tbl *tbl = hevc_ctx->priv_tbl.cpu;
|
||||
```
|
||||
(same for `rkvdec-vdpu383-hevc.c`)
|
||||
|
||||
Option A is more localized to the *cause* (the preamble's incomplete initialization). Option B is more defensive (catches any future fields added without explicit init). Either works.
|
||||
|
||||
## Why iter2's userspace fix matters separately
|
||||
|
||||
Independently of the kernel patch, the iter2 work (vendored GStreamer 1.28.2 parser + UAPI shim + h265 cache population) IS the upstream-aligned path for HEVC streams where the SPS parameter set semantics require the new ST/LT RPS controls. When ffmpeg-vaapi forwards a buffer that DOES contain the SPS (e.g. some bitstream wrappers, or future ffmpeg versions that don't strip VPS/SPS/PPS), our parser activates and submits the EXT_SPS_*_RPS controls — exactly the upstream contract from the Casanova series.
|
||||
|
||||
For the current ffmpeg-vaapi (which strips SPS), our backend correctly returns ENODATA from the parser and falls back to submitting only the standard 5 controls. With the kernel patch above, this fallback succeeds without OOPS.
|
||||
|
||||
(There remains a separate question: for bbb_60s_720p.hevc, the SPS encoded by libx265 may or may not actually require EXT_SPS_*_RPS — typical streams use simple ST_RPS that the kernel can derive from slice headers alone, which is what the current rkvdec dispatcher path expects. The user-pref direction is "do it correctly aligned with upstream architecture", so the parser+cache+inject path stays.)
|
||||
|
||||
## Phase 6 question completion
|
||||
|
||||
| Q | Answer |
|
||||
|---|--------|
|
||||
| Q1 — what is `0x51a0` | Stack garbage at `run.ext_sps_st_rps` offset from prior frame. Deterministic per calling pattern. |
|
||||
| Q2 — where is `ctrl->p_cur.p = 0x51a0` set | N/A — not a ctrl pointer corruption. The `0x51a0` is local stack memory, not from the V4L2 ctrl handler. |
|
||||
| Q3 — when `ctx->has_sps_st_rps` true | FALSE in our case. The OOPS triggers via the missing-init-of-run, not via the has_sps_st_rps gate. |
|
||||
| Q4 — backend log fires? | YES, after fflush patch + picture.c entry-point instrumentation. Confirmed full path BeginPicture → RenderPicture×2 → EndPicture → h265_set_controls → populate_ext_sps_rps_cache → ENODATA → fallback to 5-ctrl batch → EINVAL → MEDIA_REQUEST_IOC_QUEUE → kernel OOPS. |
|
||||
| Q5 — vdpu381 vs vdpu383 on RK3588 | RK3588 → `vdpu381_variant`. Confirmed in `drivers/media/platform/rockchip/rkvdec/rkvdec.c:1757` (`"rockchip,rk3588-vdec"` → `&vdpu381_variant`). vdpu383 binds only to RK3576. |
|
||||
|
||||
## Open follow-ups
|
||||
|
||||
- **kernel-agent issue (new)**: file the 1-line patch above as an upstream-aligned fix to the Casanova v7.0 series. Tag for: applies to vdpu381 + vdpu383 both; reproducer is "any HEVC stream where ffmpeg-vaapi strips SPS" (i.e., everything).
|
||||
- **iter4 plan**: with kernel patched, re-run HEVC and confirm decode succeeds. Then iterate on the standard-5-controls EINVAL (separate issue — likely a slice_params field shape or DECODE_PARAMS flag the iter2 backend isn't setting correctly).
|
||||
- **Optional userspace robustness**: after kernel patch, the EXT_SPS_*_RPS path stays dormant on standard streams; consider deferring h265_populate to *only when SPS is detected in source_data* (cheap optimization).
|
||||
- **Diagnostic instrumentation removal**: iter3-Q4 logs in `picture.c` + `h265.c` + `utils.c::fflush` should be stripped once iter4 close confirms the kernel patch works. fflush in utils.c can stay (cheap, useful).
|
||||
|
||||
## Substrate state at close
|
||||
|
||||
- Backend `.so`: md5 `404041ea2dcc03c769e0ab8c43ddadd6` on ampere `/usr/lib/dri/`
|
||||
- Backend source: ampere `~/src/libva-v4l2-request-fourier` working tree has uncommitted iter3 diagnostics (picture.c + utils.c + h265.c)
|
||||
- Kernel: unchanged from iter2 substrate (waiting on kernel-agent issue + patch land)
|
||||
|
||||
Iter3 closes with **root cause identified and a 1-line fix proposed**, going from negative-result (iter2 close) to definitive-finding in one diagnostic iteration. The 8-phase loop's Phase 5 (review) recommendation in iter2 to add backend logging was the unblocking move.
|
||||
Reference in New Issue
Block a user