[ka:experiment] RK3588 rkvdec HEVC: __pi_memcmp faults in rkvdec_hevc_prepare_hw_st_rps, wedges v4l2_mem2mem #11

Closed
opened 2026-05-16 07:27:04 +00:00 by claude-noether · 3 comments
Collaborator

[ka:experiment] RK3588 rkvdec HEVC: __pi_memcmp faults in rkvdec_hevc_prepare_hw_st_rps, wedges v4l2_mem2mem until reboot

Symptom

On ampere (CoolPi CM5 GenBook, RK3588, linux-ampere-fourier 7.0rc3.kafr1-1 = vanilla torvalds v7.0-rc3 + ampere DTS/board patches), the first attempt to decode HEVC via the libva-v4l2-request-fourier backend produces this kernel oops:

Modules linked in: ...rockchip_vdec...
lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec]
Call trace:
 __pi_memcmp+0x10/0x110 (P)
 rkvdec_hevc_assemble_hw_rps+0x1c/0xac [rockchip_vdec]
 rkvdec_hevc_run+0x12c/0x148 [rockchip_vdec]
 rkvdec_device_run+0x48/0x104 [rockchip_vdec]
 v4l2_m2m_try_run+0x80/0x13c [v4l2_mem2mem]
 v4l2_m2m_request_queue+0xe8/0x140 [v4l2_mem2mem]
 media_request_ioctl_queue+0x1bc/0x2f4 [mc]
 media_request_ioctl+0xc8/0x160 [mc]
 __arm64_sys_ioctl+0xa4/0x100

Faulting instruction is __pi_memcmp called from rkvdec_hevc_prepare_hw_st_rps — most likely a NULL or unmapped pointer dereference during HEVC short-term RPS comparison.

Cascade: after the oops, the entire v4l2_mem2mem path wedges. Subsequent ffmpeg -hwaccel vaapi for H.264 (rkvdec) AND VP8 / MPEG-2 (hantro, different driver entirely) block in futex wait inside libva. Only reboot recovers.

Reproducer

On any ampere-class RK3588 host with the libva-v4l2-request-fourier backend installed (hand-build over the broken packaged binary per marfrit-packages#17):

# Encode an HEVC test clip first, then:
LIBVA_DRIVER_NAME=v4l2_request \
ffmpeg -hide_banner -loglevel error \
    -hwaccel vaapi -hwaccel_output_format vaapi \
    -i bbb_60s_720p.hevc.mp4 \
    -vf "hwdownload,format=nv12" -frames:v 30 -f null -
# → kernel oops in dmesg, m2m wedge for all decoders until reboot

ampere-fourier campaign iter1 confirmed this is reproducible 1:1 across reboots.

Boundary / non-boundary

  • fresnel (RK3399 rkvdec via linux-fresnel-fourier 7.0-14): HEVC decode works end-to-end via the same libva backend, no oops. So the bug is RK3588-rkvdec-specific.
  • libva backend side: backend 7ac934e (iter38b) ships the iter31 fix (feedback_va_st_rps_bits_is_slice_field — route st_rps_bits to slice_params not decode_params). That fix is on RK3399 too, where it's necessary and sufficient. The RK3588 oops is downstream of that — same conceptual region (RPS preparation) but kernel-side.
  • No code change in the libva backend would fix this — the libva is sending the same correct controls that work on RK3399. The bug is in the RK3588 rkvdec HEVC driver's RPS assembly.

Suggested investigation paths (operator can pick)

  1. Search upstream for later -rc fixes: linux-media list mainline -rc4..-rc7 (we're on -rc3) for rkvdec HEVC commits — RK3588 HEVC support is relatively new and a fix may already be in flight.
  2. Bisect within v7.0-rc1..-rc3: figure out when the RK3588 binding chain for rkvdec_hevc started faulting.
  3. Defensive guard: add NULL / range check around the memcmp call in rkvdec_hevc_prepare_hw_st_rps — won't fix the underlying issue but unwedges the m2m subsystem so the campaign can keep iterating on other codecs.
  4. Strace ABI compare: collect strace of libva→V4L2 ioctl traffic on RK3399 (HEVC works) vs RK3588 (oops) for the same clip, compare control bytes — the cascade is in the RPS specifically so the diff should be small. (Would also rule out a libva-side bug that just happens to not trip on RK3399.)

Out-of-scope per operator policy

Per the ampere-fourier campaign convention (2026-05-16): ampere stays on a clean mainline + board-DTS kernel. Any fix patch belongs in a kernel-agent experiment branch / target, NOT in the baseline linux-ampere-fourier package.

Test rig

Reproducible on ampere (CoolPi CM5 GenBook). Test clip bbb_60s_720p.hevc.mp4 at ~/measurements/encoded/. dmesg + reproducer scripts at ~/measurements/p3/ and ~/src/ampere-fourier/phase3_scripts/.

Refs

  • ampere-fourier campaign iter1 Phase 0: ~/src/ampere-fourier/phase0_findings.md
  • fresnel-fourier libva iter31 fix (the upstream-of-this region): feedback_va_st_rps_bits_is_slice_field
  • Backend source pin: marfrit/libva-v4l2-request-fourier @ 7ac934e
# [ka:experiment] RK3588 rkvdec HEVC: `__pi_memcmp` faults in `rkvdec_hevc_prepare_hw_st_rps`, wedges v4l2_mem2mem until reboot ## Symptom On ampere (CoolPi CM5 GenBook, RK3588, `linux-ampere-fourier 7.0rc3.kafr1-1` = vanilla torvalds v7.0-rc3 + ampere DTS/board patches), the first attempt to decode HEVC via the libva-v4l2-request-fourier backend produces this kernel oops: ``` Modules linked in: ...rockchip_vdec... lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec] Call trace: __pi_memcmp+0x10/0x110 (P) rkvdec_hevc_assemble_hw_rps+0x1c/0xac [rockchip_vdec] rkvdec_hevc_run+0x12c/0x148 [rockchip_vdec] rkvdec_device_run+0x48/0x104 [rockchip_vdec] v4l2_m2m_try_run+0x80/0x13c [v4l2_mem2mem] v4l2_m2m_request_queue+0xe8/0x140 [v4l2_mem2mem] media_request_ioctl_queue+0x1bc/0x2f4 [mc] media_request_ioctl+0xc8/0x160 [mc] __arm64_sys_ioctl+0xa4/0x100 ``` Faulting instruction is `__pi_memcmp` called from `rkvdec_hevc_prepare_hw_st_rps` — most likely a NULL or unmapped pointer dereference during HEVC short-term RPS comparison. **Cascade**: after the oops, the entire `v4l2_mem2mem` path wedges. Subsequent `ffmpeg -hwaccel vaapi` for H.264 (rkvdec) AND VP8 / MPEG-2 (hantro, different driver entirely) block in futex wait inside libva. Only reboot recovers. ## Reproducer On any ampere-class RK3588 host with the libva-v4l2-request-fourier backend installed (hand-build over the broken packaged binary per `marfrit-packages#17`): ```sh # Encode an HEVC test clip first, then: LIBVA_DRIVER_NAME=v4l2_request \ ffmpeg -hide_banner -loglevel error \ -hwaccel vaapi -hwaccel_output_format vaapi \ -i bbb_60s_720p.hevc.mp4 \ -vf "hwdownload,format=nv12" -frames:v 30 -f null - # → kernel oops in dmesg, m2m wedge for all decoders until reboot ``` ampere-fourier campaign iter1 confirmed this is reproducible 1:1 across reboots. ## Boundary / non-boundary - **fresnel** (RK3399 rkvdec via `linux-fresnel-fourier 7.0-14`): HEVC decode works end-to-end via the same libva backend, no oops. So the bug is RK3588-rkvdec-specific. - **libva backend side**: backend `7ac934e` (iter38b) ships the iter31 fix (`feedback_va_st_rps_bits_is_slice_field` — route `st_rps_bits` to `slice_params` not `decode_params`). That fix is on RK3399 too, where it's necessary and sufficient. The RK3588 oops is downstream of that — same conceptual region (RPS preparation) but kernel-side. - **No code change in the libva backend would fix this** — the libva is sending the same correct controls that work on RK3399. The bug is in the RK3588 rkvdec HEVC driver's RPS assembly. ## Suggested investigation paths (operator can pick) 1. **Search upstream for later -rc fixes**: linux-media list mainline -rc4..-rc7 (we're on -rc3) for `rkvdec` HEVC commits — RK3588 HEVC support is relatively new and a fix may already be in flight. 2. **Bisect within v7.0-rc1..-rc3**: figure out when the RK3588 binding chain for rkvdec_hevc started faulting. 3. **Defensive guard**: add NULL / range check around the `memcmp` call in `rkvdec_hevc_prepare_hw_st_rps` — won't fix the underlying issue but unwedges the m2m subsystem so the campaign can keep iterating on other codecs. 4. **Strace ABI compare**: collect strace of libva→V4L2 ioctl traffic on RK3399 (HEVC works) vs RK3588 (oops) for the same clip, compare control bytes — the cascade is in the RPS specifically so the diff should be small. (Would also rule out a libva-side bug that just happens to not trip on RK3399.) ## Out-of-scope per operator policy Per the ampere-fourier campaign convention (2026-05-16): ampere stays on a clean mainline + board-DTS kernel. Any fix patch belongs in a kernel-agent experiment branch / target, NOT in the baseline `linux-ampere-fourier` package. ## Test rig Reproducible on `ampere` (CoolPi CM5 GenBook). Test clip `bbb_60s_720p.hevc.mp4` at `~/measurements/encoded/`. dmesg + reproducer scripts at `~/measurements/p3/` and `~/src/ampere-fourier/phase3_scripts/`. ## Refs - ampere-fourier campaign iter1 Phase 0: `~/src/ampere-fourier/phase0_findings.md` - fresnel-fourier libva iter31 fix (the upstream-of-this region): `feedback_va_st_rps_bits_is_slice_field` - Backend source pin: `marfrit/libva-v4l2-request-fourier @ 7ac934e`
Author
Collaborator

Reclassification from ampere-kernel-decoders iter1 Phase 0 prior-art survey

The Phase 0 upstream survey (linux-rockchip / linux-media / Collabora / Kwiboo) + a kernel-source read on ampere strongly indicate this OOPS is not a kernel bug — it's a userspace UAPI gap against Linux 7.0's new HEVC controls.

Mechanism

The Casanova/Collabora v8 series (the patches that landed RK3588 HEVC in mainline 7.0, lkml.org/lkml/2026/1/9/1334) introduces two new V4L2 controls that VDPU381 HEVC requires:

  • V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS (short-term reference picture set)
  • V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS (long-term reference picture set)

On ampere with the current substrate:

  • grep V4L2_CID_STATELESS_HEVC_EXT ~/src/libva-v4l2-request-fourier/src/zero hits. Backend 7ac934e (which is what fresnel-iter38 certified pre-7.0-UAPI) never populates either control.
  • grep V4L2_CID_STATELESS_HEVC_EXT /usr/include/linux/v4l2-controls.hzero hits. linux-api-headers 6.19-1 predates the constant definitions entirely.
  • Kernel code at drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c:500-509 looks up the new CIDs via v4l2_ctrl_find and stores ctrl ? ctrl->p_cur.p : NULL in run->ext_sps_st_rps / _lt_rps.
  • rkvdec_hevc_prepare_hw_st_rps (lines 380-410, the OOPS site) has an early-return if (!run->ext_sps_st_rps) return; — but if the kernel auto-registered the control with a non-NULL but uninitialized p_cur.p, the early-return doesn't fire and memcmp reads invalid offsets → __pi_memcmp fault.

Implication

Fix path is userspace: extend the libva backend to populate the new CIDs with valid SPS-parsed data. There's a defense-in-depth case for the kernel too (prepare_hw_st_rps should validate the pointer beyond non-NULL), but the fix-path-for-ampere-HEVC is in the backend.

Action

Closing this issue as reclassified. Refiling against marfrit/libva-v4l2-request-fourier as a new issue: "extend backend to populate V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS / _LT_RPS for VDPU381 HEVC."

If the userspace fix doesn't fully resolve the OOPS (i.e. mechanism reconstruction is wrong), I'll re-open this with the new evidence.

Full survey + verification in ~/src/ampere-kernel-decoders/phase0_findings.md.

Refs:

  • Casanova v8 cover: lkml.org/lkml/2026/1/9/1334
  • Collabora retrospective: collabora.com 7.0 news post
  • feedback_review_empirical_over_theoretical — verified the survey hypothesis against actual kernel source before reclassifying
## Reclassification from ampere-kernel-decoders iter1 Phase 0 prior-art survey The Phase 0 upstream survey (linux-rockchip / linux-media / Collabora / Kwiboo) + a kernel-source read on ampere strongly indicate this OOPS is **not a kernel bug** — it's a userspace UAPI gap against Linux 7.0's new HEVC controls. ### Mechanism The Casanova/Collabora v8 series (the patches that landed RK3588 HEVC in mainline 7.0, `lkml.org/lkml/2026/1/9/1334`) introduces two new V4L2 controls that VDPU381 HEVC requires: - `V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` (short-term reference picture set) - `V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS` (long-term reference picture set) On ampere with the current substrate: - `grep V4L2_CID_STATELESS_HEVC_EXT ~/src/libva-v4l2-request-fourier/src/` → **zero hits**. Backend `7ac934e` (which is what fresnel-iter38 certified pre-7.0-UAPI) never populates either control. - `grep V4L2_CID_STATELESS_HEVC_EXT /usr/include/linux/v4l2-controls.h` → **zero hits**. `linux-api-headers 6.19-1` predates the constant definitions entirely. - Kernel code at `drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c:500-509` looks up the new CIDs via `v4l2_ctrl_find` and stores `ctrl ? ctrl->p_cur.p : NULL` in `run->ext_sps_st_rps` / `_lt_rps`. - `rkvdec_hevc_prepare_hw_st_rps` (lines 380-410, the OOPS site) has an early-return `if (!run->ext_sps_st_rps) return;` — but if the kernel auto-registered the control with a non-NULL but uninitialized `p_cur.p`, the early-return doesn't fire and `memcmp` reads invalid offsets → `__pi_memcmp` fault. ### Implication Fix path is **userspace**: extend the libva backend to populate the new CIDs with valid SPS-parsed data. There's a defense-in-depth case for the kernel too (`prepare_hw_st_rps` should validate the pointer beyond non-NULL), but the *fix-path-for-ampere-HEVC* is in the backend. ### Action Closing this issue as reclassified. Refiling against `marfrit/libva-v4l2-request-fourier` as a new issue: "extend backend to populate `V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` / `_LT_RPS` for VDPU381 HEVC." If the userspace fix doesn't fully resolve the OOPS (i.e. mechanism reconstruction is wrong), I'll re-open this with the new evidence. Full survey + verification in `~/src/ampere-kernel-decoders/phase0_findings.md`. Refs: - Casanova v8 cover: lkml.org/lkml/2026/1/9/1334 - Collabora retrospective: collabora.com 7.0 news post - `feedback_review_empirical_over_theoretical` — verified the survey hypothesis against actual kernel source before reclassifying
Author
Collaborator

Re-opening: iter2 backend fix did NOT resolve the OOPS

From ampere-kernel-decoders iter2 Phase 6 Step 5 smoke test, 2026-05-16 evening.

We implemented the patch the Phase 0 prior-art survey predicted would fix this (vendored GStreamer 1.28.2 H.265 parser, populate V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS + _LT_RPS controls per the GStreamer reference pattern, runtime-optional probe). Backend commits f91c3f5..1a2c958 on marfrit/libva-v4l2-request-fourier. The patch installed cleanly; backend's iter2: log line fires confirming the probe sees the new CIDs registered. HEVC still OOPSes with the same stack and the same register state as pre-iter2.

New evidence

1. Strace shows the new controls don't reach the kernel

Per-frame VIDIOC_S_EXT_CTRLS ioctl trace from a failing HEVC ffmpeg invocation post-iter2-install:

ioctl(5</dev/video3>, VIDIOC_S_EXT_CTRLS, {ctrl_class=…, count=5, controls=[
    {id=0xa40a90  /* HEVC_SPS */,            size=40,  …},
    {id=0xa40a91  /* HEVC_PPS */,            size=64,  …},
    {id=0xa40a92  /* HEVC_SLICE_PARAMS */,   size=280, …},
    {id=0xa40a93  /* HEVC_SCALING_MATRIX */, size=1000,…},
    {id=0xa40a94  /* HEVC_DECODE_PARAMS */,  size=328, …},
]}) = -1 EINVAL (error_idx=5)

0xa40a98 (HEVC_EXT_SPS_ST_RPS = V4L2_CID_CODEC_STATELESS_BASE + 408) and 0xa40a99 (HEVC_EXT_SPS_LT_RPS = base + 409) do not appear in any S_EXT_CTRLS call in the strace. So the iter2 backend's wired code path does NOT actually submit the new controls. The standard 5-control submission returns EINVAL with error_idx=5 (kernel rejects the whole batch for a structural reason — backend ignores and proceeds to queue, OOPS follows).

Possible iter2-side root cause (TBD by iter3):

  • Most likely: h265_populate_ext_sps_rps_cache returns -ENODATA because the SPS NAL is not present in surface_object->source_data — ffmpeg-vaapi's HEVC submission protocol doesn't necessarily concatenate the SPS into the slice data buffer. Phase 5 review item #3 anticipated this needing parse-and-cache; iter2's implementation tries to cache but never gets a first-frame SPS to populate from.
  • Plausible alternative: the backend's HEVC submission path bypasses h265.c::h265_set_controls and goes through picture.c directly, in which case iter2's edits never fire regardless of cache.

2. The OOPS register state is the SAME pre- and post-iter2

Three captures (2 pre-iter2 from ampere-fourier iter1 + ampere-kernel-decoders iter2 Phase 3, 1 post-iter2-install today) all show identical:

pc : __pi_memcmp+0x10/0x110
lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec]
x0 : <kernel-heap address>  (cache pointer, valid)
x1 : 00000000000051a0       (run->ext_sps_st_rps — INVALID address)
x2 : 0000000000000048       (sizeof = 72 bytes)
pgd=0000000000000000        (no page table for x1)

The 0x51a0 value is deterministic across reboots and across with/without userspace populating the controls. This is not a "garbage stale heap value" — it's specifically 0x51a0 every time. That smells like:

  • An OFFSET being treated as a pointer somewhere
  • Or a count field at a specific struct location being read as a pointer
  • Or a ctrl->p_cur.p field that has a default-init value of 0x51a0 for some reason

0x51a0 = 20896 decimal. Possible decompositions: 80 (sizeof struct) × 261? 0x51a0 ÷ 8 = 0xA34? Doesn't immediately match a known sentinel. Worth checking what ctrl->p_cur.p is initialized to for dimensional-array controls (kernel's v4l2-ctrls-*.c registration path).

3. Kernel-side gate IS firing

Reading rkvdec-hevc-common.c:500-509:

if (ctx->has_sps_st_rps) {
    ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS);
    run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL;
}

If ctx->has_sps_st_rps is FALSE (userspace never set the controls), the assignment to run->ext_sps_st_rps doesn't happen — so run->ext_sps_st_rps should be whatever the previous frame left it as, OR zero if run is memset on each frame. The fact that we see 0x51a0 (not 0 or a stale-prev-frame value) suggests:

  • Either ctx->has_sps_st_rps is TRUE even without our userspace setting the controls (open question: how does it become true?), AND ctrl->p_cur.p is 0x51a0 (open question: where does that value come from?).
  • Or run->ext_sps_st_rps is initialized to 0x51a0 somewhere else (open question: where? Maybe at struct definition with a sentinel?).

The kernel rkvdec source we inspected:

static int rkvdec_s_ctrl(struct v4l2_ctrl *ctrl)
{
    if (ctrl->id == V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS) {
        ctx->has_sps_st_rps |= !!(ctrl->has_changed);
        return 0;
    }
    /* … */
}

ctx->has_sps_st_rps only goes true when ctrl->has_changed is set, which the V4L2 framework sets when userspace S_EXT_CTRLS's the control. If userspace never does, has_sps_st_rps stays false (assuming it starts false). So either:

  • has_sps_st_rps is somehow initialized to true (init bug)
  • ctrl->has_changed is being set without userspace involvement (registration bug)
  • Something else marks has_sps_st_rps true

Asks for iter3

Re-opening this issue (originally closed under reclassification-to-userspace). Three diagnostic paths kernel-agent could pursue:

  1. kprobe / printk-instrument rkvdec_hevc_prepare_hw_st_rps to dump run->ext_sps_st_rps value just before the memcmp. Confirms whether it's always 0x51a0 (always) or varies (kernel reading stale-heap).
  2. Trace ctx->has_sps_st_rps init + transitions across one decode session. Find why it's true without S_EXT_CTRLS.
  3. Compare against fresnel RK3399 — does ctx->has_sps_st_rps exist on the RK3399 rkvdec path at all? Or is this an RK3588-VDPU381-only struct field? If the latter, the field's initialization is the smoking gun.

iter2 backend infrastructure (vendored GStreamer parser, runtime probe, h265_set_controls extension) stays on master — reusable once the kernel-side root cause is understood.

Refs

  • ~/src/ampere-kernel-decoders/iter2_close.md — full iter2 negative-result writeup
  • marfrit/libva-v4l2-request-fourier @ bea8a79 — backend tip with iter2 commits
  • Phase 0 prior-art survey (lkml/Casanova v8): ~/src/ampere-kernel-decoders/phase0_findings_iter2.md
  • Phase 3 strace baseline (pre-iter2): ~/src/ampere-kernel-decoders/phase3_baseline_iter2.md
## Re-opening: iter2 backend fix did NOT resolve the OOPS From `ampere-kernel-decoders` iter2 Phase 6 Step 5 smoke test, 2026-05-16 evening. We implemented the patch the Phase 0 prior-art survey predicted would fix this (vendored GStreamer 1.28.2 H.265 parser, populate `V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` + `_LT_RPS` controls per the GStreamer reference pattern, runtime-optional probe). Backend commits `f91c3f5..1a2c958` on `marfrit/libva-v4l2-request-fourier`. The patch installed cleanly; backend's `iter2:` log line fires confirming the probe sees the new CIDs registered. **HEVC still OOPSes with the same stack and the same register state as pre-iter2.** ## New evidence ### 1. Strace shows the new controls don't reach the kernel Per-frame `VIDIOC_S_EXT_CTRLS` ioctl trace from a failing HEVC ffmpeg invocation post-iter2-install: ``` ioctl(5</dev/video3>, VIDIOC_S_EXT_CTRLS, {ctrl_class=…, count=5, controls=[ {id=0xa40a90 /* HEVC_SPS */, size=40, …}, {id=0xa40a91 /* HEVC_PPS */, size=64, …}, {id=0xa40a92 /* HEVC_SLICE_PARAMS */, size=280, …}, {id=0xa40a93 /* HEVC_SCALING_MATRIX */, size=1000,…}, {id=0xa40a94 /* HEVC_DECODE_PARAMS */, size=328, …}, ]}) = -1 EINVAL (error_idx=5) ``` `0xa40a98` (HEVC_EXT_SPS_ST_RPS = `V4L2_CID_CODEC_STATELESS_BASE + 408`) and `0xa40a99` (HEVC_EXT_SPS_LT_RPS = base + 409) **do not appear in any S_EXT_CTRLS call** in the strace. So the iter2 backend's wired code path does NOT actually submit the new controls. The standard 5-control submission returns `EINVAL` with `error_idx=5` (kernel rejects the whole batch for a structural reason — backend ignores and proceeds to queue, OOPS follows). Possible iter2-side root cause (TBD by iter3): - Most likely: `h265_populate_ext_sps_rps_cache` returns `-ENODATA` because the SPS NAL is not present in `surface_object->source_data` — ffmpeg-vaapi's HEVC submission protocol doesn't necessarily concatenate the SPS into the slice data buffer. Phase 5 review item #3 anticipated this needing parse-and-cache; iter2's implementation tries to cache but never gets a first-frame SPS to populate from. - Plausible alternative: the backend's HEVC submission path bypasses `h265.c::h265_set_controls` and goes through `picture.c` directly, in which case iter2's edits never fire regardless of cache. ### 2. The OOPS register state is the SAME pre- and post-iter2 Three captures (2 pre-iter2 from ampere-fourier iter1 + ampere-kernel-decoders iter2 Phase 3, 1 post-iter2-install today) all show identical: ``` pc : __pi_memcmp+0x10/0x110 lr : rkvdec_hevc_prepare_hw_st_rps+0x38/0x300 [rockchip_vdec] x0 : <kernel-heap address> (cache pointer, valid) x1 : 00000000000051a0 (run->ext_sps_st_rps — INVALID address) x2 : 0000000000000048 (sizeof = 72 bytes) pgd=0000000000000000 (no page table for x1) ``` The `0x51a0` value is **deterministic across reboots and across with/without userspace populating the controls.** This is not a "garbage stale heap value" — it's specifically `0x51a0` every time. That smells like: - An OFFSET being treated as a pointer somewhere - Or a count field at a specific struct location being read as a pointer - Or a `ctrl->p_cur.p` field that has a default-init value of `0x51a0` for some reason `0x51a0` = 20896 decimal. Possible decompositions: `80 (sizeof struct) × 261`? `0x51a0 ÷ 8 = 0xA34`? Doesn't immediately match a known sentinel. Worth checking what `ctrl->p_cur.p` is initialized to for dimensional-array controls (kernel's `v4l2-ctrls-*.c` registration path). ### 3. Kernel-side gate IS firing Reading `rkvdec-hevc-common.c:500-509`: ```c if (ctx->has_sps_st_rps) { ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS); run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL; } ``` If `ctx->has_sps_st_rps` is FALSE (userspace never set the controls), the assignment to `run->ext_sps_st_rps` doesn't happen — so `run->ext_sps_st_rps` should be whatever the previous frame left it as, OR zero if `run` is memset on each frame. The fact that we see `0x51a0` (not 0 or a stale-prev-frame value) suggests: - Either `ctx->has_sps_st_rps` is TRUE even without our userspace setting the controls (open question: how does it become true?), AND `ctrl->p_cur.p` is `0x51a0` (open question: where does that value come from?). - Or `run->ext_sps_st_rps` is initialized to `0x51a0` somewhere else (open question: where? Maybe at struct definition with a sentinel?). The kernel rkvdec source we inspected: ```c static int rkvdec_s_ctrl(struct v4l2_ctrl *ctrl) { if (ctrl->id == V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS) { ctx->has_sps_st_rps |= !!(ctrl->has_changed); return 0; } /* … */ } ``` `ctx->has_sps_st_rps` only goes true when `ctrl->has_changed` is set, which the V4L2 framework sets when userspace `S_EXT_CTRLS`'s the control. If userspace never does, `has_sps_st_rps` stays false (assuming it starts false). So either: - `has_sps_st_rps` is somehow initialized to true (init bug) - `ctrl->has_changed` is being set without userspace involvement (registration bug) - Something else marks `has_sps_st_rps` true ## Asks for iter3 Re-opening this issue (originally closed under reclassification-to-userspace). Three diagnostic paths kernel-agent could pursue: 1. **kprobe / printk-instrument `rkvdec_hevc_prepare_hw_st_rps`** to dump `run->ext_sps_st_rps` value just before the memcmp. Confirms whether it's always `0x51a0` (always) or varies (kernel reading stale-heap). 2. **Trace `ctx->has_sps_st_rps`** init + transitions across one decode session. Find why it's true without S_EXT_CTRLS. 3. **Compare against fresnel RK3399** — does `ctx->has_sps_st_rps` exist on the RK3399 rkvdec path at all? Or is this an RK3588-VDPU381-only struct field? If the latter, the field's initialization is the smoking gun. iter2 backend infrastructure (vendored GStreamer parser, runtime probe, h265_set_controls extension) stays on `master` — reusable once the kernel-side root cause is understood. ## Refs - `~/src/ampere-kernel-decoders/iter2_close.md` — full iter2 negative-result writeup - `marfrit/libva-v4l2-request-fourier @ bea8a79` — backend tip with iter2 commits - Phase 0 prior-art survey (lkml/Casanova v8): `~/src/ampere-kernel-decoders/phase0_findings_iter2.md` - Phase 3 strace baseline (pre-iter2): `~/src/ampere-kernel-decoders/phase3_baseline_iter2.md`
Author
Collaborator

Duplicate of #14. Both issues describe the identical __pi_memcmp+0x10/0x110 fault from rkvdec_hevc_prepare_hw_st_rps+0x38/0x300, same call chain through rkvdec_hevc_assemble_hw_rps → rkvdec_hevc_run → rkvdec_device_run → v4l2_m2m_*, same hardware (ampere/RK3588), same reproducer (bbb_60s_720p.hevc.mp4 via libva-v4l2-request-fourier + ffmpeg vaapi). Filed 3 hours apart.

#14 found the exact root cause (struct rkvdec_hevc_run run; left uninitialized on the dispatcher stack — rkvdec-vdpu381-hevc.c:591 + rkvdec-vdpu383-hevc.c) and ships a one-line fix patch (both producer-fix Option A and caller-zero-init Option B variants).

Triage 2026-05-18: closing this one, keeping #14 as the canonical home with the fix details. Investigation suggestions from this issue body (search upstream, bisect within rc1..rc3, defensive guard, strace ABI compare) are subsumed by #14's closed-form root cause.

**Duplicate of [#14](https://git.reauktion.de/marfrit/kernel-agent/issues/14).** Both issues describe the identical `__pi_memcmp+0x10/0x110` fault from `rkvdec_hevc_prepare_hw_st_rps+0x38/0x300`, same call chain through `rkvdec_hevc_assemble_hw_rps → rkvdec_hevc_run → rkvdec_device_run → v4l2_m2m_*`, same hardware (ampere/RK3588), same reproducer (`bbb_60s_720p.hevc.mp4` via libva-v4l2-request-fourier + ffmpeg vaapi). Filed 3 hours apart. #14 found the exact root cause (`struct rkvdec_hevc_run run;` left uninitialized on the dispatcher stack — `rkvdec-vdpu381-hevc.c:591` + `rkvdec-vdpu383-hevc.c`) and ships a one-line fix patch (both producer-fix Option A and caller-zero-init Option B variants). Triage 2026-05-18: closing this one, keeping #14 as the canonical home with the fix details. Investigation suggestions from this issue body (search upstream, bisect within rc1..rc3, defensive guard, strace ABI compare) are subsumed by #14's closed-form root cause.
Sign in to join this conversation.