[ka:experiment] RK3588 rkvdec HEVC: __pi_memcmp faults in rkvdec_hevc_prepare_hw_st_rps, wedges v4l2_mem2mem #11
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
[ka:experiment] RK3588 rkvdec HEVC:
__pi_memcmpfaults inrkvdec_hevc_prepare_hw_st_rps, wedges v4l2_mem2mem until rebootSymptom
On ampere (CoolPi CM5 GenBook, RK3588,
linux-ampere-fourier 7.0rc3.kafr1-1= vanilla torvalds v7.0-rc3 + ampere DTS/board patches), the first attempt to decode HEVC via the libva-v4l2-request-fourier backend produces this kernel oops:Faulting instruction is
__pi_memcmpcalled fromrkvdec_hevc_prepare_hw_st_rps— most likely a NULL or unmapped pointer dereference during HEVC short-term RPS comparison.Cascade: after the oops, the entire
v4l2_mem2mempath wedges. Subsequentffmpeg -hwaccel vaapifor H.264 (rkvdec) AND VP8 / MPEG-2 (hantro, different driver entirely) block in futex wait inside libva. Only reboot recovers.Reproducer
On any ampere-class RK3588 host with the libva-v4l2-request-fourier backend installed (hand-build over the broken packaged binary per
marfrit-packages#17):ampere-fourier campaign iter1 confirmed this is reproducible 1:1 across reboots.
Boundary / non-boundary
linux-fresnel-fourier 7.0-14): HEVC decode works end-to-end via the same libva backend, no oops. So the bug is RK3588-rkvdec-specific.7ac934e(iter38b) ships the iter31 fix (feedback_va_st_rps_bits_is_slice_field— routest_rps_bitstoslice_paramsnotdecode_params). That fix is on RK3399 too, where it's necessary and sufficient. The RK3588 oops is downstream of that — same conceptual region (RPS preparation) but kernel-side.Suggested investigation paths (operator can pick)
rkvdecHEVC commits — RK3588 HEVC support is relatively new and a fix may already be in flight.memcmpcall inrkvdec_hevc_prepare_hw_st_rps— won't fix the underlying issue but unwedges the m2m subsystem so the campaign can keep iterating on other codecs.Out-of-scope per operator policy
Per the ampere-fourier campaign convention (2026-05-16): ampere stays on a clean mainline + board-DTS kernel. Any fix patch belongs in a kernel-agent experiment branch / target, NOT in the baseline
linux-ampere-fourierpackage.Test rig
Reproducible on
ampere(CoolPi CM5 GenBook). Test clipbbb_60s_720p.hevc.mp4at~/measurements/encoded/. dmesg + reproducer scripts at~/measurements/p3/and~/src/ampere-fourier/phase3_scripts/.Refs
~/src/ampere-fourier/phase0_findings.mdfeedback_va_st_rps_bits_is_slice_fieldmarfrit/libva-v4l2-request-fourier @ 7ac934eReclassification from ampere-kernel-decoders iter1 Phase 0 prior-art survey
The Phase 0 upstream survey (linux-rockchip / linux-media / Collabora / Kwiboo) + a kernel-source read on ampere strongly indicate this OOPS is not a kernel bug — it's a userspace UAPI gap against Linux 7.0's new HEVC controls.
Mechanism
The Casanova/Collabora v8 series (the patches that landed RK3588 HEVC in mainline 7.0,
lkml.org/lkml/2026/1/9/1334) introduces two new V4L2 controls that VDPU381 HEVC requires:V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS(short-term reference picture set)V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS(long-term reference picture set)On ampere with the current substrate:
grep V4L2_CID_STATELESS_HEVC_EXT ~/src/libva-v4l2-request-fourier/src/→ zero hits. Backend7ac934e(which is what fresnel-iter38 certified pre-7.0-UAPI) never populates either control.grep V4L2_CID_STATELESS_HEVC_EXT /usr/include/linux/v4l2-controls.h→ zero hits.linux-api-headers 6.19-1predates the constant definitions entirely.drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c:500-509looks up the new CIDs viav4l2_ctrl_findand storesctrl ? ctrl->p_cur.p : NULLinrun->ext_sps_st_rps/_lt_rps.rkvdec_hevc_prepare_hw_st_rps(lines 380-410, the OOPS site) has an early-returnif (!run->ext_sps_st_rps) return;— but if the kernel auto-registered the control with a non-NULL but uninitializedp_cur.p, the early-return doesn't fire andmemcmpreads invalid offsets →__pi_memcmpfault.Implication
Fix path is userspace: extend the libva backend to populate the new CIDs with valid SPS-parsed data. There's a defense-in-depth case for the kernel too (
prepare_hw_st_rpsshould validate the pointer beyond non-NULL), but the fix-path-for-ampere-HEVC is in the backend.Action
Closing this issue as reclassified. Refiling against
marfrit/libva-v4l2-request-fourieras a new issue: "extend backend to populateV4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS/_LT_RPSfor VDPU381 HEVC."If the userspace fix doesn't fully resolve the OOPS (i.e. mechanism reconstruction is wrong), I'll re-open this with the new evidence.
Full survey + verification in
~/src/ampere-kernel-decoders/phase0_findings.md.Refs:
feedback_review_empirical_over_theoretical— verified the survey hypothesis against actual kernel source before reclassifyingRe-opening: iter2 backend fix did NOT resolve the OOPS
From
ampere-kernel-decodersiter2 Phase 6 Step 5 smoke test, 2026-05-16 evening.We implemented the patch the Phase 0 prior-art survey predicted would fix this (vendored GStreamer 1.28.2 H.265 parser, populate
V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS+_LT_RPScontrols per the GStreamer reference pattern, runtime-optional probe). Backend commitsf91c3f5..1a2c958onmarfrit/libva-v4l2-request-fourier. The patch installed cleanly; backend'siter2:log line fires confirming the probe sees the new CIDs registered. HEVC still OOPSes with the same stack and the same register state as pre-iter2.New evidence
1. Strace shows the new controls don't reach the kernel
Per-frame
VIDIOC_S_EXT_CTRLSioctl trace from a failing HEVC ffmpeg invocation post-iter2-install:0xa40a98(HEVC_EXT_SPS_ST_RPS =V4L2_CID_CODEC_STATELESS_BASE + 408) and0xa40a99(HEVC_EXT_SPS_LT_RPS = base + 409) do not appear in any S_EXT_CTRLS call in the strace. So the iter2 backend's wired code path does NOT actually submit the new controls. The standard 5-control submission returnsEINVALwitherror_idx=5(kernel rejects the whole batch for a structural reason — backend ignores and proceeds to queue, OOPS follows).Possible iter2-side root cause (TBD by iter3):
h265_populate_ext_sps_rps_cachereturns-ENODATAbecause the SPS NAL is not present insurface_object->source_data— ffmpeg-vaapi's HEVC submission protocol doesn't necessarily concatenate the SPS into the slice data buffer. Phase 5 review item #3 anticipated this needing parse-and-cache; iter2's implementation tries to cache but never gets a first-frame SPS to populate from.h265.c::h265_set_controlsand goes throughpicture.cdirectly, in which case iter2's edits never fire regardless of cache.2. The OOPS register state is the SAME pre- and post-iter2
Three captures (2 pre-iter2 from ampere-fourier iter1 + ampere-kernel-decoders iter2 Phase 3, 1 post-iter2-install today) all show identical:
The
0x51a0value is deterministic across reboots and across with/without userspace populating the controls. This is not a "garbage stale heap value" — it's specifically0x51a0every time. That smells like:ctrl->p_cur.pfield that has a default-init value of0x51a0for some reason0x51a0= 20896 decimal. Possible decompositions:80 (sizeof struct) × 261?0x51a0 ÷ 8 = 0xA34? Doesn't immediately match a known sentinel. Worth checking whatctrl->p_cur.pis initialized to for dimensional-array controls (kernel'sv4l2-ctrls-*.cregistration path).3. Kernel-side gate IS firing
Reading
rkvdec-hevc-common.c:500-509:If
ctx->has_sps_st_rpsis FALSE (userspace never set the controls), the assignment torun->ext_sps_st_rpsdoesn't happen — sorun->ext_sps_st_rpsshould be whatever the previous frame left it as, OR zero ifrunis memset on each frame. The fact that we see0x51a0(not 0 or a stale-prev-frame value) suggests:ctx->has_sps_st_rpsis TRUE even without our userspace setting the controls (open question: how does it become true?), ANDctrl->p_cur.pis0x51a0(open question: where does that value come from?).run->ext_sps_st_rpsis initialized to0x51a0somewhere else (open question: where? Maybe at struct definition with a sentinel?).The kernel rkvdec source we inspected:
ctx->has_sps_st_rpsonly goes true whenctrl->has_changedis set, which the V4L2 framework sets when userspaceS_EXT_CTRLS's the control. If userspace never does,has_sps_st_rpsstays false (assuming it starts false). So either:has_sps_st_rpsis somehow initialized to true (init bug)ctrl->has_changedis being set without userspace involvement (registration bug)has_sps_st_rpstrueAsks for iter3
Re-opening this issue (originally closed under reclassification-to-userspace). Three diagnostic paths kernel-agent could pursue:
rkvdec_hevc_prepare_hw_st_rpsto dumprun->ext_sps_st_rpsvalue just before the memcmp. Confirms whether it's always0x51a0(always) or varies (kernel reading stale-heap).ctx->has_sps_st_rpsinit + transitions across one decode session. Find why it's true without S_EXT_CTRLS.ctx->has_sps_st_rpsexist on the RK3399 rkvdec path at all? Or is this an RK3588-VDPU381-only struct field? If the latter, the field's initialization is the smoking gun.iter2 backend infrastructure (vendored GStreamer parser, runtime probe, h265_set_controls extension) stays on
master— reusable once the kernel-side root cause is understood.Refs
~/src/ampere-kernel-decoders/iter2_close.md— full iter2 negative-result writeupmarfrit/libva-v4l2-request-fourier @ bea8a79— backend tip with iter2 commits~/src/ampere-kernel-decoders/phase0_findings_iter2.md~/src/ampere-kernel-decoders/phase3_baseline_iter2.mdDuplicate of #14. Both issues describe the identical
__pi_memcmp+0x10/0x110fault fromrkvdec_hevc_prepare_hw_st_rps+0x38/0x300, same call chain throughrkvdec_hevc_assemble_hw_rps → rkvdec_hevc_run → rkvdec_device_run → v4l2_m2m_*, same hardware (ampere/RK3588), same reproducer (bbb_60s_720p.hevc.mp4via libva-v4l2-request-fourier + ffmpeg vaapi). Filed 3 hours apart.#14 found the exact root cause (
struct rkvdec_hevc_run run;left uninitialized on the dispatcher stack —rkvdec-vdpu381-hevc.c:591+rkvdec-vdpu383-hevc.c) and ships a one-line fix patch (both producer-fix Option A and caller-zero-init Option B variants).Triage 2026-05-18: closing this one, keeping #14 as the canonical home with the fix details. Investigation suggestions from this issue body (search upstream, bisect within rc1..rc3, defensive guard, strace ABI compare) are subsumed by #14's closed-form root cause.