diff --git a/phase0_findings.md b/phase0_findings.md new file mode 100644 index 0000000..1c569ae --- /dev/null +++ b/phase0_findings.md @@ -0,0 +1,116 @@ +# Phase 0 — Substrate / Motivation / Inventory (iter1 of ampere-kernel-decoders) + +Closed 2026-05-16 (afternoon). Locks the research question, captures the substrate, and — critically for this campaign — runs the **upstream prior-art survey** before writing or considering any patch. + +## Research question + +**Which of the two RK3588-decoder gaps surfaced by `ampere-fourier` iter1 (HEVC kernel OOPS, VP9 not exposed) actually need a *kernel* patch as their fix path, and for those that do, what's the minimal candidate patch — starting from existing upstream / out-of-tree work, not from a clean re-derivation?** + +Operator-supplied mechanism (verbatim, in-session): +> Consult linux-rockchip and linux-mm mailing lists for prior art regarding enabling the video decoders. + +Phase 0 follows that direction: a prior-art survey is the first item, *before* any source-read or hypothesis. + +## Substrate + +| Property | Value | +|---|---| +| Substrate kernel branch | `ampere:~/src/linux-rockchip` branch `ampere-minimal-devices`, tip `7c241f2e2835 arm64: dts: rockchip: rk3588-coolpi-cm5-genbook: add lid switch and USB3 PHY lane config` | +| Sister tree | `boltzmann:~/src/linux-rockchip` branch `linux-rk3588-marfrit`, tip `fccdf164bfec phy: rockchip-snps-pcie3: Only check PHY1 status when using it` — has collabora remotes tracked (`add-rkvdec2-driver`, `add-rkvdec2-driver-iommu`, `add-rkvdec2-driver-sre`, `add-rkvdec2-driver-vdpu383-hevc`) | +| Baseline kernel package | `linux-ampere-fourier 7.0rc3.kafr1-1` — vanilla v7.0-rc3 + ampere DTS/board patches; built from the ampere tree above | +| ampere `linux-api-headers` | `6.19-1` — **PREDATES** the 7.0 UAPI additions for `V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` / `_LT_RPS` | +| libva backend installed | `libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1` (hand-built `0c9a7efaab…` over the broken CI binary per marfrit-packages#17) | +| Available rkvdec sources | `drivers/media/platform/rockchip/rkvdec/{rkvdec.c, rkvdec-hevc.c, rkvdec-hevc-common.c, rkvdec-vdpu381-hevc.c, rkvdec-vdpu383-hevc.c, rkvdec-vdpu381-h264.c, rkvdec-vdpu383-h264.c, rkvdec-vp9.c, …}` on ampere | + +The rkvdec source has **separate VDPU381 and VDPU383 HEVC files** (rkvdec-vdpu381-hevc.c + rkvdec-vdpu383-hevc.c) plus a shared `rkvdec-hevc-common.c` that contains the OOPSing function. The VP9 source (`rkvdec-vp9.c`) exists too but doesn't appear in the VDPU381/383 variant_ops registration — i.e. the file exists for the RK3399 legacy rkvdec path only. + +## Prior-art survey (operator-mandated Phase 0 step) + +Surveyed 2026-05-16 by general-purpose subagent against linux-rockchip / linux-media / linux-mm / lore.kernel.org / Kwiboo / Bootlin / Collabora / D.V.A.B. Sarma / dongioia. Key findings (full subagent transcript: `~/.../tasks/a0d583fc904274132.output`): + +### Maturity baseline of RK3588 mainline decoder support + +RK3588 (VDPU381) and RK3576 (VDPU383) decoder support was merged in **Linux 7.0** as a 17-patch series from Detlev Casanova / Collabora ("media: rkvdec: Add support for VDPU381 and VDPU383", v8 at lkml.org/lkml/2026/1/9/1334). The series adds **H.264 and HEVC only — VP9 is NOT in 7.0 mainline**, multi-core glue is NOT in 7.0 mainline, AV1 (RK3576 only) is preliminary. The Collabora blog (collabora.com news 2026-02-27) explicitly frames VP9 on RK3588 as **future work attributed to D.V.A.B. Sarma's existing out-of-tree driver**. + +So `7.0-rc3` is the **very first kernel where RK3588 HEVC even exists upstream**. Regression-fix candidates in -rc4..-rc7 / -stable are plausible but not yet surveyed (lore.kernel.org was Anubis-gated during the survey — manual recheck at `https://lore.kernel.org/linux-media/?q=rkvdec+VDPU381+RPS` deferred). + +### HEVC OOPS root cause — **reclassified from "kernel bug" to "userspace UAPI gap"** + +The Casanova v8 series **introduces two new V4L2 controls**: +- `V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS` (short-term RPS) +- `V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS` (long-term RPS) + +Per the survey, the VDPU381 HEVC path requires userspace to populate these. Verified by reading the actual code on ampere: + +`drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c:500-509`: +```c +if (ctx->has_sps_st_rps) { + ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS); + run->ext_sps_st_rps = ctrl ? ctrl->p_cur.p : NULL; +} +if (ctx->has_sps_lt_rps) { + ctrl = v4l2_ctrl_find(&ctx->ctrl_hdl, V4L2_CID_STATELESS_HEVC_EXT_SPS_LT_RPS); + run->ext_sps_lt_rps = ctrl ? ctrl->p_cur.p : NULL; +} +``` + +`drivers/media/platform/rockchip/rkvdec/rkvdec-hevc-common.c:380-410` (`rkvdec_hevc_prepare_hw_st_rps`, the OOPS site): +```c +if (!run->ext_sps_st_rps) + return; + +if (!memcmp(cache, run->ext_sps_st_rps, sizeof(struct v4l2_ctrl_hevc_ext_sps_st_rps))) + return; +``` + +Empirical state on ampere: +- `grep V4L2_CID_STATELESS_HEVC_EXT ~/src/libva-v4l2-request-fourier/src/` returns **zero hits**. Backend `7ac934e` (June pre-iter38) predates the 7.0 UAPI and never populates either control. +- `grep V4L2_CID_STATELESS_HEVC_EXT /usr/include/linux/v4l2-controls.h` returns **zero hits**. `linux-api-headers 6.19-1` doesn't even define the constants. + +**Mechanism reconstruction (highly plausible, not yet test-verified):** ampere's `ctx->has_sps_st_rps` is true (VDPU381 variant_ops sets it), so the kernel calls `v4l2_ctrl_find` for the new CID. The control may be auto-registered with a non-NULL `p_cur.p` pointing to a kernel-allocated but never-written buffer (uninitialized data). The early-return `if (!run->ext_sps_st_rps) return;` doesn't fire (pointer is non-NULL), so the function proceeds to `memcmp(cache, run->ext_sps_st_rps, sizeof(struct))` which reads from invalid / unmapped offsets and faults in `__pi_memcmp`. + +**Alternative mechanism**: `ctx->has_sps_st_rps` is true but kernel never auto-allocates the control storage, so `ctrl->p_cur.p` is a stale/null pointer the kernel doesn't validate. Either way: the **fix path is userspace** — make the libva backend set the new CIDs with valid data parsed from the HEVC SPS. + +**Reclassification**: kernel-agent#11 should be **closed and re-filed against `marfrit/libva-v4l2-request-fourier`** as a new issue: "extend backend to populate V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS / _LT_RPS for VDPU381 HEVC." There's still a kernel-side hardening case (add NULL/uninit guard to `prepare_hw_st_rps` so a forgetful userspace doesn't OOPS the kernel) — but it's an upstream-defense-in-depth item, not the fix-path for ampere HEVC HW decode. + +### VP9 — kernel-side, multiple competing upstream starting points + +**Confirmed**: VP9 is not enabled in any v3..v8 of the Casanova series. The v4 cover (patchew.org/linux/20251022174508.284929-1-detlev.casanova@collabora.com/) explicitly says *"This patch only adds support for H264 and H265 in both variants."* So `S264`/`S265`-only on `/dev/video1` is documented 7.0 mainline behaviour, **not a build / config miss**. + +Out-of-tree options to evaluate as iter2+ starting points: + +| Source | Tree | Status | Notes | +|--------|------|--------|-------| +| **D.V.A.B. Sarma** | `dvab-sarma/android_kernel_rk_opi` (github.com/dvab-sarma) | Working VP9 on RK3588 ≤ 4K@30, profile 0 | Android-flavoured tree (not a clean mainline diff). Tracker: `github.com/dvab-sarma/android_local_manifest/issues/3`. Collabora has offered to coach Sarma on kernel-submission etiquette for a v1; nothing on list yet. | +| **dongioia/rock5bplus-rkvdec2** | github.com/dongioia/rock5bplus-rkvdec2 | Claims RKVDEC2/VDPU381 H.264 + HEVC + VP9 @ 4K via mainline-style patches | Worth reading as a rebase candidate if Sarma's android tree proves too far from mainline | +| **Kwiboo** | github.com/Kwiboo/linux-rockchip | Active HEVC work on `linuxtv-rkvdec-hevc-v3` (Sep 2025) | **No VP9 / RK3588 branch as of the survey**. Kwiboo's RK3588 contribution is HEVC for RK3399-class, not VP9 | +| **rcawston** | github.com/rcawston/rockchip-rk3588-mainline-patches | Encoder + HDMI only | No decoder content | + +The **`rkvdec2` separate-driver approach** (Casanova June 2024 RFC, lwn.net/Articles/1015469) was abandoned in favour of extending the existing `rkvdec` driver — which is what landed in 7.0. So future VP9-on-RK3588 will extend `rkvdec`'s VDPU381 variant_ops, **not** introduce a separate `rkvdec2` module. + +**DTS does NOT need to change to enable VP9** — once the variant_ops gains a VP9 backend, the same `compatible = "rockchip,rk3588-vdpu381"` node will advertise `V4L2_PIX_FMT_VP9_FRAME` automatically. + +### Adjacent finding worth tracking + +**`media: rkvdec: Restore iommu addresses on errors`** — the only known post-merge stability fix in the 7.0 VDPU381 path per Collabora's retrospective. The decoder's embedded IOMMU is reset alongside the decoder on error recovery, dropping mappings the kernel still considers live. **Verify this is present in our `7.0-rc3` checkout** — if absent, any decoder error recovery (e.g. one bad frame) wedges subsequent decode until reset. Low-confidence whether `ampere-minimal-devices @ 7c241f2e2835` includes it — Phase 2 source-read item. + +## Predecessor data — what carries vs what doesn't + +Per `feedback_dev_process.md` Phase 0 rules: +- **Carries (state)**: `ampere-fourier` iter1 baseline numbers as *reference history* for Phase 7 regression checks; the operator-policy rule that codec patches stay OUT of `linux-ampere-fourier` baseline; the backend source pin `7ac934e`; memory entries about V4L2-control semantics (`feedback_unconditional_codec_state`, `feedback_per_driver_kludge_gating`, `feedback_va_st_rps_bits_is_slice_field`). +- **Does NOT carry**: the iter1 N=3 FPS numbers — those were for the 3-codec subset on this exact substrate; iter2's success metric (HEVC works) is independent. + +## Open questions tabled into Phase 1 + +1. **Scope of this kernel campaign vs. spawning a sibling backend campaign**: HEVC is now established as fundamentally userspace work (extend backend to populate new CIDs). VP9 is kernel work. Phase 1 needs to decide whether (a) this campaign drops HEVC and focuses on VP9 only, (b) becomes a meta-campaign coordinating an HEVC backend-iter40 + a VP9 kernel-iter1, or (c) splits into two distinct campaigns (`ampere-kernel-decoders` for VP9, sibling backend campaign for HEVC). +2. **VP9 starting tree**: Sarma's Android branch, dongioia's mainline-style overlay, or wait for Casanova v1 mainline submission? Trade-off between time-to-validate-on-ampere and time-to-upstreamable-patch-quality. +3. **Test-verification of HEVC mechanism reconstruction**: the kernel-source read above strongly suggests the new-CID gap is the OOPS root cause, but it's not yet proved. Phase 1 might lock a sub-goal "write a minimal libva backend patch that registers the new CIDs (even with dummy data) — if HEVC oops vanishes / changes shape, hypothesis confirmed; if not, loop back to Phase 2 with the new evidence." +4. **IOMMU restore patch present?**: confirm whether `~/src/linux-rockchip @ ampere-minimal-devices` has the "Restore iommu addresses on errors" fix. Phase 2 source-read. +5. **lore.kernel.org Anubis re-check**: the survey couldn't enumerate lore directly. Phase 2 should manually re-check `https://lore.kernel.org/linux-media/?q=rkvdec+VDPU381+RPS` for any HEVC stability patches between v7.0 and v7.0-rcN. + +## Phase 0 close + +Research question locked. Substrate captured. Prior-art survey delivered the headline finding: **the HEVC OOPS is most likely a userspace UAPI gap, not a kernel bug**, which fundamentally re-scopes this campaign. VP9 remains kernel-side and has at least two viable out-of-tree starting trees. Five open questions tabled for Phase 1. + +Iteration log: +- iter1: 2026-05-16 — this document.