diff --git a/README.md b/README.md index d65bfa7..5966fb6 100644 --- a/README.md +++ b/README.md @@ -1,75 +1,150 @@ -# v4l2-request libVA Backend +# libva-v4l2-request-fourier -## About +VA-API ICD backend for V4L2 stateless video decoders. Fourier-campaign +fork of the dormant `bootlin/libva-v4l2-request` upstream. -This libVA backend is designed to work with the Linux Video4Linux2 -Request API that is used by a number of video codecs drivers, -including the Video Engine found in most Allwinner SoCs. +## What works -## Status +| SoC / host | Codecs verified bit-exact vs `kdirect` | +|---|---| +| RK3399 (fresnel — Pinebook Pro) | H.264, HEVC Main, VP9 Profile 0, VP8, MPEG-2 — 5/5 at iter38 | +| RK3588 (ampere) | H.264 (iter1 ampere-fourier); HEVC EXT_SPS structure clean (iter2); other codecs in progress | +| RK3568 / RK3566 (ohm — PineTab2) | iter1-5 baseline (libva-multiplanar campaign) | -The v4l2-request libVA backend currently supports the following formats: -* MPEG2 (Simple and Main profiles) -* H264 (Baseline, Main and High profiles) -* H265 (Main profile) +`kdirect` = `ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime ...` +through Kwiboo's downstream ffmpeg patches. The Rockchip family has the +benefit of years of `rkvdec` + `hantro-vpu` iteration in mainline + the +RK3588/RK3576 video decoder series **landing in mainline February 2026**. + +## What does NOT work, and why it's stalled + +| Target | Status | Blocker | +|---|---|---| +| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) | +| HEVC Main10 on RK3399 | not enumerated | same as Hi10P | +| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below | + +### The Pi 5 standoff + +iter40 + iter40b add a third multi-device-probe slot for +`rpi-hevc-dec`, an NC12 SAND128 detile primitive, per-driver gates +around the SPS pre-seed + start-code-prepend + scaling_matrix submission, +and a (fragile, fixture-specific) SPS field override using the +GStreamer 1.28.2 H.265 parser. ICD discovery works, `vainfo` lists +`VAProfileHEVCMain`, S\_FMT / REQBUFS / STREAMON all succeed. + +**Decode itself never succeeds** — every CAPTURE DQBUF returns +`V4L2_BUF_FLAG_ERROR`. Driver author John Cox confirmed strict SPS +validation is intentional ("`try_ext_ctrls returned an error (22)` is +expected as it is validating the SPS"), and VAAPI's +`VAPictureParameterBufferHEVC` simply doesn't carry the bitstream-true +scalars (`sps_max_num_reorder_pics`, `sps_max_latency_increase_plus1`, +slice-level `num_entry_point_offsets`) that the driver wants. We can't +fish the SPS out of `source_data` either, because ffmpeg-vaapi parses +the SPS itself and passes only slice NAL bytes to libva backends. + +This is not a bug in our backend, in libva, in ffmpeg, or in the kernel +driver. It's an ecosystem coordination failure of long standing: + +- **Kwiboo's `ffmpeg-v4l2request` hwaccel** has been in production via + LibreELEC since December 2018. Re-submitted to ffmpeg-devel as a v2 + series in August 2024. Still un-merged in May 2026 — **eight years + in the upstream review queue**. +- **`libva-v4l2-request`** (this project's upstream) hasn't taken + meaningful commits since ~2021. Nobody wants to own the impedance + mismatch between VAAPI's Intel-shaped "give me raw bitstream, I'll + parse" and V4L2 stateless's kernel-shaped "give me parsed structs, + I'll just drive the HW." +- **`rpi-hevc-dec` mainline submission** is at v4 (July 2025), 17 + months in review. The Pi 6.18.x downstream kernel meanwhile has + active HEVC regressions ([raspberrypi/linux#7228](https://github.com/raspberrypi/linux/issues/7228), + [#7306](https://github.com/raspberrypi/linux/issues/7306)) that + aren't being fast-tracked because "the new uAPI is coming." +- **Mozilla is implementing Pi 5 HEVC via ffmpeg's hwaccel-context + path** (bug [1969297](https://bugzilla.mozilla.org/show_bug.cgi?id=1969297)), + not via libva — explicit acknowledgement from David Turner that + libavcodec needs to retain the SPS context for the strict driver to + accept the control batch. + +What end-users actually do today: run Pi OS (downstream-patched ffmpeg ++ downstream kernel) or LibreELEC (Kwiboo's patches + downstream +kernel). Anyone on a stock distro outside those two: no HW HEVC on +Pi 5. + +Nobody who has authority to merge has skin in the game. Everyone with +skin in the game lacks authority. Result: 8-year stalemate, three +forks of working code, no merged upstream. + +### What this means for this backend + +We chose to extend `libva-v4l2-request` into Pi 5 territory because +the architecture maps cleanly onto the existing iter38 multi-device +probe. That work landed (iter40 commit `3ffa9d0`, iter40b commit +`071b08d`). It's reusable infrastructure for any future strict V4L2 +stateless decoder that ffmpeg ships before libva does. + +But the *user-facing* Pi 5 HEVC story will not come from this +backend. The backend was a clean architectural target inside a +coordination dead-end. The actual Pi 5 HEVC path through libva +requires either: + +- a VAAPI extension exposing the SPS scalars rpi-hevc-dec validates + against (Intel-driven; no Pi-aligned principal), +- a libva-internal `VABufferType` for raw SPS/PPS NAL bytes (no + maintainer), +- ffmpeg-vaapi forwarding `num_entry_point_offsets` to backends + (small upstream patch; no champion), OR +- the political situation around Kwiboo's series unblocks (no + visible movement). + +iter40 + iter40b are **landed but parked**. The fresnel + ampere +sibling paths are unaffected (5/5 fresnel + 9 profiles ampere +verified post-iter40b, no regression). Phase 8 packaging is +deliberately skipped — shipping a `.deb` whose primary advertised +target (Pi 5) doesn't actually decode would mislead users. + +See `phase0_pi5_hevc.md`, `phase1_pi5_hevc.md`, +`phase5_pi5_hevc_review.md`, `phase7_pi5_hevc_close.md` for the +chapter's full empirical record. ## Instructions -In order to use this libVA backend, the `v4l2_request` driver has to -be specified through the `LIBVA_DRIVER_NAME` environment variable, as -such: +In order to use this backend, set the `LIBVA_DRIVER_NAME` environment +variable: export LIBVA_DRIVER_NAME=v4l2_request -A media player that supports VAAPI (such as VLC) can then be used to decode a -video in a supported format: +Then a VA-API-capable player can decode supported codecs on a probed +device: - vlc path/to/video.mpg + vlc path/to/video.mp4 + mpv --hwdec=vaapi path/to/video.mp4 + ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -f null - -Sample media files can be obtained from: - - http://samplemedia.linaro.org/MPEG2/ - http://samplemedia.linaro.org/MPEG4/SVT/ +The backend auto-detects available decoders via the V4L2 media +topology walk; honors `LIBVA_V4L2_REQUEST_VIDEO_PATH` and +`LIBVA_V4L2_REQUEST_MEDIA_PATH` for explicit device selection. ## Technical Notes -### Surface +### Multi-device probe (iter38) -A Surface is an internal data structure never handled by the VA's user -containing the output of a rendering. Usualy, a bunch of surfaces are created -at the begining of decoding and they are then used alternatively. When -created, a surface is assigned a corresponding v4l capture buffer and it is -kept until the end of decoding. Syncing a surface waits for the v4l buffer to -be available and then dequeue it. +A single libva session opens both `rkvdec` and `hantro-vpu` (and, on +hosts where it's present, `rpi-hevc-dec`) at init. `RequestCreateConfig` +re-targets the active fd per profile via +`request_switch_device_for_profile()`. Pool teardown happens at +switch time; the next `CreateContext` rebuilds against the right +device. -Note: since a Surface is kept private from the VA's user, it can ask to -directly render a Surface on screen in an X Drawable. Some kind of -implementation is available in PutSurface but this is only for development -purpose. +### Surface / Context / Picture / Image -### Context +A Surface is an internal data structure containing rendering output. +A Context owns the V4L2 lifecycle (S\_FMT, CAPTURE pool, ctrl-batch +defaults) for one decode session. A Picture is one encoded input +frame's set of buffers. An Image is a Standard VA pixel-format view +on a decoded Surface — the backend detiles SAND/COL128 or unpacks +NV15 to NV12/P010 here so consumers see linear pitches. -A Context is a global data structure used for rendering a video of a certain -format. When a context is created, input buffers are created and v4l's output -(which is the compressed data input queue, since capture is the real output) -format is set. - -### Picture - -A Picture is an encoded input frame made of several buffers. A single input -can contain slice data, headers and IQ matrix. Each Picture is assigned a -request ID when created and each corresponding buffer might be turned into a -v4l buffers or extended control when rendered. Finally they are submitted to -kernel space when reaching EndPicture. - -The real rendering is done in EndPicture instead of RenderPicture -because the v4l2 driver expects to have the full corresponding -extended control when a buffer is queued and we don't know in which -order the different RenderPicture will be called. - -### Image - -An Image is a standard data structure containing rendered frames in a usable -pixel format. Here we only use NV12 buffers which are converted from sunxi's -proprietary tiled pixel format with tiled_yuv when deriving an Image from a -Surface. +The real rendering is in `EndPicture`, not `RenderPicture`, because +the kernel needs the full extended-control batch when the OUTPUT +buffer is queued, and `RenderPicture` order is consumer-defined.