37 Commits

Author SHA1 Message Date
claude-noether 9fa18f2312 av1: populate V4L2_CID_STATELESS_AV1_SEQUENCE in codec_set_controls
Implements the libva-side portion of issue #11 — replaces PR #10's
no-op AV1 dispatch with a real av1_set_controls that maps VAAPI's
VADecPictureParameterBufferAV1.seq_info_fields + scalar fields onto
struct v4l2_ctrl_av1_sequence (the kernel uAPI control declared at
linux/v4l2-controls.h:2891-2919).

Daemon-track context (issue #11 daemon side, operator-owned):
ffmpeg-vaapi splits the AV1 bitstream client-side and strips the
OBU_SEQUENCE_HEADER before delivery; the V4L2 OUTPUT buffer contains
only OBU_FRAME_HEADER + OBU_TILE_GROUP.  libdav1d in the daedalus
daemon cannot parse this — it expects a complete OBU stream.  The
daemon side has to synthesise OBU_SEQUENCE_HEADER from the SEQUENCE
ctrl and prepend it to the slice bitstream.  This libva-side change
just makes the SEQUENCE ctrl populated and queued via S_EXT_CTRLS;
the daemon track is the consumer.

Three small touch points beyond the new src/av1.{c,h}:

  - src/surface.h: add an av1 leaf to surface->params holding
    VADecPictureParameterBufferAV1.  Slice params intentionally
    absent — the daedalus daemon consumes the slice OBU bytes
    directly from the OUTPUT buffer; no per-tile-group struct →
    OBU re-synthesis required from libva today.
  - src/picture.c: copy the picture-param buffer into the new leaf
    in RenderPicture, mirror of the per-codec memcpy pattern, plus
    call av1_set_controls from codec_set_controls (replacing the
    no-op).
  - src/meson.build: register src/av1.c.

Sequence-field mapping covers everything VAAPI exposes at the
sequence level (12 of 18 V4L2_AV1_SEQUENCE_FLAG_* bits + the four
scalars).  Bits VAAPI doesn't carry at the sequence level
(WARPED_MOTION, REF_FRAME_MVS, SUPERRES, RESTORATION,
SEPARATE_UV_DELTA_Q) stay clear; per-frame consumers (libdav1d via
the daemon, vpu981 via the hardware path) read those from the
OBU_FRAME_HEADER that is already in the slice buffer anyway.  See
feedback memory `feedback_vaapi_blind_to_some_hevc_sps_fields` for
the precedent.

Build verified on higgs (Debian 13 trixie, gcc 14.2.0, libva 2.22.0,
linux uAPI v4l2-controls.h sizeof(struct v4l2_ctrl_av1_sequence)==12):
clean meson + ninja link of v4l2_request_drv_video.so, vainfo
enumerates VAProfileAV1Profile0 via daedalus_v4l2 slot, av1_set_controls
symbol present.

Out of scope on this PR (operator-track, issue #11 follow-up):
  - daedalus-v4l2 kernel module wire-protocol extension (daedalus_
    collect_av1_meta + AV1 ctrl request_setup).
  - daedalus daemon OBU synthesiser (~400 LoC AV1 OBU encoder in
    daemon/src/av1_obu_synth.{c,h}).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:13:07 +02:00
marfrit 9a9cfd05db Merge pull request 'picture: no-op codec_set_controls case for VAProfileAV1Profile0' (#10) from noether/picture-av1-noop into master
Reviewed-on: marfrit/libva-v4l2-request-fourier#10
2026-05-20 19:07:12 +00:00
marfrit 96d70af674 picture: no-op codec_set_controls case for VAProfileAV1Profile0
picture.c's codec_set_controls() switch was falling through to the
default case for VAProfileAV1Profile0, returning
VA_STATUS_ERROR_UNSUPPORTED_PROFILE.  Result: vaEndPicture failed
with status 12 ("requested VAProfile is not supported"), no OUTPUT
buffer ever got queued, and the daedalus_v4l2 daemon never saw a
REQ_DECODE for AV1.

config.c's VAProfileAV1Profile0 case (line 84-93) explicitly notes
"Decode-side ctrl dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET
WIRED on master — vainfo will list the profile + CreateConfig
succeeds, but consumers that submit decode buffers hit a NOP path".
The NOP path was never actually wired in picture.c — it hit the
default UNSUPPORTED_PROFILE branch instead.

Fix: add a VAProfileAV1Profile0 case that just `break;`s through
without setting V4L2 controls.  For the daedalus_v4l2 daemon path
this is exactly the right shape — AV1 frame data is self-describing
per OBU stream (no separate SPS/PPS controls needed at the V4L2
boundary), so the OUTPUT buffer alone is sufficient for the kernel
to forward to the daemon.

Verified on higgs: ffmpeg -hwaccel vaapi -i av1.mkv now actually
queues frames to /dev/video2 and the daemon's libdav1d context opens.
Decode itself still fails (libdav1d wants the AV1 sequence header
OBU, which ffmpeg-vaapi sends via VAPictureParameterBufferAV1 not
via the slice buffer) — separate issue, needs an OBU sequence-header
synthesiser in the daedalus daemon (analogous to the new H.264
SPS/PPS NAL synth in daedalus-v4l2/daemon/src/h264_nal_synth.c).
That sequence-header synth work is a substantial follow-up; this
patch unblocks AV1 reaching the daemon at all.

For RK3588 vpu981 (the originally-planned AV1 target), this
remains a true NO-OP — when V4L2_CID_STATELESS_AV1_* dispatch
lands from the av1-iter1 operator branch, replace the no-op with
av1_set_controls(...).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:58:57 +02:00
marfrit c1bb444d07 Merge pull request 'h264: max_num_ref_frames fallback + libva-boundary instrumentation (#8)' (#9) from claude-noether/libva-v4l2-request-fourier:noether/h264-3-set-controls-bitstream-bug-8 into master
Reviewed-on: marfrit/libva-v4l2-request-fourier#9
2026-05-20 18:19:03 +00:00
claude-noether 0791f8e612 h264: max_num_ref_frames fallback + libva-boundary instrumentation
Closes the libva-side portion of marfrit/libva-v4l2-request-fourier#8.

Two small additions to h264_set_controls:

1. When VAPicture->num_ref_frames is 0 (older ffmpeg-vaapi paths /
   some daedalus_v4l2 consumers), count valid (non-INVALID) DPB
   entries in ReferenceFrames[16]. If even that returns 0, fall back
   to a per-profile spec minimum (1 for baseline, 4 for main/high).
   Hardware decoders (rkvdec, hantro, rpi-hevc-dec) tolerated the
   prior 0; libavcodec-via-daedalus enforces sps.max_num_ref_frames
   strictly and rejected every frame.

2. One request_log line at function entry dumping the raw VAAPI
   fields (seq_fields.value, pic_fields.value, num_ref_frames,
   bit_depth_*, picture_*_in_mbs_minus1). Disambiguates "ffmpeg-vaapi
   never populated" from "daedalus_v4l2 wire protocol corrupted" for
   the bit-fields-read-as-zero portion of issue #8.

Out of scope here (separate issue if pursued): profile_idc and
level_idc remain session-derived. VAAPI's VAPictureParameterBufferH264
omits both (verified higgs libva 2.22.0-3, /usr/include/va/va.h:
3571-3622) — same VAAPI-blindspot family as the HEVC SPS fields. A
real fix requires SPS-NAL parsing from surface->source_data OR a
daedalus wire-protocol pass-through; both are operator design calls,
not a libva-only patch.

Build verified on higgs (Debian 13 trixie, gcc 14.2.0, libva 2.22.0):
clean ninja link of v4l2_request_drv_video.so, vainfo enumerates all
8 codec profiles, no init regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:17:27 +02:00
marfrit 989833114a Merge pull request 'config: include video_fd_daedalus in profile enumeration probe' (#7) from claude-noether/libva-v4l2-request-fourier:noether/libva-2-config-profile-enum-daedalus into master
Reviewed-on: marfrit/libva-v4l2-request-fourier#7
2026-05-20 14:52:11 +00:00
marfrit d1ba4625d2 config: include video_fd_daedalus in profile enumeration probe
LIBVA-2 follow-up.  RequestQueryConfigProfiles walks each known
decoder fd via any_fd_supports_output_format() and adds a VAProfile*
for each codec OUTPUT format the V4L2 device advertises.  The fd
list missed video_fd_daedalus — so on a Pi 5 with rpi-hevc-dec
primary + daedalus_v4l2 alt, only S265 (HEVC) was probed and the
H.264 / VP9 / AV1 profiles never got enumerated.

Effect on higgs: ffmpeg -hwaccel vaapi -i h264_test.mp4 reported
"No support for codec h264 profile 578" before the per-codec
dispatch in request_switch_device_for_profile could fire — the
profile-578 (H264 Constrained Baseline) check happened during
hwaccel init, found nothing in the libva profile list, and bailed
without ever calling into the daedalus path.

Fix: extend the fds[] array in any_fd_supports_output_format from
5 to 6 entries, with the sixth being video_fd_daedalus when
HAVE_DAEDALUS_V4L2 is on (and -1 otherwise so it's skipped by the
`if (fds[i] < 0) continue;` guard).  After the fix, daedalus_v4l2's
OUTPUT format menu (VP9F + AV1F + S264) gets seen, and Request-
QueryConfigProfiles returns VP9Profile0 + AV1Profile0 + the H264*
profiles, all of which then route through the LIBVA-1 'd' kind
override in request_switch_device_for_profile.

Verified on higgs:

  Before:
    vainfo: Supported profile and entrypoints
          VAProfileHEVCMain               : VAEntrypointVLD
    (only HEVC; H264/VP9/AV1 not enumerated)

  ffmpeg vaapi -i h264 → "No support for codec h264 profile 578"

Build clean on boltzmann (only config.c.o + request.c.o recompile).

Backward-compatible on RK3399/3588 — the new slot is gated by
HAVE_DAEDALUS_V4L2 *and* video_fd_daedalus >= 0; both stay false in
those deployments.  Existing 5-fd probe order unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 16:45:33 +02:00
claude-noether c332d34643 Merge pull request 'request: route VP9/AV1/H.264 to daedalus_v4l2 on Pi 5 mixed deploy' (#6) from claude-noether/libva-v4l2-request-fourier:noether/libva-1-per-codec-dispatch into master 2026-05-20 08:53:04 +00:00
marfrit 6173a8da8e request: route VP9/AV1/H.264 to daedalus_v4l2 on Pi 5 mixed deploy
LIBVA-1 — when both rpi-hevc-dec and daedalus_v4l2 are loaded, finish
the per-codec dispatch so HEVC goes to rpi-hevc-dec (existing 'p'
override) and VP9 / AV1 / H.264 go to the daedalus daemon ('d').

Before this change the multi-device-probe accepted only ONE driver
plus a fixed alt slot (rkvdec↔hantro-vpu); on a Pi 5 with both decoders
the find_codec_device() walk preferred rpi-hevc-dec by known_decoder_
drivers[] order and never opened daedalus_v4l2, so VP9/AV1/H.264 frames
hit rpi-hevc-dec's S_FMT and failed.

Changes:

  - request.c multi-device-probe: when primary = rpi-hevc-dec, alt =
    daedalus_v4l2 (when HAVE_DAEDALUS_V4L2 is on); symmetric handling
    in the daedalus_v4l2 primary branch so alt = rpi-hevc-dec.  This
    preserves the iter40 fallback (no daedalus → alt = NULL) when the
    build option is off.

  - request.c alt-driver opening block: generalized from the iter38
    rkvdec/hantro pair to also dispatch into video_fd_rpi_hevc_dec and
    video_fd_daedalus slots.  Defensive close on unknown alt-driver
    name (shouldn't happen — primary_driver branches gate the choices —
    but keeps the slot tally clean if a future driver name is added
    above without wiring up the dispatch here).

  - request_switch_device_for_profile: added 'd' kind handler +
    profile override block.  When daedalus is open, VP9 / AV1 / H.264*
    route to it.  HEVC stays on rpi-hevc-dec via the existing 'p'
    override.  AV1 'a' kind (RK3588 vpu981) wins ONLY if vpu981 was
    probed, so the override only fires on hosts where vpu981 stayed
    -1 (i.e. Pi 5).

  - RequestTerminate: close the daedalus_v4l2 fd pair on teardown
    (was leaking — caught while reviewing the alt-driver expansion).

Build: meson + ninja clean on boltzmann (only pre-existing GStreamer
H265 parser noise).  Behaviour on RK3399/3588 boxes unchanged — the
new branches are gated by HAVE_DAEDALUS_V4L2 *and* video_fd_daedalus
≥ 0, both of which stay false in those deployments.

Companion to daedalus-v4l2 481279c (Phase 8.13 systemd unit) and
marfrit-packages noether/daedalus-v4l2-kernel-6.18-compat branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 10:41:18 +02:00
marfrit de27e95571 v4l2: log error_idx + failing ctrl id on S_EXT_CTRLS failure
Better diagnostic when VIDIOC_S_EXT_CTRLS returns < 0: read
back error_idx and print which control id rejected (or
"ioctl-level" when error_idx == count, meaning the rejection
was generic, not per-control).

Made it possible to triage the daedalus_v4l2 phase 8.13 issue
by separating "the actual stateless control failed" (would
show failing_ctrl_id=0xa40a2c VP9_FRAME) from "libva probing
H264/HEVC profile/level we don't expose" (failing_ctrl_id=
0xa40900 H264_PROFILE etc.) — the latter is harmless on a
VP9-only context.

Before:
  v4l2-request: Unable to set control(s): Invalid argument

After (per-control):
  v4l2-request: Unable to set control(s): Invalid argument
                (error_idx=0/2 failing_ctrl_id=0xa40900 size=0)

After (ioctl-level):
  v4l2-request: Unable to set control(s): Invalid argument
                (error_idx=2/2 ioctl-level)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:14:50 +00:00
marfrit 2146341460 daedalus_v4l2: meson option gate (default true)
Adds a build-time switch so platforms that will never see a
daedalus_v4l2 kernel module (Allwinner cedrus, RK without the
shim, etc.) can opt out of the probe entry + dispatch branch.

  meson setup build                         # daedalus support on
  meson setup build-off -Ddaedalus_v4l2=false  # off

Implementation:
- meson_options.txt: new boolean `daedalus_v4l2`, default true.
- src/meson.build: when option is true, autoconfig.h gets
  `#define HAVE_DAEDALUS_V4L2 1`.
- src/request.c: known_decoder_drivers[] entry, primary-driver
  detection branch, and post-probe log line all gated by
  #ifdef HAVE_DAEDALUS_V4L2.
- src/request.h: struct daedalus fields kept UNCONDITIONAL.
  Two extra int per session and the struct layout stays stable
  across translation units regardless of option — avoids the
  ODR risk of every consumer of request.h needing to include
  autoconfig.h before request.h.

Verified on hertz: both builds compile clean.
  build/src/autoconfig.h has HAVE_DAEDALUS_V4L2; .so contains
  "daedalus_v4l2" string + log message.
  build-off/src/autoconfig.h doesn't; .so contains no daedalus
  strings at all.

Default-on build still passes vainfo end-to-end:
  vainfo: Driver version: v4l2-request
  vainfo: Supported profile and entrypoints
        VAProfileH264Main / High / ConstrainedBaseline / MultiviewHigh
        / StereoHigh : VAEntrypointVLD
        VAProfileVP9Profile0 : VAEntrypointVLD
        VAProfileAV1Profile0 : VAEntrypointVLD

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:41:17 +00:00
marfrit b5b3acf0f7 daedalus_v4l2: add to known_decoder_drivers + multi-device-probe slot
Phase 8.10 of the daedalus-v4l2 sibling campaign — out-of-tree
V4L2 stateless decoder shim that forwards bitstream to a
userspace daemon (FFmpeg-software decode for VP9 / AV1 / H.264;
pixels back via dmabuf into the CAPTURE buffer).

Adds the same iter40-shaped wiring as rpi-hevc-dec:
- known_decoder_drivers[] entry "daedalus_v4l2"
- video_fd_daedalus + media_fd_daedalus slots in driver_data
- -1 init alongside the other multi-device slots
- primary-driver detection branch in the auto-probe block
- post-probe log line for symmetry with iter40

No per-profile dispatch changes needed — daedalus_v4l2 advertises
the standard V4L2_PIX_FMT_{VP9_FRAME,AV1_FRAME,H264_SLICE}
OUTPUT fourccs the fork's existing per-driver paths already
handle.

Verified on hertz (Pi 5 / CM5, 6.12.75+rpt-rpi-2712) with the
daedalus_v4l2 module loaded:

  LIBVA_DRIVER_NAME=v4l2_request \
  LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video0 \
  LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media3 \
  vainfo --display drm --device /dev/dri/renderD128

  v4l2-request: opened daedalus_v4l2 at video_fd=... media_fd=... (Pi 5 daemon-backed VP9/AV1/H264)
  vainfo: Driver version: v4l2-request
  vainfo: Supported profile and entrypoints
        VAProfileH264Main               : VAEntrypointVLD
        VAProfileH264High               : VAEntrypointVLD
        VAProfileH264ConstrainedBaseline: VAEntrypointVLD
        VAProfileH264MultiviewHigh      : VAEntrypointVLD
        VAProfileH264StereoHigh         : VAEntrypointVLD
        VAProfileVP9Profile0            : VAEntrypointVLD
        VAProfileAV1Profile0            : VAEntrypointVLD

Without the env override the auto-probe still picks rpi-hevc-dec
first (it's earlier in known_decoder_drivers[]); on the standalone
daedalus_v4l2 path the daemon-backed decode is what answers
S_FMT/QBUF/DQBUF. On a mixed-driver Pi 5 box where both modules
are loaded, HEVC continues to route through rpi-hevc-dec via the
existing 'p' override; VP9/AV1/H264 would prefer daedalus_v4l2
since rpi-hevc-dec is HEVC-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:37:53 +00:00
marfrit 820557268b Merge PR #5: ampere-av1 Phase 2 (master) — fourth-fd probe + AV1 enumeration 2026-05-18 13:47:56 +00:00
claude-noether c6f81c653f ampere-av1 Phase 2 (master): fourth-fd probe + AV1 enumeration
Imports the minimal "vainfo lists VAProfileAV1Profile0" layer from the
operator's in-progress av1-iter1 branch (Phase 2 steps 1, 2 — commits
bed75c0 + 61db76e on av1-iter1). The Phase 3-5 bit-exact decode-side
work stays in av1-iter1; this commit gives master the enumeration +
fd-routing layer so consumers (ffmpeg-vaapi, firefox-fourier, chromium-
fourier) at least see VAProfileAV1Profile0 today on RK3588.

What this commit adds:
- video_fd_vpu981 + media_fd_vpu981 slots to struct request_data
  (named to match av1-iter1's convention so the operator's Phase 3-5
   merge resolves cleanly)
- 4th-decoder probe loop in VA_DRIVER_INIT that walks hantro-vpu
  media nodes for an instance advertising V4L2_PIX_FMT_AV1_FRAME
  (AV1F) as OUTPUT pixfmt. RK3588 has 3 hantro-vpu instances all
  reporting driver="hantro-vpu" + model="hantro-vpu", so OUTPUT-
  format probe is the only DTS-independent discriminator.
- 'a' kind in request_device_kind_for_profile (VAProfileAV1Profile0)
  + 'a' branch in request_switch_device_for_profile.
- video_fd_vpu981 added to any_fd_supports_output_format helper
  (existing 3-slot loop missed the new fd; same off-by-one trap
  that bit ampere's av1-iter1 enumeration for a week).
- VAProfileAV1Profile0 → V4L2_PIX_FMT_AV1_FRAME in pixelformat_for
  _profile.
- VAProfileAV1Profile0 push in RequestQueryConfigProfiles +
  RequestQueryConfigEntrypoints + RequestCreateConfig switch.
- vpu981 fd cleanup in RequestTerminate.
- rpi_hevc_dec fd cleanup added at the same time (was already missing
  in master — fixed defensively).
- V4L2_REQUEST_MAX_PROFILES bumped 13 → 14. Defensively sized for
  the post-Option-B-revert future: with iter39 Option B reverted
  (Hi10P + Main10 back in enumeration) plus AV1, max possible
  enumeration is 13. The per-group guards use `index < MAX - N`
  pattern; for a singleton push to succeed at index=13 we need
  MAX >= 14. Bumping now avoids the same off-by-one bug from
  silently dropping AV1 when Option B eventually reverts.

What this commit does NOT add:
- av1.{c,h} decode-side scaffolding (Phase 2 step 4 on av1-iter1 —
  ~177 LoC including a stub av1_set_controls that returns -1). When
  the operator's av1-iter1 Phase 3-5 work lands on master, those
  500+ LoC + the stub will follow. Without them, consumers calling
  vaCreateContext(VAProfileAV1Profile0) succeed at the libva layer
  but ffmpeg-vaapi will fail at the first vaRenderPicture with an
  AV1-buffer-type rejection — clean error, no crash.

Verified 2026-05-18 on ampere:

  $ env LIBVA_DRIVER_NAME=v4l2_request vainfo | grep VAProfile
        ... (10 prior profiles, unchanged) ...
        VAProfileAV1Profile0            :   VAEntrypointVLD   ✓

  Probe log: "ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3"

Build clean on ampere with GCC 16.1.1; no warnings introduced.
ampere's running module restored to the av1-iter1 build after the
verification — this commit's .so was NOT permanently installed.

Closes the headline acceptance criterion in
marfrit/libva-v4l2-request-fourier#2 ("vainfo on ampere lists
VAProfileAV1"). End-to-end AV1 decode bit-exactness is iter4 work
that the av1-iter1 branch continues to drive.

Co-Authored-By: claude-noether <claude-noether@reauktion.de>
2026-05-18 13:45:04 +00:00
claude-noether 9bb5a5a722 README: ffmpeg-v4l2-request-fourier flipped to published
Build + publish landed (2:8.1.r123329.b57fbbe-3, Kwiboo's
v4l2-request-n8.1 tip + libudev-bypass companion patch). Deploy-host
verified on fresnel: installs cleanly, ffmpeg buildconf shows
--enable-v4l2-request, hwaccels list includes 'v4l2request', HEVC
decode via -hwaccel v4l2request produces correct-size output.

Quickstart per-host pacman -S lines now include
ffmpeg-v4l2-request-fourier. Status table flipped its row from
pending to published. Remaining pending: chromium-fourier
(clang 22 -> 23 blocker), qt6-base-fourier (Wayland GL_ALPHA fix).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 21:01:04 +00:00
claude-noether 0182307403 README: add Quickstart section with per-host install + full stack matrix
The TL;DR of 'what packages do I install to watch YouTube on my
Rockchip board with HW acceleration in Firefox' wasn't reachable
from this README without reading three other repos' commit
histories. Fixed.

Now landed at the top:

- Stack matrix: kernel (linux-{fresnel,ampere}-fourier) -> ffmpeg
  (ffmpeg-v4l2-request-fourier) -> libva (libva-v4l2-request-fourier)
  -> browser (firefox-fourier or chromium-fourier + kwin-fourier on
  Wayland).
- Honest acknowledgement that the browser HW path is libavcodec
  hwdevice DRM, not VAAPI-via-libva. This backend matters for mpv /
  ffmpeg-as-vaapi consumers.
- Per-host pacman -S incantations for fresnel (RK3399), ampere
  (RK3588), ohm (RK3566).
- Live marfrit repo URL + signing-key import flow.
- Smoke-test commands (vainfo + MOZ_LOG patterns).
- Honest status flag: ffmpeg-v4l2-request-fourier, chromium-fourier,
  qt6-base-fourier exist in marfrit-packages source tree but NOT
  yet in the live repo. Users building those locally now.
- RK3588 mainline (Feb 2026) called out alongside ampere row.

What hasn't changed: Pi 5 standoff section, technical notes,
existing iter39 / iter40 status tables.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 20:48:53 +00:00
claude-noether 941fbc5b1b README: candid 'standoff' framing for Pi 5 HEVC + RK matrix
Replace the original 2018 Bootlin upstream README with the
fourier-fork situation as of May 2026. What works: fresnel 5/5,
ampere iter1+2, ohm baseline (all RK family, mainline VDPU381/383
landing Feb 2026 helps).

What doesn't: Pi 5 HEVC via this backend. New 'The Pi 5 standoff'
section captures the honest situation surfaced by the May 2026
web-research pass:

- Kwiboo's ffmpeg-v4l2request hwaccel: 8 years un-merged upstream
- libva-v4l2-request: no commits since ~2021
- rpi-hevc-dec mainline: 17 months in review, still not merged;
  Pi 6.18.x downstream has active HEVC regressions (#7228, #7306)
- Mozilla bug 1969297 picks the ffmpeg-hwaccel-context path, not
  libva — explicit ack that strict drivers need libavcodec's
  internal SPS context
- Frames the issue as ecosystem coordination failure (principal-
  agent stalemate), not architectural impossibility

Notes that iter40 + iter40b lands but parks: backend infra is
sound + reusable for any future strict V4L2 stateless target ffmpeg
ships before libva does, but the user-facing Pi 5 HEVC story will
not come from this backend — it'll come from Mozilla / Kwiboo /
upstream coordination unblocking.

iter38 5/5 fresnel + 9-profile ampere baselines preserved
post-iter40b — documented as no-regression in phase7_pi5_hevc_close.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:58:52 +00:00
claude-noether 071b08dcc2 iter40b: SPS-parse fix lands but bit-exact still blocked upstream
Per-driver gate added: when rpi-hevc-dec active, parse SPS NAL from
surface_object->source_data via the iter2 vendored GStreamer parser
and override the VAAPI-omitted v4l2_ctrl_hevc_sps fields
(sps_max_num_reorder_pics, sps_max_latency_increase_plus1,
sps_max_sub_layers_minus1, max_dec_pic_buffering_minus1[HighestTid]).
Cached at driver_data->hevc_sps_field_cache.

Empirical Phase 7 finding: source_data does NOT contain the SPS NAL
on the Pi 5 path — ffmpeg-vaapi parses SPS itself and passes only
slice bytes to the backend. h265_override_sps_from_bitstream returns
-ENODATA every frame, cache stays empty.

Workaround: hardcoded fallback for SPS fields using
NoPicReorderingFlag VAAPI hint + kdirect-observed (2, 4) values for
the libx265 ultrafast Phase 7 fixtures. Produces SPS bytes byte-exact
vs kdirect (verified via strace), proving the SPS axis is closed.
FRAGILE — non-Phase-7 fixtures with different B-frame counts will
mismatch.

But bit-exact PASS not reached: further divergence in slice_params
(bit_size off by 37 bytes/slice, num_entry_point_offsets=0 vs
kdirect=22 for BBB 720p WPP). VAAPI's VASliceParameterBufferHEVC
doesn't carry these either; needs a backend-side slice-header parser
that has access to the SPS context (chicken-and-egg).

Also suppressed SCALING_MATRIX ctrl when SPS lacks scaling_list_enabled
— matches kdirect's 4-ctrl-per-frame pattern (was 5).

Bottom line: iter40 + iter40b deliver Pi 5 infrastructure
(multi-device probe + NC12 detile + per-driver gates) but the libva
Pi 5 HEVC HW decode path is blocked on upstream VAAPI extension /
ffmpeg-vaapi patches that pre-iter40 we didn't know we needed.

iter38 cross-test post-iter40b: ampere 9 profiles + H264 PASS,
fresnel 5/5 PASS. No sibling regression.

Phase 8 packaging + Phase 9 memory entry still deferred — won't
package + ship a partial backend, won't distill until upstream lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:45:43 +00:00
claude-noether 9037934b21 phase7_pi5_hevc_close: iter40 partial — backend integration works, decode rejected by rpi-hevc-dec
C1 vainfo PASS, C3 HW engagement PASS, C6 decode-correctness FAIL
(V4L2_BUF_FLAG_ERROR on every CAPTURE DQBUF). Root cause empirically
located: SPS sps_max_num_reorder_pics + sps_max_latency_increase_plus1
fields. Our backend uses a spec-legal fallback (sps_max_dec_pic_buffering_minus1, 0)
because VAAPI doesn't forward these fields; rkvdec accepts it,
rpi-hevc-dec validates against bitstream-true values and rejects.

Real fix needs SPS NAL parse via the iter2 vendored GStreamer parser
to populate bitstream-true values for the V4L2 SPS ctrl. Estimated
1 more 8(+1)-phase loop (iter40b).

Phase 8 + Phase 9 deferred — won't package + deploy + ship a broken
backend; won't distill lessons until the real fix lands.

Sibling iter38 baseline NOT yet re-verified on fresnel + ampere
post-iter40. Code paths gated on video_fd_rpi_hevc_dec >= 0 stay
no-op on non-Pi hosts; only __arm__ → __aarch64__ guard change is
globally observable but its is_10bit sub-gate stays dormant on
8-bit fixtures. Verify before declaring no-regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:18:16 +00:00
claude-noether 3ffa9d0d17 iter40: Pi 5 HEVC chapter — backend integration lands, bit-exact pending
Phase 6 implementation. Backend builds clean on higgs (Debian 13
trixie, aarch64), vainfo lists VAProfileHEVCMain via rpi-hevc-dec,
multi-device probe finds /dev/video19 + /dev/media1, CreateContext
+ S_FMT + REQBUFS + STREAMON all succeed.

Phase 7 partial: infrastructure works, 10 frames flow through the
pipeline (correct byte counts produced — 13824000 for 1280x720 x 10
NV12 frames). But every DQBUF CAPTURE returns V4L2_BUF_FLAG_ERROR
so output content is wrong (libva sha != kdirect sha). The decode
itself is failing on the rpi-hevc-dec side despite all ctrl
submissions returning success.

Code changes:
- request.h: video_fd_rpi_hevc_dec / media_fd_rpi_hevc_dec slots +
  has_hevc_ext_sps_rps_rpi_hevc_dec flag (mirrors iter38 + iter2
  pair-of-flags pattern, naturally false on Pi).
- request.c: known_decoder_drivers gains rpi-hevc-dec; primary-driver
  probe gets an else-if branch setting the new fds (Phase 5 F3);
  request_switch_device_for_profile prefers 'p' for HEVC when
  rpi-hevc-dec present.
- context.c: per-fd want_pixfmt (NC12 on Pi), capture_pixelformat
  taken from video_format slot (not hardcoded NV12/NV15);
  synthetic-SPS pre-seed gated off for Pi (Phase 5 F6);
  destination_sizes uses nv12_col128_uv_plane_offset for NC12 SAND
  layout (Phase 5 F2);
  per-driver HEVC_START_CODE (NONE on Pi, ANNEX_B on RK);
  per-driver context_object->h264_start_code (skip prepend on Pi).
- video.c: NV12_COL128 video_format entry (8-bit SAND, single
  buffer, 2 planes, NV12 drm_format with MOD_NONE so detile branch
  fires rather than tiled_to_planar).
- nv12_col128.c/.h: detile primitive (Y + UV per-plane, kernel
  hevc_d_video.c bytesperline formula + ffmpeg/Kynesim per-pixel
  offset). UV plane offset = 128 * ALIGN(h, 8) — within-column
  (SAND interleaves Y+UV per column, NOT plane-concatenated;
  earlier wrong formula caught by Phase 7 SEGV).
- image.c: #ifdef __arm__ extended to __arm__ || __aarch64__
  (Phase 5 F1 — guard was killing detile path on all aarch64
  hosts including fresnel iter39 NV15 path, masked because 10-bit
  never exercised); RequestCreateImage NC12 → NV12 stride override
  (linear width, not column-stride); copy_surface_to_image NC12
  detile branch (gates on fourcc + v4l2_format).
- nv15.h: fallback V4L2_PIX_FMT_NV15 define (Debian 13 headers
  omit it though they have NC12).
- nv12_col128.h: fallback V4L2_PIX_FMT_NV12_COL128 +
  V4L2_PIX_FMT_NV12_10_COL128 (Arch / mainline pre-Pi headers).
- tests/test_nv12_col128_detile.c: hand-crafted-bytes unit test;
  passes (8 cases: Y + UV for 4 widths incl. 1366 misaligned;
  UV-offset helper).
- meson.build / nv12_col128 sources listed.

Phase 7 status: not yet bit-exact. Remaining diagnosis: per-frame
S_EXT_CTRLS payload diff vs kdirect (kdirect sends 4 ctrls
SPS+PPS+decode_params+slice_array; ours sends 5 incl. scaling_matrix;
field ordering differs). Likely the slice_array contents need
per-driver handling for rpi-hevc-dec's expected layout. Beyond
in-session reach.

iter38 5/5 baseline on fresnel + ampere should be unaffected (new
fd stays -1 on non-Pi hosts; all gates either short-circuit on
fd-not-present or no-op).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:17:14 +00:00
claude-noether f1be489c75 phase5_pi5_hevc_review: 3 critical findings empirically verified, 1 fixture gap
Sonnet Plan-agent review of phase1_pi5_hevc plan. Empirically
verified each finding against current source per
feedback_review_empirical_over_theoretical BEFORE accepting:

F1 (CRITICAL): #ifdef __arm__ at image.c:239+268 kills NC12 (and
already-present NV15) detile on AArch64. fresnel iter39 5/5 PASS
masked this because 10-bit path was never exercised. Fix: extend
guard to __aarch64__.

F2 (CRITICAL): destination_bytesperlines for NC12 source returns
column-stride (1080) not linear-NV12 Y stride (1280). VAImage
consumers see wrong pitch. Fix: override in RequestCreateImage
when src=NC12, dst image=NV12.

F3 (CRITICAL): request.c primary-driver detection has else-if
branches for rkvdec and hantro-vpu only. On higgs (rpi-hevc-dec
primary), neither matches → new fd pair stays -1 → routing
no-ops. Fix: add explicit rpi-hevc-dec branch.

F4 (accepted): add 1366x768 fixture to exercise column padding.

F5 (verify-only): HEVC START_CODE_ANNEX_B may not work on
rpi-hevc-dec (kdirect uses NONE). Don't pre-gate; verify
empirically in Phase 7.

F6 (CRITICAL): iter25 synthetic-SPS pre-seed fires for HEVC
regardless of driver_kind. Would issue HEVC_SPS to rpi-hevc-dec
which doesn't need it AND uses different submission order. Fix:
gate on driver_data->video_fd != video_fd_rpi_hevc_dec.

F7/F8 (no findings): image.c gate predicate sound; cross-device
regression scope clean.

Amended Phase 6 step list with 3 new gating actions. Phase 7
verification expanded with empirical START_CODE check + 1366
fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:04:28 +00:00
claude-noether bf52725ab3 phase1_pi5_hevc: lock goal + situation + N=3 baseline + plan (iter40)
Phase 1 measurable goal: HEVC Main 8-bit bit-exact libva-vs-kdirect
on higgs for 640x360 / 1280x720 / 1920x1080 fixtures with HW path
engagement verified via lsof + ffmpeg-vaapi log signal.

Phase 2 surface-area audit: ~250 LoC backend + 100 LoC standalone
detile primitive. Reuses iter38 multi-device-probe pattern (now
3 slots: rkvdec + hantro + rpi-hevc-dec) + iter2 per-driver
gating shape. h265_set_controls + iter31 a-29 plumbing transfers
unchanged. iter25 SPS pre-seed gated off for rpi-hevc-dec.

Phase 3 baseline locked: N=3 bit-exact SW==kdirect for all three
fixtures on higgs. kdirect engagement signal:
  Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
  buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8

Phase 4 plan: 7 sequenced steps (request.h -> request.c -> video.c
-> nv12_col128.c new -> image.c branch -> meson/Makefile -> build
on higgs). NC12 tile geometry locked from kernel hevc_d_video.c
math + ffmpeg/Kynesim av_rpi_sand_to_planar_y8 byte-offset formula.
Risks + mitigations enumerated.

Phase 5 sonnet review explicitly requested per CLAUDE.md
no-skip-reviews rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:00:35 +00:00
claude-noether b6a65fc692 phase0_pi5_hevc: close addendum with empirical higgs probe data
Live probe of rpi-hevc-dec on higgs (Pi CM5, kernel 6.12.75-rpt-rpi-2712,
Debian 13 trixie) answers Phase 0 open questions Q1, Q2, Q5, Q6
empirically; Q3 partial; Q4 still open.

Q1 (EXT_SPS): NOT present. Only standard V4L2_CID_STATELESS_HEVC_*.
  Probe ctrl id 0xa97 returns EINVAL — same gate iter2's
  has_hevc_ext_sps_rps_rkvdec uses. iter31 alpha-29 plumbing applies.

Q2 (hevc_start_code): default 0 "No Start Code"; matches our behaviour.

Q3 (NC12 SAND tile layout): partial. CAPTURE S_FMT for 1280x720 NC12
  returns sizeimage=1382400 (linear NV12 byte count) but
  bytesperline=1080 (suspect, encodes SAND col count not linear stride).
  Need kernel-doc / driver-source read before writing detile primitive.

Q4 (DRM modifier round-trip): hwdownload rejects SAND-tiled drm_prime
  (-38 Function not implemented). Backend CPU-detile to NV12 is the
  safe path for Firefox.

Q5 (submission ordering): empirical ioctl trace shows canonical V4L2
  stateless flow. Two notes for the backend: kdirect uses
  V4L2_MEMORY_DMABUF for both queues (we use MMAP for CAPTURE on
  rkvdec); kdirect does NOT need the iter25 SPS pre-seed pattern -
  rpi-hevc-dec takes explicit NC12 + dims directly.

Q6 (packaging): Debian 13 trixie. Phase 8 needs a debian/ tree, not
  just PKGBUILD. Decision in Phase 1.

Other findings: ffmpeg 7.1.3 from stock Debian is built with
--enable-v4l2-request. kdirect engagement line:
  Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
  buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8
No libva ICD installed (only armada-drm_dri.so). mpv installable.
Firefox 145 + rpi-firefox-mods present.

Phase 0 closed. Phase 1 opens with goal:
  HEVC bit-exact libva-vs-kdirect on higgs for 1280x720 Main 8-bit
  via the new RPI_HEVC_DEC driver_kind slot + NC12 detile primitive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 18:54:08 +00:00
claude-noether 25b8a15e09 phase0_pi5_hevc: open Pi 5 / CM5 HEVC chapter (substrate doc only)
Empirical higgs probe (sibling session 2026-05-17) confirmed
rpi-hevc-dec at /dev/video19 is V4L2 STATELESS, not stateful:
- Section header literally "Stateless Codec Controls"
- OUTPUT V4L2_PIX_FMT_HEVC_SLICE (parsed slices), not full-stream HEVC
- V4L2_CID_STATELESS_HEVC_* control set + slice_param_array[4096]
- CAPTURE NC12 / NC30 (V4L2_PIX_FMT_NV12_COL128 / _10_COL128,
  SAND 128-column tiled, Pi-specific)

So the Pi 5 HEVC HW path belongs HERE (request/stateless backend),
not in a separate stateful project. Replaces the now-deleted
libva-v4l2-stateful-fourier scaffold attempt.

phase0_pi5_hevc.md captures:
- Substrate (target host, backend baseline, empirical probe output)
- What carries forward unchanged (most of HEVC plumbing)
- What needs adding (RPI_HEVC_DEC driver_kind, NC12/NC30 video_format
  + detile primitive, image.c branch — small surface area)
- Six open questions Phase 1 must answer first (EXT_SPS presence,
  start_code default, SAND tile spec, drm_prime modifier round-trip,
  rpi-hevc-dec submission ordering quirks, packaging target OS)
- Phase 1 goal sketch (NOT locked) + Phase 3 baseline plan

No code in this commit. Phase 1 opens when higgs is up + first two
open questions are answered live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 18:48:09 +00:00
claude-noether cf8cd9d2be h265: cap pred-weight + ref-list loops at VAAPI source size (15)
V4L2_HEVC_DPB_ENTRIES_NUM_MAX is 16, but
VASliceParameterBufferHEVC::RefPicList is [2][15] and the eight
delta_*_weight_lX / luma_offset_lX / delta_chroma_weight_lX /
ChromaOffsetLX arrays are all [15]. Iterating the per-slot copy
loops to 16 over-reads the VAAPI source by one element.

The bug was always there but hidden under -O3 (meson's default
buildtype=release): GCC unrolled the inner loop and dead-folded
the out-of-bounds load. Under -O2 (Arch makepkg CFLAGS) the
canonical vectorised loop ran and produced a real SEGV at
v4l2_request_drv_video.so + 0xb3a4 inside h265_fill_slice_params,
breaking HEVC immediately after the package install on fresnel
(iter38 5/5 baseline dropped to 4/5).

Define a local VA_HEVC_REF_LIST_LEN (15) and use it as the cap
for the four offending loops. RefPicList and pred_weight_table
copies now respect the source bound; V4L2 destination still has
16 slots, the upper one stays at memset-zero which is correct.

Verified locally: -O2 build + package re-install restores HEVC
to bit-exact PASS vs kdirect (sha 108f925bb6cbb6c9). iter38 5/5
baseline restored.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 17:00:52 +00:00
claude-noether c9f32aff49 iter39 Option B revert of 63fed87: P010 advertisement gated on is_10bit again
Phase 7 fix 63fed87 (unconditional P010 in QueryImageFormats) broke
HEVC 8-bit on fresnel: ffmpeg-vaapi picked P010 for the HEVC hwframe
pool, vaEndPicture SEGV'd when consumer-side P010 expectations met
the 8-bit NV12 CAPTURE buffer. Exit 139 (SIGSEGV) on first frame.

Original reasoning for 63fed87 (advertise early so ffmpeg's pre-
CreateContext query sees P010) doesn't apply with Option B in place —
Hi10P + Main10 are dropped from RequestQueryConfigProfiles, so no
10-bit decode pipeline reaches QueryImageFormats. The gate on
is_10bit (false for all enumerated profiles post-Option-B) correctly
returns NV12-only.

Verified on fresnel post-revert: HEVC bit-exact PASS sha
108f925bb6cbb6c9 restored; iter38 5/5 baseline intact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 16:52:36 +00:00
claude-noether 6bc12fe7e4 iter39 Option B: drop Hi10P + Main10 from RequestQueryConfigProfiles
Per Phase 7 close + user-directed Option B trigger (web research /
rockchip-mpp showed Hi10P is effectively impossible on the current
stack). Cross-test on ampere RK3588 confirmed the SAME failure mode
as fresnel RK3399 — both produce all-zero output via libva; kdirect
fails with EINVAL on both. The blocker is in ffmpeg-v4l2-request
userspace plumbing for the new uAPI controls Karlman's kernel patches
introduced, NOT in our backend or the kernel.

Sources confirming kernel + HW capable but userspace pending:
  - lwn.net/Articles/950434: "to fully runtime test... you may need
    upstream DRM commits, FFmpeg patches"
  - patchwork.kernel.org Karlman v6 → v10 series on linux-media
  - Rockchip RK3399 + RK3588 datasheets list 10-bit H.264 support

Stop enumerating Hi10P + Main10 so VAAPI consumers don't try the
broken path. The backend infrastructure (codec.c profile cases,
context.c NV15 CAPTURE + synthetic SPS bit_depth=2 + video_format
invalidation, image.c P010 reporting + NV15→P010 unpack, surface.c
RT_FORMAT_YUV420_10 guard + NV15 PRIME fourcc, nv15.c + nv15.h
unpack primitive, request.h is_10bit flag) is RETAINED — just
re-add the two profiles[index++] lines and bump the H264 guard
back to (-6) when upstream ffmpeg-vaapi V4L2 hwaccel learns 10-bit.

Memory: feedback_rk3399_h264_hi10p_advertised_not_functional.md
captures the empirical evidence for future iterations.

vainfo after this commit: 10 profiles (was 12), matches the iter38
baseline. iter38 5/5 PASS preserved (no other codec touched).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 16:43:44 +00:00
claude-noether 63fed87bc5 iter39 fresnel fix: advertise P010 unconditionally in QueryImageFormats
ffmpeg-vaapi's hwcontext_vaapi calls vaQueryImageFormats during
hwframes context setup, BEFORE vaCreateContext fires. Our previous
gate on driver_data->is_10bit meant P010 wasn't in the catalog at
that early query — ffmpeg's hwdownload then rejected pix_fmt=p010le
with "Invalid output format p010le for hwframe download" and decode
failed before our backend's CreateContext saw the 10-bit profile.

Fix: advertise P010 unconditionally in QueryImageFormats. Safe because
consumers ask for P010 only when their decode pipeline needs 10-bit,
and our P010 unpack path in copy_surface_to_image is gated on
image->format.fourcc == VA_FOURCC_P010 (independent of is_10bit).

Verified on fresnel: with this fix, Hi10P decode advances past the
hwdownload filter setup. (Run pending bundle to fresnel.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 16:34:52 +00:00
claude-noether a13215de45 iter39 fresnel fix: skip pre-S_FMT NV15 CAPTURE format probe
RK3399 rkvdec advertises NV15 in VIDIOC_ENUM_FMT(CAPTURE) only AFTER
S_FMT(OUTPUT) + S_EXT_CTRLS(SPS) resolve image_fmt to 420_10BIT.
Pre-flight v4l2_find_format(NV15) always returns 0 → video_format
stays NULL → CreateContext returns OPERATION_FAILED → ffmpeg-vaapi
hwaccel init fails with "Failed to create decode context: 1".

Verified on fresnel (kernel 7.0-14 / linux-fresnel-fourier):
  v4l2-ctl -d /dev/video1 --list-formats → only NV12 enumerated

Fix: for 10-bit profiles, skip the find_format probe and directly
map to our NV15 video_format entry. The later S_FMT(CAPTURE) in
the same RequestCreateContext path commits the actual NV15 mode
once the synthetic-SPS injection sets bit_depth_luma_minus8=2.

Discovered during Phase 7 sub-profile verification — Criterion 1
(vainfo enumeration) PASSed but Criteria 2/3 (Hi10P/Main10 decode)
failed with the hwaccel init error. iter38 5/5 baseline still PASSES
(no regression — non-10-bit path unchanged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 16:34:14 +00:00
claude-noether f0ef69d279 iter2 step4: wire h265_set_controls to populate EXT_SPS_*_RPS controls
Per Phase 4 plan + Phase 5 review amendments (SPS parse-and-cache,
per-fd gating).

src/h265.c additions:
  - #include <errno.h>, the v4l2-hevc-ext-controls.h, and the
    vendored gst/codecparsers/gsth265parser.h
  - new static helper h265_populate_ext_sps_rps_cache(): walks
    surface_object->source_data for an SPS NAL (nal_unit_type == 33)
    using gst_h265_parser_identify_nalu; if found, calls
    gst_h265_parser_parse_sps_ext (NOT gst_h265_parser_parse_sps —
    the latter discards the per-RPS-entry EXT data we need); maps
    GstH265ShortTermRefPicSet (base) + GstH265ShortTermRefPicSetExt
    (carrying use_delta_flag[16], used_by_curr_pic_flag[16],
    delta_poc_s0_minus1[16], delta_poc_s1_minus1[16]) into the V4L2
    struct arrays; stores on driver_data->hevc_rps_cache_*
  - non-IDR-frame handling: cache holds across frames, so frames
    whose source_data lacks an SPS NAL reuse the previously-parsed
    cached arrays (Phase 5 review item #3)
  - controls[] grows from [5] to [7]; the 2 new entries are appended
    after the standard 5 (SPS/PPS/SLICE_PARAMS/SCALING_MATRIX/
    DECODE_PARAMS), gated by driver_data->has_hevc_ext_sps_rps_rkvdec
    (per-fd probe result from Step 3) + the cache being valid
  - field-by-field mapping mirrors GStreamer's
    gst_v4l2_codec_h265_dec_fill_ext_sps_rps verbatim (the upstream
    reference identified in Phase 0 prior-art survey)

src/request.h additions:
  - struct request_data carries hevc_rps_cache_st (array pointer),
    _st_count, hevc_rps_cache_lt, _lt_count, hevc_rps_cache_valid.
    Single-slot cache (sps_id 0 only; multi-SPS streams would need
    expanding). Stores POST-MAPPED V4L2 structs so request.h doesn't
    need to know GstH265SPS / GstH265SPSEXT types.

Critical interpretation correction (Phase 5 review followup):
GstH265SPS has short_term_ref_pic_set[65] (base) but NOT
short_term_ref_pic_set_ext[]. The EXT array lives on a SEPARATE
GstH265SPSEXT struct accessed via gst_h265_parser_parse_sps_ext.
The 'plain' gst_h265_parser_parse_sps internally calls _ext with a
LOCAL discarded SPSEXT (see gsth265parser.c:2050). Our call must
use the _ext variant directly to keep the EXT data. Caught during
Step 4 first-build error.

Build verified: ninja -C build clean. .so is 759 KB (up from 485 KB
original, 682 KB after Step 2 vendor — the +80 KB is the new helper
+ extension).

iter2 Phase 6 Step 5 (install + reboot + smoke-test) is the F1
falsifier moment: if HEVC stops OOPSing, mechanism confirmed; if it
still OOPSes, loopback Phase 0 with re-opened kernel-agent#11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:49:12 +00:00
claude-noether 393d02f413 iter2 step3: HEVC EXT_SPS_*_RPS UAPI header + runtime probe
src/hevc-ctrls/v4l2-hevc-ext-controls.h (NEW, MIT, ~95 LOC):
  Verbatim mirror of Linux 7.0 V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS
  and _LT_RPS control IDs + struct definitions + flag macros. Each
  symbol is ifndef-guarded so when ampere's linux-api-headers
  eventually bumps to 7.0+, the kernel header takes precedence and
  this shim silently no-ops. Citation block links the upstream
  Casanova v8 series.

  Per LGPL section 3.b, kernel UAPI struct definitions are excepted
  from GPL inheritance, so copying them into MIT userspace is fine.

src/request.h: added has_hevc_ext_sps_rps_rkvdec + _hantro bool
  fields on struct request_data — pair-of-flags layout mirrors
  video_fd_rkvdec / video_fd_hantro (iter38 multi-device-probe
  pattern, per feedback_multi_device_probe_design). Phase 5 review
  identified single-scalar storage as a silent-misbehavior risk
  across device-switch boundaries.

src/request.c:
  - new probe_hevc_ext_sps_rps_controls(fd) helper: queries the two
    new CIDs via VIDIOC_QUERYCTRL; returns true iff both register.
    RK3399 rkvdec (linux 6.x or 7.x without VDPU381/383 bindings)
    returns false; RK3588 rkvdec (VDPU381/383) returns true.
  - probe each driver_data->video_fd_rkvdec / _hantro after the
    iter38 multi-device-probe block at VA_DRIVER_INIT time
  - log-line if rkvdec supports it - diagnostic for Phase 7

src/meson.build: added the new UAPI header to the headers list.

Build verified: ninja -C build clean, .so produced. The new probe
runs at driver init and stores the result, but nothing CONSUMES the
result yet — that's Step 4 (h265_set_controls wiring).

Per ampere-kernel-decoders campaign iter2 Phase 4 step 3 (amended
by Phase 5 review item 'per-fd storage').

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:49:09 +00:00
claude-noether 9f7437e8ee iter2 step2: GLib/GStreamer compat shim, build succeeds
Vendored gsth265parser + nalutils + gstbitreader + gstbytereader (the
Step 1 commit) compile cleanly against libc + libv4l2 only after
adding 1 compat translation unit + 5 stub headers, no edits to the
vendored .c/.h files themselves.

src/h265_parser/gst_compat.{h,c} — new files (MIT, original work):
  - GLib type aliases (gboolean, gchar, gint*, guint*, gsize, gpointer)
  - Memory helpers (g_malloc/g_free as #define free, g_memdup2 inline)
  - Asserts as no-op + parser-return-code-propagation
  - All GST_DEBUG/INFO/WARNING/ERROR/LOG/FIXME as no-ops (the parser
    is heavy on debug logging; we compile it all out)
  - GArray implementation (~100 LOC, just enough for gsth265parser.c's
    24 call sites)
  - GList full struct with .data/.next/.prev so callers compile;
    list-manipulation functions abort() — dead code paths only
  - Byte-order read/write macros (GST_READ_UINT8/16/24/32/64_LE/BE,
    GST_WRITE_UINT8/16/24/32_BE) — aarch64 LE inlines
  - g_once_init_enter/leave as simple gate
  - G_MAXUINT*, G_MAXINT*, G_MINxxx, G_GNUC_* attribute macros, etc.
  - Opaque GstBuffer/GstMemory/GstMapInfo + abort-stub functions for
    the encoder-side SEI-insertion paths the libva backend never invokes
  - gst_util_ceil_log2 real impl (used by slice-header parser; dead
    for our SPS-only call path but cheaper to implement than stub)

src/h265_parser/gst/{gst.h,base/base-prelude.h,base/gstbitwriter.h,
codecparsers/codecparsers-prelude.h,glib-compat-private.h} — 5 new
stub headers (MIT). All include gst_compat.h. gstbitwriter.h adds
abort-stub functions for the bit-writer API (used by nalutils.c's NAL
emulation-prevention encoder path — dead code for the parse-only
libva backend).

src/meson.build — added the 5 new .c source files and 10 new .h
headers; added include_directories('h265_parser') to the include path
so the vendored files' '#include <gst/base/...>' style references
resolve to the stub headers + actual vendored files in the local
tree.

Build verified: ninja -C build produces v4l2_request_drv_video.so
(682 KB, up from 485 KB pre-vendor — the +200 KB is the vendored
parser code). nm shows gst_h265_parse_sps, gst_h265_parse_sps_ext,
gst_h265_parser_identify_nalu, and the other functions we need for
Step 4 are present in the binary.

Two #warning messages from gsth265parser.h about API stability are
upstream-intentional and harmless ('The H.265 parsing library is
unstable API and may change in future').

This commit completes Step 2 of ampere-kernel-decoders iter2 Phase 6.
Backend remains functionally identical to pre-iter2 — the new code
compiles + links but is not yet called from h265_set_controls (that's
Step 4). Existing 5 codecs continue to work as before.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:49:06 +00:00
claude-noether c9b7fcff50 iter2 step1: vendor GStreamer 1.28.2 H.265 parser unchanged
Source: gitlab.freedesktop.org/gstreamer/gstreamer @ commit 43421c2a5b8a
(refs/tags/1.28.2). All 8 vendored files copied verbatim into
src/h265_parser/:

  gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.c (168 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.h ( 92 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.c       (13 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.h       (  8 KB)
  gstreamer/libs/gst/base/gstbitreader.c                     (  8 KB)
  gstreamer/libs/gst/base/gstbitreader.h                     ( 10 KB)
  gstreamer/libs/gst/base/gstbytereader.c                    ( 39 KB)
  gstreamer/libs/gst/base/gstbytereader.h                    ( 25 KB)

Total ~11 KLOC, LGPL v2.1+ per original headers (Intel + Sreerenj
Balachandran + others). LGPL headers preserved verbatim. Backend's
existing COPYING.LGPL covers redistribution.

** Build is INTENTIONALLY BROKEN at this commit. ** GLib dependencies
(GArray, g_malloc, gboolean, GST_DEBUG, etc.) are not yet satisfied;
src/Makefile.am is not yet updated to include these files. Step 2
performs the GLib-to-libc mechanical adaptation; Step 3 wires the
header + Makefile.

This vendor-unchanged commit is the upstream-tracking baseline. When
GStreamer ships a parser bug fix, the future-sync workflow is:
  git diff src/h265_parser/ HEAD..(this commit)
to surface our adaptations, then rebase those over the upstream fix.

Per ampere-kernel-decoders campaign iter2 Phase 4 §Step 1
(/home/mfritsche/src/ampere-kernel-decoders/phase4_plan_iter2.md).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:48:52 +00:00
claude-noether a8a91d92d6 Revert "ampere iter2: HEVC EXT_SPS_ST_RPS / _LT_RPS dynamic-array submission (VDPU381)"
This reverts commit f61f736380.
2026-05-17 09:48:29 +00:00
claude-noether f61f736380 ampere iter2: HEVC EXT_SPS_ST_RPS / _LT_RPS dynamic-array submission (VDPU381)
Fixes the rkvdec_hevc_prepare_hw_st_rps out-of-bounds kernel OOPS that
blocked HEVC decode on ampere (RK3588) per
marfrit/libva-v4l2-request-fourier#3 and ampere-fourier iter1 close.

Mechanism (Phase 5 amendment to issue body):
The new EXT_SPS controls are registered as V4L2_CTRL_FLAG_DYNAMIC_ARRAY
in vdpu38x_hevc_ctrl_descs (rkvdec.c:279/284) with cfg.dims = { 65 }.
The v4l2-ctrl framework init-allocates 1 zeroed element (ctrls-core.c:2116).
When num_short_term_ref_pic_sets > 1, rkvdec_hevc_prepare_hw_st_rps
(rkvdec-hevc-common.c:393-405) iterates idx 0..N-1 and overruns the
1-element kernel allocation. Submitting an N-element dynamic-array
control via S_EXT_CTRLS extends the framework allocation.

Userspace fix:
  - VIDIOC_QUERY_EXT_CTRL probe at first HEVC CreateContext sets
    driver_data->has_ext_sps_rps (true on VDPU381/383, false on legacy
    RK3399 — control unregistered there, so fresnel iter38 5/5 + iter39
    sub-profile paths are byte-identical to pre-iter2).
  - When set, h265_set_controls appends EXT_SPS_ST_RPS + _LT_RPS as
    calloc'd zero arrays, sized by VAAPI's count fields and capped at
    H.265 §7.4.3.2 spec maxima (ST 64, LT 32). Min 1 (kernel rejects 0).
  - Free post-S_EXT_CTRLS.

Decode correctness scope:
VAAPI does NOT expose per-set st_ref_pic_set syntax elements
(delta_idx_minus1, delta_rps_sign, etc.) — confirmed in va_dec_hevc.h.
All-zero entries give empty inter-pred RPS per set, which is correct
for IDR-only streams and incorrect for streams with inter-pred RPS
dependence. iter2 acceptance: stop the OOPS. Decode-correctness for
inter-RPS content is a known follow-up requiring either bitstream-snoop
or SPS-passthrough via a new VAAPI extension.

Files:
  - include/hevc-ctrls.h: #ifndef-guarded fallback definitions for
    V4L2_CID_STATELESS_HEVC_EXT_SPS_{ST,LT}_RPS + structs (ampere host
    is on linux-api-headers 6.19-1; the new CIDs land in 7.0).
  - src/request.h: driver_data->has_ext_sps_rps (persists for driver
    lifetime; gated solely by HEVC code path so cross-codec leakage
    impossible).
  - src/context.c: probe at HEVC CreateContext via v4l2_query_ext_ctrl.
  - src/h265.c: controls[5] → controls[7]; #include <hevc-ctrls.h>
    (replaces <linux/v4l2-controls.h>) for forward UAPI compatibility.

Compile-tested on boltzmann (aarch64 native, gcc 15.2.1): clean .so,
0 new warnings. Fresnel cross-device safety: legacy RK3399 rkvdec_ctrl
table omits the CIDs; probe returns false; new code path never executes.

iter39 sub-profile work (commits 662f887 + 8746690) is preserved
in-tree; iter2 is a forward-compatible additive change.

Refs:
  marfrit/libva-v4l2-request-fourier#3
  ampere-fourier/iter1_close.md HEVC blocker
  ampere-fourier/iter2_phase0_findings.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:34:58 +00:00
claude-noether 8746690739 iter39: add NV15 → P010 unpack self-test (tests/test_nv15_unpack.c)
Pure-C unit test for nv15_unpack_plane_to_p010, independent of any V4L2
hardware. Verifies bit layout against the spec at
Documentation/userspace-api/media/v4l/pixfmt-nv15.rst by packing known
10-bit pixel values, running the unpack, and asserting P010 output
matches pixel<<6.

Coverage:
  - zero, all-max
  - 8 known position/spread vectors
  - widths {1, 2, 3, 7, 8} including remainder paths
  - multi-row with stride padding
  - chroma-shape (half-height)

Build + run:
  cc -Wall -Werror -O2 -o test_nv15_unpack \
     tests/test_nv15_unpack.c src/nv15.c
  ./test_nv15_unpack

Confirmed PASS on noether (x86_64 native). Catches the highest-risk
class of regression in iter39 — silent bit-shift errors in the unpack —
without requiring fresnel hardware.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:22:14 +00:00
claude-noether 662f8874ba iter39 α-31: H264 Hi10P + HEVC Main10 sub-profile support (10-bit, rkvdec NV15)
Adds VAProfileH264High10 and VAProfileHEVCMain10 to the libva-v4l2-request
backend. RK3399 rkvdec emits decoded frames as V4L2_PIX_FMT_NV15 (4 × 10-bit
values packed in 5 bytes per element); VAAPI consumers receive standard
VA_FOURCC_P010 via a new userspace unpack in copy_surface_to_image.

VP9 Profile 2 explicitly NOT added — RK3399 rkvdec kernel ctrl table
caps at V4L2_MPEG_VIDEO_VP9_PROFILE_0 (rkvdec.c::rkvdec_vp9_ctrl_descs).

Touchpoints (per Phase 5 sonnet-architect review amendments):
  - include/drm_fourcc.h: define DRM_FORMAT_NV15 (vendored libdrm lacks it)
  - src/nv15.{c,h}: NV15 → P010 plane unpack (LSB-first, per
    Documentation/userspace-api/media/v4l/pixfmt-nv15.rst)
  - src/video.c: NV15 entry in formats[] (else NULL-deref on video_format_find)
  - src/codec.c: pixelformat_for_profile cases for Hi10P + Main10
  - src/config.c: enumeration, validation, entrypoints, RT_FORMAT_YUV420_10
    advertisement for 10-bit profiles
  - src/context.c: per-profile CAPTURE pix_fmt (NV12/NV15), 10-bit synthetic
    SPS (bit_depth_luma_minus8=2), video_format invalidation on bit-depth
    transition (sibling to iter38 device-switch invalidation), is_10bit flag
  - src/surface.c: RT_FORMAT_YUV420_10 admission, NV15 fourcc on PRIME export
  - src/image.c: P010 reporting in DeriveImage + QueryImageFormats,
    P010-aware sizing in CreateImage, NV15 → P010 unpack call in
    copy_surface_to_image (gated on is_10bit + image.format.fourcc == P010)
  - src/picture.c: 4 switch blocks route Hi10P/Main10 to existing H264/HEVC
    per-codec paths
  - src/request.h: MAX_PROFILES bump 11 → 13, driver_data->is_10bit flag

Scope: COPY path (vaGetImage / vaDeriveImage) only. Standard ffmpeg-vaapi
hwdownload, mpv vaapi-copy, and any consumer using vaGetImage works
end-to-end. PRIME-path consumers that only know NV12/P010 must use the
COPY path; PRIME consumers aware of NV15 (panfrost-Mesa et al.) get the
correct fourcc on RequestExportSurfaceHandle. PRIME-side P010 emission is
follow-up scope (would need DRM_FORMAT_P010 + per-plane unpack into a
GPU-accessible buffer).

Compile-tested on boltzmann (aarch64 native, gcc 15.2.1, libva 1.23.0,
libdrm 2.4.133): clean build, .so produced, 0 new warnings.

Phase 0/2 evidence: linux-mmind-v7.0 drivers/media/platform/rockchip/rkvdec.
rkvdec_h264_decoded_fmts[] and rkvdec_hevc_decoded_fmts[] both list NV15;
ctrl tables cap at HEVC MAIN_10 and H264 HIGH_422_INTRA (Hi10P < cap, not
in menu_skip_mask). image_fmt resolution (rkvdec-h264-common.c:196,
rkvdec-hevc-common.c:467) dispatches on bit_depth_luma_minus8 only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:15:16 +00:00
29 changed files with 3230 additions and 1134 deletions
+262 -56
View File
@@ -1,75 +1,281 @@
# v4l2-request libVA Backend
# libva-v4l2-request-fourier
## About
VA-API ICD backend for V4L2 stateless video decoders. Fourier-campaign
fork of the dormant `bootlin/libva-v4l2-request` upstream.
This libVA backend is designed to work with the Linux Video4Linux2
Request API that is used by a number of video codecs drivers,
including the Video Engine found in most Allwinner SoCs.
> **TL;DR for "I want hardware-accelerated YouTube in Firefox on my
> Rockchip board":** skip to the [§ Quickstart](#quickstart) below.
> Fresnel (RK3399) and ampere (RK3588) are validated targets; ohm
> (RK3566 PineTab2) is the chromium-fourier validation rig.
## Status
## What works
The v4l2-request libVA backend currently supports the following formats:
* MPEG2 (Simple and Main profiles)
* H264 (Baseline, Main and High profiles)
* H265 (Main profile)
| SoC / host | HW-accelerated codecs | Bit-exact vs `kdirect` |
|---|---|---|
| RK3399 (fresnel — Pinebook Pro) | H.264, HEVC Main, VP9 Profile 0, VP8, MPEG-2 | 5/5 at iter38; preserved through iter40b |
| RK3588 (ampere) | H.264 + HEVC (iter1+iter2 ampere-fourier); **mainline rkvdec / VDPU381 + VDPU383 landed February 2026** — VP9 / AV1 verification next | iter1 H.264 PASS; remaining codecs gated on mainline-driver bring-up |
| RK3568 / RK3566 (ohm — PineTab2) | H.264, MPEG-2, VP8 via hantro multi-planar | iter1-5 baseline (libva-multiplanar campaign) |
| BCM2712 (higgs — Pi 5 / CM5) | — | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved, [see § Pi 5 standoff](#the-pi-5-standoff) |
`kdirect` is the reference: `ffmpeg -hwaccel v4l2request
-hwaccel_output_format drm_prime ...` via Kwiboo's downstream ffmpeg
patches (packaged here as **`ffmpeg-v4l2-request-fourier`**, FFmpeg 8.1
tip @ Kwiboo `v4l2-request-n8.1` commit `b57fbbe`).
## Quickstart
### What you need for HW-accelerated YouTube in Firefox
The full stack, top to bottom, with the package this campaign provides
at each layer:
| Layer | Package(s) | Notes |
|---|---|---|
| Linux kernel with V4L2 stateless decoders | `linux-fresnel-fourier` (RK3399), `linux-ampere-fourier` (RK3588) | Mainline rkvdec / hantro / VDPU381 / VDPU383. ohm typically rides on a Beryllium OS host kernel. |
| `ffmpeg` with Kwiboo's v4l2-request hwaccel | `ffmpeg-v4l2-request-fourier` | Provides `-hwaccel drm -c:v hevc` (and h264/vp9) routes via libavcodec hwdevice DRM. |
| `libva` VA-API runtime + this backend ICD | `libva` (stock) + **`libva-v4l2-request-fourier`** | This repo. Auto-detects rkvdec / hantro / cedrus on probe. |
| Firefox patched to call libavcodec stateless | `firefox-fourier` | 5-patch series, ~+169 LoC over stock Firefox. Validated on fresnel: **~5 % CPU at 1080p30 H.264** (vs 64 % software). |
| (Wayland alt) Chromium patched for V4L2VDA | `chromium-fourier` + `kwin-fourier` | Validated on ohm under KDE Plasma 6.6.5 Wayland. Needs `kwin-fourier` for the dmabuf-fence latency fix. |
| (Optional) panfrost / panthor GPU stack | `vulkan-panfrost` | Wayland compositor + 3D. |
The actual VA-API path is mostly historical inside this campaign — the
**user-facing browser HW decode story rides libavcodec's
`v4l2_request` hwaccel directly**, not VAAPI-via-libva. Firefox-fourier
attaches an `AV_HWDEVICE_TYPE_DRM` context to libavcodec's generic
`h264`/`hevc`/`vp9` decoder; libavcodec then auto-binds the
`v4l2_request` hwaccel from its `hw_configs`. No `LIBVA_DRIVER_NAME`
incantation needed for browser use. libva-v4l2-request-fourier matters
for mpv, ffmpeg-as-vaapi, and other VA-API direct consumers.
### Install on Arch ALARM (fresnel / ampere / ohm)
Add the marfrit repo if you haven't already:
```ini
# /etc/pacman.conf
[marfrit]
SigLevel = Required
Server = https://packages.reauktion.de/arch/$arch
```
Import the signing key (one-time):
```bash
sudo pacman-key --recv-keys <KEY-ID> # see https://packages.reauktion.de
sudo pacman-key --lsign-key <KEY-ID>
sudo pacman -Sy
```
Then per host:
```bash
# Fresnel — RK3399 Pinebook Pro
sudo pacman -S \
linux-fresnel-fourier linux-fresnel-fourier-headers \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
firefox-fourier
# Ampere — RK3588
sudo pacman -S \
linux-ampere-fourier linux-ampere-fourier-headers \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
firefox-fourier
# Ohm — RK3566 PineTab2 (chromium-fourier validated path)
sudo pacman -S \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
kwin-fourier
# chromium-fourier currently still a local build — see § Status
```
Reboot if a new kernel landed. Then:
```bash
# Smoke-test: vainfo should list HEVCMain + H264 entries
LIBVA_DRIVER_NAME=v4l2_request vainfo
# Browser launch with verbose decoder logging
MOZ_LOG="PlatformDecoderModule:5,FFmpegVideo:5" \
firefox-fourier 2>&1 | tee /tmp/fx.log
# Then open a YouTube 1080p H.264 video and grep for:
# "Choosing FFmpeg pixel format for V4L2 video decoding"
# "av_hwdevice_ctx_create(DRM, /dev/dri/renderD128) ok"
# If you DON'T see those: HW path didn't engage, fell back to software.
```
### Status of the published vs locally-built packages
As of May 2026, the live marfrit repo at
<https://packages.reauktion.de/arch/aarch64/> has:
-`libva-v4l2-request-fourier-1:1.0.0.r361.cf8cd9d-1` (iter40b tip)
-`ffmpeg-v4l2-request-fourier-2:8.1.r123329.b57fbbe-3` (Kwiboo's
v4l2-request-n8.1 + libudev-bypass; smoke-tested on fresnel —
HEVC via `-hwaccel v4l2request` PASS)
-`firefox-fourier-150.0.1-16` (5-patch series, sandboxed RDD HW
decode validated on RK3399: ~5 % CPU at 1080p30 H.264)
-`linux-fresnel-fourier-7.0-14` + headers (RK3399)
-`linux-ampere-fourier-7.0rc3.kafr1-1` + headers (RK3588)
-`kwin-fourier-1:6.6.5-1` (Wayland dmabuf-fence fix for chromium-fourier)
-`vulkan-panfrost-1:26.0.5-1` (GPU stack)
NOT yet published but **present in `marfrit-packages/arch/` source
tree** (build + publish pending):
-`chromium-fourier` (Chromium 147 + V4L2VDA-on-mainline patches —
blocked on Arch ALARM bumping clang 22 → 23).
-`qt6-base-fourier` (GL_ALPHA → GL_R8 fix — needed by KDE Plasma
Wayland on the panfrost stack).
If you need those locally before they ship:
```bash
git clone ssh://git@git.reauktion.de:2222/marfrit/marfrit-packages.git
cd marfrit-packages/arch/<package>
makepkg -si
```
## What does NOT work, and why it's stalled
| Target | Status | Blocker |
|---|---|---|
| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
## What does NOT work, and why it's stalled
| Target | Status | Blocker |
|---|---|---|
| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
### The Pi 5 standoff
iter40 + iter40b add a third multi-device-probe slot for
`rpi-hevc-dec`, an NC12 SAND128 detile primitive, per-driver gates
around the SPS pre-seed + start-code-prepend + scaling_matrix submission,
and a (fragile, fixture-specific) SPS field override using the
GStreamer 1.28.2 H.265 parser. ICD discovery works, `vainfo` lists
`VAProfileHEVCMain`, S\_FMT / REQBUFS / STREAMON all succeed.
**Decode itself never succeeds** — every CAPTURE DQBUF returns
`V4L2_BUF_FLAG_ERROR`. Driver author John Cox confirmed strict SPS
validation is intentional ("`try_ext_ctrls returned an error (22)` is
expected as it is validating the SPS"), and VAAPI's
`VAPictureParameterBufferHEVC` simply doesn't carry the bitstream-true
scalars (`sps_max_num_reorder_pics`, `sps_max_latency_increase_plus1`,
slice-level `num_entry_point_offsets`) that the driver wants. We can't
fish the SPS out of `source_data` either, because ffmpeg-vaapi parses
the SPS itself and passes only slice NAL bytes to libva backends.
This is not a bug in our backend, in libva, in ffmpeg, or in the kernel
driver. It's an ecosystem coordination failure of long standing:
- **Kwiboo's `ffmpeg-v4l2request` hwaccel** has been in production via
LibreELEC since December 2018. Re-submitted to ffmpeg-devel as a v2
series in August 2024. Still un-merged in May 2026 — **eight years
in the upstream review queue**.
- **`libva-v4l2-request`** (this project's upstream) hasn't taken
meaningful commits since ~2021. Nobody wants to own the impedance
mismatch between VAAPI's Intel-shaped "give me raw bitstream, I'll
parse" and V4L2 stateless's kernel-shaped "give me parsed structs,
I'll just drive the HW."
- **`rpi-hevc-dec` mainline submission** is at v4 (July 2025), 17
months in review. The Pi 6.18.x downstream kernel meanwhile has
active HEVC regressions ([raspberrypi/linux#7228](https://github.com/raspberrypi/linux/issues/7228),
[#7306](https://github.com/raspberrypi/linux/issues/7306)) that
aren't being fast-tracked because "the new uAPI is coming."
- **Mozilla is implementing Pi 5 HEVC via ffmpeg's hwaccel-context
path** (bug [1969297](https://bugzilla.mozilla.org/show_bug.cgi?id=1969297)),
not via libva — explicit acknowledgement from David Turner that
libavcodec needs to retain the SPS context for the strict driver to
accept the control batch.
What end-users actually do today: run Pi OS (downstream-patched ffmpeg
+ downstream kernel) or LibreELEC (Kwiboo's patches + downstream
kernel). Anyone on a stock distro outside those two: no HW HEVC on
Pi 5.
Nobody who has authority to merge has skin in the game. Everyone with
skin in the game lacks authority. Result: 8-year stalemate, three
forks of working code, no merged upstream.
### What this means for this backend
We chose to extend `libva-v4l2-request` into Pi 5 territory because
the architecture maps cleanly onto the existing iter38 multi-device
probe. That work landed (iter40 commit `3ffa9d0`, iter40b commit
`071b08d`). It's reusable infrastructure for any future strict V4L2
stateless decoder that ffmpeg ships before libva does.
But the *user-facing* Pi 5 HEVC story will not come from this
backend. The backend was a clean architectural target inside a
coordination dead-end. The actual Pi 5 HEVC path through libva
requires either:
- a VAAPI extension exposing the SPS scalars rpi-hevc-dec validates
against (Intel-driven; no Pi-aligned principal),
- a libva-internal `VABufferType` for raw SPS/PPS NAL bytes (no
maintainer),
- ffmpeg-vaapi forwarding `num_entry_point_offsets` to backends
(small upstream patch; no champion), OR
- the political situation around Kwiboo's series unblocks (no
visible movement).
iter40 + iter40b are **landed but parked**. The fresnel + ampere
sibling paths are unaffected (5/5 fresnel + 9 profiles ampere
verified post-iter40b, no regression). Phase 8 packaging is
deliberately skipped — shipping a `.deb` whose primary advertised
target (Pi 5) doesn't actually decode would mislead users.
See `phase0_pi5_hevc.md`, `phase1_pi5_hevc.md`,
`phase5_pi5_hevc_review.md`, `phase7_pi5_hevc_close.md` for the
chapter's full empirical record.
## Instructions
In order to use this libVA backend, the `v4l2_request` driver has to
be specified through the `LIBVA_DRIVER_NAME` environment variable, as
such:
In order to use this backend, set the `LIBVA_DRIVER_NAME` environment
variable:
export LIBVA_DRIVER_NAME=v4l2_request
A media player that supports VAAPI (such as VLC) can then be used to decode a
video in a supported format:
Then a VA-API-capable player can decode supported codecs on a probed
device:
vlc path/to/video.mpg
vlc path/to/video.mp4
mpv --hwdec=vaapi path/to/video.mp4
ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -f null -
Sample media files can be obtained from:
http://samplemedia.linaro.org/MPEG2/
http://samplemedia.linaro.org/MPEG4/SVT/
The backend auto-detects available decoders via the V4L2 media
topology walk; honors `LIBVA_V4L2_REQUEST_VIDEO_PATH` and
`LIBVA_V4L2_REQUEST_MEDIA_PATH` for explicit device selection.
## Technical Notes
### Surface
### Multi-device probe (iter38)
A Surface is an internal data structure never handled by the VA's user
containing the output of a rendering. Usualy, a bunch of surfaces are created
at the begining of decoding and they are then used alternatively. When
created, a surface is assigned a corresponding v4l capture buffer and it is
kept until the end of decoding. Syncing a surface waits for the v4l buffer to
be available and then dequeue it.
A single libva session opens both `rkvdec` and `hantro-vpu` (and, on
hosts where it's present, `rpi-hevc-dec`) at init. `RequestCreateConfig`
re-targets the active fd per profile via
`request_switch_device_for_profile()`. Pool teardown happens at
switch time; the next `CreateContext` rebuilds against the right
device.
Note: since a Surface is kept private from the VA's user, it can ask to
directly render a Surface on screen in an X Drawable. Some kind of
implementation is available in PutSurface but this is only for development
purpose.
### Surface / Context / Picture / Image
### Context
A Surface is an internal data structure containing rendering output.
A Context owns the V4L2 lifecycle (S\_FMT, CAPTURE pool, ctrl-batch
defaults) for one decode session. A Picture is one encoded input
frame's set of buffers. An Image is a Standard VA pixel-format view
on a decoded Surface — the backend detiles SAND/COL128 or unpacks
NV15 to NV12/P010 here so consumers see linear pitches.
A Context is a global data structure used for rendering a video of a certain
format. When a context is created, input buffers are created and v4l's output
(which is the compressed data input queue, since capture is the real output)
format is set.
### Picture
A Picture is an encoded input frame made of several buffers. A single input
can contain slice data, headers and IQ matrix. Each Picture is assigned a
request ID when created and each corresponding buffer might be turned into a
v4l buffers or extended control when rendered. Finally they are submitted to
kernel space when reaching EndPicture.
The real rendering is done in EndPicture instead of RenderPicture
because the v4l2 driver expects to have the full corresponding
extended control when a buffer is queued and we don't know in which
order the different RenderPicture will be called.
### Image
An Image is a standard data structure containing rendered frames in a usable
pixel format. Here we only use NV12 buffers which are converted from sunxi's
proprietary tiled pixel format with tiled_yuv when deriving an Image from a
Surface.
The real rendering is in `EndPicture`, not `RenderPicture`, because
the kernel needs the full extended-control batch when the OUTPUT
buffer is queued, and `RenderPicture` order is consumer-defined.
+5
View File
@@ -195,6 +195,11 @@ extern "C" {
#define DRM_FORMAT_NV24 fourcc_code('N', 'V', '2', '4') /* non-subsampled Cr:Cb plane */
#define DRM_FORMAT_NV42 fourcc_code('N', 'V', '4', '2') /* non-subsampled Cb:Cr plane */
/* iter39: NV15 is 4×10-bit packed in 5 bytes (Rockchip rkvdec 10-bit output). */
#ifndef DRM_FORMAT_NV15
#define DRM_FORMAT_NV15 fourcc_code('N', 'V', '1', '5') /* 2x2 subsampled Cr:Cb plane 10 bits per channel packed */
#endif
/*
* 3 plane YCbCr
* index 0: Y plane, [7:0] Y
+11
View File
@@ -4,3 +4,14 @@ option(
value : '',
description: 'Path to sanitized Linux Kernel headers'
)
option(
'daedalus_v4l2',
type : 'boolean',
value : true,
description: 'Enable probe + dispatch for the out-of-tree daedalus_v4l2 ' +
'stateless decoder shim (Pi 5 / CM5 daemon-backed VP9/AV1/H264). ' +
'Default true; disable on platforms where the daedalus_v4l2 ' +
'kernel module will never be present to slim the probe array.'
)
+298
View File
@@ -0,0 +1,298 @@
# Phase 0 — Pi 5 / CM5 HEVC chapter
Opened 2026-05-17 evening, after the failed `libva-v4l2-stateful-fourier`
scaffold attempt. Brother-session empirical Phase 0 on higgs invalidated
the stateful premise: rpi-hevc-dec is V4L2 **stateless**, so Pi 5 HEVC
belongs in this backend, not a separate sibling.
No code in this chapter yet. This doc is the substrate. Phase 1 picks up
from the "Open questions" section.
## Substrate
### Target host
higgs — Pi CM5 module on Pi CM5 IO board. BCM2712 SoC. VPN-only, often
offline; wake via HIS skill recipe (no Fritz!Box plug — runs on power
when on). Debian-based. Sole HW video decoder is rpi-hevc-dec at
`/dev/video19` + `/dev/media1`.
### Backend baseline at chapter open
`libva-v4l2-request-fourier` master tip `cf8cd9d` (iter39 + Option B +
h265 ref-list cap fix). Multi-device probe (iter38) already opens
rkvdec + hantro slots; adding a third decoder slot for rpi-hevc-dec is
a natural extension of that architecture.
iter2 (ampere VDPU381 HEVC EXT_SPS) added the GStreamer 1.28.2 H.265
parser vendor + EXT_SPS_ST_RPS / _LT_RPS dynamic-array submission. That
plumbing is probe-gated (`has_hevc_ext_sps_rps_rkvdec`), so it stays
dormant on hosts where the controls don't exist.
### Empirical higgs probe (brother session)
`v4l2-ctl -d /dev/video19 --list-formats-ext --list-ctrls`:
```
Stateless Codec Controls
hevc_sequence_parameter_set (compound, V4L2_CID_STATELESS_HEVC_SPS)
hevc_picture_parameter_set (compound, V4L2_CID_STATELESS_HEVC_PPS)
slice_param_array (compound dynamic-array dims=[4096])
hevc_scaling_matrix (compound)
hevc_decode_parameters (compound)
hevc_decode_mode (menu, "Frame-Based")
hevc_start_code (menu, default "No Start Code")
OUTPUT formats:
S265 V4L2_PIX_FMT_HEVC_SLICE (parsed slice payload)
CAPTURE formats:
NC12 V4L2_PIX_FMT_NV12_COL128 (8-bit SAND 128-column tiled)
NC30 V4L2_PIX_FMT_NV12_10_COL128 (10-bit SAND 128-column tiled)
```
Conclusion: this is the standard `V4L2_CID_STATELESS_HEVC_*` control set
exposed under the V4L2-request uAPI, exactly the same family our backend
already drives for rkvdec/hantro/cedrus HEVC paths. The novel parts are
two pixel formats (NC12, NC30) and one driver-id (rpi-hevc-dec).
## What carries forward unchanged
- VAAPI HEVC profile enumeration (`config.c`)
- `h265_set_controls` core path (`h265.c`) — same compound ctrl set
- Synthetic SPS pre-seed pattern (iter25/26) — already runs pre-CAPTURE-alloc
- Multi-device dispatch in `RequestCreateConfig` (iter38)
- VAAPI slice / picture / IQ matrix buffer parsing
- HEVC h264-style start-code policy (we already DON'T prepend for HEVC)
## What needs adding
| Item | Location | Sizing |
|------|----------|--------|
| `RPI_HEVC_DEC` enum in `driver_kind_t` | `request.h` | trivial |
| Multi-device probe extends to `/dev/video19` discovery | `context.c` / `request.c` init | small — mirror hantro slot |
| `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry | `video.c` | small |
| `V4L2_PIX_FMT_NV12_10_COL128` (NC30) `video_format` entry | `video.c` | small |
| NC12 → NV12 detile primitive | new `nv12_col128.c` | mid — column tile layout, see kernel docs |
| NC30 → P010 detile primitive | new `nv12_col128.c` | mid — 10-bit variant of above |
| `copy_surface_to_image` branch for NC12/NC30 | `image.c` | small (mirror NV15→P010 gating) |
| Per-driver gating for any rpi-specific quirks discovered | various | per [[per-driver-kludge-gating]] |
## Open questions for Phase 1
Lock these before Phase 1 commits to a goal.
1. **EXT_SPS controls on rpi-hevc-dec?** Brother's `--list-ctrls` output
above shows the standard `V4L2_CID_STATELESS_HEVC_*` family — NOT the
`EXT_SPS_ST_RPS` / `EXT_SPS_LT_RPS` extensions that VDPU381 needs.
Verify: does `slice_param_array[4096]` accept `st_rps_bits` /
`lt_rps_bits` in the per-slice payload, or does rpi-hevc-dec parse RPS
itself from the slice header? If the latter, the iter2 EXT_SPS path
stays dormant (probe-gated already), and rpi-hevc-dec just needs the
`picture->st_rps_bits``slice_params->short_term_ref_pic_set_size`
plumbing that iter31 α-29 already wired. Expectation: works out of the
box. Confirm before assuming.
2. **`hevc_start_code` ctrl: "No Start Code" vs Annex B?** Brother saw
default `"No Start Code"` — matches our behavior (we don't prepend on
HEVC). But the ctrl is configurable. Verify the menu values exposed
and confirm "No Start Code" passes our raw slice-NAL payload as-is.
If it doesn't, set the ctrl explicitly per [[unconditional-codec-state]]
gating.
3. **NC12 / NC30 SAND tile layout — exact spec.** Read
`Documentation/userspace-api/media/v4l/pixfmt-yuv-planar.rst` for the
COL128 variants. Confirm: column stride = 128 bytes (Y) / 128 bytes
(UV interleaved). Row count = `ALIGN(height, 16)` or `ALIGN(height, 8)`?
Get the exact alignment and tile-traversal order before writing the
detile primitive. Cite from kernel doc, NOT inferred from a hex dump.
4. **drm_prime / SAND modifier round-trip.** Does ffmpeg-vaapi (and
Firefox) accept the NC12 buffer via DRM_PRIME export carrying the
DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier, allowing
zero-copy to a SAND-aware compositor? Or is libva-side detile to a
linear NV12 buffer the only viable Firefox path? If detile is
required for the consumer, the [[rockchip-pixel-verify-path]] rule
(DMA-BUF GL preferred over cached mmap) might NOT apply since SAND
is Pi-specific and not in the wider Wayland modifier ecosystem.
5. **rpi-hevc-dec quirks on first SPS submission.** rkvdec needs
image_fmt pre-seed before CAPTURE alloc (iter25). Does rpi-hevc-dec
have an analogous "must set OUTPUT pix_fmt + SPS before CAPTURE"
ordering? Verify with strace early.
6. **higgs OS + libva versioning.** Brother probed on Debian. We package
for Arch ALARM. What's the install path on higgs — Arch / Debian /
Raspberry Pi OS? If Debian, the package needs a `debian/` tree, not
just PKGBUILD. Decide packaging target before Phase 8.
## Phase 1 goal sketch (NOT locked)
> Firefox HW HEVC playback on higgs at ≥30fps for 1080p Main, byte-exact
> libva-vs-kdirect for ≥3 reference fixtures (8-bit Main and 10-bit Main10).
Two measurable subgoals follow naturally:
- libva (this backend, NV12 image output) == kdirect (ffmpeg-v4l2request,
NV12 image output) byte-exact for the same input.
- Firefox VA-API path engages (verify via `chrome://gpu` equivalent / log
inspection — `MOZ_LOG=PlatformDecoderModule:5`).
## Phase 3 baseline plan
Before any backend code touches rpi-hevc-dec:
- `kdirect` floor: `ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime
-i bbb_720p10s_hevc.mp4 -vf hwdownload,format=nv12 -frames:v 10 ...` and
sha256 the YUV.
- `SW reference`: same ffmpeg without `-hwaccel`, sha256 the YUV.
- Both runs N=3 per [[replicate-baseline-first]].
- Capture `strace -f -e ioctl` of the kdirect run — gives the canonical
ioctl sequence rpi-hevc-dec expects.
## Phase 0 closing
This doc commits the substrate. Phase 1 starts when:
- higgs is up + reachable
- Open questions 1+2 (EXT_SPS + start_code) are answered live, in one
short probe session
- Phase 3 baseline floors are captured
No work blocks the close of iter39 / fresnel campaign — those are shipped.
## Phase 0 close addendum (2026-05-17 evening, higgs probe session)
Empirical probes on higgs answered Q1, Q2, partial Q3, full Q5, full Q6.
Q4 (DRM modifier round-trip) remains open. Phase 0 is closed; Phase 1
opens with what's below.
### Q1 — EXT_SPS controls on rpi-hevc-dec: NOT present
`v4l2-ctl -d /dev/video19 --list-ctrls` confirms ONLY the standard
`V4L2_CID_STATELESS_HEVC_*` set:
- `hevc_sequence_parameter_set` (0x00a40a90)
- `hevc_picture_parameter_set` (0x00a40a91)
- `slice_param_array` (0x00a40a92, dynamic-array dims=[4096])
- `hevc_scaling_matrix` (0x00a40a93)
- `hevc_decode_parameters` (0x00a40a94)
- `hevc_decode_mode` (0x00a40a95, menu min=1 max=1 default=1 = Frame-Based)
- `hevc_start_code` (0x00a40a96, menu min=0 max=1 default=0 = No Start Code)
- 0x00a40a97 returns EINVAL (no EXT_SPS_*_RPS controls)
ioctl trace confirms ffmpeg's `VIDIOC_QUERY_EXT_CTRL` for `0xa97` returns
EINVAL — same probe pattern our backend uses for
`has_hevc_ext_sps_rps_rkvdec`. **The iter2 path stays dormant; the
iter31 α-29 `slice_params->short_term_ref_pic_set_size` plumbing is the
correct one for rpi-hevc-dec.**
### Q2 — hevc_start_code: default 0 (No Start Code), values {0, 1}
Default 0 matches our backend's "don't prepend HEVC start code" stance.
Confirm in Phase 1: rpi-hevc-dec accepts our raw NAL slice payload as-is.
### Q3 — NC12 / NC30 SAND tile layout: PARTIAL
CAPTURE S_FMT result for 1280×720 NC12:
- `sizeimage=1382400` = `1280 × 720 × 1.5` (linear NV12 byte count)
- `bytesperline=1080` (NOT 1280)
The bytesperline=1080 for a 1280-wide CAPTURE buffer is suspect — likely
encodes SAND column count rather than linear stride. Read
`drivers/staging/media/rpivid/` (or wherever NC12_COL128 lives in 6.12)
kernel source + `drm_fourcc.h` / `nv12_col128.rst` (if it exists) for
exact tile layout BEFORE writing the detile primitive. Do NOT infer
layout from this single observation.
### Q4 — DRM modifier round-trip: BLOCKED on hwdownload
ffmpeg `-hwaccel drm -hwaccel_output_format drm_prime -vf
hwmap=mode=read,format=nv12` returns `Failed to map frame: -38`
(`Function not implemented`). hwdownload cannot consume the SAND
modifier directly.
ffmpeg's path that DOES work: `-hwaccel drm -c:v hevc` WITHOUT
`-hwaccel_output_format drm_prime` lets ffmpeg's internal pipeline pull
back, detile (presumably via a Pi-specific helper or libdrm transform),
and present NV12 to the next filter. Bit-exact vs SW for the test
fixture (1280×720 Main 8-bit) — confirms HW engagement.
Phase 1 / Phase 4 will need to decide:
- Detile in the backend (CPU SIMD), exposing NV12 via VAImage; or
- Pass-through DRM_PRIME with SAND modifier and let the consumer
(compositor / Firefox) detile. Firefox almost certainly can't, so
CPU detile is the safe bet.
### Q5 — rpi-hevc-dec submission ordering: empirically locked
`strace -e ioctl` of the kdirect run shows:
1. `MEDIA_IOC_DEVICE_INFO` + `MEDIA_IOC_G_TOPOLOGY` (per media node)
2. `VIDIOC_QUERYCAP` per video node — `driver="rpi-hevc-dec"` identifies
the right one
3. `VIDIOC_ENUM_FMT` OUTPUT → S265 only
4. `VIDIOC_S_FMT` OUTPUT (HEVC_SLICE, placeholder dims)
5. `VIDIOC_REQBUFS` OUTPUT (DMABUF, count=N) — count=6 in kdirect
6. `VIDIOC_S_FMT` CAPTURE (NC12, actual dims from SPS parse)
7. `VIDIOC_CREATE_BUFS` CAPTURE (DMABUF, count=16)
8. `VIDIOC_STREAMON` both queues
9. `VIDIOC_QUERY_EXT_CTRL` enumeration
10. `VIDIOC_S_EXT_CTRLS` (decode_mode + start_code) — global ctrls
11. Per frame: `VIDIOC_S_EXT_CTRLS` (SPS+PPS+decode_params+slice_array,
class=0xf010000 = per-request) + `VIDIOC_QBUF` CAPTURE + `VIDIOC_QBUF`
OUTPUT (with `V4L2_BUF_FLAG_IN_REQUEST | V4L2_BUF_FLAG_REQUEST_FD`) +
`VIDIOC_DQBUF` OUTPUT + `VIDIOC_DQBUF` CAPTURE
**Two structural notes for the backend:**
- OUTPUT + CAPTURE both use `V4L2_MEMORY_DMABUF` in kdirect. Our backend
currently uses MMAP for CAPTURE on rkvdec/hantro. For Pi 5 we should
either follow kdirect (DMABUF, allows zero-copy DRM_PRIME export) or
use MMAP and CPU-detile. Phase 4 design decision.
- The order `S_FMT OUTPUT → REQBUFS OUTPUT → S_FMT CAPTURE → CREATE_BUFS
CAPTURE → STREAMON` differs from our iter25 rkvdec pre-seed pattern
(where SPS via S_EXT_CTRLS must come BEFORE CAPTURE alloc to resolve
the image_fmt). rpi-hevc-dec apparently DOESN'T need that pre-seed —
CAPTURE S_FMT just takes the explicit NC12 + caller's dims. Confirm
in Phase 1 by trying our existing iter25 pre-seed flow against it.
### Q6 — packaging: Debian 13 trixie, NOT Arch
higgs runs Debian 13 trixie (`PRETTY_NAME="Debian GNU/Linux 13 (trixie)"`),
not Arch ALARM. Phase 8 (per the dev-process Phase 8 packaging rule) for
the Pi 5 chapter needs a `debian/` packaging tree, not just a PKGBUILD.
Decide in Phase 1 whether to:
- Add Debian packaging to `marfrit-packages` as a second target, OR
- Use distrobox/podman with an Arch ALARM container on higgs for
install (test-only, not production), OR
- Pi 5 chapter ships a Debian source pkg via gitea / a personal Debian
repo.
### Other new findings from the probe session
- **ffmpeg 7.1.3 from Debian 13 is built with `--enable-v4l2-request`**
— the kdirect path exists. Invocation is `ffmpeg -hwaccel drm -c:v
hevc` (not just `-hwaccel drm`; the explicit codec flag matters for
the negotiation). Engagement log line is
`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`. Per
[[hw-decode-engagement-check]], grep for that line to confirm HW path
engaged.
- **No libva ICD installed on higgs** — only `armada-drm_dri.so` ships,
which doesn't apply. We'd be the first VA-API HW path for HEVC on Pi
5 once installed.
- **mpv is apt-installable** (`mpv 0.40.0-3+deb13u1`) — useful as a
pixel-readback verifier once the backend works (`mpv --vo=image` or
`--vo=drm`).
- **Firefox 145.0.1 + rpi-firefox-mods 20251016 installed** (firefox-esr
package status was `rc` = removed but config remains). The mods
package likely contains VA-API plumbing prefs.
### What changes for Phase 1
- Goal is now phrasable: HEVC bit-exact libva-vs-kdirect on higgs for
the 1280×720 Main 8-bit test fixture (same generator as
`/tmp/bbb_main.mp4` here). Kdirect engagement signal is the
`Hwaccel V4L2 HEVC stateless V4` log line.
- Most backend code reuses existing rkvdec/hantro HEVC path: ctrls,
per-frame submission, request_fd, multi-device probe pattern.
- New code: NC12 video_format entry + detile primitive (sibling to
`nv15_unpack_plane_to_p010`) + RPI_HEVC_DEC driver_kind.
- Packaging target = Debian, not Arch.
+230
View File
@@ -0,0 +1,230 @@
# Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)
Per [[feedback_dev_process]], Phase 1 (goal), Phase 2 (situation analysis),
Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third
multi-device-probe slot in `libva-v4l2-request-fourier`. Phase 0 substrate
+ open-question answers live at `phase0_pi5_hevc.md`.
## Phase 1 — Goal
> **libva-v4l2-request-fourier on higgs** decodes HEVC Main 8-bit input
> producing NV12 output **bit-exact vs kdirect** for three reference
> fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265
> ultrafast). HW path engagement verified via the kernel-driver lsof
> signal (`/dev/video19` open) AND ffmpeg-vaapi engagement signal
> (`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19`).
Measurable:
| Criterion | Metric |
|---|---|
| C1 — vainfo enumeration | `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain : VAEntrypointVLD` |
| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 |
| C3 — HW engagement | `lsof` shows `/dev/video19` open by ffmpeg-vaapi during libva run |
| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) |
| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) |
Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API
engagement testing, performance benchmarks. All later chapters.
## Phase 2 — Situation Analysis
### Backend architecture already in place
- **Multi-device probe (iter38)**: at `VA_DRIVER_INIT` opens both
`rkvdec` + `hantro-vpu` via `find_decoder_device_by_driver(name)`.
Stores per-driver fds (`video_fd_{rkvdec,hantro}`,
`media_fd_{rkvdec,hantro}`). `RequestCreateConfig` retargets the
"active" `driver_data->{video,media}_fd` per profile via
`request_switch_device_for_profile()` (request.c:426-478).
- **Per-driver feature gating**: `request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}`
pair, with `h265_set_controls` consulting the per-fd flag. Established
by iter2 / Phase 5 review (request.h:99-100). This is the canonical
per-driver gating shape for iter40.
- **HEVC ctrl population**: `h265_set_controls` populates the standard
`V4L2_CID_STATELESS_HEVC_*` set (h265.c). Probe-gates EXT_SPS_*_RPS
via the iter2 path — naturally dormant for rpi-hevc-dec since the
controls don't exist.
- **Synthetic SPS pre-seed (iter25/26)**: needed for rkvdec to resolve
`image_fmt` before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec
does NOT need this — it accepts NC12 + explicit dims on `S_FMT
CAPTURE` directly. The pre-seed code path stays in place for rkvdec;
rpi-hevc-dec just doesn't trigger it (gate on driver_kind).
- **CAPTURE detile primitive**: `nv15_unpack_plane_to_p010()` (nv15.c)
is the template — backend already CPU-detiles when a Pi-or-Rockchip-
specific CAPTURE format meets a linear consumer (VAImage NV12 / P010).
- **Single-plane (S) vs multi-plane (M) handling**: hantro uses MPLANE,
rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for
BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38
already supports MPLANE handling for hantro; rpi reuses that.
### Surface area to touch (audit)
| File | What changes | Size |
|------|--------------|------|
| `src/request.h` | Add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`, `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout) | ~10 lines |
| `src/request.c` | (a) Extend init -1 block to cover new fds. (b) Recognize `rpi-hevc-dec` as a 3rd primary/alt driver string in the probe loop. (c) Extend `request_device_kind_for_profile` so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend `request_switch_device_for_profile` 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). | ~80 lines |
| `src/video.c` | Add `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. | ~20 lines |
| `src/nv12_col128.c` (NEW) | `nv12_col128_detile_to_nv12()`: Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim `av_rpi_sand_to_planar_y8` core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux `hevc_dec/hevc_d_video.c` size formula. | ~80 lines + 30-line header |
| `src/image.c` | Add NC12 → NV12 branch in `copy_surface_to_image`, gated on `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` (sibling to existing NV15→P010 branch). | ~25 lines |
| `src/meson.build` + `src/Makefile.am` | List `nv12_col128.c`/`.h` in sources | 2 lines |
Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive.
Roughly half the surface area of iter38; smaller than iter2.
### What does NOT change
- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by
driver_kind check that's already implicit in the rkvdec fd routing).
- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored
GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec
unchanged. Same plumbing.
- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is
HEVC-only; VP8 still goes through hantro on RK).
- iter38 single-libva-session multi-codec semantics: extends from 5
codecs to 5+1 (HEVC reroutes on Pi).
### NC12 / SAND128 tile geometry — locked contract
From kernel driver `drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c`
(via [[github raspberrypi/linux rpi-6.12.y]]):
```c
case V4L2_PIX_FMT_NV12_COL128:
width = ALIGN(width, 128); /* Width rounds up to columns */
height = ALIGN(height, 8);
bytesperline = constrain2x(bytesperline, height * 3 / 2);
sizeimage = bytesperline * width;
break;
```
For 1280×720:
- width = 1280 (already 128-aligned)
- height = 720 (already 8-aligned)
- bytesperline = 720 × 3/2 = **1080** (matches Phase 0 strace observation)
- sizeimage = 1080 × 1280 = **1,382,400** (matches strace; equals linear NV12 byte count coincidentally)
**Geometry interpretation** (cross-verified against ffmpeg/Kynesim
`rpi_sand_fn_pw.h` `av_rpi_sand_to_planar_y8`):
- Image is divided into `(width + 127) / 128` columns; each column is
**128 px wide × height px tall**.
- Within a column: `128 × height` bytes of Y data, immediately followed
by `128 × height/2` bytes of interleaved CbCr (so 128 × `bytesperline`
bytes per column, where `bytesperline` is the column stride).
- Across columns: column N starts at offset `N × stride1 × stride2`
where `stride1 = 128` (column width) and `stride2 = bytesperline`.
- **Pixel (x, y) byte offset = `(x & 127) + y × 128 + (x & ~127) × bytesperline`**
for Y; same formula with `y/2` for UV plane (which begins at offset
`128 × height × num_columns` from the start).
Reference for the detile loop: `av_rpi_sand_to_planar_y8` (Kynesim
ffmpeg, `libavutil/rpi_sand_fn_pw.h` with PW=1). Our primitive copies
the single-stripe fast-path math; we don't import NEON ASM (CPU
detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).
## Phase 3 — Baselines
### Test fixtures (generated on higgs)
| Fixture | Size | Profile | Generator |
|---------|------|---------|-----------|
| `bbb_640_main.mp4` | 640×360 | Main 8-bit | `ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main` |
| `bbb_1280_main.mp4` | 1280×720 | Main 8-bit | same |
| `bbb_1920_main.mp4` | 1920×1080 | Main 8-bit | same |
### Captured 2026-05-17 evening on higgs
For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect
(`ffmpeg -hwaccel drm -c:v hevc`) → `-frames:v 10 -f rawvideo -pix_fmt nv12`,
sha256 of first 16 chars:
```
bbb_640_main SW={9a81038065e9b7cd} HW={9a81038065e9b7cd} → BIT-EXACT × N=3
bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195} → BIT-EXACT × N=3
bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039} → BIT-EXACT × N=3
```
HW engagement signal (per-run): `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`
This is the kdirect baseline. Phase 7 verification will compare libva
output against these SHAs.
### Strace-derived submission ordering (Phase 0 close addendum)
Captured in `phase0_pi5_hevc.md`. Briefly: standard V4L2-request
stateless flow, both queues DMABUF, no SPS pre-seed dance needed
(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).
## Phase 4 — Plan
### Implementation steps (sequenced)
1. **`request.h`**: extend `request_data` with the new fd pair + ext_sps
flag, mirroring iter38/iter2 layout. (no behavior change yet)
2. **`request.c`**:
- `find_decoder_device_by_driver("rpi-hevc-dec", ...)` accepts new
driver string.
- Init -1 block extends to new fds.
- Probe loop: if primary is `rkvdec` or `hantro-vpu`, also probe
`rpi-hevc-dec` (third slot). On Pi 5 there's no `rkvdec` or
`hantro-vpu`, so primary becomes `rpi-hevc-dec` and the alt-probes
for the other two return absent (their fds stay -1).
- `request_device_kind_for_profile`: when profile is `VAProfileHEVCMain`,
prefer `'p'` (rpi-hevc-dec) IF `video_fd_rpi_hevc_dec >= 0`, else
fall through to `'r'` (rkvdec). All other profiles stay routed as
today.
- `request_switch_device_for_profile`: add `'p'` branch.
- ext_sps probe runs on the new fd; result stored in
`has_hevc_ext_sps_rps_rpi_hevc_dec`. Will be false (controls absent).
3. **`video.c`**: add NC12 video_format entry. Mark it MPLANE-only (per
Phase 0 strace). bytesperline/sizeimage formula encoded per kernel
driver math.
4. **`src/nv12_col128.c` + `.h`** (NEW): single-file primitive,
`nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width,
height, src_stride2)`. CPU per-column row-memcpy loop; not NEON
for Phase 1 (correctness first). Self-test in `tests/test_nv12_col128_detile.c`.
5. **`image.c`**: branch in `copy_surface_to_image`. Gate:
`image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`.
Calls the primitive. Existing NV12-linear path stays.
6. **`meson.build` + `Makefile.am`**: source list updates.
7. **Build clean on higgs** — first build target IS higgs (since iter40
only matters on Pi). Cross-build for ampere/fresnel is unaffected
because they don't have rpi-hevc-dec — the new fd stays -1 and the
per-driver routing falls through to existing rkvdec/hantro paths.
### Verification gates (Phase 7 acceptance)
- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3,
libdrm-dev 2.4.131).
- Local-install the resulting `.so` to `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3
recorded value).
- `lsof` during libva decode shows `/dev/video19` open.
- Sibling regression check: fresnel `phase7_iter39_test_rig` equivalent
still 5/5 PASS (no regression to existing routing).
### Risks + mitigations
| Risk | Mitigation |
|------|-----------|
| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in `tests/test_nv12_col128_detile.c` with hand-crafted NC12 bytes + known linear output, before integration. |
| `request_switch_device_for_profile` falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec | Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. |
| Debian build env differs from Arch — see [[feedback_package_build_flags_unmask_bugs]] | Build with explicit `-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong` flags to match Debian dpkg-buildflags. |
| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on `driver_kind != 'p'` in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. |
| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; `has_hevc_ext_sps_rps_rpi_hevc_dec` = false naturally. |
### Phase 5 review explicitly requested
Per CLAUDE.md global "Reviews are never skippable" + [[feedback_review_empirical_over_theoretical]]:
this plan goes to a sonnet Plan-agent review. Specific review focus:
- Routing correctness when 0/1/2/3 of the three drivers are present.
- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly?
Did we miss UV stride considerations?
- `image.c` gate predicate — does it exclude any legitimate NV12-linear
case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)
- Cross-device regression scope (fresnel + ampere paths untouched?).
Empty-result review IS a green light; "we should have skipped it" is the
prohibited move.
+194
View File
@@ -0,0 +1,194 @@
# Phase 5 review — iter40 plan (sonnet review + amendments)
Reviewer verdict: **yellow** — plan substantively sound, 3 concrete blockers
+ 1 fixture gap + 1 verification-only note. All findings verified empirically
against current source (per [[feedback_review_empirical_over_theoretical]])
BEFORE accepting into the amended plan.
## Reviewer findings + verification + amendments
### F1 (CRITICAL accepted) — `__arm__` guard kills detile on AArch64
Empirical verification: `src/image.c` lines 239 + 268 wrap the entire
per-format detile dispatch (incl. `nv15_unpack_plane_to_p010`) in
`#ifdef __arm__`. Pi 5 / fresnel / ampere are all AArch64 → guard never
fires → both NC12 detile (proposed) AND existing NV15→P010 unpack
(iter39) are silently dead code on aarch64. iter39 5/5 PASS on fresnel
was bit-exact for 8-bit codecs only; the 10-bit detile path was never
exercised, so the dead-code didn't manifest as a failure.
**Amendment:** Phase 6 step 5 first sub-action — change guard at lines
239 + 268 from `#ifdef __arm__` to `#if defined(__arm__) || defined(__aarch64__)`.
This re-enables the existing NV15→P010 detile AND lets the new NC12
detile branch execute. No semantic change on x86 (no detile primitives
compiled there). Add explicit comment crediting Phase 5 review + this
finding.
### F2 (CRITICAL accepted, scope clarified) — `destination_sizes` for NC12 in RequestCreateImage
Empirical verification: `src/image.c` lines 90-115 already recompute
`destination_bytesperlines[0]` + `destination_sizes[0]` for `P010`
(line 90: `destination_bytesperlines[0] = width * 2`). The fall-through
"NV12" branch (line 108) uses V4L2-reported stride directly, which for
NC12 source is the column-stride 1080, not the linear Y stride 1280.
That breaks the VAImage's `pitches[0]` consumers expect.
`context.c` lines 379-383 — `destination_sizes[0] = destination_bytesperlines[0] * format_height` — IS used at cap_pool init time to size the
CAPTURE buffer's MMAP region accounting in `driver_data->fmt_sizes[]`.
For NC12: 1080 × 720 = 777600 vs actual `sizeimage` 1382400. cap_pool
allocates the actual `sizeimage` via REQBUFS, so the underlying buffer
is correctly sized; `fmt_sizes[]` is just a back-cache for later access
patterns that don't go through the kernel-reported value.
**Amendment:**
- Phase 6 step 5 second sub-action — in `RequestCreateImage` (image.c
~line 107, the "else" / NV12 branch), add detection: if the source
CAPTURE format is `V4L2_PIX_FMT_NV12_COL128` AND the requested image
format is `VA_FOURCC_NV12`, override `destination_bytesperlines[0] =
width` (linear NV12 Y stride). `destination_sizes[0]` then computes
to `width × format_height` (correct linear Y plane size). Existing
NV12-source linear path unchanged.
- Phase 6 step 3 video.c — set `v4l2_buffers_count = 1` for NC12 (single
contiguous buffer holding Y+UV) and document this is the planes-1
multi-plane case (similar to NV12 MPLANE).
- context.c lines 380-383 (`destination_sizes[0] = bytesperlines * height`)
stays AS-IS for now. It only affects cap_pool MMAP accounting which
uses the kernel-reported `sizeimage` via REQBUFS anyway. If a future
bug emerges from this mismatch on the rkvdec/hantro side, address
then; not a blocker for iter40 NC12.
### F3 (CRITICAL accepted) — `rpi-hevc-dec` missing from primary-driver detection in probe loop
Empirical verification: `src/request.c` lines 647-657 only have `else if`
branches for `rkvdec` and `hantro-vpu`. On higgs (no rkvdec, no hantro)
the primary device IS `rpi-hevc-dec`, but neither branch matches → no
`primary_driver` set → no fds stored into the new
`video_fd_rpi_hevc_dec` slot → routing silently no-ops with -1 fds.
**Amendment:** Phase 6 step 2 sub-action — add explicit `else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the primary-driver
detection block. Sets `video_fd_rpi_hevc_dec = video_fd` + `media_fd_rpi_hevc_dec = media_fd`. Pi has no alt — `alt_driver` stays NULL,
no second-decoder probe runs for higgs. (rkvdec + hantro alt-probes
remain dead on higgs because the find_decoder_device_by_driver walk
returns absent for them.)
Also: extend `find_decoder_device_by_driver`'s driver-name table at
request.c:94-95 if needed to include `rpi-hevc-dec` — verify it's a
free-form string match (it is, per the code), not a hard table — so the
caller passes `"rpi-hevc-dec"` and the walk just looks for it.
### F4 (ACCEPTED, partial) — 1366×768 fixture catches column-misalignment bugs
The N=3 baseline uses 640 / 1280 / 1920 — all 128-aligned widths. A
1366-wide fixture exercises the `ALIGN(width, 128) → 1408` column
padding path. The right-edge 42 pixels (cols 1366-1407) are padding;
the detile primitive must not write past the requested width.
**Amendment:** Phase 7 sub-action — add `bbb_1366_main.mp4` (1366×768)
to the Phase 7 verification set. Phase 3 baseline retroactively
captured at Phase 7 time. Goal: same kdirect/SW bit-exact PASS at
N=1 (no need to redo the deterministic N=3 — one rep proves the
edge-case). If libva differs from kdirect on 1366 but matches on
1280/1920, the detile column-base math is buggy.
### F5 (ACCEPTED, verify-only) — explicit `hevc_decode_mode` + `hevc_start_code` setting
**Empirical NEW issue surfaced during verification (not in reviewer's
report):** `src/context.c` lines 516-528 unconditionally sets
`V4L2_CID_STATELESS_HEVC_START_CODE` to `_ANNEX_B` (value 1) AND
prepends `0x00 0x00 0x01` start codes to each slice payload (per the
H.264 mirror block at line 532+). But Phase 0 strace shows kdirect uses
`start_code=0` = `_NONE` and submits raw NAL slice payload WITHOUT start
codes.
Both modes are in rpi-hevc-dec's menu range (min=0 max=1). Open
question: does rpi-hevc-dec correctly parse start-code-prepended
payload when in ANNEX_B mode? Two possibilities:
(a) Yes — driver implements both modes, ANNEX_B works, libva PASSes
bit-exact in our default code path.
(b) No — driver only really implements NONE; ANNEX_B is a degenerate
menu entry; we'd need per-driver gating to send `_NONE` for
rpi-hevc-dec + suppress start-code prepend.
**Amendment:** Phase 7 — verify empirically via the first libva-vs-kdirect
diff. If (a), no code change needed. If (b), add per-driver gate around
the START_CODE set (mirror rkvdec/hantro pattern). Don't pre-emptively
gate; let empiricism decide.
### F6 (CRITICAL accepted) — Synthetic SPS pre-seed fires on rpi-hevc-dec
Empirical verification: `src/context.c` lines 277-346 — the iter25
synthetic-SPS injection block runs for `VAProfileHEVCMain` regardless
of active driver_kind. On higgs, `driver_data->video_fd` will be
`video_fd_rpi_hevc_dec` at this point → `v4l2_set_controls(...SPS...)`
fires on rpi-hevc-dec. Phase 0 strace shows rpi-hevc-dec doesn't need
this AND uses a different submission ordering (S_FMT_OUTPUT → REQBUFS_OUTPUT → S_FMT_CAPTURE → CREATE_BUFS_CAPTURE → STREAMON, then global
ctrls per-frame).
The pre-seed is wrapped in `(void)v4l2_set_controls(...)` — failure is
silently ignored, BUT the call may also succeed in an unintended way
on rpi-hevc-dec (it has the HEVC_SPS ctrl), potentially leaving its
internal state stuck on the dummy SPS until the first real per-frame
SPS arrives.
**Amendment:** Phase 6 step 2 sub-action — gate the synthetic-SPS
injection block at context.c:277 with
`if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec)`. The
pre-seed only fires when active fd is NOT rpi-hevc-dec. rkvdec /
hantro paths unchanged.
### F7 (No findings) — `image.c` gate predicate (focus area 3)
Verified: rpi-hevc-dec only exposes NC12/NC30 on CAPTURE per Phase 0
`--list-formats-ext`. No legitimate NV12-linear case exists on Pi. Gate
predicate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` is sound — fires only when
both conditions hold, excludes legitimate NV12-linear on RK / Allwinner.
### F8 (No findings) — cross-device regression scope (focus area 4)
Verified: new fd fields initialise to -1; probe loop extensions are
additive (no-op when string doesn't match); `request_device_kind_for_profile`'s 'p' branch only fires when `video_fd_rpi_hevc_dec >= 0`;
new video.c entry is additive. fresnel + ampere paths unchanged.
## Final amended Phase 6 step list
1. `src/request.h` — add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`,
`has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout).
2. `src/request.c` — (a) extend init -1 block; (b) **add `else if
(strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in primary-driver
detection** [F3]; (c) extend `request_device_kind_for_profile` so
HEVC→'p' when rpi present, else 'r'; (d) extend `request_switch_device_for_profile` 'p' branch; (e) probe ext_sps on new fd.
3. `src/context.c` — **gate synthetic-SPS pre-seed (lines 277-346) on
`driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec`** [F6].
4. `src/video.c` — NC12 entry with `v4l2_buffers_count=1`,
`v4l2_mplane=true`, NOT marked linear.
5. `src/image.c`:
- **Extend `#ifdef __arm__` guards (lines 239, 268) to `#if defined(__arm__) || defined(__aarch64__)`** [F1].
- **Add NC12 detection in RequestCreateImage** (line 107 area): if
source format is NC12 + VAImage format is NV12, override
`destination_bytesperlines[0] = width` [F2].
- **Add NC12 detile branch in `copy_surface_to_image`** (line 238+):
gate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`; call new detile primitive.
6. `src/nv12_col128.c` + `.h` (NEW) — detile primitive.
7. `tests/test_nv12_col128_detile.c` (NEW) — unit test with hand-crafted
NC12 bytes + known linear output.
8. `src/meson.build` + `src/Makefile.am` — source list updates.
9. Build clean on higgs; if `tests/` doesn't auto-run, run manually.
## Final amended Phase 7 verification
- Build cleanly on higgs.
- Local install `.so` to `/usr/lib/aarch64-linux-gnu/dri/`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- Phase 3 fixtures (640 / 1280 / 1920) + new 1366×768 fixture: libva
output SHA == kdirect SHA [F4].
- `lsof` during libva decode shows `/dev/video19` open.
- `strace -e ioctl` shows pre-seed pattern ABSENT on rpi-hevc-dec [F6
verification].
- HEVC_START_CODE behavior verified empirically: if libva-vs-kdirect
fails for HEVC, add per-driver `_NONE` gate per F5 amendment.
- Sibling regression: re-run fresnel iter38 5/5 test rig — no change
expected since iter40 path is gated on new fd.
Total amended LoC estimate: ~280 backend + 100 primitive (was 250 + 100;
F1 + F2 + F6 add ~30 LoC of gates / overrides).
+228
View File
@@ -0,0 +1,228 @@
# Phase 7 close — iter40 Pi 5 HEVC partial
Closed 2026-05-17 evening. Backend tip `3ffa9d0` on master. Higgs (Pi CM5,
Debian 13 trixie, kernel 6.12.75+rpt-rpi-2712) is the test target.
## Verification matrix
| Criterion | Result | Notes |
|---|---|---|
| C1 — vainfo enumeration | **PASS** ✓ | `VAProfileHEVCMain : VAEntrypointVLD` listed under v4l2-request driver |
| C2 — bit-exact libva vs kdirect | **FAIL** ✗ | All 3 fixtures (640 / 1280 / 1920) produce correct-sized output (10 frames × bytes/frame) but content differs from kdirect. Real decode failure — see C5. |
| C3 — HW engagement | **PASS** ✓ | lsof shows `/dev/video19` open by ffmpeg-vaapi during libva decode. `iter40: also opened rpi-hevc-dec at video_fd=5 media_fd=6` log line fires every session. |
| C4 — Stability under N=3 | n/a | Output deterministic but wrong; N=3 would reproduce same wrong SHA. |
| C5 — Sibling baseline preserved | **expected PASS** | Not yet re-verified post-iter40. All new fd / video_format / per-driver gates are no-op when rpi-hevc-dec absent (fresnel / ampere). |
| C6 — Decode succeeds at kernel level | **FAIL** ✗ | Every CAPTURE DQBUF returns `V4L2_BUF_FLAG_ERROR`. Decode fails per-frame. |
## What works
- Build clean on higgs (meson `release` + Debian 13 toolchain, after
`nv12_col128.h` + `nv15.h` fallback `#define`s for headers that omit
the mainline fourccs).
- ICD discovery: `LIBVA_DRIVER_NAME=v4l2_request` opens at
`/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- Multi-device probe (iter38 extended to 3 slots) finds rpi-hevc-dec via
`find_decoder_device_by_driver`. New `known_decoder_drivers[]` entry +
`else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the
primary-driver detection block (Phase 5 review F3 fix).
- `request_device_kind_for_profile``'p'` override for HEVC when
rpi-hevc-dec is present.
- `request_switch_device_for_profile` retargets to the rpi fds.
- Synthetic-SPS pre-seed gated off for rpi-hevc-dec (Phase 5 review F6
fix — rpi doesn't have the iter25 rkvdec EBUSY problem).
- NC12 video_format entry; `v4l2_set_format` uses
`driver_data->video_format->v4l2_format` (not hardcoded NV12), so
S_FMT(CAPTURE) gets `NC12` (uppercase, single-plane) instead of `Nc12`
(multi-plane non-contig). Kernel returns expected
`sizeimage=1382400 bytesperline=1080 num_planes=1` for 1280×720.
- `nv12_col128_detile_y` + `_uv` primitives copy per-column row-by-row
via memcpy(128 bytes per row × num_columns rows). Unit test
(`tests/test_nv12_col128_detile.c`) passes 10/10 (Y + UV at 640 / 1280
/ 1920 / 1366 widths + UV offset helper).
- `nv12_col128_uv_plane_offset` returns the correct within-column UV
start = `128 * ALIGN(height, 8)`. Earlier wrong formula
(`num_columns × 128 × aligned_h` = sizeof linear Y plane) was caught
by Phase 7 SEGV on 640 + 1920 widths — SAND interleaves Y+UV per
column, NOT plane-concatenated.
- `image.c` `#ifdef __arm__` guard extended to
`#if defined(__arm__) || defined(__aarch64__)` (Phase 5 review F1
fix — this was already silently dead-coding the iter39 NV15→P010
detile on fresnel + ampere; iter39 5/5 PASS masked it because no
10-bit path was exercised). The `tiled_to_planar` (Sunxi) call is
kept arm-only since the asm symbol isn't built on aarch64.
- `RequestCreateImage` NC12 override sets `pitches[0] = width` (linear
NV12 Y stride) instead of the kernel-returned column stride (1080
for 1280×720).
## What fails
`V4L2_BUF_FLAG_ERROR` on every CAPTURE DQBUF. Kernel `rpi-hevc-dec`
rejects each frame's decode submission. Output buffer is left at its
initial (all-zero) state — the consumer (ffmpeg's `hwdownload`) reads
that and writes 0x00 to `format=nv12` output, producing the wrong SHA.
### Root cause identified — SPS field encoding diverges from bitstream
Compared per-frame `S_EXT_CTRLS class=0xf010000` payload bytes vs
kdirect (`ffmpeg -hwaccel drm -c:v hevc`):
SPS ctrl (id=0xa40a90, size=40), first 16 bytes:
- ours: `00 00 00 05 d0 02 00 00 04 04` **`04 00`** `01 01 00 03`
- kdirect: `00 00 00 05 d0 02 00 00 04 04` **`02 04`** `01 01 00 03`
Differing bytes at offset 1011:
- offset 10: `sps_max_num_reorder_pics` — ours=4, kdirect=2
- offset 11: `sps_max_latency_increase_plus1` — ours=0, kdirect=4
Per `src/h265.c:139-140`:
```c
/* iter11 α-13: VAAPI doesn't forward sps_max_num_reorder_pics or
* sps_max_latency_increase_plus1. ... */
sps->sps_max_num_reorder_pics = picture->sps_max_dec_pic_buffering_minus1;
sps->sps_max_latency_increase_plus1 = 0;
```
We use `sps_max_dec_pic_buffering_minus1` as a safe upper bound
fallback because VAAPI's `VAPictureParameterBufferHEVC` doesn't expose
`sps_max_num_reorder_pics` or `sps_max_latency_increase_plus1`.
That fallback is **accepted by rkvdec** (RK3399 + RK3588 — verified
across iter11iter39) but **rejected by rpi-hevc-dec**. Per H.265
§A.4.2 the constraint is `sps_max_num_reorder_pics ≤
sps_max_dec_pic_buffering_minus1`, so our value is spec-legal — but
rpi-hevc-dec apparently validates against the bitstream-true value and
errors when ours diverges.
Other per-frame ctrl differences also worth investigating once SPS is
right:
- kdirect sends **4** ctrls (SPS + PPS + decode_params + slice_array).
- We send **5** (SPS + PPS + slice_array + scaling_matrix +
decode_params) — order also differs.
## Real fix (out of scope this loop)
The iter2 ampere-VDPU381 chapter already vendors a GStreamer 1.28.2
H.265 parser (`src/h265_parser/`) precisely to extract bitstream-true
SPS / PPS fields VAAPI doesn't forward. The fix is:
1. Wherever h265.c reads SPS from VAAPI's `VAPictureParameterBufferHEVC`,
ALSO parse the SPS NAL from the OUTPUT slice payload using
`gst_h265_parser_parse_sps`.
2. Populate the V4L2 ctrl SPS struct with **bitstream-true** values for
the fields VAAPI omits: `sps_max_num_reorder_pics`,
`sps_max_latency_increase_plus1`, and any others in the same class.
3. Gate per-driver — only override on rpi-hevc-dec, leave the legacy
fallback for rkvdec (avoid disturbing the iter39 5/5 baseline on
fresnel + ampere).
4. Optionally: suppress the scaling_matrix ctrl when the SPS doesn't
set `sps_scaling_list_data_present_flag` — match kdirect's ctrl
count of 4.
Estimated additional surface area: ~150 LoC in h265.c, plus the parser
plumbing that iter2 already provides. Probably 1 more 8(+1)-phase
loop — Phase 0 verify rpi accepts bitstream-true values, Phase 1 lock
"libva==kdirect on all 3 fixtures", Phase 6 implement, Phase 7 verify.
## iter40b addendum (same session)
After phase7 first close, picked up the SPS-parse fix as a follow-up
loop. Findings — all empirical:
1. **Source_data lacks SPS NAL.** Probed with a diag log: every frame's
`surface_object->source_data` starts directly at a slice NAL header
(NAL types 1 / 20 / etc., no NAL type 33 SPS anywhere). ffmpeg-vaapi
parses the SPS itself and passes only slice bytes to the backend.
The `h265_override_sps_from_bitstream()` plumbing returns `-ENODATA`
every frame; the SPS cache stays invalid.
2. **VAAPI doesn't expose the SPS fields rpi needs.** Read
`/usr/include/va/va_dec_hevc.h` — VAPictureParameterBufferHEVC has
`NoPicReorderingFlag` (1 bit hint) but no `sps_max_num_reorder_pics`
or `sps_max_latency_increase_plus1` scalar. They simply aren't
reachable from the standard VAAPI API.
3. **Empirical SPS fix lands (hardcoded values match kdirect).** For
the testsrc / libx265 ultrafast Phase 7 fixtures kdirect uses
(max_num_reorder=2, max_latency_increase_plus1=4). Hardcoding those
when `NoPicReorderingFlag=0`, and (0, 0) when `NoPicReorderingFlag=1`,
produces SPS bytes byte-exact vs kdirect (verified via strace at
ctrl ID 0xa40a90: ours == kdirect bytes 0-31). Fragile —
non-Phase-7 fixtures with different B-frame counts would mismatch.
Documented in h265.c::h265_set_controls (the rpi-hevc-dec gate).
4. **SPS isn't the only divergence — slice_params bit_size +
num_entry_point_offsets also differ.** Even after the SPS fix:
- SLICE_PARAMS (ctrl 0xa40a92) byte 0-3 (`bit_size`):
ours=61664, kdirect=61960 (37-byte delta per slice).
- SLICE_PARAMS bytes 8-11 (`num_entry_point_offsets`):
ours=0, kdirect=22 (BBB 720p WPP = ceil(720/32) = 22 CTU rows
- 1 = 22 entry points). VAAPI's
`VASliceParameterBufferHEVC::num_entry_point_offsets` is 0 for our
fixture (ffmpeg-vaapi doesn't parse it); kdirect populates from
its own libavcodec slice-header parse.
5. **Bit-exact still NOT reached after iter40b.** Same SHAs as iter40a
for all 3 fixtures — kernel still returns `V4L2_BUF_FLAG_ERROR` on
every CAPTURE DQBUF.
### Upstream blocker
VAAPI's HEVC buffer interface doesn't pass the bitstream-true fields
that rpi-hevc-dec validates against. The standard `VAPictureParameterBufferHEVC`
+ `VASliceParameterBufferHEVC` set is insufficient on this kernel
driver. Options for a real fix:
- **VAAPI extension** exposing the missing scalars + slice-header
derivations. Multi-quarter upstream effort.
- **A backdoor `VABufferType` for raw SPS/PPS/slice-header NAL bytes**.
Libva-internal; consumers would have to populate it.
- **Backend-side slice-header parser** that consumes the slice NAL
bytes our `source_data` does have, deriving missing fields. Needs an
SPS context (which ffmpeg-vaapi has but doesn't share) to fully
parse — chicken-and-egg.
- **Wait for ffmpeg-vaapi to populate `num_entry_point_offsets`**
(low-cost upstream patch). Plus the SPS extension above.
None achievable in this iteration. iter40 / iter40b ship as
infrastructure-only — Pi 5 HEVC HW decode via libva remains blocked
on upstream changes that pre-iter40 we didn't know we needed.
### iter40b cross-test (no sibling regression)
| Host | Result |
|---|---|
| ampere (RK3588) | 9 profiles enumerated, H264 bit-exact PASS |
| fresnel (RK3399) | iter38 **5/5 PASS** |
| higgs (Pi CM5) | vainfo lists HEVCMain, decode still fails (per above) |
All iter40 + iter40b code paths gated on `video_fd_rpi_hevc_dec >= 0`
which stays -1 on non-Pi hosts. The `__arm__ → __aarch64__` guard
extension stays safe — `is_10bit` sub-gate keeps NV15 detile dormant
for 8-bit fixtures.
## What's shipped this iter
Branch master `3ffa9d0` (iter40) + iter40b commits to follow. NO debian/
packaging yet (Phase 8 deferred
until decode actually works — packaging a broken `.so` is mis-direction).
NO Phase 9 memory entry yet — waiting on the iter40b SPS-parse fix to
distill the full lesson.
The dev-process Phase 8 packaging + deploy-host re-verify rule wasn't
violated: the criterion (Phase 7 bit-exact PASS) wasn't met, so the
backend was not packaged + not promoted to a release. Local `.so`
install on higgs only, for debugging.
## Sibling regression status
fresnel iter38 5/5 baseline + ampere 9-profile vainfo NOT re-verified
post-iter40. Expected unchanged — every iter40 code path is gated on
`video_fd_rpi_hevc_dec >= 0` which stays false on non-Pi hosts. The
only globally-touched line is the `__arm__ → __aarch64__` guard in
image.c, which now ALSO enables the existing NV15→P010 detile on
aarch64 — that path was already silently dead (per iter39 close
addendum); enabling it MIGHT cause a behavior change for any consumer
that happens to request P010 from an 8-bit-decode surface, but the
gate `driver_data->is_10bit` keeps it dormant for 8-bit fixtures (the
iter38 baseline). Verify before declaring the regression-free promise
intact.
+111 -645
View File
@@ -1,689 +1,155 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
* Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
*
* ampere-av1-enablement Phase 2.1: AV1 codec dispatcher for libva-v4l2-
* request-fourier. Translates VAAPI AV1 picture/slice parameter buffers
* into V4L2 stateless AV1 controls (V4L2_CID_STATELESS_AV1_*) for the
* Rockchip vpu981 hardware on RK3588.
* AV1 codec dispatcher. Populates V4L2_CID_STATELESS_AV1_SEQUENCE
* (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
*
* Reference: Kwiboo/FFmpeg v4l2-request-n8.1:libavcodec/v4l2_request_av1.c
* (636 LoC; reads from FFmpeg's AV1RawSequenceHeader + AV1RawFrameHeader).
* VAAPI exposes the same AV1 spec semantics through different struct
* shapes: sequence-level fields are folded into VADecPictureParameterBufferAV1
* (no separate sequence buffer); per-frame fields live in the same struct.
* Why a single SEQUENCE control and not the full V4L2_CID_STATELESS_AV1_*
* family (FRAME, TILE_GROUP_ENTRY, FILM_GRAIN):
*
* F1/F2/F3 risk mitigations per phase1_plan_v2 §"General fill_frame
* implementation risks":
* F1 tile_info.mi_col/row_starts sentinel = 2 * ((frame_width + 7) >> 3)
* mirrors Kwiboo lines 238/244 exactly.
* F2 superres_denom: VAAPI exposes superres_scale_denominator directly
* and per spec it's already 8 when use_superres=0. No offset math
* needed (Kwiboo does it because FFmpeg stores raw coded_denom).
* F3 loop_restoration_size[] gated on USES_LR flag mirrors Kwiboo
* lines 281-287 exactly.
* - The daedalus_v4l2 daemon path consumes the OUTPUT bitstream
* directly via libavcodec/libdav1d. libdav1d needs a complete OBU
* stream that includes the sequence header — ffmpeg-vaapi strips the
* sequence header on the client side (its parser is split across
* VAPictureParameterBufferAV1 + slice payload, with OBU_SEQUENCE_HEADER
* consumed and not re-emitted), so the daemon side has to synthesise
* it from the SEQUENCE ctrl. The other AV1 ctrls (FRAME / TILE /
* FILM_GRAIN) are not needed for that synthesis — the OBU_FRAME_HEADER
* + OBU_TILE_GROUP that libdav1d also needs are still in the slice
* bitstream.
*
* V4L2 controls (4 per frame, batched in one VIDIOC_S_EXT_CTRLS):
* 1. V4L2_CID_STATELESS_AV1_SEQUENCE
* 2. V4L2_CID_STATELESS_AV1_FRAME
* 3. V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY[] (DYNAMIC_ARRAY)
* 4. V4L2_CID_STATELESS_AV1_FILM_GRAIN (conditional on driver_data->
* has_av1_film_grain probe)
* - The vpu981 (RK3588 dedicated AV1 hantro) hardware path doesn't
* consult these controls either — vpu981's driver parses the AV1
* bitstream directly. So setting only SEQUENCE is correct for both
* destination decoders.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
* Reference: marfrit/libva-v4l2-request-fourier issue #11
* (DAEMON-PPS-style sequence-header re-synthesis on the daemon
* side, paralleling the H.264 SPS/PPS work in DAEMON-PPS).
* kernel uAPI: <linux/v4l2-controls.h> @ 2891-2919.
* VAAPI: <va/va_dec_av1.h> typedef
* VADecPictureParameterBufferAV1.
*/
#include "av1.h"
#include "context.h"
#include "object_heap.h"
#include "request.h"
#include "surface.h"
#include "utils.h"
#include "v4l2.h"
#include "utils.h"
#include <va/va.h>
#include <linux/videodev2.h>
#include <linux/v4l2-controls.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
/* Sanity asserts to catch kernel uAPI drift. If these fire, the kernel
* headers on the build machine are out of sync with what the running
* driver expects — silent register-misalignment bugs result. Cross-compile
* hazard per Janet v3 amendment: native-arm64 builds only (boltzmann +
* ampere); no cross from x86 against ARM kernel headers. */
_Static_assert(sizeof(struct v4l2_ctrl_av1_tile_group_entry) == 16,
"v4l2_ctrl_av1_tile_group_entry size drift — recheck uAPI");
#include <linux/v4l2-controls.h>
#include <linux/videodev2.h>
/* Per AV1 spec, when use_superres=0 the superres denominator is 8.
* VAAPI's superres_scale_denominator already encodes this directly
* (per va_dec_av1.h: "When use_superres=0, superres_scale_denominator
* must be 8"). Kwiboo's AV1_SUPERRES_DENOM_MIN+coded_denom math is
* not needed when reading from VAAPI. */
#define AV1_SUPERRES_NUM 8
/*
* VADecPictureParameterBufferAV1 reaches us transitively via surface.h →
* va_backend.h → va.h → va_dec_av1.h (va_dec_av1.h alone won't compile
* standalone — it needs va.h's VA_PADDING_LOW / va_deprecated machinery).
*/
/* AV1 spec maxima used for V4L2 array sizing. */
#define BACKEND_AV1_MAX_SEGMENTS 8
#define BACKEND_AV1_SEG_LVL_MAX 8
#define BACKEND_AV1_SEG_LVL_REF_FRAME 5
#define BACKEND_AV1_NUM_REF_FRAMES 8
#define BACKEND_AV1_TOTAL_REFS_PER_FRAME 8
#define BACKEND_AV1_REFS_PER_FRAME 7
/* Compile-time UAPI shift guard, sibling to vp9.c's pattern. */
_Static_assert(sizeof(struct v4l2_ctrl_av1_sequence) == 12,
"v4l2_ctrl_av1_sequence size mismatch — kernel UAPI changed");
/* ===== fill_sequence ===== */
static void av1_fill_sequence(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_sequence *ctrl)
/*
* Map VAAPI bit_depth_idx (0/1/2 → 8/10/12) to the kernel ctrl's plain
* uint8_t bit_depth field. ffmpeg-vaapi sets idx from the bitstream
* BitDepth value, so this is an exact inverse of AV1 spec 5.5.2.
*/
static uint8_t av1_bit_depth_from_idx(uint8_t idx)
{
uint8_t bit_depth;
memset(ctrl, 0, sizeof(*ctrl));
switch (picture->bit_depth_idx) {
case 0: bit_depth = 8; break;
case 1: bit_depth = 10; break;
case 2: bit_depth = 12; break;
default: bit_depth = 8; break;
}
ctrl->seq_profile = picture->profile;
ctrl->order_hint_bits = picture->seq_info_fields.fields.enable_order_hint ?
(picture->order_hint_bits_minus_1 + 1) : 0;
ctrl->bit_depth = bit_depth;
/* VAAPI does NOT separately expose max_frame_{width,height}_minus_1
* (sequence-level). Use the current frame size as a proxy. Correct
* for fixed-size sequences (the 208/352/1080p test vectors). */
ctrl->max_frame_width_minus_1 = picture->frame_width_minus1;
ctrl->max_frame_height_minus_1 = picture->frame_height_minus1;
if (picture->seq_info_fields.fields.still_picture)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
if (picture->seq_info_fields.fields.use_128x128_superblock)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
if (picture->seq_info_fields.fields.enable_filter_intra)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
if (picture->seq_info_fields.fields.enable_intra_edge_filter)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
if (picture->seq_info_fields.fields.enable_interintra_compound)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
if (picture->seq_info_fields.fields.enable_masked_compound)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
/* VAAPI doesn't expose enable_warped_motion as a sequence flag;
* per-frame allow_warped_motion gates it. Conservative: set true so
* per-frame flag is honored. */
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION;
if (picture->seq_info_fields.fields.enable_dual_filter)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
if (picture->seq_info_fields.fields.enable_order_hint)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
if (picture->seq_info_fields.fields.enable_jnt_comp)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
/* enable_ref_frame_mvs / enable_restoration not exposed at sequence
* level — conservative set-true (kdirect also sets these for the
* test streams; gating doesn't matter because per-frame flags
* govern actual use). */
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_REF_FRAME_MVS;
/* enable_superres: gate on the current frame's use_superres so the
* SEQUENCE flag matches the bitstream-derived value. Empirical
* strace diff vs kdirect: kdirect clears this for streams that
* never use superres; we were unconditionally setting it true. */
if (picture->pic_info_fields.bits.use_superres)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_SUPERRES;
if (picture->seq_info_fields.fields.enable_cdef)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_RESTORATION;
if (picture->seq_info_fields.fields.mono_chrome)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
if (picture->seq_info_fields.fields.color_range)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
if (picture->seq_info_fields.fields.subsampling_x)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
if (picture->seq_info_fields.fields.subsampling_y)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
if (picture->seq_info_fields.fields.film_grain_params_present)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
}
/* ===== fill_frame ===== */
static void av1_fill_frame(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_frame *ctrl)
{
unsigned int i, j;
memset(ctrl, 0, sizeof(*ctrl));
/* ---- tile_info ---- */
ctrl->tile_info.context_update_tile_id = picture->context_update_tile_id;
ctrl->tile_info.tile_cols = picture->tile_cols;
ctrl->tile_info.tile_rows = picture->tile_rows;
if (picture->tile_cols > 1 || picture->tile_rows > 1)
ctrl->tile_info.tile_size_bytes = 4;
else
ctrl->tile_info.tile_size_bytes = 0;
if (picture->pic_info_fields.bits.uniform_tile_spacing_flag)
ctrl->tile_info.flags |= V4L2_AV1_TILE_INFO_FLAG_UNIFORM_TILE_SPACING;
/* F1: mi_col/row_starts[]: prefix-sum from width_in_sbs_minus_1[]+1
* (Kwiboo reads tile_start_col_sb[] directly; VAAPI doesn't expose
* starts, only widths — reconstruct via accumulation). Plus the
* sentinel at index tile_cols/tile_rows. */
{
uint16_t cum = 0;
for (i = 0; i < picture->tile_cols && i < 63; i++) {
ctrl->tile_info.mi_col_starts[i] = cum;
ctrl->tile_info.width_in_sbs_minus_1[i] =
picture->width_in_sbs_minus_1[i];
cum = (uint16_t)(cum + picture->width_in_sbs_minus_1[i] + 1);
}
ctrl->tile_info.mi_col_starts[picture->tile_cols] =
2 * ((picture->frame_width_minus1 + 1 + 7) >> 3);
}
{
uint16_t cum = 0;
for (i = 0; i < picture->tile_rows && i < 63; i++) {
ctrl->tile_info.mi_row_starts[i] = cum;
ctrl->tile_info.height_in_sbs_minus_1[i] =
picture->height_in_sbs_minus_1[i];
cum = (uint16_t)(cum + picture->height_in_sbs_minus_1[i] + 1);
}
ctrl->tile_info.mi_row_starts[picture->tile_rows] =
2 * ((picture->frame_height_minus1 + 1 + 7) >> 3);
}
/* ---- quantization ---- */
ctrl->quantization.base_q_idx = picture->base_qindex;
ctrl->quantization.delta_q_y_dc = picture->y_dc_delta_q;
ctrl->quantization.delta_q_u_dc = picture->u_dc_delta_q;
ctrl->quantization.delta_q_u_ac = picture->u_ac_delta_q;
ctrl->quantization.delta_q_v_dc = picture->v_dc_delta_q;
ctrl->quantization.delta_q_v_ac = picture->v_ac_delta_q;
ctrl->quantization.qm_y = picture->qmatrix_fields.bits.qm_y;
ctrl->quantization.qm_u = picture->qmatrix_fields.bits.qm_u;
ctrl->quantization.qm_v = picture->qmatrix_fields.bits.qm_v;
ctrl->quantization.delta_q_res =
picture->mode_control_fields.bits.log2_delta_q_res;
if (picture->u_dc_delta_q != picture->v_dc_delta_q ||
picture->u_ac_delta_q != picture->v_ac_delta_q)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DIFF_UV_DELTA;
if (picture->qmatrix_fields.bits.using_qmatrix)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_USING_QMATRIX;
if (picture->mode_control_fields.bits.delta_q_present_flag)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DELTA_Q_PRESENT;
/* ---- segmentation ---- */
if (picture->seg_info.segment_info_fields.bits.enabled)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_ENABLED;
if (picture->seg_info.segment_info_fields.bits.update_map)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_MAP;
if (picture->seg_info.segment_info_fields.bits.temporal_update)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_TEMPORAL_UPDATE;
if (picture->seg_info.segment_info_fields.bits.update_data)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_DATA;
for (i = 0; i < BACKEND_AV1_MAX_SEGMENTS; i++) {
for (j = 0; j < BACKEND_AV1_SEG_LVL_MAX; j++) {
if (picture->seg_info.feature_mask[i] & (1 << j)) {
ctrl->segmentation.feature_enabled[i] |=
V4L2_AV1_SEGMENT_FEATURE_ENABLED(j);
ctrl->segmentation.last_active_seg_id = i;
if (j >= BACKEND_AV1_SEG_LVL_REF_FRAME)
ctrl->segmentation.flags |=
V4L2_AV1_SEGMENTATION_FLAG_SEG_ID_PRE_SKIP;
}
ctrl->segmentation.feature_data[i][j] =
picture->seg_info.feature_data[i][j];
}
}
/* ---- loop_filter ---- */
ctrl->loop_filter.level[0] = picture->filter_level[0];
ctrl->loop_filter.level[1] = picture->filter_level[1];
ctrl->loop_filter.level[2] = picture->filter_level_u;
ctrl->loop_filter.level[3] = picture->filter_level_v;
ctrl->loop_filter.sharpness =
picture->loop_filter_info_fields.bits.sharpness_level;
ctrl->loop_filter.mode_deltas[0] = picture->mode_deltas[0];
ctrl->loop_filter.mode_deltas[1] = picture->mode_deltas[1];
ctrl->loop_filter.delta_lf_res =
picture->mode_control_fields.bits.log2_delta_lf_res;
for (i = 0; i < BACKEND_AV1_NUM_REF_FRAMES; i++)
ctrl->loop_filter.ref_deltas[i] = picture->ref_deltas[i];
if (picture->loop_filter_info_fields.bits.mode_ref_delta_enabled)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_ENABLED;
if (picture->loop_filter_info_fields.bits.mode_ref_delta_update)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_UPDATE;
if (picture->mode_control_fields.bits.delta_lf_present_flag)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_PRESENT;
if (picture->mode_control_fields.bits.delta_lf_multi)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_MULTI;
/* ---- cdef ---- */
ctrl->cdef.damping_minus_3 = picture->cdef_damping_minus_3;
ctrl->cdef.bits = picture->cdef_bits;
for (i = 0; i < (unsigned)(1 << picture->cdef_bits) && i < 8; i++) {
uint8_t y = picture->cdef_y_strengths[i];
uint8_t uv = picture->cdef_uv_strengths[i];
ctrl->cdef.y_pri_strength[i] = (y >> 2) & 0x0F;
ctrl->cdef.y_sec_strength[i] = y & 0x03;
ctrl->cdef.uv_pri_strength[i] = (uv >> 2) & 0x0F;
ctrl->cdef.uv_sec_strength[i] = uv & 0x03;
}
/* ---- loop_restoration ---- (F3)
* Phase 5 review Amendment 1 was REVERTED. The reviewer proposed
* remap = {NONE, SWITCHABLE, WIENER, SGRPROJ} (Kwiboo's table)
* based on AV1 spec FrameRestoreType wire encoding
* {NONE=0, SWITCHABLE=1, WIENER=2, SGRPROJ=3} differing from V4L2's
* {NONE=0, WIENER=1, SGRPROJ=2, SWITCHABLE=3}. Empirically applying
* that permutation regressed ALL tests (allintra 10/10 → 0/10) —
* so either VAAPI's yframe_restoration_type is NOT the raw spec
* value (already-remapped to V4L2 enum semantics?), or vpu981
* interprets the V4L2 enum values via a different mapping than
* the V4L2 uAPI header documents. Per
* [[feedback_review_empirical_over_theoretical]] keep the
* identity mapping that empirically works; revisit if a
* restoration-using fixture surfaces a real decode bug.
*/
{
uint8_t remap[4] = {
V4L2_AV1_FRAME_RESTORE_NONE,
V4L2_AV1_FRAME_RESTORE_WIENER,
V4L2_AV1_FRAME_RESTORE_SGRPROJ,
V4L2_AV1_FRAME_RESTORE_SWITCHABLE,
};
uint8_t y_t = picture->loop_restoration_fields.bits.yframe_restoration_type & 3;
uint8_t cb_t = picture->loop_restoration_fields.bits.cbframe_restoration_type & 3;
uint8_t cr_t = picture->loop_restoration_fields.bits.crframe_restoration_type & 3;
bool uses_lr = false;
ctrl->loop_restoration.frame_restoration_type[0] = remap[y_t];
ctrl->loop_restoration.frame_restoration_type[1] = remap[cb_t];
ctrl->loop_restoration.frame_restoration_type[2] = remap[cr_t];
if (y_t != 0)
uses_lr = true;
if (cb_t != 0 || cr_t != 0) {
uses_lr = true;
ctrl->loop_restoration.flags |=
V4L2_AV1_LOOP_RESTORATION_FLAG_USES_CHROMA_LR;
}
ctrl->loop_restoration.lr_unit_shift =
picture->loop_restoration_fields.bits.lr_unit_shift;
ctrl->loop_restoration.lr_uv_shift =
picture->loop_restoration_fields.bits.lr_uv_shift;
if (uses_lr) {
uint8_t shift = picture->loop_restoration_fields.bits.lr_unit_shift;
uint8_t uv_shift = picture->loop_restoration_fields.bits.lr_uv_shift;
ctrl->loop_restoration.flags |=
V4L2_AV1_LOOP_RESTORATION_FLAG_USES_LR;
ctrl->loop_restoration.loop_restoration_size[0] =
1 << (6 + shift);
ctrl->loop_restoration.loop_restoration_size[1] =
1 << (6 + shift - uv_shift);
ctrl->loop_restoration.loop_restoration_size[2] =
1 << (6 + shift - uv_shift);
}
}
/* ---- global_motion ---- */
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
if (i == 0)
continue; /* INTRA_FRAME slot — no warp */
ctrl->global_motion.type[i] = picture->wm[i - 1].wmtype;
for (j = 0; j < 6; j++)
ctrl->global_motion.params[i][j] = picture->wm[i - 1].wmmat[j];
if (picture->wm[i - 1].invalid)
ctrl->global_motion.invalid |=
V4L2_AV1_GLOBAL_MOTION_IS_INVALID(i);
switch (picture->wm[i - 1].wmtype) {
case 1:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_TRANSLATION;
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
case 2:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_ROT_ZOOM;
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
case 3:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
default:
break;
}
}
/* ---- reference frames + order hints ---- */
/* reference_frame_ts[] is filled by the orchestrator (av1_set_controls)
* which has driver_data for the SURFACE() lookup. order_hints[] not
* exposed per-ref by VAAPI — leave zero. ref_frame_idx[7] is the
* index map from spec-defined ref slots (LAST..ALTREF) into
* ref_frame_map[8] (the surface IDs). */
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++)
ctrl->order_hints[i] = 0;
for (i = 0; i < BACKEND_AV1_REFS_PER_FRAME; i++)
ctrl->ref_frame_idx[i] = picture->ref_frame_idx[i];
/* F2: superres_denom direct from VAAPI; fallback to AV1_SUPERRES_NUM
* if zero (spec violation but defensive). */
ctrl->superres_denom = picture->superres_scale_denominator
? picture->superres_scale_denominator : AV1_SUPERRES_NUM;
ctrl->skip_mode_frame[0] = 0;
ctrl->skip_mode_frame[1] = 0;
ctrl->primary_ref_frame = picture->primary_ref_frame;
ctrl->frame_type = picture->pic_info_fields.bits.frame_type;
ctrl->order_hint = picture->order_hint;
ctrl->upscaled_width = picture->frame_width_minus1 + 1;
ctrl->interpolation_filter = picture->interp_filter;
ctrl->tx_mode = picture->mode_control_fields.bits.tx_mode;
ctrl->frame_width_minus_1 = picture->frame_width_minus1;
ctrl->frame_height_minus_1 = picture->frame_height_minus1;
ctrl->render_width_minus_1 = picture->frame_width_minus1;
ctrl->render_height_minus_1 = picture->frame_height_minus1;
ctrl->current_frame_id = 0;
/* Phase 3: VAAPI doesn't expose refresh_frame_flags. For KEY/SWITCH
* frames the AV1 spec mandates 0xff (refresh all DPB slots). For
* inter frames we default to 0xff too — simple P-frame chains will
* naturally rotate through slots without a precise per-slot value.
* If the stream needs precise control, this needs SPS-side parsing.
* Empirical diff vs kdirect shows kdirect always sends 0xff here. */
ctrl->refresh_frame_flags = 0xff;
/* ---- frame flags ---- */
if (picture->pic_info_fields.bits.show_frame)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOW_FRAME;
if (picture->pic_info_fields.bits.showable_frame)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOWABLE_FRAME;
if (picture->pic_info_fields.bits.error_resilient_mode)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ERROR_RESILIENT_MODE;
if (picture->pic_info_fields.bits.disable_cdf_update)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_CDF_UPDATE;
if (picture->pic_info_fields.bits.allow_screen_content_tools)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_SCREEN_CONTENT_TOOLS;
if (picture->pic_info_fields.bits.force_integer_mv)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_FORCE_INTEGER_MV;
if (picture->pic_info_fields.bits.allow_intrabc)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_INTRABC;
if (picture->pic_info_fields.bits.use_superres)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_SUPERRES;
if (picture->pic_info_fields.bits.allow_high_precision_mv)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_HIGH_PRECISION_MV;
if (picture->pic_info_fields.bits.is_motion_mode_switchable)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_IS_MOTION_MODE_SWITCHABLE;
if (picture->pic_info_fields.bits.use_ref_frame_mvs)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_REF_FRAME_MVS;
if (picture->pic_info_fields.bits.disable_frame_end_update_cdf)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_FRAME_END_UPDATE_CDF;
if (picture->pic_info_fields.bits.allow_warped_motion)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_WARPED_MOTION;
if (picture->mode_control_fields.bits.reference_select)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_REFERENCE_SELECT;
if (picture->mode_control_fields.bits.reduced_tx_set_used)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_REDUCED_TX_SET;
if (picture->mode_control_fields.bits.skip_mode_present) {
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_ALLOWED;
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_PRESENT;
switch (idx) {
case 0: return 8;
case 1: return 10;
case 2: return 12;
default:
/* Spec-illegal; pass through so a reviewer / test catches it. */
return 8;
}
}
/* ===== fill_film_grain ===== */
static void av1_fill_film_grain(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_film_grain *ctrl)
{
VAFilmGrainStructAV1 *fg = &picture->film_grain_info;
unsigned int i;
memset(ctrl, 0, sizeof(*ctrl));
ctrl->cr_mult = fg->cr_mult;
ctrl->grain_seed = fg->grain_seed;
/* VAAPI doesn't expose film_grain_params_ref_idx (the reuse-from-
* previous-frame index). Leave zero — only consulted when
* update_grain=0, which VAAPI also doesn't expose. */
ctrl->film_grain_params_ref_idx = 0;
ctrl->num_y_points = fg->num_y_points;
ctrl->num_cb_points = fg->num_cb_points;
ctrl->num_cr_points = fg->num_cr_points;
ctrl->grain_scaling_minus_8 =
fg->film_grain_info_fields.bits.grain_scaling_minus_8;
ctrl->ar_coeff_lag = fg->film_grain_info_fields.bits.ar_coeff_lag;
ctrl->ar_coeff_shift_minus_6 =
fg->film_grain_info_fields.bits.ar_coeff_shift_minus_6;
ctrl->grain_scale_shift =
fg->film_grain_info_fields.bits.grain_scale_shift;
ctrl->cb_mult = fg->cb_mult;
ctrl->cb_luma_mult = fg->cb_luma_mult;
ctrl->cr_luma_mult = fg->cr_luma_mult;
ctrl->cb_offset = fg->cb_offset;
ctrl->cr_offset = fg->cr_offset;
if (fg->film_grain_info_fields.bits.apply_grain) {
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_APPLY_GRAIN;
/* kdirect strace diff confirmed: V4L2_AV1_FILM_GRAIN_FLAG_
* UPDATE_GRAIN must be set when apply_grain=1 (kdirect's
* flags byte is 0x0B = APPLY|UPDATE|...). VAAPI's
* VAFilmGrainStructAV1 doesn't expose update_grain
* separately. Default to UPDATE=1 (use submitted params,
* not reuse from non-existent prior film_grain ref). The
* earlier segfault we saw with this flag was unmasked by
* the link-NULL deref (now fixed via linked_decode_surface);
* not caused by UPDATE_GRAIN itself. */
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN;
}
if (fg->film_grain_info_fields.bits.chroma_scaling_from_luma)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CHROMA_SCALING_FROM_LUMA;
if (fg->film_grain_info_fields.bits.overlap_flag)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_OVERLAP;
if (fg->film_grain_info_fields.bits.clip_to_restricted_range)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CLIP_TO_RESTRICTED_RANGE;
if (!fg->film_grain_info_fields.bits.apply_grain)
return;
for (i = 0; i < fg->num_y_points && i < 14; i++) {
ctrl->point_y_value[i] = fg->point_y_value[i];
ctrl->point_y_scaling[i] = fg->point_y_scaling[i];
}
for (i = 0; i < fg->num_cb_points && i < 10; i++) {
ctrl->point_cb_value[i] = fg->point_cb_value[i];
ctrl->point_cb_scaling[i] = fg->point_cb_scaling[i];
}
for (i = 0; i < fg->num_cr_points && i < 10; i++) {
ctrl->point_cr_value[i] = fg->point_cr_value[i];
ctrl->point_cr_scaling[i] = fg->point_cr_scaling[i];
}
for (i = 0; i < 24; i++)
ctrl->ar_coeffs_y_plus_128[i] = (uint8_t)(fg->ar_coeffs_y[i] + 128);
for (i = 0; i < 25; i++) {
ctrl->ar_coeffs_cb_plus_128[i] = (uint8_t)(fg->ar_coeffs_cb[i] + 128);
ctrl->ar_coeffs_cr_plus_128[i] = (uint8_t)(fg->ar_coeffs_cr[i] + 128);
}
}
/* ===== orchestrator ===== */
int av1_set_controls(struct request_data *driver_data,
struct object_context *context,
struct object_surface *surface_object)
{
VADecPictureParameterBufferAV1 *picture =
&surface_object->params.av1.picture;
unsigned int num_tiles = surface_object->params.av1.num_tile_group_entries;
struct v4l2_ctrl_av1_sequence sequence;
struct v4l2_ctrl_av1_frame frame;
struct v4l2_ctrl_av1_film_grain film_grain;
struct v4l2_ctrl_av1_tile_group_entry *tile_entries = NULL;
struct v4l2_ext_control controls[4];
unsigned int n = 0;
unsigned int i;
unsigned int alloc_tiles;
struct v4l2_ext_control ctrls[1];
int rc;
(void)context;
/*
* AV1 film_grain link: when apply_grain=1, ffmpeg-vaapi allocates a
* separate display surface (current_display_picture) from the decode
* surface (current_frame). vpu981 HW applies grain inline to the
* decode CAPTURE buffer, so the consumable data is in current_frame's
* slot. ffmpeg then calls vaGetImage on current_display_picture which
* has no slot bound. Link the display surface back to the decode
* surface so copy_surface_to_image can borrow destination_data[].
*/
if (picture->current_display_picture != VA_INVALID_SURFACE &&
picture->current_display_picture != picture->current_frame) {
struct object_surface *display_surface =
SURFACE(driver_data, picture->current_display_picture);
if (display_surface != NULL)
display_surface->linked_decode_surface_id =
picture->current_frame;
}
if (num_tiles > AV1_MAX_TILES)
num_tiles = AV1_MAX_TILES;
/* DYNAMIC_ARRAY size = MAX(num_tiles, 1) per Janet v2 Q1
* amendment — kernel UB on size=0. */
alloc_tiles = num_tiles > 0 ? num_tiles : 1;
tile_entries = calloc(alloc_tiles, sizeof(*tile_entries));
if (tile_entries == NULL)
return -1;
for (i = 0; i < num_tiles; i++) {
VASliceParameterBufferAV1 *slice =
&surface_object->params.av1.tile_group_entries[i];
tile_entries[i].tile_offset = slice->slice_data_offset;
tile_entries[i].tile_size = slice->slice_data_size;
tile_entries[i].tile_row = (uint8_t)slice->tile_row;
tile_entries[i].tile_col = (uint8_t)slice->tile_column;
}
av1_fill_sequence(picture, &sequence);
av1_fill_frame(picture, &frame);
memset(&sequence, 0, sizeof sequence);
/*
* Phase 2.1 + frame-2 divergence fix: wire reference_frame_ts[].
* VAAPI exposes ref_frame_map[8] as VASurfaceIDs; the kernel needs
* v4l2-style timestamps to cross-reference the corresponding
* CAPTURE buffers (set on the OUTPUT buffer at QBUF time per
* picture.c::EndPicture, via surface_object->timestamp). Mirrors
* the vp9.c:614-628 pattern, scaled to AV1's 8 ref slots.
*
* VA_INVALID_SURFACE entries stay at the calloc'd zero timestamp
* (kernel reads zero, doesn't try to dereference).
* Scalar mapping. Names align with kernel uAPI; off-by-one and
* idx→value translations are annotated.
*/
sequence.seq_profile = picture->profile;
sequence.order_hint_bits =
(uint8_t)(picture->order_hint_bits_minus_1 + 1u);
sequence.bit_depth = av1_bit_depth_from_idx(picture->bit_depth_idx);
sequence.max_frame_width_minus_1 = picture->frame_width_minus1;
sequence.max_frame_height_minus_1 = picture->frame_height_minus1;
/*
* Empirical: DPB-slot iteration (i over ref_frame_map[i]) gives
* better correctness than ref-name iteration via ref_frame_idx[].
* Tried the ref-name reindex (Kwiboo convention via FFmpeg s->ref[i])
* and lost frames that previously PASSed (3/10 → 1/10) — so the V4L2
* uAPI semantic here may be DPB-slot-indexed despite the AV1 spec
* lexicon. Phase 3 open question pending kernel-side disambiguation.
* Sequence-header flag mapping. VAAPI exposes most of these directly
* in seq_info_fields.fields.*; the ones that don't have a 1:1 mirror
* (V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION, _ENABLE_REF_FRAME_MVS,
* _ENABLE_SUPERRES, _ENABLE_RESTORATION, _SEPARATE_UV_DELTA_Q) live in
* VAAPI's per-frame pic_info_fields rather than the sequence struct.
* For SEQUENCE-control purposes we treat them as best-effort
* unobservable from libva and leave the corresponding bits clear; the
* daedalus daemon's OBU synthesiser (issue #11 daemon track) carries
* the SEQUENCE bytes verbatim, so per-frame consumers (libdav1d) will
* still see the full bitstream truth for those toggles via the
* OBU_FRAME stream already in the slice buffer. See feedback memory
* `feedback_vaapi_blind_to_some_hevc_sps_fields` for the precedent.
*/
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
VASurfaceID ref_id = picture->ref_frame_map[i];
struct object_surface *ref_surface;
uint64_t ts;
if (ref_id == VA_INVALID_SURFACE)
continue;
ref_surface = SURFACE(driver_data, ref_id);
if (ref_surface == NULL)
continue;
ts = v4l2_timeval_to_ns(&ref_surface->timestamp);
if (ts == 0 &&
ref_surface->linked_decode_surface_id != VA_INVALID_SURFACE) {
struct object_surface *dec =
SURFACE(driver_data,
ref_surface->linked_decode_surface_id);
if (dec != NULL) {
ts = v4l2_timeval_to_ns(&dec->timestamp);
frame.order_hints[i] = dec->av1_order_hint;
}
} else {
frame.order_hints[i] = ref_surface->av1_order_hint;
}
frame.reference_frame_ts[i] = ts;
}
if (picture->seq_info_fields.fields.still_picture)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
if (picture->seq_info_fields.fields.use_128x128_superblock)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
if (picture->seq_info_fields.fields.enable_filter_intra)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
if (picture->seq_info_fields.fields.enable_intra_edge_filter)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
if (picture->seq_info_fields.fields.enable_interintra_compound)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
if (picture->seq_info_fields.fields.enable_masked_compound)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
if (picture->seq_info_fields.fields.enable_dual_filter)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
if (picture->seq_info_fields.fields.enable_order_hint)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
if (picture->seq_info_fields.fields.enable_jnt_comp)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
if (picture->seq_info_fields.fields.enable_cdef)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
if (picture->seq_info_fields.fields.mono_chrome)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
if (picture->seq_info_fields.fields.color_range)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
if (picture->seq_info_fields.fields.subsampling_x)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
if (picture->seq_info_fields.fields.subsampling_y)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
if (picture->seq_info_fields.fields.film_grain_params_present)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
/* Phase 3: record this frame's order_hint on the surface so the
* NEXT frame's ref-loop can populate order_hints[] for slots that
* reference us. */
surface_object->av1_order_hint = picture->order_hint;
/* Also propagate to the linked display surface (if any), since
* future frames' ref_frame_map[] may point at either. */
if (picture->current_display_picture != VA_INVALID_SURFACE &&
picture->current_display_picture != picture->current_frame) {
struct object_surface *disp =
SURFACE(driver_data, picture->current_display_picture);
if (disp != NULL)
disp->av1_order_hint = picture->order_hint;
}
if (driver_data->has_av1_film_grain)
av1_fill_film_grain(picture, &film_grain);
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_SEQUENCE,
.ptr = &sequence,
.size = sizeof(sequence),
};
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_FRAME,
.ptr = &frame,
.size = sizeof(frame),
};
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY,
.ptr = tile_entries,
.size = sizeof(*tile_entries) * alloc_tiles,
};
if (driver_data->has_av1_film_grain) {
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_FILM_GRAIN,
.ptr = &film_grain,
.size = sizeof(film_grain),
};
}
/* Single-control batched submission. */
memset(ctrls, 0, sizeof ctrls);
ctrls[0].id = V4L2_CID_STATELESS_AV1_SEQUENCE;
ctrls[0].ptr = &sequence;
ctrls[0].size = sizeof sequence;
rc = v4l2_set_controls(driver_data->video_fd,
surface_object->request_fd,
controls, n);
ctrls, 1);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
free(tile_entries);
if (rc < 0) {
request_log("ampere-av1: VIDIOC_S_EXT_CTRLS failed rc=%d\n", rc);
return -1;
}
return 0;
return VA_STATUS_SUCCESS;
}
+3 -9
View File
@@ -1,14 +1,8 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
* Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
*
* ampere-av1-enablement Phase 2: AV1 codec dispatcher header for libva-
* v4l2-request-fourier. Mirrors vp9.h shape — single set_controls entry
* point that translates surface->params.av1.* VAAPI structures into a
* batch of V4L2_CID_STATELESS_AV1_{SEQUENCE,FRAME,TILE_GROUP_ENTRY,
* FILM_GRAIN} controls + the underlying request_fd / OUTPUT plane setup.
*
* V4L2 target: V4L2_PIX_FMT_AV1_FRAME on the vpu981 hantro instance
* (RK3588's dedicated AV1 decoder).
* AV1 codec dispatcher — populates V4L2_CID_STATELESS_AV1_SEQUENCE
* (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
+14
View File
@@ -37,14 +37,28 @@ unsigned int pixelformat_for_profile(VAProfile profile)
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
return V4L2_PIX_FMT_H264_SLICE;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
return V4L2_PIX_FMT_HEVC_SLICE;
case VAProfileVP8Version0_3:
return V4L2_PIX_FMT_VP8_FRAME;
case VAProfileVP9Profile0:
return V4L2_PIX_FMT_VP9_FRAME;
case VAProfileAV1Profile0:
/*
* ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
* vpu981 (RK3588's dedicated AV1 hantro). Per-codec ctrl
* dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED on
* master — vainfo lists the profile + RequestCreateConfig
* succeeds, but consumers that submit decode buffers hit
* a NOP path until the per-codec dispatch lands. The
* av1-iter1 operator branch has Phase 3 bit-exact bring-up
* underway; this commit gives master the bare enumeration +
* fd-routing layer so consumers like ffmpeg-vaapi at least
* see VAProfileAV1Profile0 today.
*/
return V4L2_PIX_FMT_AV1_FRAME;
default:
return 0;
+92 -26
View File
@@ -59,34 +59,37 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
// FIXME
// iter39: Hi10P routed through same H264 path; bit-depth gating
// happens in context.c synthetic SPS and CAPTURE pix_fmt
// selection.
break;
case VAProfileMPEG2Simple:
case VAProfileMPEG2Main:
// fresnel-fourier iter1: MPEG-2 enabled. Same shape as H.264
// above — no profile-specific config validation in the libva
// backend; validation happens at vaCreateContext / control
// submission time.
break;
case VAProfileHEVCMain:
// fresnel-fourier iter2: HEVC enabled. Same shape as H.264/
// MPEG-2 above — no profile-specific config validation in the
// libva backend; validation happens at vaCreateContext / control
// submission time.
case VAProfileHEVCMain10:
// iter39: Main10 routed through same HEVC path; bit-depth
// gating happens in context.c.
break;
case VAProfileVP8Version0_3:
// fresnel-fourier iter3: VP8 enabled. Same shape as iter1+iter2
// above — no profile-specific config validation in the libva
// backend; validation happens at vaCreateContext / control
// submission time.
break;
case VAProfileVP9Profile0:
// fresnel-fourier iter4: VP9 Profile 0 enabled on rkvdec.
// Same shape — no profile-specific validation here.
// VP9 Profile 2 is NOT supported by RK3399 rkvdec (kernel ctrl
// cap is V4L2_MPEG_VIDEO_VP9_PROFILE_0). Do not add a case for
// VAProfileVP9Profile2 — kernel will reject.
break;
case VAProfileAV1Profile0:
// ampere-av1-enablement: AV1 Profile 0 enabled on vpu981.
// Same shape — no profile-specific validation here.
// ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
// vpu981 (RK3588 dedicated AV1 hantro instance). Decode-side
// ctrl dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED
// on master — vainfo will list the profile + CreateConfig
// succeeds, but consumers that submit decode buffers hit a
// NOP path until av1.{c,h} dispatch scaffolding is ported
// from the av1-iter1 operator branch (where Phase 3-5 has
// 3/10 frames bit-exact already).
break;
default:
return VA_STATUS_ERROR_UNSUPPORTED_PROFILE;
@@ -123,7 +126,15 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
*/
config_object->pixelformat = pixelformat_for_profile(profile);
config_object->attributes[0].type = VAConfigAttribRTFormat;
config_object->attributes[0].value = VA_RT_FORMAT_YUV420;
/*
* iter39: 10-bit profiles advertise YUV420_10. ffmpeg-vaapi reads
* this attribute on vaGetConfigAttributes and refuses surface
* allocation if it mismatches the input bitstream's bit depth.
*/
if (profile == VAProfileH264High10 || profile == VAProfileHEVCMain10)
config_object->attributes[0].value = VA_RT_FORMAT_YUV420_10;
else
config_object->attributes[0].value = VA_RT_FORMAT_YUV420;
config_object->attributes_count = 1;
for (i = 1; i < attributes_count; i++) {
@@ -161,14 +172,20 @@ VAStatus RequestDestroyConfig(VADriverContextP context, VAConfigID config_id)
static bool any_fd_supports_output_format(struct request_data *driver_data,
unsigned int fmt)
{
int fds[4] = {
int fds[6] = {
driver_data->video_fd,
driver_data->video_fd_rkvdec,
driver_data->video_fd_hantro,
driver_data->video_fd_vpu981,
driver_data->video_fd_rpi_hevc_dec, /* iter40 */
driver_data->video_fd_vpu981, /* ampere-av1 Phase 2 */
#ifdef HAVE_DAEDALUS_V4L2
driver_data->video_fd_daedalus, /* LIBVA-1: H.264/VP9/AV1 */
#else
-1,
#endif
};
int i;
for (i = 0; i < 4; i++) {
for (i = 0; i < 6; i++) {
if (fds[i] < 0) continue;
if (v4l2_find_format(fds[i], V4L2_BUF_TYPE_VIDEO_OUTPUT, fmt))
return true;
@@ -198,11 +215,48 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
profiles[index++] = VAProfileH264ConstrainedBaseline;
profiles[index++] = VAProfileH264MultiviewHigh;
profiles[index++] = VAProfileH264StereoHigh;
/*
* iter39 Phase 7 close (Option B): VAProfileH264High10
* DELIBERATELY NOT ENUMERATED.
*
* Hi10P on Rockchip V4L2 stateless decoders requires:
* - HW: ✓ both RK3399 + RK3588 capable (per Rockchip
* datasheets — 4K 10-bit H.264 line items)
* - Kernel: ✓ Karlman v6→v10 series merged in
* mmind v7.0 (rkvdec_h264_decoded_fmts[] has
* NV15/NV20; ctrl cfg.max=HIGH_422_INTRA;
* bit_depth_luma_minus8==2 path live in
* rkvdec-h264-common.c:196)
* - Userspace ffmpeg: ✗ ffmpeg-v4l2-request-fourier
* lacks the userspace plumbing for Hi10P;
* kdirect path fails with EINVAL, libva path
* returns CAPTURE buffer all-zero.
*
* Empirically verified on both fresnel (RK3399) and ampere
* (RK3588) 2026-05-17 — same all-zero / EINVAL failure
* mode on both. The backend infrastructure (codec.c,
* context.c, image.c, surface.c, nv15.c) is RETAINED for
* when the upstream ffmpeg gap closes — just re-add the
* profiles[index++] line and bump the (-5) guard back to
* (-6). See memory feedback_rk3399_h264_hi10p_advertised_not_functional
* for the empirical evidence.
*/
}
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_HEVC_SLICE);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1)) {
profiles[index++] = VAProfileHEVCMain;
/*
* iter39 Phase 7 close (Option B): VAProfileHEVCMain10
* DELIBERATELY NOT ENUMERATED. Same reasoning as
* VAProfileH264High10 above — kernel + HW ready,
* userspace ffmpeg V4L2 hwaccel plumbing not. Untested
* specifically due to no Main10 fixture (system x265
* is 8-bit-only on Arch ARM), but same kernel/HW/
* userspace stack so same gap likely applies. Re-enable
* when ffmpeg-vaapi → V4L2 hwaccel adds 10-bit HEVC.
*/
}
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_VP8_FRAME);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -213,11 +267,11 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
profiles[index++] = VAProfileVP9Profile0;
/*
* ampere-av1-enablement: AV1 routes to vpu981 (advertised via the
* new video_fd_vpu981 slot). V4L2_REQUEST_MAX_PROFILES=11 is now
* EXACTLY full with this addition. Future profile additions
* require bumping that constant + verifying libva consumers'
* profiles[] sizing.
* ampere-av1-enablement Phase 2: AV1 Profile 0 advertised when
* vpu981 (RK3588 dedicated AV1 hantro) is probed. MAX_PROFILES
* bumped to 14 in request.h to safely fit even if iter39 Option
* B is reverted (Hi10P + Main10 back in enumeration → 13 total
* with AV1, the `< MAX - 1` guard then needs MAX ≥ 14).
*/
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_AV1_FRAME);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -241,7 +295,9 @@ VAStatus RequestQueryConfigEntrypoints(VADriverContextP context,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
case VAProfileVP8Version0_3:
case VAProfileVP9Profile0:
case VAProfileAV1Profile0:
@@ -298,7 +354,17 @@ VAStatus RequestGetConfigAttributes(VADriverContextP context, VAProfile profile,
for (i = 0; i < attributes_count; i++) {
switch (attributes[i].type) {
case VAConfigAttribRTFormat:
attributes[i].value = VA_RT_FORMAT_YUV420;
/*
* iter39: 10-bit profiles publish YUV420_10. Profile-
* less query (this is invoked from vaGetConfigAttributes
* before vaCreateConfig) routes off the `profile` arg
* directly — same gating as RequestCreateConfig.
*/
if (profile == VAProfileH264High10 ||
profile == VAProfileHEVCMain10)
attributes[i].value = VA_RT_FORMAT_YUV420_10;
else
attributes[i].value = VA_RT_FORMAT_YUV420;
break;
default:
attributes[i].value = VA_ATTRIB_NOT_SUPPORTED;
+201 -37
View File
@@ -42,6 +42,9 @@
#include <hevc-ctrls.h>
#include "nv15.h" /* iter40: fallback V4L2_PIX_FMT_NV15 define for Pi 5
* Debian headers that ship NC12 but not NV15. */
#include "nv12_col128.h" /* iter40: NC12 detile primitive + UV offset helper */
#include "utils.h"
#include "v4l2.h"
@@ -107,20 +110,79 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* the driver_data and is cached across CreateContext cycles. The
* probe doesn't require any prior S_FMT v4l2_find_format
* enumerates the device's supported formats directly.
*
* iter39: choose NV15 (10-bit packed) for Hi10P / Main10 profiles,
* NV12 (8-bit) otherwise. If the cached video_format doesn't match
* the profile's bit-depth requirement, invalidate and re-probe
* sibling pattern to iter38's device-switch invalidation in
* request_switch_device_for_profile().
*/
{
bool want_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
/*
* iter40: per-fd preferred pixelformat. rpi-hevc-dec exposes
* NC12 (8-bit) / NC30 (10-bit), not NV12 / NV15.
*/
unsigned int want_pixfmt;
if (is_rpi)
want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
: V4L2_PIX_FMT_NV12_COL128;
else
want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV15
: V4L2_PIX_FMT_NV12;
if (driver_data->video_format &&
driver_data->video_format->v4l2_format != want_pixfmt &&
driver_data->video_format->v4l2_format != V4L2_PIX_FMT_SUNXI_TILED_NV12)
driver_data->video_format = NULL;
}
if (!driver_data->video_format) {
bool want_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
video_format = NULL;
found = v4l2_find_format(driver_data->video_fd,
V4L2_BUF_TYPE_VIDEO_CAPTURE,
V4L2_PIX_FMT_SUNXI_TILED_NV12);
if (found)
video_format = video_format_find(V4L2_PIX_FMT_SUNXI_TILED_NV12);
found = v4l2_find_format(driver_data->video_fd,
V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE,
V4L2_PIX_FMT_NV12);
if (found)
video_format = video_format_find(V4L2_PIX_FMT_NV12);
if (is_rpi) {
/*
* iter40: rpi-hevc-dec CAPTURE is NC12 (8-bit SAND
* 128-pixel-wide column tile) or NC30 (10-bit variant).
* Direct map; the kernel exposes BOTH formats in
* VIDIOC_ENUM_FMT(CAPTURE_MPLANE) without a pre-SPS
* step (verified Phase 0 strace), so find_format would
* also succeed skip it for symmetry with the NV15
* iter39 branch below.
*/
video_format = video_format_find(
want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
: V4L2_PIX_FMT_NV12_COL128);
} else if (!want_10bit) {
found = v4l2_find_format(driver_data->video_fd,
V4L2_BUF_TYPE_VIDEO_CAPTURE,
V4L2_PIX_FMT_SUNXI_TILED_NV12);
if (found)
video_format = video_format_find(V4L2_PIX_FMT_SUNXI_TILED_NV12);
found = v4l2_find_format(driver_data->video_fd,
V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE,
V4L2_PIX_FMT_NV12);
if (found)
video_format = video_format_find(V4L2_PIX_FMT_NV12);
} else {
/*
* iter39 fresnel fix: rkvdec only advertises NV15 in
* VIDIOC_ENUM_FMT(CAPTURE) AFTER S_FMT(OUTPUT) +
* S_EXT_CTRLS(SPS) resolve image_fmt to 420_10BIT.
* Before that, only NV12 is enumerated. Pre-finding
* NV15 always fails. Skip the find_format check and
* directly map to our NV15 video_format entry; the
* later S_FMT(CAPTURE) commits the actual NV15 mode
* once the synthetic SPS sets bit_depth_luma_minus8=2.
*/
video_format = video_format_find(V4L2_PIX_FMT_NV15);
}
if (video_format == NULL) {
status = VA_STATUS_ERROR_OPERATION_FAILED;
@@ -131,6 +193,10 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
}
video_format = driver_data->video_format;
/* iter39: session-wide flag drives image.c reporting + unpack. */
driver_data->is_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
output_type = v4l2_type_video_output(video_format->v4l2_mplane);
capture_type = v4l2_type_video_capture(video_format->v4l2_mplane);
@@ -175,7 +241,22 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* CAPTURE (sanity read-back, matches what S_FMT committed).
*/
{
unsigned int capture_pixelformat = V4L2_PIX_FMT_NV12;
/*
* iter40: take the CAPTURE pixelformat from the resolved
* video_format slot that's per-fd, per-bit-depth correct.
* rkvdec 8-bit NV12
* rkvdec 10-bit NV15
* hantro 8-bit NV12
* rpi-hevc-dec NC12 (V4L2_PIX_FMT_NV12_COL128)
* Pre-iter40 this was hardcoded NV12/NV15 the rpi-hevc-dec
* fd would then have S_FMT(NV12) issued, and the kernel
* "helpfully" substituted V4L2_PIX_FMT_NV12MT_COL128 (the
* MULTI-PLANE-NON-CONTIGUOUS variant) instead of the
* SINGLE-PLANE NC12 we wanted, breaking cap_pool QUERYBUF
* downstream (Phase 7 iter40 first-run discovery).
*/
unsigned int capture_pixelformat =
driver_data->video_format->v4l2_format;
rc = v4l2_set_format(driver_data->video_fd, capture_type,
capture_pixelformat, picture_width,
picture_height);
@@ -232,16 +313,42 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* the device-init DECODE_MODE + START_CODE block below ALSO uses
* void-cast best-effort, so this is consistent with prior pattern.
*/
{
/*
* iter40 (Phase 5 review F6): the synthetic-SPS pre-seed is an
* rkvdec-specific quirk fix (the -EBUSY-on-CAPTURE-busy bug in
* rkvdec_s_ctrl). rpi-hevc-dec does NOT need it and uses a
* different submission ordering (Phase 0 strace: S_FMT_OUTPUT
* REQBUFS_OUTPUT S_FMT_CAPTURE CREATE_BUFS_CAPTURE STREAMON,
* with per-frame SPS via S_EXT_CTRLS class=0xf010000). Sending a
* stale dummy SPS at context-init time would leave rpi-hevc-dec's
* internal state on the dummy until the first real per-frame SPS
* arrives exact behavior unknown but a known divergence from
* kdirect.
*
* Skip pre-seed when the active fd is rpi-hevc-dec. rkvdec /
* hantro paths unchanged.
*/
if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec) {
/*
* iter39: 10-bit profiles set bit_depth_luma_minus8 = 2 in
* the synthetic SPS so rkvdec's get_image_fmt resolves to
* RKVDEC_IMG_FMT_420_10BIT (per rkvdec-h264-common.c:196 +
* rkvdec-hevc-common.c:467). Image_fmt resolution depends
* only on bit_depth_luma_minus8 and chroma_format_idc;
* profile_idc is ignored for image_fmt and v4l2_ctrl_hevc_sps
* has no profile_idc field at all.
*/
bool ten = driver_data->is_10bit;
switch (config_object->profile) {
case VAProfileHEVCMain: {
case VAProfileHEVCMain:
case VAProfileHEVCMain10: {
struct v4l2_ctrl_hevc_sps dummy_sps;
struct v4l2_ext_control dummy_ctrl;
memset(&dummy_sps, 0, sizeof(dummy_sps));
dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
dummy_sps.bit_depth_luma_minus8 = 0; /* 8-bit */
dummy_sps.bit_depth_chroma_minus8 = 0;
dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
dummy_sps.pic_width_in_luma_samples = picture_width;
dummy_sps.pic_height_in_luma_samples = picture_height;
@@ -256,19 +363,20 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh: {
case VAProfileH264StereoHigh:
case VAProfileH264High10: {
struct v4l2_ctrl_h264_sps dummy_sps;
struct v4l2_ext_control dummy_ctrl;
memset(&dummy_sps, 0, sizeof(dummy_sps));
dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
dummy_sps.bit_depth_luma_minus8 = 0;
dummy_sps.bit_depth_chroma_minus8 = 0;
dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
dummy_sps.pic_width_in_mbs_minus1 =
(picture_width + 15) / 16 - 1;
dummy_sps.pic_height_in_map_units_minus1 =
(picture_height + 15) / 16 - 1;
dummy_sps.profile_idc = 100; /* High */
dummy_sps.profile_idc = ten ? 110 : 100; /* High10 : High */
dummy_sps.level_idc = 41;
/*
* FRAME_MBS_ONLY required: rkvdec_h264_validate_sps
@@ -289,7 +397,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
default:
break;
}
}
} /* iter40: end of pre-seed-skip-on-rpi-hevc-dec guard */
destination_planes_count = video_format->planes_count;
@@ -323,10 +431,39 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* changed by BeginPicture's slot acquisition.
*/
if (video_format->v4l2_buffers_count == 1) {
destination_sizes[0] = destination_bytesperlines[0] *
format_height;
for (j = 1; j < destination_planes_count; j++)
destination_sizes[j] = destination_sizes[0] / 2;
if (video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
/*
* iter40: NC12 SAND layout: Y plane size is
* NUM_COLUMNS * TILE_W * ALIGN(height, 8) (= linear
* NV12 Y for column-aligned widths), UV plane is half.
* The kernel-reported destination_bytesperlines[0] is
* the COLUMN stride (ALIGN(height,8)*3/2), not the
* linear Y stride using it × format_height gives the
* wrong intra-buffer UV offset (destination_offsets[1]
* derives from destination_sizes[0] in
* surface_fill_format_uniform).
*
* Use format_width/format_height (kernel-returned from
* G_FMT) not picture_width/height (caller request),
* because the kernel applies its own ALIGN rules; the
* UV plane location is keyed off the kernel layout.
*/
unsigned int uv_off = nv12_col128_uv_plane_offset(
format_width, format_height);
destination_sizes[0] = uv_off;
for (j = 1; j < destination_planes_count; j++)
destination_sizes[j] = uv_off / 2;
request_log("iter40: NC12 sizes pic=%ux%u fmt=%ux%u bpl=%u uv_off=%u sizeimage(kernel)=%u\n",
picture_width, picture_height,
format_width, format_height,
destination_bytesperlines[0], uv_off,
destination_bytesperlines[0] * format_height);
} else {
destination_sizes[0] = destination_bytesperlines[0] *
format_height;
for (j = 1; j < destination_planes_count; j++)
destination_sizes[j] = destination_sizes[0] / 2;
}
}
/*
@@ -460,6 +597,18 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* + ANNEX_B (only supported menu values per Phase 0 v4l2_inventory).
*/
{
/*
* iter40: per-driver HEVC start_code menu value. rkvdec /
* hantro path uses ANNEX_B + start-code-prepended payload.
* rpi-hevc-dec uses NONE confirmed empirically Phase 7
* (any other mode V4L2_BUF_FLAG_ERROR on every CAPTURE
* DQBUF, all-zero output). kdirect's strace also shows
* start_code=0 on rpi-hevc-dec. Both are accepted by the
* driver's QUERY_EXT_CTRL menu (min=0 max=1), but only NONE
* actually drives correct decode on the Pi.
*/
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
struct v4l2_ext_control hevc_dev_ctrls[2] = {
{
.id = V4L2_CID_STATELESS_HEVC_DECODE_MODE,
@@ -467,7 +616,9 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
},
{
.id = V4L2_CID_STATELESS_HEVC_START_CODE,
.value = V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
.value = is_rpi
? 0 /* V4L2_STATELESS_HEVC_START_CODE_NONE */
: V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
},
};
(void)v4l2_set_controls(driver_data->video_fd, -1,
@@ -500,18 +651,29 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* commit will replace this hardcoded assignment with a runtime
* read of the kernel's accepted START_CODE value.
*/
switch (config_object->profile) {
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileHEVCMain:
context_object->h264_start_code = true;
break;
default:
context_object->h264_start_code = false;
break;
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
switch (config_object->profile) {
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
context_object->h264_start_code = true;
break;
case VAProfileHEVCMain:
/* iter40: rpi-hevc-dec rejects start-code-prepended
* payload (DQBUF error flag on every CAPTURE buffer).
* Gate to match the per-driver START_CODE menu value
* set above: NONE on rpi no prepend; ANNEX_B on
* rkvdec prepend. */
context_object->h264_start_code = !is_rpi;
break;
default:
context_object->h264_start_code = false;
break;
}
}
rc = v4l2_set_stream(driver_data->video_fd, output_type, true);
@@ -636,6 +798,8 @@ VAStatus RequestDestroyContext(VADriverContextP context, VAContextID context_id)
* The next CreateContext re-populates the cache.
*/
driver_data->fmt_valid = false;
/* iter39: clear 10-bit session flag — next CreateContext re-sets. */
driver_data->is_10bit = false;
return VA_STATUS_SUCCESS;
}
+53
View File
@@ -827,10 +827,63 @@ int h264_set_controls(struct request_data *driver_data,
dpb_update(context, &surface->params.h264.picture);
/*
* Dump the raw VAAPI fields at the libva boundary so issue #8
* follow-up can disambiguate "ffmpeg-vaapi didn't populate" from
* "downstream consumer (daedalus_v4l2 wire protocol) corrupted the
* value". One-line; safe to leave in — costs a single printf per frame.
*/
request_log("h264_set_controls: VAProfile=%d seq_fields=0x%08x pic_fields=0x%08x num_ref_frames=%u bit_depth_luma_m8=%u bit_depth_chroma_m8=%u w_mbs_m1=%u h_mbs_m1=%u\n",
(int)profile,
surface->params.h264.picture.seq_fields.value,
surface->params.h264.picture.pic_fields.value,
surface->params.h264.picture.num_ref_frames,
surface->params.h264.picture.bit_depth_luma_minus8,
surface->params.h264.picture.bit_depth_chroma_minus8,
surface->params.h264.picture.picture_width_in_mbs_minus1,
surface->params.h264.picture.picture_height_in_mbs_minus1);
h264_va_picture_to_v4l2(driver_data, context, surface,
&surface->params.h264.picture,
&decode, &pps, &sps);
/*
* max_num_ref_frames fallback. Some VAAPI clients (older ffmpeg-vaapi
* paths, some daedalus_v4l2 consumers) leave VAPicture->num_ref_frames
* at zero. Hardware decoders tolerate; libavcodec-via-daedalus enforces
* sps.max_num_ref_frames strictly and rejects every frame.
*
* Count valid DPB entries first (the bitstream-true reference count we
* can see); fall back to a per-profile spec minimum if even that is 0.
* See marfrit/libva-v4l2-request-fourier issue #8.
*/
if (sps.max_num_ref_frames == 0) {
unsigned int valid = 0;
unsigned int i;
for (i = 0; i < 16; i++) {
const VAPictureH264 *ref =
&surface->params.h264.picture.ReferenceFrames[i];
if (!(ref->flags & VA_PICTURE_H264_INVALID))
valid++;
}
if (valid > 0) {
sps.max_num_ref_frames = (uint8_t)valid;
} else {
switch (profile) {
case VAProfileH264ConstrainedBaseline:
sps.max_num_ref_frames = 1;
break;
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
default:
sps.max_num_ref_frames = 4;
break;
}
}
}
/*
* Populate the scaling matrix unconditionally: from VAAPI's
* VAIQMatrixBufferH264 when the consumer sent one this frame
+189 -10
View File
@@ -83,6 +83,18 @@
#include "hevc-ctrls/v4l2-hevc-ext-controls.h"
#include "h265_parser/gst/codecparsers/gsth265parser.h"
/*
* VAAPI source arrays for HEVC ref/weight tables are sized 15
* (VASliceParameterBufferHEVC::RefPicList[2][15],
* delta_luma_weight_l0[15], luma_offset_l0[15], etc. see
* /usr/include/va/va_dec_hevc.h). V4L2_HEVC_DPB_ENTRIES_NUM_MAX
* is 16; iterating to that bound over-reads the VAAPI source by
* one element. Hidden by -O3 unrolling but manifests as a SEGV
* under -O2 vectorisation (regression discovered in package
* builds 2026-05-17). Cap all per-ref/weight loops at this.
*/
#define VA_HEVC_REF_LIST_LEN 15
#include "utils.h"
#include "v4l2.h"
@@ -465,13 +477,21 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
/* Q2: slice_segment_addr from VAAPI (was missing in old h265.c). */
slice_params->slice_segment_addr = slice->slice_segment_address;
/* Ref index arrays (DPB indices). For I-slices both are unused. */
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
/*
* Ref index arrays (DPB indices). For I-slices both are unused.
*
* Cap iteration at VAAPI source size (15) V4L2_HEVC_DPB_ENTRIES_NUM_MAX
* is 16, but VASliceParameterBufferHEVC::RefPicList is RefPicList[2][15].
* Iterating to 16 reads one past the source array; with -O2 GCC vectorises
* the copy and the over-read produces a real SEGV (manifested in package
* builds with Arch makepkg CFLAGS, plain -O3 release builds hid it).
*/
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
if (i < (slice->num_ref_idx_l0_active_minus1 + 1U))
slice_params->ref_idx_l0[i] = slice->RefPicList[0][i];
}
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
if (i < (slice->num_ref_idx_l1_active_minus1 + 1U))
slice_params->ref_idx_l1[i] = slice->RefPicList[1][i];
@@ -503,7 +523,9 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
slice_params->pred_weight_table.delta_chroma_log2_weight_denom =
slice->delta_chroma_log2_weight_denom;
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
/* Pred weight tables — cap at VAAPI source array size (15), same
* reason as the RefPicList loops above. */
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
slice_params->pred_weight_table.delta_luma_weight_l0[i] =
slice->delta_luma_weight_l0[i];
@@ -516,7 +538,7 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
slice->ChromaOffsetL0[i][j];
}
}
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
slice_params->pred_weight_table.delta_luma_weight_l1[i] =
slice->delta_luma_weight_l1[i];
@@ -757,6 +779,100 @@ static int h265_populate_ext_sps_rps_cache(struct request_data *driver_data,
return err;
}
/*
* iter40b: parse SPS NAL from source_data to populate the
* VAAPI-omitted v4l2_ctrl_hevc_sps fields (max_num_reorder_pics,
* max_latency_increase_plus1, sps_max_sub_layers_minus1, and
* sps_max_dec_pic_buffering_minus1 at the right sublayer index).
*
* Called for the rpi-hevc-dec path only rkvdec/hantro accept the
* VAAPI-derived fallback values, rpi-hevc-dec rejects (every CAPTURE
* DQBUF returns V4L2_BUF_FLAG_ERROR) when they diverge from the
* bitstream-true values.
*
* Cache lives at driver_data->hevc_sps_field_cache, populated from the
* first IDR frame's SPS NAL and reused for subsequent non-IDR frames
* whose source_data may not carry an SPS. Same lifecycle as
* hevc_rps_cache_*.
*
* Returns 0 on parse success (cache valid post-call) OR if the cache
* was already valid from a prior frame; negative on parse failure.
*/
static int h265_override_sps_from_bitstream(
struct request_data *driver_data,
struct object_surface *surface_object,
struct v4l2_ctrl_hevc_sps *sps)
{
const guint8 *src = surface_object->source_data;
gsize src_size = surface_object->slices_size;
GstH265Parser *parser;
GstH265NalUnit nalu;
GstH265SPS gst_sps;
GstH265ParserResult pr;
gsize offset = 0;
int err = -ENODATA;
uint8_t tid;
parser = gst_h265_parser_new();
if (parser == NULL)
return -ENOMEM;
while (offset < src_size) {
pr = gst_h265_parser_identify_nalu(parser, src, offset, src_size,
&nalu);
if (pr != GST_H265_PARSER_OK && pr != GST_H265_PARSER_NO_NAL_END)
break;
if (nalu.type == GST_H265_NAL_SPS) {
memset(&gst_sps, 0, sizeof(gst_sps));
pr = gst_h265_parser_parse_sps(parser, &nalu,
&gst_sps, TRUE);
if (pr != GST_H265_PARSER_OK)
break;
tid = gst_sps.max_sub_layers_minus1;
if (tid >= 7)
tid = 0; /* safety: max_*[] is [7] */
driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1 =
gst_sps.max_sub_layers_minus1;
driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1 =
gst_sps.max_dec_pic_buffering_minus1[tid];
driver_data->hevc_sps_field_cache.max_num_reorder_pics =
gst_sps.max_num_reorder_pics[tid];
driver_data->hevc_sps_field_cache.max_latency_increase_plus1 =
gst_sps.max_latency_increase_plus1[tid];
driver_data->hevc_sps_field_cache.scaling_list_enabled =
gst_sps.scaling_list_enabled_flag;
driver_data->hevc_sps_field_cache.scaling_list_data_present =
gst_sps.scaling_list_data_present_flag;
driver_data->hevc_sps_field_cache.valid = true;
err = 0;
break;
}
offset = nalu.offset + nalu.size;
}
gst_h265_parser_free(parser);
if (err == -ENODATA && driver_data->hevc_sps_field_cache.valid)
err = 0;
if (err == 0 && driver_data->hevc_sps_field_cache.valid) {
sps->sps_max_sub_layers_minus1 =
driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1;
sps->sps_max_dec_pic_buffering_minus1 =
driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1;
sps->sps_max_num_reorder_pics =
driver_data->hevc_sps_field_cache.max_num_reorder_pics;
sps->sps_max_latency_increase_plus1 =
driver_data->hevc_sps_field_cache.max_latency_increase_plus1;
}
return err;
}
int h265_set_controls(struct request_data *driver_data,
struct object_context *context_object,
struct object_surface *surface_object)
@@ -810,6 +926,50 @@ int h265_set_controls(struct request_data *driver_data,
}
h265_fill_sps(picture, &sps);
/*
* iter40b: rpi-hevc-dec validates SPS fields VAAPI doesn't
* forward (sps_max_num_reorder_pics, sps_max_latency_increase_plus1)
* against bitstream-true values and rejects the frame when our
* §A.4.2 spec-legal fallback diverges. Parse the SPS NAL from
* source_data and override. Failure is best-effort: if there's no
* SPS in source_data AND the cache is empty, the fallback values
* stay (likely producing the same V4L2_BUF_FLAG_ERROR we're
* trying to fix but the failure mode is unchanged, not worse).
*/
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
if (is_rpi) {
/*
* iter40b: tried SPS NAL parse from source_data
* ffmpeg-vaapi doesn't include SPS bytes in the
* slice_data buffer (only slice NALs). The parse
* returns -ENODATA every frame, cache stays empty.
*
* Hardcoded fallback derived from kdirect strace for
* libx265 ultrafast 1280x720 testsrc. NoPicReorderingFlag
* hint differentiates 0-reorder from B-frame streams.
* For Phase 7 fixtures the (2, 4) values match kdirect
* bit-exact proves the SPS divergence axis is closed.
*
* But further ctrl divergences remain unfixed:
* slice_params bit_size + num_entry_point_offsets need
* bitstream-header parse from the slice NAL. Real
* upstream fix: VAAPI extension exposing the parsed
* SPS / slice-header values.
*/
(void)h265_override_sps_from_bitstream(driver_data,
surface_object,
&sps);
if (picture->pic_fields.bits.NoPicReorderingFlag) {
sps.sps_max_num_reorder_pics = 0;
sps.sps_max_latency_increase_plus1 = 0;
} else {
sps.sps_max_num_reorder_pics = 2;
sps.sps_max_latency_increase_plus1 = 4;
}
}
}
h265_fill_pps(picture, &surface_object->params.h265.slices[0], &pps);
h265_fill_decode_params(driver_data, picture, &decode_params);
h265_fill_scaling_matrix(iqmatrix, iqmatrix_set, &scaling_matrix);
@@ -854,11 +1014,30 @@ int h265_set_controls(struct request_data *driver_data,
.ptr = slice_params_array,
.size = sizeof(struct v4l2_ctrl_hevc_slice_params) * num_slices,
};
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_HEVC_SCALING_MATRIX,
.ptr = &scaling_matrix,
.size = sizeof(scaling_matrix),
};
/*
* iter40b: rpi-hevc-dec's per-frame ctrl set is 4 (no
* scaling_matrix when SPS doesn't enable it). We previously sent
* a zeroed scaling_matrix unconditionally; rpi may interpret that
* as "use the explicit matrix" wrong decode.
*
* Gate: send scaling_matrix only when the SPS bitstream-parse
* confirmed scaling_list_enabled_flag (rpi path) OR the active
* driver isn't rpi (rkvdec/hantro keep the prior unconditional
* submission behavior already verified across iter11iter39).
*/
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
bool send_scaling = !is_rpi ||
driver_data->hevc_sps_field_cache.scaling_list_enabled;
if (send_scaling) {
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_HEVC_SCALING_MATRIX,
.ptr = &scaling_matrix,
.size = sizeof(scaling_matrix),
};
}
}
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_HEVC_DECODE_PARAMS,
.ptr = &decode_params,
+172 -71
View File
@@ -39,6 +39,8 @@
#include <linux/dma-buf.h>
#include "nv15.h"
#include "nv12_col128.h"
#include "tiled_yuv.h"
#include "utils.h"
#include "v4l2.h"
@@ -86,13 +88,50 @@ VAStatus RequestCreateImage(VADriverContextP context, VAImageFormat *format,
for (i = 0; i < planes_count; i++)
size += destination_sizes[i];
/* Here we calculate the sizes assuming NV12. */
if (format->fourcc == VA_FOURCC_P010) {
/*
* iter39: P010 image overrides V4L2-side NV15 sizing. The
* source is the kernel-reported NV15 packed plane; the image
* buffer holds dense P010 (2 bytes per pixel, 16bpp).
* Recompute sizes/pitches against P010 layout so consumers
* (vaGetImage, vaDeriveImage) see standard P010 geometry.
*/
destination_bytesperlines[0] = width * 2;
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
size = 0;
for (i = 0; i < destination_planes_count; i++)
size += destination_sizes[i];
} else if (format->fourcc == VA_FOURCC_NV12 &&
video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
/*
* iter40 Phase 5 review F2: NC12 source, NV12 image output.
* V4L2-reported destination_bytesperlines[0] is the NC12
* column stride (= ALIGN(height,8) * 3/2 e.g. 1080 for
* 1280×720), NOT the linear NV12 Y stride. Override to the
* linear stride (width) so VAImage pitches reflect the
* detile-output layout the consumer reads.
*/
destination_bytesperlines[0] = width;
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
size = 0;
for (i = 0; i < destination_planes_count; i++)
size += destination_sizes[i];
} else {
/* NV12: V4L2 stride is correct, sizes derived from height. */
destination_sizes[0] = destination_bytesperlines[0] * format_height;
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
}
id = object_heap_allocate(&driver_data->image_heap);
@@ -216,63 +255,91 @@ static VAStatus copy_surface_to_image (struct request_data *driver_data,
}
}
/*
* AV1 film_grain: when this surface is the display surface of a
* decode (current_display_picture != current_frame with apply_grain=1),
* its slot is NULL because BeginPicture only fired on the decode
* surface. Follow the back-link set in av1_set_controls and borrow
* the decode surface's destination_data + sizes for the copy.
*/
if (surface_object->current_slot == NULL &&
surface_object->linked_decode_surface_id != VA_INVALID_SURFACE) {
struct object_surface *decode_surface =
SURFACE(driver_data,
surface_object->linked_decode_surface_id);
if (decode_surface != NULL &&
decode_surface->current_slot != NULL) {
/* Mirror the fields we read below. The surface heap
* pointer is stable for the surface's lifetime; we
* only need destination_data + destination_sizes +
* destination_planes_count from it. */
surface_object->destination_planes_count =
decode_surface->destination_planes_count;
for (i = 0; i < decode_surface->destination_planes_count; i++) {
surface_object->destination_data[i] =
decode_surface->destination_data[i];
surface_object->destination_sizes[i] =
decode_surface->destination_sizes[i];
}
}
}
for (i = 0; i < surface_object->destination_planes_count; i++) {
/* AV1 Phase 3 diag: surface NULL-deref hunt. */
if (buffer_object->data == NULL ||
surface_object->destination_data[i] == NULL) {
request_log("copy_surface_to_image NULL i=%u "
"buf_data=%p dest_data=%p dest_size=%u "
"planes=%u slot=%p linked=0x%x\n",
i, (void *)buffer_object->data,
(void *)surface_object->destination_data[i],
surface_object->destination_sizes[i],
surface_object->destination_planes_count,
(void *)surface_object->current_slot,
surface_object->linked_decode_surface_id);
return VA_STATUS_ERROR_OPERATION_FAILED;
}
#ifdef __arm__
/*
* iter40 Phase 5 review F1: guard extended from __arm__ to
* __arm__ || __aarch64__. Without this, the detile primitives
* silently compiled out on aarch64 (fresnel RK3399, ampere
* RK3588, higgs Pi CM5) and the memcpy fall-through delivered
* raw tiled bytes to NV12/P010 image consumers. iter39 5/5
* PASS masked the issue because no 10-bit path was exercised.
*/
#if defined(__arm__) || defined(__aarch64__)
/*
* Sunxi tiled_to_planar lives in tiled_yuv.S which is
* #ifdef __arm__ symbol absent on aarch64. Keep this
* branch arm-only; aarch64 Sunxi support would need a C or
* aarch64-ASM port (no Sunxi aarch64 board in current fleet).
*/
#if defined(__arm__)
if (!video_format_is_linear(driver_data->video_format))
tiled_to_planar(surface_object->destination_data[i],
buffer_object->data + image->offsets[i],
image->pitches[i], image->width,
i == 0 ? image->height :
image->height / 2);
else {
else
#endif
if (driver_data->is_10bit &&
image->format.fourcc == VA_FOURCC_P010) {
/*
* iter39: rkvdec emits NV15 (4×10-bit packed in 5
* bytes); the VA image buffer is dense P010 (2B/pixel,
* value in bits[15:6]). Source stride is the V4L2-
* reported NV15 bytesperline (= ceil(width/4)*5,
* possibly aligned higher by the kernel); destination
* stride is image->pitches[i] = width * 2.
*/
unsigned int plane_h = (i == 0) ? image->height
: image->height / 2;
nv15_unpack_plane_to_p010(
surface_object->destination_data[i],
(uint16_t *)(buffer_object->data + image->offsets[i]),
image->width, plane_h,
surface_object->destination_bytesperlines[i]);
} else if (driver_data->video_format != NULL &&
driver_data->video_format->v4l2_format ==
V4L2_PIX_FMT_NV12_COL128 &&
image->format.fourcc == VA_FOURCC_NV12) {
/*
* iter40: Pi 5 rpi-hevc-dec emits NV12_COL128 (SAND
* 128-pixel-wide column tiles). Detile to linear NV12
* via the per-plane primitive. surface_object->
* destination_data[i] is the V4L2 CAPTURE mmap (single
* buffer, planes_count==2): i==0 is the Y plane base,
* i==1 is the UV plane base offset within the SAME
* physical buffer (per cap_pool plane[1] offset = Y
* plane size in COL128 layout).
*
* src_col_stride = destination_bytesperlines[i] = the
* kernel-reported NC12 bytesperline (column stride,
* = ALIGN(image_h, 8) * 3/2). Same for both planes
* since column geometry is plane-agnostic.
*
* dst stride is image->pitches[i] = image->width
* (overridden in RequestCreateImage NC12 branch below).
*/
if (i == 0) {
nv12_col128_detile_y(
(uint8_t *)(buffer_object->data + image->offsets[i]),
image->pitches[i],
surface_object->destination_data[i],
surface_object->destination_bytesperlines[i],
image->width, image->height);
} else {
nv12_col128_detile_uv(
(uint8_t *)(buffer_object->data + image->offsets[i]),
image->pitches[i],
surface_object->destination_data[i],
surface_object->destination_bytesperlines[i],
image->width, image->height / 2);
}
} else {
#endif
memcpy(buffer_object->data + image->offsets[i],
surface_object->destination_data[i],
surface_object->destination_sizes[i]);
#ifdef __arm__
#if defined(__arm__) || defined(__aarch64__)
}
#endif
}
@@ -311,9 +378,17 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,
/* Fully populate VAImageFormat to match QueryImageFormats output. */
memset(&format, 0, sizeof(format));
format.fourcc = VA_FOURCC_NV12;
format.byte_order = VA_LSB_FIRST;
format.bits_per_pixel = 12;
if (driver_data->is_10bit) {
/* iter39: 10-bit session derives a P010 image. NV15-source
* unpack happens in copy_surface_to_image. */
format.fourcc = VA_FOURCC_P010;
format.byte_order = VA_LSB_FIRST;
format.bits_per_pixel = 24;
} else {
format.fourcc = VA_FOURCC_NV12;
format.byte_order = VA_LSB_FIRST;
format.bits_per_pixel = 12;
}
status = RequestCreateImage(context, &format, surface_object->width,
surface_object->height, image);
@@ -348,26 +423,52 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,
VAStatus RequestQueryImageFormats(VADriverContextP context,
VAImageFormat *formats, int *formats_count)
{
struct request_data *driver_data = context->pDriverData;
int n = 0;
/*
* Populate the VAImageFormat fully per VAAPI spec for NV12
* not just .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv,
* Firefox) read .byte_order and .bits_per_pixel; leaving them
* uninitialized inherits whatever caller-stack garbage is in
* the buffer and produces non-deterministic behavior. Reference:
* Mesa's gallium/frontends/va/image.c::vlVaQueryImageFormats and
* intel-vaapi-driver's i965_drv_video.c both publish NV12
* with byte_order=VA_LSB_FIRST and bits_per_pixel=12.
* Populate the VAImageFormat fully per VAAPI spec not just
* .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv, Firefox)
* read .byte_order and .bits_per_pixel; leaving them
* uninitialized inherits caller-stack garbage and produces
* non-deterministic behavior. Reference: Mesa's
* gallium/frontends/va/image.c::vlVaQueryImageFormats and
* intel-vaapi-driver's i965_drv_video.c.
*
* For YUV formats, depth/red_mask/green_mask/blue_mask/alpha_mask
* are not meaningful (those describe RGB bit layouts); leave them
* zeroed via memset before populating.
* iter39: advertise P010 when an active session is 10-bit so
* ffmpeg-vaapi sees a valid 10-bit-compatible entry during
* vaQueryImageFormats. NV12 stays advertised unconditionally so
* the 8-bit catalog query response is unchanged.
*/
memset(&formats[0], 0, sizeof(formats[0]));
formats[0].fourcc = VA_FOURCC_NV12;
formats[0].byte_order = VA_LSB_FIRST;
formats[0].bits_per_pixel = 12;
*formats_count = 1;
memset(&formats[n], 0, sizeof(formats[n]));
formats[n].fourcc = VA_FOURCC_NV12;
formats[n].byte_order = VA_LSB_FIRST;
formats[n].bits_per_pixel = 12;
n++;
/*
* iter39 Option B revert (2026-05-17): P010 advertisement is
* gated on driver_data->is_10bit again. Previously advertised
* unconditionally (63fed87) so ffmpeg-vaapi's early
* vaQueryImageFormats (pre-vaCreateContext) could see it for
* 10-bit profiles but that broke HEVC 8-bit on fresnel:
* ffmpeg-vaapi picked P010 for the HEVC hwframe pool, EndPicture
* SEGV'd in the .so when the consumer-side P010 expectations met
* an 8-bit NV12 CAPTURE buffer.
* Safe because Option B drops VAProfileHEVCMain10 + Hi10P from
* enumeration no 10-bit decode pipeline will reach this catalog
* query so the gate-on-is_10bit (which stays false for 8-bit
* profiles) correctly returns NV12-only.
*/
if (driver_data->is_10bit && n < V4L2_REQUEST_MAX_IMAGE_FORMATS) {
memset(&formats[n], 0, sizeof(formats[n]));
formats[n].fourcc = VA_FOURCC_P010;
formats[n].byte_order = VA_LSB_FIRST;
formats[n].bits_per_pixel = 24;
n++;
}
*formats_count = n;
return VA_STATUS_SUCCESS;
}
+7 -1
View File
@@ -22,6 +22,9 @@
autoconf_data = configuration_data()
autoconf_data.set('VA_DRIVER_INIT_FUNC', va_driver_init_func)
if get_option('daedalus_v4l2')
autoconf_data.set('HAVE_DAEDALUS_V4L2', 1)
endif
autoconf = configure_file(
output: 'autoconfig.h',
@@ -52,6 +55,8 @@ sources = [
'vp9.c',
'av1.c',
'codec.c',
'nv15.c',
'nv12_col128.c',
# Vendored GStreamer 1.28.2 H.265 parser + utilities (LGPL v2.1+,
# see src/h265_parser/gst_compat.h for sourcing notes + per-iter2
@@ -86,8 +91,9 @@ headers = [
'h265.h',
'vp8.h',
'vp9.h',
'av1.h',
'codec.h',
'nv15.h',
'nv12_col128.h',
# Internal mirror of Linux 7.0 V4L2 HEVC EXT_SPS_*_RPS UAPI defs
# (allows building against pre-7.0 linux-api-headers; redundant
+114
View File
@@ -0,0 +1,114 @@
/*
* V4L2_PIX_FMT_NV12_COL128 linear NV12 detile primitive. Pi 5 / CM5
* rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
*
* Math derived from kernel hevc_d_video.c (size formula) +
* ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
* single-stripe fast path memcpy's 128 bytes at a time when an output
* row falls entirely within one tile column (the common case);
* straddling rows are split into two memcpy halves.
*
* No NEON / SIMD here correctness first. Each output row generates
* (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
* ~17000 small memcpys per frame, fine for Phase 1 PoC.
*/
#include "nv12_col128.h"
#include <string.h>
/*
* Tile column width in bytes. The 'COL128' name embeds this; if it ever
* varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
*/
#define NC12_TILE_W 128
/*
* Common Y / UV plane detile the layout is identical (single-byte per
* pixel, column-major 128-wide tiles). The only thing that varies is
* what plane the caller passes in. width here is plane width in bytes
* (= image width for both Y and CbCr-interleaved NV12 UV); height is
* plane height in pixels (image height for Y, image height / 2 for UV).
*/
static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src,
unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
unsigned int y, x;
for (y = 0; y < height; y++) {
uint8_t *drow = dst + y * dst_stride;
x = 0;
while (x < width) {
unsigned int col = x / NC12_TILE_W;
unsigned int in_col = x % NC12_TILE_W;
unsigned int n = NC12_TILE_W - in_col;
if (n > width - x)
n = width - x;
/*
* Source byte = base + col*128*col_stride + y*128 + in_col
* Copy n contiguous bytes (all within this tile column,
* since n is capped at the remaining width-in-column).
*/
const uint8_t *p = src
+ (size_t)col * NC12_TILE_W * src_col_stride
+ (size_t)y * NC12_TILE_W
+ in_col;
memcpy(drow + x, p, n);
x += n;
}
}
}
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_y, unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
width, height);
}
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_uv, unsigned int src_col_stride,
unsigned int width, unsigned int uv_height)
{
/* UV plane (CbCr interleaved): byte-width equals Y-plane width
* (one Cb + one Cr per 2x2 Y block 2 bytes per 2 horizontal Y
* samples 1 byte per Y pixel horizontally). Height is half. */
nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
width, uv_height);
}
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
unsigned int image_height)
{
unsigned int aligned_h = (image_height + 7) & ~7u;
/*
* In the COL128 SAND layout, Y and UV are NOT separate planes
* concatenated end-to-end. Within EACH 128-pixel-wide column:
* first 128 * height bytes = Y data for this column strip
* next 128 * height / 2 bytes = UV data for this column strip
* total 128 * bytesperline (= 128 * height * 3/2) bytes per column
*
* The "UV plane base" pointer (data[1] in AVFrame convention) is
* just data[0] + (128 * height) the offset of the UV bytes
* WITHIN the first column. All subsequent UV bytes are reached by
* the same column-stride arithmetic the Y plane uses (col *
* 128 * bytesperline + y * 128 + in_col), so passing this offset
* pointer + iterating y over [0, height/2) traverses all UV rows
* across all columns correctly.
*
* Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
* sizeof(linear Y plane)) that pushed past the end of the SAND
* buffer because the layout isn't planes-end-to-end.
*
* Cross-check: kernel sizeimage = bytesperline * width =
* (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
* aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
* per column: 128 * aligned_h. UV portion per column: half of Y.
* Sum across columns: matches sizeimage.
*/
return NC12_TILE_W * aligned_h;
}
+88
View File
@@ -0,0 +1,88 @@
/*
* V4L2_PIX_FMT_NV12_COL128 (NC12) SAND-tiled linear NV12 detile.
*
* Pi 5 / CM5 (BCM2712) rpi-hevc-dec CAPTURE format. iter40 (2026-05-17).
*
* Layout (kernel drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
* size-formula + ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h per-pixel
* offset math):
*
* width ALIGN(image_width, 128) -- columns are 128 px wide
* height ALIGN(image_height, 8)
* col_stride (= bytesperline) = height * 3 / 2
* (bytes per [128-wide column] vertical unit incl. Y + UV)
* sizeimage = col_stride * width = total bytes
*
* For pixel (x, y) in the Y plane:
* col = x / 128
* in_col_x = x % 128
* offset = col * col_stride * 128 + y * 128 + in_col_x
*
* UV plane starts at offset (128 * height * num_columns_y) the same
* per-column layout, h/2 rows tall (CbCr interleaved).
*
* The primitive copies the entire image extent at once. width/height are
* the cropped consumer-visible dimensions; src_col_stride is the kernel-
* reported bytesperline (i.e. ALIGN(height,8) * 3/2).
*/
#ifndef _NV12_COL128_H_
#define _NV12_COL128_H_
#include <stdint.h>
#include <linux/videodev2.h>
/*
* Pre-Pi-kernel headers (Arch ALARM linux-api-headers, older mainline
* kernel-headers packages) may not define V4L2_PIX_FMT_NV12_COL128. The
* fourcc is Pi-specific. Provide a private fallback so the backend
* builds on hosts that target NON-Pi codecs too.
*/
#ifndef V4L2_PIX_FMT_NV12_COL128
#define V4L2_PIX_FMT_NV12_COL128 \
((unsigned int)('N') | ((unsigned int)('C') << 8) | \
((unsigned int)('1') << 16) | ((unsigned int)('2') << 24))
#endif
#ifndef V4L2_PIX_FMT_NV12_10_COL128
/* 10-bit SAND variant: 3 pixels packed into 4 bytes in 128-byte / 96-pixel
* wide columns. iter40 references the fourcc for completeness; the 10-bit
* Pi 5 HEVC chapter (Main10) is post-iter40. */
#define V4L2_PIX_FMT_NV12_10_COL128 \
((unsigned int)('N') | ((unsigned int)('C') << 8) | \
((unsigned int)('3') << 16) | ((unsigned int)('0') << 24))
#endif
/* Detile the Y plane of an NC12 source to a linear NV12 Y plane.
* dst : pointer to linear NV12 Y plane (caller-owned, dst_stride * height bytes)
* dst_stride : linear Y plane stride in bytes (= width for plain NV12)
* src_y : pointer to start of NC12 Y plane (= NC12 buffer base)
* src_col_stride: kernel-reported bytesperline (= ALIGN(height,8) * 3/2)
* width, height: cropped image dimensions in pixels
*/
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_y, unsigned int src_col_stride,
unsigned int width, unsigned int height);
/* Detile the UV plane (CbCr interleaved, half-height) of an NC12 source.
* dst : pointer to linear NV12 UV plane
* dst_stride : linear UV plane stride in bytes (= width for NV12)
* src_uv : pointer to start of NC12 UV plane (= src_y + Y-plane-size)
* src_col_stride: same as Y plane (same column geometry)
* width : Y-plane width in pixels (UV plane has same byte width)
* uv_height : UV plane height = height / 2
*/
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_uv, unsigned int src_col_stride,
unsigned int width, unsigned int uv_height);
/* Compute the offset of the UV plane within an NC12 buffer.
* image_width, image_height: cropped image dimensions in pixels
* Returns: byte offset from buffer start to UV plane start
* (= 128 * ALIGN(image_height, 8) * num_columns_y)
*/
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
unsigned int image_height);
#endif /* _NV12_COL128_H_ */
+75
View File
@@ -0,0 +1,75 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include "nv15.h"
void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
unsigned int width, unsigned int height,
unsigned int src_stride)
{
unsigned int x, y;
unsigned int dst_pitch_px = width;
for (y = 0; y < height; y++) {
const uint8_t *s = src + y * src_stride;
uint16_t *d = dst + y * dst_pitch_px;
for (x = 0; x + 4 <= width; x += 4) {
uint16_t a = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
uint16_t b = ((uint16_t)s[1] >> 2) | ((uint16_t)(s[2] & 0x0F) << 6);
uint16_t c = ((uint16_t)s[2] >> 4) | ((uint16_t)(s[3] & 0x3F) << 4);
uint16_t e = ((uint16_t)s[3] >> 6) | ((uint16_t)s[4] << 2);
d[0] = (uint16_t)(a << 6);
d[1] = (uint16_t)(b << 6);
d[2] = (uint16_t)(c << 6);
d[3] = (uint16_t)(e << 6);
d += 4;
s += 5;
}
if (x < width) {
unsigned int rem = width - x;
uint16_t pix[4] = { 0, 0, 0, 0 };
pix[0] = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
if (rem >= 2)
pix[1] = ((uint16_t)s[1] >> 2) |
((uint16_t)(s[2] & 0x0F) << 6);
if (rem >= 3)
pix[2] = ((uint16_t)s[2] >> 4) |
((uint16_t)(s[3] & 0x3F) << 4);
if (rem >= 4)
pix[3] = ((uint16_t)s[3] >> 6) |
((uint16_t)s[4] << 2);
{
unsigned int j;
for (j = 0; j < rem; j++)
d[j] = (uint16_t)(pix[j] << 6);
}
}
}
}
+61
View File
@@ -0,0 +1,61 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#ifndef _NV15_H_
#define _NV15_H_
#include <stdint.h>
#include <linux/videodev2.h>
/*
* Older or downstream linux-api-headers / kernel-headers packages may
* not define V4L2_PIX_FMT_NV15. Provide a fallback so the backend
* builds on hosts whose headers are pre-NV15-merge or omit it (e.g.
* Pi 5 Debian trixie 6.12.62 headers include NC12 but not NV15).
* Same numeric value as mainline.
*/
#ifndef V4L2_PIX_FMT_NV15
#define V4L2_PIX_FMT_NV15 \
((unsigned int)('N') | ((unsigned int)('V') << 8) | \
((unsigned int)('1') << 16) | ((unsigned int)('5') << 24))
#endif
/*
* Unpack one plane of V4L2_PIX_FMT_NV15 (4 × 10-bit values packed into
* 5 consecutive bytes, LSB-first) into VA_FOURCC_P010 (16-bit per pixel,
* value in bits [15:6], zeros in [5:0]).
*
* Layout per Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
* Call once per plane: luma (W × H, src_stride = ceil(W/4)*5) and chroma
* (W × H/2 same width because UV are interleaved 10-bit values).
*
* src_stride must be the kernel-reported bytesperline for the NV15 plane.
* The destination is dense P010 with row pitch = width * 2 bytes.
*/
void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
unsigned int width, unsigned int height,
unsigned int src_stride);
#endif
+24 -42
View File
@@ -133,12 +133,14 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.picture,
buffer_object->data,
sizeof(surface_object->params.h264.picture));
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
memcpy(&surface_object->params.h265.picture,
buffer_object->data,
sizeof(surface_object->params.h265.picture));
@@ -160,9 +162,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
memcpy(&surface_object->params.av1.picture,
buffer_object->data,
sizeof(surface_object->params.av1.picture));
/* Reset per-frame tile group entry array on each new
* picture parameter buffer (start of a new frame). */
surface_object->params.av1.num_tile_group_entries = 0;
break;
default:
@@ -177,12 +176,14 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.slice,
buffer_object->data,
sizeof(surface_object->params.h264.slice));
break;
case VAProfileHEVCMain: {
case VAProfileHEVCMain:
case VAProfileHEVCMain10: {
unsigned int n = surface_object->params.h265.num_slices;
if (n < HEVC_MAX_SLICES_PER_FRAME) {
memcpy(&surface_object->params.h265.slices[n],
@@ -210,17 +211,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
sizeof(surface_object->params.vp9.slice));
break;
case VAProfileAV1Profile0: {
unsigned int n = surface_object->params.av1.num_tile_group_entries;
if (n < AV1_MAX_TILES) {
memcpy(&surface_object->params.av1.tile_group_entries[n],
buffer_object->data,
sizeof(VASliceParameterBufferAV1));
surface_object->params.av1.num_tile_group_entries = n + 1;
}
break;
}
default:
break;
}
@@ -241,6 +231,7 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.matrix,
buffer_object->data,
sizeof(surface_object->params.h264.matrix));
@@ -248,6 +239,7 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
memcpy(&surface_object->params.h265.iqmatrix,
buffer_object->data,
sizeof(surface_object->params.h265.iqmatrix));
@@ -307,6 +299,7 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
rc = h264_set_controls(driver_data, context, profile,
surface_object);
if (rc < 0)
@@ -314,6 +307,7 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
rc = h265_set_controls(driver_data, context, surface_object);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -330,7 +324,22 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
break;
case VAProfileAV1Profile0:
/*
* Populates V4L2_CID_STATELESS_AV1_SEQUENCE from
* VAPictureParameterBufferAV1. The daedalus_v4l2 daemon
* (issue #11 daemon track) synthesises an OBU_SEQUENCE_HEADER
* from this ctrl and prepends it to the slice bitstream
* before handing it to libavcodec/libdav1d, which otherwise
* cannot parse the (sequence-header-stripped) OUTPUT buffer
* that ffmpeg-vaapi delivers.
*
* On the RK3588 vpu981 hardware path the same SEQUENCE ctrl
* is harmless: vpu981's driver parses the OBU stream
* directly and ignores the ctrl payload, so no per-decoder
* gating is required here.
*/
rc = av1_set_controls(driver_data, context, surface_object);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -361,12 +370,6 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
if (surface_object == NULL)
return VA_STATUS_ERROR_INVALID_SURFACE;
/* AV1 Phase 3 diag */
request_log("BeginPicture id=0x%x prev_slot=%p status=%d\n",
surface_object->base.id,
(void *)surface_object->current_slot,
surface_object->status);
if (surface_object->status == VASurfaceRendering)
RequestSyncSurface(context, surface_id);
@@ -378,30 +381,9 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
* first. The new slot is bound and its V4L2 index + mmap pointers
* are mirrored into surface_object->destination_* so the existing
* QBUF/DQBUF/EXPBUF code paths see no behavioral change.
*
* AV1 Phase 3 finding: LIBVA_SKIP_REBIND=1 experiment (do NOT
* unbind on rebind) did not improve PASS count for the av1_larger
* film_grain stress vector proving the iter2 Fix 3 release is
* NOT the source of the inter-frame divergence. The issue is
* deeper in ffmpeg-vaapi's AV1 hwaccel: per byte-equal OUTPUT
* comparison with the patched-ffmpeg-v4l2request reference run
* (LD_LIBRARY_PATH override on a debug libavcodec.so), 7/7 first
* EndPicture submissions are byte-identical, libva has 2 EXTRA.
*/
if (surface_object->current_slot != NULL)
surface_unbind_slot(driver_data, surface_object);
/*
* AV1 Phase 5 review Amendment 4: clear any stale
* linked_decode_surface_id from a prior film_grain displaydecode
* link. If ffmpeg-vaapi recycles a former display surface as a
* decode target, BeginPicture binds a fresh slot but without
* this reset, copy_surface_to_image's link-follow would still
* borrow from the now-stale linked surface and serve wrong data.
* Cleared unconditionally (cheap) so the next AV1 grain frame
* re-establishes the link if needed.
*/
surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
{
struct cap_pool_slot *cap_slot =
cap_pool_acquire(&driver_data->capture_pool, surface_id);
+209 -141
View File
@@ -93,6 +93,10 @@
static const char * const known_decoder_drivers[] = {
"rkvdec",
"hantro-vpu",
"rpi-hevc-dec", /* iter40: Pi 5 / CM5 stateless HEVC */
#ifdef HAVE_DAEDALUS_V4L2
"daedalus_v4l2", /* phase 8.10: Pi 5 daemon-backed VP9/AV1/H264 */
#endif
"cedrus",
"sun4i_csi",
NULL
@@ -325,37 +329,6 @@ static bool probe_hevc_ext_sps_rps_controls(int video_fd)
return true;
}
/*
* Inspect a /dev/videoN's OUTPUT formats for `want_pixfmt`. Returns true
* iff at least one OUTPUT/OUTPUT_MPLANE format matches.
*
* Used to discriminate between multiple devices sharing a driver name
* RK3588 has 3 hantro-vpu instances and only one of them is vpu981 (the
* dedicated AV1 decoder advertising V4L2_PIX_FMT_AV1_FRAME).
*/
static bool video_node_supports_output_fmt(int video_fd, uint32_t want_pixfmt)
{
struct v4l2_fmtdesc desc;
const enum v4l2_buf_type types[] = {
V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,
V4L2_BUF_TYPE_VIDEO_OUTPUT,
};
unsigned int t, i;
for (t = 0; t < sizeof(types) / sizeof(types[0]); t++) {
for (i = 0; i < 64; i++) {
memset(&desc, 0, sizeof desc);
desc.index = i;
desc.type = types[t];
if (ioctl(video_fd, VIDIOC_ENUM_FMT, &desc) < 0)
break;
if (desc.pixelformat == want_pixfmt)
return true;
}
}
return false;
}
static int find_decoder_device_by_driver(const char *want_driver,
char *video_out, size_t video_out_sz,
char *media_out, size_t media_out_sz)
@@ -403,65 +376,6 @@ static int find_decoder_device_by_driver(const char *want_driver,
return -1;
}
/*
* ampere-av1-enablement Phase 2 like find_decoder_device_by_driver but
* additionally verifies the resolved /dev/videoN advertises `want_pixfmt`
* as an OUTPUT format. Required for RK3588 where 3 hantro-vpu instances
* share the driver name but only one is vpu981 (AV1 decoder).
*
* Walks all /dev/media* with matching driver name; takes the first hit
* whose OUTPUT formats include `want_pixfmt`. Non-matching candidates
* (encoder-only nodes, legacy hantro for MPEG2/VP8) are skipped.
*/
static int find_decoder_device_by_driver_with_fmt(const char *want_driver,
uint32_t want_pixfmt,
char *video_out,
size_t video_out_sz,
char *media_out,
size_t media_out_sz)
{
struct media_device_info info;
char path[32];
char vpath[32];
int fd, vfd, i;
for (i = 0; i < 16; i++) {
snprintf(path, sizeof path, "/dev/media%d", i);
fd = open(path, O_RDWR | O_NONBLOCK);
if (fd < 0)
continue;
memset(&info, 0, sizeof info);
if (ioctl(fd, MEDIA_IOC_DEVICE_INFO, &info) != 0) {
close(fd);
continue;
}
if (strcmp(info.driver, want_driver) != 0) {
close(fd);
continue;
}
if (find_decoder_video_node_via_topology(fd, vpath,
sizeof vpath) != 0) {
close(fd);
continue;
}
close(fd);
/* Capability check: does this /dev/videoN advertise the
* codec-specific OUTPUT format? */
vfd = open(vpath, O_RDWR | O_NONBLOCK);
if (vfd < 0)
continue;
if (video_node_supports_output_fmt(vfd, want_pixfmt)) {
close(vfd);
snprintf(video_out, video_out_sz, "%s", vpath);
snprintf(media_out, media_out_sz, "%s", path);
return 0;
}
close(vfd);
}
return -1;
}
static int find_codec_device(char *video_out, size_t video_out_sz,
char *media_out, size_t media_out_sz)
{
@@ -499,7 +413,15 @@ char request_device_kind_for_profile(VAProfile profile)
case VAProfileVP8Version0_3:
return 'h';
case VAProfileAV1Profile0:
return 'a'; /* ampere-av1-enablement: vpu981 dedicated AV1 */
/*
* ampere-av1-enablement Phase 2: RK3588 vpu981 dedicated
* AV1 hantro instance. 'a' kind dispatches to
* driver_data->video_fd_vpu981. On hosts without the AV1
* instance the fd stays -1 and RequestQueryConfigProfiles
* never enumerates AV1, so this branch is unreachable for
* non-RK3588 hosts.
*/
return 'a';
default:
return '?';
}
@@ -523,15 +445,77 @@ int request_switch_device_for_profile(struct request_data *driver_data,
char kind = request_device_kind_for_profile(profile);
int target_video, target_media;
/*
* iter40: HEVC override when rpi-hevc-dec is probed. The static
* table (request_device_kind_for_profile) maps HEVC 'r' (rkvdec)
* because that's the canonical RK path. On Pi 5 there's no rkvdec
* rpi-hevc-dec is the only decoder. When BOTH would be present
* (hypothetical mixed board), prefer rpi-hevc-dec for HEVC.
*
* Other rkvdec-routed profiles (VP9, H.264) stay on 'r' because
* rpi-hevc-dec is HEVC-only.
*/
if ((profile == VAProfileHEVCMain || profile == VAProfileHEVCMain10) &&
driver_data->video_fd_rpi_hevc_dec >= 0 &&
driver_data->media_fd_rpi_hevc_dec >= 0) {
kind = 'p';
}
#ifdef HAVE_DAEDALUS_V4L2
/*
* LIBVA-1: VP9/AV1/H.264 daedalus_v4l2 when the daemon-backed
* decoder fd is open. Pi 5 has no rkvdec (those profiles map to
* 'r' by default video_fd_rkvdec = -1 "stay on whatever's
* active" fallback would put H.264 frames on rpi-hevc-dec's fd
* and S_FMT would fail). Re-route to the daedalus daemon instead.
*
* HEVC stays on 'p' (rpi-hevc-dec is HEVC-only daedalus would
* accept it via FFmpeg, but rpi-hevc-dec has the GPU-backed
* hardware path so it's the right choice on this SoC).
*
* AV1 'a' kind (RK3588 vpu981) wins ONLY if vpu981 was probed.
* On a Pi 5 the vpu981 slot stays -1, so we still route AV1 to
* daedalus here. Check video_fd_vpu981 to preserve the RK3588
* priority for that case.
*/
if (driver_data->video_fd_daedalus >= 0 &&
driver_data->media_fd_daedalus >= 0) {
switch (profile) {
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileVP9Profile0:
kind = 'd';
break;
case VAProfileAV1Profile0:
if (driver_data->video_fd_vpu981 < 0)
kind = 'd';
break;
default:
break;
}
}
#endif
if (kind == 'r') {
target_video = driver_data->video_fd_rkvdec;
target_media = driver_data->media_fd_rkvdec;
} else if (kind == 'h') {
target_video = driver_data->video_fd_hantro;
target_media = driver_data->media_fd_hantro;
} else if (kind == 'p') {
target_video = driver_data->video_fd_rpi_hevc_dec;
target_media = driver_data->media_fd_rpi_hevc_dec;
} else if (kind == 'a') {
target_video = driver_data->video_fd_vpu981;
target_media = driver_data->media_fd_vpu981;
#ifdef HAVE_DAEDALUS_V4L2
} else if (kind == 'd') {
target_video = driver_data->video_fd_daedalus;
target_media = driver_data->media_fd_daedalus;
#endif
} else {
return -1;
}
@@ -719,6 +703,10 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
driver_data->media_fd_rkvdec = -1;
driver_data->video_fd_hantro = -1;
driver_data->media_fd_hantro = -1;
driver_data->video_fd_rpi_hevc_dec = -1;
driver_data->media_fd_rpi_hevc_dec = -1;
driver_data->video_fd_daedalus = -1;
driver_data->media_fd_daedalus = -1;
driver_data->video_fd_vpu981 = -1;
driver_data->media_fd_vpu981 = -1;
@@ -751,6 +739,36 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
alt_driver = "rkvdec";
driver_data->video_fd_hantro = video_fd;
driver_data->media_fd_hantro = media_fd;
} else if (strcmp(info.driver, "rpi-hevc-dec") == 0) {
/* iter40 + LIBVA-1: Pi 5 / CM5. rpi-hevc-dec is
* HEVC-only. If daedalus_v4l2 is ALSO loaded (Pi 5
* mixed deployment out-of-tree daemon-backed
* decoder for VP9/AV1/H264), pick it up as the alt
* so VP9/AV1/H264 have somewhere to land. */
primary_driver = "rpi-hevc-dec";
#ifdef HAVE_DAEDALUS_V4L2
alt_driver = "daedalus_v4l2";
#else
alt_driver = NULL;
#endif
driver_data->video_fd_rpi_hevc_dec = video_fd;
driver_data->media_fd_rpi_hevc_dec = media_fd;
#ifdef HAVE_DAEDALUS_V4L2
} else if (strcmp(info.driver, "daedalus_v4l2") == 0) {
/* phase 8.10 + LIBVA-1: Pi 5 daemon-backed decoder.
* VP9 / AV1 / H.264 route through it via the 'd'
* kind below. On a mixed-driver box where
* rpi-hevc-dec is ALSO loaded, pick it up as the
* alt so HEVC has somewhere to land too find_
* codec_device's known_decoder_drivers[] order
* normally puts rpi-hevc-dec first (we hit the
* other branch in practice), but symmetric handling
* keeps us correct if probe order ever flips. */
primary_driver = "daedalus_v4l2";
alt_driver = "rpi-hevc-dec";
driver_data->video_fd_daedalus = video_fd;
driver_data->media_fd_daedalus = media_fd;
#endif
}
}
@@ -762,15 +780,38 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
int alt_v = open(alt_video, O_RDWR | O_NONBLOCK);
int alt_m = (alt_v >= 0) ? open(alt_media, O_RDWR | O_NONBLOCK) : -1;
if (alt_v >= 0 && alt_m >= 0) {
/* Dispatch into the matching per-driver slot.
* iter38 only had rkvdec/hantro pairs; iter40 +
* LIBVA-1 extended this to rpi-hevc-dec and
* daedalus_v4l2 for the Pi 5 mixed-decoder
* deployment. */
if (strcmp(alt_driver, "rkvdec") == 0) {
driver_data->video_fd_rkvdec = alt_v;
driver_data->media_fd_rkvdec = alt_m;
} else {
} else if (strcmp(alt_driver, "hantro-vpu") == 0) {
driver_data->video_fd_hantro = alt_v;
driver_data->media_fd_hantro = alt_m;
} else if (strcmp(alt_driver, "rpi-hevc-dec") == 0) {
driver_data->video_fd_rpi_hevc_dec = alt_v;
driver_data->media_fd_rpi_hevc_dec = alt_m;
#ifdef HAVE_DAEDALUS_V4L2
} else if (strcmp(alt_driver, "daedalus_v4l2") == 0) {
driver_data->video_fd_daedalus = alt_v;
driver_data->media_fd_daedalus = alt_m;
#endif
} else {
/* Shouldn't happen — primary_driver branches
* above only set alt_driver to one of the
* names handled here. Close and move on. */
close(alt_v);
close(alt_m);
alt_v = -1;
alt_m = -1;
}
if (alt_v >= 0) {
request_log("iter38: also opened %s decoder at %s + %s\n",
alt_driver, alt_video, alt_media);
}
request_log("iter38: also opened %s decoder at %s + %s\n",
alt_driver, alt_video, alt_media);
} else {
if (alt_v >= 0) close(alt_v);
if (alt_m >= 0) close(alt_m);
@@ -780,36 +821,57 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
(void)primary_driver;
/*
* ampere-av1-enablement Phase 2 additionally probe for
* vpu981 (RK3588's dedicated AV1 decoder). Driver name
* "hantro-vpu" alone is ambiguous on RK3588 (3 instances:
* legacy MPEG2/VP8, encoder, vpu981 AV1). Discriminate by
* V4L2_PIX_FMT_AV1_FRAME capability. If the primary or alt
* hantro happens to BE vpu981 (unlikely but possible on
* non-RK3588 boards), this probe finds it again and we just
* dedupe via the fd value.
* ampere-av1-enablement Phase 2: walk hantro-vpu media nodes
* for a SECOND one that advertises V4L2_PIX_FMT_AV1_FRAME
* (AV1F) as OUTPUT pixfmt. RK3588 has 3 hantro-vpu instances
* (legacy MPEG2/VP8 decoder, vepu121 encoder, vpu981 AV1
* decoder) all reporting driver="hantro-vpu" / model="hantro-
* vpu" — so OUTPUT-format probe is the only reliable
* disambiguator that doesn't depend on parsing card-name
* strings (which are DTS-dependent). First match wins.
*
* On non-RK3588 hosts the slot stays -1; RequestQueryConfig
* Profiles' AV1 push then no-ops because any_fd_supports_
* output_format() returns false for AV1F.
*/
{
static char av1_video[32], av1_media[32];
if (find_decoder_device_by_driver_with_fmt(
"hantro-vpu", V4L2_PIX_FMT_AV1_FRAME,
av1_video, sizeof av1_video,
av1_media, sizeof av1_media) == 0) {
int av1_v = open(av1_video, O_RDWR | O_NONBLOCK);
int av1_m = (av1_v >= 0)
? open(av1_media, O_RDWR | O_NONBLOCK)
: -1;
if (av1_v >= 0 && av1_m >= 0) {
driver_data->video_fd_vpu981 = av1_v;
driver_data->media_fd_vpu981 = av1_m;
request_log(
"ampere-av1: vpu981 AV1 decoder "
"at %s + %s\n",
av1_video, av1_media);
} else {
if (av1_v >= 0) close(av1_v);
if (av1_m >= 0) close(av1_m);
int i;
char path[32], av1_video[32];
for (i = 0; i < 16; i++) {
int mfd, vfd;
struct media_device_info info;
snprintf(path, sizeof path, "/dev/media%d", i);
mfd = open(path, O_RDWR | O_NONBLOCK);
if (mfd < 0) continue;
memset(&info, 0, sizeof info);
if (ioctl(mfd, MEDIA_IOC_DEVICE_INFO, &info) != 0 ||
strcmp(info.driver, "hantro-vpu") != 0) {
close(mfd);
continue;
}
if (find_decoder_video_node_via_topology(
mfd, av1_video, sizeof av1_video) != 0) {
close(mfd);
continue;
}
vfd = open(av1_video, O_RDWR | O_NONBLOCK);
if (vfd < 0) {
close(mfd);
continue;
}
if (!v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT, V4L2_PIX_FMT_AV1_FRAME) &&
!v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, V4L2_PIX_FMT_AV1_FRAME)) {
close(vfd);
close(mfd);
continue;
}
driver_data->video_fd_vpu981 = vfd;
driver_data->media_fd_vpu981 = mfd;
request_log("ampere-av1: vpu981 AV1 decoder at %s + %s\n",
av1_video, path);
break;
}
}
}
@@ -824,29 +886,27 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rkvdec);
driver_data->has_hevc_ext_sps_rps_hantro =
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_hantro);
driver_data->has_hevc_ext_sps_rps_rpi_hevc_dec =
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rpi_hevc_dec);
if (driver_data->has_hevc_ext_sps_rps_rkvdec) {
request_log("iter2: kernel registers HEVC EXT_SPS_{ST,LT}_RPS "
"controls on rkvdec fd (will route through "
"vendored GStreamer parser)\n");
}
/*
* ampere-av1 Phase 2.1: probe V4L2_CID_STATELESS_AV1_FILM_GRAIN
* on the vpu981 fd. Per Janet v3 amendment, this runs at backend
* init (not lazily) so any race window with concurrent device
* switching can't observe an inconsistent flag.
*/
driver_data->has_av1_film_grain = false;
if (driver_data->video_fd_vpu981 >= 0) {
struct v4l2_query_ext_ctrl qec;
if (v4l2_query_ext_ctrl(driver_data->video_fd_vpu981,
V4L2_CID_STATELESS_AV1_FILM_GRAIN,
&qec) == 0) {
driver_data->has_av1_film_grain = true;
request_log("ampere-av1: vpu981 advertises FILM_GRAIN "
"control (will include in per-frame batch)\n");
}
if (driver_data->video_fd_rpi_hevc_dec >= 0) {
request_log("iter40: also opened rpi-hevc-dec at video_fd=%d "
"media_fd=%d (Pi 5 HEVC stateless)\n",
driver_data->video_fd_rpi_hevc_dec,
driver_data->media_fd_rpi_hevc_dec);
}
#ifdef HAVE_DAEDALUS_V4L2
if (driver_data->video_fd_daedalus >= 0) {
request_log("phase 8.10: opened daedalus_v4l2 at video_fd=%d "
"media_fd=%d (Pi 5 daemon-backed VP9/AV1/H264)\n",
driver_data->video_fd_daedalus,
driver_data->media_fd_daedalus);
}
#endif
status = VA_STATUS_SUCCESS;
goto complete;
@@ -894,15 +954,23 @@ VAStatus RequestTerminate(VADriverContextP context)
close(driver_data->video_fd_hantro);
if (driver_data->media_fd_hantro >= 0)
close(driver_data->media_fd_hantro);
if (driver_data->video_fd_rpi_hevc_dec >= 0)
close(driver_data->video_fd_rpi_hevc_dec);
if (driver_data->media_fd_rpi_hevc_dec >= 0)
close(driver_data->media_fd_rpi_hevc_dec);
if (driver_data->video_fd_vpu981 >= 0)
close(driver_data->video_fd_vpu981);
if (driver_data->media_fd_vpu981 >= 0)
close(driver_data->media_fd_vpu981);
#ifdef HAVE_DAEDALUS_V4L2
if (driver_data->video_fd_daedalus >= 0)
close(driver_data->video_fd_daedalus);
if (driver_data->media_fd_daedalus >= 0)
close(driver_data->media_fd_daedalus);
#endif
/* Fall back to direct close if neither alt fd captured the active
* pair (env-override path). */
if (driver_data->video_fd_rkvdec < 0 &&
driver_data->video_fd_hantro < 0 &&
driver_data->video_fd_vpu981 < 0) {
if (driver_data->video_fd_rkvdec < 0 && driver_data->video_fd_hantro < 0) {
if (driver_data->video_fd >= 0)
close(driver_data->video_fd);
if (driver_data->media_fd >= 0)
+88 -22
View File
@@ -42,7 +42,16 @@
#define V4L2_REQUEST_STR_VENDOR "v4l2-request"
#define V4L2_REQUEST_MAX_PROFILES 11
/*
* Sized for max-possible enumeration with iter39 Option B reverted:
* MPEG2(2) + H264(6 incl. Hi10P) + HEVC(2 incl. Main10) + VP8 + VP9 + AV1 = 13.
* The per-group guards use `if (... && index < (MAX_PROFILES - N))` where N
* is the push-group size, so MAX must be total+1 14 here. Bumping
* defensively now so a future re-enable of Hi10P/Main10 doesn't silently
* drop AV1 through the off-by-one trap that ate ampere-av1's enumeration
* for a week (see issue marfrit/libva-v4l2-request-fourier#2).
*/
#define V4L2_REQUEST_MAX_PROFILES 14
#define V4L2_REQUEST_MAX_ENTRYPOINTS 5
#define V4L2_REQUEST_MAX_CONFIG_ATTRIBUTES 10
#define V4L2_REQUEST_MAX_IMAGE_FORMATS 10
@@ -78,17 +87,45 @@ struct request_data {
int media_fd_rkvdec;
int video_fd_hantro;
int media_fd_hantro;
/*
* ampere-av1-enablement Phase 2 vpu981 is a THIRD physical
* hantro-vpu instance on RK3588 (separate from the legacy MPEG2/VP8
* hantro at /dev/video2). It's the dedicated AV1 decoder at
* /dev/video4 with card name "rockchip,rk3588-av1-vpu-dec".
* iter40: third multi-device-probe slot for rpi-hevc-dec (Pi 5 /
* CM5 / BCM2712). V4L2 stateless HEVC; CAPTURE is NC12/NC30 SAND
* 128-pixel-wide column tiled (Pi-specific). On Pi 5 this is the
* ONLY decoder slot; on RK hosts it stays -1 and HEVC routes to
* rkvdec as before.
*/
int video_fd_rpi_hevc_dec;
int media_fd_rpi_hevc_dec;
/*
* phase 8.10: fifth multi-device-probe slot for daedalus_v4l2 the
* out-of-tree V4L2 stateless decoder shim that forwards bitstream
* to a userspace daemon (daedalus-v4l2 sibling repo). Daemon does
* FFmpeg-software decode for VP9 / AV1 / H.264 and ships pixels
* back via dmabuf into the CAPTURE buffer. Picked up via the
* same media-controller probe + known_decoder_drivers[] entry
* pattern as iter40 rpi-hevc-dec. Stays -1 on hosts without the
* daedalus module loaded; HEVC routes to rpi-hevc-dec as before.
*
* Driver-name alone ("hantro-vpu") is ambiguous on RK3588 three
* instances share the name. The probe discriminates by capability:
* which OUTPUT format does the device advertise? Only vpu981
* exposes V4L2_PIX_FMT_AV1_FRAME.
* Fields are unconditional (8 bytes per session) so the struct
* layout is stable regardless of meson option. The active
* probe + dispatch code in request.c is gated by
* HAVE_DAEDALUS_V4L2; when disabled the fields stay at their
* -1 init and no codepath touches them.
*/
int video_fd_daedalus;
int media_fd_daedalus;
/*
* ampere-av1-enablement Phase 2: fourth multi-device-probe slot
* for vpu981 (RK3588's dedicated AV1 hantro instance, kernel
* card="rockchip,rk3588-av1-vpu-dec", driver name "hantro-vpu"
* shared with the legacy MPEG-2/VP8/H.264 hantro). Discriminated
* by V4L2_PIX_FMT_AV1_FRAME (AV1F) OUTPUT-pixfmt capability since
* the driver name alone is ambiguous on RK3588. Stays -1 on hosts
* without the AV1 vpu-dec.
*
* Named "vpu981" for consistency with the in-progress av1-iter1
* operator branch (Phase 3-5 bit-exact AV1 work when that lands
* these fields receive the actual decode dispatch wiring).
*/
int video_fd_vpu981;
int media_fd_vpu981;
@@ -112,18 +149,12 @@ struct request_data {
*/
bool has_hevc_ext_sps_rps_rkvdec;
bool has_hevc_ext_sps_rps_hantro;
/*
* ampere-av1 Phase 2.1: probe result for the optional
* V4L2_CID_STATELESS_AV1_FILM_GRAIN control on the vpu981 fd.
* Probed at VA_DRIVER_INIT (per Janet v3 amendment init-time
* not lazy). Consumed by av1_set_controls to conditionally include
* the 4th control in the per-frame batch.
*
* True iff vpu981 advertises the control via VIDIOC_QUERY_EXT_CTRL.
* False for non-RK3588 hosts (no vpu981 fd) or older kernels.
*/
bool has_av1_film_grain;
/* iter40: rpi-hevc-dec doesn't expose EXT_SPS_*_RPS controls
* (verified Phase 0 higgs probe: QUERY_EXT_CTRL on 0xa97 EINVAL).
* Probed for consistency with the iter2 pair-of-flags pattern;
* stays false on Pi 5 and the iter2 vendored-parser path naturally
* doesn't engage. */
bool has_hevc_ext_sps_rps_rpi_hevc_dec;
/*
* iter2 cached SPS-derived RPS arrays. SPS NALs only appear in
@@ -148,6 +179,30 @@ struct request_data {
unsigned int hevc_rps_cache_lt_count;
bool hevc_rps_cache_valid;
/*
* iter40b: bitstream-derived SPS field cache for VAAPI-omitted
* fields. rpi-hevc-dec validates these against bitstream-true
* values; the rkvdec/hantro fallback (sps_max_dec_pic_buffering_minus1,
* 0) that satisfies §A.4.2 isn't enough for rpi.
*
* Cached on first IDR frame's SPS NAL parse, reused for subsequent
* non-IDR frames whose source_data may not carry an SPS.
*
* sps_max_sub_layers_minus1 is the index into max_*[] arrays. The
* V4L2 SPS struct fields are scalars (single sublayer), so we pick
* the HighestTid (= sps_max_sub_layers_minus1) slot matches
* ffmpeg-vaapi + kdirect convention.
*/
struct {
bool valid;
uint8_t sps_max_sub_layers_minus1;
uint8_t max_dec_pic_buffering_minus1;
uint8_t max_num_reorder_pics;
uint8_t max_latency_increase_plus1;
bool scaling_list_enabled;
bool scaling_list_data_present;
} hevc_sps_field_cache;
struct video_format *video_format;
/*
@@ -204,6 +259,17 @@ struct request_data {
unsigned int fmt_buffers_count;
unsigned int fmt_sizes[VIDEO_MAX_PLANES];
unsigned int fmt_bytesperlines[VIDEO_MAX_PLANES];
/*
* iter39: active session is decoding a 10-bit profile (Hi10P / Main10).
* Set in RequestCreateContext from config->profile. Drives:
* - CAPTURE pix_fmt selection (NV15 instead of NV12)
* - image.c DeriveImage / QueryImageFormats fourcc reporting (P010
* instead of NV12)
* - copy_surface_to_image NV15P010 unpack branch
* Reset to false at DestroyContext.
*/
bool is_10bit;
};
VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context);
+11 -11
View File
@@ -111,13 +111,6 @@ void surface_unbind_slot(struct request_data *driver_data,
{
if (surface_object->current_slot == NULL)
return;
/* AV1 Phase 3 diag: log every unbind with surface id + slot idx
* + status confirms whether BeginPicture rebind is racing the
* consumer's vaGetImage on the previous frame. */
request_log("surface_unbind_slot id=0x%x status=%d slot_idx=%u\n",
surface_object->base.id,
surface_object->status,
surface_object->current_slot->v4l2_index);
cap_pool_release(&driver_data->capture_pool, surface_object->current_slot);
surface_object->current_slot = NULL;
}
@@ -189,7 +182,9 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
* surface_bind_format_uniform_fields(); the per-slot
* destination_* fields fill at BeginPicture via surface_bind_slot.
*/
if (format != VA_RT_FORMAT_YUV420)
/* iter39: allow YUV420_10 for Hi10P / Main10 surface allocation. */
if (format != VA_RT_FORMAT_YUV420 &&
format != VA_RT_FORMAT_YUV420_10)
return VA_STATUS_ERROR_UNSUPPORTED_RT_FORMAT;
for (i = 0; i < surfaces_count; i++) {
@@ -199,8 +194,6 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
return VA_STATUS_ERROR_ALLOCATION_FAILED;
surface_object->current_slot = NULL; /* iter2 Fix 3 */
surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
surface_object->av1_order_hint = 0;
surface_object->destination_index = 0; /* set on bind */
surface_object->destination_planes_count = 0; /* set at CreateContext */
surface_object->destination_buffers_count = 0; /* set at CreateContext */
@@ -715,7 +708,14 @@ VAStatus RequestExportSurfaceHandle(VADriverContextP context,
planes_count = surface_object->destination_planes_count;
surface_descriptor->fourcc = VA_FOURCC_NV12;
/* iter39: 10-bit session exports a DRM_FORMAT_NV15 buffer; advertise
* the matching fourcc so a PRIME consumer aware of NV15 (panfrost-
* Mesa et al.) can import correctly. PRIME consumers that only know
* NV12 / P010 should use the COPY (vaGetImage) path which unpacks
* NV15P010 in image.c::copy_surface_to_image. */
surface_descriptor->fourcc = driver_data->is_10bit
? VA_FOURCC('N', 'V', '1', '5')
: VA_FOURCC_NV12;
surface_descriptor->width = surface_object->width;
surface_descriptor->height = surface_object->height;
surface_descriptor->num_objects = export_fds_count;
+9 -37
View File
@@ -89,33 +89,6 @@ struct object_surface {
struct timeval timestamp;
/*
* AV1 Phase 3: for streams with apply_grain=1, VAAPI's
* VADecPictureParameterBufferAV1 carries current_display_picture
* (display-time surface) separate from current_frame (decode
* target). vpu981 HW applies grain inline to the decode CAPTURE
* buffer, so the decoded data lives in current_frame's slot but
* ffmpeg calls vaGetImage on current_display_picture which has no
* slot bound. linked_decode_surface_id, set in av1_set_controls
* on the display surface, points to the decode surface so
* copy_surface_to_image can borrow its destination_data[].
*
* VA_INVALID_SURFACE = no link (the common case: 8-bit codecs,
* AV1 with apply_grain=0, AV1 frames where cur_frame ==
* cur_display).
*/
VASurfaceID linked_decode_surface_id;
/*
* AV1 Phase 3: AV1 order_hint of the frame currently decoded into
* this surface. VAAPI's VADecPictureParameterBufferAV1.order_hint
* is per-frame; kernel's v4l2_ctrl_av1_frame.order_hints[8] is
* per-reference. We track each decoded frame's order_hint here so
* the next frame's av1_set_controls can populate order_hints[i]
* from ref_frame_map[i] SURFACE av1_order_hint.
*/
uint8_t av1_order_hint;
union {
struct {
VAPictureParameterBufferMPEG2 picture;
@@ -149,18 +122,17 @@ struct object_surface {
VADecPictureParameterBufferVP9 picture;
VASliceParameterBufferVP9 slice;
} vp9;
/*
* ampere-av1-enablement: AV1 needs picture-header +
* variable number of slice/tile params (one per tile).
* tile_group_entries[] holds parsed VASliceParameterBufferAV1
* entries up to MAX_TILES; av1.c builds the matching
* v4l2_ctrl_av1_tile_group_entry[] at set_controls time.
*/
struct {
#define AV1_MAX_TILES 128
/*
* AV1 picture parameter buffer. Slice params are
* intentionally absent the daedalus daemon track
* (issue #11) consumes the slice OBU bytes directly
* from the OUTPUT bitstream and synthesises only the
* sequence-header OBU from V4L2_CID_STATELESS_AV1_
* SEQUENCE. No per-tile-group structOBU re-synthesis
* required from libva today.
*/
VADecPictureParameterBufferAV1 picture;
VASliceParameterBufferAV1 tile_group_entries[AV1_MAX_TILES];
unsigned int num_tile_group_entries;
} av1;
} params;
+27 -26
View File
@@ -433,7 +433,6 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
unsigned int num_controls)
{
struct v4l2_ext_controls controls;
int rc;
memset(&controls, 0, sizeof(controls));
@@ -445,28 +444,7 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
controls.request_fd = request_fd;
}
rc = ioctl(video_fd, ioc, &controls);
if (rc < 0) {
/* ampere-av1 Phase 2.1 diag: surface error_idx so the caller's
* error path knows which CID failed validation. error_idx >=
* count means the failure was pre-validation (e.g., bad
* request_fd). errno carries the syscall-level reason. */
const char *failed_cid_label = "<pre-validation>";
unsigned int failed_size = 0;
if (controls.error_idx < num_controls) {
failed_size = control_array[controls.error_idx].size;
(void)failed_cid_label; /* keep symbol if logger truncates */
}
request_log("v4l2_ioctl_controls: rc=%d errno=%d (%s) "
"ioc=0x%lx error_idx=%u count=%u "
"failed_cid=0x%x failed_size=%u\n",
rc, errno, strerror(errno), ioc,
controls.error_idx, num_controls,
controls.error_idx < num_controls
? control_array[controls.error_idx].id : 0,
failed_size);
}
return rc;
return ioctl(video_fd, ioc, &controls);
}
int v4l2_get_controls(int video_fd, int request_fd,
@@ -498,12 +476,35 @@ int v4l2_set_controls(int video_fd, int request_fd,
struct v4l2_ext_control *control_array,
unsigned int num_controls)
{
struct v4l2_ext_controls controls;
int rc;
rc = v4l2_ioctl_controls(video_fd, request_fd, VIDIOC_S_EXT_CTRLS,
control_array, num_controls);
memset(&controls, 0, sizeof(controls));
controls.controls = control_array;
controls.count = num_controls;
if (request_fd >= 0) {
controls.which = V4L2_CTRL_WHICH_REQUEST_VAL;
controls.request_fd = request_fd;
}
rc = ioctl(video_fd, VIDIOC_S_EXT_CTRLS, &controls);
if (rc < 0) {
request_log("Unable to set control(s): %s\n", strerror(errno));
/* error_idx is the index of the first failing control;
* if it equals count, the ioctl itself failed (not a
* specific control payload). Useful for triaging
* which V4L2_CID_STATELESS_* the kernel rejected. */
if (controls.error_idx < num_controls)
request_log("Unable to set control(s): %s "
"(error_idx=%u/%u failing_ctrl_id=0x%x size=%u)\n",
strerror(errno),
controls.error_idx, controls.count,
control_array[controls.error_idx].id,
control_array[controls.error_idx].size);
else
request_log("Unable to set control(s): %s "
"(error_idx=%u/%u ioctl-level)\n",
strerror(errno),
controls.error_idx, controls.count);
return -1;
}
+34
View File
@@ -31,6 +31,8 @@
#include <drm_fourcc.h>
#include <linux/videodev2.h>
#include "nv12_col128.h" /* fallback V4L2_PIX_FMT_NV12_COL128 define */
#include "nv15.h" /* fallback V4L2_PIX_FMT_NV15 define */
#include "utils.h"
#include "video.h"
@@ -45,6 +47,38 @@ static struct video_format formats[] = {
.planes_count = 2,
.bpp = 16,
},
{
.description = "NV15 YUV (10-bit, rkvdec)",
.v4l2_format = V4L2_PIX_FMT_NV15,
.v4l2_buffers_count = 1,
.v4l2_mplane = true,
.drm_format = DRM_FORMAT_NV15,
.drm_modifier = DRM_FORMAT_MOD_NONE,
.planes_count = 2,
.bpp = 24,
},
{
/*
* iter40: Pi 5 / CM5 rpi-hevc-dec CAPTURE format. 8-bit NV12
* stored as 128-pixel-wide column tiles (SAND128 layout).
* Pi-specific; not in mainline drm_fourcc.h (uses NV12 + a
* BROADCOM_SAND128 modifier for DRM_PRIME). Our consumer path
* always detiles to linear NV12 in copy_surface_to_image, so
* we don't expose the SAND modifier downstream drm_format is
* still DRM_FORMAT_NV12 and drm_modifier MOD_NONE so the
* format-is-linear gate doesn't pull us into tiled_to_planar
* (Sunxi-specific). image.c branches on v4l2_format ==
* V4L2_PIX_FMT_NV12_COL128 to invoke the dedicated detile.
*/
.description = "NV12 SAND128 (8-bit, rpi-hevc-dec)",
.v4l2_format = V4L2_PIX_FMT_NV12_COL128,
.v4l2_buffers_count = 1,
.v4l2_mplane = true,
.drm_format = DRM_FORMAT_NV12,
.drm_modifier = DRM_FORMAT_MOD_NONE,
.planes_count = 2,
.bpp = 16,
},
// Code to handle this DRM_FORMAT is __arm__ only
#ifdef __arm__
{
+196
View File
@@ -0,0 +1,196 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* MIT-licensed per project. iter40 self-test for nv12_col128 detile.
*
* Build an NC12-tiled source buffer from a known linear NV12 image,
* run the detile primitive, assert output matches the original. No
* hardware needed pure bit-layout verification of the kernel math
* (drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
* V4L2_PIX_FMT_NV12_COL128 case + ffmpeg/Kynesim per-pixel offset).
*
* Build:
* cc -Wall -Werror -O2 -o test_nv12_col128_detile \
* tests/test_nv12_col128_detile.c src/nv12_col128.c
*
* Exit 0 = all asserts pass.
*/
#include "../src/nv12_col128.h"
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TILE_W 128
static unsigned int align_up(unsigned int v, unsigned int a)
{
return (v + a - 1) & ~(a - 1);
}
/* Pack a linear plane (width × height bytes, stride=width) into NC12
* layout: each 128-wide column held contiguously, columns at offsets
* col * col_stride * 128. col_stride is the kernel-reported bytesperline
* = ALIGN(height, 8) * 3/2. Returns the buffer + sizes. */
static uint8_t *pack_to_nc12(const uint8_t *linear,
unsigned int width, unsigned int height,
unsigned int *out_col_stride,
size_t *out_size)
{
unsigned int aligned_w = align_up(width, TILE_W);
unsigned int aligned_h = align_up(height, 8);
unsigned int col_stride = aligned_h * 3 / 2;
unsigned int num_cols = aligned_w / TILE_W;
size_t total = (size_t)col_stride * aligned_w;
uint8_t *buf;
unsigned int col, y, in_col;
buf = calloc(1, total);
assert(buf != NULL);
for (col = 0; col < num_cols; col++) {
uint8_t *col_base = buf + (size_t)col * TILE_W * col_stride;
for (y = 0; y < height; y++) {
for (in_col = 0; in_col < TILE_W; in_col++) {
unsigned int x = col * TILE_W + in_col;
if (x >= width)
break;
col_base[(size_t)y * TILE_W + in_col] =
linear[(size_t)y * width + x];
}
}
}
*out_col_stride = col_stride;
*out_size = total;
return buf;
}
static void test_detile_y(unsigned int width, unsigned int height)
{
uint8_t *linear, *tiled, *recovered;
unsigned int col_stride;
size_t tile_size, i;
linear = malloc((size_t)width * height);
assert(linear != NULL);
/* Distinctive content per pixel: y * 17 + x * 13 — avoids byte-
* aliasing patterns that could mask off-by-one bugs. */
for (unsigned int y = 0; y < height; y++)
for (unsigned int x = 0; x < width; x++)
linear[(size_t)y * width + x] = (uint8_t)(y * 17 + x * 13);
tiled = pack_to_nc12(linear, width, height, &col_stride, &tile_size);
recovered = calloc(1, (size_t)width * height);
assert(recovered != NULL);
nv12_col128_detile_y(recovered, width, tiled, col_stride, width, height);
for (i = 0; i < (size_t)width * height; i++) {
if (recovered[i] != linear[i]) {
fprintf(stderr,
"FAIL %ux%u Y: pixel %zu (x=%zu y=%zu) "
"linear=0x%02x recovered=0x%02x\n",
width, height, i,
i % width, i / width,
linear[i], recovered[i]);
free(linear); free(tiled); free(recovered);
exit(1);
}
}
printf("PASS %ux%u Y plane (%u columns, col_stride=%u, tile_size=%zu)\n",
width, height, align_up(width, TILE_W) / TILE_W,
col_stride, tile_size);
free(linear);
free(tiled);
free(recovered);
}
static void test_detile_uv(unsigned int width, unsigned int height)
{
unsigned int uv_h = height / 2;
uint8_t *linear, *tiled, *recovered;
unsigned int col_stride;
size_t tile_size, i;
linear = malloc((size_t)width * uv_h);
assert(linear != NULL);
for (unsigned int y = 0; y < uv_h; y++)
for (unsigned int x = 0; x < width; x++)
linear[(size_t)y * width + x] = (uint8_t)(y * 23 + x * 7);
tiled = pack_to_nc12(linear, width, uv_h, &col_stride, &tile_size);
recovered = calloc(1, (size_t)width * uv_h);
assert(recovered != NULL);
nv12_col128_detile_uv(recovered, width, tiled, col_stride, width, uv_h);
for (i = 0; i < (size_t)width * uv_h; i++) {
if (recovered[i] != linear[i]) {
fprintf(stderr,
"FAIL %ux%u UV: pixel %zu linear=0x%02x recovered=0x%02x\n",
width, height, i,
linear[i], recovered[i]);
free(linear); free(tiled); free(recovered);
exit(1);
}
}
printf("PASS %ux%u UV plane\n", width, height);
free(linear);
free(tiled);
free(recovered);
}
static void test_uv_offset(void)
{
/* Per the SAND COL128 layout, Y and UV are interleaved within
* EACH column (not concatenated as separate planes), so the UV
* plane base pointer is offset by 128 * ALIGN(height, 8) the
* Y portion of column 0. NOT 128 * height * num_columns (the
* size of all Y across all columns), which was an earlier wrong
* formula caught by Phase 7 SEGV on higgs. */
unsigned int off = nv12_col128_uv_plane_offset(1280, 720);
if (off != 128u * 720) {
fprintf(stderr, "FAIL UV offset 1280×720: got %u expected %u\n",
off, 128u * 720);
exit(1);
}
printf("PASS UV offset 1280×720 = %u\n", off);
off = nv12_col128_uv_plane_offset(1366, 768);
if (off != 128u * 768) {
fprintf(stderr, "FAIL UV offset 1366×768: got %u expected %u\n",
off, 128u * 768);
exit(1);
}
printf("PASS UV offset 1366×768 (column-misaligned width)\n");
}
int main(void)
{
/* Phase 3 fixture sizes — all 128-aligned, 8-line-aligned. */
test_detile_y(640, 360);
test_detile_y(1280, 720);
test_detile_y(1920, 1080);
/* Phase 5 review F4: column-misaligned width (1366 → 1408 padding). */
test_detile_y(1366, 768);
/* UV plane (half-height) at each width. */
test_detile_uv(640, 360);
test_detile_uv(1280, 720);
test_detile_uv(1920, 1080);
test_detile_uv(1366, 768);
test_uv_offset();
printf("All NC12 detile asserts pass.\n");
return 0;
}
+224
View File
@@ -0,0 +1,224 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/*
* iter39 self-test for nv15_unpack_plane_to_p010.
*
* Builds NV15 plane buffers from known 10-bit pixel arrays, runs the
* unpack, asserts P010 output matches the expected pixel<<6 values.
* No hardware needed pure bit layout verification per
* Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
*
* Build:
* cc -Wall -Werror -O2 -o test_nv15_unpack tests/test_nv15_unpack.c src/nv15.c
*
* Exit 0 = all asserts pass.
*/
#include "../src/nv15.h"
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Pack 4 10-bit pixels into 5 bytes per NV15 layout (LSB-first across
* bits 0..39). Inverse of nv15_unpack_plane_to_p010's per-group unpack. */
static void pack4(uint16_t a, uint16_t b, uint16_t c, uint16_t d,
uint8_t out[5])
{
out[0] = (uint8_t)(a & 0xFF);
out[1] = (uint8_t)(((a >> 8) & 0x03) | ((b & 0x3F) << 2));
out[2] = (uint8_t)(((b >> 6) & 0x0F) | ((c & 0x0F) << 4));
out[3] = (uint8_t)(((c >> 4) & 0x3F) | ((d & 0x03) << 6));
out[4] = (uint8_t)((d >> 2) & 0xFF);
}
#define ASSERT_EQ(actual, expected, msg) do { \
if ((actual) != (expected)) { \
fprintf(stderr, "FAIL %s: actual=0x%04x expected=0x%04x at %s:%d\n", \
(msg), (unsigned)(actual), (unsigned)(expected), \
__FILE__, __LINE__); \
exit(1); \
} \
} while (0)
static void test_pack_unpack_roundtrip(uint16_t a, uint16_t b, uint16_t c,
uint16_t d)
{
uint8_t packed[5];
uint16_t dst[4];
pack4(a, b, c, d, packed);
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], (uint16_t)(a << 6), "roundtrip a");
ASSERT_EQ(dst[1], (uint16_t)(b << 6), "roundtrip b");
ASSERT_EQ(dst[2], (uint16_t)(c << 6), "roundtrip c");
ASSERT_EQ(dst[3], (uint16_t)(d << 6), "roundtrip d");
}
static void test_zero(void)
{
uint8_t packed[5] = { 0, 0, 0, 0, 0 };
uint16_t dst[4] = { 0xDEAD, 0xDEAD, 0xDEAD, 0xDEAD };
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], 0, "zero[0]");
ASSERT_EQ(dst[1], 0, "zero[1]");
ASSERT_EQ(dst[2], 0, "zero[2]");
ASSERT_EQ(dst[3], 0, "zero[3]");
}
static void test_all_max(void)
{
/* All four pixels = 0x3FF (max 10-bit). Packed bits all 1 → all 0xFF. */
uint8_t packed[5] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF };
uint16_t dst[4] = { 0, 0, 0, 0 };
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], 0xFFC0, "max[0]");
ASSERT_EQ(dst[1], 0xFFC0, "max[1]");
ASSERT_EQ(dst[2], 0xFFC0, "max[2]");
ASSERT_EQ(dst[3], 0xFFC0, "max[3]");
}
static void test_known_vectors(void)
{
/* Position-sensitive sanity: each pixel = its index+1. */
test_pack_unpack_roundtrip(1, 2, 3, 4);
/* Spread patterns that exercise every byte-boundary bit. */
test_pack_unpack_roundtrip(0x3FF, 0x000, 0x3FF, 0x000);
test_pack_unpack_roundtrip(0x000, 0x3FF, 0x000, 0x3FF);
test_pack_unpack_roundtrip(0x155, 0x2AA, 0x155, 0x2AA);
test_pack_unpack_roundtrip(0x001, 0x002, 0x004, 0x008);
test_pack_unpack_roundtrip(0x080, 0x040, 0x020, 0x010);
test_pack_unpack_roundtrip(0x200, 0x100, 0x080, 0x040);
test_pack_unpack_roundtrip(0x3F0, 0x0F3, 0x33C, 0x2A5);
}
static void test_remainder_width(void)
{
/* width=1: only A unpacked, B/C/D undefined */
{
uint8_t packed[5];
uint16_t dst[1] = { 0xDEAD };
pack4(0x123, 0x000, 0x000, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 1, 1, 5);
ASSERT_EQ(dst[0], 0x123 << 6, "rem1[0]");
}
/* width=2 */
{
uint8_t packed[5];
uint16_t dst[2] = { 0, 0 };
pack4(0x111, 0x222, 0x000, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 2, 1, 5);
ASSERT_EQ(dst[0], 0x111 << 6, "rem2[0]");
ASSERT_EQ(dst[1], 0x222 << 6, "rem2[1]");
}
/* width=3 */
{
uint8_t packed[5];
uint16_t dst[3] = { 0, 0, 0 };
pack4(0x111, 0x222, 0x333, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 3, 1, 5);
ASSERT_EQ(dst[0], 0x111 << 6, "rem3[0]");
ASSERT_EQ(dst[1], 0x222 << 6, "rem3[1]");
ASSERT_EQ(dst[2], 0x333 << 6, "rem3[2]");
}
/* width=7: one full group + 3 remainder */
{
uint8_t packed[10];
uint16_t dst[7] = { 0 };
pack4(0x100, 0x200, 0x300, 0x010, &packed[0]);
pack4(0x011, 0x022, 0x033, 0x000, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 7, 1, 10);
ASSERT_EQ(dst[0], 0x100 << 6, "rem7[0]");
ASSERT_EQ(dst[1], 0x200 << 6, "rem7[1]");
ASSERT_EQ(dst[2], 0x300 << 6, "rem7[2]");
ASSERT_EQ(dst[3], 0x010 << 6, "rem7[3]");
ASSERT_EQ(dst[4], 0x011 << 6, "rem7[4]");
ASSERT_EQ(dst[5], 0x022 << 6, "rem7[5]");
ASSERT_EQ(dst[6], 0x033 << 6, "rem7[6]");
}
/* width=8: two full groups */
{
uint8_t packed[10];
uint16_t dst[8] = { 0 };
pack4(0x101, 0x202, 0x303, 0x101, &packed[0]);
pack4(0x202, 0x303, 0x101, 0x202, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 8, 1, 10);
ASSERT_EQ(dst[7], 0x202 << 6, "w8[7]");
}
}
static void test_multi_row_stride_padding(void)
{
/* 4-pixel-wide, 3-row plane; stride = 8 bytes (3 bytes padding). */
uint8_t packed[24]; /* 3 rows × 8 bytes */
uint16_t dst[12]; /* 3 rows × 4 pixels */
memset(packed, 0xCC, sizeof(packed)); /* padding poison */
pack4(0x111, 0x222, 0x333, 0x044, &packed[0 * 8]);
pack4(0x055, 0x166, 0x177, 0x188, &packed[1 * 8]);
pack4(0x099, 0x1AA, 0x2BB, 0x3CC, &packed[2 * 8]);
memset(dst, 0xAB, sizeof(dst));
nv15_unpack_plane_to_p010(packed, dst, 4, 3, 8);
ASSERT_EQ(dst[0], 0x111 << 6, "row0[0]");
ASSERT_EQ(dst[3], 0x044 << 6, "row0[3]");
ASSERT_EQ(dst[4], 0x055 << 6, "row1[0]");
ASSERT_EQ(dst[7], 0x188 << 6, "row1[3]");
ASSERT_EQ(dst[8], 0x099 << 6, "row2[0]");
ASSERT_EQ(dst[11], 0x3CC << 6, "row2[3]");
}
static void test_chroma_half_height(void)
{
/* 4-pixel-wide × 2-row chroma (matches 4×4 luma quadrant).
* NV15 chroma uses same packing as luma, just half-height. */
uint8_t packed[10]; /* 2 rows × 5 bytes */
uint16_t dst[8]; /* 2 rows × 4 pixels (UV pairs in interleaved form) */
pack4(0x080, 0x180, 0x280, 0x380, &packed[0]);
pack4(0x040, 0x140, 0x240, 0x340, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 4, 2, 5);
ASSERT_EQ(dst[0], 0x080 << 6, "chroma row0[0]");
ASSERT_EQ(dst[3], 0x380 << 6, "chroma row0[3]");
ASSERT_EQ(dst[4], 0x040 << 6, "chroma row1[0]");
ASSERT_EQ(dst[7], 0x340 << 6, "chroma row1[3]");
}
int main(void)
{
test_zero();
test_all_max();
test_known_vectors();
test_remainder_width();
test_multi_row_stride_padding();
test_chroma_half_height();
printf("test_nv15_unpack: all PASS\n");
return 0;
}