15 Commits

Author SHA1 Message Date
claude-noether 902d6c17ba ampere-av1 Phase 5 review: stale linked_decode_surface_id clear; remap fix REVERTED
Two of three Phase 5 sonnet-architect review amendments addressed.

Amendment 4 (kept): clear surface_object->linked_decode_surface_id at
BeginPicture after the iter2 Fix 3 release. Prevents stale-link
borrows in copy_surface_to_image when ffmpeg-vaapi recycles a former
display surface as a decode target. No-op for non-AV1 codecs (link
field stays VA_INVALID_SURFACE for them throughout).

Amendment 1 (reverted): reviewer proposed remap_lr_type table
{NONE, SWITCHABLE, WIENER, SGRPROJ} per Kwiboo's permutation,
arguing AV1 spec FrameRestoreType wire encoding differs from
V4L2_AV1_FRAME_RESTORE_* enum order. Applied the proposed table
empirically → regressed ALL tests (allintra 10/10 → 0/10, test_av1
bit-exact → DIFF). Reverted to identity mapping. Either VAAPI's
yframe_restoration_type is already in V4L2-enum order, or vpu981
interprets the V4L2 enum values via a mapping that differs from the
uAPI header documentation. Per [[feedback_review_empirical_over_theoretical]]
empirical PASS wins; updated the code comment to capture the
investigation outcome so the next session has the context.

Amendment 5 (SEPARATE_UV_DELTA_Q sequence flag missing): noted but
not actionable — VAAPI doesn't expose color_config.separate_uv_delta_q.
Will need bitstream-side info to surface. Not blocking current tests.

Verification on ampere:
  test_av1.ivf:             bit-exact PASS sha 029ee72c214b37c1
  av1-1-b8-02-allintra.ivf: 10/10 PASS (no regression)
  av1_larger.ivf:           3/10 PASS (no regression)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 12:19:19 +00:00
claude-noether c839b9456e ampere-av1 Phase 3 finding: iter2 Fix 3 release is NOT the divergence cause
Investigated whether picture.c::BeginPicture's iter2 Fix 3 release-on-
rebind was causing AV1 inter-frame divergence on av1_larger.ivf
(film_grain stress vector). Added env-gated LIBVA_SKIP_REBIND=1
experiment (leak old slot instead of release); A/B run showed identical
3/10 PASS count with and without the release. Hypothesis disproved.

Where the divergence actually lives:
  - patched ffmpeg-v4l2-request-fourier libavcodec.so with a fwrite
    diag in ff_v4l2_request_append_output → 7 dump files for the
    -frames:v 5 kdirect run, sizes [15133, 3670, 1970, 1323, 812,
    886, 1310] BYTE-IDENTICAL to our LIBVA_V4L2_DUMP_OUTPUT first 7
    submissions for the same input
  - our backend has 2 EXTRA EndPicture calls (t8 size 824, t9 size
    487) on RE-USED surfaces (0x4000008 and 0x4000006)
  - the extras happen because ffmpeg-vaapi's AV1 hwaccel issues
    redecode requests onto surfaces that already hold frames the
    consumer hasn't downloaded yet
  - SKIP_REBIND should let those redecodes' slots stay around but
    doesn't help, because surface_object->current_slot can only
    point at ONE slot at a time and bind_slot overwrites it

True root cause: ffmpeg-vaapi AV1 hwaccel's surface accounting is
incompatible with the iter2 Fix 3 1:1 surface↔slot invariant when
the stream has show_existing_frame frames. Fix would need either
(a) cap_pool tracking N surfaces per slot, or (b) backend reading
ffmpeg-vaapi's display-order mapping and remapping slots accordingly.
Both are non-trivial Phase 4 work — outside this iteration's scope.

Reverted the LIBVA_SKIP_REBIND env-gate to clean shape. Comment
updated with the investigation outcome so the next session has the
context without rediscovering.

State: 3/10 av1_larger frames bit-exact (frames 0/2/4, the
apply_grain=1 IDR-derived ones). test_av1.ivf 208x208 still bit-exact
PASS (no regression). diagnostic logs in BeginPicture +
surface_unbind_slot + v4l2_ioctl_controls retained for future
investigation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 12:12:23 +00:00
claude-noether d7ef0f6cd9 ampere-av1 Phase 3: SEQUENCE byte-equal kdirect; 3/10 frames PASS bit-exact
Three more fixes after strace-diff localization vs kdirect.

Fix 6 — fill_sequence ENABLE_SUPERRES: gate on
picture->pic_info_fields.bits.use_superres instead of unconditional
set-true. VAAPI doesn't expose enable_superres at sequence level; per
strace diff kdirect clears the flag for streams not using superres
(byte 1 of flags was the only SEQUENCE diff). After this fix,
SEQUENCE ctrl byte-equal kdirect on every call.

Fix 7 — refresh_frame_flags = 0xff (was 0): VAAPI doesn't expose
refresh_frame_flags. Default 0xff = "refresh all DPB slots" matches
kdirect's submission and AV1 spec default for KEY/SWITCH frames; for
inter frames simple P-frame chains naturally tolerate this.

Fix 8 — surface_object->av1_order_hint per-surface tracking. Set in
av1_set_controls from picture->order_hint of the current frame. Also
propagated to the linked display surface (when apply_grain=1 →
cur_frame != cur_display) so future frames referencing the display
surface find the order_hint via the linked_decode_surface_id.

Tried + reverted: ref-name iteration of reference_frame_ts / order_hints
via picture->ref_frame_idx[i-1] → DPB slot (Kwiboo's convention via
FFmpeg's s->ref[i]). Empirically regressed 3/10 → 1/10. V4L2 uAPI's
indexing here looks DPB-slot-direct despite the AV1 spec lexicon —
needs kernel-side disambiguation to settle.

Verification on ampere (av1_larger.ivf 352x288, 10 frames):
  Frames 0, 2, 4: PASS bit-exact (apply_grain=1, grain HW path)
  Frames 1, 3, 5-9: DIFF (apply_grain=0)
  3/10 PASS (was 1/10 after iter checkpoint).
  test_av1.ivf 208x208: unchanged bit-exact PASS sha 029ee72c214b37c1

Remaining open: frame 1 (apply_grain=0, first inter) submits IDENTICAL
FRAME ctrl bytes to kdirect (verified strace-diff post-fix), yet
decoded output diverges. That means the divergence is no longer in
control submission — points at OUTPUT-side bitstream differences
between ffmpeg-vaapi and ffmpeg-v4l2request, or at DPB CAPTURE buffer
state (grain-applied data being used as reference vs pre-grain).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:55:07 +00:00
claude-noether 5803cbcf6c ampere-av1 Phase 3 progress: film_grain link + UPDATE_GRAIN; frame 0 bit-exact
Three structural fixes for AV1 with film_grain on vpu981 (RK3588). Output
is no longer empty / crashed; frame 0 (IDR with apply_grain=1) is
bit-exact vs kdirect. Inter frames still diverge.

Fix 1 — surface.h + surface.c: linked_decode_surface_id field on
object_surface, initialized to VA_INVALID_SURFACE. When AV1 picture has
apply_grain=1, VAAPI's VADecPictureParameterBufferAV1 carries a
current_display_picture distinct from current_frame. ffmpeg-vaapi calls
vaBeginPicture on current_frame (decode surface, slot gets bound) but
vaGetImage on current_display_picture (display surface, no slot) → NULL
deref in copy_surface_to_image.

Fix 2 — av1.c: in av1_set_controls, when cur_frame != cur_display, set
display_surface->linked_decode_surface_id = current_frame. Establishes
the back-link so display surface can borrow decode surface's data.

Fix 3 — image.c copy_surface_to_image: when slot is NULL and the
surface has linked_decode_surface_id, lookup the decode surface and
mirror its destination_data[] + destination_sizes[] +
destination_planes_count. NULL guard with diagnostic log retained.

Fix 4 — av1.c fill_film_grain: when apply_grain=1, also set
V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN. Confirmed by strace-diff: kdirect
sends flags=0x0B (APPLY|UPDATE|...), libva was sending 0x09 (APPLY but
no UPDATE). Without UPDATE the kernel tries to reuse from
film_grain_params_ref_idx=0, which is never populated. Earlier reverted
because UPDATE seemed to trigger a SEGV — but that SEGV was the
unmasked NULL-slot deref; with fix 1+2+3 in place UPDATE is safe.

Fix 5 — av1.c reference_frame_ts plumbing: when a referenced surface
has timestamp=0 AND linked_decode_surface_id set, follow the link to
find the decode surface that carries the real timestamp. Display
surfaces don't get OUTPUT QBUF'd by us, so their own timestamp stays
zero.

Also: BeginPicture diagnostic log + surface_unbind_slot diagnostic log
+ v4l2.c error_idx diagnostic (kept from earlier — useful for ongoing
investigation).

Verification on ampere:
  test_av1.ivf (208x208, 2 frames, no grain): bit-exact PASS sha
    029ee72c214b37c1 (unchanged, no regression)
  av1_larger.ivf (352x288, 10 frames, film_grain alternates):
    frame 0 (key, apply_grain=1): PASS bit-exact vs kdirect
    frame 4: PASS bit-exact
    frames 1,2,3,5,6,7,8,9: DIFFER

Frame 0 PASS proves: SEQUENCE + FRAME + TILE_GROUP_ENTRY + FILM_GRAIN
mapping is correct for IDR. Frame 4 PASS is unexplained but encouraging.
Inter-frame divergence (frame 1+) points at: reference handling for
inter prediction is still off — either order_hints[] (still zero,
VAAPI doesn't expose per-ref), or grain-applied vs pre-grain DPB
semantics, or ref_frame_idx pointing into the wrong surface space.

Next investigation: per-frame strace diff between libva and kdirect
controls payload to spot remaining field mis-mappings on inter frames.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:45:31 +00:00
claude-noether ab79ed5e4d ampere-av1 Phase 3 in-progress notes: UPDATE_GRAIN segfault; 352x288 still 0-byte
Phase 3 iteration on the av1_larger.ivf (352x288 film_grain-50) fixture.
208x208 test_av1.ivf remains bit-exact PASS at sha 029ee72c214b37c1
(libva == kdirect, post-reference_frame_ts plumbing).

Negative result this commit: setting V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN
unconditionally when apply_grain=1 triggered a userspace SIGSEGV in
ffmpeg on the av1_larger fixture (consistent across runs). Reverted to
implicit update_grain=0 — same behavior as before the experiment
(silent 0-byte output, no segfault).

Hypothesis ruled out: the 352x288 silent-decode-failure is NOT solved
by always-update_grain. A/B test earlier also confirmed that omitting
the FILM_GRAIN control entirely (AV1_NO_FG=1) still produces 0 bytes,
so film_grain is not the trigger.

Remaining Phase 3 investigation candidates:
  - tile_info field shape — single-tile av1_larger may stress mi_col/
    row_starts sentinel differently than single-tile test_av1
  - segmentation / quantization fields — different streams use
    different combinations
  - order_hints[] still zero (VAAPI doesn't expose per-ref)
  - kernel-side dev_dbg in vpu981 driver would expose which
    control field validation rejects
  - strace -e ioctl on the failing decode reveals MEDIA_REQUEST_IOC_QUEUE
    return value

Sibling-iteration parallel: ampere-kernel-decoders iter2-5 took
multiple iterations to localize the HEVC OOPS to kernel-side
ext_sps NULL init + slice_params; AV1 likely needs the same depth
of kernel-side instrumentation for the 352x288 case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:31:24 +00:00
claude-noether 5fb7e36955 ampere-av1 Phase 3 fix: wire reference_frame_ts[] from VAAPI ref_frame_map[]
Phase 2.1 first hardware test on ampere passed frame 1 (IDR) bit-exact
vs kdirect but frame 2 (inter) diverged starting at byte 64897. Root
cause: reference_frame_ts[] left at zero — kernel can't cross-reference
prior CAPTURE buffers without timestamps.

Fix: in av1_set_controls (which has driver_data), iterate VAAPI's
ref_frame_map[8] (VASurfaceIDs), look up each via SURFACE(driver_data,
ref_id), and pull v4l2_timeval_to_ns(&ref_surface->timestamp) into the
V4L2 ctrl. VA_INVALID_SURFACE entries stay at calloc-zero. Mirrors the
vp9.c:614-628 pattern scaled to AV1's 8 ref slots.

surface_object->timestamp itself is populated in picture.c::EndPicture
from context_object->timestamp_counter at QBUF time on the OUTPUT
buffer — already in place from iter1 baseline.

Verification on ampere (/tmp/test_av1.ivf 208x208, 2 frames):
  Frame 1 + 2 libva sha 029ee72c214b37c1 == kdirect 029ee72c214b37c1
  → 100% byte-identical, kdirect was Phase 0-verified bit-perfect

order_hints[] still zero — VAAPI doesn't expose per-ref POC; observed
not load-bearing on the 208x208 smoke vector. Multi-tile + film_grain
stress vectors are next (av1-1-b8-23-film_grain-50.ivf).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:28:32 +00:00
claude-noether 85bcddb5ad v4l2: surface error_idx + errno on VIDIOC_S_EXT_CTRLS failure
ampere-av1 Phase 2.1 + 3 diagnostic: log which control failed validation
on S_EXT_CTRLS rejection so debug iterations can identify the offending
CID without strace. Pre-validation failures (error_idx >= count) log as
"<pre-validation>" with the syscall errno surfacing the root reason.

Already informative on ampere — surfaces the pre-existing benign H264 +
HEVC device-init failures on the vpu981 AV1 fd as count=2 / failed_cid=0
(those go through (void)cast at context.c:450/473 by design).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:20:31 +00:00
claude-noether 9c30eccd52 ampere-av1 Phase 2.1: implement av1_set_controls body (~500 LoC)
Replaces stub av1_set_controls with full VAAPI → V4L2 stateless AV1
control translation. Four V4L2 controls batched per-frame:
  V4L2_CID_STATELESS_AV1_SEQUENCE       (sequence-level flags)
  V4L2_CID_STATELESS_AV1_FRAME          (heavy — quant, lf, cdef, lr, gm,
                                          tile_info, refs, frame flags)
  V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY[] (DYNAMIC_ARRAY, size=MAX(N,1))
  V4L2_CID_STATELESS_AV1_FILM_GRAIN     (gated on driver_data->has_av1_film_grain)

Reference: Kwiboo/FFmpeg v4l2-request-n8.1:libavcodec/v4l2_request_av1.c
(636 LoC); same V4L2 output schema, sourced from VAAPI's
VADecPictureParameterBufferAV1 instead of FFmpeg's AV1RawSequenceHeader.

VAAPI gap notes (fields the spec needs but VAAPI doesn't expose):
  - sequence max_frame_{width,height}_minus_1 — use current frame size
  - enable_warped_motion / enable_ref_frame_mvs / enable_superres /
    enable_restoration sequence-level — conservative set-true (per-frame
    flags gate actual behavior)
  - order_hints[], reference_frame_ts[] — zero (kernel cross-refs by
    OUTPUT timestamp / surface id)
  - tile_start_col_sb[] / tile_start_row_sb[] — reconstruct via
    prefix-sum on VAAPI's width/height_in_sbs_minus_1[]
  - tile_size_bytes — set to 4 for multi-tile frames (max value), 0
    for single-tile (matches Kwiboo's conditional)
  - render_width/height — fall back to coded dimensions
  - current_frame_id / refresh_frame_flags / skip_mode_frame_idx /
    buffer_removal_time / frame_refs_short_signaling — zero
  - film_grain_params_ref_idx / update_grain — zero (only consulted in
    reuse paths; apply_grain=1 + populated arrays drive decode directly)

F1/F2/F3 risk mitigations per phase1_plan_v2:
  F1: mi_col/row_starts sentinel = 2 * ((frame_width + 7) >> 3) at
      index [tile_cols]/[tile_rows] — mirrors Kwiboo lines 238/244
  F2: superres_denom direct from VAAPI's superres_scale_denominator
      (VAAPI's encoding is the final value; no AV1_SUPERRES_DENOM_MIN
      math). Fallback to AV1_SUPERRES_NUM=8 if zero.
  F3: loop_restoration_size[] gated on USES_LR flag derived from
      y_t != 0 || cb_t != 0 || cr_t != 0 — mirrors Kwiboo lines 281-287

Plus:
  - request.h: has_av1_film_grain bool on driver_data
  - request.c: probe VIDIOC_QUERY_EXT_CTRL for FILM_GRAIN on vpu981 fd
    at VA_DRIVER_INIT (Janet v3 amendment A: init-time, not lazy)

Compile-tested on boltzmann (aarch64 native, gcc 15.2.1): clean .so,
0 errors, pre-existing GStreamer #warnings only.

Phase 3 verification on ampere is next: 208x208 smoke + film_grain
stress vector (av1-1-b8-23-film_grain-50.ivf) byte-compare libva vs
kdirect (Phase 0 proved kdirect bit-perfect).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:18:46 +00:00
claude-noether 78a9978b02 ampere-av1 Phase 2 step 4: AV1 dispatch scaffolding compiles and wires
surface.h: av1 substruct (picture + tile_group_entries[AV1_MAX_TILES=128]
  + num_tile_group_entries counter)
picture.c: dispatch VAPictureParameterBufferAV1 + VASliceParameterBufferAV1
  into surface->params.av1.*; call av1_set_controls in EndPicture path
av1.h: minimal interface (av1_set_controls signature)
av1.c: stub set_controls returning -1 with diagnostic; _Static_assert on
  v4l2_ctrl_av1_tile_group_entry size = 16 (Janet hygiene)
meson.build: av1.c + av1.h in source list

Verified on ampere with /tmp/test_av1.ivf via LIBVA_DRIVER_NAME=v4l2_request:

  v4l2-request: ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3
  v4l2-request: ampere-av1: av1_set_controls stub — Phase 2.1 will implement ...
  [av1] Failed to end picture decode issue: 1 (operation failed).
  [av1] HW accel end frame fail.
  [dec:av1] Error submitting packet to decoder: Input/output error

Clean graceful failure — vpu981 probe works, dispatch reaches av1.c,
stub returns ERROR, ffmpeg falls back to SW. No crash, no IOMMU fault,
no kernel taint.

Next: Phase 2.1 implementation of fill_sequence + fill_frame +
fill_film_grain + fill_tile_group_entries (~700 LoC mirror of Kwiboo
v4l2_request_av1.c, applying F1/F2/F3 implementation-time corrections
from Janet review v2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:55:39 +02:00
claude-noether 61db76ebcf ampere-av1 Phase 2 step 2: advertise VAProfileAV1Profile0 via libva
Extended any_fd_supports_output_format() with vpu981 fd as 4th probe
target. Added V4L2_PIX_FMT_AV1_FRAME advertisement in
RequestQueryConfigProfiles. VAProfileAV1Profile0 in entrypoints +
GetConfigAttributes switches.

V4L2_REQUEST_MAX_PROFILES=11 now exactly full; comment added warning
about future profile additions needing the constant bumped.

Verified via vainfo:
  VAProfileMPEG2Simple/Main, H264×5, HEVC, VP8, AV1   — all advertised
  (VP9 absent because rkvdec module is on sibling-campaign-close
   state, not the broken vp9-iter1; restoring VP9 needs the
   ampere-vp9-enablement campaign reopened or the fail-state module
   reloaded.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:54:12 +02:00
claude-noether bed75c0cef ampere-av1 Phase 2 step 1: third-device fd scaffolding for vpu981
RK3588 has THREE hantro-vpu instances (legacy MPEG2/VP8 at /dev/video2,
encoder at /dev/video3, vpu981 AV1 at /dev/video4). The existing
2-device probe is "RK3399-shaped knowledge" — silently picks the first
hantro-vpu and never finds vpu981.

This commit adds:
- video_fd_vpu981 + media_fd_vpu981 slots to request_data
- video_node_supports_output_fmt(): capability probe via VIDIOC_ENUM_FMT
  on OUTPUT/OUTPUT_MPLANE queues
- find_decoder_device_by_driver_with_fmt(): walks /dev/media* matching
  driver name AND capability filter (V4L2_PIX_FMT_AV1_FRAME for vpu981)
- 'a' kind in request_device_kind_for_profile (VAProfileAV1Profile0)
- 'a' branch in request_switch_device_for_profile
- vpu981 probe at backend init, alongside existing rkvdec + hantro
- vpu981 fd cleanup in RequestTerminate
- VAProfileAV1Profile0 → V4L2_PIX_FMT_AV1_FRAME in codec.c

Verified on ampere:

  $ LIBVA_DRIVER_NAME=v4l2_request ffmpeg ... 2>&1 | grep iter38
  v4l2-request: auto-selected codec device: /dev/video1 + /dev/media0
  v4l2-request: iter38: also opened hantro-vpu decoder at /dev/video2 + /dev/media1
  v4l2-request: ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3

Three devices opened. HEVC still works (iter2 EXT_SPS_RPS probe still
triggers on rkvdec, sibling-campaign bit-perfect behaviour preserved).

Next steps: config.c advertise VAProfileAV1Profile0, surface.h add
av1 substruct, picture.c dispatch, av1.{c,h} for the codec dispatch
(~700 LoC mirroring Kwiboo v4l2_request_av1.c).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:53:37 +02:00
claude-noether 1a2c958ab3 iter2 step4: wire h265_set_controls to populate EXT_SPS_*_RPS controls
Per Phase 4 plan + Phase 5 review amendments (SPS parse-and-cache,
per-fd gating).

src/h265.c additions:
  - #include <errno.h>, the v4l2-hevc-ext-controls.h, and the
    vendored gst/codecparsers/gsth265parser.h
  - new static helper h265_populate_ext_sps_rps_cache(): walks
    surface_object->source_data for an SPS NAL (nal_unit_type == 33)
    using gst_h265_parser_identify_nalu; if found, calls
    gst_h265_parser_parse_sps_ext (NOT gst_h265_parser_parse_sps —
    the latter discards the per-RPS-entry EXT data we need); maps
    GstH265ShortTermRefPicSet (base) + GstH265ShortTermRefPicSetExt
    (carrying use_delta_flag[16], used_by_curr_pic_flag[16],
    delta_poc_s0_minus1[16], delta_poc_s1_minus1[16]) into the V4L2
    struct arrays; stores on driver_data->hevc_rps_cache_*
  - non-IDR-frame handling: cache holds across frames, so frames
    whose source_data lacks an SPS NAL reuse the previously-parsed
    cached arrays (Phase 5 review item #3)
  - controls[] grows from [5] to [7]; the 2 new entries are appended
    after the standard 5 (SPS/PPS/SLICE_PARAMS/SCALING_MATRIX/
    DECODE_PARAMS), gated by driver_data->has_hevc_ext_sps_rps_rkvdec
    (per-fd probe result from Step 3) + the cache being valid
  - field-by-field mapping mirrors GStreamer's
    gst_v4l2_codec_h265_dec_fill_ext_sps_rps verbatim (the upstream
    reference identified in Phase 0 prior-art survey)

src/request.h additions:
  - struct request_data carries hevc_rps_cache_st (array pointer),
    _st_count, hevc_rps_cache_lt, _lt_count, hevc_rps_cache_valid.
    Single-slot cache (sps_id 0 only; multi-SPS streams would need
    expanding). Stores POST-MAPPED V4L2 structs so request.h doesn't
    need to know GstH265SPS / GstH265SPSEXT types.

Critical interpretation correction (Phase 5 review followup):
GstH265SPS has short_term_ref_pic_set[65] (base) but NOT
short_term_ref_pic_set_ext[]. The EXT array lives on a SEPARATE
GstH265SPSEXT struct accessed via gst_h265_parser_parse_sps_ext.
The 'plain' gst_h265_parser_parse_sps internally calls _ext with a
LOCAL discarded SPSEXT (see gsth265parser.c:2050). Our call must
use the _ext variant directly to keep the EXT data. Caught during
Step 4 first-build error.

Build verified: ninja -C build clean. .so is 759 KB (up from 485 KB
original, 682 KB after Step 2 vendor — the +80 KB is the new helper
+ extension).

iter2 Phase 6 Step 5 (install + reboot + smoke-test) is the F1
falsifier moment: if HEVC stops OOPSing, mechanism confirmed; if it
still OOPSes, loopback Phase 0 with re-opened kernel-agent#11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:09:58 +02:00
claude-noether 4f6ba6c0e3 iter2 step3: HEVC EXT_SPS_*_RPS UAPI header + runtime probe
src/hevc-ctrls/v4l2-hevc-ext-controls.h (NEW, MIT, ~95 LOC):
  Verbatim mirror of Linux 7.0 V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS
  and _LT_RPS control IDs + struct definitions + flag macros. Each
  symbol is ifndef-guarded so when ampere's linux-api-headers
  eventually bumps to 7.0+, the kernel header takes precedence and
  this shim silently no-ops. Citation block links the upstream
  Casanova v8 series.

  Per LGPL section 3.b, kernel UAPI struct definitions are excepted
  from GPL inheritance, so copying them into MIT userspace is fine.

src/request.h: added has_hevc_ext_sps_rps_rkvdec + _hantro bool
  fields on struct request_data — pair-of-flags layout mirrors
  video_fd_rkvdec / video_fd_hantro (iter38 multi-device-probe
  pattern, per feedback_multi_device_probe_design). Phase 5 review
  identified single-scalar storage as a silent-misbehavior risk
  across device-switch boundaries.

src/request.c:
  - new probe_hevc_ext_sps_rps_controls(fd) helper: queries the two
    new CIDs via VIDIOC_QUERYCTRL; returns true iff both register.
    RK3399 rkvdec (linux 6.x or 7.x without VDPU381/383 bindings)
    returns false; RK3588 rkvdec (VDPU381/383) returns true.
  - probe each driver_data->video_fd_rkvdec / _hantro after the
    iter38 multi-device-probe block at VA_DRIVER_INIT time
  - log-line if rkvdec supports it - diagnostic for Phase 7

src/meson.build: added the new UAPI header to the headers list.

Build verified: ninja -C build clean, .so produced. The new probe
runs at driver init and stores the result, but nothing CONSUMES the
result yet — that's Step 4 (h265_set_controls wiring).

Per ampere-kernel-decoders campaign iter2 Phase 4 step 3 (amended
by Phase 5 review item 'per-fd storage').

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:08:10 +02:00
claude-noether c5fbc5bf04 iter2 step2: GLib/GStreamer compat shim, build succeeds
Vendored gsth265parser + nalutils + gstbitreader + gstbytereader (the
Step 1 commit) compile cleanly against libc + libv4l2 only after
adding 1 compat translation unit + 5 stub headers, no edits to the
vendored .c/.h files themselves.

src/h265_parser/gst_compat.{h,c} — new files (MIT, original work):
  - GLib type aliases (gboolean, gchar, gint*, guint*, gsize, gpointer)
  - Memory helpers (g_malloc/g_free as #define free, g_memdup2 inline)
  - Asserts as no-op + parser-return-code-propagation
  - All GST_DEBUG/INFO/WARNING/ERROR/LOG/FIXME as no-ops (the parser
    is heavy on debug logging; we compile it all out)
  - GArray implementation (~100 LOC, just enough for gsth265parser.c's
    24 call sites)
  - GList full struct with .data/.next/.prev so callers compile;
    list-manipulation functions abort() — dead code paths only
  - Byte-order read/write macros (GST_READ_UINT8/16/24/32/64_LE/BE,
    GST_WRITE_UINT8/16/24/32_BE) — aarch64 LE inlines
  - g_once_init_enter/leave as simple gate
  - G_MAXUINT*, G_MAXINT*, G_MINxxx, G_GNUC_* attribute macros, etc.
  - Opaque GstBuffer/GstMemory/GstMapInfo + abort-stub functions for
    the encoder-side SEI-insertion paths the libva backend never invokes
  - gst_util_ceil_log2 real impl (used by slice-header parser; dead
    for our SPS-only call path but cheaper to implement than stub)

src/h265_parser/gst/{gst.h,base/base-prelude.h,base/gstbitwriter.h,
codecparsers/codecparsers-prelude.h,glib-compat-private.h} — 5 new
stub headers (MIT). All include gst_compat.h. gstbitwriter.h adds
abort-stub functions for the bit-writer API (used by nalutils.c's NAL
emulation-prevention encoder path — dead code for the parse-only
libva backend).

src/meson.build — added the 5 new .c source files and 10 new .h
headers; added include_directories('h265_parser') to the include path
so the vendored files' '#include <gst/base/...>' style references
resolve to the stub headers + actual vendored files in the local
tree.

Build verified: ninja -C build produces v4l2_request_drv_video.so
(682 KB, up from 485 KB pre-vendor — the +200 KB is the vendored
parser code). nm shows gst_h265_parse_sps, gst_h265_parse_sps_ext,
gst_h265_parser_identify_nalu, and the other functions we need for
Step 4 are present in the binary.

Two #warning messages from gsth265parser.h about API stability are
upstream-intentional and harmless ('The H.265 parsing library is
unstable API and may change in future').

This commit completes Step 2 of ampere-kernel-decoders iter2 Phase 6.
Backend remains functionally identical to pre-iter2 — the new code
compiles + links but is not yet called from h265_set_controls (that's
Step 4). Existing 5 codecs continue to work as before.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:06:30 +02:00
claude-noether f91c3f53c5 iter2 step1: vendor GStreamer 1.28.2 H.265 parser unchanged
Source: gitlab.freedesktop.org/gstreamer/gstreamer @ commit 43421c2a5b8a
(refs/tags/1.28.2). All 8 vendored files copied verbatim into
src/h265_parser/:

  gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.c (168 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.h ( 92 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.c       (13 KB)
  gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.h       (  8 KB)
  gstreamer/libs/gst/base/gstbitreader.c                     (  8 KB)
  gstreamer/libs/gst/base/gstbitreader.h                     ( 10 KB)
  gstreamer/libs/gst/base/gstbytereader.c                    ( 39 KB)
  gstreamer/libs/gst/base/gstbytereader.h                    ( 25 KB)

Total ~11 KLOC, LGPL v2.1+ per original headers (Intel + Sreerenj
Balachandran + others). LGPL headers preserved verbatim. Backend's
existing COPYING.LGPL covers redistribution.

** Build is INTENTIONALLY BROKEN at this commit. ** GLib dependencies
(GArray, g_malloc, gboolean, GST_DEBUG, etc.) are not yet satisfied;
src/Makefile.am is not yet updated to include these files. Step 2
performs the GLib-to-libc mechanical adaptation; Step 3 wires the
header + Makefile.

This vendor-unchanged commit is the upstream-tracking baseline. When
GStreamer ships a parser bug fix, the future-sync workflow is:
  git diff src/h265_parser/ HEAD..(this commit)
to surface our adaptations, then rebase those over the upstream fix.

Per ampere-kernel-decoders campaign iter2 Phase 4 §Step 1
(/home/mfritsche/src/ampere-kernel-decoders/phase4_plan_iter2.md).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:02:12 +02:00
32 changed files with 1146 additions and 3668 deletions
+56 -262
View File
@@ -1,281 +1,75 @@
# libva-v4l2-request-fourier
# v4l2-request libVA Backend
VA-API ICD backend for V4L2 stateless video decoders. Fourier-campaign
fork of the dormant `bootlin/libva-v4l2-request` upstream.
## About
> **TL;DR for "I want hardware-accelerated YouTube in Firefox on my
> Rockchip board":** skip to the [§ Quickstart](#quickstart) below.
> Fresnel (RK3399) and ampere (RK3588) are validated targets; ohm
> (RK3566 PineTab2) is the chromium-fourier validation rig.
This libVA backend is designed to work with the Linux Video4Linux2
Request API that is used by a number of video codecs drivers,
including the Video Engine found in most Allwinner SoCs.
## What works
## Status
| SoC / host | HW-accelerated codecs | Bit-exact vs `kdirect` |
|---|---|---|
| RK3399 (fresnel — Pinebook Pro) | H.264, HEVC Main, VP9 Profile 0, VP8, MPEG-2 | 5/5 at iter38; preserved through iter40b |
| RK3588 (ampere) | H.264 + HEVC (iter1+iter2 ampere-fourier); **mainline rkvdec / VDPU381 + VDPU383 landed February 2026** — VP9 / AV1 verification next | iter1 H.264 PASS; remaining codecs gated on mainline-driver bring-up |
| RK3568 / RK3566 (ohm — PineTab2) | H.264, MPEG-2, VP8 via hantro multi-planar | iter1-5 baseline (libva-multiplanar campaign) |
| BCM2712 (higgs — Pi 5 / CM5) | — | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved, [see § Pi 5 standoff](#the-pi-5-standoff) |
`kdirect` is the reference: `ffmpeg -hwaccel v4l2request
-hwaccel_output_format drm_prime ...` via Kwiboo's downstream ffmpeg
patches (packaged here as **`ffmpeg-v4l2-request-fourier`**, FFmpeg 8.1
tip @ Kwiboo `v4l2-request-n8.1` commit `b57fbbe`).
## Quickstart
### What you need for HW-accelerated YouTube in Firefox
The full stack, top to bottom, with the package this campaign provides
at each layer:
| Layer | Package(s) | Notes |
|---|---|---|
| Linux kernel with V4L2 stateless decoders | `linux-fresnel-fourier` (RK3399), `linux-ampere-fourier` (RK3588) | Mainline rkvdec / hantro / VDPU381 / VDPU383. ohm typically rides on a Beryllium OS host kernel. |
| `ffmpeg` with Kwiboo's v4l2-request hwaccel | `ffmpeg-v4l2-request-fourier` | Provides `-hwaccel drm -c:v hevc` (and h264/vp9) routes via libavcodec hwdevice DRM. |
| `libva` VA-API runtime + this backend ICD | `libva` (stock) + **`libva-v4l2-request-fourier`** | This repo. Auto-detects rkvdec / hantro / cedrus on probe. |
| Firefox patched to call libavcodec stateless | `firefox-fourier` | 5-patch series, ~+169 LoC over stock Firefox. Validated on fresnel: **~5 % CPU at 1080p30 H.264** (vs 64 % software). |
| (Wayland alt) Chromium patched for V4L2VDA | `chromium-fourier` + `kwin-fourier` | Validated on ohm under KDE Plasma 6.6.5 Wayland. Needs `kwin-fourier` for the dmabuf-fence latency fix. |
| (Optional) panfrost / panthor GPU stack | `vulkan-panfrost` | Wayland compositor + 3D. |
The actual VA-API path is mostly historical inside this campaign — the
**user-facing browser HW decode story rides libavcodec's
`v4l2_request` hwaccel directly**, not VAAPI-via-libva. Firefox-fourier
attaches an `AV_HWDEVICE_TYPE_DRM` context to libavcodec's generic
`h264`/`hevc`/`vp9` decoder; libavcodec then auto-binds the
`v4l2_request` hwaccel from its `hw_configs`. No `LIBVA_DRIVER_NAME`
incantation needed for browser use. libva-v4l2-request-fourier matters
for mpv, ffmpeg-as-vaapi, and other VA-API direct consumers.
### Install on Arch ALARM (fresnel / ampere / ohm)
Add the marfrit repo if you haven't already:
```ini
# /etc/pacman.conf
[marfrit]
SigLevel = Required
Server = https://packages.reauktion.de/arch/$arch
```
Import the signing key (one-time):
```bash
sudo pacman-key --recv-keys <KEY-ID> # see https://packages.reauktion.de
sudo pacman-key --lsign-key <KEY-ID>
sudo pacman -Sy
```
Then per host:
```bash
# Fresnel — RK3399 Pinebook Pro
sudo pacman -S \
linux-fresnel-fourier linux-fresnel-fourier-headers \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
firefox-fourier
# Ampere — RK3588
sudo pacman -S \
linux-ampere-fourier linux-ampere-fourier-headers \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
firefox-fourier
# Ohm — RK3566 PineTab2 (chromium-fourier validated path)
sudo pacman -S \
ffmpeg-v4l2-request-fourier \
libva-v4l2-request-fourier \
kwin-fourier
# chromium-fourier currently still a local build — see § Status
```
Reboot if a new kernel landed. Then:
```bash
# Smoke-test: vainfo should list HEVCMain + H264 entries
LIBVA_DRIVER_NAME=v4l2_request vainfo
# Browser launch with verbose decoder logging
MOZ_LOG="PlatformDecoderModule:5,FFmpegVideo:5" \
firefox-fourier 2>&1 | tee /tmp/fx.log
# Then open a YouTube 1080p H.264 video and grep for:
# "Choosing FFmpeg pixel format for V4L2 video decoding"
# "av_hwdevice_ctx_create(DRM, /dev/dri/renderD128) ok"
# If you DON'T see those: HW path didn't engage, fell back to software.
```
### Status of the published vs locally-built packages
As of May 2026, the live marfrit repo at
<https://packages.reauktion.de/arch/aarch64/> has:
-`libva-v4l2-request-fourier-1:1.0.0.r361.cf8cd9d-1` (iter40b tip)
-`ffmpeg-v4l2-request-fourier-2:8.1.r123329.b57fbbe-3` (Kwiboo's
v4l2-request-n8.1 + libudev-bypass; smoke-tested on fresnel —
HEVC via `-hwaccel v4l2request` PASS)
-`firefox-fourier-150.0.1-16` (5-patch series, sandboxed RDD HW
decode validated on RK3399: ~5 % CPU at 1080p30 H.264)
-`linux-fresnel-fourier-7.0-14` + headers (RK3399)
-`linux-ampere-fourier-7.0rc3.kafr1-1` + headers (RK3588)
-`kwin-fourier-1:6.6.5-1` (Wayland dmabuf-fence fix for chromium-fourier)
-`vulkan-panfrost-1:26.0.5-1` (GPU stack)
NOT yet published but **present in `marfrit-packages/arch/` source
tree** (build + publish pending):
-`chromium-fourier` (Chromium 147 + V4L2VDA-on-mainline patches —
blocked on Arch ALARM bumping clang 22 → 23).
-`qt6-base-fourier` (GL_ALPHA → GL_R8 fix — needed by KDE Plasma
Wayland on the panfrost stack).
If you need those locally before they ship:
```bash
git clone ssh://git@git.reauktion.de:2222/marfrit/marfrit-packages.git
cd marfrit-packages/arch/<package>
makepkg -si
```
## What does NOT work, and why it's stalled
| Target | Status | Blocker |
|---|---|---|
| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
## What does NOT work, and why it's stalled
| Target | Status | Blocker |
|---|---|---|
| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
### The Pi 5 standoff
iter40 + iter40b add a third multi-device-probe slot for
`rpi-hevc-dec`, an NC12 SAND128 detile primitive, per-driver gates
around the SPS pre-seed + start-code-prepend + scaling_matrix submission,
and a (fragile, fixture-specific) SPS field override using the
GStreamer 1.28.2 H.265 parser. ICD discovery works, `vainfo` lists
`VAProfileHEVCMain`, S\_FMT / REQBUFS / STREAMON all succeed.
**Decode itself never succeeds** — every CAPTURE DQBUF returns
`V4L2_BUF_FLAG_ERROR`. Driver author John Cox confirmed strict SPS
validation is intentional ("`try_ext_ctrls returned an error (22)` is
expected as it is validating the SPS"), and VAAPI's
`VAPictureParameterBufferHEVC` simply doesn't carry the bitstream-true
scalars (`sps_max_num_reorder_pics`, `sps_max_latency_increase_plus1`,
slice-level `num_entry_point_offsets`) that the driver wants. We can't
fish the SPS out of `source_data` either, because ffmpeg-vaapi parses
the SPS itself and passes only slice NAL bytes to libva backends.
This is not a bug in our backend, in libva, in ffmpeg, or in the kernel
driver. It's an ecosystem coordination failure of long standing:
- **Kwiboo's `ffmpeg-v4l2request` hwaccel** has been in production via
LibreELEC since December 2018. Re-submitted to ffmpeg-devel as a v2
series in August 2024. Still un-merged in May 2026 — **eight years
in the upstream review queue**.
- **`libva-v4l2-request`** (this project's upstream) hasn't taken
meaningful commits since ~2021. Nobody wants to own the impedance
mismatch between VAAPI's Intel-shaped "give me raw bitstream, I'll
parse" and V4L2 stateless's kernel-shaped "give me parsed structs,
I'll just drive the HW."
- **`rpi-hevc-dec` mainline submission** is at v4 (July 2025), 17
months in review. The Pi 6.18.x downstream kernel meanwhile has
active HEVC regressions ([raspberrypi/linux#7228](https://github.com/raspberrypi/linux/issues/7228),
[#7306](https://github.com/raspberrypi/linux/issues/7306)) that
aren't being fast-tracked because "the new uAPI is coming."
- **Mozilla is implementing Pi 5 HEVC via ffmpeg's hwaccel-context
path** (bug [1969297](https://bugzilla.mozilla.org/show_bug.cgi?id=1969297)),
not via libva — explicit acknowledgement from David Turner that
libavcodec needs to retain the SPS context for the strict driver to
accept the control batch.
What end-users actually do today: run Pi OS (downstream-patched ffmpeg
+ downstream kernel) or LibreELEC (Kwiboo's patches + downstream
kernel). Anyone on a stock distro outside those two: no HW HEVC on
Pi 5.
Nobody who has authority to merge has skin in the game. Everyone with
skin in the game lacks authority. Result: 8-year stalemate, three
forks of working code, no merged upstream.
### What this means for this backend
We chose to extend `libva-v4l2-request` into Pi 5 territory because
the architecture maps cleanly onto the existing iter38 multi-device
probe. That work landed (iter40 commit `3ffa9d0`, iter40b commit
`071b08d`). It's reusable infrastructure for any future strict V4L2
stateless decoder that ffmpeg ships before libva does.
But the *user-facing* Pi 5 HEVC story will not come from this
backend. The backend was a clean architectural target inside a
coordination dead-end. The actual Pi 5 HEVC path through libva
requires either:
- a VAAPI extension exposing the SPS scalars rpi-hevc-dec validates
against (Intel-driven; no Pi-aligned principal),
- a libva-internal `VABufferType` for raw SPS/PPS NAL bytes (no
maintainer),
- ffmpeg-vaapi forwarding `num_entry_point_offsets` to backends
(small upstream patch; no champion), OR
- the political situation around Kwiboo's series unblocks (no
visible movement).
iter40 + iter40b are **landed but parked**. The fresnel + ampere
sibling paths are unaffected (5/5 fresnel + 9 profiles ampere
verified post-iter40b, no regression). Phase 8 packaging is
deliberately skipped — shipping a `.deb` whose primary advertised
target (Pi 5) doesn't actually decode would mislead users.
See `phase0_pi5_hevc.md`, `phase1_pi5_hevc.md`,
`phase5_pi5_hevc_review.md`, `phase7_pi5_hevc_close.md` for the
chapter's full empirical record.
The v4l2-request libVA backend currently supports the following formats:
* MPEG2 (Simple and Main profiles)
* H264 (Baseline, Main and High profiles)
* H265 (Main profile)
## Instructions
In order to use this backend, set the `LIBVA_DRIVER_NAME` environment
variable:
In order to use this libVA backend, the `v4l2_request` driver has to
be specified through the `LIBVA_DRIVER_NAME` environment variable, as
such:
export LIBVA_DRIVER_NAME=v4l2_request
Then a VA-API-capable player can decode supported codecs on a probed
device:
A media player that supports VAAPI (such as VLC) can then be used to decode a
video in a supported format:
vlc path/to/video.mp4
mpv --hwdec=vaapi path/to/video.mp4
ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -f null -
vlc path/to/video.mpg
The backend auto-detects available decoders via the V4L2 media
topology walk; honors `LIBVA_V4L2_REQUEST_VIDEO_PATH` and
`LIBVA_V4L2_REQUEST_MEDIA_PATH` for explicit device selection.
Sample media files can be obtained from:
http://samplemedia.linaro.org/MPEG2/
http://samplemedia.linaro.org/MPEG4/SVT/
## Technical Notes
### Multi-device probe (iter38)
### Surface
A single libva session opens both `rkvdec` and `hantro-vpu` (and, on
hosts where it's present, `rpi-hevc-dec`) at init. `RequestCreateConfig`
re-targets the active fd per profile via
`request_switch_device_for_profile()`. Pool teardown happens at
switch time; the next `CreateContext` rebuilds against the right
device.
A Surface is an internal data structure never handled by the VA's user
containing the output of a rendering. Usualy, a bunch of surfaces are created
at the begining of decoding and they are then used alternatively. When
created, a surface is assigned a corresponding v4l capture buffer and it is
kept until the end of decoding. Syncing a surface waits for the v4l buffer to
be available and then dequeue it.
### Surface / Context / Picture / Image
Note: since a Surface is kept private from the VA's user, it can ask to
directly render a Surface on screen in an X Drawable. Some kind of
implementation is available in PutSurface but this is only for development
purpose.
A Surface is an internal data structure containing rendering output.
A Context owns the V4L2 lifecycle (S\_FMT, CAPTURE pool, ctrl-batch
defaults) for one decode session. A Picture is one encoded input
frame's set of buffers. An Image is a Standard VA pixel-format view
on a decoded Surface — the backend detiles SAND/COL128 or unpacks
NV15 to NV12/P010 here so consumers see linear pitches.
### Context
The real rendering is in `EndPicture`, not `RenderPicture`, because
the kernel needs the full extended-control batch when the OUTPUT
buffer is queued, and `RenderPicture` order is consumer-defined.
A Context is a global data structure used for rendering a video of a certain
format. When a context is created, input buffers are created and v4l's output
(which is the compressed data input queue, since capture is the real output)
format is set.
### Picture
A Picture is an encoded input frame made of several buffers. A single input
can contain slice data, headers and IQ matrix. Each Picture is assigned a
request ID when created and each corresponding buffer might be turned into a
v4l buffers or extended control when rendered. Finally they are submitted to
kernel space when reaching EndPicture.
The real rendering is done in EndPicture instead of RenderPicture
because the v4l2 driver expects to have the full corresponding
extended control when a buffer is queued and we don't know in which
order the different RenderPicture will be called.
### Image
An Image is a standard data structure containing rendered frames in a usable
pixel format. Here we only use NV12 buffers which are converted from sunxi's
proprietary tiled pixel format with tiled_yuv when deriving an Image from a
Surface.
-5
View File
@@ -195,11 +195,6 @@ extern "C" {
#define DRM_FORMAT_NV24 fourcc_code('N', 'V', '2', '4') /* non-subsampled Cr:Cb plane */
#define DRM_FORMAT_NV42 fourcc_code('N', 'V', '4', '2') /* non-subsampled Cb:Cr plane */
/* iter39: NV15 is 4×10-bit packed in 5 bytes (Rockchip rkvdec 10-bit output). */
#ifndef DRM_FORMAT_NV15
#define DRM_FORMAT_NV15 fourcc_code('N', 'V', '1', '5') /* 2x2 subsampled Cr:Cb plane 10 bits per channel packed */
#endif
/*
* 3 plane YCbCr
* index 0: Y plane, [7:0] Y
-11
View File
@@ -4,14 +4,3 @@ option(
value : '',
description: 'Path to sanitized Linux Kernel headers'
)
option(
'daedalus_v4l2',
type : 'boolean',
value : true,
description: 'Enable probe + dispatch for the out-of-tree daedalus_v4l2 ' +
'stateless decoder shim (Pi 5 / CM5 daemon-backed VP9/AV1/H264). ' +
'Default true; disable on platforms where the daedalus_v4l2 ' +
'kernel module will never be present to slim the probe array.'
)
-298
View File
@@ -1,298 +0,0 @@
# Phase 0 — Pi 5 / CM5 HEVC chapter
Opened 2026-05-17 evening, after the failed `libva-v4l2-stateful-fourier`
scaffold attempt. Brother-session empirical Phase 0 on higgs invalidated
the stateful premise: rpi-hevc-dec is V4L2 **stateless**, so Pi 5 HEVC
belongs in this backend, not a separate sibling.
No code in this chapter yet. This doc is the substrate. Phase 1 picks up
from the "Open questions" section.
## Substrate
### Target host
higgs — Pi CM5 module on Pi CM5 IO board. BCM2712 SoC. VPN-only, often
offline; wake via HIS skill recipe (no Fritz!Box plug — runs on power
when on). Debian-based. Sole HW video decoder is rpi-hevc-dec at
`/dev/video19` + `/dev/media1`.
### Backend baseline at chapter open
`libva-v4l2-request-fourier` master tip `cf8cd9d` (iter39 + Option B +
h265 ref-list cap fix). Multi-device probe (iter38) already opens
rkvdec + hantro slots; adding a third decoder slot for rpi-hevc-dec is
a natural extension of that architecture.
iter2 (ampere VDPU381 HEVC EXT_SPS) added the GStreamer 1.28.2 H.265
parser vendor + EXT_SPS_ST_RPS / _LT_RPS dynamic-array submission. That
plumbing is probe-gated (`has_hevc_ext_sps_rps_rkvdec`), so it stays
dormant on hosts where the controls don't exist.
### Empirical higgs probe (brother session)
`v4l2-ctl -d /dev/video19 --list-formats-ext --list-ctrls`:
```
Stateless Codec Controls
hevc_sequence_parameter_set (compound, V4L2_CID_STATELESS_HEVC_SPS)
hevc_picture_parameter_set (compound, V4L2_CID_STATELESS_HEVC_PPS)
slice_param_array (compound dynamic-array dims=[4096])
hevc_scaling_matrix (compound)
hevc_decode_parameters (compound)
hevc_decode_mode (menu, "Frame-Based")
hevc_start_code (menu, default "No Start Code")
OUTPUT formats:
S265 V4L2_PIX_FMT_HEVC_SLICE (parsed slice payload)
CAPTURE formats:
NC12 V4L2_PIX_FMT_NV12_COL128 (8-bit SAND 128-column tiled)
NC30 V4L2_PIX_FMT_NV12_10_COL128 (10-bit SAND 128-column tiled)
```
Conclusion: this is the standard `V4L2_CID_STATELESS_HEVC_*` control set
exposed under the V4L2-request uAPI, exactly the same family our backend
already drives for rkvdec/hantro/cedrus HEVC paths. The novel parts are
two pixel formats (NC12, NC30) and one driver-id (rpi-hevc-dec).
## What carries forward unchanged
- VAAPI HEVC profile enumeration (`config.c`)
- `h265_set_controls` core path (`h265.c`) — same compound ctrl set
- Synthetic SPS pre-seed pattern (iter25/26) — already runs pre-CAPTURE-alloc
- Multi-device dispatch in `RequestCreateConfig` (iter38)
- VAAPI slice / picture / IQ matrix buffer parsing
- HEVC h264-style start-code policy (we already DON'T prepend for HEVC)
## What needs adding
| Item | Location | Sizing |
|------|----------|--------|
| `RPI_HEVC_DEC` enum in `driver_kind_t` | `request.h` | trivial |
| Multi-device probe extends to `/dev/video19` discovery | `context.c` / `request.c` init | small — mirror hantro slot |
| `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry | `video.c` | small |
| `V4L2_PIX_FMT_NV12_10_COL128` (NC30) `video_format` entry | `video.c` | small |
| NC12 → NV12 detile primitive | new `nv12_col128.c` | mid — column tile layout, see kernel docs |
| NC30 → P010 detile primitive | new `nv12_col128.c` | mid — 10-bit variant of above |
| `copy_surface_to_image` branch for NC12/NC30 | `image.c` | small (mirror NV15→P010 gating) |
| Per-driver gating for any rpi-specific quirks discovered | various | per [[per-driver-kludge-gating]] |
## Open questions for Phase 1
Lock these before Phase 1 commits to a goal.
1. **EXT_SPS controls on rpi-hevc-dec?** Brother's `--list-ctrls` output
above shows the standard `V4L2_CID_STATELESS_HEVC_*` family — NOT the
`EXT_SPS_ST_RPS` / `EXT_SPS_LT_RPS` extensions that VDPU381 needs.
Verify: does `slice_param_array[4096]` accept `st_rps_bits` /
`lt_rps_bits` in the per-slice payload, or does rpi-hevc-dec parse RPS
itself from the slice header? If the latter, the iter2 EXT_SPS path
stays dormant (probe-gated already), and rpi-hevc-dec just needs the
`picture->st_rps_bits``slice_params->short_term_ref_pic_set_size`
plumbing that iter31 α-29 already wired. Expectation: works out of the
box. Confirm before assuming.
2. **`hevc_start_code` ctrl: "No Start Code" vs Annex B?** Brother saw
default `"No Start Code"` — matches our behavior (we don't prepend on
HEVC). But the ctrl is configurable. Verify the menu values exposed
and confirm "No Start Code" passes our raw slice-NAL payload as-is.
If it doesn't, set the ctrl explicitly per [[unconditional-codec-state]]
gating.
3. **NC12 / NC30 SAND tile layout — exact spec.** Read
`Documentation/userspace-api/media/v4l/pixfmt-yuv-planar.rst` for the
COL128 variants. Confirm: column stride = 128 bytes (Y) / 128 bytes
(UV interleaved). Row count = `ALIGN(height, 16)` or `ALIGN(height, 8)`?
Get the exact alignment and tile-traversal order before writing the
detile primitive. Cite from kernel doc, NOT inferred from a hex dump.
4. **drm_prime / SAND modifier round-trip.** Does ffmpeg-vaapi (and
Firefox) accept the NC12 buffer via DRM_PRIME export carrying the
DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier, allowing
zero-copy to a SAND-aware compositor? Or is libva-side detile to a
linear NV12 buffer the only viable Firefox path? If detile is
required for the consumer, the [[rockchip-pixel-verify-path]] rule
(DMA-BUF GL preferred over cached mmap) might NOT apply since SAND
is Pi-specific and not in the wider Wayland modifier ecosystem.
5. **rpi-hevc-dec quirks on first SPS submission.** rkvdec needs
image_fmt pre-seed before CAPTURE alloc (iter25). Does rpi-hevc-dec
have an analogous "must set OUTPUT pix_fmt + SPS before CAPTURE"
ordering? Verify with strace early.
6. **higgs OS + libva versioning.** Brother probed on Debian. We package
for Arch ALARM. What's the install path on higgs — Arch / Debian /
Raspberry Pi OS? If Debian, the package needs a `debian/` tree, not
just PKGBUILD. Decide packaging target before Phase 8.
## Phase 1 goal sketch (NOT locked)
> Firefox HW HEVC playback on higgs at ≥30fps for 1080p Main, byte-exact
> libva-vs-kdirect for ≥3 reference fixtures (8-bit Main and 10-bit Main10).
Two measurable subgoals follow naturally:
- libva (this backend, NV12 image output) == kdirect (ffmpeg-v4l2request,
NV12 image output) byte-exact for the same input.
- Firefox VA-API path engages (verify via `chrome://gpu` equivalent / log
inspection — `MOZ_LOG=PlatformDecoderModule:5`).
## Phase 3 baseline plan
Before any backend code touches rpi-hevc-dec:
- `kdirect` floor: `ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime
-i bbb_720p10s_hevc.mp4 -vf hwdownload,format=nv12 -frames:v 10 ...` and
sha256 the YUV.
- `SW reference`: same ffmpeg without `-hwaccel`, sha256 the YUV.
- Both runs N=3 per [[replicate-baseline-first]].
- Capture `strace -f -e ioctl` of the kdirect run — gives the canonical
ioctl sequence rpi-hevc-dec expects.
## Phase 0 closing
This doc commits the substrate. Phase 1 starts when:
- higgs is up + reachable
- Open questions 1+2 (EXT_SPS + start_code) are answered live, in one
short probe session
- Phase 3 baseline floors are captured
No work blocks the close of iter39 / fresnel campaign — those are shipped.
## Phase 0 close addendum (2026-05-17 evening, higgs probe session)
Empirical probes on higgs answered Q1, Q2, partial Q3, full Q5, full Q6.
Q4 (DRM modifier round-trip) remains open. Phase 0 is closed; Phase 1
opens with what's below.
### Q1 — EXT_SPS controls on rpi-hevc-dec: NOT present
`v4l2-ctl -d /dev/video19 --list-ctrls` confirms ONLY the standard
`V4L2_CID_STATELESS_HEVC_*` set:
- `hevc_sequence_parameter_set` (0x00a40a90)
- `hevc_picture_parameter_set` (0x00a40a91)
- `slice_param_array` (0x00a40a92, dynamic-array dims=[4096])
- `hevc_scaling_matrix` (0x00a40a93)
- `hevc_decode_parameters` (0x00a40a94)
- `hevc_decode_mode` (0x00a40a95, menu min=1 max=1 default=1 = Frame-Based)
- `hevc_start_code` (0x00a40a96, menu min=0 max=1 default=0 = No Start Code)
- 0x00a40a97 returns EINVAL (no EXT_SPS_*_RPS controls)
ioctl trace confirms ffmpeg's `VIDIOC_QUERY_EXT_CTRL` for `0xa97` returns
EINVAL — same probe pattern our backend uses for
`has_hevc_ext_sps_rps_rkvdec`. **The iter2 path stays dormant; the
iter31 α-29 `slice_params->short_term_ref_pic_set_size` plumbing is the
correct one for rpi-hevc-dec.**
### Q2 — hevc_start_code: default 0 (No Start Code), values {0, 1}
Default 0 matches our backend's "don't prepend HEVC start code" stance.
Confirm in Phase 1: rpi-hevc-dec accepts our raw NAL slice payload as-is.
### Q3 — NC12 / NC30 SAND tile layout: PARTIAL
CAPTURE S_FMT result for 1280×720 NC12:
- `sizeimage=1382400` = `1280 × 720 × 1.5` (linear NV12 byte count)
- `bytesperline=1080` (NOT 1280)
The bytesperline=1080 for a 1280-wide CAPTURE buffer is suspect — likely
encodes SAND column count rather than linear stride. Read
`drivers/staging/media/rpivid/` (or wherever NC12_COL128 lives in 6.12)
kernel source + `drm_fourcc.h` / `nv12_col128.rst` (if it exists) for
exact tile layout BEFORE writing the detile primitive. Do NOT infer
layout from this single observation.
### Q4 — DRM modifier round-trip: BLOCKED on hwdownload
ffmpeg `-hwaccel drm -hwaccel_output_format drm_prime -vf
hwmap=mode=read,format=nv12` returns `Failed to map frame: -38`
(`Function not implemented`). hwdownload cannot consume the SAND
modifier directly.
ffmpeg's path that DOES work: `-hwaccel drm -c:v hevc` WITHOUT
`-hwaccel_output_format drm_prime` lets ffmpeg's internal pipeline pull
back, detile (presumably via a Pi-specific helper or libdrm transform),
and present NV12 to the next filter. Bit-exact vs SW for the test
fixture (1280×720 Main 8-bit) — confirms HW engagement.
Phase 1 / Phase 4 will need to decide:
- Detile in the backend (CPU SIMD), exposing NV12 via VAImage; or
- Pass-through DRM_PRIME with SAND modifier and let the consumer
(compositor / Firefox) detile. Firefox almost certainly can't, so
CPU detile is the safe bet.
### Q5 — rpi-hevc-dec submission ordering: empirically locked
`strace -e ioctl` of the kdirect run shows:
1. `MEDIA_IOC_DEVICE_INFO` + `MEDIA_IOC_G_TOPOLOGY` (per media node)
2. `VIDIOC_QUERYCAP` per video node — `driver="rpi-hevc-dec"` identifies
the right one
3. `VIDIOC_ENUM_FMT` OUTPUT → S265 only
4. `VIDIOC_S_FMT` OUTPUT (HEVC_SLICE, placeholder dims)
5. `VIDIOC_REQBUFS` OUTPUT (DMABUF, count=N) — count=6 in kdirect
6. `VIDIOC_S_FMT` CAPTURE (NC12, actual dims from SPS parse)
7. `VIDIOC_CREATE_BUFS` CAPTURE (DMABUF, count=16)
8. `VIDIOC_STREAMON` both queues
9. `VIDIOC_QUERY_EXT_CTRL` enumeration
10. `VIDIOC_S_EXT_CTRLS` (decode_mode + start_code) — global ctrls
11. Per frame: `VIDIOC_S_EXT_CTRLS` (SPS+PPS+decode_params+slice_array,
class=0xf010000 = per-request) + `VIDIOC_QBUF` CAPTURE + `VIDIOC_QBUF`
OUTPUT (with `V4L2_BUF_FLAG_IN_REQUEST | V4L2_BUF_FLAG_REQUEST_FD`) +
`VIDIOC_DQBUF` OUTPUT + `VIDIOC_DQBUF` CAPTURE
**Two structural notes for the backend:**
- OUTPUT + CAPTURE both use `V4L2_MEMORY_DMABUF` in kdirect. Our backend
currently uses MMAP for CAPTURE on rkvdec/hantro. For Pi 5 we should
either follow kdirect (DMABUF, allows zero-copy DRM_PRIME export) or
use MMAP and CPU-detile. Phase 4 design decision.
- The order `S_FMT OUTPUT → REQBUFS OUTPUT → S_FMT CAPTURE → CREATE_BUFS
CAPTURE → STREAMON` differs from our iter25 rkvdec pre-seed pattern
(where SPS via S_EXT_CTRLS must come BEFORE CAPTURE alloc to resolve
the image_fmt). rpi-hevc-dec apparently DOESN'T need that pre-seed —
CAPTURE S_FMT just takes the explicit NC12 + caller's dims. Confirm
in Phase 1 by trying our existing iter25 pre-seed flow against it.
### Q6 — packaging: Debian 13 trixie, NOT Arch
higgs runs Debian 13 trixie (`PRETTY_NAME="Debian GNU/Linux 13 (trixie)"`),
not Arch ALARM. Phase 8 (per the dev-process Phase 8 packaging rule) for
the Pi 5 chapter needs a `debian/` packaging tree, not just a PKGBUILD.
Decide in Phase 1 whether to:
- Add Debian packaging to `marfrit-packages` as a second target, OR
- Use distrobox/podman with an Arch ALARM container on higgs for
install (test-only, not production), OR
- Pi 5 chapter ships a Debian source pkg via gitea / a personal Debian
repo.
### Other new findings from the probe session
- **ffmpeg 7.1.3 from Debian 13 is built with `--enable-v4l2-request`**
— the kdirect path exists. Invocation is `ffmpeg -hwaccel drm -c:v
hevc` (not just `-hwaccel drm`; the explicit codec flag matters for
the negotiation). Engagement log line is
`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`. Per
[[hw-decode-engagement-check]], grep for that line to confirm HW path
engaged.
- **No libva ICD installed on higgs** — only `armada-drm_dri.so` ships,
which doesn't apply. We'd be the first VA-API HW path for HEVC on Pi
5 once installed.
- **mpv is apt-installable** (`mpv 0.40.0-3+deb13u1`) — useful as a
pixel-readback verifier once the backend works (`mpv --vo=image` or
`--vo=drm`).
- **Firefox 145.0.1 + rpi-firefox-mods 20251016 installed** (firefox-esr
package status was `rc` = removed but config remains). The mods
package likely contains VA-API plumbing prefs.
### What changes for Phase 1
- Goal is now phrasable: HEVC bit-exact libva-vs-kdirect on higgs for
the 1280×720 Main 8-bit test fixture (same generator as
`/tmp/bbb_main.mp4` here). Kdirect engagement signal is the
`Hwaccel V4L2 HEVC stateless V4` log line.
- Most backend code reuses existing rkvdec/hantro HEVC path: ctrls,
per-frame submission, request_fd, multi-device probe pattern.
- New code: NC12 video_format entry + detile primitive (sibling to
`nv15_unpack_plane_to_p010`) + RPI_HEVC_DEC driver_kind.
- Packaging target = Debian, not Arch.
-230
View File
@@ -1,230 +0,0 @@
# Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)
Per [[feedback_dev_process]], Phase 1 (goal), Phase 2 (situation analysis),
Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third
multi-device-probe slot in `libva-v4l2-request-fourier`. Phase 0 substrate
+ open-question answers live at `phase0_pi5_hevc.md`.
## Phase 1 — Goal
> **libva-v4l2-request-fourier on higgs** decodes HEVC Main 8-bit input
> producing NV12 output **bit-exact vs kdirect** for three reference
> fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265
> ultrafast). HW path engagement verified via the kernel-driver lsof
> signal (`/dev/video19` open) AND ffmpeg-vaapi engagement signal
> (`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19`).
Measurable:
| Criterion | Metric |
|---|---|
| C1 — vainfo enumeration | `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain : VAEntrypointVLD` |
| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 |
| C3 — HW engagement | `lsof` shows `/dev/video19` open by ffmpeg-vaapi during libva run |
| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) |
| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) |
Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API
engagement testing, performance benchmarks. All later chapters.
## Phase 2 — Situation Analysis
### Backend architecture already in place
- **Multi-device probe (iter38)**: at `VA_DRIVER_INIT` opens both
`rkvdec` + `hantro-vpu` via `find_decoder_device_by_driver(name)`.
Stores per-driver fds (`video_fd_{rkvdec,hantro}`,
`media_fd_{rkvdec,hantro}`). `RequestCreateConfig` retargets the
"active" `driver_data->{video,media}_fd` per profile via
`request_switch_device_for_profile()` (request.c:426-478).
- **Per-driver feature gating**: `request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}`
pair, with `h265_set_controls` consulting the per-fd flag. Established
by iter2 / Phase 5 review (request.h:99-100). This is the canonical
per-driver gating shape for iter40.
- **HEVC ctrl population**: `h265_set_controls` populates the standard
`V4L2_CID_STATELESS_HEVC_*` set (h265.c). Probe-gates EXT_SPS_*_RPS
via the iter2 path — naturally dormant for rpi-hevc-dec since the
controls don't exist.
- **Synthetic SPS pre-seed (iter25/26)**: needed for rkvdec to resolve
`image_fmt` before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec
does NOT need this — it accepts NC12 + explicit dims on `S_FMT
CAPTURE` directly. The pre-seed code path stays in place for rkvdec;
rpi-hevc-dec just doesn't trigger it (gate on driver_kind).
- **CAPTURE detile primitive**: `nv15_unpack_plane_to_p010()` (nv15.c)
is the template — backend already CPU-detiles when a Pi-or-Rockchip-
specific CAPTURE format meets a linear consumer (VAImage NV12 / P010).
- **Single-plane (S) vs multi-plane (M) handling**: hantro uses MPLANE,
rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for
BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38
already supports MPLANE handling for hantro; rpi reuses that.
### Surface area to touch (audit)
| File | What changes | Size |
|------|--------------|------|
| `src/request.h` | Add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`, `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout) | ~10 lines |
| `src/request.c` | (a) Extend init -1 block to cover new fds. (b) Recognize `rpi-hevc-dec` as a 3rd primary/alt driver string in the probe loop. (c) Extend `request_device_kind_for_profile` so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend `request_switch_device_for_profile` 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). | ~80 lines |
| `src/video.c` | Add `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. | ~20 lines |
| `src/nv12_col128.c` (NEW) | `nv12_col128_detile_to_nv12()`: Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim `av_rpi_sand_to_planar_y8` core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux `hevc_dec/hevc_d_video.c` size formula. | ~80 lines + 30-line header |
| `src/image.c` | Add NC12 → NV12 branch in `copy_surface_to_image`, gated on `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` (sibling to existing NV15→P010 branch). | ~25 lines |
| `src/meson.build` + `src/Makefile.am` | List `nv12_col128.c`/`.h` in sources | 2 lines |
Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive.
Roughly half the surface area of iter38; smaller than iter2.
### What does NOT change
- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by
driver_kind check that's already implicit in the rkvdec fd routing).
- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored
GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec
unchanged. Same plumbing.
- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is
HEVC-only; VP8 still goes through hantro on RK).
- iter38 single-libva-session multi-codec semantics: extends from 5
codecs to 5+1 (HEVC reroutes on Pi).
### NC12 / SAND128 tile geometry — locked contract
From kernel driver `drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c`
(via [[github raspberrypi/linux rpi-6.12.y]]):
```c
case V4L2_PIX_FMT_NV12_COL128:
width = ALIGN(width, 128); /* Width rounds up to columns */
height = ALIGN(height, 8);
bytesperline = constrain2x(bytesperline, height * 3 / 2);
sizeimage = bytesperline * width;
break;
```
For 1280×720:
- width = 1280 (already 128-aligned)
- height = 720 (already 8-aligned)
- bytesperline = 720 × 3/2 = **1080** (matches Phase 0 strace observation)
- sizeimage = 1080 × 1280 = **1,382,400** (matches strace; equals linear NV12 byte count coincidentally)
**Geometry interpretation** (cross-verified against ffmpeg/Kynesim
`rpi_sand_fn_pw.h` `av_rpi_sand_to_planar_y8`):
- Image is divided into `(width + 127) / 128` columns; each column is
**128 px wide × height px tall**.
- Within a column: `128 × height` bytes of Y data, immediately followed
by `128 × height/2` bytes of interleaved CbCr (so 128 × `bytesperline`
bytes per column, where `bytesperline` is the column stride).
- Across columns: column N starts at offset `N × stride1 × stride2`
where `stride1 = 128` (column width) and `stride2 = bytesperline`.
- **Pixel (x, y) byte offset = `(x & 127) + y × 128 + (x & ~127) × bytesperline`**
for Y; same formula with `y/2` for UV plane (which begins at offset
`128 × height × num_columns` from the start).
Reference for the detile loop: `av_rpi_sand_to_planar_y8` (Kynesim
ffmpeg, `libavutil/rpi_sand_fn_pw.h` with PW=1). Our primitive copies
the single-stripe fast-path math; we don't import NEON ASM (CPU
detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).
## Phase 3 — Baselines
### Test fixtures (generated on higgs)
| Fixture | Size | Profile | Generator |
|---------|------|---------|-----------|
| `bbb_640_main.mp4` | 640×360 | Main 8-bit | `ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main` |
| `bbb_1280_main.mp4` | 1280×720 | Main 8-bit | same |
| `bbb_1920_main.mp4` | 1920×1080 | Main 8-bit | same |
### Captured 2026-05-17 evening on higgs
For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect
(`ffmpeg -hwaccel drm -c:v hevc`) → `-frames:v 10 -f rawvideo -pix_fmt nv12`,
sha256 of first 16 chars:
```
bbb_640_main SW={9a81038065e9b7cd} HW={9a81038065e9b7cd} → BIT-EXACT × N=3
bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195} → BIT-EXACT × N=3
bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039} → BIT-EXACT × N=3
```
HW engagement signal (per-run): `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`
This is the kdirect baseline. Phase 7 verification will compare libva
output against these SHAs.
### Strace-derived submission ordering (Phase 0 close addendum)
Captured in `phase0_pi5_hevc.md`. Briefly: standard V4L2-request
stateless flow, both queues DMABUF, no SPS pre-seed dance needed
(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).
## Phase 4 — Plan
### Implementation steps (sequenced)
1. **`request.h`**: extend `request_data` with the new fd pair + ext_sps
flag, mirroring iter38/iter2 layout. (no behavior change yet)
2. **`request.c`**:
- `find_decoder_device_by_driver("rpi-hevc-dec", ...)` accepts new
driver string.
- Init -1 block extends to new fds.
- Probe loop: if primary is `rkvdec` or `hantro-vpu`, also probe
`rpi-hevc-dec` (third slot). On Pi 5 there's no `rkvdec` or
`hantro-vpu`, so primary becomes `rpi-hevc-dec` and the alt-probes
for the other two return absent (their fds stay -1).
- `request_device_kind_for_profile`: when profile is `VAProfileHEVCMain`,
prefer `'p'` (rpi-hevc-dec) IF `video_fd_rpi_hevc_dec >= 0`, else
fall through to `'r'` (rkvdec). All other profiles stay routed as
today.
- `request_switch_device_for_profile`: add `'p'` branch.
- ext_sps probe runs on the new fd; result stored in
`has_hevc_ext_sps_rps_rpi_hevc_dec`. Will be false (controls absent).
3. **`video.c`**: add NC12 video_format entry. Mark it MPLANE-only (per
Phase 0 strace). bytesperline/sizeimage formula encoded per kernel
driver math.
4. **`src/nv12_col128.c` + `.h`** (NEW): single-file primitive,
`nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width,
height, src_stride2)`. CPU per-column row-memcpy loop; not NEON
for Phase 1 (correctness first). Self-test in `tests/test_nv12_col128_detile.c`.
5. **`image.c`**: branch in `copy_surface_to_image`. Gate:
`image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`.
Calls the primitive. Existing NV12-linear path stays.
6. **`meson.build` + `Makefile.am`**: source list updates.
7. **Build clean on higgs** — first build target IS higgs (since iter40
only matters on Pi). Cross-build for ampere/fresnel is unaffected
because they don't have rpi-hevc-dec — the new fd stays -1 and the
per-driver routing falls through to existing rkvdec/hantro paths.
### Verification gates (Phase 7 acceptance)
- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3,
libdrm-dev 2.4.131).
- Local-install the resulting `.so` to `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3
recorded value).
- `lsof` during libva decode shows `/dev/video19` open.
- Sibling regression check: fresnel `phase7_iter39_test_rig` equivalent
still 5/5 PASS (no regression to existing routing).
### Risks + mitigations
| Risk | Mitigation |
|------|-----------|
| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in `tests/test_nv12_col128_detile.c` with hand-crafted NC12 bytes + known linear output, before integration. |
| `request_switch_device_for_profile` falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec | Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. |
| Debian build env differs from Arch — see [[feedback_package_build_flags_unmask_bugs]] | Build with explicit `-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong` flags to match Debian dpkg-buildflags. |
| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on `driver_kind != 'p'` in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. |
| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; `has_hevc_ext_sps_rps_rpi_hevc_dec` = false naturally. |
### Phase 5 review explicitly requested
Per CLAUDE.md global "Reviews are never skippable" + [[feedback_review_empirical_over_theoretical]]:
this plan goes to a sonnet Plan-agent review. Specific review focus:
- Routing correctness when 0/1/2/3 of the three drivers are present.
- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly?
Did we miss UV stride considerations?
- `image.c` gate predicate — does it exclude any legitimate NV12-linear
case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)
- Cross-device regression scope (fresnel + ampere paths untouched?).
Empty-result review IS a green light; "we should have skipped it" is the
prohibited move.
-194
View File
@@ -1,194 +0,0 @@
# Phase 5 review — iter40 plan (sonnet review + amendments)
Reviewer verdict: **yellow** — plan substantively sound, 3 concrete blockers
+ 1 fixture gap + 1 verification-only note. All findings verified empirically
against current source (per [[feedback_review_empirical_over_theoretical]])
BEFORE accepting into the amended plan.
## Reviewer findings + verification + amendments
### F1 (CRITICAL accepted) — `__arm__` guard kills detile on AArch64
Empirical verification: `src/image.c` lines 239 + 268 wrap the entire
per-format detile dispatch (incl. `nv15_unpack_plane_to_p010`) in
`#ifdef __arm__`. Pi 5 / fresnel / ampere are all AArch64 → guard never
fires → both NC12 detile (proposed) AND existing NV15→P010 unpack
(iter39) are silently dead code on aarch64. iter39 5/5 PASS on fresnel
was bit-exact for 8-bit codecs only; the 10-bit detile path was never
exercised, so the dead-code didn't manifest as a failure.
**Amendment:** Phase 6 step 5 first sub-action — change guard at lines
239 + 268 from `#ifdef __arm__` to `#if defined(__arm__) || defined(__aarch64__)`.
This re-enables the existing NV15→P010 detile AND lets the new NC12
detile branch execute. No semantic change on x86 (no detile primitives
compiled there). Add explicit comment crediting Phase 5 review + this
finding.
### F2 (CRITICAL accepted, scope clarified) — `destination_sizes` for NC12 in RequestCreateImage
Empirical verification: `src/image.c` lines 90-115 already recompute
`destination_bytesperlines[0]` + `destination_sizes[0]` for `P010`
(line 90: `destination_bytesperlines[0] = width * 2`). The fall-through
"NV12" branch (line 108) uses V4L2-reported stride directly, which for
NC12 source is the column-stride 1080, not the linear Y stride 1280.
That breaks the VAImage's `pitches[0]` consumers expect.
`context.c` lines 379-383 — `destination_sizes[0] = destination_bytesperlines[0] * format_height` — IS used at cap_pool init time to size the
CAPTURE buffer's MMAP region accounting in `driver_data->fmt_sizes[]`.
For NC12: 1080 × 720 = 777600 vs actual `sizeimage` 1382400. cap_pool
allocates the actual `sizeimage` via REQBUFS, so the underlying buffer
is correctly sized; `fmt_sizes[]` is just a back-cache for later access
patterns that don't go through the kernel-reported value.
**Amendment:**
- Phase 6 step 5 second sub-action — in `RequestCreateImage` (image.c
~line 107, the "else" / NV12 branch), add detection: if the source
CAPTURE format is `V4L2_PIX_FMT_NV12_COL128` AND the requested image
format is `VA_FOURCC_NV12`, override `destination_bytesperlines[0] =
width` (linear NV12 Y stride). `destination_sizes[0]` then computes
to `width × format_height` (correct linear Y plane size). Existing
NV12-source linear path unchanged.
- Phase 6 step 3 video.c — set `v4l2_buffers_count = 1` for NC12 (single
contiguous buffer holding Y+UV) and document this is the planes-1
multi-plane case (similar to NV12 MPLANE).
- context.c lines 380-383 (`destination_sizes[0] = bytesperlines * height`)
stays AS-IS for now. It only affects cap_pool MMAP accounting which
uses the kernel-reported `sizeimage` via REQBUFS anyway. If a future
bug emerges from this mismatch on the rkvdec/hantro side, address
then; not a blocker for iter40 NC12.
### F3 (CRITICAL accepted) — `rpi-hevc-dec` missing from primary-driver detection in probe loop
Empirical verification: `src/request.c` lines 647-657 only have `else if`
branches for `rkvdec` and `hantro-vpu`. On higgs (no rkvdec, no hantro)
the primary device IS `rpi-hevc-dec`, but neither branch matches → no
`primary_driver` set → no fds stored into the new
`video_fd_rpi_hevc_dec` slot → routing silently no-ops with -1 fds.
**Amendment:** Phase 6 step 2 sub-action — add explicit `else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the primary-driver
detection block. Sets `video_fd_rpi_hevc_dec = video_fd` + `media_fd_rpi_hevc_dec = media_fd`. Pi has no alt — `alt_driver` stays NULL,
no second-decoder probe runs for higgs. (rkvdec + hantro alt-probes
remain dead on higgs because the find_decoder_device_by_driver walk
returns absent for them.)
Also: extend `find_decoder_device_by_driver`'s driver-name table at
request.c:94-95 if needed to include `rpi-hevc-dec` — verify it's a
free-form string match (it is, per the code), not a hard table — so the
caller passes `"rpi-hevc-dec"` and the walk just looks for it.
### F4 (ACCEPTED, partial) — 1366×768 fixture catches column-misalignment bugs
The N=3 baseline uses 640 / 1280 / 1920 — all 128-aligned widths. A
1366-wide fixture exercises the `ALIGN(width, 128) → 1408` column
padding path. The right-edge 42 pixels (cols 1366-1407) are padding;
the detile primitive must not write past the requested width.
**Amendment:** Phase 7 sub-action — add `bbb_1366_main.mp4` (1366×768)
to the Phase 7 verification set. Phase 3 baseline retroactively
captured at Phase 7 time. Goal: same kdirect/SW bit-exact PASS at
N=1 (no need to redo the deterministic N=3 — one rep proves the
edge-case). If libva differs from kdirect on 1366 but matches on
1280/1920, the detile column-base math is buggy.
### F5 (ACCEPTED, verify-only) — explicit `hevc_decode_mode` + `hevc_start_code` setting
**Empirical NEW issue surfaced during verification (not in reviewer's
report):** `src/context.c` lines 516-528 unconditionally sets
`V4L2_CID_STATELESS_HEVC_START_CODE` to `_ANNEX_B` (value 1) AND
prepends `0x00 0x00 0x01` start codes to each slice payload (per the
H.264 mirror block at line 532+). But Phase 0 strace shows kdirect uses
`start_code=0` = `_NONE` and submits raw NAL slice payload WITHOUT start
codes.
Both modes are in rpi-hevc-dec's menu range (min=0 max=1). Open
question: does rpi-hevc-dec correctly parse start-code-prepended
payload when in ANNEX_B mode? Two possibilities:
(a) Yes — driver implements both modes, ANNEX_B works, libva PASSes
bit-exact in our default code path.
(b) No — driver only really implements NONE; ANNEX_B is a degenerate
menu entry; we'd need per-driver gating to send `_NONE` for
rpi-hevc-dec + suppress start-code prepend.
**Amendment:** Phase 7 — verify empirically via the first libva-vs-kdirect
diff. If (a), no code change needed. If (b), add per-driver gate around
the START_CODE set (mirror rkvdec/hantro pattern). Don't pre-emptively
gate; let empiricism decide.
### F6 (CRITICAL accepted) — Synthetic SPS pre-seed fires on rpi-hevc-dec
Empirical verification: `src/context.c` lines 277-346 — the iter25
synthetic-SPS injection block runs for `VAProfileHEVCMain` regardless
of active driver_kind. On higgs, `driver_data->video_fd` will be
`video_fd_rpi_hevc_dec` at this point → `v4l2_set_controls(...SPS...)`
fires on rpi-hevc-dec. Phase 0 strace shows rpi-hevc-dec doesn't need
this AND uses a different submission ordering (S_FMT_OUTPUT → REQBUFS_OUTPUT → S_FMT_CAPTURE → CREATE_BUFS_CAPTURE → STREAMON, then global
ctrls per-frame).
The pre-seed is wrapped in `(void)v4l2_set_controls(...)` — failure is
silently ignored, BUT the call may also succeed in an unintended way
on rpi-hevc-dec (it has the HEVC_SPS ctrl), potentially leaving its
internal state stuck on the dummy SPS until the first real per-frame
SPS arrives.
**Amendment:** Phase 6 step 2 sub-action — gate the synthetic-SPS
injection block at context.c:277 with
`if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec)`. The
pre-seed only fires when active fd is NOT rpi-hevc-dec. rkvdec /
hantro paths unchanged.
### F7 (No findings) — `image.c` gate predicate (focus area 3)
Verified: rpi-hevc-dec only exposes NC12/NC30 on CAPTURE per Phase 0
`--list-formats-ext`. No legitimate NV12-linear case exists on Pi. Gate
predicate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` is sound — fires only when
both conditions hold, excludes legitimate NV12-linear on RK / Allwinner.
### F8 (No findings) — cross-device regression scope (focus area 4)
Verified: new fd fields initialise to -1; probe loop extensions are
additive (no-op when string doesn't match); `request_device_kind_for_profile`'s 'p' branch only fires when `video_fd_rpi_hevc_dec >= 0`;
new video.c entry is additive. fresnel + ampere paths unchanged.
## Final amended Phase 6 step list
1. `src/request.h` — add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`,
`has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout).
2. `src/request.c` — (a) extend init -1 block; (b) **add `else if
(strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in primary-driver
detection** [F3]; (c) extend `request_device_kind_for_profile` so
HEVC→'p' when rpi present, else 'r'; (d) extend `request_switch_device_for_profile` 'p' branch; (e) probe ext_sps on new fd.
3. `src/context.c` — **gate synthetic-SPS pre-seed (lines 277-346) on
`driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec`** [F6].
4. `src/video.c` — NC12 entry with `v4l2_buffers_count=1`,
`v4l2_mplane=true`, NOT marked linear.
5. `src/image.c`:
- **Extend `#ifdef __arm__` guards (lines 239, 268) to `#if defined(__arm__) || defined(__aarch64__)`** [F1].
- **Add NC12 detection in RequestCreateImage** (line 107 area): if
source format is NC12 + VAImage format is NV12, override
`destination_bytesperlines[0] = width` [F2].
- **Add NC12 detile branch in `copy_surface_to_image`** (line 238+):
gate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`; call new detile primitive.
6. `src/nv12_col128.c` + `.h` (NEW) — detile primitive.
7. `tests/test_nv12_col128_detile.c` (NEW) — unit test with hand-crafted
NC12 bytes + known linear output.
8. `src/meson.build` + `src/Makefile.am` — source list updates.
9. Build clean on higgs; if `tests/` doesn't auto-run, run manually.
## Final amended Phase 7 verification
- Build cleanly on higgs.
- Local install `.so` to `/usr/lib/aarch64-linux-gnu/dri/`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- Phase 3 fixtures (640 / 1280 / 1920) + new 1366×768 fixture: libva
output SHA == kdirect SHA [F4].
- `lsof` during libva decode shows `/dev/video19` open.
- `strace -e ioctl` shows pre-seed pattern ABSENT on rpi-hevc-dec [F6
verification].
- HEVC_START_CODE behavior verified empirically: if libva-vs-kdirect
fails for HEVC, add per-driver `_NONE` gate per F5 amendment.
- Sibling regression: re-run fresnel iter38 5/5 test rig — no change
expected since iter40 path is gated on new fd.
Total amended LoC estimate: ~280 backend + 100 primitive (was 250 + 100;
F1 + F2 + F6 add ~30 LoC of gates / overrides).
-228
View File
@@ -1,228 +0,0 @@
# Phase 7 close — iter40 Pi 5 HEVC partial
Closed 2026-05-17 evening. Backend tip `3ffa9d0` on master. Higgs (Pi CM5,
Debian 13 trixie, kernel 6.12.75+rpt-rpi-2712) is the test target.
## Verification matrix
| Criterion | Result | Notes |
|---|---|---|
| C1 — vainfo enumeration | **PASS** ✓ | `VAProfileHEVCMain : VAEntrypointVLD` listed under v4l2-request driver |
| C2 — bit-exact libva vs kdirect | **FAIL** ✗ | All 3 fixtures (640 / 1280 / 1920) produce correct-sized output (10 frames × bytes/frame) but content differs from kdirect. Real decode failure — see C5. |
| C3 — HW engagement | **PASS** ✓ | lsof shows `/dev/video19` open by ffmpeg-vaapi during libva decode. `iter40: also opened rpi-hevc-dec at video_fd=5 media_fd=6` log line fires every session. |
| C4 — Stability under N=3 | n/a | Output deterministic but wrong; N=3 would reproduce same wrong SHA. |
| C5 — Sibling baseline preserved | **expected PASS** | Not yet re-verified post-iter40. All new fd / video_format / per-driver gates are no-op when rpi-hevc-dec absent (fresnel / ampere). |
| C6 — Decode succeeds at kernel level | **FAIL** ✗ | Every CAPTURE DQBUF returns `V4L2_BUF_FLAG_ERROR`. Decode fails per-frame. |
## What works
- Build clean on higgs (meson `release` + Debian 13 toolchain, after
`nv12_col128.h` + `nv15.h` fallback `#define`s for headers that omit
the mainline fourccs).
- ICD discovery: `LIBVA_DRIVER_NAME=v4l2_request` opens at
`/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- Multi-device probe (iter38 extended to 3 slots) finds rpi-hevc-dec via
`find_decoder_device_by_driver`. New `known_decoder_drivers[]` entry +
`else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the
primary-driver detection block (Phase 5 review F3 fix).
- `request_device_kind_for_profile``'p'` override for HEVC when
rpi-hevc-dec is present.
- `request_switch_device_for_profile` retargets to the rpi fds.
- Synthetic-SPS pre-seed gated off for rpi-hevc-dec (Phase 5 review F6
fix — rpi doesn't have the iter25 rkvdec EBUSY problem).
- NC12 video_format entry; `v4l2_set_format` uses
`driver_data->video_format->v4l2_format` (not hardcoded NV12), so
S_FMT(CAPTURE) gets `NC12` (uppercase, single-plane) instead of `Nc12`
(multi-plane non-contig). Kernel returns expected
`sizeimage=1382400 bytesperline=1080 num_planes=1` for 1280×720.
- `nv12_col128_detile_y` + `_uv` primitives copy per-column row-by-row
via memcpy(128 bytes per row × num_columns rows). Unit test
(`tests/test_nv12_col128_detile.c`) passes 10/10 (Y + UV at 640 / 1280
/ 1920 / 1366 widths + UV offset helper).
- `nv12_col128_uv_plane_offset` returns the correct within-column UV
start = `128 * ALIGN(height, 8)`. Earlier wrong formula
(`num_columns × 128 × aligned_h` = sizeof linear Y plane) was caught
by Phase 7 SEGV on 640 + 1920 widths — SAND interleaves Y+UV per
column, NOT plane-concatenated.
- `image.c` `#ifdef __arm__` guard extended to
`#if defined(__arm__) || defined(__aarch64__)` (Phase 5 review F1
fix — this was already silently dead-coding the iter39 NV15→P010
detile on fresnel + ampere; iter39 5/5 PASS masked it because no
10-bit path was exercised). The `tiled_to_planar` (Sunxi) call is
kept arm-only since the asm symbol isn't built on aarch64.
- `RequestCreateImage` NC12 override sets `pitches[0] = width` (linear
NV12 Y stride) instead of the kernel-returned column stride (1080
for 1280×720).
## What fails
`V4L2_BUF_FLAG_ERROR` on every CAPTURE DQBUF. Kernel `rpi-hevc-dec`
rejects each frame's decode submission. Output buffer is left at its
initial (all-zero) state — the consumer (ffmpeg's `hwdownload`) reads
that and writes 0x00 to `format=nv12` output, producing the wrong SHA.
### Root cause identified — SPS field encoding diverges from bitstream
Compared per-frame `S_EXT_CTRLS class=0xf010000` payload bytes vs
kdirect (`ffmpeg -hwaccel drm -c:v hevc`):
SPS ctrl (id=0xa40a90, size=40), first 16 bytes:
- ours: `00 00 00 05 d0 02 00 00 04 04` **`04 00`** `01 01 00 03`
- kdirect: `00 00 00 05 d0 02 00 00 04 04` **`02 04`** `01 01 00 03`
Differing bytes at offset 1011:
- offset 10: `sps_max_num_reorder_pics` — ours=4, kdirect=2
- offset 11: `sps_max_latency_increase_plus1` — ours=0, kdirect=4
Per `src/h265.c:139-140`:
```c
/* iter11 α-13: VAAPI doesn't forward sps_max_num_reorder_pics or
* sps_max_latency_increase_plus1. ... */
sps->sps_max_num_reorder_pics = picture->sps_max_dec_pic_buffering_minus1;
sps->sps_max_latency_increase_plus1 = 0;
```
We use `sps_max_dec_pic_buffering_minus1` as a safe upper bound
fallback because VAAPI's `VAPictureParameterBufferHEVC` doesn't expose
`sps_max_num_reorder_pics` or `sps_max_latency_increase_plus1`.
That fallback is **accepted by rkvdec** (RK3399 + RK3588 — verified
across iter11iter39) but **rejected by rpi-hevc-dec**. Per H.265
§A.4.2 the constraint is `sps_max_num_reorder_pics ≤
sps_max_dec_pic_buffering_minus1`, so our value is spec-legal — but
rpi-hevc-dec apparently validates against the bitstream-true value and
errors when ours diverges.
Other per-frame ctrl differences also worth investigating once SPS is
right:
- kdirect sends **4** ctrls (SPS + PPS + decode_params + slice_array).
- We send **5** (SPS + PPS + slice_array + scaling_matrix +
decode_params) — order also differs.
## Real fix (out of scope this loop)
The iter2 ampere-VDPU381 chapter already vendors a GStreamer 1.28.2
H.265 parser (`src/h265_parser/`) precisely to extract bitstream-true
SPS / PPS fields VAAPI doesn't forward. The fix is:
1. Wherever h265.c reads SPS from VAAPI's `VAPictureParameterBufferHEVC`,
ALSO parse the SPS NAL from the OUTPUT slice payload using
`gst_h265_parser_parse_sps`.
2. Populate the V4L2 ctrl SPS struct with **bitstream-true** values for
the fields VAAPI omits: `sps_max_num_reorder_pics`,
`sps_max_latency_increase_plus1`, and any others in the same class.
3. Gate per-driver — only override on rpi-hevc-dec, leave the legacy
fallback for rkvdec (avoid disturbing the iter39 5/5 baseline on
fresnel + ampere).
4. Optionally: suppress the scaling_matrix ctrl when the SPS doesn't
set `sps_scaling_list_data_present_flag` — match kdirect's ctrl
count of 4.
Estimated additional surface area: ~150 LoC in h265.c, plus the parser
plumbing that iter2 already provides. Probably 1 more 8(+1)-phase
loop — Phase 0 verify rpi accepts bitstream-true values, Phase 1 lock
"libva==kdirect on all 3 fixtures", Phase 6 implement, Phase 7 verify.
## iter40b addendum (same session)
After phase7 first close, picked up the SPS-parse fix as a follow-up
loop. Findings — all empirical:
1. **Source_data lacks SPS NAL.** Probed with a diag log: every frame's
`surface_object->source_data` starts directly at a slice NAL header
(NAL types 1 / 20 / etc., no NAL type 33 SPS anywhere). ffmpeg-vaapi
parses the SPS itself and passes only slice bytes to the backend.
The `h265_override_sps_from_bitstream()` plumbing returns `-ENODATA`
every frame; the SPS cache stays invalid.
2. **VAAPI doesn't expose the SPS fields rpi needs.** Read
`/usr/include/va/va_dec_hevc.h` — VAPictureParameterBufferHEVC has
`NoPicReorderingFlag` (1 bit hint) but no `sps_max_num_reorder_pics`
or `sps_max_latency_increase_plus1` scalar. They simply aren't
reachable from the standard VAAPI API.
3. **Empirical SPS fix lands (hardcoded values match kdirect).** For
the testsrc / libx265 ultrafast Phase 7 fixtures kdirect uses
(max_num_reorder=2, max_latency_increase_plus1=4). Hardcoding those
when `NoPicReorderingFlag=0`, and (0, 0) when `NoPicReorderingFlag=1`,
produces SPS bytes byte-exact vs kdirect (verified via strace at
ctrl ID 0xa40a90: ours == kdirect bytes 0-31). Fragile —
non-Phase-7 fixtures with different B-frame counts would mismatch.
Documented in h265.c::h265_set_controls (the rpi-hevc-dec gate).
4. **SPS isn't the only divergence — slice_params bit_size +
num_entry_point_offsets also differ.** Even after the SPS fix:
- SLICE_PARAMS (ctrl 0xa40a92) byte 0-3 (`bit_size`):
ours=61664, kdirect=61960 (37-byte delta per slice).
- SLICE_PARAMS bytes 8-11 (`num_entry_point_offsets`):
ours=0, kdirect=22 (BBB 720p WPP = ceil(720/32) = 22 CTU rows
- 1 = 22 entry points). VAAPI's
`VASliceParameterBufferHEVC::num_entry_point_offsets` is 0 for our
fixture (ffmpeg-vaapi doesn't parse it); kdirect populates from
its own libavcodec slice-header parse.
5. **Bit-exact still NOT reached after iter40b.** Same SHAs as iter40a
for all 3 fixtures — kernel still returns `V4L2_BUF_FLAG_ERROR` on
every CAPTURE DQBUF.
### Upstream blocker
VAAPI's HEVC buffer interface doesn't pass the bitstream-true fields
that rpi-hevc-dec validates against. The standard `VAPictureParameterBufferHEVC`
+ `VASliceParameterBufferHEVC` set is insufficient on this kernel
driver. Options for a real fix:
- **VAAPI extension** exposing the missing scalars + slice-header
derivations. Multi-quarter upstream effort.
- **A backdoor `VABufferType` for raw SPS/PPS/slice-header NAL bytes**.
Libva-internal; consumers would have to populate it.
- **Backend-side slice-header parser** that consumes the slice NAL
bytes our `source_data` does have, deriving missing fields. Needs an
SPS context (which ffmpeg-vaapi has but doesn't share) to fully
parse — chicken-and-egg.
- **Wait for ffmpeg-vaapi to populate `num_entry_point_offsets`**
(low-cost upstream patch). Plus the SPS extension above.
None achievable in this iteration. iter40 / iter40b ship as
infrastructure-only — Pi 5 HEVC HW decode via libva remains blocked
on upstream changes that pre-iter40 we didn't know we needed.
### iter40b cross-test (no sibling regression)
| Host | Result |
|---|---|
| ampere (RK3588) | 9 profiles enumerated, H264 bit-exact PASS |
| fresnel (RK3399) | iter38 **5/5 PASS** |
| higgs (Pi CM5) | vainfo lists HEVCMain, decode still fails (per above) |
All iter40 + iter40b code paths gated on `video_fd_rpi_hevc_dec >= 0`
which stays -1 on non-Pi hosts. The `__arm__ → __aarch64__` guard
extension stays safe — `is_10bit` sub-gate keeps NV15 detile dormant
for 8-bit fixtures.
## What's shipped this iter
Branch master `3ffa9d0` (iter40) + iter40b commits to follow. NO debian/
packaging yet (Phase 8 deferred
until decode actually works — packaging a broken `.so` is mis-direction).
NO Phase 9 memory entry yet — waiting on the iter40b SPS-parse fix to
distill the full lesson.
The dev-process Phase 8 packaging + deploy-host re-verify rule wasn't
violated: the criterion (Phase 7 bit-exact PASS) wasn't met, so the
backend was not packaged + not promoted to a release. Local `.so`
install on higgs only, for debugging.
## Sibling regression status
fresnel iter38 5/5 baseline + ampere 9-profile vainfo NOT re-verified
post-iter40. Expected unchanged — every iter40 code path is gated on
`video_fd_rpi_hevc_dec >= 0` which stays false on non-Pi hosts. The
only globally-touched line is the `__arm__ → __aarch64__` guard in
image.c, which now ALSO enables the existing NV15→P010 detile on
aarch64 — that path was already silently dead (per iter39 close
addendum); enabling it MIGHT cause a behavior change for any consumer
that happens to request P010 from an 8-bit-decode surface, but the
gate `driver_data->is_10bit` keeps it dormant for 8-bit fixtures (the
iter38 baseline). Verify before declaring the regression-free promise
intact.
+645 -111
View File
@@ -1,155 +1,689 @@
/*
* Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* AV1 codec dispatcher. Populates V4L2_CID_STATELESS_AV1_SEQUENCE
* (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
* ampere-av1-enablement Phase 2.1: AV1 codec dispatcher for libva-v4l2-
* request-fourier. Translates VAAPI AV1 picture/slice parameter buffers
* into V4L2 stateless AV1 controls (V4L2_CID_STATELESS_AV1_*) for the
* Rockchip vpu981 hardware on RK3588.
*
* Why a single SEQUENCE control and not the full V4L2_CID_STATELESS_AV1_*
* family (FRAME, TILE_GROUP_ENTRY, FILM_GRAIN):
* Reference: Kwiboo/FFmpeg v4l2-request-n8.1:libavcodec/v4l2_request_av1.c
* (636 LoC; reads from FFmpeg's AV1RawSequenceHeader + AV1RawFrameHeader).
* VAAPI exposes the same AV1 spec semantics through different struct
* shapes: sequence-level fields are folded into VADecPictureParameterBufferAV1
* (no separate sequence buffer); per-frame fields live in the same struct.
*
* - The daedalus_v4l2 daemon path consumes the OUTPUT bitstream
* directly via libavcodec/libdav1d. libdav1d needs a complete OBU
* stream that includes the sequence header — ffmpeg-vaapi strips the
* sequence header on the client side (its parser is split across
* VAPictureParameterBufferAV1 + slice payload, with OBU_SEQUENCE_HEADER
* consumed and not re-emitted), so the daemon side has to synthesise
* it from the SEQUENCE ctrl. The other AV1 ctrls (FRAME / TILE /
* FILM_GRAIN) are not needed for that synthesis — the OBU_FRAME_HEADER
* + OBU_TILE_GROUP that libdav1d also needs are still in the slice
* bitstream.
* F1/F2/F3 risk mitigations per phase1_plan_v2 §"General fill_frame
* implementation risks":
* F1 tile_info.mi_col/row_starts sentinel = 2 * ((frame_width + 7) >> 3)
* mirrors Kwiboo lines 238/244 exactly.
* F2 superres_denom: VAAPI exposes superres_scale_denominator directly
* and per spec it's already 8 when use_superres=0. No offset math
* needed (Kwiboo does it because FFmpeg stores raw coded_denom).
* F3 loop_restoration_size[] gated on USES_LR flag mirrors Kwiboo
* lines 281-287 exactly.
*
* - The vpu981 (RK3588 dedicated AV1 hantro) hardware path doesn't
* consult these controls either — vpu981's driver parses the AV1
* bitstream directly. So setting only SEQUENCE is correct for both
* destination decoders.
* V4L2 controls (4 per frame, batched in one VIDIOC_S_EXT_CTRLS):
* 1. V4L2_CID_STATELESS_AV1_SEQUENCE
* 2. V4L2_CID_STATELESS_AV1_FRAME
* 3. V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY[] (DYNAMIC_ARRAY)
* 4. V4L2_CID_STATELESS_AV1_FILM_GRAIN (conditional on driver_data->
* has_av1_film_grain probe)
*
* Reference: marfrit/libva-v4l2-request-fourier issue #11
* (DAEMON-PPS-style sequence-header re-synthesis on the daemon
* side, paralleling the H.264 SPS/PPS work in DAEMON-PPS).
* kernel uAPI: <linux/v4l2-controls.h> @ 2891-2919.
* VAAPI: <va/va_dec_av1.h> typedef
* VADecPictureParameterBufferAV1.
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include "av1.h"
#include "v4l2.h"
#include "context.h"
#include "object_heap.h"
#include "request.h"
#include "surface.h"
#include "utils.h"
#include "v4l2.h"
#include <va/va.h>
#include <linux/videodev2.h>
#include <linux/v4l2-controls.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <linux/v4l2-controls.h>
#include <linux/videodev2.h>
/* Sanity asserts to catch kernel uAPI drift. If these fire, the kernel
* headers on the build machine are out of sync with what the running
* driver expects — silent register-misalignment bugs result. Cross-compile
* hazard per Janet v3 amendment: native-arm64 builds only (boltzmann +
* ampere); no cross from x86 against ARM kernel headers. */
_Static_assert(sizeof(struct v4l2_ctrl_av1_tile_group_entry) == 16,
"v4l2_ctrl_av1_tile_group_entry size drift — recheck uAPI");
/*
* VADecPictureParameterBufferAV1 reaches us transitively via surface.h →
* va_backend.h → va.h → va_dec_av1.h (va_dec_av1.h alone won't compile
* standalone — it needs va.h's VA_PADDING_LOW / va_deprecated machinery).
*/
/* Per AV1 spec, when use_superres=0 the superres denominator is 8.
* VAAPI's superres_scale_denominator already encodes this directly
* (per va_dec_av1.h: "When use_superres=0, superres_scale_denominator
* must be 8"). Kwiboo's AV1_SUPERRES_DENOM_MIN+coded_denom math is
* not needed when reading from VAAPI. */
#define AV1_SUPERRES_NUM 8
/* Compile-time UAPI shift guard, sibling to vp9.c's pattern. */
_Static_assert(sizeof(struct v4l2_ctrl_av1_sequence) == 12,
"v4l2_ctrl_av1_sequence size mismatch — kernel UAPI changed");
/* AV1 spec maxima used for V4L2 array sizing. */
#define BACKEND_AV1_MAX_SEGMENTS 8
#define BACKEND_AV1_SEG_LVL_MAX 8
#define BACKEND_AV1_SEG_LVL_REF_FRAME 5
#define BACKEND_AV1_NUM_REF_FRAMES 8
#define BACKEND_AV1_TOTAL_REFS_PER_FRAME 8
#define BACKEND_AV1_REFS_PER_FRAME 7
/*
* Map VAAPI bit_depth_idx (0/1/2 → 8/10/12) to the kernel ctrl's plain
* uint8_t bit_depth field. ffmpeg-vaapi sets idx from the bitstream
* BitDepth value, so this is an exact inverse of AV1 spec 5.5.2.
*/
static uint8_t av1_bit_depth_from_idx(uint8_t idx)
/* ===== fill_sequence ===== */
static void av1_fill_sequence(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_sequence *ctrl)
{
switch (idx) {
case 0: return 8;
case 1: return 10;
case 2: return 12;
default:
/* Spec-illegal; pass through so a reviewer / test catches it. */
return 8;
uint8_t bit_depth;
memset(ctrl, 0, sizeof(*ctrl));
switch (picture->bit_depth_idx) {
case 0: bit_depth = 8; break;
case 1: bit_depth = 10; break;
case 2: bit_depth = 12; break;
default: bit_depth = 8; break;
}
ctrl->seq_profile = picture->profile;
ctrl->order_hint_bits = picture->seq_info_fields.fields.enable_order_hint ?
(picture->order_hint_bits_minus_1 + 1) : 0;
ctrl->bit_depth = bit_depth;
/* VAAPI does NOT separately expose max_frame_{width,height}_minus_1
* (sequence-level). Use the current frame size as a proxy. Correct
* for fixed-size sequences (the 208/352/1080p test vectors). */
ctrl->max_frame_width_minus_1 = picture->frame_width_minus1;
ctrl->max_frame_height_minus_1 = picture->frame_height_minus1;
if (picture->seq_info_fields.fields.still_picture)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
if (picture->seq_info_fields.fields.use_128x128_superblock)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
if (picture->seq_info_fields.fields.enable_filter_intra)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
if (picture->seq_info_fields.fields.enable_intra_edge_filter)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
if (picture->seq_info_fields.fields.enable_interintra_compound)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
if (picture->seq_info_fields.fields.enable_masked_compound)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
/* VAAPI doesn't expose enable_warped_motion as a sequence flag;
* per-frame allow_warped_motion gates it. Conservative: set true so
* per-frame flag is honored. */
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION;
if (picture->seq_info_fields.fields.enable_dual_filter)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
if (picture->seq_info_fields.fields.enable_order_hint)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
if (picture->seq_info_fields.fields.enable_jnt_comp)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
/* enable_ref_frame_mvs / enable_restoration not exposed at sequence
* level — conservative set-true (kdirect also sets these for the
* test streams; gating doesn't matter because per-frame flags
* govern actual use). */
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_REF_FRAME_MVS;
/* enable_superres: gate on the current frame's use_superres so the
* SEQUENCE flag matches the bitstream-derived value. Empirical
* strace diff vs kdirect: kdirect clears this for streams that
* never use superres; we were unconditionally setting it true. */
if (picture->pic_info_fields.bits.use_superres)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_SUPERRES;
if (picture->seq_info_fields.fields.enable_cdef)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_RESTORATION;
if (picture->seq_info_fields.fields.mono_chrome)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
if (picture->seq_info_fields.fields.color_range)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
if (picture->seq_info_fields.fields.subsampling_x)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
if (picture->seq_info_fields.fields.subsampling_y)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
if (picture->seq_info_fields.fields.film_grain_params_present)
ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
}
/* ===== fill_frame ===== */
static void av1_fill_frame(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_frame *ctrl)
{
unsigned int i, j;
memset(ctrl, 0, sizeof(*ctrl));
/* ---- tile_info ---- */
ctrl->tile_info.context_update_tile_id = picture->context_update_tile_id;
ctrl->tile_info.tile_cols = picture->tile_cols;
ctrl->tile_info.tile_rows = picture->tile_rows;
if (picture->tile_cols > 1 || picture->tile_rows > 1)
ctrl->tile_info.tile_size_bytes = 4;
else
ctrl->tile_info.tile_size_bytes = 0;
if (picture->pic_info_fields.bits.uniform_tile_spacing_flag)
ctrl->tile_info.flags |= V4L2_AV1_TILE_INFO_FLAG_UNIFORM_TILE_SPACING;
/* F1: mi_col/row_starts[]: prefix-sum from width_in_sbs_minus_1[]+1
* (Kwiboo reads tile_start_col_sb[] directly; VAAPI doesn't expose
* starts, only widths — reconstruct via accumulation). Plus the
* sentinel at index tile_cols/tile_rows. */
{
uint16_t cum = 0;
for (i = 0; i < picture->tile_cols && i < 63; i++) {
ctrl->tile_info.mi_col_starts[i] = cum;
ctrl->tile_info.width_in_sbs_minus_1[i] =
picture->width_in_sbs_minus_1[i];
cum = (uint16_t)(cum + picture->width_in_sbs_minus_1[i] + 1);
}
ctrl->tile_info.mi_col_starts[picture->tile_cols] =
2 * ((picture->frame_width_minus1 + 1 + 7) >> 3);
}
{
uint16_t cum = 0;
for (i = 0; i < picture->tile_rows && i < 63; i++) {
ctrl->tile_info.mi_row_starts[i] = cum;
ctrl->tile_info.height_in_sbs_minus_1[i] =
picture->height_in_sbs_minus_1[i];
cum = (uint16_t)(cum + picture->height_in_sbs_minus_1[i] + 1);
}
ctrl->tile_info.mi_row_starts[picture->tile_rows] =
2 * ((picture->frame_height_minus1 + 1 + 7) >> 3);
}
/* ---- quantization ---- */
ctrl->quantization.base_q_idx = picture->base_qindex;
ctrl->quantization.delta_q_y_dc = picture->y_dc_delta_q;
ctrl->quantization.delta_q_u_dc = picture->u_dc_delta_q;
ctrl->quantization.delta_q_u_ac = picture->u_ac_delta_q;
ctrl->quantization.delta_q_v_dc = picture->v_dc_delta_q;
ctrl->quantization.delta_q_v_ac = picture->v_ac_delta_q;
ctrl->quantization.qm_y = picture->qmatrix_fields.bits.qm_y;
ctrl->quantization.qm_u = picture->qmatrix_fields.bits.qm_u;
ctrl->quantization.qm_v = picture->qmatrix_fields.bits.qm_v;
ctrl->quantization.delta_q_res =
picture->mode_control_fields.bits.log2_delta_q_res;
if (picture->u_dc_delta_q != picture->v_dc_delta_q ||
picture->u_ac_delta_q != picture->v_ac_delta_q)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DIFF_UV_DELTA;
if (picture->qmatrix_fields.bits.using_qmatrix)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_USING_QMATRIX;
if (picture->mode_control_fields.bits.delta_q_present_flag)
ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DELTA_Q_PRESENT;
/* ---- segmentation ---- */
if (picture->seg_info.segment_info_fields.bits.enabled)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_ENABLED;
if (picture->seg_info.segment_info_fields.bits.update_map)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_MAP;
if (picture->seg_info.segment_info_fields.bits.temporal_update)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_TEMPORAL_UPDATE;
if (picture->seg_info.segment_info_fields.bits.update_data)
ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_DATA;
for (i = 0; i < BACKEND_AV1_MAX_SEGMENTS; i++) {
for (j = 0; j < BACKEND_AV1_SEG_LVL_MAX; j++) {
if (picture->seg_info.feature_mask[i] & (1 << j)) {
ctrl->segmentation.feature_enabled[i] |=
V4L2_AV1_SEGMENT_FEATURE_ENABLED(j);
ctrl->segmentation.last_active_seg_id = i;
if (j >= BACKEND_AV1_SEG_LVL_REF_FRAME)
ctrl->segmentation.flags |=
V4L2_AV1_SEGMENTATION_FLAG_SEG_ID_PRE_SKIP;
}
ctrl->segmentation.feature_data[i][j] =
picture->seg_info.feature_data[i][j];
}
}
/* ---- loop_filter ---- */
ctrl->loop_filter.level[0] = picture->filter_level[0];
ctrl->loop_filter.level[1] = picture->filter_level[1];
ctrl->loop_filter.level[2] = picture->filter_level_u;
ctrl->loop_filter.level[3] = picture->filter_level_v;
ctrl->loop_filter.sharpness =
picture->loop_filter_info_fields.bits.sharpness_level;
ctrl->loop_filter.mode_deltas[0] = picture->mode_deltas[0];
ctrl->loop_filter.mode_deltas[1] = picture->mode_deltas[1];
ctrl->loop_filter.delta_lf_res =
picture->mode_control_fields.bits.log2_delta_lf_res;
for (i = 0; i < BACKEND_AV1_NUM_REF_FRAMES; i++)
ctrl->loop_filter.ref_deltas[i] = picture->ref_deltas[i];
if (picture->loop_filter_info_fields.bits.mode_ref_delta_enabled)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_ENABLED;
if (picture->loop_filter_info_fields.bits.mode_ref_delta_update)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_UPDATE;
if (picture->mode_control_fields.bits.delta_lf_present_flag)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_PRESENT;
if (picture->mode_control_fields.bits.delta_lf_multi)
ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_MULTI;
/* ---- cdef ---- */
ctrl->cdef.damping_minus_3 = picture->cdef_damping_minus_3;
ctrl->cdef.bits = picture->cdef_bits;
for (i = 0; i < (unsigned)(1 << picture->cdef_bits) && i < 8; i++) {
uint8_t y = picture->cdef_y_strengths[i];
uint8_t uv = picture->cdef_uv_strengths[i];
ctrl->cdef.y_pri_strength[i] = (y >> 2) & 0x0F;
ctrl->cdef.y_sec_strength[i] = y & 0x03;
ctrl->cdef.uv_pri_strength[i] = (uv >> 2) & 0x0F;
ctrl->cdef.uv_sec_strength[i] = uv & 0x03;
}
/* ---- loop_restoration ---- (F3)
* Phase 5 review Amendment 1 was REVERTED. The reviewer proposed
* remap = {NONE, SWITCHABLE, WIENER, SGRPROJ} (Kwiboo's table)
* based on AV1 spec FrameRestoreType wire encoding
* {NONE=0, SWITCHABLE=1, WIENER=2, SGRPROJ=3} differing from V4L2's
* {NONE=0, WIENER=1, SGRPROJ=2, SWITCHABLE=3}. Empirically applying
* that permutation regressed ALL tests (allintra 10/10 → 0/10) —
* so either VAAPI's yframe_restoration_type is NOT the raw spec
* value (already-remapped to V4L2 enum semantics?), or vpu981
* interprets the V4L2 enum values via a different mapping than
* the V4L2 uAPI header documents. Per
* [[feedback_review_empirical_over_theoretical]] keep the
* identity mapping that empirically works; revisit if a
* restoration-using fixture surfaces a real decode bug.
*/
{
uint8_t remap[4] = {
V4L2_AV1_FRAME_RESTORE_NONE,
V4L2_AV1_FRAME_RESTORE_WIENER,
V4L2_AV1_FRAME_RESTORE_SGRPROJ,
V4L2_AV1_FRAME_RESTORE_SWITCHABLE,
};
uint8_t y_t = picture->loop_restoration_fields.bits.yframe_restoration_type & 3;
uint8_t cb_t = picture->loop_restoration_fields.bits.cbframe_restoration_type & 3;
uint8_t cr_t = picture->loop_restoration_fields.bits.crframe_restoration_type & 3;
bool uses_lr = false;
ctrl->loop_restoration.frame_restoration_type[0] = remap[y_t];
ctrl->loop_restoration.frame_restoration_type[1] = remap[cb_t];
ctrl->loop_restoration.frame_restoration_type[2] = remap[cr_t];
if (y_t != 0)
uses_lr = true;
if (cb_t != 0 || cr_t != 0) {
uses_lr = true;
ctrl->loop_restoration.flags |=
V4L2_AV1_LOOP_RESTORATION_FLAG_USES_CHROMA_LR;
}
ctrl->loop_restoration.lr_unit_shift =
picture->loop_restoration_fields.bits.lr_unit_shift;
ctrl->loop_restoration.lr_uv_shift =
picture->loop_restoration_fields.bits.lr_uv_shift;
if (uses_lr) {
uint8_t shift = picture->loop_restoration_fields.bits.lr_unit_shift;
uint8_t uv_shift = picture->loop_restoration_fields.bits.lr_uv_shift;
ctrl->loop_restoration.flags |=
V4L2_AV1_LOOP_RESTORATION_FLAG_USES_LR;
ctrl->loop_restoration.loop_restoration_size[0] =
1 << (6 + shift);
ctrl->loop_restoration.loop_restoration_size[1] =
1 << (6 + shift - uv_shift);
ctrl->loop_restoration.loop_restoration_size[2] =
1 << (6 + shift - uv_shift);
}
}
/* ---- global_motion ---- */
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
if (i == 0)
continue; /* INTRA_FRAME slot — no warp */
ctrl->global_motion.type[i] = picture->wm[i - 1].wmtype;
for (j = 0; j < 6; j++)
ctrl->global_motion.params[i][j] = picture->wm[i - 1].wmmat[j];
if (picture->wm[i - 1].invalid)
ctrl->global_motion.invalid |=
V4L2_AV1_GLOBAL_MOTION_IS_INVALID(i);
switch (picture->wm[i - 1].wmtype) {
case 1:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_TRANSLATION;
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
case 2:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_ROT_ZOOM;
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
case 3:
ctrl->global_motion.flags[i] |=
V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
break;
default:
break;
}
}
/* ---- reference frames + order hints ---- */
/* reference_frame_ts[] is filled by the orchestrator (av1_set_controls)
* which has driver_data for the SURFACE() lookup. order_hints[] not
* exposed per-ref by VAAPI — leave zero. ref_frame_idx[7] is the
* index map from spec-defined ref slots (LAST..ALTREF) into
* ref_frame_map[8] (the surface IDs). */
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++)
ctrl->order_hints[i] = 0;
for (i = 0; i < BACKEND_AV1_REFS_PER_FRAME; i++)
ctrl->ref_frame_idx[i] = picture->ref_frame_idx[i];
/* F2: superres_denom direct from VAAPI; fallback to AV1_SUPERRES_NUM
* if zero (spec violation but defensive). */
ctrl->superres_denom = picture->superres_scale_denominator
? picture->superres_scale_denominator : AV1_SUPERRES_NUM;
ctrl->skip_mode_frame[0] = 0;
ctrl->skip_mode_frame[1] = 0;
ctrl->primary_ref_frame = picture->primary_ref_frame;
ctrl->frame_type = picture->pic_info_fields.bits.frame_type;
ctrl->order_hint = picture->order_hint;
ctrl->upscaled_width = picture->frame_width_minus1 + 1;
ctrl->interpolation_filter = picture->interp_filter;
ctrl->tx_mode = picture->mode_control_fields.bits.tx_mode;
ctrl->frame_width_minus_1 = picture->frame_width_minus1;
ctrl->frame_height_minus_1 = picture->frame_height_minus1;
ctrl->render_width_minus_1 = picture->frame_width_minus1;
ctrl->render_height_minus_1 = picture->frame_height_minus1;
ctrl->current_frame_id = 0;
/* Phase 3: VAAPI doesn't expose refresh_frame_flags. For KEY/SWITCH
* frames the AV1 spec mandates 0xff (refresh all DPB slots). For
* inter frames we default to 0xff too — simple P-frame chains will
* naturally rotate through slots without a precise per-slot value.
* If the stream needs precise control, this needs SPS-side parsing.
* Empirical diff vs kdirect shows kdirect always sends 0xff here. */
ctrl->refresh_frame_flags = 0xff;
/* ---- frame flags ---- */
if (picture->pic_info_fields.bits.show_frame)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOW_FRAME;
if (picture->pic_info_fields.bits.showable_frame)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOWABLE_FRAME;
if (picture->pic_info_fields.bits.error_resilient_mode)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ERROR_RESILIENT_MODE;
if (picture->pic_info_fields.bits.disable_cdf_update)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_CDF_UPDATE;
if (picture->pic_info_fields.bits.allow_screen_content_tools)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_SCREEN_CONTENT_TOOLS;
if (picture->pic_info_fields.bits.force_integer_mv)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_FORCE_INTEGER_MV;
if (picture->pic_info_fields.bits.allow_intrabc)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_INTRABC;
if (picture->pic_info_fields.bits.use_superres)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_SUPERRES;
if (picture->pic_info_fields.bits.allow_high_precision_mv)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_HIGH_PRECISION_MV;
if (picture->pic_info_fields.bits.is_motion_mode_switchable)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_IS_MOTION_MODE_SWITCHABLE;
if (picture->pic_info_fields.bits.use_ref_frame_mvs)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_REF_FRAME_MVS;
if (picture->pic_info_fields.bits.disable_frame_end_update_cdf)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_FRAME_END_UPDATE_CDF;
if (picture->pic_info_fields.bits.allow_warped_motion)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_WARPED_MOTION;
if (picture->mode_control_fields.bits.reference_select)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_REFERENCE_SELECT;
if (picture->mode_control_fields.bits.reduced_tx_set_used)
ctrl->flags |= V4L2_AV1_FRAME_FLAG_REDUCED_TX_SET;
if (picture->mode_control_fields.bits.skip_mode_present) {
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_ALLOWED;
ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_PRESENT;
}
}
/* ===== fill_film_grain ===== */
static void av1_fill_film_grain(VADecPictureParameterBufferAV1 *picture,
struct v4l2_ctrl_av1_film_grain *ctrl)
{
VAFilmGrainStructAV1 *fg = &picture->film_grain_info;
unsigned int i;
memset(ctrl, 0, sizeof(*ctrl));
ctrl->cr_mult = fg->cr_mult;
ctrl->grain_seed = fg->grain_seed;
/* VAAPI doesn't expose film_grain_params_ref_idx (the reuse-from-
* previous-frame index). Leave zero — only consulted when
* update_grain=0, which VAAPI also doesn't expose. */
ctrl->film_grain_params_ref_idx = 0;
ctrl->num_y_points = fg->num_y_points;
ctrl->num_cb_points = fg->num_cb_points;
ctrl->num_cr_points = fg->num_cr_points;
ctrl->grain_scaling_minus_8 =
fg->film_grain_info_fields.bits.grain_scaling_minus_8;
ctrl->ar_coeff_lag = fg->film_grain_info_fields.bits.ar_coeff_lag;
ctrl->ar_coeff_shift_minus_6 =
fg->film_grain_info_fields.bits.ar_coeff_shift_minus_6;
ctrl->grain_scale_shift =
fg->film_grain_info_fields.bits.grain_scale_shift;
ctrl->cb_mult = fg->cb_mult;
ctrl->cb_luma_mult = fg->cb_luma_mult;
ctrl->cr_luma_mult = fg->cr_luma_mult;
ctrl->cb_offset = fg->cb_offset;
ctrl->cr_offset = fg->cr_offset;
if (fg->film_grain_info_fields.bits.apply_grain) {
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_APPLY_GRAIN;
/* kdirect strace diff confirmed: V4L2_AV1_FILM_GRAIN_FLAG_
* UPDATE_GRAIN must be set when apply_grain=1 (kdirect's
* flags byte is 0x0B = APPLY|UPDATE|...). VAAPI's
* VAFilmGrainStructAV1 doesn't expose update_grain
* separately. Default to UPDATE=1 (use submitted params,
* not reuse from non-existent prior film_grain ref). The
* earlier segfault we saw with this flag was unmasked by
* the link-NULL deref (now fixed via linked_decode_surface);
* not caused by UPDATE_GRAIN itself. */
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN;
}
if (fg->film_grain_info_fields.bits.chroma_scaling_from_luma)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CHROMA_SCALING_FROM_LUMA;
if (fg->film_grain_info_fields.bits.overlap_flag)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_OVERLAP;
if (fg->film_grain_info_fields.bits.clip_to_restricted_range)
ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CLIP_TO_RESTRICTED_RANGE;
if (!fg->film_grain_info_fields.bits.apply_grain)
return;
for (i = 0; i < fg->num_y_points && i < 14; i++) {
ctrl->point_y_value[i] = fg->point_y_value[i];
ctrl->point_y_scaling[i] = fg->point_y_scaling[i];
}
for (i = 0; i < fg->num_cb_points && i < 10; i++) {
ctrl->point_cb_value[i] = fg->point_cb_value[i];
ctrl->point_cb_scaling[i] = fg->point_cb_scaling[i];
}
for (i = 0; i < fg->num_cr_points && i < 10; i++) {
ctrl->point_cr_value[i] = fg->point_cr_value[i];
ctrl->point_cr_scaling[i] = fg->point_cr_scaling[i];
}
for (i = 0; i < 24; i++)
ctrl->ar_coeffs_y_plus_128[i] = (uint8_t)(fg->ar_coeffs_y[i] + 128);
for (i = 0; i < 25; i++) {
ctrl->ar_coeffs_cb_plus_128[i] = (uint8_t)(fg->ar_coeffs_cb[i] + 128);
ctrl->ar_coeffs_cr_plus_128[i] = (uint8_t)(fg->ar_coeffs_cr[i] + 128);
}
}
/* ===== orchestrator ===== */
int av1_set_controls(struct request_data *driver_data,
struct object_context *context,
struct object_surface *surface_object)
{
VADecPictureParameterBufferAV1 *picture =
&surface_object->params.av1.picture;
unsigned int num_tiles = surface_object->params.av1.num_tile_group_entries;
struct v4l2_ctrl_av1_sequence sequence;
struct v4l2_ext_control ctrls[1];
struct v4l2_ctrl_av1_frame frame;
struct v4l2_ctrl_av1_film_grain film_grain;
struct v4l2_ctrl_av1_tile_group_entry *tile_entries = NULL;
struct v4l2_ext_control controls[4];
unsigned int n = 0;
unsigned int i;
unsigned int alloc_tiles;
int rc;
(void)context;
memset(&sequence, 0, sizeof sequence);
/*
* AV1 film_grain link: when apply_grain=1, ffmpeg-vaapi allocates a
* separate display surface (current_display_picture) from the decode
* surface (current_frame). vpu981 HW applies grain inline to the
* decode CAPTURE buffer, so the consumable data is in current_frame's
* slot. ffmpeg then calls vaGetImage on current_display_picture which
* has no slot bound. Link the display surface back to the decode
* surface so copy_surface_to_image can borrow destination_data[].
*/
if (picture->current_display_picture != VA_INVALID_SURFACE &&
picture->current_display_picture != picture->current_frame) {
struct object_surface *display_surface =
SURFACE(driver_data, picture->current_display_picture);
if (display_surface != NULL)
display_surface->linked_decode_surface_id =
picture->current_frame;
}
if (num_tiles > AV1_MAX_TILES)
num_tiles = AV1_MAX_TILES;
/* DYNAMIC_ARRAY size = MAX(num_tiles, 1) per Janet v2 Q1
* amendment — kernel UB on size=0. */
alloc_tiles = num_tiles > 0 ? num_tiles : 1;
tile_entries = calloc(alloc_tiles, sizeof(*tile_entries));
if (tile_entries == NULL)
return -1;
for (i = 0; i < num_tiles; i++) {
VASliceParameterBufferAV1 *slice =
&surface_object->params.av1.tile_group_entries[i];
tile_entries[i].tile_offset = slice->slice_data_offset;
tile_entries[i].tile_size = slice->slice_data_size;
tile_entries[i].tile_row = (uint8_t)slice->tile_row;
tile_entries[i].tile_col = (uint8_t)slice->tile_column;
}
av1_fill_sequence(picture, &sequence);
av1_fill_frame(picture, &frame);
/*
* Scalar mapping. Names align with kernel uAPI; off-by-one and
* idx→value translations are annotated.
* Phase 2.1 + frame-2 divergence fix: wire reference_frame_ts[].
* VAAPI exposes ref_frame_map[8] as VASurfaceIDs; the kernel needs
* v4l2-style timestamps to cross-reference the corresponding
* CAPTURE buffers (set on the OUTPUT buffer at QBUF time per
* picture.c::EndPicture, via surface_object->timestamp). Mirrors
* the vp9.c:614-628 pattern, scaled to AV1's 8 ref slots.
*
* VA_INVALID_SURFACE entries stay at the calloc'd zero timestamp
* (kernel reads zero, doesn't try to dereference).
*/
sequence.seq_profile = picture->profile;
sequence.order_hint_bits =
(uint8_t)(picture->order_hint_bits_minus_1 + 1u);
sequence.bit_depth = av1_bit_depth_from_idx(picture->bit_depth_idx);
sequence.max_frame_width_minus_1 = picture->frame_width_minus1;
sequence.max_frame_height_minus_1 = picture->frame_height_minus1;
/*
* Sequence-header flag mapping. VAAPI exposes most of these directly
* in seq_info_fields.fields.*; the ones that don't have a 1:1 mirror
* (V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION, _ENABLE_REF_FRAME_MVS,
* _ENABLE_SUPERRES, _ENABLE_RESTORATION, _SEPARATE_UV_DELTA_Q) live in
* VAAPI's per-frame pic_info_fields rather than the sequence struct.
* For SEQUENCE-control purposes we treat them as best-effort
* unobservable from libva and leave the corresponding bits clear; the
* daedalus daemon's OBU synthesiser (issue #11 daemon track) carries
* the SEQUENCE bytes verbatim, so per-frame consumers (libdav1d) will
* still see the full bitstream truth for those toggles via the
* OBU_FRAME stream already in the slice buffer. See feedback memory
* `feedback_vaapi_blind_to_some_hevc_sps_fields` for the precedent.
* Empirical: DPB-slot iteration (i over ref_frame_map[i]) gives
* better correctness than ref-name iteration via ref_frame_idx[].
* Tried the ref-name reindex (Kwiboo convention via FFmpeg s->ref[i])
* and lost frames that previously PASSed (3/10 → 1/10) — so the V4L2
* uAPI semantic here may be DPB-slot-indexed despite the AV1 spec
* lexicon. Phase 3 open question pending kernel-side disambiguation.
*/
if (picture->seq_info_fields.fields.still_picture)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
if (picture->seq_info_fields.fields.use_128x128_superblock)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
if (picture->seq_info_fields.fields.enable_filter_intra)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
if (picture->seq_info_fields.fields.enable_intra_edge_filter)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
if (picture->seq_info_fields.fields.enable_interintra_compound)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
if (picture->seq_info_fields.fields.enable_masked_compound)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
if (picture->seq_info_fields.fields.enable_dual_filter)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
if (picture->seq_info_fields.fields.enable_order_hint)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
if (picture->seq_info_fields.fields.enable_jnt_comp)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
if (picture->seq_info_fields.fields.enable_cdef)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
if (picture->seq_info_fields.fields.mono_chrome)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
if (picture->seq_info_fields.fields.color_range)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
if (picture->seq_info_fields.fields.subsampling_x)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
if (picture->seq_info_fields.fields.subsampling_y)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
if (picture->seq_info_fields.fields.film_grain_params_present)
sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
VASurfaceID ref_id = picture->ref_frame_map[i];
struct object_surface *ref_surface;
uint64_t ts;
if (ref_id == VA_INVALID_SURFACE)
continue;
ref_surface = SURFACE(driver_data, ref_id);
if (ref_surface == NULL)
continue;
ts = v4l2_timeval_to_ns(&ref_surface->timestamp);
if (ts == 0 &&
ref_surface->linked_decode_surface_id != VA_INVALID_SURFACE) {
struct object_surface *dec =
SURFACE(driver_data,
ref_surface->linked_decode_surface_id);
if (dec != NULL) {
ts = v4l2_timeval_to_ns(&dec->timestamp);
frame.order_hints[i] = dec->av1_order_hint;
}
} else {
frame.order_hints[i] = ref_surface->av1_order_hint;
}
frame.reference_frame_ts[i] = ts;
}
/* Single-control batched submission. */
memset(ctrls, 0, sizeof ctrls);
ctrls[0].id = V4L2_CID_STATELESS_AV1_SEQUENCE;
ctrls[0].ptr = &sequence;
ctrls[0].size = sizeof sequence;
/* Phase 3: record this frame's order_hint on the surface so the
* NEXT frame's ref-loop can populate order_hints[] for slots that
* reference us. */
surface_object->av1_order_hint = picture->order_hint;
/* Also propagate to the linked display surface (if any), since
* future frames' ref_frame_map[] may point at either. */
if (picture->current_display_picture != VA_INVALID_SURFACE &&
picture->current_display_picture != picture->current_frame) {
struct object_surface *disp =
SURFACE(driver_data, picture->current_display_picture);
if (disp != NULL)
disp->av1_order_hint = picture->order_hint;
}
if (driver_data->has_av1_film_grain)
av1_fill_film_grain(picture, &film_grain);
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_SEQUENCE,
.ptr = &sequence,
.size = sizeof(sequence),
};
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_FRAME,
.ptr = &frame,
.size = sizeof(frame),
};
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY,
.ptr = tile_entries,
.size = sizeof(*tile_entries) * alloc_tiles,
};
if (driver_data->has_av1_film_grain) {
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_AV1_FILM_GRAIN,
.ptr = &film_grain,
.size = sizeof(film_grain),
};
}
rc = v4l2_set_controls(driver_data->video_fd,
surface_object->request_fd,
ctrls, 1);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
controls, n);
return VA_STATUS_SUCCESS;
free(tile_entries);
if (rc < 0) {
request_log("ampere-av1: VIDIOC_S_EXT_CTRLS failed rc=%d\n", rc);
return -1;
}
return 0;
}
+9 -3
View File
@@ -1,8 +1,14 @@
/*
* Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* AV1 codec dispatcher — populates V4L2_CID_STATELESS_AV1_SEQUENCE
* (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
* ampere-av1-enablement Phase 2: AV1 codec dispatcher header for libva-
* v4l2-request-fourier. Mirrors vp9.h shape — single set_controls entry
* point that translates surface->params.av1.* VAAPI structures into a
* batch of V4L2_CID_STATELESS_AV1_{SEQUENCE,FRAME,TILE_GROUP_ENTRY,
* FILM_GRAIN} controls + the underlying request_fd / OUTPUT plane setup.
*
* V4L2 target: V4L2_PIX_FMT_AV1_FRAME on the vpu981 hantro instance
* (RK3588's dedicated AV1 decoder).
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
-14
View File
@@ -37,28 +37,14 @@ unsigned int pixelformat_for_profile(VAProfile profile)
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
return V4L2_PIX_FMT_H264_SLICE;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
return V4L2_PIX_FMT_HEVC_SLICE;
case VAProfileVP8Version0_3:
return V4L2_PIX_FMT_VP8_FRAME;
case VAProfileVP9Profile0:
return V4L2_PIX_FMT_VP9_FRAME;
case VAProfileAV1Profile0:
/*
* ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
* vpu981 (RK3588's dedicated AV1 hantro). Per-codec ctrl
* dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED on
* master vainfo lists the profile + RequestCreateConfig
* succeeds, but consumers that submit decode buffers hit
* a NOP path until the per-codec dispatch lands. The
* av1-iter1 operator branch has Phase 3 bit-exact bring-up
* underway; this commit gives master the bare enumeration +
* fd-routing layer so consumers like ffmpeg-vaapi at least
* see VAProfileAV1Profile0 today.
*/
return V4L2_PIX_FMT_AV1_FRAME;
default:
return 0;
+24 -90
View File
@@ -59,37 +59,34 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
// FIXME
// iter39: Hi10P routed through same H264 path; bit-depth gating
// happens in context.c synthetic SPS and CAPTURE pix_fmt
// selection.
break;
case VAProfileMPEG2Simple:
case VAProfileMPEG2Main:
// fresnel-fourier iter1: MPEG-2 enabled. Same shape as H.264
// above — no profile-specific config validation in the libva
// backend; validation happens at vaCreateContext / control
// submission time.
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
// iter39: Main10 routed through same HEVC path; bit-depth
// gating happens in context.c.
// fresnel-fourier iter2: HEVC enabled. Same shape as H.264/
// MPEG-2 above — no profile-specific config validation in the
// libva backend; validation happens at vaCreateContext / control
// submission time.
break;
case VAProfileVP8Version0_3:
// fresnel-fourier iter3: VP8 enabled. Same shape as iter1+iter2
// above — no profile-specific config validation in the libva
// backend; validation happens at vaCreateContext / control
// submission time.
break;
case VAProfileVP9Profile0:
// fresnel-fourier iter4: VP9 Profile 0 enabled on rkvdec.
// VP9 Profile 2 is NOT supported by RK3399 rkvdec (kernel ctrl
// cap is V4L2_MPEG_VIDEO_VP9_PROFILE_0). Do not add a case for
// VAProfileVP9Profile2 — kernel will reject.
// Same shape — no profile-specific validation here.
break;
case VAProfileAV1Profile0:
// ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
// vpu981 (RK3588 dedicated AV1 hantro instance). Decode-side
// ctrl dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED
// on master — vainfo will list the profile + CreateConfig
// succeeds, but consumers that submit decode buffers hit a
// NOP path until av1.{c,h} dispatch scaffolding is ported
// from the av1-iter1 operator branch (where Phase 3-5 has
// 3/10 frames bit-exact already).
// ampere-av1-enablement: AV1 Profile 0 enabled on vpu981.
// Same shape — no profile-specific validation here.
break;
default:
return VA_STATUS_ERROR_UNSUPPORTED_PROFILE;
@@ -126,14 +123,6 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
*/
config_object->pixelformat = pixelformat_for_profile(profile);
config_object->attributes[0].type = VAConfigAttribRTFormat;
/*
* iter39: 10-bit profiles advertise YUV420_10. ffmpeg-vaapi reads
* this attribute on vaGetConfigAttributes and refuses surface
* allocation if it mismatches the input bitstream's bit depth.
*/
if (profile == VAProfileH264High10 || profile == VAProfileHEVCMain10)
config_object->attributes[0].value = VA_RT_FORMAT_YUV420_10;
else
config_object->attributes[0].value = VA_RT_FORMAT_YUV420;
config_object->attributes_count = 1;
@@ -172,20 +161,14 @@ VAStatus RequestDestroyConfig(VADriverContextP context, VAConfigID config_id)
static bool any_fd_supports_output_format(struct request_data *driver_data,
unsigned int fmt)
{
int fds[6] = {
int fds[4] = {
driver_data->video_fd,
driver_data->video_fd_rkvdec,
driver_data->video_fd_hantro,
driver_data->video_fd_rpi_hevc_dec, /* iter40 */
driver_data->video_fd_vpu981, /* ampere-av1 Phase 2 */
#ifdef HAVE_DAEDALUS_V4L2
driver_data->video_fd_daedalus, /* LIBVA-1: H.264/VP9/AV1 */
#else
-1,
#endif
driver_data->video_fd_vpu981,
};
int i;
for (i = 0; i < 6; i++) {
for (i = 0; i < 4; i++) {
if (fds[i] < 0) continue;
if (v4l2_find_format(fds[i], V4L2_BUF_TYPE_VIDEO_OUTPUT, fmt))
return true;
@@ -215,48 +198,11 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
profiles[index++] = VAProfileH264ConstrainedBaseline;
profiles[index++] = VAProfileH264MultiviewHigh;
profiles[index++] = VAProfileH264StereoHigh;
/*
* iter39 Phase 7 close (Option B): VAProfileH264High10
* DELIBERATELY NOT ENUMERATED.
*
* Hi10P on Rockchip V4L2 stateless decoders requires:
* - HW: both RK3399 + RK3588 capable (per Rockchip
* datasheets 4K 10-bit H.264 line items)
* - Kernel: Karlman v6v10 series merged in
* mmind v7.0 (rkvdec_h264_decoded_fmts[] has
* NV15/NV20; ctrl cfg.max=HIGH_422_INTRA;
* bit_depth_luma_minus8==2 path live in
* rkvdec-h264-common.c:196)
* - Userspace ffmpeg: ffmpeg-v4l2-request-fourier
* lacks the userspace plumbing for Hi10P;
* kdirect path fails with EINVAL, libva path
* returns CAPTURE buffer all-zero.
*
* Empirically verified on both fresnel (RK3399) and ampere
* (RK3588) 2026-05-17 same all-zero / EINVAL failure
* mode on both. The backend infrastructure (codec.c,
* context.c, image.c, surface.c, nv15.c) is RETAINED for
* when the upstream ffmpeg gap closes just re-add the
* profiles[index++] line and bump the (-5) guard back to
* (-6). See memory feedback_rk3399_h264_hi10p_advertised_not_functional
* for the empirical evidence.
*/
}
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_HEVC_SLICE);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1)) {
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
profiles[index++] = VAProfileHEVCMain;
/*
* iter39 Phase 7 close (Option B): VAProfileHEVCMain10
* DELIBERATELY NOT ENUMERATED. Same reasoning as
* VAProfileH264High10 above kernel + HW ready,
* userspace ffmpeg V4L2 hwaccel plumbing not. Untested
* specifically due to no Main10 fixture (system x265
* is 8-bit-only on Arch ARM), but same kernel/HW/
* userspace stack so same gap likely applies. Re-enable
* when ffmpeg-vaapi V4L2 hwaccel adds 10-bit HEVC.
*/
}
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_VP8_FRAME);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -267,11 +213,11 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
profiles[index++] = VAProfileVP9Profile0;
/*
* ampere-av1-enablement Phase 2: AV1 Profile 0 advertised when
* vpu981 (RK3588 dedicated AV1 hantro) is probed. MAX_PROFILES
* bumped to 14 in request.h to safely fit even if iter39 Option
* B is reverted (Hi10P + Main10 back in enumeration 13 total
* with AV1, the `< MAX - 1` guard then needs MAX 14).
* ampere-av1-enablement: AV1 routes to vpu981 (advertised via the
* new video_fd_vpu981 slot). V4L2_REQUEST_MAX_PROFILES=11 is now
* EXACTLY full with this addition. Future profile additions
* require bumping that constant + verifying libva consumers'
* profiles[] sizing.
*/
found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_AV1_FRAME);
if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -295,9 +241,7 @@ VAStatus RequestQueryConfigEntrypoints(VADriverContextP context,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
case VAProfileVP8Version0_3:
case VAProfileVP9Profile0:
case VAProfileAV1Profile0:
@@ -354,16 +298,6 @@ VAStatus RequestGetConfigAttributes(VADriverContextP context, VAProfile profile,
for (i = 0; i < attributes_count; i++) {
switch (attributes[i].type) {
case VAConfigAttribRTFormat:
/*
* iter39: 10-bit profiles publish YUV420_10. Profile-
* less query (this is invoked from vaGetConfigAttributes
* before vaCreateConfig) routes off the `profile` arg
* directly same gating as RequestCreateConfig.
*/
if (profile == VAProfileH264High10 ||
profile == VAProfileHEVCMain10)
attributes[i].value = VA_RT_FORMAT_YUV420_10;
else
attributes[i].value = VA_RT_FORMAT_YUV420;
break;
default:
+13 -179
View File
@@ -42,9 +42,6 @@
#include <hevc-ctrls.h>
#include "nv15.h" /* iter40: fallback V4L2_PIX_FMT_NV15 define for Pi 5
* Debian headers that ship NC12 but not NV15. */
#include "nv12_col128.h" /* iter40: NC12 detile primitive + UV offset helper */
#include "utils.h"
#include "v4l2.h"
@@ -110,55 +107,9 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* the driver_data and is cached across CreateContext cycles. The
* probe doesn't require any prior S_FMT v4l2_find_format
* enumerates the device's supported formats directly.
*
* iter39: choose NV15 (10-bit packed) for Hi10P / Main10 profiles,
* NV12 (8-bit) otherwise. If the cached video_format doesn't match
* the profile's bit-depth requirement, invalidate and re-probe
* sibling pattern to iter38's device-switch invalidation in
* request_switch_device_for_profile().
*/
{
bool want_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
/*
* iter40: per-fd preferred pixelformat. rpi-hevc-dec exposes
* NC12 (8-bit) / NC30 (10-bit), not NV12 / NV15.
*/
unsigned int want_pixfmt;
if (is_rpi)
want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
: V4L2_PIX_FMT_NV12_COL128;
else
want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV15
: V4L2_PIX_FMT_NV12;
if (driver_data->video_format &&
driver_data->video_format->v4l2_format != want_pixfmt &&
driver_data->video_format->v4l2_format != V4L2_PIX_FMT_SUNXI_TILED_NV12)
driver_data->video_format = NULL;
}
if (!driver_data->video_format) {
bool want_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
video_format = NULL;
if (is_rpi) {
/*
* iter40: rpi-hevc-dec CAPTURE is NC12 (8-bit SAND
* 128-pixel-wide column tile) or NC30 (10-bit variant).
* Direct map; the kernel exposes BOTH formats in
* VIDIOC_ENUM_FMT(CAPTURE_MPLANE) without a pre-SPS
* step (verified Phase 0 strace), so find_format would
* also succeed skip it for symmetry with the NV15
* iter39 branch below.
*/
video_format = video_format_find(
want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
: V4L2_PIX_FMT_NV12_COL128);
} else if (!want_10bit) {
found = v4l2_find_format(driver_data->video_fd,
V4L2_BUF_TYPE_VIDEO_CAPTURE,
V4L2_PIX_FMT_SUNXI_TILED_NV12);
@@ -170,19 +121,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
V4L2_PIX_FMT_NV12);
if (found)
video_format = video_format_find(V4L2_PIX_FMT_NV12);
} else {
/*
* iter39 fresnel fix: rkvdec only advertises NV15 in
* VIDIOC_ENUM_FMT(CAPTURE) AFTER S_FMT(OUTPUT) +
* S_EXT_CTRLS(SPS) resolve image_fmt to 420_10BIT.
* Before that, only NV12 is enumerated. Pre-finding
* NV15 always fails. Skip the find_format check and
* directly map to our NV15 video_format entry; the
* later S_FMT(CAPTURE) commits the actual NV15 mode
* once the synthetic SPS sets bit_depth_luma_minus8=2.
*/
video_format = video_format_find(V4L2_PIX_FMT_NV15);
}
if (video_format == NULL) {
status = VA_STATUS_ERROR_OPERATION_FAILED;
@@ -193,10 +131,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
}
video_format = driver_data->video_format;
/* iter39: session-wide flag drives image.c reporting + unpack. */
driver_data->is_10bit = (config_object->profile == VAProfileH264High10 ||
config_object->profile == VAProfileHEVCMain10);
output_type = v4l2_type_video_output(video_format->v4l2_mplane);
capture_type = v4l2_type_video_capture(video_format->v4l2_mplane);
@@ -241,22 +175,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* CAPTURE (sanity read-back, matches what S_FMT committed).
*/
{
/*
* iter40: take the CAPTURE pixelformat from the resolved
* video_format slot that's per-fd, per-bit-depth correct.
* rkvdec 8-bit NV12
* rkvdec 10-bit NV15
* hantro 8-bit NV12
* rpi-hevc-dec NC12 (V4L2_PIX_FMT_NV12_COL128)
* Pre-iter40 this was hardcoded NV12/NV15 the rpi-hevc-dec
* fd would then have S_FMT(NV12) issued, and the kernel
* "helpfully" substituted V4L2_PIX_FMT_NV12MT_COL128 (the
* MULTI-PLANE-NON-CONTIGUOUS variant) instead of the
* SINGLE-PLANE NC12 we wanted, breaking cap_pool QUERYBUF
* downstream (Phase 7 iter40 first-run discovery).
*/
unsigned int capture_pixelformat =
driver_data->video_format->v4l2_format;
unsigned int capture_pixelformat = V4L2_PIX_FMT_NV12;
rc = v4l2_set_format(driver_data->video_fd, capture_type,
capture_pixelformat, picture_width,
picture_height);
@@ -313,42 +232,16 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* the device-init DECODE_MODE + START_CODE block below ALSO uses
* void-cast best-effort, so this is consistent with prior pattern.
*/
/*
* iter40 (Phase 5 review F6): the synthetic-SPS pre-seed is an
* rkvdec-specific quirk fix (the -EBUSY-on-CAPTURE-busy bug in
* rkvdec_s_ctrl). rpi-hevc-dec does NOT need it and uses a
* different submission ordering (Phase 0 strace: S_FMT_OUTPUT
* REQBUFS_OUTPUT S_FMT_CAPTURE CREATE_BUFS_CAPTURE STREAMON,
* with per-frame SPS via S_EXT_CTRLS class=0xf010000). Sending a
* stale dummy SPS at context-init time would leave rpi-hevc-dec's
* internal state on the dummy until the first real per-frame SPS
* arrives exact behavior unknown but a known divergence from
* kdirect.
*
* Skip pre-seed when the active fd is rpi-hevc-dec. rkvdec /
* hantro paths unchanged.
*/
if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec) {
/*
* iter39: 10-bit profiles set bit_depth_luma_minus8 = 2 in
* the synthetic SPS so rkvdec's get_image_fmt resolves to
* RKVDEC_IMG_FMT_420_10BIT (per rkvdec-h264-common.c:196 +
* rkvdec-hevc-common.c:467). Image_fmt resolution depends
* only on bit_depth_luma_minus8 and chroma_format_idc;
* profile_idc is ignored for image_fmt and v4l2_ctrl_hevc_sps
* has no profile_idc field at all.
*/
bool ten = driver_data->is_10bit;
{
switch (config_object->profile) {
case VAProfileHEVCMain:
case VAProfileHEVCMain10: {
case VAProfileHEVCMain: {
struct v4l2_ctrl_hevc_sps dummy_sps;
struct v4l2_ext_control dummy_ctrl;
memset(&dummy_sps, 0, sizeof(dummy_sps));
dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_luma_minus8 = 0; /* 8-bit */
dummy_sps.bit_depth_chroma_minus8 = 0;
dummy_sps.pic_width_in_luma_samples = picture_width;
dummy_sps.pic_height_in_luma_samples = picture_height;
@@ -363,20 +256,19 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10: {
case VAProfileH264StereoHigh: {
struct v4l2_ctrl_h264_sps dummy_sps;
struct v4l2_ext_control dummy_ctrl;
memset(&dummy_sps, 0, sizeof(dummy_sps));
dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
dummy_sps.bit_depth_luma_minus8 = 0;
dummy_sps.bit_depth_chroma_minus8 = 0;
dummy_sps.pic_width_in_mbs_minus1 =
(picture_width + 15) / 16 - 1;
dummy_sps.pic_height_in_map_units_minus1 =
(picture_height + 15) / 16 - 1;
dummy_sps.profile_idc = ten ? 110 : 100; /* High10 : High */
dummy_sps.profile_idc = 100; /* High */
dummy_sps.level_idc = 41;
/*
* FRAME_MBS_ONLY required: rkvdec_h264_validate_sps
@@ -397,7 +289,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
default:
break;
}
} /* iter40: end of pre-seed-skip-on-rpi-hevc-dec guard */
}
destination_planes_count = video_format->planes_count;
@@ -431,40 +323,11 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* changed by BeginPicture's slot acquisition.
*/
if (video_format->v4l2_buffers_count == 1) {
if (video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
/*
* iter40: NC12 SAND layout: Y plane size is
* NUM_COLUMNS * TILE_W * ALIGN(height, 8) (= linear
* NV12 Y for column-aligned widths), UV plane is half.
* The kernel-reported destination_bytesperlines[0] is
* the COLUMN stride (ALIGN(height,8)*3/2), not the
* linear Y stride using it × format_height gives the
* wrong intra-buffer UV offset (destination_offsets[1]
* derives from destination_sizes[0] in
* surface_fill_format_uniform).
*
* Use format_width/format_height (kernel-returned from
* G_FMT) not picture_width/height (caller request),
* because the kernel applies its own ALIGN rules; the
* UV plane location is keyed off the kernel layout.
*/
unsigned int uv_off = nv12_col128_uv_plane_offset(
format_width, format_height);
destination_sizes[0] = uv_off;
for (j = 1; j < destination_planes_count; j++)
destination_sizes[j] = uv_off / 2;
request_log("iter40: NC12 sizes pic=%ux%u fmt=%ux%u bpl=%u uv_off=%u sizeimage(kernel)=%u\n",
picture_width, picture_height,
format_width, format_height,
destination_bytesperlines[0], uv_off,
destination_bytesperlines[0] * format_height);
} else {
destination_sizes[0] = destination_bytesperlines[0] *
format_height;
for (j = 1; j < destination_planes_count; j++)
destination_sizes[j] = destination_sizes[0] / 2;
}
}
/*
* iter5b-β Commit D: cache the format-uniform CAPTURE geometry
@@ -537,9 +400,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
*/
rc = request_pool_init(&driver_data->output_pool,
driver_data->video_fd, driver_data->media_fd,
output_type, 16, pixelformat,
(unsigned int)picture_width,
(unsigned int)picture_height);
output_type, 16);
if (rc < 0) {
status = VA_STATUS_ERROR_ALLOCATION_FAILED;
goto error;
@@ -599,18 +460,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* + ANNEX_B (only supported menu values per Phase 0 v4l2_inventory).
*/
{
/*
* iter40: per-driver HEVC start_code menu value. rkvdec /
* hantro path uses ANNEX_B + start-code-prepended payload.
* rpi-hevc-dec uses NONE confirmed empirically Phase 7
* (any other mode V4L2_BUF_FLAG_ERROR on every CAPTURE
* DQBUF, all-zero output). kdirect's strace also shows
* start_code=0 on rpi-hevc-dec. Both are accepted by the
* driver's QUERY_EXT_CTRL menu (min=0 max=1), but only NONE
* actually drives correct decode on the Pi.
*/
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
struct v4l2_ext_control hevc_dev_ctrls[2] = {
{
.id = V4L2_CID_STATELESS_HEVC_DECODE_MODE,
@@ -618,9 +467,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
},
{
.id = V4L2_CID_STATELESS_HEVC_START_CODE,
.value = is_rpi
? 0 /* V4L2_STATELESS_HEVC_START_CODE_NONE */
: V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
.value = V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
},
};
(void)v4l2_set_controls(driver_data->video_fd, -1,
@@ -653,30 +500,19 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
* commit will replace this hardcoded assignment with a runtime
* read of the kernel's accepted START_CODE value.
*/
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
switch (config_object->profile) {
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
context_object->h264_start_code = true;
break;
case VAProfileHEVCMain:
/* iter40: rpi-hevc-dec rejects start-code-prepended
* payload (DQBUF error flag on every CAPTURE buffer).
* Gate to match the per-driver START_CODE menu value
* set above: NONE on rpi no prepend; ANNEX_B on
* rkvdec prepend. */
context_object->h264_start_code = !is_rpi;
context_object->h264_start_code = true;
break;
default:
context_object->h264_start_code = false;
break;
}
}
rc = v4l2_set_stream(driver_data->video_fd, output_type, true);
if (rc < 0) {
@@ -800,8 +636,6 @@ VAStatus RequestDestroyContext(VADriverContextP context, VAContextID context_id)
* The next CreateContext re-populates the cache.
*/
driver_data->fmt_valid = false;
/* iter39: clear 10-bit session flag — next CreateContext re-sets. */
driver_data->is_10bit = false;
return VA_STATUS_SUCCESS;
}
-53
View File
@@ -827,63 +827,10 @@ int h264_set_controls(struct request_data *driver_data,
dpb_update(context, &surface->params.h264.picture);
/*
* Dump the raw VAAPI fields at the libva boundary so issue #8
* follow-up can disambiguate "ffmpeg-vaapi didn't populate" from
* "downstream consumer (daedalus_v4l2 wire protocol) corrupted the
* value". One-line; safe to leave in — costs a single printf per frame.
*/
request_log("h264_set_controls: VAProfile=%d seq_fields=0x%08x pic_fields=0x%08x num_ref_frames=%u bit_depth_luma_m8=%u bit_depth_chroma_m8=%u w_mbs_m1=%u h_mbs_m1=%u\n",
(int)profile,
surface->params.h264.picture.seq_fields.value,
surface->params.h264.picture.pic_fields.value,
surface->params.h264.picture.num_ref_frames,
surface->params.h264.picture.bit_depth_luma_minus8,
surface->params.h264.picture.bit_depth_chroma_minus8,
surface->params.h264.picture.picture_width_in_mbs_minus1,
surface->params.h264.picture.picture_height_in_mbs_minus1);
h264_va_picture_to_v4l2(driver_data, context, surface,
&surface->params.h264.picture,
&decode, &pps, &sps);
/*
* max_num_ref_frames fallback. Some VAAPI clients (older ffmpeg-vaapi
* paths, some daedalus_v4l2 consumers) leave VAPicture->num_ref_frames
* at zero. Hardware decoders tolerate; libavcodec-via-daedalus enforces
* sps.max_num_ref_frames strictly and rejects every frame.
*
* Count valid DPB entries first (the bitstream-true reference count we
* can see); fall back to a per-profile spec minimum if even that is 0.
* See marfrit/libva-v4l2-request-fourier issue #8.
*/
if (sps.max_num_ref_frames == 0) {
unsigned int valid = 0;
unsigned int i;
for (i = 0; i < 16; i++) {
const VAPictureH264 *ref =
&surface->params.h264.picture.ReferenceFrames[i];
if (!(ref->flags & VA_PICTURE_H264_INVALID))
valid++;
}
if (valid > 0) {
sps.max_num_ref_frames = (uint8_t)valid;
} else {
switch (profile) {
case VAProfileH264ConstrainedBaseline:
sps.max_num_ref_frames = 1;
break;
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
default:
sps.max_num_ref_frames = 4;
break;
}
}
}
/*
* Populate the scaling matrix unconditionally: from VAAPI's
* VAIQMatrixBufferH264 when the consumer sent one this frame
+5 -184
View File
@@ -83,18 +83,6 @@
#include "hevc-ctrls/v4l2-hevc-ext-controls.h"
#include "h265_parser/gst/codecparsers/gsth265parser.h"
/*
* VAAPI source arrays for HEVC ref/weight tables are sized 15
* (VASliceParameterBufferHEVC::RefPicList[2][15],
* delta_luma_weight_l0[15], luma_offset_l0[15], etc. see
* /usr/include/va/va_dec_hevc.h). V4L2_HEVC_DPB_ENTRIES_NUM_MAX
* is 16; iterating to that bound over-reads the VAAPI source by
* one element. Hidden by -O3 unrolling but manifests as a SEGV
* under -O2 vectorisation (regression discovered in package
* builds 2026-05-17). Cap all per-ref/weight loops at this.
*/
#define VA_HEVC_REF_LIST_LEN 15
#include "utils.h"
#include "v4l2.h"
@@ -477,21 +465,13 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
/* Q2: slice_segment_addr from VAAPI (was missing in old h265.c). */
slice_params->slice_segment_addr = slice->slice_segment_address;
/*
* Ref index arrays (DPB indices). For I-slices both are unused.
*
* Cap iteration at VAAPI source size (15) V4L2_HEVC_DPB_ENTRIES_NUM_MAX
* is 16, but VASliceParameterBufferHEVC::RefPicList is RefPicList[2][15].
* Iterating to 16 reads one past the source array; with -O2 GCC vectorises
* the copy and the over-read produces a real SEGV (manifested in package
* builds with Arch makepkg CFLAGS, plain -O3 release builds hid it).
*/
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
/* Ref index arrays (DPB indices). For I-slices both are unused. */
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
if (i < (slice->num_ref_idx_l0_active_minus1 + 1U))
slice_params->ref_idx_l0[i] = slice->RefPicList[0][i];
}
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
if (i < (slice->num_ref_idx_l1_active_minus1 + 1U))
slice_params->ref_idx_l1[i] = slice->RefPicList[1][i];
@@ -523,9 +503,7 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
slice_params->pred_weight_table.delta_chroma_log2_weight_denom =
slice->delta_chroma_log2_weight_denom;
/* Pred weight tables — cap at VAAPI source array size (15), same
* reason as the RefPicList loops above. */
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
slice_params->pred_weight_table.delta_luma_weight_l0[i] =
slice->delta_luma_weight_l0[i];
@@ -538,7 +516,7 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
slice->ChromaOffsetL0[i][j];
}
}
for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
slice_params->pred_weight_table.delta_luma_weight_l1[i] =
slice->delta_luma_weight_l1[i];
@@ -779,100 +757,6 @@ static int h265_populate_ext_sps_rps_cache(struct request_data *driver_data,
return err;
}
/*
* iter40b: parse SPS NAL from source_data to populate the
* VAAPI-omitted v4l2_ctrl_hevc_sps fields (max_num_reorder_pics,
* max_latency_increase_plus1, sps_max_sub_layers_minus1, and
* sps_max_dec_pic_buffering_minus1 at the right sublayer index).
*
* Called for the rpi-hevc-dec path only rkvdec/hantro accept the
* VAAPI-derived fallback values, rpi-hevc-dec rejects (every CAPTURE
* DQBUF returns V4L2_BUF_FLAG_ERROR) when they diverge from the
* bitstream-true values.
*
* Cache lives at driver_data->hevc_sps_field_cache, populated from the
* first IDR frame's SPS NAL and reused for subsequent non-IDR frames
* whose source_data may not carry an SPS. Same lifecycle as
* hevc_rps_cache_*.
*
* Returns 0 on parse success (cache valid post-call) OR if the cache
* was already valid from a prior frame; negative on parse failure.
*/
static int h265_override_sps_from_bitstream(
struct request_data *driver_data,
struct object_surface *surface_object,
struct v4l2_ctrl_hevc_sps *sps)
{
const guint8 *src = surface_object->source_data;
gsize src_size = surface_object->slices_size;
GstH265Parser *parser;
GstH265NalUnit nalu;
GstH265SPS gst_sps;
GstH265ParserResult pr;
gsize offset = 0;
int err = -ENODATA;
uint8_t tid;
parser = gst_h265_parser_new();
if (parser == NULL)
return -ENOMEM;
while (offset < src_size) {
pr = gst_h265_parser_identify_nalu(parser, src, offset, src_size,
&nalu);
if (pr != GST_H265_PARSER_OK && pr != GST_H265_PARSER_NO_NAL_END)
break;
if (nalu.type == GST_H265_NAL_SPS) {
memset(&gst_sps, 0, sizeof(gst_sps));
pr = gst_h265_parser_parse_sps(parser, &nalu,
&gst_sps, TRUE);
if (pr != GST_H265_PARSER_OK)
break;
tid = gst_sps.max_sub_layers_minus1;
if (tid >= 7)
tid = 0; /* safety: max_*[] is [7] */
driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1 =
gst_sps.max_sub_layers_minus1;
driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1 =
gst_sps.max_dec_pic_buffering_minus1[tid];
driver_data->hevc_sps_field_cache.max_num_reorder_pics =
gst_sps.max_num_reorder_pics[tid];
driver_data->hevc_sps_field_cache.max_latency_increase_plus1 =
gst_sps.max_latency_increase_plus1[tid];
driver_data->hevc_sps_field_cache.scaling_list_enabled =
gst_sps.scaling_list_enabled_flag;
driver_data->hevc_sps_field_cache.scaling_list_data_present =
gst_sps.scaling_list_data_present_flag;
driver_data->hevc_sps_field_cache.valid = true;
err = 0;
break;
}
offset = nalu.offset + nalu.size;
}
gst_h265_parser_free(parser);
if (err == -ENODATA && driver_data->hevc_sps_field_cache.valid)
err = 0;
if (err == 0 && driver_data->hevc_sps_field_cache.valid) {
sps->sps_max_sub_layers_minus1 =
driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1;
sps->sps_max_dec_pic_buffering_minus1 =
driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1;
sps->sps_max_num_reorder_pics =
driver_data->hevc_sps_field_cache.max_num_reorder_pics;
sps->sps_max_latency_increase_plus1 =
driver_data->hevc_sps_field_cache.max_latency_increase_plus1;
}
return err;
}
int h265_set_controls(struct request_data *driver_data,
struct object_context *context_object,
struct object_surface *surface_object)
@@ -926,50 +810,6 @@ int h265_set_controls(struct request_data *driver_data,
}
h265_fill_sps(picture, &sps);
/*
* iter40b: rpi-hevc-dec validates SPS fields VAAPI doesn't
* forward (sps_max_num_reorder_pics, sps_max_latency_increase_plus1)
* against bitstream-true values and rejects the frame when our
* §A.4.2 spec-legal fallback diverges. Parse the SPS NAL from
* source_data and override. Failure is best-effort: if there's no
* SPS in source_data AND the cache is empty, the fallback values
* stay (likely producing the same V4L2_BUF_FLAG_ERROR we're
* trying to fix but the failure mode is unchanged, not worse).
*/
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
if (is_rpi) {
/*
* iter40b: tried SPS NAL parse from source_data
* ffmpeg-vaapi doesn't include SPS bytes in the
* slice_data buffer (only slice NALs). The parse
* returns -ENODATA every frame, cache stays empty.
*
* Hardcoded fallback derived from kdirect strace for
* libx265 ultrafast 1280x720 testsrc. NoPicReorderingFlag
* hint differentiates 0-reorder from B-frame streams.
* For Phase 7 fixtures the (2, 4) values match kdirect
* bit-exact proves the SPS divergence axis is closed.
*
* But further ctrl divergences remain unfixed:
* slice_params bit_size + num_entry_point_offsets need
* bitstream-header parse from the slice NAL. Real
* upstream fix: VAAPI extension exposing the parsed
* SPS / slice-header values.
*/
(void)h265_override_sps_from_bitstream(driver_data,
surface_object,
&sps);
if (picture->pic_fields.bits.NoPicReorderingFlag) {
sps.sps_max_num_reorder_pics = 0;
sps.sps_max_latency_increase_plus1 = 0;
} else {
sps.sps_max_num_reorder_pics = 2;
sps.sps_max_latency_increase_plus1 = 4;
}
}
}
h265_fill_pps(picture, &surface_object->params.h265.slices[0], &pps);
h265_fill_decode_params(driver_data, picture, &decode_params);
h265_fill_scaling_matrix(iqmatrix, iqmatrix_set, &scaling_matrix);
@@ -1014,30 +854,11 @@ int h265_set_controls(struct request_data *driver_data,
.ptr = slice_params_array,
.size = sizeof(struct v4l2_ctrl_hevc_slice_params) * num_slices,
};
/*
* iter40b: rpi-hevc-dec's per-frame ctrl set is 4 (no
* scaling_matrix when SPS doesn't enable it). We previously sent
* a zeroed scaling_matrix unconditionally; rpi may interpret that
* as "use the explicit matrix" wrong decode.
*
* Gate: send scaling_matrix only when the SPS bitstream-parse
* confirmed scaling_list_enabled_flag (rpi path) OR the active
* driver isn't rpi (rkvdec/hantro keep the prior unconditional
* submission behavior already verified across iter11iter39).
*/
{
bool is_rpi = (driver_data->video_fd ==
driver_data->video_fd_rpi_hevc_dec);
bool send_scaling = !is_rpi ||
driver_data->hevc_sps_field_cache.scaling_list_enabled;
if (send_scaling) {
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_HEVC_SCALING_MATRIX,
.ptr = &scaling_matrix,
.size = sizeof(scaling_matrix),
};
}
}
controls[n++] = (struct v4l2_ext_control){
.id = V4L2_CID_STATELESS_HEVC_DECODE_PARAMS,
.ptr = &decode_params,
+64 -165
View File
@@ -39,8 +39,6 @@
#include <linux/dma-buf.h>
#include "nv15.h"
#include "nv12_col128.h"
#include "tiled_yuv.h"
#include "utils.h"
#include "v4l2.h"
@@ -88,51 +86,14 @@ VAStatus RequestCreateImage(VADriverContextP context, VAImageFormat *format,
for (i = 0; i < planes_count; i++)
size += destination_sizes[i];
if (format->fourcc == VA_FOURCC_P010) {
/*
* iter39: P010 image overrides V4L2-side NV15 sizing. The
* source is the kernel-reported NV15 packed plane; the image
* buffer holds dense P010 (2 bytes per pixel, 16bpp).
* Recompute sizes/pitches against P010 layout so consumers
* (vaGetImage, vaDeriveImage) see standard P010 geometry.
*/
destination_bytesperlines[0] = width * 2;
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
size = 0;
for (i = 0; i < destination_planes_count; i++)
size += destination_sizes[i];
} else if (format->fourcc == VA_FOURCC_NV12 &&
video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
/*
* iter40 Phase 5 review F2: NC12 source, NV12 image output.
* V4L2-reported destination_bytesperlines[0] is the NC12
* column stride (= ALIGN(height,8) * 3/2 e.g. 1080 for
* 1280×720), NOT the linear NV12 Y stride. Override to the
* linear stride (width) so VAImage pitches reflect the
* detile-output layout the consumer reads.
*/
destination_bytesperlines[0] = width;
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
size = 0;
for (i = 0; i < destination_planes_count; i++)
size += destination_sizes[i];
} else {
/* NV12: V4L2 stride is correct, sizes derived from height. */
/* Here we calculate the sizes assuming NV12. */
destination_sizes[0] = destination_bytesperlines[0] * format_height;
for (i = 1; i < destination_planes_count; i++) {
destination_bytesperlines[i] = destination_bytesperlines[0];
destination_sizes[i] = destination_sizes[0] / 2;
}
}
id = object_heap_allocate(&driver_data->image_heap);
image_object = IMAGE(driver_data, id);
@@ -255,91 +216,63 @@ static VAStatus copy_surface_to_image (struct request_data *driver_data,
}
}
/*
* AV1 film_grain: when this surface is the display surface of a
* decode (current_display_picture != current_frame with apply_grain=1),
* its slot is NULL because BeginPicture only fired on the decode
* surface. Follow the back-link set in av1_set_controls and borrow
* the decode surface's destination_data + sizes for the copy.
*/
if (surface_object->current_slot == NULL &&
surface_object->linked_decode_surface_id != VA_INVALID_SURFACE) {
struct object_surface *decode_surface =
SURFACE(driver_data,
surface_object->linked_decode_surface_id);
if (decode_surface != NULL &&
decode_surface->current_slot != NULL) {
/* Mirror the fields we read below. The surface heap
* pointer is stable for the surface's lifetime; we
* only need destination_data + destination_sizes +
* destination_planes_count from it. */
surface_object->destination_planes_count =
decode_surface->destination_planes_count;
for (i = 0; i < decode_surface->destination_planes_count; i++) {
surface_object->destination_data[i] =
decode_surface->destination_data[i];
surface_object->destination_sizes[i] =
decode_surface->destination_sizes[i];
}
}
}
for (i = 0; i < surface_object->destination_planes_count; i++) {
/*
* iter40 Phase 5 review F1: guard extended from __arm__ to
* __arm__ || __aarch64__. Without this, the detile primitives
* silently compiled out on aarch64 (fresnel RK3399, ampere
* RK3588, higgs Pi CM5) and the memcpy fall-through delivered
* raw tiled bytes to NV12/P010 image consumers. iter39 5/5
* PASS masked the issue because no 10-bit path was exercised.
*/
#if defined(__arm__) || defined(__aarch64__)
/*
* Sunxi tiled_to_planar lives in tiled_yuv.S which is
* #ifdef __arm__ symbol absent on aarch64. Keep this
* branch arm-only; aarch64 Sunxi support would need a C or
* aarch64-ASM port (no Sunxi aarch64 board in current fleet).
*/
#if defined(__arm__)
/* AV1 Phase 3 diag: surface NULL-deref hunt. */
if (buffer_object->data == NULL ||
surface_object->destination_data[i] == NULL) {
request_log("copy_surface_to_image NULL i=%u "
"buf_data=%p dest_data=%p dest_size=%u "
"planes=%u slot=%p linked=0x%x\n",
i, (void *)buffer_object->data,
(void *)surface_object->destination_data[i],
surface_object->destination_sizes[i],
surface_object->destination_planes_count,
(void *)surface_object->current_slot,
surface_object->linked_decode_surface_id);
return VA_STATUS_ERROR_OPERATION_FAILED;
}
#ifdef __arm__
if (!video_format_is_linear(driver_data->video_format))
tiled_to_planar(surface_object->destination_data[i],
buffer_object->data + image->offsets[i],
image->pitches[i], image->width,
i == 0 ? image->height :
image->height / 2);
else
#endif
if (driver_data->is_10bit &&
image->format.fourcc == VA_FOURCC_P010) {
/*
* iter39: rkvdec emits NV15 (4×10-bit packed in 5
* bytes); the VA image buffer is dense P010 (2B/pixel,
* value in bits[15:6]). Source stride is the V4L2-
* reported NV15 bytesperline (= ceil(width/4)*5,
* possibly aligned higher by the kernel); destination
* stride is image->pitches[i] = width * 2.
*/
unsigned int plane_h = (i == 0) ? image->height
: image->height / 2;
nv15_unpack_plane_to_p010(
surface_object->destination_data[i],
(uint16_t *)(buffer_object->data + image->offsets[i]),
image->width, plane_h,
surface_object->destination_bytesperlines[i]);
} else if (driver_data->video_format != NULL &&
driver_data->video_format->v4l2_format ==
V4L2_PIX_FMT_NV12_COL128 &&
image->format.fourcc == VA_FOURCC_NV12) {
/*
* iter40: Pi 5 rpi-hevc-dec emits NV12_COL128 (SAND
* 128-pixel-wide column tiles). Detile to linear NV12
* via the per-plane primitive. surface_object->
* destination_data[i] is the V4L2 CAPTURE mmap (single
* buffer, planes_count==2): i==0 is the Y plane base,
* i==1 is the UV plane base offset within the SAME
* physical buffer (per cap_pool plane[1] offset = Y
* plane size in COL128 layout).
*
* src_col_stride = destination_bytesperlines[i] = the
* kernel-reported NC12 bytesperline (column stride,
* = ALIGN(image_h, 8) * 3/2). Same for both planes
* since column geometry is plane-agnostic.
*
* dst stride is image->pitches[i] = image->width
* (overridden in RequestCreateImage NC12 branch below).
*/
if (i == 0) {
nv12_col128_detile_y(
(uint8_t *)(buffer_object->data + image->offsets[i]),
image->pitches[i],
surface_object->destination_data[i],
surface_object->destination_bytesperlines[i],
image->width, image->height);
} else {
nv12_col128_detile_uv(
(uint8_t *)(buffer_object->data + image->offsets[i]),
image->pitches[i],
surface_object->destination_data[i],
surface_object->destination_bytesperlines[i],
image->width, image->height / 2);
}
} else {
else {
#endif
memcpy(buffer_object->data + image->offsets[i],
surface_object->destination_data[i],
surface_object->destination_sizes[i]);
#if defined(__arm__) || defined(__aarch64__)
#ifdef __arm__
}
#endif
}
@@ -378,17 +311,9 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,
/* Fully populate VAImageFormat to match QueryImageFormats output. */
memset(&format, 0, sizeof(format));
if (driver_data->is_10bit) {
/* iter39: 10-bit session derives a P010 image. NV15-source
* unpack happens in copy_surface_to_image. */
format.fourcc = VA_FOURCC_P010;
format.byte_order = VA_LSB_FIRST;
format.bits_per_pixel = 24;
} else {
format.fourcc = VA_FOURCC_NV12;
format.byte_order = VA_LSB_FIRST;
format.bits_per_pixel = 12;
}
status = RequestCreateImage(context, &format, surface_object->width,
surface_object->height, image);
@@ -423,52 +348,26 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,
VAStatus RequestQueryImageFormats(VADriverContextP context,
VAImageFormat *formats, int *formats_count)
{
struct request_data *driver_data = context->pDriverData;
int n = 0;
/*
* Populate the VAImageFormat fully per VAAPI spec not just
* .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv, Firefox)
* read .byte_order and .bits_per_pixel; leaving them
* uninitialized inherits caller-stack garbage and produces
* non-deterministic behavior. Reference: Mesa's
* gallium/frontends/va/image.c::vlVaQueryImageFormats and
* intel-vaapi-driver's i965_drv_video.c.
* Populate the VAImageFormat fully per VAAPI spec for NV12
* not just .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv,
* Firefox) read .byte_order and .bits_per_pixel; leaving them
* uninitialized inherits whatever caller-stack garbage is in
* the buffer and produces non-deterministic behavior. Reference:
* Mesa's gallium/frontends/va/image.c::vlVaQueryImageFormats and
* intel-vaapi-driver's i965_drv_video.c both publish NV12
* with byte_order=VA_LSB_FIRST and bits_per_pixel=12.
*
* iter39: advertise P010 when an active session is 10-bit so
* ffmpeg-vaapi sees a valid 10-bit-compatible entry during
* vaQueryImageFormats. NV12 stays advertised unconditionally so
* the 8-bit catalog query response is unchanged.
* For YUV formats, depth/red_mask/green_mask/blue_mask/alpha_mask
* are not meaningful (those describe RGB bit layouts); leave them
* zeroed via memset before populating.
*/
memset(&formats[n], 0, sizeof(formats[n]));
formats[n].fourcc = VA_FOURCC_NV12;
formats[n].byte_order = VA_LSB_FIRST;
formats[n].bits_per_pixel = 12;
n++;
/*
* iter39 Option B revert (2026-05-17): P010 advertisement is
* gated on driver_data->is_10bit again. Previously advertised
* unconditionally (63fed87) so ffmpeg-vaapi's early
* vaQueryImageFormats (pre-vaCreateContext) could see it for
* 10-bit profiles but that broke HEVC 8-bit on fresnel:
* ffmpeg-vaapi picked P010 for the HEVC hwframe pool, EndPicture
* SEGV'd in the .so when the consumer-side P010 expectations met
* an 8-bit NV12 CAPTURE buffer.
* Safe because Option B drops VAProfileHEVCMain10 + Hi10P from
* enumeration no 10-bit decode pipeline will reach this catalog
* query so the gate-on-is_10bit (which stays false for 8-bit
* profiles) correctly returns NV12-only.
*/
if (driver_data->is_10bit && n < V4L2_REQUEST_MAX_IMAGE_FORMATS) {
memset(&formats[n], 0, sizeof(formats[n]));
formats[n].fourcc = VA_FOURCC_P010;
formats[n].byte_order = VA_LSB_FIRST;
formats[n].bits_per_pixel = 24;
n++;
}
*formats_count = n;
memset(&formats[0], 0, sizeof(formats[0]));
formats[0].fourcc = VA_FOURCC_NV12;
formats[0].byte_order = VA_LSB_FIRST;
formats[0].bits_per_pixel = 12;
*formats_count = 1;
return VA_STATUS_SUCCESS;
}
+1 -7
View File
@@ -22,9 +22,6 @@
autoconf_data = configuration_data()
autoconf_data.set('VA_DRIVER_INIT_FUNC', va_driver_init_func)
if get_option('daedalus_v4l2')
autoconf_data.set('HAVE_DAEDALUS_V4L2', 1)
endif
autoconf = configure_file(
output: 'autoconfig.h',
@@ -55,8 +52,6 @@ sources = [
'vp9.c',
'av1.c',
'codec.c',
'nv15.c',
'nv12_col128.c',
# Vendored GStreamer 1.28.2 H.265 parser + utilities (LGPL v2.1+,
# see src/h265_parser/gst_compat.h for sourcing notes + per-iter2
@@ -91,9 +86,8 @@ headers = [
'h265.h',
'vp8.h',
'vp9.h',
'av1.h',
'codec.h',
'nv15.h',
'nv12_col128.h',
# Internal mirror of Linux 7.0 V4L2 HEVC EXT_SPS_*_RPS UAPI defs
# (allows building against pre-7.0 linux-api-headers; redundant
-114
View File
@@ -1,114 +0,0 @@
/*
* V4L2_PIX_FMT_NV12_COL128 linear NV12 detile primitive. Pi 5 / CM5
* rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
*
* Math derived from kernel hevc_d_video.c (size formula) +
* ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
* single-stripe fast path memcpy's 128 bytes at a time when an output
* row falls entirely within one tile column (the common case);
* straddling rows are split into two memcpy halves.
*
* No NEON / SIMD here correctness first. Each output row generates
* (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
* ~17000 small memcpys per frame, fine for Phase 1 PoC.
*/
#include "nv12_col128.h"
#include <string.h>
/*
* Tile column width in bytes. The 'COL128' name embeds this; if it ever
* varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
*/
#define NC12_TILE_W 128
/*
* Common Y / UV plane detile the layout is identical (single-byte per
* pixel, column-major 128-wide tiles). The only thing that varies is
* what plane the caller passes in. width here is plane width in bytes
* (= image width for both Y and CbCr-interleaved NV12 UV); height is
* plane height in pixels (image height for Y, image height / 2 for UV).
*/
static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src,
unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
unsigned int y, x;
for (y = 0; y < height; y++) {
uint8_t *drow = dst + y * dst_stride;
x = 0;
while (x < width) {
unsigned int col = x / NC12_TILE_W;
unsigned int in_col = x % NC12_TILE_W;
unsigned int n = NC12_TILE_W - in_col;
if (n > width - x)
n = width - x;
/*
* Source byte = base + col*128*col_stride + y*128 + in_col
* Copy n contiguous bytes (all within this tile column,
* since n is capped at the remaining width-in-column).
*/
const uint8_t *p = src
+ (size_t)col * NC12_TILE_W * src_col_stride
+ (size_t)y * NC12_TILE_W
+ in_col;
memcpy(drow + x, p, n);
x += n;
}
}
}
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_y, unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
width, height);
}
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_uv, unsigned int src_col_stride,
unsigned int width, unsigned int uv_height)
{
/* UV plane (CbCr interleaved): byte-width equals Y-plane width
* (one Cb + one Cr per 2x2 Y block 2 bytes per 2 horizontal Y
* samples 1 byte per Y pixel horizontally). Height is half. */
nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
width, uv_height);
}
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
unsigned int image_height)
{
unsigned int aligned_h = (image_height + 7) & ~7u;
/*
* In the COL128 SAND layout, Y and UV are NOT separate planes
* concatenated end-to-end. Within EACH 128-pixel-wide column:
* first 128 * height bytes = Y data for this column strip
* next 128 * height / 2 bytes = UV data for this column strip
* total 128 * bytesperline (= 128 * height * 3/2) bytes per column
*
* The "UV plane base" pointer (data[1] in AVFrame convention) is
* just data[0] + (128 * height) the offset of the UV bytes
* WITHIN the first column. All subsequent UV bytes are reached by
* the same column-stride arithmetic the Y plane uses (col *
* 128 * bytesperline + y * 128 + in_col), so passing this offset
* pointer + iterating y over [0, height/2) traverses all UV rows
* across all columns correctly.
*
* Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
* sizeof(linear Y plane)) that pushed past the end of the SAND
* buffer because the layout isn't planes-end-to-end.
*
* Cross-check: kernel sizeimage = bytesperline * width =
* (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
* aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
* per column: 128 * aligned_h. UV portion per column: half of Y.
* Sum across columns: matches sizeimage.
*/
return NC12_TILE_W * aligned_h;
}
-88
View File
@@ -1,88 +0,0 @@
/*
* V4L2_PIX_FMT_NV12_COL128 (NC12) SAND-tiled linear NV12 detile.
*
* Pi 5 / CM5 (BCM2712) rpi-hevc-dec CAPTURE format. iter40 (2026-05-17).
*
* Layout (kernel drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
* size-formula + ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h per-pixel
* offset math):
*
* width ALIGN(image_width, 128) -- columns are 128 px wide
* height ALIGN(image_height, 8)
* col_stride (= bytesperline) = height * 3 / 2
* (bytes per [128-wide column] vertical unit incl. Y + UV)
* sizeimage = col_stride * width = total bytes
*
* For pixel (x, y) in the Y plane:
* col = x / 128
* in_col_x = x % 128
* offset = col * col_stride * 128 + y * 128 + in_col_x
*
* UV plane starts at offset (128 * height * num_columns_y) the same
* per-column layout, h/2 rows tall (CbCr interleaved).
*
* The primitive copies the entire image extent at once. width/height are
* the cropped consumer-visible dimensions; src_col_stride is the kernel-
* reported bytesperline (i.e. ALIGN(height,8) * 3/2).
*/
#ifndef _NV12_COL128_H_
#define _NV12_COL128_H_
#include <stdint.h>
#include <linux/videodev2.h>
/*
* Pre-Pi-kernel headers (Arch ALARM linux-api-headers, older mainline
* kernel-headers packages) may not define V4L2_PIX_FMT_NV12_COL128. The
* fourcc is Pi-specific. Provide a private fallback so the backend
* builds on hosts that target NON-Pi codecs too.
*/
#ifndef V4L2_PIX_FMT_NV12_COL128
#define V4L2_PIX_FMT_NV12_COL128 \
((unsigned int)('N') | ((unsigned int)('C') << 8) | \
((unsigned int)('1') << 16) | ((unsigned int)('2') << 24))
#endif
#ifndef V4L2_PIX_FMT_NV12_10_COL128
/* 10-bit SAND variant: 3 pixels packed into 4 bytes in 128-byte / 96-pixel
* wide columns. iter40 references the fourcc for completeness; the 10-bit
* Pi 5 HEVC chapter (Main10) is post-iter40. */
#define V4L2_PIX_FMT_NV12_10_COL128 \
((unsigned int)('N') | ((unsigned int)('C') << 8) | \
((unsigned int)('3') << 16) | ((unsigned int)('0') << 24))
#endif
/* Detile the Y plane of an NC12 source to a linear NV12 Y plane.
* dst : pointer to linear NV12 Y plane (caller-owned, dst_stride * height bytes)
* dst_stride : linear Y plane stride in bytes (= width for plain NV12)
* src_y : pointer to start of NC12 Y plane (= NC12 buffer base)
* src_col_stride: kernel-reported bytesperline (= ALIGN(height,8) * 3/2)
* width, height: cropped image dimensions in pixels
*/
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_y, unsigned int src_col_stride,
unsigned int width, unsigned int height);
/* Detile the UV plane (CbCr interleaved, half-height) of an NC12 source.
* dst : pointer to linear NV12 UV plane
* dst_stride : linear UV plane stride in bytes (= width for NV12)
* src_uv : pointer to start of NC12 UV plane (= src_y + Y-plane-size)
* src_col_stride: same as Y plane (same column geometry)
* width : Y-plane width in pixels (UV plane has same byte width)
* uv_height : UV plane height = height / 2
*/
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_uv, unsigned int src_col_stride,
unsigned int width, unsigned int uv_height);
/* Compute the offset of the UV plane within an NC12 buffer.
* image_width, image_height: cropped image dimensions in pixels
* Returns: byte offset from buffer start to UV plane start
* (= 128 * ALIGN(image_height, 8) * num_columns_y)
*/
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
unsigned int image_height);
#endif /* _NV12_COL128_H_ */
-75
View File
@@ -1,75 +0,0 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include "nv15.h"
void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
unsigned int width, unsigned int height,
unsigned int src_stride)
{
unsigned int x, y;
unsigned int dst_pitch_px = width;
for (y = 0; y < height; y++) {
const uint8_t *s = src + y * src_stride;
uint16_t *d = dst + y * dst_pitch_px;
for (x = 0; x + 4 <= width; x += 4) {
uint16_t a = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
uint16_t b = ((uint16_t)s[1] >> 2) | ((uint16_t)(s[2] & 0x0F) << 6);
uint16_t c = ((uint16_t)s[2] >> 4) | ((uint16_t)(s[3] & 0x3F) << 4);
uint16_t e = ((uint16_t)s[3] >> 6) | ((uint16_t)s[4] << 2);
d[0] = (uint16_t)(a << 6);
d[1] = (uint16_t)(b << 6);
d[2] = (uint16_t)(c << 6);
d[3] = (uint16_t)(e << 6);
d += 4;
s += 5;
}
if (x < width) {
unsigned int rem = width - x;
uint16_t pix[4] = { 0, 0, 0, 0 };
pix[0] = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
if (rem >= 2)
pix[1] = ((uint16_t)s[1] >> 2) |
((uint16_t)(s[2] & 0x0F) << 6);
if (rem >= 3)
pix[2] = ((uint16_t)s[2] >> 4) |
((uint16_t)(s[3] & 0x3F) << 4);
if (rem >= 4)
pix[3] = ((uint16_t)s[3] >> 6) |
((uint16_t)s[4] << 2);
{
unsigned int j;
for (j = 0; j < rem; j++)
d[j] = (uint16_t)(pix[j] << 6);
}
}
}
}
-61
View File
@@ -1,61 +0,0 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#ifndef _NV15_H_
#define _NV15_H_
#include <stdint.h>
#include <linux/videodev2.h>
/*
* Older or downstream linux-api-headers / kernel-headers packages may
* not define V4L2_PIX_FMT_NV15. Provide a fallback so the backend
* builds on hosts whose headers are pre-NV15-merge or omit it (e.g.
* Pi 5 Debian trixie 6.12.62 headers include NC12 but not NV15).
* Same numeric value as mainline.
*/
#ifndef V4L2_PIX_FMT_NV15
#define V4L2_PIX_FMT_NV15 \
((unsigned int)('N') | ((unsigned int)('V') << 8) | \
((unsigned int)('1') << 16) | ((unsigned int)('5') << 24))
#endif
/*
* Unpack one plane of V4L2_PIX_FMT_NV15 (4 × 10-bit values packed into
* 5 consecutive bytes, LSB-first) into VA_FOURCC_P010 (16-bit per pixel,
* value in bits [15:6], zeros in [5:0]).
*
* Layout per Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
* Call once per plane: luma (W × H, src_stride = ceil(W/4)*5) and chroma
* (W × H/2 same width because UV are interleaved 10-bit values).
*
* src_stride must be the kernel-reported bytesperline for the NV15 plane.
* The destination is dense P010 with row pitch = width * 2 bytes.
*/
void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
unsigned int width, unsigned int height,
unsigned int src_stride);
#endif
+47 -216
View File
@@ -37,7 +37,6 @@
#include "vp8.h"
#include "vp9.h"
#include "av1.h"
#include "request_pool.h"
#include <assert.h>
#include <stdio.h>
@@ -56,159 +55,6 @@
#include "autoconfig.h"
/*
* iter#15 issue #15: ensure the in-flight surface's OUTPUT mmap has
* room for `delta` more bytes appended to slices_size; if not, grow the
* pool transparently via request_pool_resize.
*
* Sequence on overflow:
* 1. Snapshot the surface's accumulated bytes to a temp heap buffer.
* 2. Release the surface's OUTPUT pool slot back to FREE (resize
* requires no slot be borrowed).
* 3. Compute new sizeimage = roundup(needed * 2, 4 KiB), and at least
* double the current source_size so geometric growth amortises
* repeated overruns at the same resolution.
* 4. Call request_pool_resize.
* 5. Re-acquire a pool slot (the new pool has fresh indices and fds).
* 6. Re-mirror surface_object->source_{index,data,size,request_fd}
* from the new slot.
* 7. Restore the saved bytes via memcpy into the new mmap.
*
* Returns VA_STATUS_SUCCESS on clean resize (or no resize needed) and
* VA_STATUS_ERROR_ALLOCATION_FAILED on heap-alloc / V4L2 / kernel
* failure the libva client falls back to surface re-creation as
* before the resize hook landed.
*
* NOTE on inline-Sync invariant: RequestEndPicture calls
* RequestSyncSurface inline, so when codec_store_buffer runs no other
* pool slot is borrowed across libva-driver-API entry points. The
* temporary release-then-reacquire of the in-flight slot here keeps
* that invariant intact across the resize.
*/
static VAStatus
codec_store_buffer_ensure_capacity(struct request_data *driver_data,
struct object_surface *surface_object,
size_t need)
{
struct request_pool_slot *slot;
uint8_t *save_buf;
size_t save_size;
unsigned int saved_index;
size_t want_sizeimage;
unsigned int new_sizeimage;
int new_index;
int rc;
if (need <= surface_object->source_size)
return VA_STATUS_SUCCESS;
save_size = surface_object->slices_size;
save_buf = NULL;
if (save_size > 0) {
save_buf = malloc(save_size);
if (save_buf == NULL) {
request_log("codec_store_buffer_ensure_capacity: malloc(%zu) for resize-save failed\n",
save_size);
return VA_STATUS_ERROR_ALLOCATION_FAILED;
}
memcpy(save_buf, surface_object->source_data, save_size);
}
/*
* Temporarily release the in-flight slot. The slot's V4L2 buffer
* has NOT been QBUF'd yet (QBUF lives in RequestEndPicture, after
* this codec_store_buffer call), so the release is a clean
* busy=false flip; no kernel state is in question. The slot's
* stale request_fd does not need to be saved the resize closes
* every slot's fd and the post-resize acquire below re-mirrors a
* fresh slot's request_fd into surface_object->request_fd.
*/
saved_index = surface_object->source_index;
request_pool_release(&driver_data->output_pool, saved_index);
/*
* Geometric growth: at least 2× the current source_size, but no
* less than 2× the required total so a single resize covers the
* triggering append plus comfortable headroom for the rest of
* this frame. Round up to a 4 KiB page boundary so the kernel's
* own alignment doesn't waste pages. Compute in size_t so the
* 2× doubling can't silently wrap at 2 GiB on 32-bit unsigned int
* (sizeimage stays bounded by V4L2's u32, but the doubling target
* could otherwise overflow before the clamp).
*/
want_sizeimage = need * 2;
if (want_sizeimage < (size_t)surface_object->source_size * 2)
want_sizeimage = (size_t)surface_object->source_size * 2;
if (want_sizeimage > 0x40000000u) /* 1 GiB hard cap — V4L2 sizeimage is u32 */
want_sizeimage = 0x40000000u;
want_sizeimage = (want_sizeimage + 0xFFFu) & ~(size_t)0xFFFu;
new_sizeimage = (unsigned int)want_sizeimage;
request_log("codec_store_buffer: OUTPUT-pool resize (need %zu > cap %u → new_sizeimage %u)\n",
need, surface_object->source_size, new_sizeimage);
rc = request_pool_resize(&driver_data->output_pool, new_sizeimage);
if (rc < 0) {
/*
* Resize failed. The original slot was already released
* above, so surface_object->source_data is now pointing
* at a FREE-but-still-borrowable mmap. Restore the
* surface's slot mirror so EndPicture / DestroyContext
* unwind paths see a consistent (if partial) state.
*
* If the resize aborted early (pre-STREAMOFF), the slot
* is intact: re-acquiring the same index is the inverse
* of the temporary release above. If it aborted later
* (post-teardown), the slot's data/size were zeroed in
* place by request_pool_resize and the re-acquire flips
* busy=true on a dead slot still safe, because the
* caller will return ERROR_ALLOCATION_FAILED and the
* libva consumer destroys the surface/context.
*/
(void)request_pool_acquire(&driver_data->output_pool);
free(save_buf);
return VA_STATUS_ERROR_ALLOCATION_FAILED;
}
new_index = request_pool_acquire(&driver_data->output_pool);
if (new_index < 0) {
free(save_buf);
return VA_STATUS_ERROR_ALLOCATION_FAILED;
}
slot = request_pool_slot(&driver_data->output_pool,
(unsigned int)new_index);
if (slot == NULL) {
request_pool_release(&driver_data->output_pool,
(unsigned int)new_index);
free(save_buf);
return VA_STATUS_ERROR_ALLOCATION_FAILED;
}
surface_object->source_index = slot->index;
surface_object->source_data = slot->data;
surface_object->source_size = slot->size;
surface_object->request_fd = slot->request_fd;
if (need > surface_object->source_size) {
/*
* Kernel rounded the new sizeimage down below what we
* needed drivers may clamp at their per-codec ceiling.
* Don't corrupt memory; surface the error to libva.
*/
request_log("codec_store_buffer_ensure_capacity: kernel returned sizeimage %u < required %zu\n",
surface_object->source_size, need);
free(save_buf);
return VA_STATUS_ERROR_ALLOCATION_FAILED;
}
if (save_buf != NULL) {
memcpy(surface_object->source_data, save_buf, save_size);
free(save_buf);
}
return VA_STATUS_SUCCESS;
}
static VAStatus codec_store_buffer(struct request_data *driver_data,
struct object_context *context,
VAProfile profile,
@@ -216,36 +62,16 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
struct object_buffer *buffer_object)
{
switch (buffer_object->type) {
case VASliceDataBufferType: {
case VASliceDataBufferType:
/*
* Since there is no guarantee that the allocation
* order is the same as the submission order (via
* RenderPicture), we can't use a V4L2 buffer directly
* and have to copy from a regular buffer.
*
* Capacity guard (issue #13 + #15): surface_object->source_data
* points at an OUTPUT-pool mmap of size source_size, negotiated
* at S_FMT time. A stream-level resolution upshift can produce
* a slice larger than this allocation. Each append site below
* computes the post-append running total and calls
* codec_store_buffer_ensure_capacity, which transparently grows
* the OUTPUT pool (request_pool_resize) so the existing memcpy
* has room. The hard error path (VA_STATUS_ERROR_ALLOCATION_FAILED)
* only fires if both the heap save buffer AND the kernel-side
* grow fail at which point libavcodec recreates the surface.
*/
size_t need;
VAStatus ensure_rc;
if (context->h264_start_code) {
static const char start_code[3] = { 0x00, 0x00, 0x01 };
need = (size_t)surface_object->slices_size +
sizeof(start_code);
ensure_rc = codec_store_buffer_ensure_capacity(
driver_data, surface_object, need);
if (ensure_rc != VA_STATUS_SUCCESS)
return ensure_rc;
memcpy(surface_object->source_data +
surface_object->slices_size,
start_code, sizeof(start_code));
@@ -279,32 +105,19 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
unsigned int header_size =
surface_object->params.vp8.picture.pic_fields.bits.key_frame == 0 ?
10 : 3;
need = (size_t)surface_object->slices_size + header_size;
ensure_rc = codec_store_buffer_ensure_capacity(
driver_data, surface_object, need);
if (ensure_rc != VA_STATUS_SUCCESS)
return ensure_rc;
memset(surface_object->source_data +
surface_object->slices_size,
0, header_size);
surface_object->slices_size += header_size;
}
{
size_t payload = (size_t)buffer_object->size *
buffer_object->count;
need = (size_t)surface_object->slices_size + payload;
ensure_rc = codec_store_buffer_ensure_capacity(
driver_data, surface_object, need);
if (ensure_rc != VA_STATUS_SUCCESS)
return ensure_rc;
memcpy(surface_object->source_data +
surface_object->slices_size,
buffer_object->data, payload);
surface_object->slices_size += payload;
}
buffer_object->data,
buffer_object->size * buffer_object->count);
surface_object->slices_size +=
buffer_object->size * buffer_object->count;
surface_object->slices_count++;
break;
}
case VAPictureParameterBufferType:
switch (profile) {
@@ -320,14 +133,12 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.picture,
buffer_object->data,
sizeof(surface_object->params.h264.picture));
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
memcpy(&surface_object->params.h265.picture,
buffer_object->data,
sizeof(surface_object->params.h265.picture));
@@ -349,6 +160,9 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
memcpy(&surface_object->params.av1.picture,
buffer_object->data,
sizeof(surface_object->params.av1.picture));
/* Reset per-frame tile group entry array on each new
* picture parameter buffer (start of a new frame). */
surface_object->params.av1.num_tile_group_entries = 0;
break;
default:
@@ -363,14 +177,12 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.slice,
buffer_object->data,
sizeof(surface_object->params.h264.slice));
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10: {
case VAProfileHEVCMain: {
unsigned int n = surface_object->params.h265.num_slices;
if (n < HEVC_MAX_SLICES_PER_FRAME) {
memcpy(&surface_object->params.h265.slices[n],
@@ -398,6 +210,17 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
sizeof(surface_object->params.vp9.slice));
break;
case VAProfileAV1Profile0: {
unsigned int n = surface_object->params.av1.num_tile_group_entries;
if (n < AV1_MAX_TILES) {
memcpy(&surface_object->params.av1.tile_group_entries[n],
buffer_object->data,
sizeof(VASliceParameterBufferAV1));
surface_object->params.av1.num_tile_group_entries = n + 1;
}
break;
}
default:
break;
}
@@ -418,7 +241,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
memcpy(&surface_object->params.h264.matrix,
buffer_object->data,
sizeof(surface_object->params.h264.matrix));
@@ -426,7 +248,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
memcpy(&surface_object->params.h265.iqmatrix,
buffer_object->data,
sizeof(surface_object->params.h265.iqmatrix));
@@ -486,7 +307,6 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileH264High10:
rc = h264_set_controls(driver_data, context, profile,
surface_object);
if (rc < 0)
@@ -494,7 +314,6 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
break;
case VAProfileHEVCMain:
case VAProfileHEVCMain10:
rc = h265_set_controls(driver_data, context, surface_object);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -511,22 +330,7 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
break;
case VAProfileAV1Profile0:
/*
* Populates V4L2_CID_STATELESS_AV1_SEQUENCE from
* VAPictureParameterBufferAV1. The daedalus_v4l2 daemon
* (issue #11 daemon track) synthesises an OBU_SEQUENCE_HEADER
* from this ctrl and prepends it to the slice bitstream
* before handing it to libavcodec/libdav1d, which otherwise
* cannot parse the (sequence-header-stripped) OUTPUT buffer
* that ffmpeg-vaapi delivers.
*
* On the RK3588 vpu981 hardware path the same SEQUENCE ctrl
* is harmless: vpu981's driver parses the OBU stream
* directly and ignores the ctrl payload, so no per-decoder
* gating is required here.
*/
rc = av1_set_controls(driver_data, context, surface_object);
if (rc < 0)
return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -557,6 +361,12 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
if (surface_object == NULL)
return VA_STATUS_ERROR_INVALID_SURFACE;
/* AV1 Phase 3 diag */
request_log("BeginPicture id=0x%x prev_slot=%p status=%d\n",
surface_object->base.id,
(void *)surface_object->current_slot,
surface_object->status);
if (surface_object->status == VASurfaceRendering)
RequestSyncSurface(context, surface_id);
@@ -568,9 +378,30 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
* first. The new slot is bound and its V4L2 index + mmap pointers
* are mirrored into surface_object->destination_* so the existing
* QBUF/DQBUF/EXPBUF code paths see no behavioral change.
*
* AV1 Phase 3 finding: LIBVA_SKIP_REBIND=1 experiment (do NOT
* unbind on rebind) did not improve PASS count for the av1_larger
* film_grain stress vector proving the iter2 Fix 3 release is
* NOT the source of the inter-frame divergence. The issue is
* deeper in ffmpeg-vaapi's AV1 hwaccel: per byte-equal OUTPUT
* comparison with the patched-ffmpeg-v4l2request reference run
* (LD_LIBRARY_PATH override on a debug libavcodec.so), 7/7 first
* EndPicture submissions are byte-identical, libva has 2 EXTRA.
*/
if (surface_object->current_slot != NULL)
surface_unbind_slot(driver_data, surface_object);
/*
* AV1 Phase 5 review Amendment 4: clear any stale
* linked_decode_surface_id from a prior film_grain displaydecode
* link. If ffmpeg-vaapi recycles a former display surface as a
* decode target, BeginPicture binds a fresh slot but without
* this reset, copy_surface_to_image's link-follow would still
* borrow from the now-stale linked surface and serve wrong data.
* Cleared unconditionally (cheap) so the next AV1 grain frame
* re-establishes the link if needed.
*/
surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
{
struct cap_pool_slot *cap_slot =
cap_pool_acquire(&driver_data->capture_pool, surface_id);
+138 -206
View File
@@ -93,10 +93,6 @@
static const char * const known_decoder_drivers[] = {
"rkvdec",
"hantro-vpu",
"rpi-hevc-dec", /* iter40: Pi 5 / CM5 stateless HEVC */
#ifdef HAVE_DAEDALUS_V4L2
"daedalus_v4l2", /* phase 8.10: Pi 5 daemon-backed VP9/AV1/H264 */
#endif
"cedrus",
"sun4i_csi",
NULL
@@ -329,6 +325,37 @@ static bool probe_hevc_ext_sps_rps_controls(int video_fd)
return true;
}
/*
* Inspect a /dev/videoN's OUTPUT formats for `want_pixfmt`. Returns true
* iff at least one OUTPUT/OUTPUT_MPLANE format matches.
*
* Used to discriminate between multiple devices sharing a driver name
* RK3588 has 3 hantro-vpu instances and only one of them is vpu981 (the
* dedicated AV1 decoder advertising V4L2_PIX_FMT_AV1_FRAME).
*/
static bool video_node_supports_output_fmt(int video_fd, uint32_t want_pixfmt)
{
struct v4l2_fmtdesc desc;
const enum v4l2_buf_type types[] = {
V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,
V4L2_BUF_TYPE_VIDEO_OUTPUT,
};
unsigned int t, i;
for (t = 0; t < sizeof(types) / sizeof(types[0]); t++) {
for (i = 0; i < 64; i++) {
memset(&desc, 0, sizeof desc);
desc.index = i;
desc.type = types[t];
if (ioctl(video_fd, VIDIOC_ENUM_FMT, &desc) < 0)
break;
if (desc.pixelformat == want_pixfmt)
return true;
}
}
return false;
}
static int find_decoder_device_by_driver(const char *want_driver,
char *video_out, size_t video_out_sz,
char *media_out, size_t media_out_sz)
@@ -376,6 +403,65 @@ static int find_decoder_device_by_driver(const char *want_driver,
return -1;
}
/*
* ampere-av1-enablement Phase 2 like find_decoder_device_by_driver but
* additionally verifies the resolved /dev/videoN advertises `want_pixfmt`
* as an OUTPUT format. Required for RK3588 where 3 hantro-vpu instances
* share the driver name but only one is vpu981 (AV1 decoder).
*
* Walks all /dev/media* with matching driver name; takes the first hit
* whose OUTPUT formats include `want_pixfmt`. Non-matching candidates
* (encoder-only nodes, legacy hantro for MPEG2/VP8) are skipped.
*/
static int find_decoder_device_by_driver_with_fmt(const char *want_driver,
uint32_t want_pixfmt,
char *video_out,
size_t video_out_sz,
char *media_out,
size_t media_out_sz)
{
struct media_device_info info;
char path[32];
char vpath[32];
int fd, vfd, i;
for (i = 0; i < 16; i++) {
snprintf(path, sizeof path, "/dev/media%d", i);
fd = open(path, O_RDWR | O_NONBLOCK);
if (fd < 0)
continue;
memset(&info, 0, sizeof info);
if (ioctl(fd, MEDIA_IOC_DEVICE_INFO, &info) != 0) {
close(fd);
continue;
}
if (strcmp(info.driver, want_driver) != 0) {
close(fd);
continue;
}
if (find_decoder_video_node_via_topology(fd, vpath,
sizeof vpath) != 0) {
close(fd);
continue;
}
close(fd);
/* Capability check: does this /dev/videoN advertise the
* codec-specific OUTPUT format? */
vfd = open(vpath, O_RDWR | O_NONBLOCK);
if (vfd < 0)
continue;
if (video_node_supports_output_fmt(vfd, want_pixfmt)) {
close(vfd);
snprintf(video_out, video_out_sz, "%s", vpath);
snprintf(media_out, media_out_sz, "%s", path);
return 0;
}
close(vfd);
}
return -1;
}
static int find_codec_device(char *video_out, size_t video_out_sz,
char *media_out, size_t media_out_sz)
{
@@ -413,15 +499,7 @@ char request_device_kind_for_profile(VAProfile profile)
case VAProfileVP8Version0_3:
return 'h';
case VAProfileAV1Profile0:
/*
* ampere-av1-enablement Phase 2: RK3588 vpu981 dedicated
* AV1 hantro instance. 'a' kind dispatches to
* driver_data->video_fd_vpu981. On hosts without the AV1
* instance the fd stays -1 and RequestQueryConfigProfiles
* never enumerates AV1, so this branch is unreachable for
* non-RK3588 hosts.
*/
return 'a';
return 'a'; /* ampere-av1-enablement: vpu981 dedicated AV1 */
default:
return '?';
}
@@ -445,77 +523,15 @@ int request_switch_device_for_profile(struct request_data *driver_data,
char kind = request_device_kind_for_profile(profile);
int target_video, target_media;
/*
* iter40: HEVC override when rpi-hevc-dec is probed. The static
* table (request_device_kind_for_profile) maps HEVC 'r' (rkvdec)
* because that's the canonical RK path. On Pi 5 there's no rkvdec
* rpi-hevc-dec is the only decoder. When BOTH would be present
* (hypothetical mixed board), prefer rpi-hevc-dec for HEVC.
*
* Other rkvdec-routed profiles (VP9, H.264) stay on 'r' because
* rpi-hevc-dec is HEVC-only.
*/
if ((profile == VAProfileHEVCMain || profile == VAProfileHEVCMain10) &&
driver_data->video_fd_rpi_hevc_dec >= 0 &&
driver_data->media_fd_rpi_hevc_dec >= 0) {
kind = 'p';
}
#ifdef HAVE_DAEDALUS_V4L2
/*
* LIBVA-1: VP9/AV1/H.264 daedalus_v4l2 when the daemon-backed
* decoder fd is open. Pi 5 has no rkvdec (those profiles map to
* 'r' by default video_fd_rkvdec = -1 "stay on whatever's
* active" fallback would put H.264 frames on rpi-hevc-dec's fd
* and S_FMT would fail). Re-route to the daedalus daemon instead.
*
* HEVC stays on 'p' (rpi-hevc-dec is HEVC-only daedalus would
* accept it via FFmpeg, but rpi-hevc-dec has the GPU-backed
* hardware path so it's the right choice on this SoC).
*
* AV1 'a' kind (RK3588 vpu981) wins ONLY if vpu981 was probed.
* On a Pi 5 the vpu981 slot stays -1, so we still route AV1 to
* daedalus here. Check video_fd_vpu981 to preserve the RK3588
* priority for that case.
*/
if (driver_data->video_fd_daedalus >= 0 &&
driver_data->media_fd_daedalus >= 0) {
switch (profile) {
case VAProfileH264Main:
case VAProfileH264High:
case VAProfileH264ConstrainedBaseline:
case VAProfileH264MultiviewHigh:
case VAProfileH264StereoHigh:
case VAProfileVP9Profile0:
kind = 'd';
break;
case VAProfileAV1Profile0:
if (driver_data->video_fd_vpu981 < 0)
kind = 'd';
break;
default:
break;
}
}
#endif
if (kind == 'r') {
target_video = driver_data->video_fd_rkvdec;
target_media = driver_data->media_fd_rkvdec;
} else if (kind == 'h') {
target_video = driver_data->video_fd_hantro;
target_media = driver_data->media_fd_hantro;
} else if (kind == 'p') {
target_video = driver_data->video_fd_rpi_hevc_dec;
target_media = driver_data->media_fd_rpi_hevc_dec;
} else if (kind == 'a') {
target_video = driver_data->video_fd_vpu981;
target_media = driver_data->media_fd_vpu981;
#ifdef HAVE_DAEDALUS_V4L2
} else if (kind == 'd') {
target_video = driver_data->video_fd_daedalus;
target_media = driver_data->media_fd_daedalus;
#endif
} else {
return -1;
}
@@ -703,10 +719,6 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
driver_data->media_fd_rkvdec = -1;
driver_data->video_fd_hantro = -1;
driver_data->media_fd_hantro = -1;
driver_data->video_fd_rpi_hevc_dec = -1;
driver_data->media_fd_rpi_hevc_dec = -1;
driver_data->video_fd_daedalus = -1;
driver_data->media_fd_daedalus = -1;
driver_data->video_fd_vpu981 = -1;
driver_data->media_fd_vpu981 = -1;
@@ -739,36 +751,6 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
alt_driver = "rkvdec";
driver_data->video_fd_hantro = video_fd;
driver_data->media_fd_hantro = media_fd;
} else if (strcmp(info.driver, "rpi-hevc-dec") == 0) {
/* iter40 + LIBVA-1: Pi 5 / CM5. rpi-hevc-dec is
* HEVC-only. If daedalus_v4l2 is ALSO loaded (Pi 5
* mixed deployment out-of-tree daemon-backed
* decoder for VP9/AV1/H264), pick it up as the alt
* so VP9/AV1/H264 have somewhere to land. */
primary_driver = "rpi-hevc-dec";
#ifdef HAVE_DAEDALUS_V4L2
alt_driver = "daedalus_v4l2";
#else
alt_driver = NULL;
#endif
driver_data->video_fd_rpi_hevc_dec = video_fd;
driver_data->media_fd_rpi_hevc_dec = media_fd;
#ifdef HAVE_DAEDALUS_V4L2
} else if (strcmp(info.driver, "daedalus_v4l2") == 0) {
/* phase 8.10 + LIBVA-1: Pi 5 daemon-backed decoder.
* VP9 / AV1 / H.264 route through it via the 'd'
* kind below. On a mixed-driver box where
* rpi-hevc-dec is ALSO loaded, pick it up as the
* alt so HEVC has somewhere to land too find_
* codec_device's known_decoder_drivers[] order
* normally puts rpi-hevc-dec first (we hit the
* other branch in practice), but symmetric handling
* keeps us correct if probe order ever flips. */
primary_driver = "daedalus_v4l2";
alt_driver = "rpi-hevc-dec";
driver_data->video_fd_daedalus = video_fd;
driver_data->media_fd_daedalus = media_fd;
#endif
}
}
@@ -780,38 +762,15 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
int alt_v = open(alt_video, O_RDWR | O_NONBLOCK);
int alt_m = (alt_v >= 0) ? open(alt_media, O_RDWR | O_NONBLOCK) : -1;
if (alt_v >= 0 && alt_m >= 0) {
/* Dispatch into the matching per-driver slot.
* iter38 only had rkvdec/hantro pairs; iter40 +
* LIBVA-1 extended this to rpi-hevc-dec and
* daedalus_v4l2 for the Pi 5 mixed-decoder
* deployment. */
if (strcmp(alt_driver, "rkvdec") == 0) {
driver_data->video_fd_rkvdec = alt_v;
driver_data->media_fd_rkvdec = alt_m;
} else if (strcmp(alt_driver, "hantro-vpu") == 0) {
} else {
driver_data->video_fd_hantro = alt_v;
driver_data->media_fd_hantro = alt_m;
} else if (strcmp(alt_driver, "rpi-hevc-dec") == 0) {
driver_data->video_fd_rpi_hevc_dec = alt_v;
driver_data->media_fd_rpi_hevc_dec = alt_m;
#ifdef HAVE_DAEDALUS_V4L2
} else if (strcmp(alt_driver, "daedalus_v4l2") == 0) {
driver_data->video_fd_daedalus = alt_v;
driver_data->media_fd_daedalus = alt_m;
#endif
} else {
/* Shouldn't happen — primary_driver branches
* above only set alt_driver to one of the
* names handled here. Close and move on. */
close(alt_v);
close(alt_m);
alt_v = -1;
alt_m = -1;
}
if (alt_v >= 0) {
request_log("iter38: also opened %s decoder at %s + %s\n",
alt_driver, alt_video, alt_media);
}
} else {
if (alt_v >= 0) close(alt_v);
if (alt_m >= 0) close(alt_m);
@@ -821,57 +780,36 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
(void)primary_driver;
/*
* ampere-av1-enablement Phase 2: walk hantro-vpu media nodes
* for a SECOND one that advertises V4L2_PIX_FMT_AV1_FRAME
* (AV1F) as OUTPUT pixfmt. RK3588 has 3 hantro-vpu instances
* (legacy MPEG2/VP8 decoder, vepu121 encoder, vpu981 AV1
* decoder) all reporting driver="hantro-vpu" / model="hantro-
* vpu" — so OUTPUT-format probe is the only reliable
* disambiguator that doesn't depend on parsing card-name
* strings (which are DTS-dependent). First match wins.
*
* On non-RK3588 hosts the slot stays -1; RequestQueryConfig
* Profiles' AV1 push then no-ops because any_fd_supports_
* output_format() returns false for AV1F.
* ampere-av1-enablement Phase 2 additionally probe for
* vpu981 (RK3588's dedicated AV1 decoder). Driver name
* "hantro-vpu" alone is ambiguous on RK3588 (3 instances:
* legacy MPEG2/VP8, encoder, vpu981 AV1). Discriminate by
* V4L2_PIX_FMT_AV1_FRAME capability. If the primary or alt
* hantro happens to BE vpu981 (unlikely but possible on
* non-RK3588 boards), this probe finds it again and we just
* dedupe via the fd value.
*/
{
int i;
char path[32], av1_video[32];
for (i = 0; i < 16; i++) {
int mfd, vfd;
struct media_device_info info;
snprintf(path, sizeof path, "/dev/media%d", i);
mfd = open(path, O_RDWR | O_NONBLOCK);
if (mfd < 0) continue;
memset(&info, 0, sizeof info);
if (ioctl(mfd, MEDIA_IOC_DEVICE_INFO, &info) != 0 ||
strcmp(info.driver, "hantro-vpu") != 0) {
close(mfd);
continue;
static char av1_video[32], av1_media[32];
if (find_decoder_device_by_driver_with_fmt(
"hantro-vpu", V4L2_PIX_FMT_AV1_FRAME,
av1_video, sizeof av1_video,
av1_media, sizeof av1_media) == 0) {
int av1_v = open(av1_video, O_RDWR | O_NONBLOCK);
int av1_m = (av1_v >= 0)
? open(av1_media, O_RDWR | O_NONBLOCK)
: -1;
if (av1_v >= 0 && av1_m >= 0) {
driver_data->video_fd_vpu981 = av1_v;
driver_data->media_fd_vpu981 = av1_m;
request_log(
"ampere-av1: vpu981 AV1 decoder "
"at %s + %s\n",
av1_video, av1_media);
} else {
if (av1_v >= 0) close(av1_v);
if (av1_m >= 0) close(av1_m);
}
if (find_decoder_video_node_via_topology(
mfd, av1_video, sizeof av1_video) != 0) {
close(mfd);
continue;
}
vfd = open(av1_video, O_RDWR | O_NONBLOCK);
if (vfd < 0) {
close(mfd);
continue;
}
if (!v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT, V4L2_PIX_FMT_AV1_FRAME) &&
!v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, V4L2_PIX_FMT_AV1_FRAME)) {
close(vfd);
close(mfd);
continue;
}
driver_data->video_fd_vpu981 = vfd;
driver_data->media_fd_vpu981 = mfd;
request_log("ampere-av1: vpu981 AV1 decoder at %s + %s\n",
av1_video, path);
break;
}
}
}
@@ -886,27 +824,29 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rkvdec);
driver_data->has_hevc_ext_sps_rps_hantro =
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_hantro);
driver_data->has_hevc_ext_sps_rps_rpi_hevc_dec =
probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rpi_hevc_dec);
if (driver_data->has_hevc_ext_sps_rps_rkvdec) {
request_log("iter2: kernel registers HEVC EXT_SPS_{ST,LT}_RPS "
"controls on rkvdec fd (will route through "
"vendored GStreamer parser)\n");
}
if (driver_data->video_fd_rpi_hevc_dec >= 0) {
request_log("iter40: also opened rpi-hevc-dec at video_fd=%d "
"media_fd=%d (Pi 5 HEVC stateless)\n",
driver_data->video_fd_rpi_hevc_dec,
driver_data->media_fd_rpi_hevc_dec);
/*
* ampere-av1 Phase 2.1: probe V4L2_CID_STATELESS_AV1_FILM_GRAIN
* on the vpu981 fd. Per Janet v3 amendment, this runs at backend
* init (not lazily) so any race window with concurrent device
* switching can't observe an inconsistent flag.
*/
driver_data->has_av1_film_grain = false;
if (driver_data->video_fd_vpu981 >= 0) {
struct v4l2_query_ext_ctrl qec;
if (v4l2_query_ext_ctrl(driver_data->video_fd_vpu981,
V4L2_CID_STATELESS_AV1_FILM_GRAIN,
&qec) == 0) {
driver_data->has_av1_film_grain = true;
request_log("ampere-av1: vpu981 advertises FILM_GRAIN "
"control (will include in per-frame batch)\n");
}
#ifdef HAVE_DAEDALUS_V4L2
if (driver_data->video_fd_daedalus >= 0) {
request_log("phase 8.10: opened daedalus_v4l2 at video_fd=%d "
"media_fd=%d (Pi 5 daemon-backed VP9/AV1/H264)\n",
driver_data->video_fd_daedalus,
driver_data->media_fd_daedalus);
}
#endif
status = VA_STATUS_SUCCESS;
goto complete;
@@ -954,23 +894,15 @@ VAStatus RequestTerminate(VADriverContextP context)
close(driver_data->video_fd_hantro);
if (driver_data->media_fd_hantro >= 0)
close(driver_data->media_fd_hantro);
if (driver_data->video_fd_rpi_hevc_dec >= 0)
close(driver_data->video_fd_rpi_hevc_dec);
if (driver_data->media_fd_rpi_hevc_dec >= 0)
close(driver_data->media_fd_rpi_hevc_dec);
if (driver_data->video_fd_vpu981 >= 0)
close(driver_data->video_fd_vpu981);
if (driver_data->media_fd_vpu981 >= 0)
close(driver_data->media_fd_vpu981);
#ifdef HAVE_DAEDALUS_V4L2
if (driver_data->video_fd_daedalus >= 0)
close(driver_data->video_fd_daedalus);
if (driver_data->media_fd_daedalus >= 0)
close(driver_data->media_fd_daedalus);
#endif
/* Fall back to direct close if neither alt fd captured the active
* pair (env-override path). */
if (driver_data->video_fd_rkvdec < 0 && driver_data->video_fd_hantro < 0) {
if (driver_data->video_fd_rkvdec < 0 &&
driver_data->video_fd_hantro < 0 &&
driver_data->video_fd_vpu981 < 0) {
if (driver_data->video_fd >= 0)
close(driver_data->video_fd);
if (driver_data->media_fd >= 0)
+22 -88
View File
@@ -42,16 +42,7 @@
#define V4L2_REQUEST_STR_VENDOR "v4l2-request"
/*
* Sized for max-possible enumeration with iter39 Option B reverted:
* MPEG2(2) + H264(6 incl. Hi10P) + HEVC(2 incl. Main10) + VP8 + VP9 + AV1 = 13.
* The per-group guards use `if (... && index < (MAX_PROFILES - N))` where N
* is the push-group size, so MAX must be total+1 14 here. Bumping
* defensively now so a future re-enable of Hi10P/Main10 doesn't silently
* drop AV1 through the off-by-one trap that ate ampere-av1's enumeration
* for a week (see issue marfrit/libva-v4l2-request-fourier#2).
*/
#define V4L2_REQUEST_MAX_PROFILES 14
#define V4L2_REQUEST_MAX_PROFILES 11
#define V4L2_REQUEST_MAX_ENTRYPOINTS 5
#define V4L2_REQUEST_MAX_CONFIG_ATTRIBUTES 10
#define V4L2_REQUEST_MAX_IMAGE_FORMATS 10
@@ -87,45 +78,17 @@ struct request_data {
int media_fd_rkvdec;
int video_fd_hantro;
int media_fd_hantro;
/*
* iter40: third multi-device-probe slot for rpi-hevc-dec (Pi 5 /
* CM5 / BCM2712). V4L2 stateless HEVC; CAPTURE is NC12/NC30 SAND
* 128-pixel-wide column tiled (Pi-specific). On Pi 5 this is the
* ONLY decoder slot; on RK hosts it stays -1 and HEVC routes to
* rkvdec as before.
*/
int video_fd_rpi_hevc_dec;
int media_fd_rpi_hevc_dec;
/*
* phase 8.10: fifth multi-device-probe slot for daedalus_v4l2 the
* out-of-tree V4L2 stateless decoder shim that forwards bitstream
* to a userspace daemon (daedalus-v4l2 sibling repo). Daemon does
* FFmpeg-software decode for VP9 / AV1 / H.264 and ships pixels
* back via dmabuf into the CAPTURE buffer. Picked up via the
* same media-controller probe + known_decoder_drivers[] entry
* pattern as iter40 rpi-hevc-dec. Stays -1 on hosts without the
* daedalus module loaded; HEVC routes to rpi-hevc-dec as before.
* ampere-av1-enablement Phase 2 vpu981 is a THIRD physical
* hantro-vpu instance on RK3588 (separate from the legacy MPEG2/VP8
* hantro at /dev/video2). It's the dedicated AV1 decoder at
* /dev/video4 with card name "rockchip,rk3588-av1-vpu-dec".
*
* Fields are unconditional (8 bytes per session) so the struct
* layout is stable regardless of meson option. The active
* probe + dispatch code in request.c is gated by
* HAVE_DAEDALUS_V4L2; when disabled the fields stay at their
* -1 init and no codepath touches them.
*/
int video_fd_daedalus;
int media_fd_daedalus;
/*
* ampere-av1-enablement Phase 2: fourth multi-device-probe slot
* for vpu981 (RK3588's dedicated AV1 hantro instance, kernel
* card="rockchip,rk3588-av1-vpu-dec", driver name "hantro-vpu"
* shared with the legacy MPEG-2/VP8/H.264 hantro). Discriminated
* by V4L2_PIX_FMT_AV1_FRAME (AV1F) OUTPUT-pixfmt capability since
* the driver name alone is ambiguous on RK3588. Stays -1 on hosts
* without the AV1 vpu-dec.
*
* Named "vpu981" for consistency with the in-progress av1-iter1
* operator branch (Phase 3-5 bit-exact AV1 work when that lands
* these fields receive the actual decode dispatch wiring).
* Driver-name alone ("hantro-vpu") is ambiguous on RK3588 three
* instances share the name. The probe discriminates by capability:
* which OUTPUT format does the device advertise? Only vpu981
* exposes V4L2_PIX_FMT_AV1_FRAME.
*/
int video_fd_vpu981;
int media_fd_vpu981;
@@ -149,12 +112,18 @@ struct request_data {
*/
bool has_hevc_ext_sps_rps_rkvdec;
bool has_hevc_ext_sps_rps_hantro;
/* iter40: rpi-hevc-dec doesn't expose EXT_SPS_*_RPS controls
* (verified Phase 0 higgs probe: QUERY_EXT_CTRL on 0xa97 EINVAL).
* Probed for consistency with the iter2 pair-of-flags pattern;
* stays false on Pi 5 and the iter2 vendored-parser path naturally
* doesn't engage. */
bool has_hevc_ext_sps_rps_rpi_hevc_dec;
/*
* ampere-av1 Phase 2.1: probe result for the optional
* V4L2_CID_STATELESS_AV1_FILM_GRAIN control on the vpu981 fd.
* Probed at VA_DRIVER_INIT (per Janet v3 amendment init-time
* not lazy). Consumed by av1_set_controls to conditionally include
* the 4th control in the per-frame batch.
*
* True iff vpu981 advertises the control via VIDIOC_QUERY_EXT_CTRL.
* False for non-RK3588 hosts (no vpu981 fd) or older kernels.
*/
bool has_av1_film_grain;
/*
* iter2 cached SPS-derived RPS arrays. SPS NALs only appear in
@@ -179,30 +148,6 @@ struct request_data {
unsigned int hevc_rps_cache_lt_count;
bool hevc_rps_cache_valid;
/*
* iter40b: bitstream-derived SPS field cache for VAAPI-omitted
* fields. rpi-hevc-dec validates these against bitstream-true
* values; the rkvdec/hantro fallback (sps_max_dec_pic_buffering_minus1,
* 0) that satisfies §A.4.2 isn't enough for rpi.
*
* Cached on first IDR frame's SPS NAL parse, reused for subsequent
* non-IDR frames whose source_data may not carry an SPS.
*
* sps_max_sub_layers_minus1 is the index into max_*[] arrays. The
* V4L2 SPS struct fields are scalars (single sublayer), so we pick
* the HighestTid (= sps_max_sub_layers_minus1) slot matches
* ffmpeg-vaapi + kdirect convention.
*/
struct {
bool valid;
uint8_t sps_max_sub_layers_minus1;
uint8_t max_dec_pic_buffering_minus1;
uint8_t max_num_reorder_pics;
uint8_t max_latency_increase_plus1;
bool scaling_list_enabled;
bool scaling_list_data_present;
} hevc_sps_field_cache;
struct video_format *video_format;
/*
@@ -259,17 +204,6 @@ struct request_data {
unsigned int fmt_buffers_count;
unsigned int fmt_sizes[VIDEO_MAX_PLANES];
unsigned int fmt_bytesperlines[VIDEO_MAX_PLANES];
/*
* iter39: active session is decoding a 10-bit profile (Hi10P / Main10).
* Set in RequestCreateContext from config->profile. Drives:
* - CAPTURE pix_fmt selection (NV15 instead of NV12)
* - image.c DeriveImage / QueryImageFormats fourcc reporting (P010
* instead of NV12)
* - copy_surface_to_image NV15P010 unpack branch
* Reset to false at DestroyContext.
*/
bool is_10bit;
};
VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context);
+1 -126
View File
@@ -21,10 +21,7 @@
#include "v4l2.h"
int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
unsigned int output_type, unsigned int count,
unsigned int pixelformat,
unsigned int picture_width,
unsigned int picture_height)
unsigned int output_type, unsigned int count)
{
unsigned int index_base;
unsigned int length;
@@ -46,16 +43,6 @@ int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
pool->next = 0;
pool->media_fd = media_fd; /* iter7: kept for force_release re-alloc */
/*
* iter#15: cache the S_FMT params so request_pool_resize can
* re-issue S_FMT with a sizeimage hint override on overrun.
*/
pool->video_fd = video_fd;
pool->output_type = output_type;
pool->pixelformat = pixelformat;
pool->picture_width = picture_width;
pool->picture_height = picture_height;
for (i = 0; i < count; i++)
pool->slots[i].request_fd = -1;
@@ -107,118 +94,6 @@ error:
return -1;
}
int request_pool_resize(struct request_pool *pool,
unsigned int new_sizeimage_min)
{
unsigned int index_base;
unsigned int length;
unsigned int offset;
unsigned int saved_count;
unsigned int i;
int rc;
if (pool == NULL || !pool->initialized || pool->count == 0)
return -1;
/*
* Pre-condition guard: no slot may be borrowed when we tear the
* pool down. The caller in codec_store_buffer temporarily releases
* the current in-flight surface's slot before invoking us; the
* inline-Sync-in-EndPicture pattern guarantees no other slot is
* borrowed elsewhere in the driver. Bail loudly if anyone breaks
* that invariant rather than corrupting in-flight V4L2 state.
*/
for (i = 0; i < pool->count; i++) {
if (pool->slots[i].busy) {
request_log("request_pool_resize: slot %u still busy — "
"caller must release before resize\n", i);
return -1;
}
}
saved_count = pool->count;
/* STREAMOFF the OUTPUT queue so REQBUFS(0) is accepted. */
rc = v4l2_set_stream(pool->video_fd, pool->output_type, false);
if (rc < 0)
return -1;
/*
* Tear down every slot: munmap, close per-slot request_fd. Slot
* fields are zeroed in place so failure halfway is recoverable.
*/
for (i = 0; i < pool->count; i++) {
if (pool->slots[i].data != NULL && pool->slots[i].size > 0) {
munmap(pool->slots[i].data, pool->slots[i].size);
pool->slots[i].data = NULL;
pool->slots[i].size = 0;
}
if (pool->slots[i].request_fd >= 0) {
close(pool->slots[i].request_fd);
pool->slots[i].request_fd = -1;
}
}
/*
* Release the V4L2 OUTPUT buffer indices. REQBUFS(0) is the only
* way to ask the kernel to free buffers so CREATE_BUFS can re-
* allocate with a new per-buffer sizeimage.
*/
rc = v4l2_request_buffers(pool->video_fd, pool->output_type, 0);
if (rc < 0)
return -1;
/*
* Re-issue S_FMT with the cached dimensions but a larger
* sizeimage. The kernel may round up further (driver-specific
* page / alignment rules); we accept whatever it returns and
* pick that up from per-slot v4l2_query_buffer below.
*/
rc = v4l2_set_format_sizeimage(pool->video_fd, pool->output_type,
pool->pixelformat,
pool->picture_width,
pool->picture_height,
new_sizeimage_min);
if (rc < 0)
return -1;
rc = v4l2_create_buffers(pool->video_fd, pool->output_type,
saved_count, &index_base);
if (rc < 0)
return -1;
for (i = 0; i < saved_count; i++) {
pool->slots[i].index = index_base + i;
pool->slots[i].busy = false;
rc = v4l2_query_buffer(pool->video_fd, pool->output_type,
pool->slots[i].index,
&length, &offset, 1);
if (rc < 0)
return -1;
pool->slots[i].data = mmap(NULL, length,
PROT_READ | PROT_WRITE,
MAP_SHARED, pool->video_fd, offset);
if (pool->slots[i].data == MAP_FAILED) {
pool->slots[i].data = NULL;
return -1;
}
pool->slots[i].size = length;
pool->slots[i].request_fd = media_request_alloc(pool->media_fd);
if (pool->slots[i].request_fd < 0)
return -1;
}
rc = v4l2_set_stream(pool->video_fd, pool->output_type, true);
if (rc < 0)
return -1;
pool->next = 0;
return 0;
}
void request_pool_destroy(struct request_pool *pool)
{
unsigned int i;
+3 -58
View File
@@ -52,71 +52,16 @@ struct request_pool {
int media_fd; /* iter7: kept for
* force_release re-alloc */
bool initialized;
/*
* iter#15: cached S_FMT params from request_pool_init, so
* request_pool_resize can re-S_FMT the OUTPUT queue with a new
* sizeimage override on a mid-session resolution upshift overrun
* without the caller having to re-thread these through six call
* sites. video_fd is also cached so the resize is fully
* self-contained request_pool_resize takes only the pool and
* the new sizeimage hint.
*/
int video_fd;
unsigned int output_type;
unsigned int pixelformat;
unsigned int picture_width;
unsigned int picture_height;
};
/*
* Allocate count OUTPUT buffers via VIDIOC_CREATE_BUFS, query and mmap
* each, populate pool->slots[]. Caller must have already done
* VIDIOC_S_FMT on the OUTPUT queue. The S_FMT params (pixelformat,
* picture_width, picture_height) are stashed on the pool so that
* request_pool_resize can re-issue S_FMT with the same dimensions but
* a larger sizeimage hint. Returns 0 on success, -1 on failure.
* VIDIOC_S_FMT on the OUTPUT queue. Returns 0 on success, -1 on
* failure.
*/
int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
unsigned int output_type, unsigned int count,
unsigned int pixelformat,
unsigned int picture_width,
unsigned int picture_height);
/*
* iter#15: grow the OUTPUT pool's per-slot sizeimage in place.
*
* Issued from codec_store_buffer when an Annex-B start code / VP8
* header pad / slice payload won't fit in the current
* surface->source_size i.e. the stream's per-frame bitstream budget
* has outgrown the OUTPUT pool slot's mmap (typical cause: SPS-driven
* resolution upshift mid-session).
*
* Steps:
* 1. STREAMOFF the OUTPUT queue.
* 2. munmap every slot, close every per-slot media-request fd.
* 3. VIDIOC_REQBUFS(count=0) to release the V4L2 buffer indices.
* 4. S_FMT with the cached pixelformat / picture_width /
* picture_height but a sizeimage hint of new_sizeimage_min.
* 5. CREATE_BUFS with the original slot count.
* 6. Per-slot: query buffer length, mmap, alloc fresh request_fd.
* 7. STREAMON.
*
* Returns 0 on success, -1 on failure (caller falls back to
* VA_STATUS_ERROR_ALLOCATION_FAILED the libva consumer recreates
* the surface at the new resolution).
*
* Pre-condition: NO pool slot is currently borrowed (busy=false on
* every slot) AND no buffer is in-flight on the OUTPUT queue. The
* inline-Sync-in-EndPicture pattern (RequestEndPicture calls
* RequestSyncSurface before returning) makes this trivially true at
* codec_store_buffer time for the only-supported single-context
* single-render-surface flow: the in-flight surface's slot is the
* sole borrowed slot, and the resize caller temporarily releases it
* before calling here.
*/
int request_pool_resize(struct request_pool *pool,
unsigned int new_sizeimage_min);
unsigned int output_type, unsigned int count);
/*
* Munmap all slots and free the slots array. Idempotent.
+11 -11
View File
@@ -111,6 +111,13 @@ void surface_unbind_slot(struct request_data *driver_data,
{
if (surface_object->current_slot == NULL)
return;
/* AV1 Phase 3 diag: log every unbind with surface id + slot idx
* + status confirms whether BeginPicture rebind is racing the
* consumer's vaGetImage on the previous frame. */
request_log("surface_unbind_slot id=0x%x status=%d slot_idx=%u\n",
surface_object->base.id,
surface_object->status,
surface_object->current_slot->v4l2_index);
cap_pool_release(&driver_data->capture_pool, surface_object->current_slot);
surface_object->current_slot = NULL;
}
@@ -182,9 +189,7 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
* surface_bind_format_uniform_fields(); the per-slot
* destination_* fields fill at BeginPicture via surface_bind_slot.
*/
/* iter39: allow YUV420_10 for Hi10P / Main10 surface allocation. */
if (format != VA_RT_FORMAT_YUV420 &&
format != VA_RT_FORMAT_YUV420_10)
if (format != VA_RT_FORMAT_YUV420)
return VA_STATUS_ERROR_UNSUPPORTED_RT_FORMAT;
for (i = 0; i < surfaces_count; i++) {
@@ -194,6 +199,8 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
return VA_STATUS_ERROR_ALLOCATION_FAILED;
surface_object->current_slot = NULL; /* iter2 Fix 3 */
surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
surface_object->av1_order_hint = 0;
surface_object->destination_index = 0; /* set on bind */
surface_object->destination_planes_count = 0; /* set at CreateContext */
surface_object->destination_buffers_count = 0; /* set at CreateContext */
@@ -708,14 +715,7 @@ VAStatus RequestExportSurfaceHandle(VADriverContextP context,
planes_count = surface_object->destination_planes_count;
/* iter39: 10-bit session exports a DRM_FORMAT_NV15 buffer; advertise
* the matching fourcc so a PRIME consumer aware of NV15 (panfrost-
* Mesa et al.) can import correctly. PRIME consumers that only know
* NV12 / P010 should use the COPY (vaGetImage) path which unpacks
* NV15P010 in image.c::copy_surface_to_image. */
surface_descriptor->fourcc = driver_data->is_10bit
? VA_FOURCC('N', 'V', '1', '5')
: VA_FOURCC_NV12;
surface_descriptor->fourcc = VA_FOURCC_NV12;
surface_descriptor->width = surface_object->width;
surface_descriptor->height = surface_object->height;
surface_descriptor->num_objects = export_fds_count;
+36 -8
View File
@@ -89,6 +89,33 @@ struct object_surface {
struct timeval timestamp;
/*
* AV1 Phase 3: for streams with apply_grain=1, VAAPI's
* VADecPictureParameterBufferAV1 carries current_display_picture
* (display-time surface) separate from current_frame (decode
* target). vpu981 HW applies grain inline to the decode CAPTURE
* buffer, so the decoded data lives in current_frame's slot but
* ffmpeg calls vaGetImage on current_display_picture which has no
* slot bound. linked_decode_surface_id, set in av1_set_controls
* on the display surface, points to the decode surface so
* copy_surface_to_image can borrow its destination_data[].
*
* VA_INVALID_SURFACE = no link (the common case: 8-bit codecs,
* AV1 with apply_grain=0, AV1 frames where cur_frame ==
* cur_display).
*/
VASurfaceID linked_decode_surface_id;
/*
* AV1 Phase 3: AV1 order_hint of the frame currently decoded into
* this surface. VAAPI's VADecPictureParameterBufferAV1.order_hint
* is per-frame; kernel's v4l2_ctrl_av1_frame.order_hints[8] is
* per-reference. We track each decoded frame's order_hint here so
* the next frame's av1_set_controls can populate order_hints[i]
* from ref_frame_map[i] SURFACE av1_order_hint.
*/
uint8_t av1_order_hint;
union {
struct {
VAPictureParameterBufferMPEG2 picture;
@@ -122,17 +149,18 @@ struct object_surface {
VADecPictureParameterBufferVP9 picture;
VASliceParameterBufferVP9 slice;
} vp9;
struct {
/*
* AV1 picture parameter buffer. Slice params are
* intentionally absent the daedalus daemon track
* (issue #11) consumes the slice OBU bytes directly
* from the OUTPUT bitstream and synthesises only the
* sequence-header OBU from V4L2_CID_STATELESS_AV1_
* SEQUENCE. No per-tile-group structOBU re-synthesis
* required from libva today.
* ampere-av1-enablement: AV1 needs picture-header +
* variable number of slice/tile params (one per tile).
* tile_group_entries[] holds parsed VASliceParameterBufferAV1
* entries up to MAX_TILES; av1.c builds the matching
* v4l2_ctrl_av1_tile_group_entry[] at set_controls time.
*/
struct {
#define AV1_MAX_TILES 128
VADecPictureParameterBufferAV1 picture;
VASliceParameterBufferAV1 tile_group_entries[AV1_MAX_TILES];
unsigned int num_tile_group_entries;
} av1;
} params;
+26 -73
View File
@@ -113,28 +113,6 @@ static void v4l2_setup_format(struct v4l2_format *format, unsigned int type,
}
}
static void v4l2_setup_format_sizeimage(struct v4l2_format *format,
unsigned int type,
unsigned int width, unsigned int height,
unsigned int pixelformat,
unsigned int sizeimage)
{
memset(format, 0, sizeof(*format));
format->type = type;
if (v4l2_type_is_mplane(type)) {
format->fmt.pix_mp.width = width;
format->fmt.pix_mp.height = height;
format->fmt.pix_mp.plane_fmt[0].sizeimage = sizeimage;
format->fmt.pix_mp.pixelformat = pixelformat;
} else {
format->fmt.pix.width = width;
format->fmt.pix.height = height;
format->fmt.pix.sizeimage = sizeimage;
format->fmt.pix.pixelformat = pixelformat;
}
}
bool v4l2_find_format(int video_fd, unsigned int type, unsigned int pixelformat)
{
struct v4l2_fmtdesc fmtdesc;
@@ -194,30 +172,6 @@ int v4l2_set_format(int video_fd, unsigned int type, unsigned int pixelformat,
return 0;
}
int v4l2_set_format_sizeimage(int video_fd, unsigned int type,
unsigned int pixelformat,
unsigned int width, unsigned int height,
unsigned int sizeimage)
{
struct v4l2_format format;
int rc;
if (sizeimage == 0)
return v4l2_set_format(video_fd, type, pixelformat, width, height);
v4l2_setup_format_sizeimage(&format, type, width, height, pixelformat,
sizeimage);
rc = ioctl(video_fd, VIDIOC_S_FMT, &format);
if (rc < 0) {
request_log("Unable to set format (sizeimage=%u) for type %d: %s\n",
sizeimage, type, strerror(errno));
return -1;
}
return 0;
}
int v4l2_get_format(int video_fd, unsigned int type, unsigned int *width,
unsigned int *height, unsigned int *bytesperline,
unsigned int *sizes, unsigned int *planes_count)
@@ -479,6 +433,7 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
unsigned int num_controls)
{
struct v4l2_ext_controls controls;
int rc;
memset(&controls, 0, sizeof(controls));
@@ -490,7 +445,28 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
controls.request_fd = request_fd;
}
return ioctl(video_fd, ioc, &controls);
rc = ioctl(video_fd, ioc, &controls);
if (rc < 0) {
/* ampere-av1 Phase 2.1 diag: surface error_idx so the caller's
* error path knows which CID failed validation. error_idx >=
* count means the failure was pre-validation (e.g., bad
* request_fd). errno carries the syscall-level reason. */
const char *failed_cid_label = "<pre-validation>";
unsigned int failed_size = 0;
if (controls.error_idx < num_controls) {
failed_size = control_array[controls.error_idx].size;
(void)failed_cid_label; /* keep symbol if logger truncates */
}
request_log("v4l2_ioctl_controls: rc=%d errno=%d (%s) "
"ioc=0x%lx error_idx=%u count=%u "
"failed_cid=0x%x failed_size=%u\n",
rc, errno, strerror(errno), ioc,
controls.error_idx, num_controls,
controls.error_idx < num_controls
? control_array[controls.error_idx].id : 0,
failed_size);
}
return rc;
}
int v4l2_get_controls(int video_fd, int request_fd,
@@ -522,35 +498,12 @@ int v4l2_set_controls(int video_fd, int request_fd,
struct v4l2_ext_control *control_array,
unsigned int num_controls)
{
struct v4l2_ext_controls controls;
int rc;
memset(&controls, 0, sizeof(controls));
controls.controls = control_array;
controls.count = num_controls;
if (request_fd >= 0) {
controls.which = V4L2_CTRL_WHICH_REQUEST_VAL;
controls.request_fd = request_fd;
}
rc = ioctl(video_fd, VIDIOC_S_EXT_CTRLS, &controls);
rc = v4l2_ioctl_controls(video_fd, request_fd, VIDIOC_S_EXT_CTRLS,
control_array, num_controls);
if (rc < 0) {
/* error_idx is the index of the first failing control;
* if it equals count, the ioctl itself failed (not a
* specific control payload). Useful for triaging
* which V4L2_CID_STATELESS_* the kernel rejected. */
if (controls.error_idx < num_controls)
request_log("Unable to set control(s): %s "
"(error_idx=%u/%u failing_ctrl_id=0x%x size=%u)\n",
strerror(errno),
controls.error_idx, controls.count,
control_array[controls.error_idx].id,
control_array[controls.error_idx].size);
else
request_log("Unable to set control(s): %s "
"(error_idx=%u/%u ioctl-level)\n",
strerror(errno),
controls.error_idx, controls.count);
request_log("Unable to set control(s): %s\n", strerror(errno));
return -1;
}
-11
View File
@@ -36,17 +36,6 @@ bool v4l2_find_format(int video_fd, unsigned int type,
unsigned int pixelformat);
int v4l2_set_format(int video_fd, unsigned int type, unsigned int pixelformat,
unsigned int width, unsigned int height);
/*
* Same as v4l2_set_format but explicitly overrides the OUTPUT
* sizeimage hint. Pass sizeimage=0 to get the v4l2_set_format default
* (SOURCE_SIZE_MAX for OUTPUT, 0 for CAPTURE). Used by
* request_pool_resize on a mid-session bitstream-budget overrun to
* grow the OUTPUT pool slots past the SOURCE_SIZE_MAX floor.
*/
int v4l2_set_format_sizeimage(int video_fd, unsigned int type,
unsigned int pixelformat,
unsigned int width, unsigned int height,
unsigned int sizeimage);
int v4l2_get_format(int video_fd, unsigned int type, unsigned int *width,
unsigned int *height, unsigned int *bytesperline,
unsigned int *sizes, unsigned int *planes_count);
-34
View File
@@ -31,8 +31,6 @@
#include <drm_fourcc.h>
#include <linux/videodev2.h>
#include "nv12_col128.h" /* fallback V4L2_PIX_FMT_NV12_COL128 define */
#include "nv15.h" /* fallback V4L2_PIX_FMT_NV15 define */
#include "utils.h"
#include "video.h"
@@ -47,38 +45,6 @@ static struct video_format formats[] = {
.planes_count = 2,
.bpp = 16,
},
{
.description = "NV15 YUV (10-bit, rkvdec)",
.v4l2_format = V4L2_PIX_FMT_NV15,
.v4l2_buffers_count = 1,
.v4l2_mplane = true,
.drm_format = DRM_FORMAT_NV15,
.drm_modifier = DRM_FORMAT_MOD_NONE,
.planes_count = 2,
.bpp = 24,
},
{
/*
* iter40: Pi 5 / CM5 rpi-hevc-dec CAPTURE format. 8-bit NV12
* stored as 128-pixel-wide column tiles (SAND128 layout).
* Pi-specific; not in mainline drm_fourcc.h (uses NV12 + a
* BROADCOM_SAND128 modifier for DRM_PRIME). Our consumer path
* always detiles to linear NV12 in copy_surface_to_image, so
* we don't expose the SAND modifier downstream drm_format is
* still DRM_FORMAT_NV12 and drm_modifier MOD_NONE so the
* format-is-linear gate doesn't pull us into tiled_to_planar
* (Sunxi-specific). image.c branches on v4l2_format ==
* V4L2_PIX_FMT_NV12_COL128 to invoke the dedicated detile.
*/
.description = "NV12 SAND128 (8-bit, rpi-hevc-dec)",
.v4l2_format = V4L2_PIX_FMT_NV12_COL128,
.v4l2_buffers_count = 1,
.v4l2_mplane = true,
.drm_format = DRM_FORMAT_NV12,
.drm_modifier = DRM_FORMAT_MOD_NONE,
.planes_count = 2,
.bpp = 16,
},
// Code to handle this DRM_FORMAT is __arm__ only
#ifdef __arm__
{
-196
View File
@@ -1,196 +0,0 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* MIT-licensed per project. iter40 self-test for nv12_col128 detile.
*
* Build an NC12-tiled source buffer from a known linear NV12 image,
* run the detile primitive, assert output matches the original. No
* hardware needed pure bit-layout verification of the kernel math
* (drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
* V4L2_PIX_FMT_NV12_COL128 case + ffmpeg/Kynesim per-pixel offset).
*
* Build:
* cc -Wall -Werror -O2 -o test_nv12_col128_detile \
* tests/test_nv12_col128_detile.c src/nv12_col128.c
*
* Exit 0 = all asserts pass.
*/
#include "../src/nv12_col128.h"
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TILE_W 128
static unsigned int align_up(unsigned int v, unsigned int a)
{
return (v + a - 1) & ~(a - 1);
}
/* Pack a linear plane (width × height bytes, stride=width) into NC12
* layout: each 128-wide column held contiguously, columns at offsets
* col * col_stride * 128. col_stride is the kernel-reported bytesperline
* = ALIGN(height, 8) * 3/2. Returns the buffer + sizes. */
static uint8_t *pack_to_nc12(const uint8_t *linear,
unsigned int width, unsigned int height,
unsigned int *out_col_stride,
size_t *out_size)
{
unsigned int aligned_w = align_up(width, TILE_W);
unsigned int aligned_h = align_up(height, 8);
unsigned int col_stride = aligned_h * 3 / 2;
unsigned int num_cols = aligned_w / TILE_W;
size_t total = (size_t)col_stride * aligned_w;
uint8_t *buf;
unsigned int col, y, in_col;
buf = calloc(1, total);
assert(buf != NULL);
for (col = 0; col < num_cols; col++) {
uint8_t *col_base = buf + (size_t)col * TILE_W * col_stride;
for (y = 0; y < height; y++) {
for (in_col = 0; in_col < TILE_W; in_col++) {
unsigned int x = col * TILE_W + in_col;
if (x >= width)
break;
col_base[(size_t)y * TILE_W + in_col] =
linear[(size_t)y * width + x];
}
}
}
*out_col_stride = col_stride;
*out_size = total;
return buf;
}
static void test_detile_y(unsigned int width, unsigned int height)
{
uint8_t *linear, *tiled, *recovered;
unsigned int col_stride;
size_t tile_size, i;
linear = malloc((size_t)width * height);
assert(linear != NULL);
/* Distinctive content per pixel: y * 17 + x * 13 — avoids byte-
* aliasing patterns that could mask off-by-one bugs. */
for (unsigned int y = 0; y < height; y++)
for (unsigned int x = 0; x < width; x++)
linear[(size_t)y * width + x] = (uint8_t)(y * 17 + x * 13);
tiled = pack_to_nc12(linear, width, height, &col_stride, &tile_size);
recovered = calloc(1, (size_t)width * height);
assert(recovered != NULL);
nv12_col128_detile_y(recovered, width, tiled, col_stride, width, height);
for (i = 0; i < (size_t)width * height; i++) {
if (recovered[i] != linear[i]) {
fprintf(stderr,
"FAIL %ux%u Y: pixel %zu (x=%zu y=%zu) "
"linear=0x%02x recovered=0x%02x\n",
width, height, i,
i % width, i / width,
linear[i], recovered[i]);
free(linear); free(tiled); free(recovered);
exit(1);
}
}
printf("PASS %ux%u Y plane (%u columns, col_stride=%u, tile_size=%zu)\n",
width, height, align_up(width, TILE_W) / TILE_W,
col_stride, tile_size);
free(linear);
free(tiled);
free(recovered);
}
static void test_detile_uv(unsigned int width, unsigned int height)
{
unsigned int uv_h = height / 2;
uint8_t *linear, *tiled, *recovered;
unsigned int col_stride;
size_t tile_size, i;
linear = malloc((size_t)width * uv_h);
assert(linear != NULL);
for (unsigned int y = 0; y < uv_h; y++)
for (unsigned int x = 0; x < width; x++)
linear[(size_t)y * width + x] = (uint8_t)(y * 23 + x * 7);
tiled = pack_to_nc12(linear, width, uv_h, &col_stride, &tile_size);
recovered = calloc(1, (size_t)width * uv_h);
assert(recovered != NULL);
nv12_col128_detile_uv(recovered, width, tiled, col_stride, width, uv_h);
for (i = 0; i < (size_t)width * uv_h; i++) {
if (recovered[i] != linear[i]) {
fprintf(stderr,
"FAIL %ux%u UV: pixel %zu linear=0x%02x recovered=0x%02x\n",
width, height, i,
linear[i], recovered[i]);
free(linear); free(tiled); free(recovered);
exit(1);
}
}
printf("PASS %ux%u UV plane\n", width, height);
free(linear);
free(tiled);
free(recovered);
}
static void test_uv_offset(void)
{
/* Per the SAND COL128 layout, Y and UV are interleaved within
* EACH column (not concatenated as separate planes), so the UV
* plane base pointer is offset by 128 * ALIGN(height, 8) the
* Y portion of column 0. NOT 128 * height * num_columns (the
* size of all Y across all columns), which was an earlier wrong
* formula caught by Phase 7 SEGV on higgs. */
unsigned int off = nv12_col128_uv_plane_offset(1280, 720);
if (off != 128u * 720) {
fprintf(stderr, "FAIL UV offset 1280×720: got %u expected %u\n",
off, 128u * 720);
exit(1);
}
printf("PASS UV offset 1280×720 = %u\n", off);
off = nv12_col128_uv_plane_offset(1366, 768);
if (off != 128u * 768) {
fprintf(stderr, "FAIL UV offset 1366×768: got %u expected %u\n",
off, 128u * 768);
exit(1);
}
printf("PASS UV offset 1366×768 (column-misaligned width)\n");
}
int main(void)
{
/* Phase 3 fixture sizes — all 128-aligned, 8-line-aligned. */
test_detile_y(640, 360);
test_detile_y(1280, 720);
test_detile_y(1920, 1080);
/* Phase 5 review F4: column-misaligned width (1366 → 1408 padding). */
test_detile_y(1366, 768);
/* UV plane (half-height) at each width. */
test_detile_uv(640, 360);
test_detile_uv(1280, 720);
test_detile_uv(1920, 1080);
test_detile_uv(1366, 768);
test_uv_offset();
printf("All NC12 detile asserts pass.\n");
return 0;
}
-224
View File
@@ -1,224 +0,0 @@
/*
* Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
* IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/*
* iter39 self-test for nv15_unpack_plane_to_p010.
*
* Builds NV15 plane buffers from known 10-bit pixel arrays, runs the
* unpack, asserts P010 output matches the expected pixel<<6 values.
* No hardware needed pure bit layout verification per
* Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
*
* Build:
* cc -Wall -Werror -O2 -o test_nv15_unpack tests/test_nv15_unpack.c src/nv15.c
*
* Exit 0 = all asserts pass.
*/
#include "../src/nv15.h"
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Pack 4 10-bit pixels into 5 bytes per NV15 layout (LSB-first across
* bits 0..39). Inverse of nv15_unpack_plane_to_p010's per-group unpack. */
static void pack4(uint16_t a, uint16_t b, uint16_t c, uint16_t d,
uint8_t out[5])
{
out[0] = (uint8_t)(a & 0xFF);
out[1] = (uint8_t)(((a >> 8) & 0x03) | ((b & 0x3F) << 2));
out[2] = (uint8_t)(((b >> 6) & 0x0F) | ((c & 0x0F) << 4));
out[3] = (uint8_t)(((c >> 4) & 0x3F) | ((d & 0x03) << 6));
out[4] = (uint8_t)((d >> 2) & 0xFF);
}
#define ASSERT_EQ(actual, expected, msg) do { \
if ((actual) != (expected)) { \
fprintf(stderr, "FAIL %s: actual=0x%04x expected=0x%04x at %s:%d\n", \
(msg), (unsigned)(actual), (unsigned)(expected), \
__FILE__, __LINE__); \
exit(1); \
} \
} while (0)
static void test_pack_unpack_roundtrip(uint16_t a, uint16_t b, uint16_t c,
uint16_t d)
{
uint8_t packed[5];
uint16_t dst[4];
pack4(a, b, c, d, packed);
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], (uint16_t)(a << 6), "roundtrip a");
ASSERT_EQ(dst[1], (uint16_t)(b << 6), "roundtrip b");
ASSERT_EQ(dst[2], (uint16_t)(c << 6), "roundtrip c");
ASSERT_EQ(dst[3], (uint16_t)(d << 6), "roundtrip d");
}
static void test_zero(void)
{
uint8_t packed[5] = { 0, 0, 0, 0, 0 };
uint16_t dst[4] = { 0xDEAD, 0xDEAD, 0xDEAD, 0xDEAD };
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], 0, "zero[0]");
ASSERT_EQ(dst[1], 0, "zero[1]");
ASSERT_EQ(dst[2], 0, "zero[2]");
ASSERT_EQ(dst[3], 0, "zero[3]");
}
static void test_all_max(void)
{
/* All four pixels = 0x3FF (max 10-bit). Packed bits all 1 → all 0xFF. */
uint8_t packed[5] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF };
uint16_t dst[4] = { 0, 0, 0, 0 };
nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
ASSERT_EQ(dst[0], 0xFFC0, "max[0]");
ASSERT_EQ(dst[1], 0xFFC0, "max[1]");
ASSERT_EQ(dst[2], 0xFFC0, "max[2]");
ASSERT_EQ(dst[3], 0xFFC0, "max[3]");
}
static void test_known_vectors(void)
{
/* Position-sensitive sanity: each pixel = its index+1. */
test_pack_unpack_roundtrip(1, 2, 3, 4);
/* Spread patterns that exercise every byte-boundary bit. */
test_pack_unpack_roundtrip(0x3FF, 0x000, 0x3FF, 0x000);
test_pack_unpack_roundtrip(0x000, 0x3FF, 0x000, 0x3FF);
test_pack_unpack_roundtrip(0x155, 0x2AA, 0x155, 0x2AA);
test_pack_unpack_roundtrip(0x001, 0x002, 0x004, 0x008);
test_pack_unpack_roundtrip(0x080, 0x040, 0x020, 0x010);
test_pack_unpack_roundtrip(0x200, 0x100, 0x080, 0x040);
test_pack_unpack_roundtrip(0x3F0, 0x0F3, 0x33C, 0x2A5);
}
static void test_remainder_width(void)
{
/* width=1: only A unpacked, B/C/D undefined */
{
uint8_t packed[5];
uint16_t dst[1] = { 0xDEAD };
pack4(0x123, 0x000, 0x000, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 1, 1, 5);
ASSERT_EQ(dst[0], 0x123 << 6, "rem1[0]");
}
/* width=2 */
{
uint8_t packed[5];
uint16_t dst[2] = { 0, 0 };
pack4(0x111, 0x222, 0x000, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 2, 1, 5);
ASSERT_EQ(dst[0], 0x111 << 6, "rem2[0]");
ASSERT_EQ(dst[1], 0x222 << 6, "rem2[1]");
}
/* width=3 */
{
uint8_t packed[5];
uint16_t dst[3] = { 0, 0, 0 };
pack4(0x111, 0x222, 0x333, 0x000, packed);
nv15_unpack_plane_to_p010(packed, dst, 3, 1, 5);
ASSERT_EQ(dst[0], 0x111 << 6, "rem3[0]");
ASSERT_EQ(dst[1], 0x222 << 6, "rem3[1]");
ASSERT_EQ(dst[2], 0x333 << 6, "rem3[2]");
}
/* width=7: one full group + 3 remainder */
{
uint8_t packed[10];
uint16_t dst[7] = { 0 };
pack4(0x100, 0x200, 0x300, 0x010, &packed[0]);
pack4(0x011, 0x022, 0x033, 0x000, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 7, 1, 10);
ASSERT_EQ(dst[0], 0x100 << 6, "rem7[0]");
ASSERT_EQ(dst[1], 0x200 << 6, "rem7[1]");
ASSERT_EQ(dst[2], 0x300 << 6, "rem7[2]");
ASSERT_EQ(dst[3], 0x010 << 6, "rem7[3]");
ASSERT_EQ(dst[4], 0x011 << 6, "rem7[4]");
ASSERT_EQ(dst[5], 0x022 << 6, "rem7[5]");
ASSERT_EQ(dst[6], 0x033 << 6, "rem7[6]");
}
/* width=8: two full groups */
{
uint8_t packed[10];
uint16_t dst[8] = { 0 };
pack4(0x101, 0x202, 0x303, 0x101, &packed[0]);
pack4(0x202, 0x303, 0x101, 0x202, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 8, 1, 10);
ASSERT_EQ(dst[7], 0x202 << 6, "w8[7]");
}
}
static void test_multi_row_stride_padding(void)
{
/* 4-pixel-wide, 3-row plane; stride = 8 bytes (3 bytes padding). */
uint8_t packed[24]; /* 3 rows × 8 bytes */
uint16_t dst[12]; /* 3 rows × 4 pixels */
memset(packed, 0xCC, sizeof(packed)); /* padding poison */
pack4(0x111, 0x222, 0x333, 0x044, &packed[0 * 8]);
pack4(0x055, 0x166, 0x177, 0x188, &packed[1 * 8]);
pack4(0x099, 0x1AA, 0x2BB, 0x3CC, &packed[2 * 8]);
memset(dst, 0xAB, sizeof(dst));
nv15_unpack_plane_to_p010(packed, dst, 4, 3, 8);
ASSERT_EQ(dst[0], 0x111 << 6, "row0[0]");
ASSERT_EQ(dst[3], 0x044 << 6, "row0[3]");
ASSERT_EQ(dst[4], 0x055 << 6, "row1[0]");
ASSERT_EQ(dst[7], 0x188 << 6, "row1[3]");
ASSERT_EQ(dst[8], 0x099 << 6, "row2[0]");
ASSERT_EQ(dst[11], 0x3CC << 6, "row2[3]");
}
static void test_chroma_half_height(void)
{
/* 4-pixel-wide × 2-row chroma (matches 4×4 luma quadrant).
* NV15 chroma uses same packing as luma, just half-height. */
uint8_t packed[10]; /* 2 rows × 5 bytes */
uint16_t dst[8]; /* 2 rows × 4 pixels (UV pairs in interleaved form) */
pack4(0x080, 0x180, 0x280, 0x380, &packed[0]);
pack4(0x040, 0x140, 0x240, 0x340, &packed[5]);
nv15_unpack_plane_to_p010(packed, dst, 4, 2, 5);
ASSERT_EQ(dst[0], 0x080 << 6, "chroma row0[0]");
ASSERT_EQ(dst[3], 0x380 << 6, "chroma row0[3]");
ASSERT_EQ(dst[4], 0x040 << 6, "chroma row1[0]");
ASSERT_EQ(dst[7], 0x340 << 6, "chroma row1[3]");
}
int main(void)
{
test_zero();
test_all_max();
test_known_vectors();
test_remainder_width();
test_multi_row_stride_padding();
test_chroma_half_height();
printf("test_nv15_unpack: all PASS\n");
return 0;
}