RK3399 rkvdec advertises NV15 in VIDIOC_ENUM_FMT(CAPTURE) only AFTER
S_FMT(OUTPUT) + S_EXT_CTRLS(SPS) resolve image_fmt to 420_10BIT.
Pre-flight v4l2_find_format(NV15) always returns 0 → video_format
stays NULL → CreateContext returns OPERATION_FAILED → ffmpeg-vaapi
hwaccel init fails with "Failed to create decode context: 1".
Verified on fresnel (kernel 7.0-14 / linux-fresnel-fourier):
v4l2-ctl -d /dev/video1 --list-formats → only NV12 enumerated
Fix: for 10-bit profiles, skip the find_format probe and directly
map to our NV15 video_format entry. The later S_FMT(CAPTURE) in
the same RequestCreateContext path commits the actual NV15 mode
once the synthetic-SPS injection sets bit_depth_luma_minus8=2.
Discovered during Phase 7 sub-profile verification — Criterion 1
(vainfo enumeration) PASSed but Criteria 2/3 (Hi10P/Main10 decode)
failed with the hwaccel init error. iter38 5/5 baseline still PASSES
(no regression — non-10-bit path unchanged).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes the rkvdec_hevc_prepare_hw_st_rps out-of-bounds kernel OOPS that
blocked HEVC decode on ampere (RK3588) per
marfrit/libva-v4l2-request-fourier#3 and ampere-fourier iter1 close.
Mechanism (Phase 5 amendment to issue body):
The new EXT_SPS controls are registered as V4L2_CTRL_FLAG_DYNAMIC_ARRAY
in vdpu38x_hevc_ctrl_descs (rkvdec.c:279/284) with cfg.dims = { 65 }.
The v4l2-ctrl framework init-allocates 1 zeroed element (ctrls-core.c:2116).
When num_short_term_ref_pic_sets > 1, rkvdec_hevc_prepare_hw_st_rps
(rkvdec-hevc-common.c:393-405) iterates idx 0..N-1 and overruns the
1-element kernel allocation. Submitting an N-element dynamic-array
control via S_EXT_CTRLS extends the framework allocation.
Userspace fix:
- VIDIOC_QUERY_EXT_CTRL probe at first HEVC CreateContext sets
driver_data->has_ext_sps_rps (true on VDPU381/383, false on legacy
RK3399 — control unregistered there, so fresnel iter38 5/5 + iter39
sub-profile paths are byte-identical to pre-iter2).
- When set, h265_set_controls appends EXT_SPS_ST_RPS + _LT_RPS as
calloc'd zero arrays, sized by VAAPI's count fields and capped at
H.265 §7.4.3.2 spec maxima (ST 64, LT 32). Min 1 (kernel rejects 0).
- Free post-S_EXT_CTRLS.
Decode correctness scope:
VAAPI does NOT expose per-set st_ref_pic_set syntax elements
(delta_idx_minus1, delta_rps_sign, etc.) — confirmed in va_dec_hevc.h.
All-zero entries give empty inter-pred RPS per set, which is correct
for IDR-only streams and incorrect for streams with inter-pred RPS
dependence. iter2 acceptance: stop the OOPS. Decode-correctness for
inter-RPS content is a known follow-up requiring either bitstream-snoop
or SPS-passthrough via a new VAAPI extension.
Files:
- include/hevc-ctrls.h: #ifndef-guarded fallback definitions for
V4L2_CID_STATELESS_HEVC_EXT_SPS_{ST,LT}_RPS + structs (ampere host
is on linux-api-headers 6.19-1; the new CIDs land in 7.0).
- src/request.h: driver_data->has_ext_sps_rps (persists for driver
lifetime; gated solely by HEVC code path so cross-codec leakage
impossible).
- src/context.c: probe at HEVC CreateContext via v4l2_query_ext_ctrl.
- src/h265.c: controls[5] → controls[7]; #include <hevc-ctrls.h>
(replaces <linux/v4l2-controls.h>) for forward UAPI compatibility.
Compile-tested on boltzmann (aarch64 native, gcc 15.2.1): clean .so,
0 new warnings. Fresnel cross-device safety: legacy RK3399 rkvdec_ctrl
table omits the CIDs; probe returns false; new code path never executes.
iter39 sub-profile work (commits 662f887 + 8746690) is preserved
in-tree; iter2 is a forward-compatible additive change.
Refs:
marfrit/libva-v4l2-request-fourier#3
ampere-fourier/iter1_close.md HEVC blocker
ampere-fourier/iter2_phase0_findings.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds VAProfileH264High10 and VAProfileHEVCMain10 to the libva-v4l2-request
backend. RK3399 rkvdec emits decoded frames as V4L2_PIX_FMT_NV15 (4 × 10-bit
values packed in 5 bytes per element); VAAPI consumers receive standard
VA_FOURCC_P010 via a new userspace unpack in copy_surface_to_image.
VP9 Profile 2 explicitly NOT added — RK3399 rkvdec kernel ctrl table
caps at V4L2_MPEG_VIDEO_VP9_PROFILE_0 (rkvdec.c::rkvdec_vp9_ctrl_descs).
Touchpoints (per Phase 5 sonnet-architect review amendments):
- include/drm_fourcc.h: define DRM_FORMAT_NV15 (vendored libdrm lacks it)
- src/nv15.{c,h}: NV15 → P010 plane unpack (LSB-first, per
Documentation/userspace-api/media/v4l/pixfmt-nv15.rst)
- src/video.c: NV15 entry in formats[] (else NULL-deref on video_format_find)
- src/codec.c: pixelformat_for_profile cases for Hi10P + Main10
- src/config.c: enumeration, validation, entrypoints, RT_FORMAT_YUV420_10
advertisement for 10-bit profiles
- src/context.c: per-profile CAPTURE pix_fmt (NV12/NV15), 10-bit synthetic
SPS (bit_depth_luma_minus8=2), video_format invalidation on bit-depth
transition (sibling to iter38 device-switch invalidation), is_10bit flag
- src/surface.c: RT_FORMAT_YUV420_10 admission, NV15 fourcc on PRIME export
- src/image.c: P010 reporting in DeriveImage + QueryImageFormats,
P010-aware sizing in CreateImage, NV15 → P010 unpack call in
copy_surface_to_image (gated on is_10bit + image.format.fourcc == P010)
- src/picture.c: 4 switch blocks route Hi10P/Main10 to existing H264/HEVC
per-codec paths
- src/request.h: MAX_PROFILES bump 11 → 13, driver_data->is_10bit flag
Scope: COPY path (vaGetImage / vaDeriveImage) only. Standard ffmpeg-vaapi
hwdownload, mpv vaapi-copy, and any consumer using vaGetImage works
end-to-end. PRIME-path consumers that only know NV12/P010 must use the
COPY path; PRIME consumers aware of NV15 (panfrost-Mesa et al.) get the
correct fourcc on RequestExportSurfaceHandle. PRIME-side P010 emission is
follow-up scope (would need DRM_FORMAT_P010 + per-plane unpack into a
GPU-accessible buffer).
Compile-tested on boltzmann (aarch64 native, gcc 15.2.1, libva 1.23.0,
libdrm 2.4.133): clean build, .so produced, 0 new warnings.
Phase 0/2 evidence: linux-mmind-v7.0 drivers/media/platform/rockchip/rkvdec.
rkvdec_h264_decoded_fmts[] and rkvdec_hevc_decoded_fmts[] both list NV15;
ctrl tables cap at HEVC MAIN_10 and H264 HIGH_422_INTRA (Hi10P < cap, not
in menu_skip_mask). image_fmt resolution (rkvdec-h264-common.c:196,
rkvdec-hevc-common.c:467) dispatches on bit_depth_luma_minus8 only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rkvdec_h264_validate_sps doubles height when FRAME_MBS_ONLY is unset
(field-to-frame). Dummy with 1080-height was failing validation as
2176 > 1080, returning -EINVAL silently (void-cast). Even though libva
ignores the result of v4l2_set_controls, the side effect was leaving
ctx->image_fmt at ANY → first per-frame H264_SPS still hit -EBUSY in
try_or_set_cluster → setup loop broke (Bug 4 unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for Bug 5 (HEVC libva = all-zero CAPTURE) and Bug 4 (H.264
libva = keyframe partial), localized via iter17→iter24 kernel-printk
chain:
rkvdec_s_ctrl() for HEVC_SPS / H264_SPS calls get_image_fmt() and,
if the resolved image_fmt differs from cached ctx->image_fmt (default
RKVDEC_IMG_FMT_ANY at open), tries to reset the CAPTURE format.
Format reset returns -EBUSY when vb2_is_busy(CAPTURE_queue) — any
CAPTURE buffer allocated blocks the change.
libva (iter5b-β) pre-allocates 24 CAPTURE buffers at CreateContext
via cap_pool_init, BEFORE any per-frame S_EXT_CTRLS. First per-frame
HEVC_SPS therefore fails with -EBUSY in try_or_set_cluster, breaks
v4l2_ctrl_request_setup's outer loop, leaves all 5 staged HEVC
compound controls at zero in ctx->ctrl_hdl. rkvdec_hevc_run reads
zero (iter20 dmesg: sps[0..16]=00..00), hardware sees w=0 h=0,
CAPTURE comes out all-zero (Bug 5).
Fix: BEFORE cap_pool_init, inject one S_EXT_CTRLS (no request, no
which) with a synthetic SPS containing the profile's known chroma +
bit_depth. CAPTURE queue is still empty at this point → vb2_is_busy
returns false → rkvdec_s_ctrl succeeds, ctx->image_fmt is updated to
the profile's image_fmt. From then on, per-frame SPS submissions with
matching chroma + bit_depth see image_fmt_changed=false → skip reset
→ commit succeeds.
VP9 / MPEG-2 / VP8 paths are not affected: VP9's rkvdec coded_fmt_desc
has no get_image_fmt op; MPEG-2 + VP8 route to hantro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 ioctl-sequence diff: kdirect (ffmpeg-v4l2request) S_FMTs CAPTURE
with NV12 + dimensions after S_FMT OUTPUT, BEFORE CREATE_BUFS. libva's
old code only G_FMTs CAPTURE (per iter5b-β's hantro-targeted comment
that explicit S_FMT puts hantro into an inconsistent state).
For rkvdec on RK3399 the absence of explicit S_FMT CAPTURE doesn't
commit the chosen NV12 format properly. rkvdec HEVC + H.264 silently
produce zero / garbage CAPTURE output — Bug 4 + Bug 5 root cause.
Now: S_FMT OUTPUT → S_FMT CAPTURE → G_FMT CAPTURE. Failure of S_FMT
CAPTURE is non-fatal: fall back to G_FMT (preserves the iter5b-β
hantro path).
Future iter to gate this on driver_kind explicitly per
feedback_per_driver_kludge_gating.md. For now, always-on is safe
because kdirect proves S_FMT CAPTURE works on both rkvdec AND hantro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace gettimeofday in RequestEndPicture with object_context-scoped
counter producing small us values (1, 2, 3, ...) so OUTPUT QBUF
timestamp and DPB.reference_ts match ffmpeg-v4l2request's pattern.
Phase 5 IMP-1: counter scoped to object_context (not driver_data) to
avoid multi-context collisions.
Empirical confirmation only — reviewer's CRIT-1 predicts this is
inert (VP9/MPEG-2 use same path and PASS). If α-7 produces the same
broken hash, the libva wire-byte search space is exhausted and iter10
must pivot to slice-data inspection or kernel investigation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 7 empirical: all 5 libva codecs returned all-zero because
CreateContext's surfaces_ids[] walk was a no-op for ffmpeg-vaapi-copy
which passes surfaces_count=0 to vaCreateContext (per the iter6
comment at context.c:262). Surfaces existed in driver_data's
surface_heap but weren't in the param array → destination_* stayed
at the zero initialization from CreateSurfaces2 β → BeginPicture's
surface_bind_slot saw destination_planes_count=0 → no data
assignment → copy_surface_to_image read all-zero.
Fix: cache the format-uniform CAPTURE geometry in driver_data
(fmt_valid, fmt_planes_count, fmt_buffers_count, fmt_format_height,
fmt_sizes[], fmt_bytesperlines[]). Populate at CreateContext after
v4l2_get_format(CAPTURE). Walk surface_heap (not just surfaces_ids[])
to fill every existing surface. Add lazy-fill in CreateSurfaces2 for
surfaces created AFTER CreateContext. Invalidate cache in
DestroyContext.
New helper: surface_fill_format_uniform(driver_data, surface_object).
Idempotent on destination_planes_count != 0.
Signed-off-by: claude-noether <claude-noether@reauktion.de>
Strip OUTPUT-side V4L2 device-format lifecycle out of
RequestCreateSurfaces2 entirely. Move S_FMT(OUTPUT), CAPTURE-format
probe, cap_pool_init, per-surface destination_* fill into
RequestCreateContext where config_id (and therefore the bound
VAProfile) is known via config_object->pixelformat (wired by
commit B). The α' multi-CreateSurfaces2-mid-stream failure mode
disappears because β has no in-CreateSurfaces2 teardown branch;
each context cycle does its own setup, DestroyContext handles
teardown.
Phase 5 v2 review amendments:
- CRIT-1: removed video_format==NULL early-return at context.c:64-66
(would have rejected every first β CreateContext).
- CRIT-2: added request_pool_destroy() to DestroyContext before
REQBUFS(0). Pre-β only surface.c's resolution-change branch
called request_pool_destroy; β strips that, so DestroyContext
becomes the sole per-session teardown site.
- IMP-1: probe CAPTURE format first to derive output_type from
video_format->v4l2_mplane (eliminates the hardcoded mplane=true
hack from the Phase 4 v2 plan).
- IMP-2: surface_reset_format_cache() deleted (function + declaration
in surface.h + call in DestroyContext + last_output_{width,height}
fields in request.h). All dead under β.
CreateSurfaces2 now ~50 LOC (was ~250). Pure surface ID allocation
+ per-surface lifecycle bookkeeping; no V4L2 device state touched.
Signed-off-by: claude-noether <claude-noether@reauktion.de>
Root cause for VP9 criterion-4 failure traced via runtime
instrumentation: context.c:194 unconditionally set
context_object->h264_start_code = true for every CreateContext,
regardless of codec profile. picture.c:70 then prepends 0x00 0x00 0x01
(ANNEX-B start code) to ALL slice data including VP9 frames.
VP9 has no start codes — its uncompressed_header begins with the raw
frame_marker byte (0x10 in the high 2 bits). The 3-byte prefix
shifted the rkvdec driver's bitstream-read by 24 bits, producing a
silent decode failure (frame_marker mismatch -> driver fails to
locate a valid frame -> CAPTURE slot stays at cap_pool init pattern,
the dim 0x4c green visible in Phase 7 hwdownload PNGs).
iter4 fix: switch on config_object->profile in RequestCreateContext.
Set h264_start_code = true only for VAProfileH264* and VAProfileHEVCMain.
False for MPEG2/VP8/VP9.
iter1 (MPEG-2) and iter3 (VP8) had this same bug latent — they passed
because their criterion-4 verification used different paths (iter1
direct readback was small enough to mask, iter3 used transitive proof
not pixel comparison). The Phase 7 byte-level pixel comparison is what
exposed it.
Empirical proof of the fix on fresnel:
- pre-fix submission FRAME control bytes 0-23: lf.flags=0x01 (only
DELTA_ENABLED), base_q_idx=0x41 — bit-misaligned because parser was
reading the prefix bytes.
- post-fix submission FRAME control bytes 0-23 byte-match Phase 3
kernel-direct anchor: lf.flags=0x03 (ENABLED|UPDATE), base_q_idx=0x2e
(46). Transitive-proof leg 1 (backend-payload == kernel-direct-payload)
satisfied for the keyframe.
- s(6) bit-width fix in vp9.c (4 mag + 1 sign -> 6 mag + 1 sign per
VP9 spec) was a real bug too, latent because Bug 1 (this commit's fix)
prevented its code path from running. Both fixes ship together.
Pixels still produce 0x4c constant pattern post-fix — that is Bug 2
(substrate-wide cap_pool readback regression on
linux-fresnel-fourier 7.0-1) per phase7_iter4_verification.md.
Bug 2 is out of iter4 scope per Option-A choice; transitive proof
remains the criterion-4 verification path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites src/h265.c (407 lines → 588 lines) and the picture.c HEVC
dispatch + per-slice accumulation against the modern split V4L2_CID_
STATELESS_HEVC_{SPS,PPS,SLICE_PARAMS,SCALING_MATRIX,DECODE_PARAMS,
DECODE_MODE,START_CODE} stateless controls. Replaces the staging-era
V4L2_CID_MPEG_VIDEO_HEVC_{SPS,PPS,SLICE_PARAMS} CIDs that were
removed from the kernel UAPI.
Per-frame submission: ONE batched VIDIOC_S_EXT_CTRLS, count=5,
ctrl_class=V4L2_CTRL_CLASS_CODEC_STATELESS:
0xa40a90 SPS (40 bytes)
0xa40a91 PPS (64 bytes)
0xa40a92 SLICE_PARAMS (variable; dynamic-array; one entry per slice)
0xa40a93 SCALING_MATRIX (1296 bytes; memset-zero when no scaling list)
0xa40a94 DECODE_PARAMS (328 bytes; per-frame DPB info)
Plus device-wide menus set once at context.c init (separate batched
S_EXT_CTRLS call so a kernel without HEVC controls — e.g. hantro on
RK3568/RK3399 — silently fails its batch without invalidating H.264):
0xa40a95 DECODE_MODE (FRAME_BASED on rkvdec)
0xa40a96 START_CODE (ANNEX_B on rkvdec)
Reference: FFmpeg libavcodec/v4l2_request_hevc.c:505-565
(v4l2_request_hevc_queue_decode batched submission shape).
Phase 5 review amendments incorporated:
C1 (data_byte_offset NOT data_bit_offset):
Old h265.c at lines 184-209 ran an 8-bit search to compute
bit-granularity offset. New API renames the field to
data_byte_offset (u32 byte offset). Bit-search dropped; replaced
with plain byte offset = source_offset + slice->slice_data_byte_offset.
C2 (dpb_entry.flags only LONG_TERM_REFERENCE; pic_order_cnt_val
singular; poc_st_curr_*[] arrays hold DPB INDICES not POC):
h265_fill_decode_params replaces old slice-params DPB iteration
with explicit DPB classification + index-array population.
For each VAAPI ReferenceFrames[i]:
- Classify into ST_CURR_BEFORE / ST_CURR_AFTER / LT_CURR via
VA_PICTURE_HEVC_RPS_* flags.
- Set dpb[j].timestamp, .pic_order_cnt_val (singular), .field_pic.
- Set dpb[j].flags = LONG_TERM_REFERENCE iff RPS_LT_CURR.
- Append j (DPB index, u8) to poc_st_curr_before[k] /
poc_st_curr_after[k] / poc_lt_curr[k] based on classification.
C3 (union-aliasing reasoning corrected):
BeginPicture's params.h265.num_slices = 0 reset is benign for
non-HEVC profiles because byte ~17764 of the params union is past
any field non-HEVC profiles read, NOT because RenderPicture's
per-buffer copies overwrite that location. Wording amended in
phase4_iter2_plan.md per phase5_iter2_review.md.
S1 (PPS flags 19 + 20 — DEBLOCKING_FILTER_CONTROL_PRESENT and
UNIFORM_SPACING):
Empirically VAAPI does NOT expose either flag in the
VAPictureParameterBufferHEVC pic_fields.bits or
slice_parsing_fields.bits. Both bits left zero. BBB-720p10s_hevc
fixture uses neither tiles nor explicit deblocking-control
parameters, so the omission is correct for the iter2 binding cell.
S2 (3 PPS scalars added):
pic_parameter_set_id (default 0; VAAPI doesn't expose),
num_ref_idx_l0_default_active_minus1, num_ref_idx_l1_default_
active_minus1 (both populated from VAAPI picture struct).
Q2 (slice_segment_addr populated):
Was missing in old h265.c. Now sourced from
VAAPI's slice->slice_segment_address.
S3 (SCALING_MATRIX content choice):
Implementer choice taken: when iqmatrix_set==false (BBB has no
scaling list per SPS flags = SAO|STRONG_INTRA_SMOOTHING),
h265_fill_scaling_matrix sends memset-zero. Matches FFmpeg's
sl=NULL pattern at v4l2_request_hevc.c:384-403 (preserves
byte-equality vs cross-validator anchor).
S4 (FFmpeg function name fix): cosmetic; no code impact.
Plus one Phase 6 inline correction: phase 5 review S1 suggested
VAAPI exposes uniform_spacing_flag in pic_fields.bits; empirical
test-compile shows it doesn't. Comment added in h265_fill_pps
documenting the omission.
Picture.c changes (3 edits):
1. codec_set_controls HEVCMain dispatch (lines 204-206 → call
h265_set_controls; replaces explicit Fourier-local: HEVC stripped
reject).
2. codec_store_buffer HEVC VASliceParameterBufferType case: append
VAAPI slice param to params.h265.slices[N] array, increment
num_slices. Single-slice mirror at .slice retained for
h265_fill_pps (which reads dependent_slice_segment_flag from
LongSliceFlags).
3. RequestBeginPicture: add params.h265.num_slices = 0 reset
alongside existing h264.matrix_set = false reset.
Surface.h: extend params.h265 struct with slices[HEVC_MAX_SLICES_PER_
FRAME=64] array + num_slices counter. ~17 KB extra per surface union;
24 surfaces in iter7 cap_pool = ~400 KB total surface_heap growth.
object_heap allocator picks up new size automatically via
sizeof(struct object_surface).
Context.c: separate 2-control batched call sets HEVC DECODE_MODE +
START_CODE device-wide. Same best-effort (void)v4l2_set_controls
pattern as the existing H.264 device-init block; if kernel doesn't
advertise HEVC controls (hantro on RK3568/RK3399), the batch silently
fails without invalidating the H.264 batch.
Meson.build: uncomment 'h265.c' (line 50) and 'h265.h' (line 73)
in sources + headers lists.
H265.h: added HEVC_MAX_SLICES_PER_FRAME=64 #define before struct
forward declarations.
Phase 6 smoke test on fresnel (post Commit A + Commit B):
Criterion 1: vainfo lists VAProfileHEVCMain on rkvdec env binding
(/dev/video1 + /dev/media0). PASS.
Criterion 3: ffmpeg -hwaccel vaapi HEVC decode of bbb_720p10s_hevc.mp4
-frames:v 5 -f null -, exit 0. cap_pool_init: 24 slots
ready. PASS.
Criterion 4: mpv --hwdec=vaapi --vo=image at +02s seek, HEVC fixture:
HW frame 1: 47a5f3850df5d8c732767a227830c2272ff78402a7b6adeea329e29838808be5
SW frame 1: 47a5f3850df5d8c732767a227830c2272ff78402a7b6adeea329e29838808be5
HW frame 2: a467b3bc9d7b6374b6786ecfac46932d6c7bb932ab11d311edaa233d7863e656
SW frame 2: a467b3bc9d7b6374b6786ecfac46932d6c7bb932ab11d311edaa233d7863e656
HW=SW byte-identical for both frames; frame1 != frame2 (real motion).
PASS.
Criterion 5: regression hashes hold for both prior cells:
H.264 +30s HW frame 1: f623d5f7a41697f67dd227275c6f1b21ffc257f65626d32fde8229357f8764c9 (T4 ref MATCH)
H.264 +30s HW frame 2: 7d7bc6f2146dda8b2d223bba622c4b9fbe9674181ff1e02afe286b620342e0a8 (T4 ref MATCH)
MPEG-2 +02s HW frame 1: 6e7873030dbf0403c67f35dd106ebef3c7909a0fd12433b82ad758e7fee9f092 (iter1 ref MATCH)
MPEG-2 +02s HW frame 2: ccc7ce08810d4a96e9ba7a19f4f95bbf6cc861bda9337604b5c668ad52bef7de (iter1 ref MATCH)
PASS.
All five criteria green on first build attempt — Phase 5 review
caught the 3 Critical UAPI errors (data_bit_offset → data_byte_offset
rename; dpb.rps field gone + pic_order_cnt_val rename + index-array
semantics) that would have been Phase 6 compile failures or silent
Phase 7 byte-compare divergences. Without that review pass, this
commit would have been the start of a 2+ loopback debugging cycle.
Refs:
../fresnel-fourier/phase4_iter2_plan.md (10 contract clauses,
File 4 patch shape)
../fresnel-fourier/phase5_iter2_review.md (C1, C2, C3, S1, S2,
S3, S4, Q2 amendments
all incorporated)
../fresnel-fourier/phase0_evidence/2026-05-08/iter2_phase3/
ffmpeg_v4l2req.stdout (cross-validator anchor — Phase 7
bonus byte-compare verification target)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix-forward for commit C (3aab187): Phase 2 source-read missed a
third occurrence of #include <mpeg2-ctrls.h> in src/context.c:42.
The Phase 2 grep audit reported only two callsites
(src/config.c:37, src/mpeg2.c:38), both removed in commit B.
After commit C deleted include/mpeg2-ctrls.h from disk, the build
broke on context.c with:
../src/context.c:42:10: fatal error: mpeg2-ctrls.h:
No such file or directory
42 | #include <mpeg2-ctrls.h>
| ^~~~~~~~~~~~~~~
The include in context.c was vestigial — context.c references no
V4L2_CID_MPEG_VIDEO_MPEG2_* symbols and never needed the header
even before iter1's rewrite. The Phase 2 grep was simply incomplete.
This commit drops the orphan include line. Build now passes; install
clean; Phase 1 criterion 4 (DMA-BUF GL HW=SW byte-identical pixel
hashes) still PASS:
HW frame 1: 6e7873030dbf0403c67f35dd106ebef3c7909a0fd12433b82ad758e7fee9f092
SW frame 1: 6e7873030dbf0403c67f35dd106ebef3c7909a0fd12433b82ad758e7fee9f092
HW frame 2: ccc7ce08810d4a96e9ba7a19f4f95bbf6cc861bda9337604b5c668ad52bef7de
SW frame 2: ccc7ce08810d4a96e9ba7a19f4f95bbf6cc861bda9337604b5c668ad52bef7de
Per feedback_dev_process.md Phase 6 discipline:
"If a plan revision is needed mid-implementation, surface it
explicitly and re-enter Phase 4."
This is a 1-line scope expansion of commit B's "drop mpeg2-ctrls.h
include from all callsites" intent. Surfacing explicitly here
rather than silently amending B (which is already pushed). No
re-lock of plan needed; the spirit of File 1+2 in
phase4_iter1_plan.md was "drop the include from every file that
has it." The audit method (Phase 2 grep) was the gap.
Lesson for Phase 8 memory update: a more authoritative completeness
check than naive grep before deleting a header — recursive build
attempt to drive out hidden includes, or grep with no path filter
would have caught it.
Refs:
../fresnel-fourier/phase4_iter1_plan.md (File 3 + audit)
../fresnel-fourier/phase2_iter1_situation.md Bug 3 (incomplete
audit)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced during iter7 Track F research: the campaign target hardware
is Rockchip RK3566 silicon (PineTab2). The hantro driver attaches
via the rockchip,rk3568-vpu DT compatible because the RK3566 silicon
is close enough to RK3568 to share that variant. The proper RK3566
mainline driver target (rkvdec2 / vdpu346) has no kernel support yet
— Christian Hewitt's patch series LKML 2025/12/26/206 is unmerged.
Updates the two src/ comments that called the hardware "RK3568":
- context.c: hantro-vpu device-init S_EXT_CTRLS comment now reads
"via rockchip,rk3568-vpu DT compatible (covers RK3568 and RK3566
— PineTab2 silicon — since they're close enough)"
- h264.c: DPB pic_num discussion ends "...never surfaced on PineTab2
(RK3566 via hantro/rk3568-vpu)"
Not a correctness change. Compiles + decodes identically. The
update matters for upstream submission accuracy (bootlin/Rockchip
maintainers will care which silicon the campaign tested on).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
iter4 (385dee1) replaced the original media_request_reinit pattern
with close+media_request_alloc per frame to escape an EINVAL on
S_EXT_CTRLS that turned out to be a DPB-payload bug (74d8dd1, FFmpeg
V4L2_H264_FRAME_REF semantics). The per-frame close+alloc model
worked for mpv vaapi-copy (single-surface recycle) but raced under
Firefox 150's MediaSource pipeline (multi-surface rotation): fd=30
got reused via lowest-free-fd allocation faster than the kernel-
side per-buffer state-machine could tear down the prior request,
producing intermittent VIDIOC_QBUF EINVAL on OUTPUT after 1..53
successful frames.
Phase 2 telemetry confirmed:
- DQBUF returned the index we passed (no FIFO mismatch)
- SPS/PPS/DECODE_PARAMS/SCALING_MATRIX byte-identical between mpv
and Firefox first 64 bytes
- Pool size bump 4 -> 16 only delayed the failure (62 frames)
- Different OUTPUT slot indices failed across runs (race signature)
Fix: each OUTPUT pool slot owns a permanent request_fd allocated
once at request_pool_init and REINIT'd between uses in
RequestSyncSurface. 1:1 slot-to-fd binding eliminates cross-slot fd
reuse entirely. Pool stays driver-wide (multi-context safe per
iter5 Track E); slots cycle through 16 distinct fds in round-robin
acquire.
Files:
- request_pool.h: add request_fd field to slot struct; init
signature takes media_fd
- request_pool.c: alloc per-slot fd at init, close at destroy
- context.c: pass driver_data->media_fd; pool size 4 -> 16
- picture.c: BeginPicture binds slot->request_fd to surface;
EndPicture's per-frame media_request_alloc removed
- surface.c: RequestSyncSurface uses media_request_reinit instead
of close+alloc; DestroySurfaces close removed (slot owns fd);
error path close removed; surface_object NULL-init for the
-Wmaybe-uninitialized warning fix
Empirical verification (clean build sha ebe396d5..., no diagnostic
instrumentation):
- Firefox 150 + bbb_1080p30_h264.mp4 + LIBVA_DRIVER_NAME=v4l2_request
+ sandbox enabled: 35s+ playback, zero "Unable to queue buffer"
/ "Unable to set control(s)", lsof shows RDD process holds
/dev/video1 + /dev/media0 throughout. Driver stderr: only the
single cap_pool_init: 24 slots ready line.
- mpv vaapi-copy 50 frames: zero errors, "Using hardware decoding
(vaapi-copy)" - no regression vs iter5-end driver.
Pool-size bump diagnostic (Phase 5 sonnet design review feedback):
4 -> 16 alone took 1->62 frames, far short of the 30s success
criterion (~900 frames at 30fps). REINIT discipline is the actual
fix; pool 16 is comfortable headroom over typical H.264 MaxDpbFrames.
Phase 5 sonnet code review: APPROVE-WITH-CHANGES (one comment
attribution corrected: cleanup runs at RequestTerminate, not
RequestDestroyContext, since the pool is driver-wide).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sonnet review 7.3 / 9.6 from iter1 + carried iter2/3/4 substrate.
Two libva driver_data instances in the same process (e.g. Firefox
playing two tabs at different resolutions, or Firefox + mpv via the
same dlopened backend) would race on the static cache.
Move to struct request_data.last_output_width/height. The V4L2
device fd is already per-driver_data, so this is the correct binding
unit (one fd, one current OUTPUT format).
Verified: two concurrent mpv processes (2s stagger) both decode
300 frames cleanly with no cross-corruption. Same-instant init still
hits kernel-level fd contention on /dev/video1 (hantro is a
single-instance device); cross-process serialization is out of scope
for a libva backend.
Resolves the surface_reset_format_cache() callsite: now takes
driver_data parameter (was zero-arg).
Also drops the 'rc' unused-variable warning in v4l2_ioctl_controls
that the iter5 sweep left behind.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-iter2 each VA surface was permanently 1:1 bound to one V4L2 CAPTURE
buffer. mpv reusing a surface for a new decode while the compositor still
held an EXPBUF'd dma_buf fd to the prior frame caused the kernel to
write fresh decode output into the same physical memory the compositor
was reading -- visible as stutter / back-and-forth swap on
mpv --hwdec=vaapi --vo=gpu playback.
Architecture:
- New cap_pool abstraction (cap_pool.{h,c}) owns N CAPTURE buffers
(N = max(surfaces_count, MIN_CAP_POOL=24)) with per-slot state
{FREE, IN_DECODE, DECODED, EXPORTED} guarded by pthread_mutex_t.
- Surfaces no longer own buffers; each vaBeginPicture acquires the
oldest FREE slot (LRU), binds it for the decode cycle, and the slot
cycles IN_DECODE -> DECODED (post-DQBUF) -> EXPORTED (post-EXPBUF).
- Slot is released on next BeginPicture for the same surface or on
vaDestroySurfaces.
Limitations (Sonnet Phase 5 review iter2 9.x, deferred to iter3+):
- Option-A statistical mitigation; race window narrows to "pool
exhausted, force-recycle of oldest EXPORTED slot." For typical mpv
16-surface playback with MIN_CAP_POOL=24 the fallback never fires.
- Multi-context concurrent use not addressed (one V4L2 device, multiple
cap_pools -- iter3 scope).
Other call sites updated:
- picture.c::BeginPicture acquires + binds, releasing prior slot if any.
- surface.c::SyncSurface marks slot DECODED after DQBUF.
- surface.c::ExportSurfaceHandle marks slot EXPORTED, retaining OUR
EXPBUF fd for force-recycle close().
- surface.c::DestroySurfaces releases via surface_unbind_slot;
cap_pool owns the mmaps now.
- surface.c::CreateSurfaces2 destroys the pool in the resolution-change
path before REQBUFS(0) (else stale v4l2_index after Fix 1's REQBUFS).
- context.c::DestroyContext invokes cap_pool_destroy.
- image.c::DeriveImage skips copy_surface_to_image when current_slot is
NULL (ffmpeg av_hwframe_ctx_init probes derive on undecoded surfaces).
Verified: mpv vaapi-copy 200 frames bbb_1080p30, 0 drops, LRU visibly
recycling slot indices, real luma gradient. mpv vaapi --vo=gpu
operator-inspection follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix 1 of iteration 2 per phase4_iter2_plan.md.
Adds surface_reset_format_cache() exposed from src/surface.h. Called
from RequestDestroyContext after the dual REQBUFS(0). Without this,
multi-video Firefox sessions on mozilla.org corrupted the next
session's CAPTURE format query: the kernel reset to defaults but
our LAST_OUTPUT_WIDTH/HEIGHT cache still said 'already 1920x1088,'
so the next G_FMT returned 48x48 and the exported descriptor
encoded wrong pitch/offset.
Also adds REQBUFS(0) on CAPTURE in the resolution-change path of
RequestCreateSurfaces2 (Sonnet Phase 5 review iter2 9.1). The
existing code only did REQBUFS(0) on OUTPUT before re-S_FMTting;
hantro derives CAPTURE format from OUTPUT format, so leftover
CAPTURE buffers from the prior resolution would also block the
implicit format change. Pre-existing bug surfaced by Sonnet's
audit; Fix 3 pool refactor would have exposed it more often.
Limitation noted in surface.h docblock: the LAST_OUTPUT_WIDTH/
HEIGHT cache is a static process-global, so concurrent multi-
context use still races (Sonnet 7.3 / 9.6). Iteration 2 only
addresses sequential sessions. Multi-context safety is iteration 3+.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Patch 0002 sets V4L2_CID_STATELESS_H264_START_CODE to ANNEX_B on the
device, telling the kernel that OUTPUT-buffer payloads will contain
0x00 0x00 0x01 NAL start codes. picture.c::codec_store_buffer has
the prepend logic guarded by `if (context->h264_start_code)`, but
that boolean is set ONLY inside h264_get_controls() — a function
that exists but is never called.
Result: device expects ANNEX_B, libva-v4l2-request feeds raw NAL
payloads with no start codes, kernel cannot find slice boundaries,
hantro emits a zeroed CAPTURE buffer. mpv reports successful decode
because the V4L2 round-trip succeeds (no EINVAL); the visual output
is a flat dark-green frame (NV12 zero through BT.709).
Identified via:
- Patch 0006 cleared the EINVAL cluster-rejection (128 → 0 on
bbb_1080p30) but visual output remained flat green.
- GStreamer reference (gstv4l2codech264dec.c:1363-1377) confirms
start codes are required when ANNEX_B is selected.
- Source-archaeology of fourier's picture.c:67-74 showed the gate
on context->h264_start_code.
Fix: in context.c::RequestCreateContext, immediately after patch
0002's device-control block, set context_object->h264_start_code =
true to match the ANNEX_B mode we just programmed. Hardcoded for
now (matches 0002's hardcoded set); replaced with a runtime probe
in the planned probe-then-set commit.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Commit 3 of the upstreamable plan (upstreamable_design.md §1, §5).
Replaces the prior per-surface OUTPUT-buffer ownership model with a
small driver-wide pool sized by codec pipeline depth (4 H.264 frames
in flight), allocated unconditionally regardless of caller's
num_render_targets.
Prior art (kernel UAPI dev-stateless-decoder.rst, ffmpeg
v4l2_request.c, Chromium V4L2StatelessVideoDecoder, GStreamer
v4l2slh264dec) all decouple OUTPUT and CAPTURE pool sizing. fourier's
"output_count == surfaces_count" model was a category error: OUTPUT
buffers are request-time bitstream slots, CAPTURE buffers are
picture-time DPB slots; their lifecycles and sizing are independent.
Changes:
* NEW src/request_pool.{c,h} (~200 LoC):
- request_pool_init(): CREATE_BUFS + per-slot QUERYBUF + mmap.
- request_pool_destroy(): munmap all, idempotent.
- request_pool_acquire(): round-robin claim; returns V4L2 buffer
index of an unused slot or -1.
- request_pool_release(): mark slot free for reuse.
- request_pool_slot(): accessor for ptr/size given a buffer index.
* src/request.h: add struct request_pool output_pool to request_data.
* src/context.c::RequestCreateContext: replace the per-surface
OUTPUT loop with a single request_pool_init() call (count=4,
independent of surfaces_count). Drop the now-unused locals
(length, offset, source_data, output_buffers_count, index,
index_base, i, surface_object). DELETES patch 0002's
"output_buffers_count = ... ? ... : 4" hack inline — the pool's
own count parameter supersedes it.
* src/picture.c::RequestBeginPicture: borrow a pool slot at frame
start, write its mmap pointer/size/index into the surface's
transient source_* fields. The fields stay (still useful as
a borrow handle that the existing codec_store_buffer memcpys
target), but no longer represent surface-permanent ownership.
Reset slices_size/slices_count here too (was implicit on first
Render).
* src/surface.c::RequestSyncSurface: after VIDIOC_DQBUF returns
the OUTPUT buffer, release the pool slot and clear the surface's
borrow handle. Fixes the segv on second-frame submission.
* src/surface.c::RequestDestroySurfaces: remove the munmap of
source_data — pool owns the mmap.
* src/request.c::RequestTerminate: call request_pool_destroy()
before close(video_fd) so munmaps still target a valid fd.
* src/meson.build: add request_pool.c and request_pool.h to the
sources/headers lists.
This commit removes 0002's OUTPUT-pool hack inline (the
"floor to 4" line is gone). The DECODE_MODE/START_CODE block in 0002
remains until commit 4 lands.
Build-verified clean on aarch64.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Two related fixes that surfaced during the first hantro-vpu (RK3568)
smoke test of the multiplanar build:
1. **OUTPUT queue must be non-empty at STREAMON.** Hantro's
vb2_start_streaming rejects an empty queue with EINVAL. Some
VA-API callers (notably ffmpeg's vaapi-copy path) call
vaCreateContext with num_render_targets=0 and allocate render
targets lazily. The OUTPUT (bitstream-input) pool must NOT be
sized off surfaces_count alone — it is a request-time resource,
not per-surface. Quick fix: floor the pool to 4 buffers when
the caller passes 0. (A proper decoupling of OUTPUT pool from
surface lifecycle is documented in upstreamable_design.md.)
2. **Device-wide stateless H.264 controls before STREAMON.** The
V4L2 stateless framework requires V4L2_CID_STATELESS_H264_
DECODE_MODE and START_CODE be set on the device fd
(request_fd=-1) before stream start. Per-request controls
(SPS/PPS/SLICE_PARAMS/etc.) attached to a request_fd come
later via h264_set_controls(). hantro-vpu accepts only
DECODE_MODE_FRAME_BASED; START_CODE_ANNEX_B matches what the
existing slice-assembly path emits.
This is set unconditionally for now (errors silently ignored)
to keep cedrus and other backends compatible — they may
default to SLICE_BASED and not expose DECODE_MODE at all.
Probe-then-set via VIDIOC_QUERYCTRL is the upstream-correct
approach (see upstreamable_design.md §3).
After this patch, vainfo still enumerates as before, but the first
mpv vaapi-copy attempt advances past STREAMON and into actual
decode submission.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Compound patch carrying the fork's pre-Step-1 substrate, originally
authored by Jernej Škrabec / fourier on top of bootlin's a3c2476:
- src/h264.c + src/picture.c: V4L2_CID_MPEG_VIDEO_H264_* renamed to
V4L2_CID_STATELESS_H264_*, struct shapes tracked to mainline
(V4L2_CID_STATELESS_H264_DECODE_MODE/_START_CODE added to the
passthrough shim).
- include/hevc-ctrls.h: redirect shim to <linux/v4l2-controls.h>
(kernel-side HEVC controls now live in the canonical UAPI header).
- src/meson.build: src/h265.c / src/h265.h commented out — HEVC
build path is excluded from this fork (RK3568 hantro G1/G2 has
no HEVC, and the kernel-side HEVC controls have a separate
rework in flight upstream).
- src/tiled_yuv.S: aarch64 stub for tiled_to_planar (assembly
source was sunxi-cedrus armv7-only; aarch64 needs a stub to keep
the build linking).
- include/h264-ctrls.h: removed (dead post-fourier — no source
includes it; the passthrough shim's CID aliases live in the
kernel header now).
Functionally equivalent to the prior fork master commits:
c1f5108 V4L2_PIX_FMT_H264_SLICE rename
4ccbfe9 Strip HEVC build path
da9f2a5 include/h264-ctrls.h passthrough + CID aliases
fc4bb10 src/h264.c track upstream UAPI shape
13e9b64 src/h264.c drop num_slices field
4d14ffb src/tiled_yuv.S aarch64 stub
1b02c9b src/h264.c include utils.h
Folded into one commit during 2026-05-04 Step 1 reconciliation
(see ../phase0_evidence/2026-05-04/findings.md). Per-patch history
of the early fork commits preserved on the pre-step1 branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the per-codec options while at it, since we'll soon include a copy
of the associated headers.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
H.264 and H.265 support is still not supported upstream,
so it makes sense to autodetect each codec and only
enable those that are supported.
Signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
Because there might be more than a single call to CreateSurfaces,
we cannot assume that the index relative to the number of surfaces
requested in a single call matches the v4l2 index.
Grab the base index (as returned by the kernel) when allocating
buffers and use it for memory mapping and addressing them in v4l2.
This avoids memory-mapping the first (index 0) buffer multiple times
in that scenario instead of the n-th allocated buffer (in the n-th
call in the sequence).
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
The V4L2 API does not currently provide a way to liberate allocated
buffers one by one (which would fit well with DestroySurfaces in
VAAPI). Moreover, streaming needs to be off before liberating
buffers is allowed.
As a result, output an capture buffers can only be liberated when
destroying the decoding context, all at once, such as implemented
in this patch.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Since the V4L2 ioctl is called QUERYBUF, it makes more sense to
call the associated function with the same name.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
void * can be assigned from and stored to any pointer type without any
warning. Remove the explicit casts.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The cedrus_data structure carries the old name. In order to migrate to the
new name, let's rename it to request_data.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The sunxi_cedrus.h header contains a bunch of defines prefixed with
SUNXI_CEDRUS.
As part as the ongoing migration to a more generic name, change that prefix
for V4L2_REQUEST, and the header file to request.h
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
As part of our renaming effort, Rename the libva hooks names to mention
request instead of SunxiCedrus
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The libva only provides the reference images needed to decode the current
picture, but not the full DPB. However, some codecs need that whole DPB in
order to decode a picture.
For example, the Allwinner hardware codec has an internal SRAM, with each
picture getting a slot in that SRAM, and during each decoding process, some
metadata will then be generated from that SRAM content to a separate
buffer. Therefore, each frames must be located at the same SRAM position
each time so that the metadata are then re-used properly.
However, since libva will only pass a few reference images, we can end up
in a situation where multiple, subsequent, frames will have the same
reference images set, but might all be used as reference later on and
cannot therefore be located at the same position.
And from a more theorical point of view, Linux expects a full blown DPB in
its H264 control.
In order to work around this, we can create a shadow of the DPB by simply
maintaining a list of 16 decoded images, each associated with their
VAPictureH264 and an age. This age is the last time we used that frame as
reference. When a new picture is decoded, either we assign it to a free
slot, or we reuse the slot from the frame that hasn't been used as a
reference for the longest time.
This is a much simpler approach than the one documented in the H264 spec,
but this shouldn't really be a problem since we don't handle the reference
frames ourselves, but just re-use the one from the libva, and taken from
the bitstream before. As such, frames that are not supposed to be used for
reference will not be anymore, their age will not increase, and therefore
after a while we will garbage-collect their slot to store a much newer
frame.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The coding style has been a bit erratic. Enforce the linux kernel coding
style by reusing their .clang-format file, running clang-format on the
source, and ignoring the few shortcomings that clang-format has at the
moment (especially on aligning the define values).
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
This long structure name makes it quite difficult to fit within the 80
characters limit. Shorten it.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>