Three structural fixes for AV1 with film_grain on vpu981 (RK3588). Output
is no longer empty / crashed; frame 0 (IDR with apply_grain=1) is
bit-exact vs kdirect. Inter frames still diverge.
Fix 1 — surface.h + surface.c: linked_decode_surface_id field on
object_surface, initialized to VA_INVALID_SURFACE. When AV1 picture has
apply_grain=1, VAAPI's VADecPictureParameterBufferAV1 carries a
current_display_picture distinct from current_frame. ffmpeg-vaapi calls
vaBeginPicture on current_frame (decode surface, slot gets bound) but
vaGetImage on current_display_picture (display surface, no slot) → NULL
deref in copy_surface_to_image.
Fix 2 — av1.c: in av1_set_controls, when cur_frame != cur_display, set
display_surface->linked_decode_surface_id = current_frame. Establishes
the back-link so display surface can borrow decode surface's data.
Fix 3 — image.c copy_surface_to_image: when slot is NULL and the
surface has linked_decode_surface_id, lookup the decode surface and
mirror its destination_data[] + destination_sizes[] +
destination_planes_count. NULL guard with diagnostic log retained.
Fix 4 — av1.c fill_film_grain: when apply_grain=1, also set
V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN. Confirmed by strace-diff: kdirect
sends flags=0x0B (APPLY|UPDATE|...), libva was sending 0x09 (APPLY but
no UPDATE). Without UPDATE the kernel tries to reuse from
film_grain_params_ref_idx=0, which is never populated. Earlier reverted
because UPDATE seemed to trigger a SEGV — but that SEGV was the
unmasked NULL-slot deref; with fix 1+2+3 in place UPDATE is safe.
Fix 5 — av1.c reference_frame_ts plumbing: when a referenced surface
has timestamp=0 AND linked_decode_surface_id set, follow the link to
find the decode surface that carries the real timestamp. Display
surfaces don't get OUTPUT QBUF'd by us, so their own timestamp stays
zero.
Also: BeginPicture diagnostic log + surface_unbind_slot diagnostic log
+ v4l2.c error_idx diagnostic (kept from earlier — useful for ongoing
investigation).
Verification on ampere:
test_av1.ivf (208x208, 2 frames, no grain): bit-exact PASS sha
029ee72c214b37c1 (unchanged, no regression)
av1_larger.ivf (352x288, 10 frames, film_grain alternates):
frame 0 (key, apply_grain=1): PASS bit-exact vs kdirect
frame 4: PASS bit-exact
frames 1,2,3,5,6,7,8,9: DIFFER
Frame 0 PASS proves: SEQUENCE + FRAME + TILE_GROUP_ENTRY + FILM_GRAIN
mapping is correct for IDR. Frame 4 PASS is unexplained but encouraging.
Inter-frame divergence (frame 1+) points at: reference handling for
inter prediction is still off — either order_hints[] (still zero,
VAAPI doesn't expose per-ref), or grain-applied vs pre-grain DPB
semantics, or ref_frame_idx pointing into the wrong surface space.
Next investigation: per-frame strace diff between libva and kdirect
controls payload to spot remaining field mis-mappings on inter frames.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 3 iteration on the av1_larger.ivf (352x288 film_grain-50) fixture.
208x208 test_av1.ivf remains bit-exact PASS at sha 029ee72c214b37c1
(libva == kdirect, post-reference_frame_ts plumbing).
Negative result this commit: setting V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN
unconditionally when apply_grain=1 triggered a userspace SIGSEGV in
ffmpeg on the av1_larger fixture (consistent across runs). Reverted to
implicit update_grain=0 — same behavior as before the experiment
(silent 0-byte output, no segfault).
Hypothesis ruled out: the 352x288 silent-decode-failure is NOT solved
by always-update_grain. A/B test earlier also confirmed that omitting
the FILM_GRAIN control entirely (AV1_NO_FG=1) still produces 0 bytes,
so film_grain is not the trigger.
Remaining Phase 3 investigation candidates:
- tile_info field shape — single-tile av1_larger may stress mi_col/
row_starts sentinel differently than single-tile test_av1
- segmentation / quantization fields — different streams use
different combinations
- order_hints[] still zero (VAAPI doesn't expose per-ref)
- kernel-side dev_dbg in vpu981 driver would expose which
control field validation rejects
- strace -e ioctl on the failing decode reveals MEDIA_REQUEST_IOC_QUEUE
return value
Sibling-iteration parallel: ampere-kernel-decoders iter2-5 took
multiple iterations to localize the HEVC OOPS to kernel-side
ext_sps NULL init + slice_params; AV1 likely needs the same depth
of kernel-side instrumentation for the 352x288 case.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 2.1 first hardware test on ampere passed frame 1 (IDR) bit-exact
vs kdirect but frame 2 (inter) diverged starting at byte 64897. Root
cause: reference_frame_ts[] left at zero — kernel can't cross-reference
prior CAPTURE buffers without timestamps.
Fix: in av1_set_controls (which has driver_data), iterate VAAPI's
ref_frame_map[8] (VASurfaceIDs), look up each via SURFACE(driver_data,
ref_id), and pull v4l2_timeval_to_ns(&ref_surface->timestamp) into the
V4L2 ctrl. VA_INVALID_SURFACE entries stay at calloc-zero. Mirrors the
vp9.c:614-628 pattern scaled to AV1's 8 ref slots.
surface_object->timestamp itself is populated in picture.c::EndPicture
from context_object->timestamp_counter at QBUF time on the OUTPUT
buffer — already in place from iter1 baseline.
Verification on ampere (/tmp/test_av1.ivf 208x208, 2 frames):
Frame 1 + 2 libva sha 029ee72c214b37c1 == kdirect 029ee72c214b37c1
→ 100% byte-identical, kdirect was Phase 0-verified bit-perfect
order_hints[] still zero — VAAPI doesn't expose per-ref POC; observed
not load-bearing on the 208x208 smoke vector. Multi-tile + film_grain
stress vectors are next (av1-1-b8-23-film_grain-50.ivf).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ampere-av1 Phase 2.1 + 3 diagnostic: log which control failed validation
on S_EXT_CTRLS rejection so debug iterations can identify the offending
CID without strace. Pre-validation failures (error_idx >= count) log as
"<pre-validation>" with the syscall errno surfacing the root reason.
Already informative on ampere — surfaces the pre-existing benign H264 +
HEVC device-init failures on the vpu981 AV1 fd as count=2 / failed_cid=0
(those go through (void)cast at context.c:450/473 by design).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extended any_fd_supports_output_format() with vpu981 fd as 4th probe
target. Added V4L2_PIX_FMT_AV1_FRAME advertisement in
RequestQueryConfigProfiles. VAProfileAV1Profile0 in entrypoints +
GetConfigAttributes switches.
V4L2_REQUEST_MAX_PROFILES=11 now exactly full; comment added warning
about future profile additions needing the constant bumped.
Verified via vainfo:
VAProfileMPEG2Simple/Main, H264×5, HEVC, VP8, AV1 — all advertised
(VP9 absent because rkvdec module is on sibling-campaign-close
state, not the broken vp9-iter1; restoring VP9 needs the
ampere-vp9-enablement campaign reopened or the fail-state module
reloaded.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RK3588 has THREE hantro-vpu instances (legacy MPEG2/VP8 at /dev/video2,
encoder at /dev/video3, vpu981 AV1 at /dev/video4). The existing
2-device probe is "RK3399-shaped knowledge" — silently picks the first
hantro-vpu and never finds vpu981.
This commit adds:
- video_fd_vpu981 + media_fd_vpu981 slots to request_data
- video_node_supports_output_fmt(): capability probe via VIDIOC_ENUM_FMT
on OUTPUT/OUTPUT_MPLANE queues
- find_decoder_device_by_driver_with_fmt(): walks /dev/media* matching
driver name AND capability filter (V4L2_PIX_FMT_AV1_FRAME for vpu981)
- 'a' kind in request_device_kind_for_profile (VAProfileAV1Profile0)
- 'a' branch in request_switch_device_for_profile
- vpu981 probe at backend init, alongside existing rkvdec + hantro
- vpu981 fd cleanup in RequestTerminate
- VAProfileAV1Profile0 → V4L2_PIX_FMT_AV1_FRAME in codec.c
Verified on ampere:
$ LIBVA_DRIVER_NAME=v4l2_request ffmpeg ... 2>&1 | grep iter38
v4l2-request: auto-selected codec device: /dev/video1 + /dev/media0
v4l2-request: iter38: also opened hantro-vpu decoder at /dev/video2 + /dev/media1
v4l2-request: ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3
Three devices opened. HEVC still works (iter2 EXT_SPS_RPS probe still
triggers on rkvdec, sibling-campaign bit-perfect behaviour preserved).
Next steps: config.c advertise VAProfileAV1Profile0, surface.h add
av1 substruct, picture.c dispatch, av1.{c,h} for the codec dispatch
(~700 LoC mirroring Kwiboo v4l2_request_av1.c).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Phase 4 plan + Phase 5 review amendments (SPS parse-and-cache,
per-fd gating).
src/h265.c additions:
- #include <errno.h>, the v4l2-hevc-ext-controls.h, and the
vendored gst/codecparsers/gsth265parser.h
- new static helper h265_populate_ext_sps_rps_cache(): walks
surface_object->source_data for an SPS NAL (nal_unit_type == 33)
using gst_h265_parser_identify_nalu; if found, calls
gst_h265_parser_parse_sps_ext (NOT gst_h265_parser_parse_sps —
the latter discards the per-RPS-entry EXT data we need); maps
GstH265ShortTermRefPicSet (base) + GstH265ShortTermRefPicSetExt
(carrying use_delta_flag[16], used_by_curr_pic_flag[16],
delta_poc_s0_minus1[16], delta_poc_s1_minus1[16]) into the V4L2
struct arrays; stores on driver_data->hevc_rps_cache_*
- non-IDR-frame handling: cache holds across frames, so frames
whose source_data lacks an SPS NAL reuse the previously-parsed
cached arrays (Phase 5 review item #3)
- controls[] grows from [5] to [7]; the 2 new entries are appended
after the standard 5 (SPS/PPS/SLICE_PARAMS/SCALING_MATRIX/
DECODE_PARAMS), gated by driver_data->has_hevc_ext_sps_rps_rkvdec
(per-fd probe result from Step 3) + the cache being valid
- field-by-field mapping mirrors GStreamer's
gst_v4l2_codec_h265_dec_fill_ext_sps_rps verbatim (the upstream
reference identified in Phase 0 prior-art survey)
src/request.h additions:
- struct request_data carries hevc_rps_cache_st (array pointer),
_st_count, hevc_rps_cache_lt, _lt_count, hevc_rps_cache_valid.
Single-slot cache (sps_id 0 only; multi-SPS streams would need
expanding). Stores POST-MAPPED V4L2 structs so request.h doesn't
need to know GstH265SPS / GstH265SPSEXT types.
Critical interpretation correction (Phase 5 review followup):
GstH265SPS has short_term_ref_pic_set[65] (base) but NOT
short_term_ref_pic_set_ext[]. The EXT array lives on a SEPARATE
GstH265SPSEXT struct accessed via gst_h265_parser_parse_sps_ext.
The 'plain' gst_h265_parser_parse_sps internally calls _ext with a
LOCAL discarded SPSEXT (see gsth265parser.c:2050). Our call must
use the _ext variant directly to keep the EXT data. Caught during
Step 4 first-build error.
Build verified: ninja -C build clean. .so is 759 KB (up from 485 KB
original, 682 KB after Step 2 vendor — the +80 KB is the new helper
+ extension).
iter2 Phase 6 Step 5 (install + reboot + smoke-test) is the F1
falsifier moment: if HEVC stops OOPSing, mechanism confirmed; if it
still OOPSes, loopback Phase 0 with re-opened kernel-agent#11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
src/hevc-ctrls/v4l2-hevc-ext-controls.h (NEW, MIT, ~95 LOC):
Verbatim mirror of Linux 7.0 V4L2_CID_STATELESS_HEVC_EXT_SPS_ST_RPS
and _LT_RPS control IDs + struct definitions + flag macros. Each
symbol is ifndef-guarded so when ampere's linux-api-headers
eventually bumps to 7.0+, the kernel header takes precedence and
this shim silently no-ops. Citation block links the upstream
Casanova v8 series.
Per LGPL section 3.b, kernel UAPI struct definitions are excepted
from GPL inheritance, so copying them into MIT userspace is fine.
src/request.h: added has_hevc_ext_sps_rps_rkvdec + _hantro bool
fields on struct request_data — pair-of-flags layout mirrors
video_fd_rkvdec / video_fd_hantro (iter38 multi-device-probe
pattern, per feedback_multi_device_probe_design). Phase 5 review
identified single-scalar storage as a silent-misbehavior risk
across device-switch boundaries.
src/request.c:
- new probe_hevc_ext_sps_rps_controls(fd) helper: queries the two
new CIDs via VIDIOC_QUERYCTRL; returns true iff both register.
RK3399 rkvdec (linux 6.x or 7.x without VDPU381/383 bindings)
returns false; RK3588 rkvdec (VDPU381/383) returns true.
- probe each driver_data->video_fd_rkvdec / _hantro after the
iter38 multi-device-probe block at VA_DRIVER_INIT time
- log-line if rkvdec supports it - diagnostic for Phase 7
src/meson.build: added the new UAPI header to the headers list.
Build verified: ninja -C build clean, .so produced. The new probe
runs at driver init and stores the result, but nothing CONSUMES the
result yet — that's Step 4 (h265_set_controls wiring).
Per ampere-kernel-decoders campaign iter2 Phase 4 step 3 (amended
by Phase 5 review item 'per-fd storage').
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vendored gsth265parser + nalutils + gstbitreader + gstbytereader (the
Step 1 commit) compile cleanly against libc + libv4l2 only after
adding 1 compat translation unit + 5 stub headers, no edits to the
vendored .c/.h files themselves.
src/h265_parser/gst_compat.{h,c} — new files (MIT, original work):
- GLib type aliases (gboolean, gchar, gint*, guint*, gsize, gpointer)
- Memory helpers (g_malloc/g_free as #define free, g_memdup2 inline)
- Asserts as no-op + parser-return-code-propagation
- All GST_DEBUG/INFO/WARNING/ERROR/LOG/FIXME as no-ops (the parser
is heavy on debug logging; we compile it all out)
- GArray implementation (~100 LOC, just enough for gsth265parser.c's
24 call sites)
- GList full struct with .data/.next/.prev so callers compile;
list-manipulation functions abort() — dead code paths only
- Byte-order read/write macros (GST_READ_UINT8/16/24/32/64_LE/BE,
GST_WRITE_UINT8/16/24/32_BE) — aarch64 LE inlines
- g_once_init_enter/leave as simple gate
- G_MAXUINT*, G_MAXINT*, G_MINxxx, G_GNUC_* attribute macros, etc.
- Opaque GstBuffer/GstMemory/GstMapInfo + abort-stub functions for
the encoder-side SEI-insertion paths the libva backend never invokes
- gst_util_ceil_log2 real impl (used by slice-header parser; dead
for our SPS-only call path but cheaper to implement than stub)
src/h265_parser/gst/{gst.h,base/base-prelude.h,base/gstbitwriter.h,
codecparsers/codecparsers-prelude.h,glib-compat-private.h} — 5 new
stub headers (MIT). All include gst_compat.h. gstbitwriter.h adds
abort-stub functions for the bit-writer API (used by nalutils.c's NAL
emulation-prevention encoder path — dead code for the parse-only
libva backend).
src/meson.build — added the 5 new .c source files and 10 new .h
headers; added include_directories('h265_parser') to the include path
so the vendored files' '#include <gst/base/...>' style references
resolve to the stub headers + actual vendored files in the local
tree.
Build verified: ninja -C build produces v4l2_request_drv_video.so
(682 KB, up from 485 KB pre-vendor — the +200 KB is the vendored
parser code). nm shows gst_h265_parse_sps, gst_h265_parse_sps_ext,
gst_h265_parser_identify_nalu, and the other functions we need for
Step 4 are present in the binary.
Two #warning messages from gsth265parser.h about API stability are
upstream-intentional and harmless ('The H.265 parsing library is
unstable API and may change in future').
This commit completes Step 2 of ampere-kernel-decoders iter2 Phase 6.
Backend remains functionally identical to pre-iter2 — the new code
compiles + links but is not yet called from h265_set_controls (that's
Step 4). Existing 5 codecs continue to work as before.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Source: gitlab.freedesktop.org/gstreamer/gstreamer @ commit 43421c2a5b8a
(refs/tags/1.28.2). All 8 vendored files copied verbatim into
src/h265_parser/:
gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.c (168 KB)
gst-plugins-bad/gst-libs/gst/codecparsers/gsth265parser.h ( 92 KB)
gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.c (13 KB)
gst-plugins-bad/gst-libs/gst/codecparsers/nalutils.h ( 8 KB)
gstreamer/libs/gst/base/gstbitreader.c ( 8 KB)
gstreamer/libs/gst/base/gstbitreader.h ( 10 KB)
gstreamer/libs/gst/base/gstbytereader.c ( 39 KB)
gstreamer/libs/gst/base/gstbytereader.h ( 25 KB)
Total ~11 KLOC, LGPL v2.1+ per original headers (Intel + Sreerenj
Balachandran + others). LGPL headers preserved verbatim. Backend's
existing COPYING.LGPL covers redistribution.
** Build is INTENTIONALLY BROKEN at this commit. ** GLib dependencies
(GArray, g_malloc, gboolean, GST_DEBUG, etc.) are not yet satisfied;
src/Makefile.am is not yet updated to include these files. Step 2
performs the GLib-to-libc mechanical adaptation; Step 3 wires the
header + Makefile.
This vendor-unchanged commit is the upstream-tracking baseline. When
GStreamer ships a parser bug fix, the future-sync workflow is:
git diff src/h265_parser/ HEAD..(this commit)
to surface our adaptations, then rebase those over the upstream fix.
Per ampere-kernel-decoders campaign iter2 Phase 4 §Step 1
(/home/mfritsche/src/ampere-kernel-decoders/phase4_plan_iter2.md).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Latent bug surfaced by iter38 multi-device probe. profiles[] array
in RequestQueryConfigProfiles is sized by V4L2_REQUEST_MAX_PROFILES
(set as context->max_profiles=11 in VA_DRIVER_INIT), but the bounds
checks used V4L2_REQUEST_MAX_CONFIG_ATTRIBUTES (10). Pre-iter38 only
a single device's profiles were enumerated, total ≤9, so the off-by-
one never bit. With iter38's rkvdec+hantro union (10 profiles total
across MPEG2/H264/HEVC/VP8/VP9), the last enumerator (VP9) hit
index=9 with the check 'index < 10-1 = 9' → skipped.
Probe BOTH rkvdec and hantro-vpu at VA_DRIVER_INIT and keep their
{video,media}_fd pairs in driver_data. RequestQueryConfigProfiles
enumerates the union of supported profiles from all open fds.
RequestCreateConfig retargets driver_data->{video,media}_fd to the
device that serves the requested profile; if a switch is needed
(active fd is wrong), tears down output_pool, capture_pool, video_format
cache, and fmt_valid so the next RequestCreateContext rebuilds them
on the new device.
Profile→device map (RK3399-shaped):
H264 / HEVC / VP9 → rkvdec
MPEG-2 / VP8 → hantro-vpu
Honours LIBVA_V4L2_REQUEST_VIDEO_PATH / MEDIA_PATH explicit overrides
(skips alt-probe when those are set).
Closes the 'libva multi-device probe' open item from iter36/iter37
campaign-close.
α-26 (iter26) wrote VAAPI's picture->st_rps_bits to the V4L2 decode_params
field of the same name based on field-name match. Per V4L2 spec, this field
is the bit-count of st_ref_pic_set() *in the SPS* — VAAPI doesn't expose
that. The slice-header bit-count (which IS what VAAPI's st_rps_bits provides)
belongs in slice_params->short_term_ref_pic_set_size (handled correctly in
α-29).
rkvdec doesn't read decode_params->short_term_ref_pic_set_size, so the
misroute was harmless but stale. This revert restores spec-correct semantics
(0 when SPS bit-count is unknown).
Cosmetic cleanup; no functional change.
ROOT CAUSE FIX for VP8 libva decode garbage output.
ffmpeg-vaapi's vaapi_vp8.c:191-192 STRIPS the VP8 uncompressed
header (3 bytes for interframe, 10 bytes for keyframe) before
submitting the slice data via VAAPI. ffmpeg-v4l2request (kdirect)
KEEPS the header in its OUTPUT buffer.
Hantro's rockchip_vpu2_vp8_dec_run (rockchip_vpu2_hw_vp8_dec.c:349)
hard-codes 'first_part_offset = V4L2_VP8_FRAME_IS_KEY_FRAME(hdr) ? 10 : 3'
as the byte offset into OUTPUT where the first compressed partition
starts. It uses this offset for:
- mb_offset_bits = first_part_offset * 8 + first_part_header_bits + 8
- dct_part_offset = first_part_offset + first_part_size
Without the header, every offset is wrong, the entropy decoder
spins on the wrong bytes, and every frame decodes to garbage.
Fix: in codec_store_buffer for VAProfileVP8Version0_3, prepend
header_size bytes (10 keyframe / 3 interframe) of zeros to OUTPUT
before the slice data memcpy. Hantro skips these bytes for actual
parsing (uses ctrl-struct values instead), so zero-fill is fine.
Empirical: iter33 kernel printk in vpu2_vp8_dec_run dumped the
v4l2_ctrl_vp8_frame struct for libva vs kdirect and confirmed
byte-identical control fields. Only the OUTPUT buffer bytes
differed, traced to ffmpeg-vaapi's header stripping.
LIBVA_VP8_DUMP_FRAME=1 prints the v4l2_ctrl_vp8_frame struct fields
to stderr before VIDIOC_S_EXT_CTRLS. Goal: diff libva-side struct
against expected kdirect-side values for VP8 frame-2+ divergence
(libva produces non-trivial but wrong output; kdirect VP8 byte-equal
to SW). Env-gated, no behavior change otherwise.
ROOT CAUSE FIX for HEVC frame 2+ divergence (Bug 5 remainder).
rkvdec's assemble_sw_rps (rkvdec-hevc.c:386-389) uses
sl_params->short_term_ref_pic_set_size to compute the bit offset where
long-term RPS data starts in the slice header. When zero, it falls back
to fls(num_short_term_ref_pic_sets - 1) — wrong when num=1 (BBB's case).
α-26 misdirected: set decode_params->short_term_ref_pic_set_size = st_rps_bits
but rkvdec doesn't use that field. The correct consumer is slice_params per
V4L2 spec and rkvdec source.
VAAPI's picture->st_rps_bits is documented as: 'number of bits that structure
short_term_ref_pic_set(num_short_term_ref_pic_sets) takes in slice segment
header when short_term_ref_pic_set_sps_flag equals 0' — exactly what
sl_params->short_term_ref_pic_set_size means.
Frames 1 (IDR) unaffected (V4L2 rkvdec gates on !IDR_PIC flag).
Frames 2+: bit offset for long-term RPS now correct, slice header parsing
no longer falls off the edge of the entropy bitstream.
Default behavior unchanged: counter*1000ns same as before.
With LIBVA_TS_SCALE=N, multiplies the ns timestamp by N. Lets us
sweep timestamp magnitude to test whether small-ts collides with
stale CAPTURE entries in vb2_find_buffer for HEVC frame 2+ bug.
Also keeps iter29 slice-tail probe from previous commit.
Env-gated via LIBVA_HEVC_DUMP_SLICE_TAIL=1. Goal: characterise the
40-byte inflation in libva's slice_data buffer vs ffmpeg-v4l2request
(see iter27/28 close — HEVC frame 2+ divergence at byte 1382401).
Dumps per slice: nal_unit_type, slice_data_size, slice_data_byte_offset,
and the last 80 bytes of source_data for that slice. Lets us see if the
trailing 40 bytes are (a) real entropy, (b) trailing zeros, (c) a
next-NAL start code prefix, or (d) random memory.
iter28b tested LIBVA_HEVC_TRIM_TRAILING=40 on HEVC. Result: hash
differed at byte 899745 (inside frame 1, NOT just frame 2 boundary at
byte 1382401). Trimming 40 bytes off the IDR slice (96890→96850)
corrupted frame 1. The 40-byte inflation is not uniform per slice;
requires dynamic detection (e.g., scan for rbsp_stop_one_bit) or
per-slice-type logic.
VAAPI's slice_data_size includes NAL+slice header bytes that precede the
slice payload. rkvdec_hevc expects bit_size to cover the slice payload
(starting at data_byte_offset). Setting bit_size = slice_data_size * 8
made rkvdec read past slice payload → wrong entropy state → frame 2+
garbage despite correct ctx->image_fmt (iter25) and decode_params
(iter26).
Empirical match: with formula (slice_data_size - slice_data_byte_offset)
* 8, libva produces bit_size=44096 for BBB frame 2 matching kdirect's
44096 exactly per iter27 dmesg printk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VAPictureParameterBufferHEVC exposes st_rps_bits — the number of bits
the inline short_term_ref_pic_set syntax element takes in the slice
header. rkvdec's DPB resolution for P/B frames uses this to skip the
RPS data correctly; with size=0 it skips wrong bytes and reads wrong
references → frame 2+ visual divergence.
iter25 evidence: libva HEVC frame 1 byte-identical to kdirect, but
frame 2 diverges at the decode_params bytes 4-5 (libva 0x00 0x00,
kdirect 0x0a 0x00 = 10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rkvdec_h264_validate_sps doubles height when FRAME_MBS_ONLY is unset
(field-to-frame). Dummy with 1080-height was failing validation as
2176 > 1080, returning -EINVAL silently (void-cast). Even though libva
ignores the result of v4l2_set_controls, the side effect was leaving
ctx->image_fmt at ANY → first per-frame H264_SPS still hit -EBUSY in
try_or_set_cluster → setup loop broke (Bug 4 unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for Bug 5 (HEVC libva = all-zero CAPTURE) and Bug 4 (H.264
libva = keyframe partial), localized via iter17→iter24 kernel-printk
chain:
rkvdec_s_ctrl() for HEVC_SPS / H264_SPS calls get_image_fmt() and,
if the resolved image_fmt differs from cached ctx->image_fmt (default
RKVDEC_IMG_FMT_ANY at open), tries to reset the CAPTURE format.
Format reset returns -EBUSY when vb2_is_busy(CAPTURE_queue) — any
CAPTURE buffer allocated blocks the change.
libva (iter5b-β) pre-allocates 24 CAPTURE buffers at CreateContext
via cap_pool_init, BEFORE any per-frame S_EXT_CTRLS. First per-frame
HEVC_SPS therefore fails with -EBUSY in try_or_set_cluster, breaks
v4l2_ctrl_request_setup's outer loop, leaves all 5 staged HEVC
compound controls at zero in ctx->ctrl_hdl. rkvdec_hevc_run reads
zero (iter20 dmesg: sps[0..16]=00..00), hardware sees w=0 h=0,
CAPTURE comes out all-zero (Bug 5).
Fix: BEFORE cap_pool_init, inject one S_EXT_CTRLS (no request, no
which) with a synthetic SPS containing the profile's known chroma +
bit_depth. CAPTURE queue is still empty at this point → vb2_is_busy
returns false → rkvdec_s_ctrl succeeds, ctx->image_fmt is updated to
the profile's image_fmt. From then on, per-frame SPS submissions with
matching chroma + bit_depth see image_fmt_changed=false → skip reset
→ commit succeeds.
VP9 / MPEG-2 / VP8 paths are not affected: VP9's rkvdec coded_fmt_desc
has no get_image_fmt op; MPEG-2 + VP8 route to hantro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Env-gated by LIBVA_V4L2_REQ_GETBACK. After v4l2_set_controls() against
the request_fd in h265_set_controls(), issue G_EXT_CTRLS with the same
request_fd targeting SPS and log first 16 bytes returned.
iter20 (kernel printk) found rkvdec sees all-zero ctx->ctrl_hdl SPS for
libva HEVC vs correct bytes for kdirect. The remaining branch is whether
req->p_new was ever staged with libva's payload, or whether
v4l2_ctrl_request_setup failed to apply it.
α-24 distinguishes the two:
zero readback -> staging failed in v4l2_s_ext_ctrls
non-zero -> apply failed in v4l2_ctrl_request_setup
EACCES -> kernel disallows req readback; need deeper printk
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If removing DECODE_PARAMS from libva's S_EXT_CTRLS batch lets the other
4 controls stage, rkvdec_hevc_run printk will show w=1280 h=720 etc.
That confirms DECODE_PARAMS specifically is failing kernel validation
and rolling back the whole batch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests mechanism 5 (silent partial failure). If error_idx != count after
S_EXT_CTRLS, one of the per-request controls was rejected by the kernel
even though the ioctl returned 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Static storage for sps/pps/decode_params/scaling_matrix + no-free for
slice_params_array. Tests the kernel-defers-compound-copy hypothesis
from iter17 P7 finding.
If hashes change -> mechanism 3 confirmed; will refactor to per-surface
heap allocation.
If hashes unchanged -> mechanism 3 disproved; iter19 explores
mechanisms 1/2/5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test discriminator: lowering MIN_CAP_POOL from 24 to 11 (matching
kdirect) did not change any of the 5-codec hashes. Pool depth is
not the cause of Bug 4/5/6. Revert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quick discriminator: if pool depth affects rkvdec's per-codec state
machine, reducing libva's pool to kdirect's ~11 might change Bug 4/5/6
hashes. Reverts to 24 if test shows no change or regression.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 ioctl-sequence diff: kdirect (ffmpeg-v4l2request) S_FMTs CAPTURE
with NV12 + dimensions after S_FMT OUTPUT, BEFORE CREATE_BUFS. libva's
old code only G_FMTs CAPTURE (per iter5b-β's hantro-targeted comment
that explicit S_FMT puts hantro into an inconsistent state).
For rkvdec on RK3399 the absence of explicit S_FMT CAPTURE doesn't
commit the chosen NV12 format properly. rkvdec HEVC + H.264 silently
produce zero / garbage CAPTURE output — Bug 4 + Bug 5 root cause.
Now: S_FMT OUTPUT → S_FMT CAPTURE → G_FMT CAPTURE. Failure of S_FMT
CAPTURE is non-fatal: fall back to G_FMT (preserves the iter5b-β
hantro path).
Future iter to gate this on driver_kind explicitly per
feedback_per_driver_kludge_gating.md. For now, always-on is safe
because kdirect proves S_FMT CAPTURE works on both rkvdec AND hantro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LIBVA_V4L2_DUMP_OUTPUT=<dir> writes source_data[0..slices_size] to
<dir>/output_p<profile>_s<surface>_t<ts>.bin immediately before
v4l2_queue_buffer OUTPUT. Discriminates whether libva writes the
correct H.264/HEVC bitstream bytes (same as kdirect/input file).
Off by default. Wrapped in static-cache env check.
iter11+12+13 confirmed Bug 4/5 are not in S_EXT_CTRLS payload, not
in kernel substrate (RFC v2), not in CPU cache visibility (α-17 sync
ioctl works but inert). The remaining libva-side surface is the
actual bitstream bytes the kernel reads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V4L2 CAPTURE buffers are V4L2_MEMORY_MMAP and mapped cached. Kernel
DMA writes don't propagate to CPU cache observer; reading
destination_data[] without DMA_BUF_IOCTL_SYNC(START|READ) returns
stale data on RK3399 — observed as Bug 4 (H.264 partial-fill) and
Bug 5 (HEVC all-zero) when libva goes through cached-mmap readback
while kdirect ffmpeg-v4l2request + DRM_PRIME-mmap reads cleanly via
implicit sync.
Per Tomasz Figa's 2024 linaro-mm-sig discussion + feedback_rfc_v2_
vb2_dma_resv_scope.md: userspace responsibility for cache sync on
cached-mmap'd V4L2 buffers. RFC v2 fence work doesn't engage this
path; this ioctl pair does.
Just-in-time EXPBUF + SYNC + close per copy. Per-call cost is one
ioctl pair + one fd lifecycle per plane. Could cache the EXPBUF fd
on cap_pool slot but doing it transient keeps lifecycle simple.
Closing the EXPBUF fd is a no-op on V4L2 buffer memory.
If EXPBUF or SYNC fails, fall through to existing memcpy path —
preserves pre-iter13 behavior on the error branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes in one commit:
α-13 (h265_fill_sps): sps_max_num_reorder_pics now derived from
sps_max_dec_pic_buffering_minus1 (safe upper bound per H.265 §A.4.2)
instead of hardcoded 0. Phase 5b empirically showed rkvdec ignores
this field on RK3399, so this is wire-correctness hygiene only — matches
kdirect's payload pattern without behavior change.
α-14 (h265_set_controls): derive IRAP_PIC / IDR_PIC flags from the
first slice's nal_unit_type (parsed by h265_fill_slice_params into
slice_params_array[0].nal_unit_type). Without these flags rkvdec
doesn't recognise the keyframe boundary, treats IDR as inter without
references, and produces all-zero CAPTURE output — observed as Bug 5
on libva HEVC (06b2c5a0...). kdirect sets these from the bitstream
parse and decodes correctly (9340b832...).
Mapping:
nal_unit_type 16..23 -> IRAP_PIC
nal_unit_type 19 (IDR_W_RADL) or 20 (IDR_N_LP) -> IDR_PIC
HEVC-only (no risk to other codecs). h265_set_controls already
profile-gated via picture.c::codec_set_controls VAProfileHEVCMain
dispatch. Per feedback_unconditional_codec_state.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace gettimeofday in RequestEndPicture with object_context-scoped
counter producing small us values (1, 2, 3, ...) so OUTPUT QBUF
timestamp and DPB.reference_ts match ffmpeg-v4l2request's pattern.
Phase 5 IMP-1: counter scoped to object_context (not driver_data) to
avoid multi-context collisions.
Empirical confirmation only — reviewer's CRIT-1 predicts this is
inert (VP9/MPEG-2 use same path and PASS). If α-7 produces the same
broken hash, the libva wire-byte search space is exhausted and iter10
must pivot to slice-data inspection or kernel investigation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 4 root cause per Phase 7 γ + Phase 4c strace re-decode:
libva strips FFmpeg's bit-16 POC sentinel; kdirect (ffmpeg-v4l2request)
does NOT strip. rkvdec writes top/bottom_field_order_cnt directly to
MMIO via writel_relaxed; with libva sending 0 instead of kdirect's
65536, hardware POC comparisons mismatch and motion compensation
silently corrupts (16x32 patch + nothing else).
The original h264_strip_ffmpeg_poc_sentinel was hantro-specific
(hantro_h264.c prepare_table fed unmasked tbl->poc[]). Hantro+H.264
is not exercised on RK3399; deferring per-driver gating to iter9 if
it surfaces.
Preserve VA_PICTURE_H264_INVALID → return 0 (correct zero-init for
empty DPB slots per Phase 5c amendment).
4 call sites unchanged (h264.c:309, 312, 462, 465 — for ref and current
frame TopFieldOrderCnt / BottomFieldOrderCnt). Both reference and
current-frame POCs now pass through unchanged so hardware compares
agree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>