iter4 (385dee1) replaced the original media_request_reinit pattern
with close+media_request_alloc per frame to escape an EINVAL on
S_EXT_CTRLS that turned out to be a DPB-payload bug (74d8dd1, FFmpeg
V4L2_H264_FRAME_REF semantics). The per-frame close+alloc model
worked for mpv vaapi-copy (single-surface recycle) but raced under
Firefox 150's MediaSource pipeline (multi-surface rotation): fd=30
got reused via lowest-free-fd allocation faster than the kernel-
side per-buffer state-machine could tear down the prior request,
producing intermittent VIDIOC_QBUF EINVAL on OUTPUT after 1..53
successful frames.
Phase 2 telemetry confirmed:
- DQBUF returned the index we passed (no FIFO mismatch)
- SPS/PPS/DECODE_PARAMS/SCALING_MATRIX byte-identical between mpv
and Firefox first 64 bytes
- Pool size bump 4 -> 16 only delayed the failure (62 frames)
- Different OUTPUT slot indices failed across runs (race signature)
Fix: each OUTPUT pool slot owns a permanent request_fd allocated
once at request_pool_init and REINIT'd between uses in
RequestSyncSurface. 1:1 slot-to-fd binding eliminates cross-slot fd
reuse entirely. Pool stays driver-wide (multi-context safe per
iter5 Track E); slots cycle through 16 distinct fds in round-robin
acquire.
Files:
- request_pool.h: add request_fd field to slot struct; init
signature takes media_fd
- request_pool.c: alloc per-slot fd at init, close at destroy
- context.c: pass driver_data->media_fd; pool size 4 -> 16
- picture.c: BeginPicture binds slot->request_fd to surface;
EndPicture's per-frame media_request_alloc removed
- surface.c: RequestSyncSurface uses media_request_reinit instead
of close+alloc; DestroySurfaces close removed (slot owns fd);
error path close removed; surface_object NULL-init for the
-Wmaybe-uninitialized warning fix
Empirical verification (clean build sha ebe396d5..., no diagnostic
instrumentation):
- Firefox 150 + bbb_1080p30_h264.mp4 + LIBVA_DRIVER_NAME=v4l2_request
+ sandbox enabled: 35s+ playback, zero "Unable to queue buffer"
/ "Unable to set control(s)", lsof shows RDD process holds
/dev/video1 + /dev/media0 throughout. Driver stderr: only the
single cap_pool_init: 24 slots ready line.
- mpv vaapi-copy 50 frames: zero errors, "Using hardware decoding
(vaapi-copy)" - no regression vs iter5-end driver.
Pool-size bump diagnostic (Phase 5 sonnet design review feedback):
4 -> 16 alone took 1->62 frames, far short of the 30s success
criterion (~900 frames at 30fps). REINIT discipline is the actual
fix; pool 16 is comfortable headroom over typical H.264 MaxDpbFrames.
Phase 5 sonnet code review: APPROVE-WITH-CHANGES (one comment
attribution corrected: cleanup runs at RequestTerminate, not
RequestDestroyContext, since the pool is driver-wide).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
picture.c: remove the 0xab sentinel write into CAPTURE buffer first
32 bytes pre-QBUF + the OUTPUT hex-dump pre-QBUF. Both were iter1
diagnostics for "where does the buffer write go?" investigation.
surface.c: remove the post-DQBUF CAPTURE Y-plane hex-dump + luma
variance signal. The msync(MS_SYNC|MS_INVALIDATE) was added as a
companion fix for the cached-mmap issue surfaced by the dump itself —
removing the dump removes the need for the msync.
With iter1+iter2+iter3+iter4 fixes landed, these dumps fire on every
single frame and produce hundreds of MB of log noise during sustained
decode. Now gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the iter1 patch-0014 ENTER traces from buffer.c, image.c,
picture.c, surface.c. These were diagnostic-only entry-point logs
added during iter1's "where does Firefox RDD crash?" investigation.
With the iter1+iter2+iter3+iter4 fixes landed, the entry-point
traces are pure noise.
If a future investigation needs entry-point coverage, strace -e trace
on the libva consumer process gives equivalent visibility without
modifying the driver.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-iter2 each VA surface was permanently 1:1 bound to one V4L2 CAPTURE
buffer. mpv reusing a surface for a new decode while the compositor still
held an EXPBUF'd dma_buf fd to the prior frame caused the kernel to
write fresh decode output into the same physical memory the compositor
was reading -- visible as stutter / back-and-forth swap on
mpv --hwdec=vaapi --vo=gpu playback.
Architecture:
- New cap_pool abstraction (cap_pool.{h,c}) owns N CAPTURE buffers
(N = max(surfaces_count, MIN_CAP_POOL=24)) with per-slot state
{FREE, IN_DECODE, DECODED, EXPORTED} guarded by pthread_mutex_t.
- Surfaces no longer own buffers; each vaBeginPicture acquires the
oldest FREE slot (LRU), binds it for the decode cycle, and the slot
cycles IN_DECODE -> DECODED (post-DQBUF) -> EXPORTED (post-EXPBUF).
- Slot is released on next BeginPicture for the same surface or on
vaDestroySurfaces.
Limitations (Sonnet Phase 5 review iter2 9.x, deferred to iter3+):
- Option-A statistical mitigation; race window narrows to "pool
exhausted, force-recycle of oldest EXPORTED slot." For typical mpv
16-surface playback with MIN_CAP_POOL=24 the fallback never fires.
- Multi-context concurrent use not addressed (one V4L2 device, multiple
cap_pools -- iter3 scope).
Other call sites updated:
- picture.c::BeginPicture acquires + binds, releasing prior slot if any.
- surface.c::SyncSurface marks slot DECODED after DQBUF.
- surface.c::ExportSurfaceHandle marks slot EXPORTED, retaining OUR
EXPBUF fd for force-recycle close().
- surface.c::DestroySurfaces releases via surface_unbind_slot;
cap_pool owns the mmaps now.
- surface.c::CreateSurfaces2 destroys the pool in the resolution-change
path before REQBUFS(0) (else stale v4l2_index after Fix 1's REQBUFS).
- context.c::DestroyContext invokes cap_pool_destroy.
- image.c::DeriveImage skips copy_surface_to_image when current_slot is
NULL (ffmpeg av_hwframe_ctx_init probes derive on undecoded surfaces).
Verified: mpv vaapi-copy 200 frames bbb_1080p30, 0 drops, LRU visibly
recycling slot indices, real luma gradient. mpv vaapi --vo=gpu
operator-inspection follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VAAPI signals "explicit scaling lists are present in the bitstream"
implicitly: the consumer (ffmpeg-vaapi, mpv, etc.) sends a
VAIQMatrixBufferH264 alongside RenderPicture iff
sps_scaling_matrix_present_flag || pps_scaling_matrix_present_flag.
When the bitstream uses default (flat) scaling, no IQMatrixBuffer
arrives and the in-tree h264.matrix struct stays zero-initialised.
fourier's existing codec_store_buffer for MPEG2 and HEVC tracks this
via a per-surface iqmatrix_set boolean (surface.h::mpeg2.iqmatrix_set,
h265.iqmatrix_set) — the H.264 path was missing the equivalent flag,
so set_controls always submitted the scaling matrix, including the
zero-initialised case.
Symptom on hantro-vpu RK3568: when TRANSFORM_8X8_MODE is enabled in
PPS, the kernel multiplies all 8x8 DCT coefficients by the zeroed
scaling_list_8x8, producing a zeroed CAPTURE buffer despite a
successful decode round-trip (no V4L2_BUF_FLAG_ERROR,
bytesused=3655712 reported).
Earlier draft of this patch unconditionally omitted SCALING_MATRIX in
FRAME_BASED. That's corpus-correct (bbb has no explicit scaling
lists) but the wrong predicate: the kernel-side gating is by
"matrix-supplied vs. not," not by decode mode. Streams that signal
explicit scaling lists must submit SCALING_MATRIX in either mode.
Contract verification (audit_0008_decode_params_2026-05-01.md +
hantro_h264.c::assemble_scaling_list): the kernel uses the supplied
matrix when SCALING_MATRIX is in the control batch and falls back
to spec-defined defaults when absent. Mode-independent.
This patch:
- surface.h: adds bool matrix_set to params.h264, mirroring
mpeg2.iqmatrix_set / h265.iqmatrix_set.
- picture.c codec_store_buffer (H.264 VAIQMatrixBufferType case):
sets matrix_set = true when the buffer arrives.
- picture.c RequestBeginPicture: resets matrix_set = false at the
start of each Begin/Render/End cycle.
- h264.c h264_set_controls: builds the controls[] array
incrementally; SPS/PPS/DECODE_PARAMS always; SCALING_MATRIX iff
matrix_set; SLICE_PARAMS only in SLICE_BASED; PRED_WEIGHTS only
when both SLICE_BASED and V4L2_H264_CTRL_PRED_WEIGHTS_REQUIRED.
The pre-existing FRAME_BASED-omits-SLICE_PARAMS rule is preserved —
kernel doc ext-ctrls-codec-stateless.rst:752: "When this mode is
selected, the V4L2_CID_STATELESS_H264_SLICE_PARAMS control shall
not be set."
Cross-reference: kernel UAPI section
ext-ctrls-codec-stateless.rst V4L2_CID_STATELESS_H264_SCALING_MATRIX
(matrix supplied iff explicit scaling lists in bitstream) and
hantro_h264.c::assemble_scaling_list (consumes supplied matrix or
falls back to defaults).
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Diagnostic-only. Writes 0xab×32 into the CAPTURE buffer's first 32
bytes immediately before VIDIOC_QBUF. The 0010 hex-dump after
DQBUF reveals which case we're in:
- All 0xab → kernel never wrote to this buffer (wrong buffer
chosen, alias, or no decode actually happened despite
bytesused=3655712 reported).
- All zeros → kernel did write 0x00s (overwriting our
sentinel), and the apparent "no picture" output is the
kernel-side decode actually producing zeros (e.g. parser
rejected the bitstream).
- Mix of zeros and real luma values → kernel wrote real
decoded pixels; CPU read sees stale-cached zeros somewhere
OR the sentinel area was a header that decoder zeroed but
rest is real. Need to check more bytes.
- All 0xab still → kernel never touched this region but other
parts of buffer may be filled (incomplete decode).
Removed once Step 1 decode is verified.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Diagnostic-only patch (NOT for upstream). Hex-dumps:
- First 32 bytes of OUTPUT buffer at QBUF time in
picture.c::RequestEndPicture (i.e. what we feed the kernel)
- First 32 bytes of CAPTURE Y-plane after DQBUF in
surface.c::RequestSyncSurface (i.e. what kernel returned)
Lets us see whether:
- OUTPUT bitstream begins with valid ANNEX_B start code + NAL
header byte (e.g. `00 00 01 65` for IDR slice)
- CAPTURE Y-plane after decode contains varied luma data
(working) vs. all-zeros / repeating pattern (kernel didn't
write anything).
Removed once Step 1 decode is verified working. Output goes via
existing request_log() to stderr.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Commit 3 of the upstreamable plan (upstreamable_design.md §1, §5).
Replaces the prior per-surface OUTPUT-buffer ownership model with a
small driver-wide pool sized by codec pipeline depth (4 H.264 frames
in flight), allocated unconditionally regardless of caller's
num_render_targets.
Prior art (kernel UAPI dev-stateless-decoder.rst, ffmpeg
v4l2_request.c, Chromium V4L2StatelessVideoDecoder, GStreamer
v4l2slh264dec) all decouple OUTPUT and CAPTURE pool sizing. fourier's
"output_count == surfaces_count" model was a category error: OUTPUT
buffers are request-time bitstream slots, CAPTURE buffers are
picture-time DPB slots; their lifecycles and sizing are independent.
Changes:
* NEW src/request_pool.{c,h} (~200 LoC):
- request_pool_init(): CREATE_BUFS + per-slot QUERYBUF + mmap.
- request_pool_destroy(): munmap all, idempotent.
- request_pool_acquire(): round-robin claim; returns V4L2 buffer
index of an unused slot or -1.
- request_pool_release(): mark slot free for reuse.
- request_pool_slot(): accessor for ptr/size given a buffer index.
* src/request.h: add struct request_pool output_pool to request_data.
* src/context.c::RequestCreateContext: replace the per-surface
OUTPUT loop with a single request_pool_init() call (count=4,
independent of surfaces_count). Drop the now-unused locals
(length, offset, source_data, output_buffers_count, index,
index_base, i, surface_object). DELETES patch 0002's
"output_buffers_count = ... ? ... : 4" hack inline — the pool's
own count parameter supersedes it.
* src/picture.c::RequestBeginPicture: borrow a pool slot at frame
start, write its mmap pointer/size/index into the surface's
transient source_* fields. The fields stay (still useful as
a borrow handle that the existing codec_store_buffer memcpys
target), but no longer represent surface-permanent ownership.
Reset slices_size/slices_count here too (was implicit on first
Render).
* src/surface.c::RequestSyncSurface: after VIDIOC_DQBUF returns
the OUTPUT buffer, release the pool slot and clear the surface's
borrow handle. Fixes the segv on second-frame submission.
* src/surface.c::RequestDestroySurfaces: remove the munmap of
source_data — pool owns the mmap.
* src/request.c::RequestTerminate: call request_pool_destroy()
before close(video_fd) so munmaps still target a valid fd.
* src/meson.build: add request_pool.c and request_pool.h to the
sources/headers lists.
This commit removes 0002's OUTPUT-pool hack inline (the
"floor to 4" line is gone). The DECODE_MODE/START_CODE block in 0002
remains until commit 4 lands.
Build-verified clean on aarch64.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Compound patch carrying the fork's pre-Step-1 substrate, originally
authored by Jernej Škrabec / fourier on top of bootlin's a3c2476:
- src/h264.c + src/picture.c: V4L2_CID_MPEG_VIDEO_H264_* renamed to
V4L2_CID_STATELESS_H264_*, struct shapes tracked to mainline
(V4L2_CID_STATELESS_H264_DECODE_MODE/_START_CODE added to the
passthrough shim).
- include/hevc-ctrls.h: redirect shim to <linux/v4l2-controls.h>
(kernel-side HEVC controls now live in the canonical UAPI header).
- src/meson.build: src/h265.c / src/h265.h commented out — HEVC
build path is excluded from this fork (RK3568 hantro G1/G2 has
no HEVC, and the kernel-side HEVC controls have a separate
rework in flight upstream).
- src/tiled_yuv.S: aarch64 stub for tiled_to_planar (assembly
source was sunxi-cedrus armv7-only; aarch64 needs a stub to keep
the build linking).
- include/h264-ctrls.h: removed (dead post-fourier — no source
includes it; the passthrough shim's CID aliases live in the
kernel header now).
Functionally equivalent to the prior fork master commits:
c1f5108 V4L2_PIX_FMT_H264_SLICE rename
4ccbfe9 Strip HEVC build path
da9f2a5 include/h264-ctrls.h passthrough + CID aliases
fc4bb10 src/h264.c track upstream UAPI shape
13e9b64 src/h264.c drop num_slices field
4d14ffb src/tiled_yuv.S aarch64 stub
1b02c9b src/h264.c include utils.h
Folded into one commit during 2026-05-04 Step 1 reconciliation
(see ../phase0_evidence/2026-05-04/findings.md). Per-patch history
of the early fork commits preserved on the pre-step1 branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference frames are now identified using their timestamp:
set the timestamp when queuing the output buffer and use it to identify
the frame later on.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Drop the per-codec options while at it, since we'll soon include a copy
of the associated headers.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
H.264 and H.265 support is still not supported upstream,
so it makes sense to autodetect each codec and only
enable those that are supported.
Signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
This adds support for MPEG2 quantization matrices, which are optional
given that fallback default matrices are used (on the kernel side) when
no such matrix is provided.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
void * can be assigned from and stored to any pointer type without any
warning. Remove the explicit casts.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The cedrus_data structure carries the old name. In order to migrate to the
new name, let's rename it to request_data.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The sunxi_cedrus.h header contains a bunch of defines prefixed with
SUNXI_CEDRUS.
As part as the ongoing migration to a more generic name, change that prefix
for V4L2_REQUEST, and the header file to request.h
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
As part of our renaming effort, Rename the libva hooks names to mention
request instead of SunxiCedrus
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The num_slices parameter was improperly set to the number of reference
frames, which is incorrect.
Add a counter for the number of slices per surface, and set num_slices to
that value.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Some functions setting the controls will need the context in the future.
Make sure that we provide it as an argument.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The coding style has been a bit erratic. Enforce the linux kernel coding
style by reusing their .clang-format file, running clang-format on the
source, and ignoring the few shortcomings that clang-format has at the
moment (especially on aligning the define values).
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
This long structure name makes it quite difficult to fit within the 80
characters limit. Shorten it.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The BUFFER macro takes an implicit driver_data argument. In order to make
it obvious that we need it, let's put it as an explicit parameter.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The libVA API expects rendering to be done after calling EndPicture.
Since SyncSurface calls are generally not issued in a way compatible
with dequeuing buffers (they might be called too early or too late),
always call SyncSurface from EndPicture.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>