Pre-iter2 each VA surface was permanently 1:1 bound to one V4L2 CAPTURE
buffer. mpv reusing a surface for a new decode while the compositor still
held an EXPBUF'd dma_buf fd to the prior frame caused the kernel to
write fresh decode output into the same physical memory the compositor
was reading -- visible as stutter / back-and-forth swap on
mpv --hwdec=vaapi --vo=gpu playback.
Architecture:
- New cap_pool abstraction (cap_pool.{h,c}) owns N CAPTURE buffers
(N = max(surfaces_count, MIN_CAP_POOL=24)) with per-slot state
{FREE, IN_DECODE, DECODED, EXPORTED} guarded by pthread_mutex_t.
- Surfaces no longer own buffers; each vaBeginPicture acquires the
oldest FREE slot (LRU), binds it for the decode cycle, and the slot
cycles IN_DECODE -> DECODED (post-DQBUF) -> EXPORTED (post-EXPBUF).
- Slot is released on next BeginPicture for the same surface or on
vaDestroySurfaces.
Limitations (Sonnet Phase 5 review iter2 9.x, deferred to iter3+):
- Option-A statistical mitigation; race window narrows to "pool
exhausted, force-recycle of oldest EXPORTED slot." For typical mpv
16-surface playback with MIN_CAP_POOL=24 the fallback never fires.
- Multi-context concurrent use not addressed (one V4L2 device, multiple
cap_pools -- iter3 scope).
Other call sites updated:
- picture.c::BeginPicture acquires + binds, releasing prior slot if any.
- surface.c::SyncSurface marks slot DECODED after DQBUF.
- surface.c::ExportSurfaceHandle marks slot EXPORTED, retaining OUR
EXPBUF fd for force-recycle close().
- surface.c::DestroySurfaces releases via surface_unbind_slot;
cap_pool owns the mmaps now.
- surface.c::CreateSurfaces2 destroys the pool in the resolution-change
path before REQBUFS(0) (else stale v4l2_index after Fix 1's REQBUFS).
- context.c::DestroyContext invokes cap_pool_destroy.
- image.c::DeriveImage skips copy_surface_to_image when current_slot is
NULL (ffmpeg av_hwframe_ctx_init probes derive on undecoded surfaces).
Verified: mpv vaapi-copy 200 frames bbb_1080p30, 0 drops, LRU visibly
recycling slot indices, real luma gradient. mpv vaapi --vo=gpu
operator-inspection follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix 1 of iteration 2 per phase4_iter2_plan.md.
Adds surface_reset_format_cache() exposed from src/surface.h. Called
from RequestDestroyContext after the dual REQBUFS(0). Without this,
multi-video Firefox sessions on mozilla.org corrupted the next
session's CAPTURE format query: the kernel reset to defaults but
our LAST_OUTPUT_WIDTH/HEIGHT cache still said 'already 1920x1088,'
so the next G_FMT returned 48x48 and the exported descriptor
encoded wrong pitch/offset.
Also adds REQBUFS(0) on CAPTURE in the resolution-change path of
RequestCreateSurfaces2 (Sonnet Phase 5 review iter2 9.1). The
existing code only did REQBUFS(0) on OUTPUT before re-S_FMTting;
hantro derives CAPTURE format from OUTPUT format, so leftover
CAPTURE buffers from the prior resolution would also block the
implicit format change. Pre-existing bug surfaced by Sonnet's
audit; Fix 3 pool refactor would have exposed it more often.
Limitation noted in surface.h docblock: the LAST_OUTPUT_WIDTH/
HEIGHT cache is a static process-global, so concurrent multi-
context use still races (Sonnet 7.3 / 9.6). Iteration 2 only
addresses sequential sessions. Multi-context safety is iteration 3+.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Patch 0002 sets V4L2_CID_STATELESS_H264_START_CODE to ANNEX_B on the
device, telling the kernel that OUTPUT-buffer payloads will contain
0x00 0x00 0x01 NAL start codes. picture.c::codec_store_buffer has
the prepend logic guarded by `if (context->h264_start_code)`, but
that boolean is set ONLY inside h264_get_controls() — a function
that exists but is never called.
Result: device expects ANNEX_B, libva-v4l2-request feeds raw NAL
payloads with no start codes, kernel cannot find slice boundaries,
hantro emits a zeroed CAPTURE buffer. mpv reports successful decode
because the V4L2 round-trip succeeds (no EINVAL); the visual output
is a flat dark-green frame (NV12 zero through BT.709).
Identified via:
- Patch 0006 cleared the EINVAL cluster-rejection (128 → 0 on
bbb_1080p30) but visual output remained flat green.
- GStreamer reference (gstv4l2codech264dec.c:1363-1377) confirms
start codes are required when ANNEX_B is selected.
- Source-archaeology of fourier's picture.c:67-74 showed the gate
on context->h264_start_code.
Fix: in context.c::RequestCreateContext, immediately after patch
0002's device-control block, set context_object->h264_start_code =
true to match the ANNEX_B mode we just programmed. Hardcoded for
now (matches 0002's hardcoded set); replaced with a runtime probe
in the planned probe-then-set commit.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Commit 3 of the upstreamable plan (upstreamable_design.md §1, §5).
Replaces the prior per-surface OUTPUT-buffer ownership model with a
small driver-wide pool sized by codec pipeline depth (4 H.264 frames
in flight), allocated unconditionally regardless of caller's
num_render_targets.
Prior art (kernel UAPI dev-stateless-decoder.rst, ffmpeg
v4l2_request.c, Chromium V4L2StatelessVideoDecoder, GStreamer
v4l2slh264dec) all decouple OUTPUT and CAPTURE pool sizing. fourier's
"output_count == surfaces_count" model was a category error: OUTPUT
buffers are request-time bitstream slots, CAPTURE buffers are
picture-time DPB slots; their lifecycles and sizing are independent.
Changes:
* NEW src/request_pool.{c,h} (~200 LoC):
- request_pool_init(): CREATE_BUFS + per-slot QUERYBUF + mmap.
- request_pool_destroy(): munmap all, idempotent.
- request_pool_acquire(): round-robin claim; returns V4L2 buffer
index of an unused slot or -1.
- request_pool_release(): mark slot free for reuse.
- request_pool_slot(): accessor for ptr/size given a buffer index.
* src/request.h: add struct request_pool output_pool to request_data.
* src/context.c::RequestCreateContext: replace the per-surface
OUTPUT loop with a single request_pool_init() call (count=4,
independent of surfaces_count). Drop the now-unused locals
(length, offset, source_data, output_buffers_count, index,
index_base, i, surface_object). DELETES patch 0002's
"output_buffers_count = ... ? ... : 4" hack inline — the pool's
own count parameter supersedes it.
* src/picture.c::RequestBeginPicture: borrow a pool slot at frame
start, write its mmap pointer/size/index into the surface's
transient source_* fields. The fields stay (still useful as
a borrow handle that the existing codec_store_buffer memcpys
target), but no longer represent surface-permanent ownership.
Reset slices_size/slices_count here too (was implicit on first
Render).
* src/surface.c::RequestSyncSurface: after VIDIOC_DQBUF returns
the OUTPUT buffer, release the pool slot and clear the surface's
borrow handle. Fixes the segv on second-frame submission.
* src/surface.c::RequestDestroySurfaces: remove the munmap of
source_data — pool owns the mmap.
* src/request.c::RequestTerminate: call request_pool_destroy()
before close(video_fd) so munmaps still target a valid fd.
* src/meson.build: add request_pool.c and request_pool.h to the
sources/headers lists.
This commit removes 0002's OUTPUT-pool hack inline (the
"floor to 4" line is gone). The DECODE_MODE/START_CODE block in 0002
remains until commit 4 lands.
Build-verified clean on aarch64.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Two related fixes that surfaced during the first hantro-vpu (RK3568)
smoke test of the multiplanar build:
1. **OUTPUT queue must be non-empty at STREAMON.** Hantro's
vb2_start_streaming rejects an empty queue with EINVAL. Some
VA-API callers (notably ffmpeg's vaapi-copy path) call
vaCreateContext with num_render_targets=0 and allocate render
targets lazily. The OUTPUT (bitstream-input) pool must NOT be
sized off surfaces_count alone — it is a request-time resource,
not per-surface. Quick fix: floor the pool to 4 buffers when
the caller passes 0. (A proper decoupling of OUTPUT pool from
surface lifecycle is documented in upstreamable_design.md.)
2. **Device-wide stateless H.264 controls before STREAMON.** The
V4L2 stateless framework requires V4L2_CID_STATELESS_H264_
DECODE_MODE and START_CODE be set on the device fd
(request_fd=-1) before stream start. Per-request controls
(SPS/PPS/SLICE_PARAMS/etc.) attached to a request_fd come
later via h264_set_controls(). hantro-vpu accepts only
DECODE_MODE_FRAME_BASED; START_CODE_ANNEX_B matches what the
existing slice-assembly path emits.
This is set unconditionally for now (errors silently ignored)
to keep cedrus and other backends compatible — they may
default to SLICE_BASED and not expose DECODE_MODE at all.
Probe-then-set via VIDIOC_QUERYCTRL is the upstream-correct
approach (see upstreamable_design.md §3).
After this patch, vainfo still enumerates as before, but the first
mpv vaapi-copy attempt advances past STREAMON and into actual
decode submission.
Signed-off-by: Markus Fritsche <fritsche.markus@gmail.com>
Compound patch carrying the fork's pre-Step-1 substrate, originally
authored by Jernej Škrabec / fourier on top of bootlin's a3c2476:
- src/h264.c + src/picture.c: V4L2_CID_MPEG_VIDEO_H264_* renamed to
V4L2_CID_STATELESS_H264_*, struct shapes tracked to mainline
(V4L2_CID_STATELESS_H264_DECODE_MODE/_START_CODE added to the
passthrough shim).
- include/hevc-ctrls.h: redirect shim to <linux/v4l2-controls.h>
(kernel-side HEVC controls now live in the canonical UAPI header).
- src/meson.build: src/h265.c / src/h265.h commented out — HEVC
build path is excluded from this fork (RK3568 hantro G1/G2 has
no HEVC, and the kernel-side HEVC controls have a separate
rework in flight upstream).
- src/tiled_yuv.S: aarch64 stub for tiled_to_planar (assembly
source was sunxi-cedrus armv7-only; aarch64 needs a stub to keep
the build linking).
- include/h264-ctrls.h: removed (dead post-fourier — no source
includes it; the passthrough shim's CID aliases live in the
kernel header now).
Functionally equivalent to the prior fork master commits:
c1f5108 V4L2_PIX_FMT_H264_SLICE rename
4ccbfe9 Strip HEVC build path
da9f2a5 include/h264-ctrls.h passthrough + CID aliases
fc4bb10 src/h264.c track upstream UAPI shape
13e9b64 src/h264.c drop num_slices field
4d14ffb src/tiled_yuv.S aarch64 stub
1b02c9b src/h264.c include utils.h
Folded into one commit during 2026-05-04 Step 1 reconciliation
(see ../phase0_evidence/2026-05-04/findings.md). Per-patch history
of the early fork commits preserved on the pre-step1 branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the per-codec options while at it, since we'll soon include a copy
of the associated headers.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
H.264 and H.265 support is still not supported upstream,
so it makes sense to autodetect each codec and only
enable those that are supported.
Signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
Because there might be more than a single call to CreateSurfaces,
we cannot assume that the index relative to the number of surfaces
requested in a single call matches the v4l2 index.
Grab the base index (as returned by the kernel) when allocating
buffers and use it for memory mapping and addressing them in v4l2.
This avoids memory-mapping the first (index 0) buffer multiple times
in that scenario instead of the n-th allocated buffer (in the n-th
call in the sequence).
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
The V4L2 API does not currently provide a way to liberate allocated
buffers one by one (which would fit well with DestroySurfaces in
VAAPI). Moreover, streaming needs to be off before liberating
buffers is allowed.
As a result, output an capture buffers can only be liberated when
destroying the decoding context, all at once, such as implemented
in this patch.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Since the V4L2 ioctl is called QUERYBUF, it makes more sense to
call the associated function with the same name.
Signed-off-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
void * can be assigned from and stored to any pointer type without any
warning. Remove the explicit casts.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The cedrus_data structure carries the old name. In order to migrate to the
new name, let's rename it to request_data.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The sunxi_cedrus.h header contains a bunch of defines prefixed with
SUNXI_CEDRUS.
As part as the ongoing migration to a more generic name, change that prefix
for V4L2_REQUEST, and the header file to request.h
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
As part of our renaming effort, Rename the libva hooks names to mention
request instead of SunxiCedrus
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The libva only provides the reference images needed to decode the current
picture, but not the full DPB. However, some codecs need that whole DPB in
order to decode a picture.
For example, the Allwinner hardware codec has an internal SRAM, with each
picture getting a slot in that SRAM, and during each decoding process, some
metadata will then be generated from that SRAM content to a separate
buffer. Therefore, each frames must be located at the same SRAM position
each time so that the metadata are then re-used properly.
However, since libva will only pass a few reference images, we can end up
in a situation where multiple, subsequent, frames will have the same
reference images set, but might all be used as reference later on and
cannot therefore be located at the same position.
And from a more theorical point of view, Linux expects a full blown DPB in
its H264 control.
In order to work around this, we can create a shadow of the DPB by simply
maintaining a list of 16 decoded images, each associated with their
VAPictureH264 and an age. This age is the last time we used that frame as
reference. When a new picture is decoded, either we assign it to a free
slot, or we reuse the slot from the frame that hasn't been used as a
reference for the longest time.
This is a much simpler approach than the one documented in the H264 spec,
but this shouldn't really be a problem since we don't handle the reference
frames ourselves, but just re-use the one from the libva, and taken from
the bitstream before. As such, frames that are not supposed to be used for
reference will not be anymore, their age will not increase, and therefore
after a while we will garbage-collect their slot to store a much newer
frame.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
The coding style has been a bit erratic. Enforce the linux kernel coding
style by reusing their .clang-format file, running clang-format on the
source, and ignoring the few shortcomings that clang-format has at the
moment (especially on aligning the define values).
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
This long structure name makes it quite difficult to fit within the 80
characters limit. Shorten it.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>