Major iteration result on av1-iter1 backend branch (tip c839b94 both
ampere and noether). AV1 hardware decode is FUNCTIONALLY WORKING for
the common cases:
Fixture Result
test_av1.ivf (2 frames, no grain) bit-exact PASS 2/2
av1-1-b8-02-allintra.ivf (39 all-intra) bit-exact PASS 10/10
av1_larger.ivf (film_grain + show_existing) 3/10 PASS (apply_grain=1 IDR-derived)
av1-1-b10-23-film_grain-50.ivf (10-bit) both libva + kdirect 0 bytes (vpu981 may not support)
The 10/10 all-intra PASS is the load-bearing validation: it proves
our backend's V4L2 control submission, OUTPUT byte assembly, surface
management, reference timestamp plumbing, and per-codec dispatch
are all correct for the common AV1 case.
The remaining 7/10 divergence on the film_grain+show_existing
fixture is localized via patched-libavcodec dump (LD_LIBRARY_PATH
override on debug fwrite-instrumented libavcodec.so) to:
- First 7 EndPicture submissions byte-IDENTICAL to kdirect for
SEQUENCE + FRAME + TILE_GROUP_ENTRY + FILM_GRAIN ctrls AND for
OUTPUT byte payload.
- libva has 2 EXTRA EndPicture calls on REUSED surfaces (the
ffmpeg-vaapi AV1 hwaccel's show_existing_frame handling).
- iter2 Fix 3 release-on-rebind FALSIFIED as the cause
(LIBVA_SKIP_REBIND=1 A/B identical to default).
Fix space (Phase 4): cap_pool refactor to track N surfaces per
slot, OR ffmpeg-vaapi AV1 hwaccel surface-allocation change.
All diagnostic infrastructure retained for next iteration:
/tmp/diff_av1_ctrls.py on ampere (per-CID strace byte diff)
/tmp/ivf_split.py on ampere (per-frame IVF extraction)
LIBVA_V4L2_DUMP_OUTPUT env on backend (libva-side OUTPUT bytes)
patched libavcodec build instructions in close doc
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Janet re-review verdict on v2: PROCEED with no architecture changes.
Two minor plan-text amendments:
A. cap_pool init for vpu981 — at backend init (NOT lazy). Three
explicit init calls in VA_DRIVER_INIT_FUNC. Lazy init creates race
window where concurrent VP8 (legacy hantro) and AV1 (vpu981)
sessions can alias cap_pool state across the wrong device.
B. Lock the F1 test vector: av1-1-b8-23-film_grain-50.ivf (AOM data
set). Already used in Phase 0 kdirect; reuse via libva for byte-
compare. Default 1080p encodes may be single-tile and never
exercise F1; a named conformance vector locks the test surface.
Cross-compile hazard documented: _Static_assert validates against
COMPILE-TIME kernel headers. Sibling-campaign pattern is native arm64
build on ampere — no cross-compile in scope. Comment added near the
assert for future Makefile changes.
Phase 2 entry checklist locked. Plan stack: v1 → v2 (Q5 BLOCKER fix)
→ v3 amendment (Janet PROCEED + 2 fixes). Implementation starts with
request.c third-device scaffolding (smallest atomic change), then
small file edits, then NEW av1.h, then NEW av1.c (~700 LoC), then
build + Phase 3 hardware test.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Janet v1 verdict was AMEND with 1 structural BLOCKER (Q5) + 3
implementation-time risks (F1-F3).
Q5 (BLOCKER) empirically confirmed on ampere: RK3588 has THREE
hantro-vpu instances (legacy MPEG2/VP8 at /dev/video2, encoder at
/dev/video3, vpu981 AV1 at /dev/video4). Current backend's 2-device
fd model (rkvdec + hantro) is "RK3399-shaped knowledge" per its own
comment — silently picks the wrong hantro on RK3588.
v2 fix: add third device slot (video_fd_vpu981 + media_fd_vpu981),
discriminate by V4L2_PIX_FMT_AV1_FRAME capability (not driver name),
extend request_device_kind_for_profile with 'a' kind for VAProfile-
AV1Profile0, extend cap_pool pair-of-flags layout per iter38 pattern.
Q1 amendment: tile_group_entry DYNAMIC_ARRAY size = sizeof * MAX(N,1);
add _Static_assert for kernel uAPI drift catch.
Q2 amendment: VIDIOC_QUERY_EXT_CTRL probe at context init for film_grain
availability; gate per-frame send on the flag.
Q3 PROCEED: per-frame SEQUENCE send (no caching).
Q4 PROCEED: ignore VAOpaqueAV1 (codec_store_buffer has default fallback).
Q6 PROCEED: V4L2_REQUEST_MAX_PROFILES=11 exactly full with AV1; add
comment for future bumps.
F1-F3 implementation risks catalogued for Phase 2 code review:
- mi_col/row_starts sentinel (silent corruption on multi-tile)
- superres_denom AV1_SUPERRES_NUM=8 default offset
- loop_restoration_size[] USES_LR flag gating
File estimate up from 800 to 880 LoC (added vpu981 scaffolding).
Phase 0 test vectors were single-tile (208×208 + 352×288); Phase 3
verification must include multi-tile 1080p+ AV1 to exercise F1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pivoted from ampere-vp9-enablement (closed at structural impossibility
on rkvdec/vdpu381). Janet PIVOT verdict pointed at AV1 — verification
shows it works out-of-the-box on mainline 7.0.0-rc3:
- Kernel driver: drivers/media/platform/verisilicon/
rockchip_vpu981_hw_av1_dec.c (in-tree, loaded as hantro-vpu)
- Hardware: vpu981 dedicated AV1 IP at fdc70000 (separate from rkvdec)
- V4L2 node: /dev/video4 enumerates AV1F format
- Userspace: ffmpeg -hwaccel v4l2request kdirect path works
Verification: byte-compare HW (hantro-vpu) vs SW (libdav1d) on two
AOM test vectors:
- av1-1-b8-01-size-208x208.ivf (2 frames): 100.0000% exact match
- av1-1-b8-23-film_grain-50.ivf (10 frames): 100.0000% exact match
per frame, including AV1 film_grain post-processing
Pivot outcome:
- VP9 campaign: 10 iterations + 2 architect reviews → structural
impossibility (kernel-side gap that needs upstream/Collabora coord)
- AV1 verification: 0 iterations → bit-perfect first try
The "enablement campaign" framing is mostly inappropriate for AV1 —
this is a verification campaign. Real upstream work was done by
Verisilicon + Collabora; we just confirm it works on ampere.
Optional follow-ups (out of Phase 0 scope):
A. libva backend AV1 dispatch (enables VAAPI consumers)
B. Fluster AV1-TEST-VECTORS comprehensive validation
C. 1080p/4K real-world AV1 stress test
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>