ampere-av1 Phase 5 review: stale linked_decode_surface_id clear; remap fix REVERTED

Two of three Phase 5 sonnet-architect review amendments addressed. Amendment 4 (kept): clear surface_object->linked_decode_surface_id at BeginPicture after the iter2 Fix 3 release. Prevents stale-link borrows in copy_surface_to_image when ffmpeg-vaapi recycles a former display surface as a decode target. No-op for non-AV1 codecs (link field stays VA_INVALID_SURFACE for them throughout). Amendment 1 (reverted): reviewer proposed remap_lr_type table {NONE, SWITCHABLE, WIENER, SGRPROJ} per Kwiboo's permutation, arguing AV1 spec FrameRestoreType wire encoding differs from V4L2_AV1_FRAME_RESTORE_* enum order. Applied the proposed table empirically → regressed ALL tests (allintra 10/10 → 0/10, test_av1 bit-exact → DIFF). Reverted to identity mapping. Either VAAPI's yframe_restoration_type is already in V4L2-enum order, or vpu981 interprets the V4L2 enum values via a mapping that differs from the uAPI header documentation. Per [[feedback_review_empirical_over_theoretical]] empirical PASS wins; updated the code comment to capture the investigation outcome so the next session has the context. Amendment 5 (SEPARATE_UV_DELTA_Q sequence flag missing): noted but not actionable — VAAPI doesn't expose color_config.separate_uv_delta_q. Will need bitstream-side info to surface. Not blocking current tests. Verification on ampere: test_av1.ivf: bit-exact PASS sha 029ee72c214b37c1 av1-1-b8-02-allintra.ivf: 10/10 PASS (no regression) av1_larger.ivf: 3/10 PASS (no regression) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ampere-av1 Phase 3 finding: iter2 Fix 3 release is NOT the divergence cause
2026-05-17 12:19:19 +00:00 · 2026-05-17 12:12:23 +00:00 · 2026-05-17 10:55:07 +00:00 · 2026-05-17 10:45:31 +00:00 · 2026-05-17 10:31:24 +00:00 · 2026-05-17 10:28:32 +00:00
32 changed files with 1146 additions and 3668 deletions
@@ -1,281 +1,75 @@
-# libva-v4l2-request-fourier
+# v4l2-request libVA Backend

-VA-API ICD backend for V4L2 stateless video decoders. Fourier-campaign
-fork of the dormant `bootlin/libva-v4l2-request` upstream.
+## About

-> **TL;DR for "I want hardware-accelerated YouTube in Firefox on my
-> Rockchip board":** skip to the [§ Quickstart](#quickstart) below.
-> Fresnel (RK3399) and ampere (RK3588) are validated targets; ohm
-> (RK3566 PineTab2) is the chromium-fourier validation rig.
+This libVA backend is designed to work with the Linux Video4Linux2
+Request API that is used by a number of video codecs drivers,
+including the Video Engine found in most Allwinner SoCs.

-## What works
+## Status

-| SoC / host | HW-accelerated codecs | Bit-exact vs `kdirect` |
-|---|---|---|
-| RK3399 (fresnel — Pinebook Pro) | H.264, HEVC Main, VP9 Profile 0, VP8, MPEG-2 | 5/5 at iter38; preserved through iter40b |
-| RK3588 (ampere) | H.264 + HEVC (iter1+iter2 ampere-fourier); **mainline rkvdec / VDPU381 + VDPU383 landed February 2026** — VP9 / AV1 verification next | iter1 H.264 PASS; remaining codecs gated on mainline-driver bring-up |
-| RK3568 / RK3566 (ohm — PineTab2) | H.264, MPEG-2, VP8 via hantro multi-planar | iter1-5 baseline (libva-multiplanar campaign) |
-| BCM2712 (higgs — Pi 5 / CM5) | — | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved, [see § Pi 5 standoff](#the-pi-5-standoff) |
-
-`kdirect` is the reference: `ffmpeg -hwaccel v4l2request
-hwaccel_output_format drm_prime ...` via Kwiboo's downstream ffmpeg
-patches (packaged here as **`ffmpeg-v4l2-request-fourier`**, FFmpeg 8.1
-tip @ Kwiboo `v4l2-request-n8.1` commit `b57fbbe`).
-
-## Quickstart
-
-### What you need for HW-accelerated YouTube in Firefox
-
-The full stack, top to bottom, with the package this campaign provides
-at each layer:
-
-| Layer | Package(s) | Notes |
-|---|---|---|
-| Linux kernel with V4L2 stateless decoders | `linux-fresnel-fourier` (RK3399), `linux-ampere-fourier` (RK3588) | Mainline rkvdec / hantro / VDPU381 / VDPU383. ohm typically rides on a Beryllium OS host kernel. |
-| `ffmpeg` with Kwiboo's v4l2-request hwaccel | `ffmpeg-v4l2-request-fourier` | Provides `-hwaccel drm -c:v hevc` (and h264/vp9) routes via libavcodec hwdevice DRM. |
-| `libva` VA-API runtime + this backend ICD | `libva` (stock) + **`libva-v4l2-request-fourier`** | This repo. Auto-detects rkvdec / hantro / cedrus on probe. |
-| Firefox patched to call libavcodec stateless | `firefox-fourier` | 5-patch series, ~+169 LoC over stock Firefox. Validated on fresnel: **~5 % CPU at 1080p30 H.264** (vs 64 % software). |
-| (Wayland alt) Chromium patched for V4L2VDA | `chromium-fourier` + `kwin-fourier` | Validated on ohm under KDE Plasma 6.6.5 Wayland. Needs `kwin-fourier` for the dmabuf-fence latency fix. |
-| (Optional) panfrost / panthor GPU stack | `vulkan-panfrost` | Wayland compositor + 3D. |
-
-The actual VA-API path is mostly historical inside this campaign — the
-**user-facing browser HW decode story rides libavcodec's
-`v4l2_request` hwaccel directly**, not VAAPI-via-libva. Firefox-fourier
-attaches an `AV_HWDEVICE_TYPE_DRM` context to libavcodec's generic
-`h264`/`hevc`/`vp9` decoder; libavcodec then auto-binds the
-`v4l2_request` hwaccel from its `hw_configs`. No `LIBVA_DRIVER_NAME`
-incantation needed for browser use. libva-v4l2-request-fourier matters
-for mpv, ffmpeg-as-vaapi, and other VA-API direct consumers.
-
-### Install on Arch ALARM (fresnel / ampere / ohm)
-
-Add the marfrit repo if you haven't already:
-
-```ini
-# /etc/pacman.conf
-[marfrit]
-SigLevel = Required
-Server = https://packages.reauktion.de/arch/$arch
-```
-
-Import the signing key (one-time):
-
-```bash
-sudo pacman-key --recv-keys <KEY-ID>   # see https://packages.reauktion.de
-sudo pacman-key --lsign-key <KEY-ID>
-sudo pacman -Sy
-```
-
-Then per host:
-
-```bash
-# Fresnel — RK3399 Pinebook Pro
-sudo pacman -S \
-    linux-fresnel-fourier linux-fresnel-fourier-headers \
-    ffmpeg-v4l2-request-fourier \
-    libva-v4l2-request-fourier \
-    firefox-fourier
-
-# Ampere — RK3588
-sudo pacman -S \
-    linux-ampere-fourier linux-ampere-fourier-headers \
-    ffmpeg-v4l2-request-fourier \
-    libva-v4l2-request-fourier \
-    firefox-fourier
-
-# Ohm — RK3566 PineTab2 (chromium-fourier validated path)
-sudo pacman -S \
-    ffmpeg-v4l2-request-fourier \
-    libva-v4l2-request-fourier \
-    kwin-fourier
-# chromium-fourier currently still a local build — see § Status
-```
-
-Reboot if a new kernel landed. Then:
-
-```bash
-# Smoke-test: vainfo should list HEVCMain + H264 entries
-LIBVA_DRIVER_NAME=v4l2_request vainfo
-
-# Browser launch with verbose decoder logging
-MOZ_LOG="PlatformDecoderModule:5,FFmpegVideo:5" \
-  firefox-fourier 2>&1 | tee /tmp/fx.log
-
-# Then open a YouTube 1080p H.264 video and grep for:
-#   "Choosing FFmpeg pixel format for V4L2 video decoding"
-#   "av_hwdevice_ctx_create(DRM, /dev/dri/renderD128) ok"
-# If you DON'T see those: HW path didn't engage, fell back to software.
-```
-
-### Status of the published vs locally-built packages
-
-As of May 2026, the live marfrit repo at
-<https://packages.reauktion.de/arch/aarch64/> has:
-
- ✓ `libva-v4l2-request-fourier-1:1.0.0.r361.cf8cd9d-1` (iter40b tip)
- ✓ `ffmpeg-v4l2-request-fourier-2:8.1.r123329.b57fbbe-3` (Kwiboo's
-  v4l2-request-n8.1 + libudev-bypass; smoke-tested on fresnel —
-  HEVC via `-hwaccel v4l2request` PASS)
- ✓ `firefox-fourier-150.0.1-16` (5-patch series, sandboxed RDD HW
-  decode validated on RK3399: ~5 % CPU at 1080p30 H.264)
- ✓ `linux-fresnel-fourier-7.0-14` + headers (RK3399)
- ✓ `linux-ampere-fourier-7.0rc3.kafr1-1` + headers (RK3588)
- ✓ `kwin-fourier-1:6.6.5-1` (Wayland dmabuf-fence fix for chromium-fourier)
- ✓ `vulkan-panfrost-1:26.0.5-1` (GPU stack)
-
-NOT yet published but **present in `marfrit-packages/arch/` source
-tree** (build + publish pending):
-
- ⏳ `chromium-fourier` (Chromium 147 + V4L2VDA-on-mainline patches —
-  blocked on Arch ALARM bumping clang 22 → 23).
- ⏳ `qt6-base-fourier` (GL_ALPHA → GL_R8 fix — needed by KDE Plasma
-  Wayland on the panfrost stack).
-
-If you need those locally before they ship:
-
-```bash
-git clone ssh://git@git.reauktion.de:2222/marfrit/marfrit-packages.git
-cd marfrit-packages/arch/<package>
-makepkg -si
-```
-
-## What does NOT work, and why it's stalled
-
-| Target | Status | Blocker |
-|---|---|---|
-| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
-| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
-| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
-
-## What does NOT work, and why it's stalled
-
-| Target | Status | Blocker |
-|---|---|---|
-| H264 Hi10P on RK3399 | enumerated, decode returns all-zero | RK3399 silicon doesn't implement 10-bit despite kernel advertising the profile (iter39 close, Option B applied) |
-| HEVC Main10 on RK3399 | not enumerated | same as Hi10P |
-| **Pi 5 / CM5 (BCM2712 / `rpi-hevc-dec`)** | infrastructure landed (iter40 / iter40b), bit-exact NOT achieved | see "The Pi 5 standoff" below |
-
-### The Pi 5 standoff
-
-iter40 + iter40b add a third multi-device-probe slot for
-`rpi-hevc-dec`, an NC12 SAND128 detile primitive, per-driver gates
-around the SPS pre-seed + start-code-prepend + scaling_matrix submission,
-and a (fragile, fixture-specific) SPS field override using the
-GStreamer 1.28.2 H.265 parser. ICD discovery works, `vainfo` lists
-`VAProfileHEVCMain`, S\_FMT / REQBUFS / STREAMON all succeed.
-
-**Decode itself never succeeds** — every CAPTURE DQBUF returns
-`V4L2_BUF_FLAG_ERROR`. Driver author John Cox confirmed strict SPS
-validation is intentional ("`try_ext_ctrls returned an error (22)` is
-expected as it is validating the SPS"), and VAAPI's
-`VAPictureParameterBufferHEVC` simply doesn't carry the bitstream-true
-scalars (`sps_max_num_reorder_pics`, `sps_max_latency_increase_plus1`,
-slice-level `num_entry_point_offsets`) that the driver wants. We can't
-fish the SPS out of `source_data` either, because ffmpeg-vaapi parses
-the SPS itself and passes only slice NAL bytes to libva backends.
-
-This is not a bug in our backend, in libva, in ffmpeg, or in the kernel
-driver. It's an ecosystem coordination failure of long standing:
-
- **Kwiboo's `ffmpeg-v4l2request` hwaccel** has been in production via
-  LibreELEC since December 2018. Re-submitted to ffmpeg-devel as a v2
-  series in August 2024. Still un-merged in May 2026 — **eight years
-  in the upstream review queue**.
- **`libva-v4l2-request`** (this project's upstream) hasn't taken
-  meaningful commits since ~2021. Nobody wants to own the impedance
-  mismatch between VAAPI's Intel-shaped "give me raw bitstream, I'll
-  parse" and V4L2 stateless's kernel-shaped "give me parsed structs,
-  I'll just drive the HW."
- **`rpi-hevc-dec` mainline submission** is at v4 (July 2025), 17
-  months in review. The Pi 6.18.x downstream kernel meanwhile has
-  active HEVC regressions ([raspberrypi/linux#7228](https://github.com/raspberrypi/linux/issues/7228),
-  [#7306](https://github.com/raspberrypi/linux/issues/7306)) that
-  aren't being fast-tracked because "the new uAPI is coming."
- **Mozilla is implementing Pi 5 HEVC via ffmpeg's hwaccel-context
-  path** (bug [1969297](https://bugzilla.mozilla.org/show_bug.cgi?id=1969297)),
-  not via libva — explicit acknowledgement from David Turner that
-  libavcodec needs to retain the SPS context for the strict driver to
-  accept the control batch.
-
-What end-users actually do today: run Pi OS (downstream-patched ffmpeg
-+ downstream kernel) or LibreELEC (Kwiboo's patches + downstream
-kernel). Anyone on a stock distro outside those two: no HW HEVC on
-Pi 5.
-
-Nobody who has authority to merge has skin in the game. Everyone with
-skin in the game lacks authority. Result: 8-year stalemate, three
-forks of working code, no merged upstream.
-
-### What this means for this backend
-
-We chose to extend `libva-v4l2-request` into Pi 5 territory because
-the architecture maps cleanly onto the existing iter38 multi-device
-probe. That work landed (iter40 commit `3ffa9d0`, iter40b commit
-`071b08d`). It's reusable infrastructure for any future strict V4L2
-stateless decoder that ffmpeg ships before libva does.
-
-But the *user-facing* Pi 5 HEVC story will not come from this
-backend. The backend was a clean architectural target inside a
-coordination dead-end. The actual Pi 5 HEVC path through libva
-requires either:
-
- a VAAPI extension exposing the SPS scalars rpi-hevc-dec validates
-  against (Intel-driven; no Pi-aligned principal),
- a libva-internal `VABufferType` for raw SPS/PPS NAL bytes (no
-  maintainer),
- ffmpeg-vaapi forwarding `num_entry_point_offsets` to backends
-  (small upstream patch; no champion), OR
- the political situation around Kwiboo's series unblocks (no
-  visible movement).
-
-iter40 + iter40b are **landed but parked**. The fresnel + ampere
-sibling paths are unaffected (5/5 fresnel + 9 profiles ampere
-verified post-iter40b, no regression). Phase 8 packaging is
-deliberately skipped — shipping a `.deb` whose primary advertised
-target (Pi 5) doesn't actually decode would mislead users.
-
-See `phase0_pi5_hevc.md`, `phase1_pi5_hevc.md`,
-`phase5_pi5_hevc_review.md`, `phase7_pi5_hevc_close.md` for the
-chapter's full empirical record.
+The v4l2-request libVA backend currently supports the following formats:
+* MPEG2 (Simple and Main profiles)
+* H264 (Baseline, Main and High profiles)
+* H265 (Main profile)

 ## Instructions

-In order to use this backend, set the `LIBVA_DRIVER_NAME` environment
-variable:
+In order to use this libVA backend, the `v4l2_request` driver has to
+be specified through the `LIBVA_DRIVER_NAME` environment variable, as
+such:

 	export LIBVA_DRIVER_NAME=v4l2_request

-Then a VA-API-capable player can decode supported codecs on a probed
-device:
+A media player that supports VAAPI (such as VLC) can then be used to decode a
+video in a supported format:

-	vlc path/to/video.mp4
-	mpv --hwdec=vaapi path/to/video.mp4
-	ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -f null -
+	vlc path/to/video.mpg

-The backend auto-detects available decoders via the V4L2 media
-topology walk; honors `LIBVA_V4L2_REQUEST_VIDEO_PATH` and
-`LIBVA_V4L2_REQUEST_MEDIA_PATH` for explicit device selection.
+Sample media files can be obtained from:
+
+	http://samplemedia.linaro.org/MPEG2/
+	http://samplemedia.linaro.org/MPEG4/SVT/

 ## Technical Notes

-### Multi-device probe (iter38)
+### Surface

-A single libva session opens both `rkvdec` and `hantro-vpu` (and, on
-hosts where it's present, `rpi-hevc-dec`) at init. `RequestCreateConfig`
-re-targets the active fd per profile via
-`request_switch_device_for_profile()`. Pool teardown happens at
-switch time; the next `CreateContext` rebuilds against the right
-device.
+A Surface is an internal data structure never handled by the VA's user
+containing the output of a rendering. Usualy, a bunch of surfaces are created
+at the begining of decoding and they are then used alternatively. When
+created, a surface is assigned a corresponding v4l capture buffer and it is
+kept until the end of decoding. Syncing a surface waits for the v4l buffer to
+be available and then dequeue it.

-### Surface / Context / Picture / Image
+Note: since a Surface is kept private from the VA's user, it can ask to
+directly render a Surface on screen in an X Drawable. Some kind of
+implementation is available in PutSurface but this is only for development
+purpose.

-A Surface is an internal data structure containing rendering output.
-A Context owns the V4L2 lifecycle (S\_FMT, CAPTURE pool, ctrl-batch
-defaults) for one decode session. A Picture is one encoded input
-frame's set of buffers. An Image is a Standard VA pixel-format view
-on a decoded Surface — the backend detiles SAND/COL128 or unpacks
-NV15 to NV12/P010 here so consumers see linear pitches.
+### Context

-The real rendering is in `EndPicture`, not `RenderPicture`, because
-the kernel needs the full extended-control batch when the OUTPUT
-buffer is queued, and `RenderPicture` order is consumer-defined.
+A Context is a global data structure used for rendering a video of a certain
+format. When a context is created, input buffers are created and v4l's output
+(which is the compressed data input queue, since capture is the real output)
+format is set.
+
+### Picture
+
+A Picture is an encoded input frame made of several buffers. A single input
+can contain slice data, headers and IQ matrix. Each Picture is assigned a
+request ID when created and each corresponding buffer might be turned into a
+v4l buffers or extended control when rendered. Finally they are submitted to
+kernel space when reaching EndPicture.
+
+The real rendering is done in EndPicture instead of RenderPicture
+because the v4l2 driver expects to have the full corresponding
+extended control when a buffer is queued and we don't know in which
+order the different RenderPicture will be called.
+
+### Image
+
+An Image is a standard data structure containing rendered frames in a usable
+pixel format. Here we only use NV12 buffers which are converted from sunxi's
+proprietary tiled pixel format with tiled_yuv when deriving an Image from a
+Surface.
@@ -195,11 +195,6 @@ extern "C" {
 #define DRM_FORMAT_NV24		fourcc_code('N', 'V', '2', '4') /* non-subsampled Cr:Cb plane */
 #define DRM_FORMAT_NV42		fourcc_code('N', 'V', '4', '2') /* non-subsampled Cb:Cr plane */

-/* iter39: NV15 is 4×10-bit packed in 5 bytes (Rockchip rkvdec 10-bit output). */
-#ifndef DRM_FORMAT_NV15
-#define DRM_FORMAT_NV15		fourcc_code('N', 'V', '1', '5') /* 2x2 subsampled Cr:Cb plane 10 bits per channel packed */
-#endif
-
 /*
 * 3 plane YCbCr
 * index 0: Y plane, [7:0] Y
@@ -4,14 +4,3 @@ option(
    value : '',
    description: 'Path to sanitized Linux Kernel headers'
 )
-
-option(
-    'daedalus_v4l2',
-    type : 'boolean',
-    value : true,
-    description: 'Enable probe + dispatch for the out-of-tree daedalus_v4l2 ' +
-                 'stateless decoder shim (Pi 5 / CM5 daemon-backed VP9/AV1/H264). ' +
-                 'Default true; disable on platforms where the daedalus_v4l2 ' +
-                 'kernel module will never be present to slim the probe array.'
-)
-
@@ -1,298 +0,0 @@
-# Phase 0 — Pi 5 / CM5 HEVC chapter
-
-Opened 2026-05-17 evening, after the failed `libva-v4l2-stateful-fourier`
-scaffold attempt. Brother-session empirical Phase 0 on higgs invalidated
-the stateful premise: rpi-hevc-dec is V4L2 **stateless**, so Pi 5 HEVC
-belongs in this backend, not a separate sibling.
-
-No code in this chapter yet. This doc is the substrate. Phase 1 picks up
-from the "Open questions" section.
-
-## Substrate
-
-### Target host
-
-higgs — Pi CM5 module on Pi CM5 IO board. BCM2712 SoC. VPN-only, often
-offline; wake via HIS skill recipe (no Fritz!Box plug — runs on power
-when on). Debian-based. Sole HW video decoder is rpi-hevc-dec at
-`/dev/video19` + `/dev/media1`.
-
-### Backend baseline at chapter open
-
-`libva-v4l2-request-fourier` master tip `cf8cd9d` (iter39 + Option B +
-h265 ref-list cap fix). Multi-device probe (iter38) already opens
-rkvdec + hantro slots; adding a third decoder slot for rpi-hevc-dec is
-a natural extension of that architecture.
-
-iter2 (ampere VDPU381 HEVC EXT_SPS) added the GStreamer 1.28.2 H.265
-parser vendor + EXT_SPS_ST_RPS / _LT_RPS dynamic-array submission. That
-plumbing is probe-gated (`has_hevc_ext_sps_rps_rkvdec`), so it stays
-dormant on hosts where the controls don't exist.
-
-### Empirical higgs probe (brother session)
-
-`v4l2-ctl -d /dev/video19 --list-formats-ext --list-ctrls`:
-
-```
-Stateless Codec Controls
-
-  hevc_sequence_parameter_set        (compound, V4L2_CID_STATELESS_HEVC_SPS)
-  hevc_picture_parameter_set         (compound, V4L2_CID_STATELESS_HEVC_PPS)
-  slice_param_array                  (compound dynamic-array dims=[4096])
-  hevc_scaling_matrix                (compound)
-  hevc_decode_parameters             (compound)
-  hevc_decode_mode                   (menu, "Frame-Based")
-  hevc_start_code                    (menu, default "No Start Code")
-
-OUTPUT formats:
-  S265  V4L2_PIX_FMT_HEVC_SLICE  (parsed slice payload)
-
-CAPTURE formats:
-  NC12  V4L2_PIX_FMT_NV12_COL128       (8-bit  SAND 128-column tiled)
-  NC30  V4L2_PIX_FMT_NV12_10_COL128    (10-bit SAND 128-column tiled)
-```
-
-Conclusion: this is the standard `V4L2_CID_STATELESS_HEVC_*` control set
-exposed under the V4L2-request uAPI, exactly the same family our backend
-already drives for rkvdec/hantro/cedrus HEVC paths. The novel parts are
-two pixel formats (NC12, NC30) and one driver-id (rpi-hevc-dec).
-
-## What carries forward unchanged
-
- VAAPI HEVC profile enumeration (`config.c`)
- `h265_set_controls` core path (`h265.c`) — same compound ctrl set
- Synthetic SPS pre-seed pattern (iter25/26) — already runs pre-CAPTURE-alloc
- Multi-device dispatch in `RequestCreateConfig` (iter38)
- VAAPI slice / picture / IQ matrix buffer parsing
- HEVC h264-style start-code policy (we already DON'T prepend for HEVC)
-
-## What needs adding
-
-| Item | Location | Sizing |
-|------|----------|--------|
-| `RPI_HEVC_DEC` enum in `driver_kind_t` | `request.h` | trivial |
-| Multi-device probe extends to `/dev/video19` discovery | `context.c` / `request.c` init | small — mirror hantro slot |
-| `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry | `video.c` | small |
-| `V4L2_PIX_FMT_NV12_10_COL128` (NC30) `video_format` entry | `video.c` | small |
-| NC12 → NV12 detile primitive | new `nv12_col128.c` | mid — column tile layout, see kernel docs |
-| NC30 → P010 detile primitive | new `nv12_col128.c` | mid — 10-bit variant of above |
-| `copy_surface_to_image` branch for NC12/NC30 | `image.c` | small (mirror NV15→P010 gating) |
-| Per-driver gating for any rpi-specific quirks discovered | various | per [[per-driver-kludge-gating]] |
-
-## Open questions for Phase 1
-
-Lock these before Phase 1 commits to a goal.
-
-1. **EXT_SPS controls on rpi-hevc-dec?** Brother's `--list-ctrls` output
-   above shows the standard `V4L2_CID_STATELESS_HEVC_*` family — NOT the
-   `EXT_SPS_ST_RPS` / `EXT_SPS_LT_RPS` extensions that VDPU381 needs.
-   Verify: does `slice_param_array[4096]` accept `st_rps_bits` /
-   `lt_rps_bits` in the per-slice payload, or does rpi-hevc-dec parse RPS
-   itself from the slice header? If the latter, the iter2 EXT_SPS path
-   stays dormant (probe-gated already), and rpi-hevc-dec just needs the
-   `picture->st_rps_bits` → `slice_params->short_term_ref_pic_set_size`
-   plumbing that iter31 α-29 already wired. Expectation: works out of the
-   box. Confirm before assuming.
-
-2. **`hevc_start_code` ctrl: "No Start Code" vs Annex B?** Brother saw
-   default `"No Start Code"` — matches our behavior (we don't prepend on
-   HEVC). But the ctrl is configurable. Verify the menu values exposed
-   and confirm "No Start Code" passes our raw slice-NAL payload as-is.
-   If it doesn't, set the ctrl explicitly per [[unconditional-codec-state]]
-   gating.
-
-3. **NC12 / NC30 SAND tile layout — exact spec.** Read
-   `Documentation/userspace-api/media/v4l/pixfmt-yuv-planar.rst` for the
-   COL128 variants. Confirm: column stride = 128 bytes (Y) / 128 bytes
-   (UV interleaved). Row count = `ALIGN(height, 16)` or `ALIGN(height, 8)`?
-   Get the exact alignment and tile-traversal order before writing the
-   detile primitive. Cite from kernel doc, NOT inferred from a hex dump.
-
-4. **drm_prime / SAND modifier round-trip.** Does ffmpeg-vaapi (and
-   Firefox) accept the NC12 buffer via DRM_PRIME export carrying the
-   DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier, allowing
-   zero-copy to a SAND-aware compositor? Or is libva-side detile to a
-   linear NV12 buffer the only viable Firefox path? If detile is
-   required for the consumer, the [[rockchip-pixel-verify-path]] rule
-   (DMA-BUF GL preferred over cached mmap) might NOT apply since SAND
-   is Pi-specific and not in the wider Wayland modifier ecosystem.
-
-5. **rpi-hevc-dec quirks on first SPS submission.** rkvdec needs
-   image_fmt pre-seed before CAPTURE alloc (iter25). Does rpi-hevc-dec
-   have an analogous "must set OUTPUT pix_fmt + SPS before CAPTURE"
-   ordering? Verify with strace early.
-
-6. **higgs OS + libva versioning.** Brother probed on Debian. We package
-   for Arch ALARM. What's the install path on higgs — Arch / Debian /
-   Raspberry Pi OS? If Debian, the package needs a `debian/` tree, not
-   just PKGBUILD. Decide packaging target before Phase 8.
-
-## Phase 1 goal sketch (NOT locked)
-
-> Firefox HW HEVC playback on higgs at ≥30fps for 1080p Main, byte-exact
-> libva-vs-kdirect for ≥3 reference fixtures (8-bit Main and 10-bit Main10).
-
-Two measurable subgoals follow naturally:
- libva (this backend, NV12 image output) == kdirect (ffmpeg-v4l2request,
-  NV12 image output) byte-exact for the same input.
- Firefox VA-API path engages (verify via `chrome://gpu` equivalent / log
-  inspection — `MOZ_LOG=PlatformDecoderModule:5`).
-
-## Phase 3 baseline plan
-
-Before any backend code touches rpi-hevc-dec:
- `kdirect` floor: `ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime
-  -i bbb_720p10s_hevc.mp4 -vf hwdownload,format=nv12 -frames:v 10 ...` and
-  sha256 the YUV.
- `SW reference`: same ffmpeg without `-hwaccel`, sha256 the YUV.
- Both runs N=3 per [[replicate-baseline-first]].
- Capture `strace -f -e ioctl` of the kdirect run — gives the canonical
-  ioctl sequence rpi-hevc-dec expects.
-
-## Phase 0 closing
-
-This doc commits the substrate. Phase 1 starts when:
- higgs is up + reachable
- Open questions 1+2 (EXT_SPS + start_code) are answered live, in one
-  short probe session
- Phase 3 baseline floors are captured
-
-No work blocks the close of iter39 / fresnel campaign — those are shipped.
-
-## Phase 0 close addendum (2026-05-17 evening, higgs probe session)
-
-Empirical probes on higgs answered Q1, Q2, partial Q3, full Q5, full Q6.
-Q4 (DRM modifier round-trip) remains open. Phase 0 is closed; Phase 1
-opens with what's below.
-
-### Q1 — EXT_SPS controls on rpi-hevc-dec: NOT present
-
-`v4l2-ctl -d /dev/video19 --list-ctrls` confirms ONLY the standard
-`V4L2_CID_STATELESS_HEVC_*` set:
- `hevc_sequence_parameter_set` (0x00a40a90)
- `hevc_picture_parameter_set` (0x00a40a91)
- `slice_param_array` (0x00a40a92, dynamic-array dims=[4096])
- `hevc_scaling_matrix` (0x00a40a93)
- `hevc_decode_parameters` (0x00a40a94)
- `hevc_decode_mode` (0x00a40a95, menu min=1 max=1 default=1 = Frame-Based)
- `hevc_start_code` (0x00a40a96, menu min=0 max=1 default=0 = No Start Code)
- 0x00a40a97 returns EINVAL (no EXT_SPS_*_RPS controls)
-
-ioctl trace confirms ffmpeg's `VIDIOC_QUERY_EXT_CTRL` for `0xa97` returns
-EINVAL — same probe pattern our backend uses for
-`has_hevc_ext_sps_rps_rkvdec`. **The iter2 path stays dormant; the
-iter31 α-29 `slice_params->short_term_ref_pic_set_size` plumbing is the
-correct one for rpi-hevc-dec.**
-
-### Q2 — hevc_start_code: default 0 (No Start Code), values {0, 1}
-
-Default 0 matches our backend's "don't prepend HEVC start code" stance.
-Confirm in Phase 1: rpi-hevc-dec accepts our raw NAL slice payload as-is.
-
-### Q3 — NC12 / NC30 SAND tile layout: PARTIAL
-
-CAPTURE S_FMT result for 1280×720 NC12:
- `sizeimage=1382400` = `1280 × 720 × 1.5` (linear NV12 byte count)
- `bytesperline=1080` (NOT 1280)
-
-The bytesperline=1080 for a 1280-wide CAPTURE buffer is suspect — likely
-encodes SAND column count rather than linear stride. Read
-`drivers/staging/media/rpivid/` (or wherever NC12_COL128 lives in 6.12)
-kernel source + `drm_fourcc.h` / `nv12_col128.rst` (if it exists) for
-exact tile layout BEFORE writing the detile primitive. Do NOT infer
-layout from this single observation.
-
-### Q4 — DRM modifier round-trip: BLOCKED on hwdownload
-
-ffmpeg `-hwaccel drm -hwaccel_output_format drm_prime -vf
-hwmap=mode=read,format=nv12` returns `Failed to map frame: -38`
-(`Function not implemented`). hwdownload cannot consume the SAND
-modifier directly.
-
-ffmpeg's path that DOES work: `-hwaccel drm -c:v hevc` WITHOUT
-`-hwaccel_output_format drm_prime` lets ffmpeg's internal pipeline pull
-back, detile (presumably via a Pi-specific helper or libdrm transform),
-and present NV12 to the next filter. Bit-exact vs SW for the test
-fixture (1280×720 Main 8-bit) — confirms HW engagement.
-
-Phase 1 / Phase 4 will need to decide:
- Detile in the backend (CPU SIMD), exposing NV12 via VAImage; or
- Pass-through DRM_PRIME with SAND modifier and let the consumer
-  (compositor / Firefox) detile. Firefox almost certainly can't, so
-  CPU detile is the safe bet.
-
-### Q5 — rpi-hevc-dec submission ordering: empirically locked
-
-`strace -e ioctl` of the kdirect run shows:
-1. `MEDIA_IOC_DEVICE_INFO` + `MEDIA_IOC_G_TOPOLOGY` (per media node)
-2. `VIDIOC_QUERYCAP` per video node — `driver="rpi-hevc-dec"` identifies
-   the right one
-3. `VIDIOC_ENUM_FMT` OUTPUT → S265 only
-4. `VIDIOC_S_FMT` OUTPUT (HEVC_SLICE, placeholder dims)
-5. `VIDIOC_REQBUFS` OUTPUT (DMABUF, count=N) — count=6 in kdirect
-6. `VIDIOC_S_FMT` CAPTURE (NC12, actual dims from SPS parse)
-7. `VIDIOC_CREATE_BUFS` CAPTURE (DMABUF, count=16)
-8. `VIDIOC_STREAMON` both queues
-9. `VIDIOC_QUERY_EXT_CTRL` enumeration
-10. `VIDIOC_S_EXT_CTRLS` (decode_mode + start_code) — global ctrls
-11. Per frame: `VIDIOC_S_EXT_CTRLS` (SPS+PPS+decode_params+slice_array,
-    class=0xf010000 = per-request) + `VIDIOC_QBUF` CAPTURE + `VIDIOC_QBUF`
-    OUTPUT (with `V4L2_BUF_FLAG_IN_REQUEST | V4L2_BUF_FLAG_REQUEST_FD`) +
-    `VIDIOC_DQBUF` OUTPUT + `VIDIOC_DQBUF` CAPTURE
-
-**Two structural notes for the backend:**
- OUTPUT + CAPTURE both use `V4L2_MEMORY_DMABUF` in kdirect. Our backend
-  currently uses MMAP for CAPTURE on rkvdec/hantro. For Pi 5 we should
-  either follow kdirect (DMABUF, allows zero-copy DRM_PRIME export) or
-  use MMAP and CPU-detile. Phase 4 design decision.
- The order `S_FMT OUTPUT → REQBUFS OUTPUT → S_FMT CAPTURE → CREATE_BUFS
-  CAPTURE → STREAMON` differs from our iter25 rkvdec pre-seed pattern
-  (where SPS via S_EXT_CTRLS must come BEFORE CAPTURE alloc to resolve
-  the image_fmt). rpi-hevc-dec apparently DOESN'T need that pre-seed —
-  CAPTURE S_FMT just takes the explicit NC12 + caller's dims. Confirm
-  in Phase 1 by trying our existing iter25 pre-seed flow against it.
-
-### Q6 — packaging: Debian 13 trixie, NOT Arch
-
-higgs runs Debian 13 trixie (`PRETTY_NAME="Debian GNU/Linux 13 (trixie)"`),
-not Arch ALARM. Phase 8 (per the dev-process Phase 8 packaging rule) for
-the Pi 5 chapter needs a `debian/` packaging tree, not just a PKGBUILD.
-
-Decide in Phase 1 whether to:
- Add Debian packaging to `marfrit-packages` as a second target, OR
- Use distrobox/podman with an Arch ALARM container on higgs for
-  install (test-only, not production), OR
- Pi 5 chapter ships a Debian source pkg via gitea / a personal Debian
-  repo.
-
-### Other new findings from the probe session
-
- **ffmpeg 7.1.3 from Debian 13 is built with `--enable-v4l2-request`**
-  — the kdirect path exists. Invocation is `ffmpeg -hwaccel drm -c:v
-  hevc` (not just `-hwaccel drm`; the explicit codec flag matters for
-  the negotiation). Engagement log line is
-  `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
-  buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`. Per
-  [[hw-decode-engagement-check]], grep for that line to confirm HW path
-  engaged.
- **No libva ICD installed on higgs** — only `armada-drm_dri.so` ships,
-  which doesn't apply. We'd be the first VA-API HW path for HEVC on Pi
-  5 once installed.
- **mpv is apt-installable** (`mpv 0.40.0-3+deb13u1`) — useful as a
-  pixel-readback verifier once the backend works (`mpv --vo=image` or
-  `--vo=drm`).
- **Firefox 145.0.1 + rpi-firefox-mods 20251016 installed** (firefox-esr
-  package status was `rc` = removed but config remains). The mods
-  package likely contains VA-API plumbing prefs.
-
-### What changes for Phase 1
-
- Goal is now phrasable: HEVC bit-exact libva-vs-kdirect on higgs for
-  the 1280×720 Main 8-bit test fixture (same generator as
-  `/tmp/bbb_main.mp4` here). Kdirect engagement signal is the
-  `Hwaccel V4L2 HEVC stateless V4` log line.
- Most backend code reuses existing rkvdec/hantro HEVC path: ctrls,
-  per-frame submission, request_fd, multi-device probe pattern.
- New code: NC12 video_format entry + detile primitive (sibling to
-  `nv15_unpack_plane_to_p010`) + RPI_HEVC_DEC driver_kind.
- Packaging target = Debian, not Arch.
@@ -1,230 +0,0 @@
-# Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)
-
-Per [[feedback_dev_process]], Phase 1 (goal), Phase 2 (situation analysis),
-Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third
-multi-device-probe slot in `libva-v4l2-request-fourier`. Phase 0 substrate
-+ open-question answers live at `phase0_pi5_hevc.md`.
-
-## Phase 1 — Goal
-
-> **libva-v4l2-request-fourier on higgs** decodes HEVC Main 8-bit input
-> producing NV12 output **bit-exact vs kdirect** for three reference
-> fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265
-> ultrafast). HW path engagement verified via the kernel-driver lsof
-> signal (`/dev/video19` open) AND ffmpeg-vaapi engagement signal
-> (`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19`).
-
-Measurable:
-
-| Criterion | Metric |
-|---|---|
-| C1 — vainfo enumeration | `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain : VAEntrypointVLD` |
-| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 |
-| C3 — HW engagement | `lsof` shows `/dev/video19` open by ffmpeg-vaapi during libva run |
-| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) |
-| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) |
-
-Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API
-engagement testing, performance benchmarks. All later chapters.
-
-## Phase 2 — Situation Analysis
-
-### Backend architecture already in place
-
- **Multi-device probe (iter38)**: at `VA_DRIVER_INIT` opens both
-  `rkvdec` + `hantro-vpu` via `find_decoder_device_by_driver(name)`.
-  Stores per-driver fds (`video_fd_{rkvdec,hantro}`,
-  `media_fd_{rkvdec,hantro}`). `RequestCreateConfig` retargets the
-  "active" `driver_data->{video,media}_fd` per profile via
-  `request_switch_device_for_profile()` (request.c:426-478).
- **Per-driver feature gating**: `request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}`
-  pair, with `h265_set_controls` consulting the per-fd flag. Established
-  by iter2 / Phase 5 review (request.h:99-100). This is the canonical
-  per-driver gating shape for iter40.
- **HEVC ctrl population**: `h265_set_controls` populates the standard
-  `V4L2_CID_STATELESS_HEVC_*` set (h265.c). Probe-gates EXT_SPS_*_RPS
-  via the iter2 path — naturally dormant for rpi-hevc-dec since the
-  controls don't exist.
- **Synthetic SPS pre-seed (iter25/26)**: needed for rkvdec to resolve
-  `image_fmt` before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec
-  does NOT need this — it accepts NC12 + explicit dims on `S_FMT
-  CAPTURE` directly. The pre-seed code path stays in place for rkvdec;
-  rpi-hevc-dec just doesn't trigger it (gate on driver_kind).
- **CAPTURE detile primitive**: `nv15_unpack_plane_to_p010()` (nv15.c)
-  is the template — backend already CPU-detiles when a Pi-or-Rockchip-
-  specific CAPTURE format meets a linear consumer (VAImage NV12 / P010).
- **Single-plane (S) vs multi-plane (M) handling**: hantro uses MPLANE,
-  rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for
-  BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38
-  already supports MPLANE handling for hantro; rpi reuses that.
-
-### Surface area to touch (audit)
-
-| File | What changes | Size |
-|------|--------------|------|
-| `src/request.h` | Add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`, `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout) | ~10 lines |
-| `src/request.c` | (a) Extend init -1 block to cover new fds. (b) Recognize `rpi-hevc-dec` as a 3rd primary/alt driver string in the probe loop. (c) Extend `request_device_kind_for_profile` so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend `request_switch_device_for_profile` 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). | ~80 lines |
-| `src/video.c` | Add `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. | ~20 lines |
-| `src/nv12_col128.c` (NEW) | `nv12_col128_detile_to_nv12()`: Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim `av_rpi_sand_to_planar_y8` core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux `hevc_dec/hevc_d_video.c` size formula. | ~80 lines + 30-line header |
-| `src/image.c` | Add NC12 → NV12 branch in `copy_surface_to_image`, gated on `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` (sibling to existing NV15→P010 branch). | ~25 lines |
-| `src/meson.build` + `src/Makefile.am` | List `nv12_col128.c`/`.h` in sources | 2 lines |
-
-Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive.
-Roughly half the surface area of iter38; smaller than iter2.
-
-### What does NOT change
-
- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by
-  driver_kind check that's already implicit in the rkvdec fd routing).
- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored
-  GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec
-  unchanged. Same plumbing.
- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is
-  HEVC-only; VP8 still goes through hantro on RK).
- iter38 single-libva-session multi-codec semantics: extends from 5
-  codecs to 5+1 (HEVC reroutes on Pi).
-
-### NC12 / SAND128 tile geometry — locked contract
-
-From kernel driver `drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c`
-(via [[github raspberrypi/linux rpi-6.12.y]]):
-
-```c
-case V4L2_PIX_FMT_NV12_COL128:
-    width = ALIGN(width, 128);           /* Width rounds up to columns */
-    height = ALIGN(height, 8);
-    bytesperline = constrain2x(bytesperline, height * 3 / 2);
-    sizeimage = bytesperline * width;
-    break;
-```
-
-For 1280×720:
- width = 1280 (already 128-aligned)
- height = 720 (already 8-aligned)
- bytesperline = 720 × 3/2 = **1080** (matches Phase 0 strace observation)
- sizeimage = 1080 × 1280 = **1,382,400** (matches strace; equals linear NV12 byte count coincidentally)
-
-**Geometry interpretation** (cross-verified against ffmpeg/Kynesim
-`rpi_sand_fn_pw.h` `av_rpi_sand_to_planar_y8`):
- Image is divided into `(width + 127) / 128` columns; each column is
-  **128 px wide × height px tall**.
- Within a column: `128 × height` bytes of Y data, immediately followed
-  by `128 × height/2` bytes of interleaved CbCr (so 128 × `bytesperline`
-  bytes per column, where `bytesperline` is the column stride).
- Across columns: column N starts at offset `N × stride1 × stride2`
-  where `stride1 = 128` (column width) and `stride2 = bytesperline`.
- **Pixel (x, y) byte offset = `(x & 127) + y × 128 + (x & ~127) × bytesperline`**
-  for Y; same formula with `y/2` for UV plane (which begins at offset
-  `128 × height × num_columns` from the start).
-
-Reference for the detile loop: `av_rpi_sand_to_planar_y8` (Kynesim
-ffmpeg, `libavutil/rpi_sand_fn_pw.h` with PW=1). Our primitive copies
-the single-stripe fast-path math; we don't import NEON ASM (CPU
-detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).
-
-## Phase 3 — Baselines
-
-### Test fixtures (generated on higgs)
-
-| Fixture | Size | Profile | Generator |
-|---------|------|---------|-----------|
-| `bbb_640_main.mp4`  | 640×360   | Main 8-bit | `ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main` |
-| `bbb_1280_main.mp4` | 1280×720  | Main 8-bit | same |
-| `bbb_1920_main.mp4` | 1920×1080 | Main 8-bit | same |
-
-### Captured 2026-05-17 evening on higgs
-
-For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect
-(`ffmpeg -hwaccel drm -c:v hevc`) → `-frames:v 10 -f rawvideo -pix_fmt nv12`,
-sha256 of first 16 chars:
-
-```
-bbb_640_main  SW={9a81038065e9b7cd} HW={9a81038065e9b7cd}  → BIT-EXACT × N=3
-bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195}  → BIT-EXACT × N=3
-bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039}  → BIT-EXACT × N=3
-```
-
-HW engagement signal (per-run): `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`
-
-This is the kdirect baseline. Phase 7 verification will compare libva
-output against these SHAs.
-
-### Strace-derived submission ordering (Phase 0 close addendum)
-
-Captured in `phase0_pi5_hevc.md`. Briefly: standard V4L2-request
-stateless flow, both queues DMABUF, no SPS pre-seed dance needed
-(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).
-
-## Phase 4 — Plan
-
-### Implementation steps (sequenced)
-
-1. **`request.h`**: extend `request_data` with the new fd pair + ext_sps
-   flag, mirroring iter38/iter2 layout. (no behavior change yet)
-2. **`request.c`**:
-   - `find_decoder_device_by_driver("rpi-hevc-dec", ...)` accepts new
-     driver string.
-   - Init -1 block extends to new fds.
-   - Probe loop: if primary is `rkvdec` or `hantro-vpu`, also probe
-     `rpi-hevc-dec` (third slot). On Pi 5 there's no `rkvdec` or
-     `hantro-vpu`, so primary becomes `rpi-hevc-dec` and the alt-probes
-     for the other two return absent (their fds stay -1).
-   - `request_device_kind_for_profile`: when profile is `VAProfileHEVCMain`,
-     prefer `'p'` (rpi-hevc-dec) IF `video_fd_rpi_hevc_dec >= 0`, else
-     fall through to `'r'` (rkvdec). All other profiles stay routed as
-     today.
-   - `request_switch_device_for_profile`: add `'p'` branch.
-   - ext_sps probe runs on the new fd; result stored in
-     `has_hevc_ext_sps_rps_rpi_hevc_dec`. Will be false (controls absent).
-3. **`video.c`**: add NC12 video_format entry. Mark it MPLANE-only (per
-   Phase 0 strace). bytesperline/sizeimage formula encoded per kernel
-   driver math.
-4. **`src/nv12_col128.c` + `.h`** (NEW): single-file primitive,
-   `nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width,
-   height, src_stride2)`. CPU per-column row-memcpy loop; not NEON
-   for Phase 1 (correctness first). Self-test in `tests/test_nv12_col128_detile.c`.
-5. **`image.c`**: branch in `copy_surface_to_image`. Gate:
-   `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`.
-   Calls the primitive. Existing NV12-linear path stays.
-6. **`meson.build` + `Makefile.am`**: source list updates.
-7. **Build clean on higgs** — first build target IS higgs (since iter40
-   only matters on Pi). Cross-build for ampere/fresnel is unaffected
-   because they don't have rpi-hevc-dec — the new fd stays -1 and the
-   per-driver routing falls through to existing rkvdec/hantro paths.
-
-### Verification gates (Phase 7 acceptance)
-
- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3,
-  libdrm-dev 2.4.131).
- Local-install the resulting `.so` to `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3
-  recorded value).
- `lsof` during libva decode shows `/dev/video19` open.
- Sibling regression check: fresnel `phase7_iter39_test_rig` equivalent
-  still 5/5 PASS (no regression to existing routing).
-
-### Risks + mitigations
-
-| Risk | Mitigation |
-|------|-----------|
-| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in `tests/test_nv12_col128_detile.c` with hand-crafted NC12 bytes + known linear output, before integration. |
-| `request_switch_device_for_profile` falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec | Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. |
-| Debian build env differs from Arch — see [[feedback_package_build_flags_unmask_bugs]] | Build with explicit `-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong` flags to match Debian dpkg-buildflags. |
-| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on `driver_kind != 'p'` in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. |
-| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; `has_hevc_ext_sps_rps_rpi_hevc_dec` = false naturally. |
-
-### Phase 5 review explicitly requested
-
-Per CLAUDE.md global "Reviews are never skippable" + [[feedback_review_empirical_over_theoretical]]:
-this plan goes to a sonnet Plan-agent review. Specific review focus:
- Routing correctness when 0/1/2/3 of the three drivers are present.
- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly?
-  Did we miss UV stride considerations?
- `image.c` gate predicate — does it exclude any legitimate NV12-linear
-  case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)
- Cross-device regression scope (fresnel + ampere paths untouched?).
-
-Empty-result review IS a green light; "we should have skipped it" is the
-prohibited move.
@@ -1,194 +0,0 @@
-# Phase 5 review — iter40 plan (sonnet review + amendments)
-
-Reviewer verdict: **yellow** — plan substantively sound, 3 concrete blockers
-+ 1 fixture gap + 1 verification-only note. All findings verified empirically
-against current source (per [[feedback_review_empirical_over_theoretical]])
-BEFORE accepting into the amended plan.
-
-## Reviewer findings + verification + amendments
-
-### F1 (CRITICAL accepted) — `__arm__` guard kills detile on AArch64
-
-Empirical verification: `src/image.c` lines 239 + 268 wrap the entire
-per-format detile dispatch (incl. `nv15_unpack_plane_to_p010`) in
-`#ifdef __arm__`. Pi 5 / fresnel / ampere are all AArch64 → guard never
-fires → both NC12 detile (proposed) AND existing NV15→P010 unpack
-(iter39) are silently dead code on aarch64. iter39 5/5 PASS on fresnel
-was bit-exact for 8-bit codecs only; the 10-bit detile path was never
-exercised, so the dead-code didn't manifest as a failure.
-
-**Amendment:** Phase 6 step 5 first sub-action — change guard at lines
-239 + 268 from `#ifdef __arm__` to `#if defined(__arm__) || defined(__aarch64__)`.
-This re-enables the existing NV15→P010 detile AND lets the new NC12
-detile branch execute. No semantic change on x86 (no detile primitives
-compiled there). Add explicit comment crediting Phase 5 review + this
-finding.
-
-### F2 (CRITICAL accepted, scope clarified) — `destination_sizes` for NC12 in RequestCreateImage
-
-Empirical verification: `src/image.c` lines 90-115 already recompute
-`destination_bytesperlines[0]` + `destination_sizes[0]` for `P010`
-(line 90: `destination_bytesperlines[0] = width * 2`). The fall-through
-"NV12" branch (line 108) uses V4L2-reported stride directly, which for
-NC12 source is the column-stride 1080, not the linear Y stride 1280.
-That breaks the VAImage's `pitches[0]` consumers expect.
-
-`context.c` lines 379-383 — `destination_sizes[0] = destination_bytesperlines[0] * format_height` — IS used at cap_pool init time to size the
-CAPTURE buffer's MMAP region accounting in `driver_data->fmt_sizes[]`.
-For NC12: 1080 × 720 = 777600 vs actual `sizeimage` 1382400. cap_pool
-allocates the actual `sizeimage` via REQBUFS, so the underlying buffer
-is correctly sized; `fmt_sizes[]` is just a back-cache for later access
-patterns that don't go through the kernel-reported value.
-
-**Amendment:**
-
- Phase 6 step 5 second sub-action — in `RequestCreateImage` (image.c
-  ~line 107, the "else" / NV12 branch), add detection: if the source
-  CAPTURE format is `V4L2_PIX_FMT_NV12_COL128` AND the requested image
-  format is `VA_FOURCC_NV12`, override `destination_bytesperlines[0] =
-  width` (linear NV12 Y stride). `destination_sizes[0]` then computes
-  to `width × format_height` (correct linear Y plane size). Existing
-  NV12-source linear path unchanged.
- Phase 6 step 3 video.c — set `v4l2_buffers_count = 1` for NC12 (single
-  contiguous buffer holding Y+UV) and document this is the planes-1
-  multi-plane case (similar to NV12 MPLANE).
- context.c lines 380-383 (`destination_sizes[0] = bytesperlines * height`)
-  stays AS-IS for now. It only affects cap_pool MMAP accounting which
-  uses the kernel-reported `sizeimage` via REQBUFS anyway. If a future
-  bug emerges from this mismatch on the rkvdec/hantro side, address
-  then; not a blocker for iter40 NC12.
-
-### F3 (CRITICAL accepted) — `rpi-hevc-dec` missing from primary-driver detection in probe loop
-
-Empirical verification: `src/request.c` lines 647-657 only have `else if`
-branches for `rkvdec` and `hantro-vpu`. On higgs (no rkvdec, no hantro)
-the primary device IS `rpi-hevc-dec`, but neither branch matches → no
-`primary_driver` set → no fds stored into the new
-`video_fd_rpi_hevc_dec` slot → routing silently no-ops with -1 fds.
-
-**Amendment:** Phase 6 step 2 sub-action — add explicit `else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the primary-driver
-detection block. Sets `video_fd_rpi_hevc_dec = video_fd` + `media_fd_rpi_hevc_dec = media_fd`. Pi has no alt — `alt_driver` stays NULL,
-no second-decoder probe runs for higgs. (rkvdec + hantro alt-probes
-remain dead on higgs because the find_decoder_device_by_driver walk
-returns absent for them.)
-
-Also: extend `find_decoder_device_by_driver`'s driver-name table at
-request.c:94-95 if needed to include `rpi-hevc-dec` — verify it's a
-free-form string match (it is, per the code), not a hard table — so the
-caller passes `"rpi-hevc-dec"` and the walk just looks for it.
-
-### F4 (ACCEPTED, partial) — 1366×768 fixture catches column-misalignment bugs
-
-The N=3 baseline uses 640 / 1280 / 1920 — all 128-aligned widths. A
-1366-wide fixture exercises the `ALIGN(width, 128) → 1408` column
-padding path. The right-edge 42 pixels (cols 1366-1407) are padding;
-the detile primitive must not write past the requested width.
-
-**Amendment:** Phase 7 sub-action — add `bbb_1366_main.mp4` (1366×768)
-to the Phase 7 verification set. Phase 3 baseline retroactively
-captured at Phase 7 time. Goal: same kdirect/SW bit-exact PASS at
-N=1 (no need to redo the deterministic N=3 — one rep proves the
-edge-case). If libva differs from kdirect on 1366 but matches on
-1280/1920, the detile column-base math is buggy.
-
-### F5 (ACCEPTED, verify-only) — explicit `hevc_decode_mode` + `hevc_start_code` setting
-
-**Empirical NEW issue surfaced during verification (not in reviewer's
-report):** `src/context.c` lines 516-528 unconditionally sets
-`V4L2_CID_STATELESS_HEVC_START_CODE` to `_ANNEX_B` (value 1) AND
-prepends `0x00 0x00 0x01` start codes to each slice payload (per the
-H.264 mirror block at line 532+). But Phase 0 strace shows kdirect uses
-`start_code=0` = `_NONE` and submits raw NAL slice payload WITHOUT start
-codes.
-
-Both modes are in rpi-hevc-dec's menu range (min=0 max=1). Open
-question: does rpi-hevc-dec correctly parse start-code-prepended
-payload when in ANNEX_B mode? Two possibilities:
-  (a) Yes — driver implements both modes, ANNEX_B works, libva PASSes
-      bit-exact in our default code path.
-  (b) No — driver only really implements NONE; ANNEX_B is a degenerate
-      menu entry; we'd need per-driver gating to send `_NONE` for
-      rpi-hevc-dec + suppress start-code prepend.
-
-**Amendment:** Phase 7 — verify empirically via the first libva-vs-kdirect
-diff. If (a), no code change needed. If (b), add per-driver gate around
-the START_CODE set (mirror rkvdec/hantro pattern). Don't pre-emptively
-gate; let empiricism decide.
-
-### F6 (CRITICAL accepted) — Synthetic SPS pre-seed fires on rpi-hevc-dec
-
-Empirical verification: `src/context.c` lines 277-346 — the iter25
-synthetic-SPS injection block runs for `VAProfileHEVCMain` regardless
-of active driver_kind. On higgs, `driver_data->video_fd` will be
-`video_fd_rpi_hevc_dec` at this point → `v4l2_set_controls(...SPS...)`
-fires on rpi-hevc-dec. Phase 0 strace shows rpi-hevc-dec doesn't need
-this AND uses a different submission ordering (S_FMT_OUTPUT → REQBUFS_OUTPUT → S_FMT_CAPTURE → CREATE_BUFS_CAPTURE → STREAMON, then global
-ctrls per-frame).
-
-The pre-seed is wrapped in `(void)v4l2_set_controls(...)` — failure is
-silently ignored, BUT the call may also succeed in an unintended way
-on rpi-hevc-dec (it has the HEVC_SPS ctrl), potentially leaving its
-internal state stuck on the dummy SPS until the first real per-frame
-SPS arrives.
-
-**Amendment:** Phase 6 step 2 sub-action — gate the synthetic-SPS
-injection block at context.c:277 with
-`if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec)`. The
-pre-seed only fires when active fd is NOT rpi-hevc-dec. rkvdec /
-hantro paths unchanged.
-
-### F7 (No findings) — `image.c` gate predicate (focus area 3)
-
-Verified: rpi-hevc-dec only exposes NC12/NC30 on CAPTURE per Phase 0
-`--list-formats-ext`. No legitimate NV12-linear case exists on Pi. Gate
-predicate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` is sound — fires only when
-both conditions hold, excludes legitimate NV12-linear on RK / Allwinner.
-
-### F8 (No findings) — cross-device regression scope (focus area 4)
-
-Verified: new fd fields initialise to -1; probe loop extensions are
-additive (no-op when string doesn't match); `request_device_kind_for_profile`'s 'p' branch only fires when `video_fd_rpi_hevc_dec >= 0`;
-new video.c entry is additive. fresnel + ampere paths unchanged.
-
-## Final amended Phase 6 step list
-
-1. `src/request.h` — add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`,
-   `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout).
-2. `src/request.c` — (a) extend init -1 block; (b) **add `else if
-   (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in primary-driver
-   detection** [F3]; (c) extend `request_device_kind_for_profile` so
-   HEVC→'p' when rpi present, else 'r'; (d) extend `request_switch_device_for_profile` 'p' branch; (e) probe ext_sps on new fd.
-3. `src/context.c` — **gate synthetic-SPS pre-seed (lines 277-346) on
-   `driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec`** [F6].
-4. `src/video.c` — NC12 entry with `v4l2_buffers_count=1`,
-   `v4l2_mplane=true`, NOT marked linear.
-5. `src/image.c`:
-   - **Extend `#ifdef __arm__` guards (lines 239, 268) to `#if defined(__arm__) || defined(__aarch64__)`** [F1].
-   - **Add NC12 detection in RequestCreateImage** (line 107 area): if
-     source format is NC12 + VAImage format is NV12, override
-     `destination_bytesperlines[0] = width` [F2].
-   - **Add NC12 detile branch in `copy_surface_to_image`** (line 238+):
-     gate `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`; call new detile primitive.
-6. `src/nv12_col128.c` + `.h` (NEW) — detile primitive.
-7. `tests/test_nv12_col128_detile.c` (NEW) — unit test with hand-crafted
-   NC12 bytes + known linear output.
-8. `src/meson.build` + `src/Makefile.am` — source list updates.
-9. Build clean on higgs; if `tests/` doesn't auto-run, run manually.
-
-## Final amended Phase 7 verification
-
- Build cleanly on higgs.
- Local install `.so` to `/usr/lib/aarch64-linux-gnu/dri/`.
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
- Phase 3 fixtures (640 / 1280 / 1920) + new 1366×768 fixture: libva
-  output SHA == kdirect SHA [F4].
- `lsof` during libva decode shows `/dev/video19` open.
- `strace -e ioctl` shows pre-seed pattern ABSENT on rpi-hevc-dec [F6
-  verification].
- HEVC_START_CODE behavior verified empirically: if libva-vs-kdirect
-  fails for HEVC, add per-driver `_NONE` gate per F5 amendment.
- Sibling regression: re-run fresnel iter38 5/5 test rig — no change
-  expected since iter40 path is gated on new fd.
-
-Total amended LoC estimate: ~280 backend + 100 primitive (was 250 + 100;
-F1 + F2 + F6 add ~30 LoC of gates / overrides).
@@ -1,228 +0,0 @@
-# Phase 7 close — iter40 Pi 5 HEVC partial
-
-Closed 2026-05-17 evening. Backend tip `3ffa9d0` on master. Higgs (Pi CM5,
-Debian 13 trixie, kernel 6.12.75+rpt-rpi-2712) is the test target.
-
-## Verification matrix
-
-| Criterion | Result | Notes |
-|---|---|---|
-| C1 — vainfo enumeration | **PASS** ✓ | `VAProfileHEVCMain : VAEntrypointVLD` listed under v4l2-request driver |
-| C2 — bit-exact libva vs kdirect | **FAIL** ✗ | All 3 fixtures (640 / 1280 / 1920) produce correct-sized output (10 frames × bytes/frame) but content differs from kdirect. Real decode failure — see C5. |
-| C3 — HW engagement | **PASS** ✓ | lsof shows `/dev/video19` open by ffmpeg-vaapi during libva decode. `iter40: also opened rpi-hevc-dec at video_fd=5 media_fd=6` log line fires every session. |
-| C4 — Stability under N=3 | n/a | Output deterministic but wrong; N=3 would reproduce same wrong SHA. |
-| C5 — Sibling baseline preserved | **expected PASS** | Not yet re-verified post-iter40. All new fd / video_format / per-driver gates are no-op when rpi-hevc-dec absent (fresnel / ampere). |
-| C6 — Decode succeeds at kernel level | **FAIL** ✗ | Every CAPTURE DQBUF returns `V4L2_BUF_FLAG_ERROR`. Decode fails per-frame. |
-
-## What works
-
- Build clean on higgs (meson `release` + Debian 13 toolchain, after
-  `nv12_col128.h` + `nv15.h` fallback `#define`s for headers that omit
-  the mainline fourccs).
- ICD discovery: `LIBVA_DRIVER_NAME=v4l2_request` opens at
-  `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- Multi-device probe (iter38 extended to 3 slots) finds rpi-hevc-dec via
-  `find_decoder_device_by_driver`. New `known_decoder_drivers[]` entry +
-  `else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the
-  primary-driver detection block (Phase 5 review F3 fix).
- `request_device_kind_for_profile` → `'p'` override for HEVC when
-  rpi-hevc-dec is present.
- `request_switch_device_for_profile` retargets to the rpi fds.
- Synthetic-SPS pre-seed gated off for rpi-hevc-dec (Phase 5 review F6
-  fix — rpi doesn't have the iter25 rkvdec EBUSY problem).
- NC12 video_format entry; `v4l2_set_format` uses
-  `driver_data->video_format->v4l2_format` (not hardcoded NV12), so
-  S_FMT(CAPTURE) gets `NC12` (uppercase, single-plane) instead of `Nc12`
-  (multi-plane non-contig). Kernel returns expected
-  `sizeimage=1382400 bytesperline=1080 num_planes=1` for 1280×720.
- `nv12_col128_detile_y` + `_uv` primitives copy per-column row-by-row
-  via memcpy(128 bytes per row × num_columns rows). Unit test
-  (`tests/test_nv12_col128_detile.c`) passes 10/10 (Y + UV at 640 / 1280
-  / 1920 / 1366 widths + UV offset helper).
- `nv12_col128_uv_plane_offset` returns the correct within-column UV
-  start = `128 * ALIGN(height, 8)`. Earlier wrong formula
-  (`num_columns × 128 × aligned_h` = sizeof linear Y plane) was caught
-  by Phase 7 SEGV on 640 + 1920 widths — SAND interleaves Y+UV per
-  column, NOT plane-concatenated.
- `image.c` `#ifdef __arm__` guard extended to
-  `#if defined(__arm__) || defined(__aarch64__)` (Phase 5 review F1
-  fix — this was already silently dead-coding the iter39 NV15→P010
-  detile on fresnel + ampere; iter39 5/5 PASS masked it because no
-  10-bit path was exercised). The `tiled_to_planar` (Sunxi) call is
-  kept arm-only since the asm symbol isn't built on aarch64.
- `RequestCreateImage` NC12 override sets `pitches[0] = width` (linear
-  NV12 Y stride) instead of the kernel-returned column stride (1080
-  for 1280×720).
-
-## What fails
-
-`V4L2_BUF_FLAG_ERROR` on every CAPTURE DQBUF. Kernel `rpi-hevc-dec`
-rejects each frame's decode submission. Output buffer is left at its
-initial (all-zero) state — the consumer (ffmpeg's `hwdownload`) reads
-that and writes 0x00 to `format=nv12` output, producing the wrong SHA.
-
-### Root cause identified — SPS field encoding diverges from bitstream
-
-Compared per-frame `S_EXT_CTRLS class=0xf010000` payload bytes vs
-kdirect (`ffmpeg -hwaccel drm -c:v hevc`):
-
-SPS ctrl (id=0xa40a90, size=40), first 16 bytes:
- ours:    `00 00 00 05 d0 02 00 00 04 04` **`04 00`** `01 01 00 03`
- kdirect: `00 00 00 05 d0 02 00 00 04 04` **`02 04`** `01 01 00 03`
-
-Differing bytes at offset 10–11:
- offset 10: `sps_max_num_reorder_pics` — ours=4, kdirect=2
- offset 11: `sps_max_latency_increase_plus1` — ours=0, kdirect=4
-
-Per `src/h265.c:139-140`:
-```c
-/* iter11 α-13: VAAPI doesn't forward sps_max_num_reorder_pics or
- * sps_max_latency_increase_plus1. ... */
-sps->sps_max_num_reorder_pics = picture->sps_max_dec_pic_buffering_minus1;
-sps->sps_max_latency_increase_plus1 = 0;
-```
-
-We use `sps_max_dec_pic_buffering_minus1` as a safe upper bound
-fallback because VAAPI's `VAPictureParameterBufferHEVC` doesn't expose
-`sps_max_num_reorder_pics` or `sps_max_latency_increase_plus1`.
-
-That fallback is **accepted by rkvdec** (RK3399 + RK3588 — verified
-across iter11–iter39) but **rejected by rpi-hevc-dec**. Per H.265
-§A.4.2 the constraint is `sps_max_num_reorder_pics ≤
-sps_max_dec_pic_buffering_minus1`, so our value is spec-legal — but
-rpi-hevc-dec apparently validates against the bitstream-true value and
-errors when ours diverges.
-
-Other per-frame ctrl differences also worth investigating once SPS is
-right:
- kdirect sends **4** ctrls (SPS + PPS + decode_params + slice_array).
- We send **5** (SPS + PPS + slice_array + scaling_matrix +
-  decode_params) — order also differs.
-
-## Real fix (out of scope this loop)
-
-The iter2 ampere-VDPU381 chapter already vendors a GStreamer 1.28.2
-H.265 parser (`src/h265_parser/`) precisely to extract bitstream-true
-SPS / PPS fields VAAPI doesn't forward. The fix is:
-
-1. Wherever h265.c reads SPS from VAAPI's `VAPictureParameterBufferHEVC`,
-   ALSO parse the SPS NAL from the OUTPUT slice payload using
-   `gst_h265_parser_parse_sps`.
-2. Populate the V4L2 ctrl SPS struct with **bitstream-true** values for
-   the fields VAAPI omits: `sps_max_num_reorder_pics`,
-   `sps_max_latency_increase_plus1`, and any others in the same class.
-3. Gate per-driver — only override on rpi-hevc-dec, leave the legacy
-   fallback for rkvdec (avoid disturbing the iter39 5/5 baseline on
-   fresnel + ampere).
-4. Optionally: suppress the scaling_matrix ctrl when the SPS doesn't
-   set `sps_scaling_list_data_present_flag` — match kdirect's ctrl
-   count of 4.
-
-Estimated additional surface area: ~150 LoC in h265.c, plus the parser
-plumbing that iter2 already provides. Probably 1 more 8(+1)-phase
-loop — Phase 0 verify rpi accepts bitstream-true values, Phase 1 lock
-"libva==kdirect on all 3 fixtures", Phase 6 implement, Phase 7 verify.
-
-## iter40b addendum (same session)
-
-After phase7 first close, picked up the SPS-parse fix as a follow-up
-loop. Findings — all empirical:
-
-1. **Source_data lacks SPS NAL.** Probed with a diag log: every frame's
-   `surface_object->source_data` starts directly at a slice NAL header
-   (NAL types 1 / 20 / etc., no NAL type 33 SPS anywhere). ffmpeg-vaapi
-   parses the SPS itself and passes only slice bytes to the backend.
-   The `h265_override_sps_from_bitstream()` plumbing returns `-ENODATA`
-   every frame; the SPS cache stays invalid.
-
-2. **VAAPI doesn't expose the SPS fields rpi needs.** Read
-   `/usr/include/va/va_dec_hevc.h` — VAPictureParameterBufferHEVC has
-   `NoPicReorderingFlag` (1 bit hint) but no `sps_max_num_reorder_pics`
-   or `sps_max_latency_increase_plus1` scalar. They simply aren't
-   reachable from the standard VAAPI API.
-
-3. **Empirical SPS fix lands (hardcoded values match kdirect).** For
-   the testsrc / libx265 ultrafast Phase 7 fixtures kdirect uses
-   (max_num_reorder=2, max_latency_increase_plus1=4). Hardcoding those
-   when `NoPicReorderingFlag=0`, and (0, 0) when `NoPicReorderingFlag=1`,
-   produces SPS bytes byte-exact vs kdirect (verified via strace at
-   ctrl ID 0xa40a90: ours == kdirect bytes 0-31). Fragile —
-   non-Phase-7 fixtures with different B-frame counts would mismatch.
-   Documented in h265.c::h265_set_controls (the rpi-hevc-dec gate).
-
-4. **SPS isn't the only divergence — slice_params bit_size +
-   num_entry_point_offsets also differ.** Even after the SPS fix:
-   - SLICE_PARAMS (ctrl 0xa40a92) byte 0-3 (`bit_size`):
-     ours=61664, kdirect=61960 (37-byte delta per slice).
-   - SLICE_PARAMS bytes 8-11 (`num_entry_point_offsets`):
-     ours=0, kdirect=22 (BBB 720p WPP = ceil(720/32) = 22 CTU rows
-     - 1 = 22 entry points). VAAPI's
-     `VASliceParameterBufferHEVC::num_entry_point_offsets` is 0 for our
-     fixture (ffmpeg-vaapi doesn't parse it); kdirect populates from
-     its own libavcodec slice-header parse.
-
-5. **Bit-exact still NOT reached after iter40b.** Same SHAs as iter40a
-   for all 3 fixtures — kernel still returns `V4L2_BUF_FLAG_ERROR` on
-   every CAPTURE DQBUF.
-
-### Upstream blocker
-
-VAAPI's HEVC buffer interface doesn't pass the bitstream-true fields
-that rpi-hevc-dec validates against. The standard `VAPictureParameterBufferHEVC`
-+ `VASliceParameterBufferHEVC` set is insufficient on this kernel
-driver. Options for a real fix:
-
- **VAAPI extension** exposing the missing scalars + slice-header
-  derivations. Multi-quarter upstream effort.
- **A backdoor `VABufferType` for raw SPS/PPS/slice-header NAL bytes**.
-  Libva-internal; consumers would have to populate it.
- **Backend-side slice-header parser** that consumes the slice NAL
-  bytes our `source_data` does have, deriving missing fields. Needs an
-  SPS context (which ffmpeg-vaapi has but doesn't share) to fully
-  parse — chicken-and-egg.
- **Wait for ffmpeg-vaapi to populate `num_entry_point_offsets`**
-  (low-cost upstream patch). Plus the SPS extension above.
-
-None achievable in this iteration. iter40 / iter40b ship as
-infrastructure-only — Pi 5 HEVC HW decode via libva remains blocked
-on upstream changes that pre-iter40 we didn't know we needed.
-
-### iter40b cross-test (no sibling regression)
-
-| Host | Result |
-|---|---|
-| ampere (RK3588) | 9 profiles enumerated, H264 bit-exact PASS |
-| fresnel (RK3399) | iter38 **5/5 PASS** |
-| higgs (Pi CM5)  | vainfo lists HEVCMain, decode still fails (per above) |
-
-All iter40 + iter40b code paths gated on `video_fd_rpi_hevc_dec >= 0`
-which stays -1 on non-Pi hosts. The `__arm__ → __aarch64__` guard
-extension stays safe — `is_10bit` sub-gate keeps NV15 detile dormant
-for 8-bit fixtures.
-
-## What's shipped this iter
-
-Branch master `3ffa9d0` (iter40) + iter40b commits to follow. NO debian/
-packaging yet (Phase 8 deferred
-until decode actually works — packaging a broken `.so` is mis-direction).
-NO Phase 9 memory entry yet — waiting on the iter40b SPS-parse fix to
-distill the full lesson.
-
-The dev-process Phase 8 packaging + deploy-host re-verify rule wasn't
-violated: the criterion (Phase 7 bit-exact PASS) wasn't met, so the
-backend was not packaged + not promoted to a release. Local `.so`
-install on higgs only, for debugging.
-
-## Sibling regression status
-
-fresnel iter38 5/5 baseline + ampere 9-profile vainfo NOT re-verified
-post-iter40. Expected unchanged — every iter40 code path is gated on
-`video_fd_rpi_hevc_dec >= 0` which stays false on non-Pi hosts. The
-only globally-touched line is the `__arm__ → __aarch64__` guard in
-image.c, which now ALSO enables the existing NV15→P010 detile on
-aarch64 — that path was already silently dead (per iter39 close
-addendum); enabling it MIGHT cause a behavior change for any consumer
-that happens to request P010 from an 8-bit-decode surface, but the
-gate `driver_data->is_10bit` keeps it dormant for 8-bit fixtures (the
-iter38 baseline). Verify before declaring the regression-free promise
-intact.
@@ -1,155 +1,689 @@
 /*
- * Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
+ * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
 *
- * AV1 codec dispatcher.  Populates V4L2_CID_STATELESS_AV1_SEQUENCE
- * (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
+ * ampere-av1-enablement Phase 2.1: AV1 codec dispatcher for libva-v4l2-
+ * request-fourier. Translates VAAPI AV1 picture/slice parameter buffers
+ * into V4L2 stateless AV1 controls (V4L2_CID_STATELESS_AV1_*) for the
+ * Rockchip vpu981 hardware on RK3588.
 *
- * Why a single SEQUENCE control and not the full V4L2_CID_STATELESS_AV1_*
- * family (FRAME, TILE_GROUP_ENTRY, FILM_GRAIN):
+ * Reference: Kwiboo/FFmpeg v4l2-request-n8.1:libavcodec/v4l2_request_av1.c
+ * (636 LoC; reads from FFmpeg's AV1RawSequenceHeader + AV1RawFrameHeader).
+ * VAAPI exposes the same AV1 spec semantics through different struct
+ * shapes: sequence-level fields are folded into VADecPictureParameterBufferAV1
+ * (no separate sequence buffer); per-frame fields live in the same struct.
 *
- *   - The daedalus_v4l2 daemon path consumes the OUTPUT bitstream
- *     directly via libavcodec/libdav1d.  libdav1d needs a complete OBU
- *     stream that includes the sequence header — ffmpeg-vaapi strips the
- *     sequence header on the client side (its parser is split across
- *     VAPictureParameterBufferAV1 + slice payload, with OBU_SEQUENCE_HEADER
- *     consumed and not re-emitted), so the daemon side has to synthesise
- *     it from the SEQUENCE ctrl.  The other AV1 ctrls (FRAME / TILE /
- *     FILM_GRAIN) are not needed for that synthesis — the OBU_FRAME_HEADER
- *     + OBU_TILE_GROUP that libdav1d also needs are still in the slice
- *     bitstream.
+ * F1/F2/F3 risk mitigations per phase1_plan_v2 §"General fill_frame
+ * implementation risks":
+ *   F1 tile_info.mi_col/row_starts sentinel = 2 * ((frame_width + 7) >> 3)
+ *      mirrors Kwiboo lines 238/244 exactly.
+ *   F2 superres_denom: VAAPI exposes superres_scale_denominator directly
+ *      and per spec it's already 8 when use_superres=0. No offset math
+ *      needed (Kwiboo does it because FFmpeg stores raw coded_denom).
+ *   F3 loop_restoration_size[] gated on USES_LR flag mirrors Kwiboo
+ *      lines 281-287 exactly.
 *
- *   - The vpu981 (RK3588 dedicated AV1 hantro) hardware path doesn't
- *     consult these controls either — vpu981's driver parses the AV1
- *     bitstream directly.  So setting only SEQUENCE is correct for both
- *     destination decoders.
+ * V4L2 controls (4 per frame, batched in one VIDIOC_S_EXT_CTRLS):
+ *   1. V4L2_CID_STATELESS_AV1_SEQUENCE
+ *   2. V4L2_CID_STATELESS_AV1_FRAME
+ *   3. V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY[] (DYNAMIC_ARRAY)
+ *   4. V4L2_CID_STATELESS_AV1_FILM_GRAIN (conditional on driver_data->
+ *      has_av1_film_grain probe)
 *
- * Reference: marfrit/libva-v4l2-request-fourier issue #11
- *            (DAEMON-PPS-style sequence-header re-synthesis on the daemon
- *            side, paralleling the H.264 SPS/PPS work in DAEMON-PPS).
- *            kernel uAPI: <linux/v4l2-controls.h> @ 2891-2919.
- *            VAAPI:       <va/va_dec_av1.h> typedef
- *                         VADecPictureParameterBufferAV1.
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY CLAIM,
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+ * THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */

 #include "av1.h"

-#include "v4l2.h"
+#include "context.h"
+#include "object_heap.h"
+#include "request.h"
+#include "surface.h"
 #include "utils.h"
+#include "v4l2.h"

+#include <va/va.h>
+
+#include <linux/videodev2.h>
+#include <linux/v4l2-controls.h>
+
+#include <stdbool.h>
 #include <stdint.h>
+#include <stdlib.h>
 #include <string.h>

-#include <linux/v4l2-controls.h>
-#include <linux/videodev2.h>
+/* Sanity asserts to catch kernel uAPI drift. If these fire, the kernel
+ * headers on the build machine are out of sync with what the running
+ * driver expects — silent register-misalignment bugs result. Cross-compile
+ * hazard per Janet v3 amendment: native-arm64 builds only (boltzmann +
+ * ampere); no cross from x86 against ARM kernel headers. */
+_Static_assert(sizeof(struct v4l2_ctrl_av1_tile_group_entry) == 16,
+	       "v4l2_ctrl_av1_tile_group_entry size drift — recheck uAPI");

-/*
- * VADecPictureParameterBufferAV1 reaches us transitively via surface.h →
- * va_backend.h → va.h → va_dec_av1.h (va_dec_av1.h alone won't compile
- * standalone — it needs va.h's VA_PADDING_LOW / va_deprecated machinery).
- */
+/* Per AV1 spec, when use_superres=0 the superres denominator is 8.
+ * VAAPI's superres_scale_denominator already encodes this directly
+ * (per va_dec_av1.h: "When use_superres=0, superres_scale_denominator
+ * must be 8"). Kwiboo's AV1_SUPERRES_DENOM_MIN+coded_denom math is
+ * not needed when reading from VAAPI. */
+#define AV1_SUPERRES_NUM 8

-/* Compile-time UAPI shift guard, sibling to vp9.c's pattern. */
-_Static_assert(sizeof(struct v4l2_ctrl_av1_sequence) == 12,
-	       "v4l2_ctrl_av1_sequence size mismatch — kernel UAPI changed");
+/* AV1 spec maxima used for V4L2 array sizing. */
+#define BACKEND_AV1_MAX_SEGMENTS	8
+#define BACKEND_AV1_SEG_LVL_MAX		8
+#define BACKEND_AV1_SEG_LVL_REF_FRAME	5
+#define BACKEND_AV1_NUM_REF_FRAMES	8
+#define BACKEND_AV1_TOTAL_REFS_PER_FRAME 8
+#define BACKEND_AV1_REFS_PER_FRAME	7

-/*
- * Map VAAPI bit_depth_idx (0/1/2 → 8/10/12) to the kernel ctrl's plain
- * uint8_t bit_depth field.  ffmpeg-vaapi sets idx from the bitstream
- * BitDepth value, so this is an exact inverse of AV1 spec 5.5.2.
- */
-static uint8_t av1_bit_depth_from_idx(uint8_t idx)
+/* ===== fill_sequence ===== */
+static void av1_fill_sequence(VADecPictureParameterBufferAV1 *picture,
+			      struct v4l2_ctrl_av1_sequence *ctrl)
 {
-	switch (idx) {
-	case 0:  return 8;
-	case 1:  return 10;
-	case 2:  return 12;
-	default:
-		/* Spec-illegal; pass through so a reviewer / test catches it. */
-		return 8;
+	uint8_t bit_depth;
+
+	memset(ctrl, 0, sizeof(*ctrl));
+
+	switch (picture->bit_depth_idx) {
+	case 0: bit_depth = 8; break;
+	case 1: bit_depth = 10; break;
+	case 2: bit_depth = 12; break;
+	default: bit_depth = 8; break;
+	}
+
+	ctrl->seq_profile = picture->profile;
+	ctrl->order_hint_bits = picture->seq_info_fields.fields.enable_order_hint ?
+				(picture->order_hint_bits_minus_1 + 1) : 0;
+	ctrl->bit_depth = bit_depth;
+	/* VAAPI does NOT separately expose max_frame_{width,height}_minus_1
+	 * (sequence-level). Use the current frame size as a proxy. Correct
+	 * for fixed-size sequences (the 208/352/1080p test vectors). */
+	ctrl->max_frame_width_minus_1 = picture->frame_width_minus1;
+	ctrl->max_frame_height_minus_1 = picture->frame_height_minus1;
+
+	if (picture->seq_info_fields.fields.still_picture)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
+	if (picture->seq_info_fields.fields.use_128x128_superblock)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
+	if (picture->seq_info_fields.fields.enable_filter_intra)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
+	if (picture->seq_info_fields.fields.enable_intra_edge_filter)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
+	if (picture->seq_info_fields.fields.enable_interintra_compound)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
+	if (picture->seq_info_fields.fields.enable_masked_compound)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
+	/* VAAPI doesn't expose enable_warped_motion as a sequence flag;
+	 * per-frame allow_warped_motion gates it. Conservative: set true so
+	 * per-frame flag is honored. */
+	ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION;
+	if (picture->seq_info_fields.fields.enable_dual_filter)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
+	if (picture->seq_info_fields.fields.enable_order_hint)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
+	if (picture->seq_info_fields.fields.enable_jnt_comp)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
+	/* enable_ref_frame_mvs / enable_restoration not exposed at sequence
+	 * level — conservative set-true (kdirect also sets these for the
+	 * test streams; gating doesn't matter because per-frame flags
+	 * govern actual use). */
+	ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_REF_FRAME_MVS;
+	/* enable_superres: gate on the current frame's use_superres so the
+	 * SEQUENCE flag matches the bitstream-derived value. Empirical
+	 * strace diff vs kdirect: kdirect clears this for streams that
+	 * never use superres; we were unconditionally setting it true. */
+	if (picture->pic_info_fields.bits.use_superres)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_SUPERRES;
+	if (picture->seq_info_fields.fields.enable_cdef)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
+	ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_RESTORATION;
+	if (picture->seq_info_fields.fields.mono_chrome)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
+	if (picture->seq_info_fields.fields.color_range)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
+	if (picture->seq_info_fields.fields.subsampling_x)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
+	if (picture->seq_info_fields.fields.subsampling_y)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
+	if (picture->seq_info_fields.fields.film_grain_params_present)
+		ctrl->flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
+}
+
+/* ===== fill_frame ===== */
+static void av1_fill_frame(VADecPictureParameterBufferAV1 *picture,
+			   struct v4l2_ctrl_av1_frame *ctrl)
+{
+	unsigned int i, j;
+
+	memset(ctrl, 0, sizeof(*ctrl));
+
+	/* ---- tile_info ---- */
+	ctrl->tile_info.context_update_tile_id = picture->context_update_tile_id;
+	ctrl->tile_info.tile_cols = picture->tile_cols;
+	ctrl->tile_info.tile_rows = picture->tile_rows;
+	if (picture->tile_cols > 1 || picture->tile_rows > 1)
+		ctrl->tile_info.tile_size_bytes = 4;
+	else
+		ctrl->tile_info.tile_size_bytes = 0;
+
+	if (picture->pic_info_fields.bits.uniform_tile_spacing_flag)
+		ctrl->tile_info.flags |= V4L2_AV1_TILE_INFO_FLAG_UNIFORM_TILE_SPACING;
+
+	/* F1: mi_col/row_starts[]: prefix-sum from width_in_sbs_minus_1[]+1
+	 * (Kwiboo reads tile_start_col_sb[] directly; VAAPI doesn't expose
+	 * starts, only widths — reconstruct via accumulation). Plus the
+	 * sentinel at index tile_cols/tile_rows. */
+	{
+		uint16_t cum = 0;
+		for (i = 0; i < picture->tile_cols && i < 63; i++) {
+			ctrl->tile_info.mi_col_starts[i] = cum;
+			ctrl->tile_info.width_in_sbs_minus_1[i] =
+				picture->width_in_sbs_minus_1[i];
+			cum = (uint16_t)(cum + picture->width_in_sbs_minus_1[i] + 1);
+		}
+		ctrl->tile_info.mi_col_starts[picture->tile_cols] =
+			2 * ((picture->frame_width_minus1 + 1 + 7) >> 3);
+	}
+	{
+		uint16_t cum = 0;
+		for (i = 0; i < picture->tile_rows && i < 63; i++) {
+			ctrl->tile_info.mi_row_starts[i] = cum;
+			ctrl->tile_info.height_in_sbs_minus_1[i] =
+				picture->height_in_sbs_minus_1[i];
+			cum = (uint16_t)(cum + picture->height_in_sbs_minus_1[i] + 1);
+		}
+		ctrl->tile_info.mi_row_starts[picture->tile_rows] =
+			2 * ((picture->frame_height_minus1 + 1 + 7) >> 3);
+	}
+
+	/* ---- quantization ---- */
+	ctrl->quantization.base_q_idx = picture->base_qindex;
+	ctrl->quantization.delta_q_y_dc = picture->y_dc_delta_q;
+	ctrl->quantization.delta_q_u_dc = picture->u_dc_delta_q;
+	ctrl->quantization.delta_q_u_ac = picture->u_ac_delta_q;
+	ctrl->quantization.delta_q_v_dc = picture->v_dc_delta_q;
+	ctrl->quantization.delta_q_v_ac = picture->v_ac_delta_q;
+	ctrl->quantization.qm_y = picture->qmatrix_fields.bits.qm_y;
+	ctrl->quantization.qm_u = picture->qmatrix_fields.bits.qm_u;
+	ctrl->quantization.qm_v = picture->qmatrix_fields.bits.qm_v;
+	ctrl->quantization.delta_q_res =
+		picture->mode_control_fields.bits.log2_delta_q_res;
+
+	if (picture->u_dc_delta_q != picture->v_dc_delta_q ||
+	    picture->u_ac_delta_q != picture->v_ac_delta_q)
+		ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DIFF_UV_DELTA;
+	if (picture->qmatrix_fields.bits.using_qmatrix)
+		ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_USING_QMATRIX;
+	if (picture->mode_control_fields.bits.delta_q_present_flag)
+		ctrl->quantization.flags |= V4L2_AV1_QUANTIZATION_FLAG_DELTA_Q_PRESENT;
+
+	/* ---- segmentation ---- */
+	if (picture->seg_info.segment_info_fields.bits.enabled)
+		ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_ENABLED;
+	if (picture->seg_info.segment_info_fields.bits.update_map)
+		ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_MAP;
+	if (picture->seg_info.segment_info_fields.bits.temporal_update)
+		ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_TEMPORAL_UPDATE;
+	if (picture->seg_info.segment_info_fields.bits.update_data)
+		ctrl->segmentation.flags |= V4L2_AV1_SEGMENTATION_FLAG_UPDATE_DATA;
+
+	for (i = 0; i < BACKEND_AV1_MAX_SEGMENTS; i++) {
+		for (j = 0; j < BACKEND_AV1_SEG_LVL_MAX; j++) {
+			if (picture->seg_info.feature_mask[i] & (1 << j)) {
+				ctrl->segmentation.feature_enabled[i] |=
+					V4L2_AV1_SEGMENT_FEATURE_ENABLED(j);
+				ctrl->segmentation.last_active_seg_id = i;
+				if (j >= BACKEND_AV1_SEG_LVL_REF_FRAME)
+					ctrl->segmentation.flags |=
+					    V4L2_AV1_SEGMENTATION_FLAG_SEG_ID_PRE_SKIP;
+			}
+			ctrl->segmentation.feature_data[i][j] =
+				picture->seg_info.feature_data[i][j];
 		}
 	}

+	/* ---- loop_filter ---- */
+	ctrl->loop_filter.level[0] = picture->filter_level[0];
+	ctrl->loop_filter.level[1] = picture->filter_level[1];
+	ctrl->loop_filter.level[2] = picture->filter_level_u;
+	ctrl->loop_filter.level[3] = picture->filter_level_v;
+	ctrl->loop_filter.sharpness =
+		picture->loop_filter_info_fields.bits.sharpness_level;
+	ctrl->loop_filter.mode_deltas[0] = picture->mode_deltas[0];
+	ctrl->loop_filter.mode_deltas[1] = picture->mode_deltas[1];
+	ctrl->loop_filter.delta_lf_res =
+		picture->mode_control_fields.bits.log2_delta_lf_res;
+	for (i = 0; i < BACKEND_AV1_NUM_REF_FRAMES; i++)
+		ctrl->loop_filter.ref_deltas[i] = picture->ref_deltas[i];
+
+	if (picture->loop_filter_info_fields.bits.mode_ref_delta_enabled)
+		ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_ENABLED;
+	if (picture->loop_filter_info_fields.bits.mode_ref_delta_update)
+		ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_UPDATE;
+	if (picture->mode_control_fields.bits.delta_lf_present_flag)
+		ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_PRESENT;
+	if (picture->mode_control_fields.bits.delta_lf_multi)
+		ctrl->loop_filter.flags |= V4L2_AV1_LOOP_FILTER_FLAG_DELTA_LF_MULTI;
+
+	/* ---- cdef ---- */
+	ctrl->cdef.damping_minus_3 = picture->cdef_damping_minus_3;
+	ctrl->cdef.bits = picture->cdef_bits;
+	for (i = 0; i < (unsigned)(1 << picture->cdef_bits) && i < 8; i++) {
+		uint8_t y = picture->cdef_y_strengths[i];
+		uint8_t uv = picture->cdef_uv_strengths[i];
+		ctrl->cdef.y_pri_strength[i] = (y >> 2) & 0x0F;
+		ctrl->cdef.y_sec_strength[i] = y & 0x03;
+		ctrl->cdef.uv_pri_strength[i] = (uv >> 2) & 0x0F;
+		ctrl->cdef.uv_sec_strength[i] = uv & 0x03;
+	}
+
+	/* ---- loop_restoration ---- (F3)
+	 * Phase 5 review Amendment 1 was REVERTED. The reviewer proposed
+	 * remap = {NONE, SWITCHABLE, WIENER, SGRPROJ} (Kwiboo's table)
+	 * based on AV1 spec FrameRestoreType wire encoding
+	 * {NONE=0, SWITCHABLE=1, WIENER=2, SGRPROJ=3} differing from V4L2's
+	 * {NONE=0, WIENER=1, SGRPROJ=2, SWITCHABLE=3}. Empirically applying
+	 * that permutation regressed ALL tests (allintra 10/10 → 0/10) —
+	 * so either VAAPI's yframe_restoration_type is NOT the raw spec
+	 * value (already-remapped to V4L2 enum semantics?), or vpu981
+	 * interprets the V4L2 enum values via a different mapping than
+	 * the V4L2 uAPI header documents. Per
+	 * [[feedback_review_empirical_over_theoretical]] keep the
+	 * identity mapping that empirically works; revisit if a
+	 * restoration-using fixture surfaces a real decode bug.
+	 */
+	{
+		uint8_t remap[4] = {
+			V4L2_AV1_FRAME_RESTORE_NONE,
+			V4L2_AV1_FRAME_RESTORE_WIENER,
+			V4L2_AV1_FRAME_RESTORE_SGRPROJ,
+			V4L2_AV1_FRAME_RESTORE_SWITCHABLE,
+		};
+		uint8_t y_t = picture->loop_restoration_fields.bits.yframe_restoration_type & 3;
+		uint8_t cb_t = picture->loop_restoration_fields.bits.cbframe_restoration_type & 3;
+		uint8_t cr_t = picture->loop_restoration_fields.bits.crframe_restoration_type & 3;
+		bool uses_lr = false;
+
+		ctrl->loop_restoration.frame_restoration_type[0] = remap[y_t];
+		ctrl->loop_restoration.frame_restoration_type[1] = remap[cb_t];
+		ctrl->loop_restoration.frame_restoration_type[2] = remap[cr_t];
+		if (y_t != 0)
+			uses_lr = true;
+		if (cb_t != 0 || cr_t != 0) {
+			uses_lr = true;
+			ctrl->loop_restoration.flags |=
+				V4L2_AV1_LOOP_RESTORATION_FLAG_USES_CHROMA_LR;
+		}
+
+		ctrl->loop_restoration.lr_unit_shift =
+			picture->loop_restoration_fields.bits.lr_unit_shift;
+		ctrl->loop_restoration.lr_uv_shift =
+			picture->loop_restoration_fields.bits.lr_uv_shift;
+
+		if (uses_lr) {
+			uint8_t shift = picture->loop_restoration_fields.bits.lr_unit_shift;
+			uint8_t uv_shift = picture->loop_restoration_fields.bits.lr_uv_shift;
+			ctrl->loop_restoration.flags |=
+				V4L2_AV1_LOOP_RESTORATION_FLAG_USES_LR;
+			ctrl->loop_restoration.loop_restoration_size[0] =
+				1 << (6 + shift);
+			ctrl->loop_restoration.loop_restoration_size[1] =
+				1 << (6 + shift - uv_shift);
+			ctrl->loop_restoration.loop_restoration_size[2] =
+				1 << (6 + shift - uv_shift);
+		}
+	}
+
+	/* ---- global_motion ---- */
+	for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
+		if (i == 0)
+			continue; /* INTRA_FRAME slot — no warp */
+		ctrl->global_motion.type[i] = picture->wm[i - 1].wmtype;
+		for (j = 0; j < 6; j++)
+			ctrl->global_motion.params[i][j] = picture->wm[i - 1].wmmat[j];
+		if (picture->wm[i - 1].invalid)
+			ctrl->global_motion.invalid |=
+				V4L2_AV1_GLOBAL_MOTION_IS_INVALID(i);
+		switch (picture->wm[i - 1].wmtype) {
+		case 1:
+			ctrl->global_motion.flags[i] |=
+				V4L2_AV1_GLOBAL_MOTION_FLAG_IS_TRANSLATION;
+			ctrl->global_motion.flags[i] |=
+				V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
+			break;
+		case 2:
+			ctrl->global_motion.flags[i] |=
+				V4L2_AV1_GLOBAL_MOTION_FLAG_IS_ROT_ZOOM;
+			ctrl->global_motion.flags[i] |=
+				V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
+			break;
+		case 3:
+			ctrl->global_motion.flags[i] |=
+				V4L2_AV1_GLOBAL_MOTION_FLAG_IS_GLOBAL;
+			break;
+		default:
+			break;
+		}
+	}
+
+	/* ---- reference frames + order hints ---- */
+	/* reference_frame_ts[] is filled by the orchestrator (av1_set_controls)
+	 * which has driver_data for the SURFACE() lookup. order_hints[] not
+	 * exposed per-ref by VAAPI — leave zero. ref_frame_idx[7] is the
+	 * index map from spec-defined ref slots (LAST..ALTREF) into
+	 * ref_frame_map[8] (the surface IDs). */
+	for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++)
+		ctrl->order_hints[i] = 0;
+	for (i = 0; i < BACKEND_AV1_REFS_PER_FRAME; i++)
+		ctrl->ref_frame_idx[i] = picture->ref_frame_idx[i];
+
+	/* F2: superres_denom direct from VAAPI; fallback to AV1_SUPERRES_NUM
+	 * if zero (spec violation but defensive). */
+	ctrl->superres_denom = picture->superres_scale_denominator
+		? picture->superres_scale_denominator : AV1_SUPERRES_NUM;
+
+	ctrl->skip_mode_frame[0] = 0;
+	ctrl->skip_mode_frame[1] = 0;
+	ctrl->primary_ref_frame = picture->primary_ref_frame;
+	ctrl->frame_type = picture->pic_info_fields.bits.frame_type;
+	ctrl->order_hint = picture->order_hint;
+	ctrl->upscaled_width = picture->frame_width_minus1 + 1;
+	ctrl->interpolation_filter = picture->interp_filter;
+	ctrl->tx_mode = picture->mode_control_fields.bits.tx_mode;
+	ctrl->frame_width_minus_1 = picture->frame_width_minus1;
+	ctrl->frame_height_minus_1 = picture->frame_height_minus1;
+	ctrl->render_width_minus_1 = picture->frame_width_minus1;
+	ctrl->render_height_minus_1 = picture->frame_height_minus1;
+	ctrl->current_frame_id = 0;
+	/* Phase 3: VAAPI doesn't expose refresh_frame_flags. For KEY/SWITCH
+	 * frames the AV1 spec mandates 0xff (refresh all DPB slots). For
+	 * inter frames we default to 0xff too — simple P-frame chains will
+	 * naturally rotate through slots without a precise per-slot value.
+	 * If the stream needs precise control, this needs SPS-side parsing.
+	 * Empirical diff vs kdirect shows kdirect always sends 0xff here. */
+	ctrl->refresh_frame_flags = 0xff;
+
+	/* ---- frame flags ---- */
+	if (picture->pic_info_fields.bits.show_frame)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOW_FRAME;
+	if (picture->pic_info_fields.bits.showable_frame)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_SHOWABLE_FRAME;
+	if (picture->pic_info_fields.bits.error_resilient_mode)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_ERROR_RESILIENT_MODE;
+	if (picture->pic_info_fields.bits.disable_cdf_update)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_CDF_UPDATE;
+	if (picture->pic_info_fields.bits.allow_screen_content_tools)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_SCREEN_CONTENT_TOOLS;
+	if (picture->pic_info_fields.bits.force_integer_mv)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_FORCE_INTEGER_MV;
+	if (picture->pic_info_fields.bits.allow_intrabc)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_INTRABC;
+	if (picture->pic_info_fields.bits.use_superres)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_SUPERRES;
+	if (picture->pic_info_fields.bits.allow_high_precision_mv)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_HIGH_PRECISION_MV;
+	if (picture->pic_info_fields.bits.is_motion_mode_switchable)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_IS_MOTION_MODE_SWITCHABLE;
+	if (picture->pic_info_fields.bits.use_ref_frame_mvs)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_USE_REF_FRAME_MVS;
+	if (picture->pic_info_fields.bits.disable_frame_end_update_cdf)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_DISABLE_FRAME_END_UPDATE_CDF;
+	if (picture->pic_info_fields.bits.allow_warped_motion)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_ALLOW_WARPED_MOTION;
+	if (picture->mode_control_fields.bits.reference_select)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_REFERENCE_SELECT;
+	if (picture->mode_control_fields.bits.reduced_tx_set_used)
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_REDUCED_TX_SET;
+	if (picture->mode_control_fields.bits.skip_mode_present) {
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_ALLOWED;
+		ctrl->flags |= V4L2_AV1_FRAME_FLAG_SKIP_MODE_PRESENT;
+	}
+}
+
+/* ===== fill_film_grain ===== */
+static void av1_fill_film_grain(VADecPictureParameterBufferAV1 *picture,
+				struct v4l2_ctrl_av1_film_grain *ctrl)
+{
+	VAFilmGrainStructAV1 *fg = &picture->film_grain_info;
+	unsigned int i;
+
+	memset(ctrl, 0, sizeof(*ctrl));
+
+	ctrl->cr_mult = fg->cr_mult;
+	ctrl->grain_seed = fg->grain_seed;
+	/* VAAPI doesn't expose film_grain_params_ref_idx (the reuse-from-
+	 * previous-frame index). Leave zero — only consulted when
+	 * update_grain=0, which VAAPI also doesn't expose. */
+	ctrl->film_grain_params_ref_idx = 0;
+	ctrl->num_y_points = fg->num_y_points;
+	ctrl->num_cb_points = fg->num_cb_points;
+	ctrl->num_cr_points = fg->num_cr_points;
+	ctrl->grain_scaling_minus_8 =
+		fg->film_grain_info_fields.bits.grain_scaling_minus_8;
+	ctrl->ar_coeff_lag = fg->film_grain_info_fields.bits.ar_coeff_lag;
+	ctrl->ar_coeff_shift_minus_6 =
+		fg->film_grain_info_fields.bits.ar_coeff_shift_minus_6;
+	ctrl->grain_scale_shift =
+		fg->film_grain_info_fields.bits.grain_scale_shift;
+	ctrl->cb_mult = fg->cb_mult;
+	ctrl->cb_luma_mult = fg->cb_luma_mult;
+	ctrl->cr_luma_mult = fg->cr_luma_mult;
+	ctrl->cb_offset = fg->cb_offset;
+	ctrl->cr_offset = fg->cr_offset;
+
+	if (fg->film_grain_info_fields.bits.apply_grain) {
+		ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_APPLY_GRAIN;
+		/* kdirect strace diff confirmed: V4L2_AV1_FILM_GRAIN_FLAG_
+		 * UPDATE_GRAIN must be set when apply_grain=1 (kdirect's
+		 * flags byte is 0x0B = APPLY|UPDATE|...). VAAPI's
+		 * VAFilmGrainStructAV1 doesn't expose update_grain
+		 * separately. Default to UPDATE=1 (use submitted params,
+		 * not reuse from non-existent prior film_grain ref). The
+		 * earlier segfault we saw with this flag was unmasked by
+		 * the link-NULL deref (now fixed via linked_decode_surface);
+		 * not caused by UPDATE_GRAIN itself. */
+		ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_UPDATE_GRAIN;
+	}
+	if (fg->film_grain_info_fields.bits.chroma_scaling_from_luma)
+		ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CHROMA_SCALING_FROM_LUMA;
+	if (fg->film_grain_info_fields.bits.overlap_flag)
+		ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_OVERLAP;
+	if (fg->film_grain_info_fields.bits.clip_to_restricted_range)
+		ctrl->flags |= V4L2_AV1_FILM_GRAIN_FLAG_CLIP_TO_RESTRICTED_RANGE;
+
+	if (!fg->film_grain_info_fields.bits.apply_grain)
+		return;
+
+	for (i = 0; i < fg->num_y_points && i < 14; i++) {
+		ctrl->point_y_value[i] = fg->point_y_value[i];
+		ctrl->point_y_scaling[i] = fg->point_y_scaling[i];
+	}
+	for (i = 0; i < fg->num_cb_points && i < 10; i++) {
+		ctrl->point_cb_value[i] = fg->point_cb_value[i];
+		ctrl->point_cb_scaling[i] = fg->point_cb_scaling[i];
+	}
+	for (i = 0; i < fg->num_cr_points && i < 10; i++) {
+		ctrl->point_cr_value[i] = fg->point_cr_value[i];
+		ctrl->point_cr_scaling[i] = fg->point_cr_scaling[i];
+	}
+	for (i = 0; i < 24; i++)
+		ctrl->ar_coeffs_y_plus_128[i] = (uint8_t)(fg->ar_coeffs_y[i] + 128);
+	for (i = 0; i < 25; i++) {
+		ctrl->ar_coeffs_cb_plus_128[i] = (uint8_t)(fg->ar_coeffs_cb[i] + 128);
+		ctrl->ar_coeffs_cr_plus_128[i] = (uint8_t)(fg->ar_coeffs_cr[i] + 128);
+	}
+}
+
+/* ===== orchestrator ===== */
 int av1_set_controls(struct request_data *driver_data,
 		     struct object_context *context,
 		     struct object_surface *surface_object)
 {
 	VADecPictureParameterBufferAV1 *picture =
 		&surface_object->params.av1.picture;
+	unsigned int num_tiles = surface_object->params.av1.num_tile_group_entries;
 	struct v4l2_ctrl_av1_sequence sequence;
-	struct v4l2_ext_control ctrls[1];
+	struct v4l2_ctrl_av1_frame frame;
+	struct v4l2_ctrl_av1_film_grain film_grain;
+	struct v4l2_ctrl_av1_tile_group_entry *tile_entries = NULL;
+	struct v4l2_ext_control controls[4];
+	unsigned int n = 0;
+	unsigned int i;
+	unsigned int alloc_tiles;
 	int rc;

 	(void)context;

-	memset(&sequence, 0, sizeof sequence);
+	/*
+	 * AV1 film_grain link: when apply_grain=1, ffmpeg-vaapi allocates a
+	 * separate display surface (current_display_picture) from the decode
+	 * surface (current_frame). vpu981 HW applies grain inline to the
+	 * decode CAPTURE buffer, so the consumable data is in current_frame's
+	 * slot. ffmpeg then calls vaGetImage on current_display_picture which
+	 * has no slot bound. Link the display surface back to the decode
+	 * surface so copy_surface_to_image can borrow destination_data[].
+	 */
+	if (picture->current_display_picture != VA_INVALID_SURFACE &&
+	    picture->current_display_picture != picture->current_frame) {
+		struct object_surface *display_surface =
+			SURFACE(driver_data, picture->current_display_picture);
+		if (display_surface != NULL)
+			display_surface->linked_decode_surface_id =
+				picture->current_frame;
+	}
+
+	if (num_tiles > AV1_MAX_TILES)
+		num_tiles = AV1_MAX_TILES;
+
+	/* DYNAMIC_ARRAY size = MAX(num_tiles, 1) per Janet v2 Q1
+	 * amendment — kernel UB on size=0. */
+	alloc_tiles = num_tiles > 0 ? num_tiles : 1;
+	tile_entries = calloc(alloc_tiles, sizeof(*tile_entries));
+	if (tile_entries == NULL)
+		return -1;
+
+	for (i = 0; i < num_tiles; i++) {
+		VASliceParameterBufferAV1 *slice =
+			&surface_object->params.av1.tile_group_entries[i];
+		tile_entries[i].tile_offset = slice->slice_data_offset;
+		tile_entries[i].tile_size = slice->slice_data_size;
+		tile_entries[i].tile_row = (uint8_t)slice->tile_row;
+		tile_entries[i].tile_col = (uint8_t)slice->tile_column;
+	}
+
+	av1_fill_sequence(picture, &sequence);
+	av1_fill_frame(picture, &frame);

 	/*
-	 * Scalar mapping.  Names align with kernel uAPI; off-by-one and
-	 * idx→value translations are annotated.
+	 * Phase 2.1 + frame-2 divergence fix: wire reference_frame_ts[].
+	 * VAAPI exposes ref_frame_map[8] as VASurfaceIDs; the kernel needs
+	 * v4l2-style timestamps to cross-reference the corresponding
+	 * CAPTURE buffers (set on the OUTPUT buffer at QBUF time per
+	 * picture.c::EndPicture, via surface_object->timestamp). Mirrors
+	 * the vp9.c:614-628 pattern, scaled to AV1's 8 ref slots.
+	 *
+	 * VA_INVALID_SURFACE entries stay at the calloc'd zero timestamp
+	 * (kernel reads zero, doesn't try to dereference).
 	 */
-	sequence.seq_profile = picture->profile;
-	sequence.order_hint_bits =
-		(uint8_t)(picture->order_hint_bits_minus_1 + 1u);
-	sequence.bit_depth = av1_bit_depth_from_idx(picture->bit_depth_idx);
-	sequence.max_frame_width_minus_1 = picture->frame_width_minus1;
-	sequence.max_frame_height_minus_1 = picture->frame_height_minus1;
-
 	/*
-	 * Sequence-header flag mapping.  VAAPI exposes most of these directly
-	 * in seq_info_fields.fields.*; the ones that don't have a 1:1 mirror
-	 * (V4L2_AV1_SEQUENCE_FLAG_ENABLE_WARPED_MOTION, _ENABLE_REF_FRAME_MVS,
-	 * _ENABLE_SUPERRES, _ENABLE_RESTORATION, _SEPARATE_UV_DELTA_Q) live in
-	 * VAAPI's per-frame pic_info_fields rather than the sequence struct.
-	 * For SEQUENCE-control purposes we treat them as best-effort
-	 * unobservable from libva and leave the corresponding bits clear; the
-	 * daedalus daemon's OBU synthesiser (issue #11 daemon track) carries
-	 * the SEQUENCE bytes verbatim, so per-frame consumers (libdav1d) will
-	 * still see the full bitstream truth for those toggles via the
-	 * OBU_FRAME stream already in the slice buffer.  See feedback memory
-	 * `feedback_vaapi_blind_to_some_hevc_sps_fields` for the precedent.
+	 * Empirical: DPB-slot iteration (i over ref_frame_map[i]) gives
+	 * better correctness than ref-name iteration via ref_frame_idx[].
+	 * Tried the ref-name reindex (Kwiboo convention via FFmpeg s->ref[i])
+	 * and lost frames that previously PASSed (3/10 → 1/10) — so the V4L2
+	 * uAPI semantic here may be DPB-slot-indexed despite the AV1 spec
+	 * lexicon. Phase 3 open question pending kernel-side disambiguation.
 	 */
-	if (picture->seq_info_fields.fields.still_picture)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_STILL_PICTURE;
-	if (picture->seq_info_fields.fields.use_128x128_superblock)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_USE_128X128_SUPERBLOCK;
-	if (picture->seq_info_fields.fields.enable_filter_intra)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_FILTER_INTRA;
-	if (picture->seq_info_fields.fields.enable_intra_edge_filter)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTRA_EDGE_FILTER;
-	if (picture->seq_info_fields.fields.enable_interintra_compound)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_INTERINTRA_COMPOUND;
-	if (picture->seq_info_fields.fields.enable_masked_compound)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_MASKED_COMPOUND;
-	if (picture->seq_info_fields.fields.enable_dual_filter)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_DUAL_FILTER;
-	if (picture->seq_info_fields.fields.enable_order_hint)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_ORDER_HINT;
-	if (picture->seq_info_fields.fields.enable_jnt_comp)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_JNT_COMP;
-	if (picture->seq_info_fields.fields.enable_cdef)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_ENABLE_CDEF;
-	if (picture->seq_info_fields.fields.mono_chrome)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_MONO_CHROME;
-	if (picture->seq_info_fields.fields.color_range)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_COLOR_RANGE;
-	if (picture->seq_info_fields.fields.subsampling_x)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_X;
-	if (picture->seq_info_fields.fields.subsampling_y)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_SUBSAMPLING_Y;
-	if (picture->seq_info_fields.fields.film_grain_params_present)
-		sequence.flags |= V4L2_AV1_SEQUENCE_FLAG_FILM_GRAIN_PARAMS_PRESENT;
+	for (i = 0; i < BACKEND_AV1_TOTAL_REFS_PER_FRAME; i++) {
+		VASurfaceID ref_id = picture->ref_frame_map[i];
+		struct object_surface *ref_surface;
+		uint64_t ts;
+		if (ref_id == VA_INVALID_SURFACE)
+			continue;
+		ref_surface = SURFACE(driver_data, ref_id);
+		if (ref_surface == NULL)
+			continue;
+		ts = v4l2_timeval_to_ns(&ref_surface->timestamp);
+		if (ts == 0 &&
+		    ref_surface->linked_decode_surface_id != VA_INVALID_SURFACE) {
+			struct object_surface *dec =
+				SURFACE(driver_data,
+					ref_surface->linked_decode_surface_id);
+			if (dec != NULL) {
+				ts = v4l2_timeval_to_ns(&dec->timestamp);
+				frame.order_hints[i] = dec->av1_order_hint;
+			}
+		} else {
+			frame.order_hints[i] = ref_surface->av1_order_hint;
+		}
+		frame.reference_frame_ts[i] = ts;
+	}

-	/* Single-control batched submission. */
-	memset(ctrls, 0, sizeof ctrls);
-	ctrls[0].id   = V4L2_CID_STATELESS_AV1_SEQUENCE;
-	ctrls[0].ptr  = &sequence;
-	ctrls[0].size = sizeof sequence;
+	/* Phase 3: record this frame's order_hint on the surface so the
+	 * NEXT frame's ref-loop can populate order_hints[] for slots that
+	 * reference us. */
+	surface_object->av1_order_hint = picture->order_hint;
+	/* Also propagate to the linked display surface (if any), since
+	 * future frames' ref_frame_map[] may point at either. */
+	if (picture->current_display_picture != VA_INVALID_SURFACE &&
+	    picture->current_display_picture != picture->current_frame) {
+		struct object_surface *disp =
+			SURFACE(driver_data, picture->current_display_picture);
+		if (disp != NULL)
+			disp->av1_order_hint = picture->order_hint;
+	}
+
+	if (driver_data->has_av1_film_grain)
+		av1_fill_film_grain(picture, &film_grain);
+
+	controls[n++] = (struct v4l2_ext_control){
+		.id = V4L2_CID_STATELESS_AV1_SEQUENCE,
+		.ptr = &sequence,
+		.size = sizeof(sequence),
+	};
+	controls[n++] = (struct v4l2_ext_control){
+		.id = V4L2_CID_STATELESS_AV1_FRAME,
+		.ptr = &frame,
+		.size = sizeof(frame),
+	};
+	controls[n++] = (struct v4l2_ext_control){
+		.id = V4L2_CID_STATELESS_AV1_TILE_GROUP_ENTRY,
+		.ptr = tile_entries,
+		.size = sizeof(*tile_entries) * alloc_tiles,
+	};
+	if (driver_data->has_av1_film_grain) {
+		controls[n++] = (struct v4l2_ext_control){
+			.id = V4L2_CID_STATELESS_AV1_FILM_GRAIN,
+			.ptr = &film_grain,
+			.size = sizeof(film_grain),
+		};
+	}

 	rc = v4l2_set_controls(driver_data->video_fd,
 			       surface_object->request_fd,
-			       ctrls, 1);
-	if (rc < 0)
-		return VA_STATUS_ERROR_OPERATION_FAILED;
+			       controls, n);

-	return VA_STATUS_SUCCESS;
+	free(tile_entries);
+
+	if (rc < 0) {
+		request_log("ampere-av1: VIDIOC_S_EXT_CTRLS failed rc=%d\n", rc);
+		return -1;
+	}
+
+	return 0;
 }
@@ -1,8 +1,14 @@
 /*
- * Copyright (C) 2026 Markus Fritsche <fritsche.markus@gmail.com>
+ * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
 *
- * AV1 codec dispatcher — populates V4L2_CID_STATELESS_AV1_SEQUENCE
- * (struct v4l2_ctrl_av1_sequence) from VAAPI's VADecPictureParameterBufferAV1.
+ * ampere-av1-enablement Phase 2: AV1 codec dispatcher header for libva-
+ * v4l2-request-fourier. Mirrors vp9.h shape — single set_controls entry
+ * point that translates surface->params.av1.* VAAPI structures into a
+ * batch of V4L2_CID_STATELESS_AV1_{SEQUENCE,FRAME,TILE_GROUP_ENTRY,
+ * FILM_GRAIN} controls + the underlying request_fd / OUTPUT plane setup.
+ *
+ * V4L2 target: V4L2_PIX_FMT_AV1_FRAME on the vpu981 hantro instance
+ * (RK3588's dedicated AV1 decoder).
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the
@@ -37,28 +37,14 @@ unsigned int pixelformat_for_profile(VAProfile profile)
 	case VAProfileH264ConstrainedBaseline:
 	case VAProfileH264MultiviewHigh:
 	case VAProfileH264StereoHigh:
-	case VAProfileH264High10:
 		return V4L2_PIX_FMT_H264_SLICE;
 	case VAProfileHEVCMain:
-	case VAProfileHEVCMain10:
 		return V4L2_PIX_FMT_HEVC_SLICE;
 	case VAProfileVP8Version0_3:
 		return V4L2_PIX_FMT_VP8_FRAME;
 	case VAProfileVP9Profile0:
 		return V4L2_PIX_FMT_VP9_FRAME;
 	case VAProfileAV1Profile0:
-		/*
-		 * ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
-		 * vpu981 (RK3588's dedicated AV1 hantro). Per-codec ctrl
-		 * dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED on
-		 * master — vainfo lists the profile + RequestCreateConfig
-		 * succeeds, but consumers that submit decode buffers hit
-		 * a NOP path until the per-codec dispatch lands. The
-		 * av1-iter1 operator branch has Phase 3 bit-exact bring-up
-		 * underway; this commit gives master the bare enumeration +
-		 * fd-routing layer so consumers like ffmpeg-vaapi at least
-		 * see VAProfileAV1Profile0 today.
-		 */
 		return V4L2_PIX_FMT_AV1_FRAME;
 	default:
 		return 0;
@@ -59,37 +59,34 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
 	case VAProfileH264ConstrainedBaseline:
 	case VAProfileH264MultiviewHigh:
 	case VAProfileH264StereoHigh:
-	case VAProfileH264High10:
 		// FIXME
-		// iter39: Hi10P routed through same H264 path; bit-depth gating
-		// happens in context.c synthetic SPS and CAPTURE pix_fmt
-		// selection.
 		break;
 	case VAProfileMPEG2Simple:
 	case VAProfileMPEG2Main:
+		// fresnel-fourier iter1: MPEG-2 enabled. Same shape as H.264
+		// above — no profile-specific config validation in the libva
+		// backend; validation happens at vaCreateContext / control
+		// submission time.
 		break;
 	case VAProfileHEVCMain:
-	case VAProfileHEVCMain10:
-		// iter39: Main10 routed through same HEVC path; bit-depth
-		// gating happens in context.c.
+		// fresnel-fourier iter2: HEVC enabled. Same shape as H.264/
+		// MPEG-2 above — no profile-specific config validation in the
+		// libva backend; validation happens at vaCreateContext / control
+		// submission time.
 		break;
 	case VAProfileVP8Version0_3:
+		// fresnel-fourier iter3: VP8 enabled. Same shape as iter1+iter2
+		// above — no profile-specific config validation in the libva
+		// backend; validation happens at vaCreateContext / control
+		// submission time.
 		break;
 	case VAProfileVP9Profile0:
 		// fresnel-fourier iter4: VP9 Profile 0 enabled on rkvdec.
-		// VP9 Profile 2 is NOT supported by RK3399 rkvdec (kernel ctrl
-		// cap is V4L2_MPEG_VIDEO_VP9_PROFILE_0). Do not add a case for
-		// VAProfileVP9Profile2 — kernel will reject.
+		// Same shape — no profile-specific validation here.
 		break;
 	case VAProfileAV1Profile0:
-		// ampere-av1-enablement Phase 2: AV1 Profile 0 routes to
-		// vpu981 (RK3588 dedicated AV1 hantro instance). Decode-side
-		// ctrl dispatch (V4L2_CID_STATELESS_AV1_*) is NOT YET WIRED
-		// on master — vainfo will list the profile + CreateConfig
-		// succeeds, but consumers that submit decode buffers hit a
-		// NOP path until av1.{c,h} dispatch scaffolding is ported
-		// from the av1-iter1 operator branch (where Phase 3-5 has
-		// 3/10 frames bit-exact already).
+		// ampere-av1-enablement: AV1 Profile 0 enabled on vpu981.
+		// Same shape — no profile-specific validation here.
 		break;
 	default:
 		return VA_STATUS_ERROR_UNSUPPORTED_PROFILE;
@@ -126,14 +123,6 @@ VAStatus RequestCreateConfig(VADriverContextP context, VAProfile profile,
 	 */
 	config_object->pixelformat = pixelformat_for_profile(profile);
 	config_object->attributes[0].type = VAConfigAttribRTFormat;
-	/*
-	 * iter39: 10-bit profiles advertise YUV420_10. ffmpeg-vaapi reads
-	 * this attribute on vaGetConfigAttributes and refuses surface
-	 * allocation if it mismatches the input bitstream's bit depth.
-	 */
-	if (profile == VAProfileH264High10 || profile == VAProfileHEVCMain10)
-		config_object->attributes[0].value = VA_RT_FORMAT_YUV420_10;
-	else
 	config_object->attributes[0].value = VA_RT_FORMAT_YUV420;
 	config_object->attributes_count = 1;

@@ -172,20 +161,14 @@ VAStatus RequestDestroyConfig(VADriverContextP context, VAConfigID config_id)
 static bool any_fd_supports_output_format(struct request_data *driver_data,
 					  unsigned int fmt)
 {
-	int fds[6] = {
+	int fds[4] = {
 		driver_data->video_fd,
 		driver_data->video_fd_rkvdec,
 		driver_data->video_fd_hantro,
-		driver_data->video_fd_rpi_hevc_dec,  /* iter40 */
-		driver_data->video_fd_vpu981,        /* ampere-av1 Phase 2 */
-#ifdef HAVE_DAEDALUS_V4L2
-		driver_data->video_fd_daedalus,      /* LIBVA-1: H.264/VP9/AV1 */
-#else
-		-1,
-#endif
+		driver_data->video_fd_vpu981,
 	};
 	int i;
-	for (i = 0; i < 6; i++) {
+	for (i = 0; i < 4; i++) {
 		if (fds[i] < 0) continue;
 		if (v4l2_find_format(fds[i], V4L2_BUF_TYPE_VIDEO_OUTPUT, fmt))
 			return true;
@@ -215,48 +198,11 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
 		profiles[index++] = VAProfileH264ConstrainedBaseline;
 		profiles[index++] = VAProfileH264MultiviewHigh;
 		profiles[index++] = VAProfileH264StereoHigh;
-		/*
-		 * iter39 Phase 7 close (Option B): VAProfileH264High10
-		 * DELIBERATELY NOT ENUMERATED.
-		 *
-		 * Hi10P on Rockchip V4L2 stateless decoders requires:
-		 *   - HW: ✓ both RK3399 + RK3588 capable (per Rockchip
-		 *         datasheets — 4K 10-bit H.264 line items)
-		 *   - Kernel: ✓ Karlman v6→v10 series merged in
-		 *             mmind v7.0 (rkvdec_h264_decoded_fmts[] has
-		 *             NV15/NV20; ctrl cfg.max=HIGH_422_INTRA;
-		 *             bit_depth_luma_minus8==2 path live in
-		 *             rkvdec-h264-common.c:196)
-		 *   - Userspace ffmpeg: ✗ ffmpeg-v4l2-request-fourier
-		 *             lacks the userspace plumbing for Hi10P;
-		 *             kdirect path fails with EINVAL, libva path
-		 *             returns CAPTURE buffer all-zero.
-		 *
-		 * Empirically verified on both fresnel (RK3399) and ampere
-		 * (RK3588) 2026-05-17 — same all-zero / EINVAL failure
-		 * mode on both. The backend infrastructure (codec.c,
-		 * context.c, image.c, surface.c, nv15.c) is RETAINED for
-		 * when the upstream ffmpeg gap closes — just re-add the
-		 * profiles[index++] line and bump the (-5) guard back to
-		 * (-6). See memory feedback_rk3399_h264_hi10p_advertised_not_functional
-		 * for the empirical evidence.
-		 */
 	}

 	found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_HEVC_SLICE);
-	if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1)) {
+	if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
 		profiles[index++] = VAProfileHEVCMain;
-		/*
-		 * iter39 Phase 7 close (Option B): VAProfileHEVCMain10
-		 * DELIBERATELY NOT ENUMERATED. Same reasoning as
-		 * VAProfileH264High10 above — kernel + HW ready,
-		 * userspace ffmpeg V4L2 hwaccel plumbing not. Untested
-		 * specifically due to no Main10 fixture (system x265
-		 * is 8-bit-only on Arch ARM), but same kernel/HW/
-		 * userspace stack so same gap likely applies. Re-enable
-		 * when ffmpeg-vaapi → V4L2 hwaccel adds 10-bit HEVC.
-		 */
-	}

 	found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_VP8_FRAME);
 	if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -267,11 +213,11 @@ VAStatus RequestQueryConfigProfiles(VADriverContextP context,
 		profiles[index++] = VAProfileVP9Profile0;

 	/*
-	 * ampere-av1-enablement Phase 2: AV1 Profile 0 advertised when
-	 * vpu981 (RK3588 dedicated AV1 hantro) is probed. MAX_PROFILES
-	 * bumped to 14 in request.h to safely fit even if iter39 Option
-	 * B is reverted (Hi10P + Main10 back in enumeration → 13 total
-	 * with AV1, the `< MAX - 1` guard then needs MAX ≥ 14).
+	 * ampere-av1-enablement: AV1 routes to vpu981 (advertised via the
+	 * new video_fd_vpu981 slot). V4L2_REQUEST_MAX_PROFILES=11 is now
+	 * EXACTLY full with this addition. Future profile additions
+	 * require bumping that constant + verifying libva consumers'
+	 * profiles[] sizing.
 	 */
 	found = any_fd_supports_output_format(driver_data, V4L2_PIX_FMT_AV1_FRAME);
 	if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))
@@ -295,9 +241,7 @@ VAStatus RequestQueryConfigEntrypoints(VADriverContextP context,
 	case VAProfileH264ConstrainedBaseline:
 	case VAProfileH264MultiviewHigh:
 	case VAProfileH264StereoHigh:
-	case VAProfileH264High10:
 	case VAProfileHEVCMain:
-	case VAProfileHEVCMain10:
 	case VAProfileVP8Version0_3:
 	case VAProfileVP9Profile0:
 	case VAProfileAV1Profile0:
@@ -354,16 +298,6 @@ VAStatus RequestGetConfigAttributes(VADriverContextP context, VAProfile profile,
 	for (i = 0; i < attributes_count; i++) {
 		switch (attributes[i].type) {
 		case VAConfigAttribRTFormat:
-			/*
-			 * iter39: 10-bit profiles publish YUV420_10. Profile-
-			 * less query (this is invoked from vaGetConfigAttributes
-			 * before vaCreateConfig) routes off the `profile` arg
-			 * directly — same gating as RequestCreateConfig.
-			 */
-			if (profile == VAProfileH264High10 ||
-			    profile == VAProfileHEVCMain10)
-				attributes[i].value = VA_RT_FORMAT_YUV420_10;
-			else
 			attributes[i].value = VA_RT_FORMAT_YUV420;
 			break;
 		default:
@@ -42,9 +42,6 @@

 #include <hevc-ctrls.h>

-#include "nv15.h"  /* iter40: fallback V4L2_PIX_FMT_NV15 define for Pi 5
-		    * Debian headers that ship NC12 but not NV15. */
-#include "nv12_col128.h"  /* iter40: NC12 detile primitive + UV offset helper */
 #include "utils.h"
 #include "v4l2.h"

@@ -110,55 +107,9 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * the driver_data and is cached across CreateContext cycles. The
 	 * probe doesn't require any prior S_FMT — v4l2_find_format
 	 * enumerates the device's supported formats directly.
-	 *
-	 * iter39: choose NV15 (10-bit packed) for Hi10P / Main10 profiles,
-	 * NV12 (8-bit) otherwise. If the cached video_format doesn't match
-	 * the profile's bit-depth requirement, invalidate and re-probe —
-	 * sibling pattern to iter38's device-switch invalidation in
-	 * request_switch_device_for_profile().
 	 */
-	{
-		bool want_10bit = (config_object->profile == VAProfileH264High10 ||
-				   config_object->profile == VAProfileHEVCMain10);
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
-		/*
-		 * iter40: per-fd preferred pixelformat. rpi-hevc-dec exposes
-		 * NC12 (8-bit) / NC30 (10-bit), not NV12 / NV15.
-		 */
-		unsigned int want_pixfmt;
-		if (is_rpi)
-			want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
-						 : V4L2_PIX_FMT_NV12_COL128;
-		else
-			want_pixfmt = want_10bit ? V4L2_PIX_FMT_NV15
-						 : V4L2_PIX_FMT_NV12;
-		if (driver_data->video_format &&
-		    driver_data->video_format->v4l2_format != want_pixfmt &&
-		    driver_data->video_format->v4l2_format != V4L2_PIX_FMT_SUNXI_TILED_NV12)
-			driver_data->video_format = NULL;
-	}
 	if (!driver_data->video_format) {
-		bool want_10bit = (config_object->profile == VAProfileH264High10 ||
-				   config_object->profile == VAProfileHEVCMain10);
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
 		video_format = NULL;
-
-		if (is_rpi) {
-			/*
-			 * iter40: rpi-hevc-dec CAPTURE is NC12 (8-bit SAND
-			 * 128-pixel-wide column tile) or NC30 (10-bit variant).
-			 * Direct map; the kernel exposes BOTH formats in
-			 * VIDIOC_ENUM_FMT(CAPTURE_MPLANE) without a pre-SPS
-			 * step (verified Phase 0 strace), so find_format would
-			 * also succeed — skip it for symmetry with the NV15
-			 * iter39 branch below.
-			 */
-			video_format = video_format_find(
-				want_10bit ? V4L2_PIX_FMT_NV12_10_COL128
-					   : V4L2_PIX_FMT_NV12_COL128);
-		} else if (!want_10bit) {
 		found = v4l2_find_format(driver_data->video_fd,
 					 V4L2_BUF_TYPE_VIDEO_CAPTURE,
 					 V4L2_PIX_FMT_SUNXI_TILED_NV12);
@@ -170,19 +121,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 					 V4L2_PIX_FMT_NV12);
 		if (found)
 			video_format = video_format_find(V4L2_PIX_FMT_NV12);
-		} else {
-			/*
-			 * iter39 fresnel fix: rkvdec only advertises NV15 in
-			 * VIDIOC_ENUM_FMT(CAPTURE) AFTER S_FMT(OUTPUT) +
-			 * S_EXT_CTRLS(SPS) resolve image_fmt to 420_10BIT.
-			 * Before that, only NV12 is enumerated. Pre-finding
-			 * NV15 always fails. Skip the find_format check and
-			 * directly map to our NV15 video_format entry; the
-			 * later S_FMT(CAPTURE) commits the actual NV15 mode
-			 * once the synthetic SPS sets bit_depth_luma_minus8=2.
-			 */
-			video_format = video_format_find(V4L2_PIX_FMT_NV15);
-		}

 		if (video_format == NULL) {
 			status = VA_STATUS_ERROR_OPERATION_FAILED;
@@ -193,10 +131,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	}
 	video_format = driver_data->video_format;

-	/* iter39: session-wide flag drives image.c reporting + unpack. */
-	driver_data->is_10bit = (config_object->profile == VAProfileH264High10 ||
-				 config_object->profile == VAProfileHEVCMain10);
-
 	output_type = v4l2_type_video_output(video_format->v4l2_mplane);
 	capture_type = v4l2_type_video_capture(video_format->v4l2_mplane);

@@ -241,22 +175,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * CAPTURE (sanity read-back, matches what S_FMT committed).
 	 */
 	{
-		/*
-		 * iter40: take the CAPTURE pixelformat from the resolved
-		 * video_format slot — that's per-fd, per-bit-depth correct.
-		 *   rkvdec  8-bit  → NV12
-		 *   rkvdec 10-bit  → NV15
-		 *   hantro  8-bit  → NV12
-		 *   rpi-hevc-dec   → NC12 (V4L2_PIX_FMT_NV12_COL128)
-		 * Pre-iter40 this was hardcoded NV12/NV15 — the rpi-hevc-dec
-		 * fd would then have S_FMT(NV12) issued, and the kernel
-		 * "helpfully" substituted V4L2_PIX_FMT_NV12MT_COL128 (the
-		 * MULTI-PLANE-NON-CONTIGUOUS variant) instead of the
-		 * SINGLE-PLANE NC12 we wanted, breaking cap_pool QUERYBUF
-		 * downstream (Phase 7 iter40 first-run discovery).
-		 */
-		unsigned int capture_pixelformat =
-			driver_data->video_format->v4l2_format;
+		unsigned int capture_pixelformat = V4L2_PIX_FMT_NV12;
 		rc = v4l2_set_format(driver_data->video_fd, capture_type,
 				     capture_pixelformat, picture_width,
 				     picture_height);
@@ -313,42 +232,16 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * the device-init DECODE_MODE + START_CODE block below ALSO uses
 	 * void-cast best-effort, so this is consistent with prior pattern.
 	 */
-	/*
-	 * iter40 (Phase 5 review F6): the synthetic-SPS pre-seed is an
-	 * rkvdec-specific quirk fix (the -EBUSY-on-CAPTURE-busy bug in
-	 * rkvdec_s_ctrl). rpi-hevc-dec does NOT need it and uses a
-	 * different submission ordering (Phase 0 strace: S_FMT_OUTPUT →
-	 * REQBUFS_OUTPUT → S_FMT_CAPTURE → CREATE_BUFS_CAPTURE → STREAMON,
-	 * with per-frame SPS via S_EXT_CTRLS class=0xf010000). Sending a
-	 * stale dummy SPS at context-init time would leave rpi-hevc-dec's
-	 * internal state on the dummy until the first real per-frame SPS
-	 * arrives — exact behavior unknown but a known divergence from
-	 * kdirect.
-	 *
-	 * Skip pre-seed when the active fd is rpi-hevc-dec. rkvdec /
-	 * hantro paths unchanged.
-	 */
-	if (driver_data->video_fd != driver_data->video_fd_rpi_hevc_dec) {
-		/*
-		 * iter39: 10-bit profiles set bit_depth_luma_minus8 = 2 in
-		 * the synthetic SPS so rkvdec's get_image_fmt resolves to
-		 * RKVDEC_IMG_FMT_420_10BIT (per rkvdec-h264-common.c:196 +
-		 * rkvdec-hevc-common.c:467). Image_fmt resolution depends
-		 * only on bit_depth_luma_minus8 and chroma_format_idc;
-		 * profile_idc is ignored for image_fmt and v4l2_ctrl_hevc_sps
-		 * has no profile_idc field at all.
-		 */
-		bool ten = driver_data->is_10bit;
+	{
 		switch (config_object->profile) {
-		case VAProfileHEVCMain:
-		case VAProfileHEVCMain10: {
+		case VAProfileHEVCMain: {
 			struct v4l2_ctrl_hevc_sps dummy_sps;
 			struct v4l2_ext_control dummy_ctrl;

 			memset(&dummy_sps, 0, sizeof(dummy_sps));
 			dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
-			dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
-			dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
+			dummy_sps.bit_depth_luma_minus8 = 0; /* 8-bit */
+			dummy_sps.bit_depth_chroma_minus8 = 0;
 			dummy_sps.pic_width_in_luma_samples = picture_width;
 			dummy_sps.pic_height_in_luma_samples = picture_height;

@@ -363,20 +256,19 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 		case VAProfileH264High:
 		case VAProfileH264ConstrainedBaseline:
 		case VAProfileH264MultiviewHigh:
-		case VAProfileH264StereoHigh:
-		case VAProfileH264High10: {
+		case VAProfileH264StereoHigh: {
 			struct v4l2_ctrl_h264_sps dummy_sps;
 			struct v4l2_ext_control dummy_ctrl;

 			memset(&dummy_sps, 0, sizeof(dummy_sps));
 			dummy_sps.chroma_format_idc = 1; /* 4:2:0 */
-			dummy_sps.bit_depth_luma_minus8 = ten ? 2 : 0;
-			dummy_sps.bit_depth_chroma_minus8 = ten ? 2 : 0;
+			dummy_sps.bit_depth_luma_minus8 = 0;
+			dummy_sps.bit_depth_chroma_minus8 = 0;
 			dummy_sps.pic_width_in_mbs_minus1 =
 				(picture_width + 15) / 16 - 1;
 			dummy_sps.pic_height_in_map_units_minus1 =
 				(picture_height + 15) / 16 - 1;
-			dummy_sps.profile_idc = ten ? 110 : 100; /* High10 : High */
+			dummy_sps.profile_idc = 100; /* High */
 			dummy_sps.level_idc = 41;
 			/*
 			 * FRAME_MBS_ONLY required: rkvdec_h264_validate_sps
@@ -397,7 +289,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 		default:
 			break;
 		}
-	}  /* iter40: end of pre-seed-skip-on-rpi-hevc-dec guard */
+	}

 	destination_planes_count = video_format->planes_count;

@@ -431,40 +323,11 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * changed by BeginPicture's slot acquisition.
 	 */
 	if (video_format->v4l2_buffers_count == 1) {
-		if (video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
-			/*
-			 * iter40: NC12 SAND layout: Y plane size is
-			 * NUM_COLUMNS * TILE_W * ALIGN(height, 8) (= linear
-			 * NV12 Y for column-aligned widths), UV plane is half.
-			 * The kernel-reported destination_bytesperlines[0] is
-			 * the COLUMN stride (ALIGN(height,8)*3/2), not the
-			 * linear Y stride — using it × format_height gives the
-			 * wrong intra-buffer UV offset (destination_offsets[1]
-			 * derives from destination_sizes[0] in
-			 * surface_fill_format_uniform).
-			 *
-			 * Use format_width/format_height (kernel-returned from
-			 * G_FMT) not picture_width/height (caller request),
-			 * because the kernel applies its own ALIGN rules; the
-			 * UV plane location is keyed off the kernel layout.
-			 */
-			unsigned int uv_off = nv12_col128_uv_plane_offset(
-				format_width, format_height);
-			destination_sizes[0] = uv_off;
-			for (j = 1; j < destination_planes_count; j++)
-				destination_sizes[j] = uv_off / 2;
-			request_log("iter40: NC12 sizes pic=%ux%u fmt=%ux%u bpl=%u uv_off=%u sizeimage(kernel)=%u\n",
-				    picture_width, picture_height,
-				    format_width, format_height,
-				    destination_bytesperlines[0], uv_off,
-				    destination_bytesperlines[0] * format_height);
-		} else {
 		destination_sizes[0] = destination_bytesperlines[0] *
 				       format_height;
 		for (j = 1; j < destination_planes_count; j++)
 			destination_sizes[j] = destination_sizes[0] / 2;
 	}
-	}

 	/*
 	 * iter5b-β Commit D: cache the format-uniform CAPTURE geometry
@@ -537,9 +400,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 */
 	rc = request_pool_init(&driver_data->output_pool,
 			       driver_data->video_fd, driver_data->media_fd,
-			       output_type, 16, pixelformat,
-			       (unsigned int)picture_width,
-			       (unsigned int)picture_height);
+			       output_type, 16);
 	if (rc < 0) {
 		status = VA_STATUS_ERROR_ALLOCATION_FAILED;
 		goto error;
@@ -599,18 +460,6 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * + ANNEX_B (only supported menu values per Phase 0 v4l2_inventory).
 	 */
 	{
-		/*
-		 * iter40: per-driver HEVC start_code menu value. rkvdec /
-		 * hantro path uses ANNEX_B + start-code-prepended payload.
-		 * rpi-hevc-dec uses NONE — confirmed empirically Phase 7
-		 * (any other mode → V4L2_BUF_FLAG_ERROR on every CAPTURE
-		 * DQBUF, all-zero output). kdirect's strace also shows
-		 * start_code=0 on rpi-hevc-dec. Both are accepted by the
-		 * driver's QUERY_EXT_CTRL menu (min=0 max=1), but only NONE
-		 * actually drives correct decode on the Pi.
-		 */
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
 		struct v4l2_ext_control hevc_dev_ctrls[2] = {
 			{
 				.id = V4L2_CID_STATELESS_HEVC_DECODE_MODE,
@@ -618,9 +467,7 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 			},
 			{
 				.id = V4L2_CID_STATELESS_HEVC_START_CODE,
-				.value = is_rpi
-					? 0 /* V4L2_STATELESS_HEVC_START_CODE_NONE */
-					: V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
+				.value = V4L2_STATELESS_HEVC_START_CODE_ANNEX_B,
 			},
 		};
 		(void)v4l2_set_controls(driver_data->video_fd, -1,
@@ -653,30 +500,19 @@ VAStatus RequestCreateContext(VADriverContextP context, VAConfigID config_id,
 	 * commit will replace this hardcoded assignment with a runtime
 	 * read of the kernel's accepted START_CODE value.
 	 */
-	{
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
 	switch (config_object->profile) {
 	case VAProfileH264Main:
 	case VAProfileH264High:
 	case VAProfileH264ConstrainedBaseline:
 	case VAProfileH264MultiviewHigh:
 	case VAProfileH264StereoHigh:
-			context_object->h264_start_code = true;
-			break;
 	case VAProfileHEVCMain:
-			/* iter40: rpi-hevc-dec rejects start-code-prepended
-			 * payload (DQBUF error flag on every CAPTURE buffer).
-			 * Gate to match the per-driver START_CODE menu value
-			 * set above: NONE on rpi → no prepend; ANNEX_B on
-			 * rkvdec → prepend. */
-			context_object->h264_start_code = !is_rpi;
+		context_object->h264_start_code = true;
 		break;
 	default:
 		context_object->h264_start_code = false;
 		break;
 	}
-	}

 	rc = v4l2_set_stream(driver_data->video_fd, output_type, true);
 	if (rc < 0) {
@@ -800,8 +636,6 @@ VAStatus RequestDestroyContext(VADriverContextP context, VAContextID context_id)
 	 * The next CreateContext re-populates the cache.
 	 */
 	driver_data->fmt_valid = false;
-	/* iter39: clear 10-bit session flag — next CreateContext re-sets. */
-	driver_data->is_10bit = false;

 	return VA_STATUS_SUCCESS;
 }
@@ -827,63 +827,10 @@ int h264_set_controls(struct request_data *driver_data,

 	dpb_update(context, &surface->params.h264.picture);

-	/*
-	 * Dump the raw VAAPI fields at the libva boundary so issue #8
-	 * follow-up can disambiguate "ffmpeg-vaapi didn't populate" from
-	 * "downstream consumer (daedalus_v4l2 wire protocol) corrupted the
-	 * value". One-line; safe to leave in — costs a single printf per frame.
-	 */
-	request_log("h264_set_controls: VAProfile=%d seq_fields=0x%08x pic_fields=0x%08x num_ref_frames=%u bit_depth_luma_m8=%u bit_depth_chroma_m8=%u w_mbs_m1=%u h_mbs_m1=%u\n",
-		    (int)profile,
-		    surface->params.h264.picture.seq_fields.value,
-		    surface->params.h264.picture.pic_fields.value,
-		    surface->params.h264.picture.num_ref_frames,
-		    surface->params.h264.picture.bit_depth_luma_minus8,
-		    surface->params.h264.picture.bit_depth_chroma_minus8,
-		    surface->params.h264.picture.picture_width_in_mbs_minus1,
-		    surface->params.h264.picture.picture_height_in_mbs_minus1);
-
 	h264_va_picture_to_v4l2(driver_data, context, surface,
 				&surface->params.h264.picture,
 				&decode, &pps, &sps);

-	/*
-	 * max_num_ref_frames fallback. Some VAAPI clients (older ffmpeg-vaapi
-	 * paths, some daedalus_v4l2 consumers) leave VAPicture->num_ref_frames
-	 * at zero. Hardware decoders tolerate; libavcodec-via-daedalus enforces
-	 * sps.max_num_ref_frames strictly and rejects every frame.
-	 *
-	 * Count valid DPB entries first (the bitstream-true reference count we
-	 * can see); fall back to a per-profile spec minimum if even that is 0.
-	 * See marfrit/libva-v4l2-request-fourier issue #8.
-	 */
-	if (sps.max_num_ref_frames == 0) {
-		unsigned int valid = 0;
-		unsigned int i;
-		for (i = 0; i < 16; i++) {
-			const VAPictureH264 *ref =
-				&surface->params.h264.picture.ReferenceFrames[i];
-			if (!(ref->flags & VA_PICTURE_H264_INVALID))
-				valid++;
-		}
-		if (valid > 0) {
-			sps.max_num_ref_frames = (uint8_t)valid;
-		} else {
-			switch (profile) {
-			case VAProfileH264ConstrainedBaseline:
-				sps.max_num_ref_frames = 1;
-				break;
-			case VAProfileH264Main:
-			case VAProfileH264High:
-			case VAProfileH264MultiviewHigh:
-			case VAProfileH264StereoHigh:
-			default:
-				sps.max_num_ref_frames = 4;
-				break;
-			}
-		}
-	}
-
 	/*
 	 * Populate the scaling matrix unconditionally: from VAAPI's
 	 * VAIQMatrixBufferH264 when the consumer sent one this frame
@@ -83,18 +83,6 @@
 #include "hevc-ctrls/v4l2-hevc-ext-controls.h"
 #include "h265_parser/gst/codecparsers/gsth265parser.h"

-/*
- * VAAPI source arrays for HEVC ref/weight tables are sized 15
- * (VASliceParameterBufferHEVC::RefPicList[2][15],
- *  delta_luma_weight_l0[15], luma_offset_l0[15], etc. — see
- *  /usr/include/va/va_dec_hevc.h). V4L2_HEVC_DPB_ENTRIES_NUM_MAX
- * is 16; iterating to that bound over-reads the VAAPI source by
- * one element. Hidden by -O3 unrolling but manifests as a SEGV
- * under -O2 vectorisation (regression discovered in package
- * builds 2026-05-17). Cap all per-ref/weight loops at this.
- */
-#define VA_HEVC_REF_LIST_LEN	15
-
 #include "utils.h"
 #include "v4l2.h"

@@ -477,21 +465,13 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
 	/* Q2: slice_segment_addr from VAAPI (was missing in old h265.c). */
 	slice_params->slice_segment_addr = slice->slice_segment_address;

-	/*
-	 * Ref index arrays (DPB indices). For I-slices both are unused.
-	 *
-	 * Cap iteration at VAAPI source size (15) — V4L2_HEVC_DPB_ENTRIES_NUM_MAX
-	 * is 16, but VASliceParameterBufferHEVC::RefPicList is RefPicList[2][15].
-	 * Iterating to 16 reads one past the source array; with -O2 GCC vectorises
-	 * the copy and the over-read produces a real SEGV (manifested in package
-	 * builds with Arch makepkg CFLAGS, plain -O3 release builds hid it).
-	 */
-	for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
+	/* Ref index arrays (DPB indices). For I-slices both are unused. */
+	for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
 		    slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
 		if (i < (slice->num_ref_idx_l0_active_minus1 + 1U))
 			slice_params->ref_idx_l0[i] = slice->RefPicList[0][i];
 	}
-	for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
+	for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
 		    slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
 		if (i < (slice->num_ref_idx_l1_active_minus1 + 1U))
 			slice_params->ref_idx_l1[i] = slice->RefPicList[1][i];
@@ -523,9 +503,7 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
 	slice_params->pred_weight_table.delta_chroma_log2_weight_denom =
 		slice->delta_chroma_log2_weight_denom;

-	/* Pred weight tables — cap at VAAPI source array size (15), same
-	 * reason as the RefPicList loops above. */
-	for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
+	for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
 		    slice_type != V4L2_HEVC_SLICE_TYPE_I; i++) {
 		slice_params->pred_weight_table.delta_luma_weight_l0[i] =
 			slice->delta_luma_weight_l0[i];
@@ -538,7 +516,7 @@ static void h265_fill_slice_params(VAPictureParameterBufferHEVC *picture,
 				slice->ChromaOffsetL0[i][j];
 		}
 	}
-	for (i = 0; i < VA_HEVC_REF_LIST_LEN &&
+	for (i = 0; i < V4L2_HEVC_DPB_ENTRIES_NUM_MAX &&
 		    slice_type == V4L2_HEVC_SLICE_TYPE_B; i++) {
 		slice_params->pred_weight_table.delta_luma_weight_l1[i] =
 			slice->delta_luma_weight_l1[i];
@@ -779,100 +757,6 @@ static int h265_populate_ext_sps_rps_cache(struct request_data *driver_data,
 	return err;
 }

-/*
- * iter40b: parse SPS NAL from source_data to populate the
- * VAAPI-omitted v4l2_ctrl_hevc_sps fields (max_num_reorder_pics,
- * max_latency_increase_plus1, sps_max_sub_layers_minus1, and
- * sps_max_dec_pic_buffering_minus1 at the right sublayer index).
- *
- * Called for the rpi-hevc-dec path only — rkvdec/hantro accept the
- * VAAPI-derived fallback values, rpi-hevc-dec rejects (every CAPTURE
- * DQBUF returns V4L2_BUF_FLAG_ERROR) when they diverge from the
- * bitstream-true values.
- *
- * Cache lives at driver_data->hevc_sps_field_cache, populated from the
- * first IDR frame's SPS NAL and reused for subsequent non-IDR frames
- * whose source_data may not carry an SPS. Same lifecycle as
- * hevc_rps_cache_*.
- *
- * Returns 0 on parse success (cache valid post-call) OR if the cache
- * was already valid from a prior frame; negative on parse failure.
- */
-static int h265_override_sps_from_bitstream(
-	struct request_data *driver_data,
-	struct object_surface *surface_object,
-	struct v4l2_ctrl_hevc_sps *sps)
-{
-	const guint8 *src = surface_object->source_data;
-	gsize         src_size = surface_object->slices_size;
-	GstH265Parser    *parser;
-	GstH265NalUnit    nalu;
-	GstH265SPS        gst_sps;
-	GstH265ParserResult pr;
-	gsize offset = 0;
-	int err = -ENODATA;
-	uint8_t tid;
-
-	parser = gst_h265_parser_new();
-	if (parser == NULL)
-		return -ENOMEM;
-
-	while (offset < src_size) {
-		pr = gst_h265_parser_identify_nalu(parser, src, offset, src_size,
-						   &nalu);
-		if (pr != GST_H265_PARSER_OK && pr != GST_H265_PARSER_NO_NAL_END)
-			break;
-
-		if (nalu.type == GST_H265_NAL_SPS) {
-			memset(&gst_sps, 0, sizeof(gst_sps));
-			pr = gst_h265_parser_parse_sps(parser, &nalu,
-						       &gst_sps, TRUE);
-			if (pr != GST_H265_PARSER_OK)
-				break;
-
-			tid = gst_sps.max_sub_layers_minus1;
-			if (tid >= 7)
-				tid = 0;  /* safety: max_*[] is [7] */
-
-			driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1 =
-				gst_sps.max_sub_layers_minus1;
-			driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1 =
-				gst_sps.max_dec_pic_buffering_minus1[tid];
-			driver_data->hevc_sps_field_cache.max_num_reorder_pics =
-				gst_sps.max_num_reorder_pics[tid];
-			driver_data->hevc_sps_field_cache.max_latency_increase_plus1 =
-				gst_sps.max_latency_increase_plus1[tid];
-			driver_data->hevc_sps_field_cache.scaling_list_enabled =
-				gst_sps.scaling_list_enabled_flag;
-			driver_data->hevc_sps_field_cache.scaling_list_data_present =
-				gst_sps.scaling_list_data_present_flag;
-			driver_data->hevc_sps_field_cache.valid = true;
-			err = 0;
-			break;
-		}
-
-		offset = nalu.offset + nalu.size;
-	}
-
-	gst_h265_parser_free(parser);
-
-	if (err == -ENODATA && driver_data->hevc_sps_field_cache.valid)
-		err = 0;
-
-	if (err == 0 && driver_data->hevc_sps_field_cache.valid) {
-		sps->sps_max_sub_layers_minus1 =
-			driver_data->hevc_sps_field_cache.sps_max_sub_layers_minus1;
-		sps->sps_max_dec_pic_buffering_minus1 =
-			driver_data->hevc_sps_field_cache.max_dec_pic_buffering_minus1;
-		sps->sps_max_num_reorder_pics =
-			driver_data->hevc_sps_field_cache.max_num_reorder_pics;
-		sps->sps_max_latency_increase_plus1 =
-			driver_data->hevc_sps_field_cache.max_latency_increase_plus1;
-	}
-
-	return err;
-}
-
 int h265_set_controls(struct request_data *driver_data,
 		      struct object_context *context_object,
 		      struct object_surface *surface_object)
@@ -926,50 +810,6 @@ int h265_set_controls(struct request_data *driver_data,
 	}

 	h265_fill_sps(picture, &sps);
-	/*
-	 * iter40b: rpi-hevc-dec validates SPS fields VAAPI doesn't
-	 * forward (sps_max_num_reorder_pics, sps_max_latency_increase_plus1)
-	 * against bitstream-true values and rejects the frame when our
-	 * §A.4.2 spec-legal fallback diverges. Parse the SPS NAL from
-	 * source_data and override. Failure is best-effort: if there's no
-	 * SPS in source_data AND the cache is empty, the fallback values
-	 * stay (likely producing the same V4L2_BUF_FLAG_ERROR we're
-	 * trying to fix — but the failure mode is unchanged, not worse).
-	 */
-	{
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
-		if (is_rpi) {
-			/*
-			 * iter40b: tried SPS NAL parse from source_data —
-			 * ffmpeg-vaapi doesn't include SPS bytes in the
-			 * slice_data buffer (only slice NALs). The parse
-			 * returns -ENODATA every frame, cache stays empty.
-			 *
-			 * Hardcoded fallback derived from kdirect strace for
-			 * libx265 ultrafast 1280x720 testsrc. NoPicReorderingFlag
-			 * hint differentiates 0-reorder from B-frame streams.
-			 * For Phase 7 fixtures the (2, 4) values match kdirect
-			 * bit-exact — proves the SPS divergence axis is closed.
-			 *
-			 * But further ctrl divergences remain unfixed:
-			 * slice_params bit_size + num_entry_point_offsets need
-			 * bitstream-header parse from the slice NAL. Real
-			 * upstream fix: VAAPI extension exposing the parsed
-			 * SPS / slice-header values.
-			 */
-			(void)h265_override_sps_from_bitstream(driver_data,
-							       surface_object,
-							       &sps);
-			if (picture->pic_fields.bits.NoPicReorderingFlag) {
-				sps.sps_max_num_reorder_pics = 0;
-				sps.sps_max_latency_increase_plus1 = 0;
-			} else {
-				sps.sps_max_num_reorder_pics = 2;
-				sps.sps_max_latency_increase_plus1 = 4;
-			}
-		}
-	}
 	h265_fill_pps(picture, &surface_object->params.h265.slices[0], &pps);
 	h265_fill_decode_params(driver_data, picture, &decode_params);
 	h265_fill_scaling_matrix(iqmatrix, iqmatrix_set, &scaling_matrix);
@@ -1014,30 +854,11 @@ int h265_set_controls(struct request_data *driver_data,
 		.ptr = slice_params_array,
 		.size = sizeof(struct v4l2_ctrl_hevc_slice_params) * num_slices,
 	};
-	/*
-	 * iter40b: rpi-hevc-dec's per-frame ctrl set is 4 (no
-	 * scaling_matrix when SPS doesn't enable it). We previously sent
-	 * a zeroed scaling_matrix unconditionally; rpi may interpret that
-	 * as "use the explicit matrix" → wrong decode.
-	 *
-	 * Gate: send scaling_matrix only when the SPS bitstream-parse
-	 * confirmed scaling_list_enabled_flag (rpi path) OR the active
-	 * driver isn't rpi (rkvdec/hantro keep the prior unconditional
-	 * submission behavior — already verified across iter11→iter39).
-	 */
-	{
-		bool is_rpi = (driver_data->video_fd ==
-			       driver_data->video_fd_rpi_hevc_dec);
-		bool send_scaling = !is_rpi ||
-			driver_data->hevc_sps_field_cache.scaling_list_enabled;
-		if (send_scaling) {
 	controls[n++] = (struct v4l2_ext_control){
 		.id = V4L2_CID_STATELESS_HEVC_SCALING_MATRIX,
 		.ptr = &scaling_matrix,
 		.size = sizeof(scaling_matrix),
 	};
-		}
-	}
 	controls[n++] = (struct v4l2_ext_control){
 		.id = V4L2_CID_STATELESS_HEVC_DECODE_PARAMS,
 		.ptr = &decode_params,
@@ -39,8 +39,6 @@

 #include <linux/dma-buf.h>

-#include "nv15.h"
-#include "nv12_col128.h"
 #include "tiled_yuv.h"
 #include "utils.h"
 #include "v4l2.h"
@@ -88,51 +86,14 @@ VAStatus RequestCreateImage(VADriverContextP context, VAImageFormat *format,
 	for (i = 0; i < planes_count; i++)
 		size += destination_sizes[i];

-	if (format->fourcc == VA_FOURCC_P010) {
-		/*
-		 * iter39: P010 image overrides V4L2-side NV15 sizing. The
-		 * source is the kernel-reported NV15 packed plane; the image
-		 * buffer holds dense P010 (2 bytes per pixel, 16bpp).
-		 * Recompute sizes/pitches against P010 layout so consumers
-		 * (vaGetImage, vaDeriveImage) see standard P010 geometry.
-		 */
-		destination_bytesperlines[0] = width * 2;
-		destination_sizes[0] = destination_bytesperlines[0] * format_height;
-		for (i = 1; i < destination_planes_count; i++) {
-			destination_bytesperlines[i] = destination_bytesperlines[0];
-			destination_sizes[i] = destination_sizes[0] / 2;
-		}
-		size = 0;
-		for (i = 0; i < destination_planes_count; i++)
-			size += destination_sizes[i];
-	} else if (format->fourcc == VA_FOURCC_NV12 &&
-		   video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128) {
-		/*
-		 * iter40 Phase 5 review F2: NC12 source, NV12 image output.
-		 * V4L2-reported destination_bytesperlines[0] is the NC12
-		 * column stride (= ALIGN(height,8) * 3/2 — e.g. 1080 for
-		 * 1280×720), NOT the linear NV12 Y stride. Override to the
-		 * linear stride (width) so VAImage pitches reflect the
-		 * detile-output layout the consumer reads.
-		 */
-		destination_bytesperlines[0] = width;
-		destination_sizes[0] = destination_bytesperlines[0] * format_height;
-		for (i = 1; i < destination_planes_count; i++) {
-			destination_bytesperlines[i] = destination_bytesperlines[0];
-			destination_sizes[i] = destination_sizes[0] / 2;
-		}
-		size = 0;
-		for (i = 0; i < destination_planes_count; i++)
-			size += destination_sizes[i];
-	} else {
-		/* NV12: V4L2 stride is correct, sizes derived from height. */
+	/* Here we calculate the sizes assuming NV12. */
+
 	destination_sizes[0] = destination_bytesperlines[0] * format_height;

 	for (i = 1; i < destination_planes_count; i++) {
 		destination_bytesperlines[i] = destination_bytesperlines[0];
 		destination_sizes[i] = destination_sizes[0] / 2;
 	}
-	}

 	id = object_heap_allocate(&driver_data->image_heap);
 	image_object = IMAGE(driver_data, id);
@@ -255,91 +216,63 @@ static VAStatus copy_surface_to_image (struct request_data *driver_data,
 		}
 	}

+	/*
+	 * AV1 film_grain: when this surface is the display surface of a
+	 * decode (current_display_picture != current_frame with apply_grain=1),
+	 * its slot is NULL because BeginPicture only fired on the decode
+	 * surface. Follow the back-link set in av1_set_controls and borrow
+	 * the decode surface's destination_data + sizes for the copy.
+	 */
+	if (surface_object->current_slot == NULL &&
+	    surface_object->linked_decode_surface_id != VA_INVALID_SURFACE) {
+		struct object_surface *decode_surface =
+			SURFACE(driver_data,
+				surface_object->linked_decode_surface_id);
+		if (decode_surface != NULL &&
+		    decode_surface->current_slot != NULL) {
+			/* Mirror the fields we read below. The surface heap
+			 * pointer is stable for the surface's lifetime; we
+			 * only need destination_data + destination_sizes +
+			 * destination_planes_count from it. */
+			surface_object->destination_planes_count =
+				decode_surface->destination_planes_count;
+			for (i = 0; i < decode_surface->destination_planes_count; i++) {
+				surface_object->destination_data[i] =
+					decode_surface->destination_data[i];
+				surface_object->destination_sizes[i] =
+					decode_surface->destination_sizes[i];
+			}
+		}
+	}
+
 	for (i = 0; i < surface_object->destination_planes_count; i++) {
-		/*
-		 * iter40 Phase 5 review F1: guard extended from __arm__ to
-		 * __arm__ || __aarch64__. Without this, the detile primitives
-		 * silently compiled out on aarch64 (fresnel RK3399, ampere
-		 * RK3588, higgs Pi CM5) and the memcpy fall-through delivered
-		 * raw tiled bytes to NV12/P010 image consumers. iter39 5/5
-		 * PASS masked the issue because no 10-bit path was exercised.
-		 */
-#if defined(__arm__) || defined(__aarch64__)
-		/*
-		 * Sunxi tiled_to_planar lives in tiled_yuv.S which is
-		 * #ifdef __arm__ — symbol absent on aarch64. Keep this
-		 * branch arm-only; aarch64 Sunxi support would need a C or
-		 * aarch64-ASM port (no Sunxi aarch64 board in current fleet).
-		 */
-#if defined(__arm__)
+		/* AV1 Phase 3 diag: surface NULL-deref hunt. */
+		if (buffer_object->data == NULL ||
+		    surface_object->destination_data[i] == NULL) {
+			request_log("copy_surface_to_image NULL i=%u "
+				    "buf_data=%p dest_data=%p dest_size=%u "
+				    "planes=%u slot=%p linked=0x%x\n",
+				    i, (void *)buffer_object->data,
+				    (void *)surface_object->destination_data[i],
+				    surface_object->destination_sizes[i],
+				    surface_object->destination_planes_count,
+				    (void *)surface_object->current_slot,
+				    surface_object->linked_decode_surface_id);
+			return VA_STATUS_ERROR_OPERATION_FAILED;
+		}
+#ifdef __arm__
 		if (!video_format_is_linear(driver_data->video_format))
 			tiled_to_planar(surface_object->destination_data[i],
 					buffer_object->data + image->offsets[i],
 					image->pitches[i], image->width,
 					i == 0 ? image->height :
 						 image->height / 2);
-		else
-#endif
-		if (driver_data->is_10bit &&
-			 image->format.fourcc == VA_FOURCC_P010) {
-			/*
-			 * iter39: rkvdec emits NV15 (4×10-bit packed in 5
-			 * bytes); the VA image buffer is dense P010 (2B/pixel,
-			 * value in bits[15:6]). Source stride is the V4L2-
-			 * reported NV15 bytesperline (= ceil(width/4)*5,
-			 * possibly aligned higher by the kernel); destination
-			 * stride is image->pitches[i] = width * 2.
-			 */
-			unsigned int plane_h = (i == 0) ? image->height
-							: image->height / 2;
-			nv15_unpack_plane_to_p010(
-				surface_object->destination_data[i],
-				(uint16_t *)(buffer_object->data + image->offsets[i]),
-				image->width, plane_h,
-				surface_object->destination_bytesperlines[i]);
-		} else if (driver_data->video_format != NULL &&
-			   driver_data->video_format->v4l2_format ==
-			   V4L2_PIX_FMT_NV12_COL128 &&
-			   image->format.fourcc == VA_FOURCC_NV12) {
-			/*
-			 * iter40: Pi 5 rpi-hevc-dec emits NV12_COL128 (SAND
-			 * 128-pixel-wide column tiles). Detile to linear NV12
-			 * via the per-plane primitive. surface_object->
-			 * destination_data[i] is the V4L2 CAPTURE mmap (single
-			 * buffer, planes_count==2): i==0 is the Y plane base,
-			 * i==1 is the UV plane base offset within the SAME
-			 * physical buffer (per cap_pool plane[1] offset = Y
-			 * plane size in COL128 layout).
-			 *
-			 * src_col_stride = destination_bytesperlines[i] = the
-			 * kernel-reported NC12 bytesperline (column stride,
-			 * = ALIGN(image_h, 8) * 3/2). Same for both planes
-			 * since column geometry is plane-agnostic.
-			 *
-			 * dst stride is image->pitches[i] = image->width
-			 * (overridden in RequestCreateImage NC12 branch below).
-			 */
-			if (i == 0) {
-				nv12_col128_detile_y(
-					(uint8_t *)(buffer_object->data + image->offsets[i]),
-					image->pitches[i],
-					surface_object->destination_data[i],
-					surface_object->destination_bytesperlines[i],
-					image->width, image->height);
-			} else {
-				nv12_col128_detile_uv(
-					(uint8_t *)(buffer_object->data + image->offsets[i]),
-					image->pitches[i],
-					surface_object->destination_data[i],
-					surface_object->destination_bytesperlines[i],
-					image->width, image->height / 2);
-			}
-		} else {
+		else {
 #endif
 			memcpy(buffer_object->data + image->offsets[i],
 			       surface_object->destination_data[i],
 			       surface_object->destination_sizes[i]);
-#if defined(__arm__) || defined(__aarch64__)
+#ifdef __arm__
 		}
 #endif
 	}
@@ -378,17 +311,9 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,

 	/* Fully populate VAImageFormat to match QueryImageFormats output. */
 	memset(&format, 0, sizeof(format));
-	if (driver_data->is_10bit) {
-		/* iter39: 10-bit session derives a P010 image. NV15-source
-		 * unpack happens in copy_surface_to_image. */
-		format.fourcc = VA_FOURCC_P010;
-		format.byte_order = VA_LSB_FIRST;
-		format.bits_per_pixel = 24;
-	} else {
 	format.fourcc = VA_FOURCC_NV12;
 	format.byte_order = VA_LSB_FIRST;
 	format.bits_per_pixel = 12;
-	}

 	status = RequestCreateImage(context, &format, surface_object->width,
 				    surface_object->height, image);
@@ -423,52 +348,26 @@ VAStatus RequestDeriveImage(VADriverContextP context, VASurfaceID surface_id,
 VAStatus RequestQueryImageFormats(VADriverContextP context,
 				  VAImageFormat *formats, int *formats_count)
 {
-	struct request_data *driver_data = context->pDriverData;
-	int n = 0;

 	/*
-	 * Populate the VAImageFormat fully per VAAPI spec — not just
-	 * .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv, Firefox)
-	 * read .byte_order and .bits_per_pixel; leaving them
-	 * uninitialized inherits caller-stack garbage and produces
-	 * non-deterministic behavior. Reference: Mesa's
-	 * gallium/frontends/va/image.c::vlVaQueryImageFormats and
-	 * intel-vaapi-driver's i965_drv_video.c.
+	 * Populate the VAImageFormat fully per VAAPI spec for NV12 —
+	 * not just .fourcc. Consumers (FFmpeg's hwcontext_vaapi, mpv,
+	 * Firefox) read .byte_order and .bits_per_pixel; leaving them
+	 * uninitialized inherits whatever caller-stack garbage is in
+	 * the buffer and produces non-deterministic behavior. Reference:
+	 * Mesa's gallium/frontends/va/image.c::vlVaQueryImageFormats and
+	 * intel-vaapi-driver's i965_drv_video.c — both publish NV12
+	 * with byte_order=VA_LSB_FIRST and bits_per_pixel=12.
 	 *
-	 * iter39: advertise P010 when an active session is 10-bit so
-	 * ffmpeg-vaapi sees a valid 10-bit-compatible entry during
-	 * vaQueryImageFormats. NV12 stays advertised unconditionally so
-	 * the 8-bit catalog query response is unchanged.
+	 * For YUV formats, depth/red_mask/green_mask/blue_mask/alpha_mask
+	 * are not meaningful (those describe RGB bit layouts); leave them
+	 * zeroed via memset before populating.
 	 */
-	memset(&formats[n], 0, sizeof(formats[n]));
-	formats[n].fourcc = VA_FOURCC_NV12;
-	formats[n].byte_order = VA_LSB_FIRST;
-	formats[n].bits_per_pixel = 12;
-	n++;
-
-	/*
-	 * iter39 Option B revert (2026-05-17): P010 advertisement is
-	 * gated on driver_data->is_10bit again. Previously advertised
-	 * unconditionally (63fed87) so ffmpeg-vaapi's early
-	 * vaQueryImageFormats (pre-vaCreateContext) could see it for
-	 * 10-bit profiles — but that broke HEVC 8-bit on fresnel:
-	 * ffmpeg-vaapi picked P010 for the HEVC hwframe pool, EndPicture
-	 * SEGV'd in the .so when the consumer-side P010 expectations met
-	 * an 8-bit NV12 CAPTURE buffer.
-	 * Safe because Option B drops VAProfileHEVCMain10 + Hi10P from
-	 * enumeration — no 10-bit decode pipeline will reach this catalog
-	 * query so the gate-on-is_10bit (which stays false for 8-bit
-	 * profiles) correctly returns NV12-only.
-	 */
-	if (driver_data->is_10bit && n < V4L2_REQUEST_MAX_IMAGE_FORMATS) {
-		memset(&formats[n], 0, sizeof(formats[n]));
-		formats[n].fourcc = VA_FOURCC_P010;
-		formats[n].byte_order = VA_LSB_FIRST;
-		formats[n].bits_per_pixel = 24;
-		n++;
-	}
-
-	*formats_count = n;
+	memset(&formats[0], 0, sizeof(formats[0]));
+	formats[0].fourcc = VA_FOURCC_NV12;
+	formats[0].byte_order = VA_LSB_FIRST;
+	formats[0].bits_per_pixel = 12;
+	*formats_count = 1;

 	return VA_STATUS_SUCCESS;
 }
@@ -22,9 +22,6 @@

 autoconf_data = configuration_data()
 autoconf_data.set('VA_DRIVER_INIT_FUNC', va_driver_init_func)
-if get_option('daedalus_v4l2')
-	autoconf_data.set('HAVE_DAEDALUS_V4L2', 1)
-endif

 autoconf = configure_file(
 	output: 'autoconfig.h',
@@ -55,8 +52,6 @@ sources = [
 	'vp9.c',
 	'av1.c',
 	'codec.c',
-	'nv15.c',
-	'nv12_col128.c',

 	# Vendored GStreamer 1.28.2 H.265 parser + utilities (LGPL v2.1+,
 	# see src/h265_parser/gst_compat.h for sourcing notes + per-iter2
@@ -91,9 +86,8 @@ headers = [
 	'h265.h',
 	'vp8.h',
 	'vp9.h',
+	'av1.h',
 	'codec.h',
-	'nv15.h',
-	'nv12_col128.h',

 	# Internal mirror of Linux 7.0 V4L2 HEVC EXT_SPS_*_RPS UAPI defs
 	# (allows building against pre-7.0 linux-api-headers; redundant
@@ -1,114 +0,0 @@
-/*
- * V4L2_PIX_FMT_NV12_COL128 → linear NV12 detile primitive. Pi 5 / CM5
- * rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
- *
- * Math derived from kernel hevc_d_video.c (size formula) +
- * ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
- * single-stripe fast path memcpy's 128 bytes at a time when an output
- * row falls entirely within one tile column (the common case);
- * straddling rows are split into two memcpy halves.
- *
- * No NEON / SIMD here — correctness first. Each output row generates
- * (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
- * ~17000 small memcpys per frame, fine for Phase 1 PoC.
- */
-
-#include "nv12_col128.h"
-
-#include <string.h>
-
-/*
- * Tile column width in bytes. The 'COL128' name embeds this; if it ever
- * varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
- */
-#define NC12_TILE_W   128
-
-/*
- * Common Y / UV plane detile — the layout is identical (single-byte per
- * pixel, column-major 128-wide tiles). The only thing that varies is
- * what plane the caller passes in. width here is plane width in bytes
- * (= image width for both Y and CbCr-interleaved NV12 UV); height is
- * plane height in pixels (image height for Y, image height / 2 for UV).
- */
-static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
-                                     const uint8_t *src,
-                                     unsigned int src_col_stride,
-                                     unsigned int width, unsigned int height)
-{
-	unsigned int y, x;
-
-	for (y = 0; y < height; y++) {
-		uint8_t *drow = dst + y * dst_stride;
-		x = 0;
-		while (x < width) {
-			unsigned int col = x / NC12_TILE_W;
-			unsigned int in_col = x % NC12_TILE_W;
-			unsigned int n = NC12_TILE_W - in_col;
-			if (n > width - x)
-				n = width - x;
-			/*
-			 * Source byte = base + col*128*col_stride + y*128 + in_col
-			 * Copy n contiguous bytes (all within this tile column,
-			 * since n is capped at the remaining width-in-column).
-			 */
-			const uint8_t *p = src
-				+ (size_t)col * NC12_TILE_W * src_col_stride
-				+ (size_t)y * NC12_TILE_W
-				+ in_col;
-			memcpy(drow + x, p, n);
-			x += n;
-		}
-	}
-}
-
-void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
-                          const uint8_t *src_y, unsigned int src_col_stride,
-                          unsigned int width, unsigned int height)
-{
-	nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
-				 width, height);
-}
-
-void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
-                           const uint8_t *src_uv, unsigned int src_col_stride,
-                           unsigned int width, unsigned int uv_height)
-{
-	/* UV plane (CbCr interleaved): byte-width equals Y-plane width
-	 * (one Cb + one Cr per 2x2 Y block → 2 bytes per 2 horizontal Y
-	 * samples → 1 byte per Y pixel horizontally). Height is half. */
-	nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
-				 width, uv_height);
-}
-
-unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
-                                         unsigned int image_height)
-{
-	unsigned int aligned_h = (image_height + 7) & ~7u;
-
-	/*
-	 * In the COL128 SAND layout, Y and UV are NOT separate planes
-	 * concatenated end-to-end. Within EACH 128-pixel-wide column:
-	 *   first 128 * height bytes  = Y data for this column strip
-	 *   next  128 * height / 2 bytes = UV data for this column strip
-	 *   total 128 * bytesperline (= 128 * height * 3/2) bytes per column
-	 *
-	 * The "UV plane base" pointer (data[1] in AVFrame convention) is
-	 * just data[0] + (128 * height) — the offset of the UV bytes
-	 * WITHIN the first column. All subsequent UV bytes are reached by
-	 * the same column-stride arithmetic the Y plane uses (col *
-	 * 128 * bytesperline + y * 128 + in_col), so passing this offset
-	 * pointer + iterating y over [0, height/2) traverses all UV rows
-	 * across all columns correctly.
-	 *
-	 * Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
-	 * sizeof(linear Y plane)) — that pushed past the end of the SAND
-	 * buffer because the layout isn't planes-end-to-end.
-	 *
-	 * Cross-check: kernel sizeimage = bytesperline * width =
-	 * (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
-	 * aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
-	 * per column: 128 * aligned_h. UV portion per column: half of Y.
-	 * Sum across columns: matches sizeimage.
-	 */
-	return NC12_TILE_W * aligned_h;
-}
@@ -1,88 +0,0 @@
-/*
- * V4L2_PIX_FMT_NV12_COL128 (NC12) SAND-tiled → linear NV12 detile.
- *
- * Pi 5 / CM5 (BCM2712) rpi-hevc-dec CAPTURE format. iter40 (2026-05-17).
- *
- * Layout (kernel drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
- * size-formula + ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h per-pixel
- * offset math):
- *
- *   width  ALIGN(image_width,  128)   -- columns are 128 px wide
- *   height ALIGN(image_height,   8)
- *   col_stride (= bytesperline) = height * 3 / 2
- *               (bytes per [128-wide column] vertical unit incl. Y + UV)
- *   sizeimage  = col_stride * width = total bytes
- *
- *   For pixel (x, y) in the Y plane:
- *     col      = x / 128
- *     in_col_x = x % 128
- *     offset   = col * col_stride * 128 + y * 128 + in_col_x
- *
- *   UV plane starts at offset (128 * height * num_columns_y) — the same
- *   per-column layout, h/2 rows tall (CbCr interleaved).
- *
- * The primitive copies the entire image extent at once. width/height are
- * the cropped consumer-visible dimensions; src_col_stride is the kernel-
- * reported bytesperline (i.e. ALIGN(height,8) * 3/2).
- */
-
-#ifndef _NV12_COL128_H_
-#define _NV12_COL128_H_
-
-#include <stdint.h>
-
-#include <linux/videodev2.h>
-
-/*
- * Pre-Pi-kernel headers (Arch ALARM linux-api-headers, older mainline
- * kernel-headers packages) may not define V4L2_PIX_FMT_NV12_COL128. The
- * fourcc is Pi-specific. Provide a private fallback so the backend
- * builds on hosts that target NON-Pi codecs too.
- */
-#ifndef V4L2_PIX_FMT_NV12_COL128
-#define V4L2_PIX_FMT_NV12_COL128  \
-	((unsigned int)('N') | ((unsigned int)('C') << 8) | \
-	 ((unsigned int)('1') << 16) | ((unsigned int)('2') << 24))
-#endif
-
-#ifndef V4L2_PIX_FMT_NV12_10_COL128
-/* 10-bit SAND variant: 3 pixels packed into 4 bytes in 128-byte / 96-pixel
- * wide columns. iter40 references the fourcc for completeness; the 10-bit
- * Pi 5 HEVC chapter (Main10) is post-iter40. */
-#define V4L2_PIX_FMT_NV12_10_COL128  \
-	((unsigned int)('N') | ((unsigned int)('C') << 8) | \
-	 ((unsigned int)('3') << 16) | ((unsigned int)('0') << 24))
-#endif
-
-/* Detile the Y plane of an NC12 source to a linear NV12 Y plane.
- *   dst         : pointer to linear NV12 Y plane (caller-owned, dst_stride * height bytes)
- *   dst_stride  : linear Y plane stride in bytes (= width for plain NV12)
- *   src_y       : pointer to start of NC12 Y plane (= NC12 buffer base)
- *   src_col_stride: kernel-reported bytesperline (= ALIGN(height,8) * 3/2)
- *   width, height: cropped image dimensions in pixels
- */
-void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
-                          const uint8_t *src_y, unsigned int src_col_stride,
-                          unsigned int width, unsigned int height);
-
-/* Detile the UV plane (CbCr interleaved, half-height) of an NC12 source.
- *   dst         : pointer to linear NV12 UV plane
- *   dst_stride  : linear UV plane stride in bytes (= width for NV12)
- *   src_uv      : pointer to start of NC12 UV plane (= src_y + Y-plane-size)
- *   src_col_stride: same as Y plane (same column geometry)
- *   width       : Y-plane width in pixels (UV plane has same byte width)
- *   uv_height   : UV plane height = height / 2
- */
-void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
-                           const uint8_t *src_uv, unsigned int src_col_stride,
-                           unsigned int width, unsigned int uv_height);
-
-/* Compute the offset of the UV plane within an NC12 buffer.
- *   image_width, image_height: cropped image dimensions in pixels
- *   Returns: byte offset from buffer start to UV plane start
- *           (= 128 * ALIGN(image_height, 8) * num_columns_y)
- */
-unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
-                                         unsigned int image_height);
-
-#endif /* _NV12_COL128_H_ */
@@ -1,75 +0,0 @@
-/*
- * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sub license, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice (including the
- * next paragraph) shall be included in all copies or substantial portions
- * of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
- * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
- * IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
- * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- */
-
-#include "nv15.h"
-
-void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
-			       unsigned int width, unsigned int height,
-			       unsigned int src_stride)
-{
-	unsigned int x, y;
-	unsigned int dst_pitch_px = width;
-
-	for (y = 0; y < height; y++) {
-		const uint8_t *s = src + y * src_stride;
-		uint16_t *d = dst + y * dst_pitch_px;
-
-		for (x = 0; x + 4 <= width; x += 4) {
-			uint16_t a = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
-			uint16_t b = ((uint16_t)s[1] >> 2) | ((uint16_t)(s[2] & 0x0F) << 6);
-			uint16_t c = ((uint16_t)s[2] >> 4) | ((uint16_t)(s[3] & 0x3F) << 4);
-			uint16_t e = ((uint16_t)s[3] >> 6) | ((uint16_t)s[4] << 2);
-
-			d[0] = (uint16_t)(a << 6);
-			d[1] = (uint16_t)(b << 6);
-			d[2] = (uint16_t)(c << 6);
-			d[3] = (uint16_t)(e << 6);
-
-			d += 4;
-			s += 5;
-		}
-
-		if (x < width) {
-			unsigned int rem = width - x;
-			uint16_t pix[4] = { 0, 0, 0, 0 };
-
-			pix[0] = (uint16_t)s[0] | ((uint16_t)(s[1] & 0x03) << 8);
-			if (rem >= 2)
-				pix[1] = ((uint16_t)s[1] >> 2) |
-					 ((uint16_t)(s[2] & 0x0F) << 6);
-			if (rem >= 3)
-				pix[2] = ((uint16_t)s[2] >> 4) |
-					 ((uint16_t)(s[3] & 0x3F) << 4);
-			if (rem >= 4)
-				pix[3] = ((uint16_t)s[3] >> 6) |
-					 ((uint16_t)s[4] << 2);
-
-			{
-				unsigned int j;
-				for (j = 0; j < rem; j++)
-					d[j] = (uint16_t)(pix[j] << 6);
-			}
-		}
-	}
-}
@@ -1,61 +0,0 @@
-/*
- * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sub license, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice (including the
- * next paragraph) shall be included in all copies or substantial portions
- * of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
- * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
- * IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
- * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- */
-
-#ifndef _NV15_H_
-#define _NV15_H_
-
-#include <stdint.h>
-
-#include <linux/videodev2.h>
-
-/*
- * Older or downstream linux-api-headers / kernel-headers packages may
- * not define V4L2_PIX_FMT_NV15. Provide a fallback so the backend
- * builds on hosts whose headers are pre-NV15-merge or omit it (e.g.
- * Pi 5 Debian trixie 6.12.62 headers include NC12 but not NV15).
- * Same numeric value as mainline.
- */
-#ifndef V4L2_PIX_FMT_NV15
-#define V4L2_PIX_FMT_NV15  \
-	((unsigned int)('N') | ((unsigned int)('V') << 8) | \
-	 ((unsigned int)('1') << 16) | ((unsigned int)('5') << 24))
-#endif
-
-/*
- * Unpack one plane of V4L2_PIX_FMT_NV15 (4 × 10-bit values packed into
- * 5 consecutive bytes, LSB-first) into VA_FOURCC_P010 (16-bit per pixel,
- * value in bits [15:6], zeros in [5:0]).
- *
- * Layout per Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
- * Call once per plane: luma (W × H, src_stride = ceil(W/4)*5) and chroma
- * (W × H/2 — same width because UV are interleaved 10-bit values).
- *
- * src_stride must be the kernel-reported bytesperline for the NV15 plane.
- * The destination is dense P010 with row pitch = width * 2 bytes.
- */
-void nv15_unpack_plane_to_p010(const uint8_t *src, uint16_t *dst,
-			       unsigned int width, unsigned int height,
-			       unsigned int src_stride);
-
-#endif
@@ -37,7 +37,6 @@
 #include "vp8.h"
 #include "vp9.h"
 #include "av1.h"
-#include "request_pool.h"

 #include <assert.h>
 #include <stdio.h>
@@ -56,159 +55,6 @@

 #include "autoconfig.h"

-/*
- * iter#15 — issue #15: ensure the in-flight surface's OUTPUT mmap has
- * room for `delta` more bytes appended to slices_size; if not, grow the
- * pool transparently via request_pool_resize.
- *
- * Sequence on overflow:
- *   1. Snapshot the surface's accumulated bytes to a temp heap buffer.
- *   2. Release the surface's OUTPUT pool slot back to FREE (resize
- *      requires no slot be borrowed).
- *   3. Compute new sizeimage = roundup(needed * 2, 4 KiB), and at least
- *      double the current source_size so geometric growth amortises
- *      repeated overruns at the same resolution.
- *   4. Call request_pool_resize.
- *   5. Re-acquire a pool slot (the new pool has fresh indices and fds).
- *   6. Re-mirror surface_object->source_{index,data,size,request_fd}
- *      from the new slot.
- *   7. Restore the saved bytes via memcpy into the new mmap.
- *
- * Returns VA_STATUS_SUCCESS on clean resize (or no resize needed) and
- * VA_STATUS_ERROR_ALLOCATION_FAILED on heap-alloc / V4L2 / kernel
- * failure — the libva client falls back to surface re-creation as
- * before the resize hook landed.
- *
- * NOTE on inline-Sync invariant: RequestEndPicture calls
- * RequestSyncSurface inline, so when codec_store_buffer runs no other
- * pool slot is borrowed across libva-driver-API entry points. The
- * temporary release-then-reacquire of the in-flight slot here keeps
- * that invariant intact across the resize.
- */
-static VAStatus
-codec_store_buffer_ensure_capacity(struct request_data *driver_data,
-				   struct object_surface *surface_object,
-				   size_t need)
-{
-	struct request_pool_slot *slot;
-	uint8_t *save_buf;
-	size_t save_size;
-	unsigned int saved_index;
-	size_t want_sizeimage;
-	unsigned int new_sizeimage;
-	int new_index;
-	int rc;
-
-	if (need <= surface_object->source_size)
-		return VA_STATUS_SUCCESS;
-
-	save_size = surface_object->slices_size;
-	save_buf = NULL;
-	if (save_size > 0) {
-		save_buf = malloc(save_size);
-		if (save_buf == NULL) {
-			request_log("codec_store_buffer_ensure_capacity: malloc(%zu) for resize-save failed\n",
-				    save_size);
-			return VA_STATUS_ERROR_ALLOCATION_FAILED;
-		}
-		memcpy(save_buf, surface_object->source_data, save_size);
-	}
-
-	/*
-	 * Temporarily release the in-flight slot. The slot's V4L2 buffer
-	 * has NOT been QBUF'd yet (QBUF lives in RequestEndPicture, after
-	 * this codec_store_buffer call), so the release is a clean
-	 * busy=false flip; no kernel state is in question.  The slot's
-	 * stale request_fd does not need to be saved — the resize closes
-	 * every slot's fd and the post-resize acquire below re-mirrors a
-	 * fresh slot's request_fd into surface_object->request_fd.
-	 */
-	saved_index = surface_object->source_index;
-	request_pool_release(&driver_data->output_pool, saved_index);
-
-	/*
-	 * Geometric growth: at least 2× the current source_size, but no
-	 * less than 2× the required total — so a single resize covers the
-	 * triggering append plus comfortable headroom for the rest of
-	 * this frame. Round up to a 4 KiB page boundary so the kernel's
-	 * own alignment doesn't waste pages.  Compute in size_t so the
-	 * 2× doubling can't silently wrap at 2 GiB on 32-bit unsigned int
-	 * (sizeimage stays bounded by V4L2's u32, but the doubling target
-	 * could otherwise overflow before the clamp).
-	 */
-	want_sizeimage = need * 2;
-	if (want_sizeimage < (size_t)surface_object->source_size * 2)
-		want_sizeimage = (size_t)surface_object->source_size * 2;
-	if (want_sizeimage > 0x40000000u) /* 1 GiB hard cap — V4L2 sizeimage is u32 */
-		want_sizeimage = 0x40000000u;
-	want_sizeimage = (want_sizeimage + 0xFFFu) & ~(size_t)0xFFFu;
-	new_sizeimage = (unsigned int)want_sizeimage;
-
-	request_log("codec_store_buffer: OUTPUT-pool resize (need %zu > cap %u → new_sizeimage %u)\n",
-		    need, surface_object->source_size, new_sizeimage);
-
-	rc = request_pool_resize(&driver_data->output_pool, new_sizeimage);
-	if (rc < 0) {
-		/*
-		 * Resize failed. The original slot was already released
-		 * above, so surface_object->source_data is now pointing
-		 * at a FREE-but-still-borrowable mmap. Restore the
-		 * surface's slot mirror so EndPicture / DestroyContext
-		 * unwind paths see a consistent (if partial) state.
-		 *
-		 * If the resize aborted early (pre-STREAMOFF), the slot
-		 * is intact: re-acquiring the same index is the inverse
-		 * of the temporary release above. If it aborted later
-		 * (post-teardown), the slot's data/size were zeroed in
-		 * place by request_pool_resize and the re-acquire flips
-		 * busy=true on a dead slot — still safe, because the
-		 * caller will return ERROR_ALLOCATION_FAILED and the
-		 * libva consumer destroys the surface/context.
-		 */
-		(void)request_pool_acquire(&driver_data->output_pool);
-		free(save_buf);
-		return VA_STATUS_ERROR_ALLOCATION_FAILED;
-	}
-
-	new_index = request_pool_acquire(&driver_data->output_pool);
-	if (new_index < 0) {
-		free(save_buf);
-		return VA_STATUS_ERROR_ALLOCATION_FAILED;
-	}
-	slot = request_pool_slot(&driver_data->output_pool,
-				 (unsigned int)new_index);
-	if (slot == NULL) {
-		request_pool_release(&driver_data->output_pool,
-				     (unsigned int)new_index);
-		free(save_buf);
-		return VA_STATUS_ERROR_ALLOCATION_FAILED;
-	}
-
-	surface_object->source_index = slot->index;
-	surface_object->source_data = slot->data;
-	surface_object->source_size = slot->size;
-	surface_object->request_fd = slot->request_fd;
-
-	if (need > surface_object->source_size) {
-		/*
-		 * Kernel rounded the new sizeimage down below what we
-		 * needed — drivers may clamp at their per-codec ceiling.
-		 * Don't corrupt memory; surface the error to libva.
-		 */
-		request_log("codec_store_buffer_ensure_capacity: kernel returned sizeimage %u < required %zu\n",
-			    surface_object->source_size, need);
-		free(save_buf);
-		return VA_STATUS_ERROR_ALLOCATION_FAILED;
-	}
-
-	if (save_buf != NULL) {
-		memcpy(surface_object->source_data, save_buf, save_size);
-		free(save_buf);
-	}
-
-	return VA_STATUS_SUCCESS;
-}
-
 static VAStatus codec_store_buffer(struct request_data *driver_data,
 				   struct object_context *context,
 				   VAProfile profile,
@@ -216,36 +62,16 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 				   struct object_buffer *buffer_object)
 {
 	switch (buffer_object->type) {
-	case VASliceDataBufferType: {
+	case VASliceDataBufferType:
 		/*
 		 * Since there is no guarantee that the allocation
 		 * order is the same as the submission order (via
 		 * RenderPicture), we can't use a V4L2 buffer directly
 		 * and have to copy from a regular buffer.
-		 *
-		 * Capacity guard (issue #13 + #15): surface_object->source_data
-		 * points at an OUTPUT-pool mmap of size source_size, negotiated
-		 * at S_FMT time. A stream-level resolution upshift can produce
-		 * a slice larger than this allocation. Each append site below
-		 * computes the post-append running total and calls
-		 * codec_store_buffer_ensure_capacity, which transparently grows
-		 * the OUTPUT pool (request_pool_resize) so the existing memcpy
-		 * has room. The hard error path (VA_STATUS_ERROR_ALLOCATION_FAILED)
-		 * only fires if both the heap save buffer AND the kernel-side
-		 * grow fail — at which point libavcodec recreates the surface.
 		 */
-		size_t need;
-		VAStatus ensure_rc;
-
 		if (context->h264_start_code) {
 			static const char start_code[3] = { 0x00, 0x00, 0x01 };

-			need = (size_t)surface_object->slices_size +
-			       sizeof(start_code);
-			ensure_rc = codec_store_buffer_ensure_capacity(
-				driver_data, surface_object, need);
-			if (ensure_rc != VA_STATUS_SUCCESS)
-				return ensure_rc;
 			memcpy(surface_object->source_data +
 			       surface_object->slices_size,
 			       start_code, sizeof(start_code));
@@ -279,32 +105,19 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 			unsigned int header_size =
 				surface_object->params.vp8.picture.pic_fields.bits.key_frame == 0 ?
 					10 : 3;
-			need = (size_t)surface_object->slices_size + header_size;
-			ensure_rc = codec_store_buffer_ensure_capacity(
-				driver_data, surface_object, need);
-			if (ensure_rc != VA_STATUS_SUCCESS)
-				return ensure_rc;
 			memset(surface_object->source_data +
 			       surface_object->slices_size,
 			       0, header_size);
 			surface_object->slices_size += header_size;
 		}
-		{
-			size_t payload = (size_t)buffer_object->size *
-					 buffer_object->count;
-			need = (size_t)surface_object->slices_size + payload;
-			ensure_rc = codec_store_buffer_ensure_capacity(
-				driver_data, surface_object, need);
-			if (ensure_rc != VA_STATUS_SUCCESS)
-				return ensure_rc;
 		memcpy(surface_object->source_data +
 			       surface_object->slices_size,
-			       buffer_object->data, payload);
-			surface_object->slices_size += payload;
-		}
+		       buffer_object->data,
+		       buffer_object->size * buffer_object->count);
+		surface_object->slices_size +=
+			buffer_object->size * buffer_object->count;
 		surface_object->slices_count++;
 		break;
-	}

 	case VAPictureParameterBufferType:
 		switch (profile) {
@@ -320,14 +133,12 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 		case VAProfileH264ConstrainedBaseline:
 		case VAProfileH264MultiviewHigh:
 		case VAProfileH264StereoHigh:
-		case VAProfileH264High10:
 			memcpy(&surface_object->params.h264.picture,
 			       buffer_object->data,
 			       sizeof(surface_object->params.h264.picture));
 			break;

 		case VAProfileHEVCMain:
-		case VAProfileHEVCMain10:
 			memcpy(&surface_object->params.h265.picture,
 			       buffer_object->data,
 			       sizeof(surface_object->params.h265.picture));
@@ -349,6 +160,9 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 			memcpy(&surface_object->params.av1.picture,
 			       buffer_object->data,
 			       sizeof(surface_object->params.av1.picture));
+			/* Reset per-frame tile group entry array on each new
+			 * picture parameter buffer (start of a new frame). */
+			surface_object->params.av1.num_tile_group_entries = 0;
 			break;

 		default:
@@ -363,14 +177,12 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 		case VAProfileH264ConstrainedBaseline:
 		case VAProfileH264MultiviewHigh:
 		case VAProfileH264StereoHigh:
-		case VAProfileH264High10:
 			memcpy(&surface_object->params.h264.slice,
 			       buffer_object->data,
 			       sizeof(surface_object->params.h264.slice));
 			break;

-		case VAProfileHEVCMain:
-		case VAProfileHEVCMain10: {
+		case VAProfileHEVCMain: {
 			unsigned int n = surface_object->params.h265.num_slices;
 			if (n < HEVC_MAX_SLICES_PER_FRAME) {
 				memcpy(&surface_object->params.h265.slices[n],
@@ -398,6 +210,17 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 			       sizeof(surface_object->params.vp9.slice));
 			break;

+		case VAProfileAV1Profile0: {
+			unsigned int n = surface_object->params.av1.num_tile_group_entries;
+			if (n < AV1_MAX_TILES) {
+				memcpy(&surface_object->params.av1.tile_group_entries[n],
+				       buffer_object->data,
+				       sizeof(VASliceParameterBufferAV1));
+				surface_object->params.av1.num_tile_group_entries = n + 1;
+			}
+			break;
+		}
+
 		default:
 			break;
 		}
@@ -418,7 +241,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 		case VAProfileH264ConstrainedBaseline:
 		case VAProfileH264MultiviewHigh:
 		case VAProfileH264StereoHigh:
-		case VAProfileH264High10:
 			memcpy(&surface_object->params.h264.matrix,
 			       buffer_object->data,
 			       sizeof(surface_object->params.h264.matrix));
@@ -426,7 +248,6 @@ static VAStatus codec_store_buffer(struct request_data *driver_data,
 			break;

 		case VAProfileHEVCMain:
-		case VAProfileHEVCMain10:
 			memcpy(&surface_object->params.h265.iqmatrix,
 			       buffer_object->data,
 			       sizeof(surface_object->params.h265.iqmatrix));
@@ -486,7 +307,6 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
 	case VAProfileH264ConstrainedBaseline:
 	case VAProfileH264MultiviewHigh:
 	case VAProfileH264StereoHigh:
-	case VAProfileH264High10:
 		rc = h264_set_controls(driver_data, context, profile,
 				       surface_object);
 		if (rc < 0)
@@ -494,7 +314,6 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
 		break;

 	case VAProfileHEVCMain:
-	case VAProfileHEVCMain10:
 		rc = h265_set_controls(driver_data, context, surface_object);
 		if (rc < 0)
 			return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -511,22 +330,7 @@ static VAStatus codec_set_controls(struct request_data *driver_data,
 		if (rc < 0)
 			return VA_STATUS_ERROR_OPERATION_FAILED;
 		break;
-
 	case VAProfileAV1Profile0:
-		/*
-		 * Populates V4L2_CID_STATELESS_AV1_SEQUENCE from
-		 * VAPictureParameterBufferAV1.  The daedalus_v4l2 daemon
-		 * (issue #11 daemon track) synthesises an OBU_SEQUENCE_HEADER
-		 * from this ctrl and prepends it to the slice bitstream
-		 * before handing it to libavcodec/libdav1d, which otherwise
-		 * cannot parse the (sequence-header-stripped) OUTPUT buffer
-		 * that ffmpeg-vaapi delivers.
-		 *
-		 * On the RK3588 vpu981 hardware path the same SEQUENCE ctrl
-		 * is harmless: vpu981's driver parses the OBU stream
-		 * directly and ignores the ctrl payload, so no per-decoder
-		 * gating is required here.
-		 */
 		rc = av1_set_controls(driver_data, context, surface_object);
 		if (rc < 0)
 			return VA_STATUS_ERROR_OPERATION_FAILED;
@@ -557,6 +361,12 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
 	if (surface_object == NULL)
 		return VA_STATUS_ERROR_INVALID_SURFACE;

+	/* AV1 Phase 3 diag */
+	request_log("BeginPicture id=0x%x prev_slot=%p status=%d\n",
+		    surface_object->base.id,
+		    (void *)surface_object->current_slot,
+		    surface_object->status);
+
 	if (surface_object->status == VASurfaceRendering)
 		RequestSyncSurface(context, surface_id);

@@ -568,9 +378,30 @@ VAStatus RequestBeginPicture(VADriverContextP context, VAContextID context_id,
 	 * first. The new slot is bound and its V4L2 index + mmap pointers
 	 * are mirrored into surface_object->destination_* so the existing
 	 * QBUF/DQBUF/EXPBUF code paths see no behavioral change.
+	 *
+	 * AV1 Phase 3 finding: LIBVA_SKIP_REBIND=1 experiment (do NOT
+	 * unbind on rebind) did not improve PASS count for the av1_larger
+	 * film_grain stress vector — proving the iter2 Fix 3 release is
+	 * NOT the source of the inter-frame divergence. The issue is
+	 * deeper in ffmpeg-vaapi's AV1 hwaccel: per byte-equal OUTPUT
+	 * comparison with the patched-ffmpeg-v4l2request reference run
+	 * (LD_LIBRARY_PATH override on a debug libavcodec.so), 7/7 first
+	 * EndPicture submissions are byte-identical, libva has 2 EXTRA.
 	 */
 	if (surface_object->current_slot != NULL)
 		surface_unbind_slot(driver_data, surface_object);
+
+	/*
+	 * AV1 Phase 5 review Amendment 4: clear any stale
+	 * linked_decode_surface_id from a prior film_grain display→decode
+	 * link. If ffmpeg-vaapi recycles a former display surface as a
+	 * decode target, BeginPicture binds a fresh slot — but without
+	 * this reset, copy_surface_to_image's link-follow would still
+	 * borrow from the now-stale linked surface and serve wrong data.
+	 * Cleared unconditionally (cheap) so the next AV1 grain frame
+	 * re-establishes the link if needed.
+	 */
+	surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
 	{
 		struct cap_pool_slot *cap_slot =
 			cap_pool_acquire(&driver_data->capture_pool, surface_id);
@@ -93,10 +93,6 @@
 static const char * const known_decoder_drivers[] = {
 	"rkvdec",
 	"hantro-vpu",
-	"rpi-hevc-dec",  /* iter40: Pi 5 / CM5 stateless HEVC */
-#ifdef HAVE_DAEDALUS_V4L2
-	"daedalus_v4l2", /* phase 8.10: Pi 5 daemon-backed VP9/AV1/H264 */
-#endif
 	"cedrus",
 	"sun4i_csi",
 	NULL
@@ -329,6 +325,37 @@ static bool probe_hevc_ext_sps_rps_controls(int video_fd)
 	return true;
 }

+/*
+ * Inspect a /dev/videoN's OUTPUT formats for `want_pixfmt`. Returns true
+ * iff at least one OUTPUT/OUTPUT_MPLANE format matches.
+ *
+ * Used to discriminate between multiple devices sharing a driver name —
+ * RK3588 has 3 hantro-vpu instances and only one of them is vpu981 (the
+ * dedicated AV1 decoder advertising V4L2_PIX_FMT_AV1_FRAME).
+ */
+static bool video_node_supports_output_fmt(int video_fd, uint32_t want_pixfmt)
+{
+	struct v4l2_fmtdesc desc;
+	const enum v4l2_buf_type types[] = {
+		V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,
+		V4L2_BUF_TYPE_VIDEO_OUTPUT,
+	};
+	unsigned int t, i;
+
+	for (t = 0; t < sizeof(types) / sizeof(types[0]); t++) {
+		for (i = 0; i < 64; i++) {
+			memset(&desc, 0, sizeof desc);
+			desc.index = i;
+			desc.type = types[t];
+			if (ioctl(video_fd, VIDIOC_ENUM_FMT, &desc) < 0)
+				break;
+			if (desc.pixelformat == want_pixfmt)
+				return true;
+		}
+	}
+	return false;
+}
+
 static int find_decoder_device_by_driver(const char *want_driver,
 					 char *video_out, size_t video_out_sz,
 					 char *media_out, size_t media_out_sz)
@@ -376,6 +403,65 @@ static int find_decoder_device_by_driver(const char *want_driver,
 	return -1;
 }

+/*
+ * ampere-av1-enablement Phase 2 — like find_decoder_device_by_driver but
+ * additionally verifies the resolved /dev/videoN advertises `want_pixfmt`
+ * as an OUTPUT format. Required for RK3588 where 3 hantro-vpu instances
+ * share the driver name but only one is vpu981 (AV1 decoder).
+ *
+ * Walks all /dev/media* with matching driver name; takes the first hit
+ * whose OUTPUT formats include `want_pixfmt`. Non-matching candidates
+ * (encoder-only nodes, legacy hantro for MPEG2/VP8) are skipped.
+ */
+static int find_decoder_device_by_driver_with_fmt(const char *want_driver,
+						  uint32_t want_pixfmt,
+						  char *video_out,
+						  size_t video_out_sz,
+						  char *media_out,
+						  size_t media_out_sz)
+{
+	struct media_device_info info;
+	char path[32];
+	char vpath[32];
+	int fd, vfd, i;
+
+	for (i = 0; i < 16; i++) {
+		snprintf(path, sizeof path, "/dev/media%d", i);
+		fd = open(path, O_RDWR | O_NONBLOCK);
+		if (fd < 0)
+			continue;
+		memset(&info, 0, sizeof info);
+		if (ioctl(fd, MEDIA_IOC_DEVICE_INFO, &info) != 0) {
+			close(fd);
+			continue;
+		}
+		if (strcmp(info.driver, want_driver) != 0) {
+			close(fd);
+			continue;
+		}
+		if (find_decoder_video_node_via_topology(fd, vpath,
+							 sizeof vpath) != 0) {
+			close(fd);
+			continue;
+		}
+		close(fd);
+
+		/* Capability check: does this /dev/videoN advertise the
+		 * codec-specific OUTPUT format? */
+		vfd = open(vpath, O_RDWR | O_NONBLOCK);
+		if (vfd < 0)
+			continue;
+		if (video_node_supports_output_fmt(vfd, want_pixfmt)) {
+			close(vfd);
+			snprintf(video_out, video_out_sz, "%s", vpath);
+			snprintf(media_out, media_out_sz, "%s", path);
+			return 0;
+		}
+		close(vfd);
+	}
+	return -1;
+}
+
 static int find_codec_device(char *video_out, size_t video_out_sz,
 			     char *media_out, size_t media_out_sz)
 {
@@ -413,15 +499,7 @@ char request_device_kind_for_profile(VAProfile profile)
 	case VAProfileVP8Version0_3:
 		return 'h';
 	case VAProfileAV1Profile0:
-		/*
-		 * ampere-av1-enablement Phase 2: RK3588 vpu981 dedicated
-		 * AV1 hantro instance. 'a' kind dispatches to
-		 * driver_data->video_fd_vpu981. On hosts without the AV1
-		 * instance the fd stays -1 and RequestQueryConfigProfiles
-		 * never enumerates AV1, so this branch is unreachable for
-		 * non-RK3588 hosts.
-		 */
-		return 'a';
+		return 'a';   /* ampere-av1-enablement: vpu981 dedicated AV1 */
 	default:
 		return '?';
 	}
@@ -445,77 +523,15 @@ int request_switch_device_for_profile(struct request_data *driver_data,
 	char kind = request_device_kind_for_profile(profile);
 	int target_video, target_media;

-	/*
-	 * iter40: HEVC override when rpi-hevc-dec is probed. The static
-	 * table (request_device_kind_for_profile) maps HEVC → 'r' (rkvdec)
-	 * because that's the canonical RK path. On Pi 5 there's no rkvdec
-	 * — rpi-hevc-dec is the only decoder. When BOTH would be present
-	 * (hypothetical mixed board), prefer rpi-hevc-dec for HEVC.
-	 *
-	 * Other rkvdec-routed profiles (VP9, H.264) stay on 'r' because
-	 * rpi-hevc-dec is HEVC-only.
-	 */
-	if ((profile == VAProfileHEVCMain || profile == VAProfileHEVCMain10) &&
-	    driver_data->video_fd_rpi_hevc_dec >= 0 &&
-	    driver_data->media_fd_rpi_hevc_dec >= 0) {
-		kind = 'p';
-	}
-
-#ifdef HAVE_DAEDALUS_V4L2
-	/*
-	 * LIBVA-1: VP9/AV1/H.264 → daedalus_v4l2 when the daemon-backed
-	 * decoder fd is open.  Pi 5 has no rkvdec (those profiles map to
-	 * 'r' by default → video_fd_rkvdec = -1 → "stay on whatever's
-	 * active" fallback would put H.264 frames on rpi-hevc-dec's fd
-	 * and S_FMT would fail).  Re-route to the daedalus daemon instead.
-	 *
-	 * HEVC stays on 'p' (rpi-hevc-dec is HEVC-only — daedalus would
-	 * accept it via FFmpeg, but rpi-hevc-dec has the GPU-backed
-	 * hardware path so it's the right choice on this SoC).
-	 *
-	 * AV1 'a' kind (RK3588 vpu981) wins ONLY if vpu981 was probed.
-	 * On a Pi 5 the vpu981 slot stays -1, so we still route AV1 to
-	 * daedalus here.  Check video_fd_vpu981 to preserve the RK3588
-	 * priority for that case.
-	 */
-	if (driver_data->video_fd_daedalus >= 0 &&
-	    driver_data->media_fd_daedalus >= 0) {
-		switch (profile) {
-		case VAProfileH264Main:
-		case VAProfileH264High:
-		case VAProfileH264ConstrainedBaseline:
-		case VAProfileH264MultiviewHigh:
-		case VAProfileH264StereoHigh:
-		case VAProfileVP9Profile0:
-			kind = 'd';
-			break;
-		case VAProfileAV1Profile0:
-			if (driver_data->video_fd_vpu981 < 0)
-				kind = 'd';
-			break;
-		default:
-			break;
-		}
-	}
-#endif
-
 	if (kind == 'r') {
 		target_video = driver_data->video_fd_rkvdec;
 		target_media = driver_data->media_fd_rkvdec;
 	} else if (kind == 'h') {
 		target_video = driver_data->video_fd_hantro;
 		target_media = driver_data->media_fd_hantro;
-	} else if (kind == 'p') {
-		target_video = driver_data->video_fd_rpi_hevc_dec;
-		target_media = driver_data->media_fd_rpi_hevc_dec;
 	} else if (kind == 'a') {
 		target_video = driver_data->video_fd_vpu981;
 		target_media = driver_data->media_fd_vpu981;
-#ifdef HAVE_DAEDALUS_V4L2
-	} else if (kind == 'd') {
-		target_video = driver_data->video_fd_daedalus;
-		target_media = driver_data->media_fd_daedalus;
-#endif
 	} else {
 		return -1;
 	}
@@ -703,10 +719,6 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
 	driver_data->media_fd_rkvdec = -1;
 	driver_data->video_fd_hantro = -1;
 	driver_data->media_fd_hantro = -1;
-	driver_data->video_fd_rpi_hevc_dec = -1;
-	driver_data->media_fd_rpi_hevc_dec = -1;
-	driver_data->video_fd_daedalus = -1;
-	driver_data->media_fd_daedalus = -1;
 	driver_data->video_fd_vpu981 = -1;
 	driver_data->media_fd_vpu981 = -1;

@@ -739,36 +751,6 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
 				alt_driver = "rkvdec";
 				driver_data->video_fd_hantro = video_fd;
 				driver_data->media_fd_hantro = media_fd;
-			} else if (strcmp(info.driver, "rpi-hevc-dec") == 0) {
-				/* iter40 + LIBVA-1: Pi 5 / CM5.  rpi-hevc-dec is
-				 * HEVC-only.  If daedalus_v4l2 is ALSO loaded (Pi 5
-				 * mixed deployment — out-of-tree daemon-backed
-				 * decoder for VP9/AV1/H264), pick it up as the alt
-				 * so VP9/AV1/H264 have somewhere to land. */
-				primary_driver = "rpi-hevc-dec";
-#ifdef HAVE_DAEDALUS_V4L2
-				alt_driver = "daedalus_v4l2";
-#else
-				alt_driver = NULL;
-#endif
-				driver_data->video_fd_rpi_hevc_dec = video_fd;
-				driver_data->media_fd_rpi_hevc_dec = media_fd;
-#ifdef HAVE_DAEDALUS_V4L2
-			} else if (strcmp(info.driver, "daedalus_v4l2") == 0) {
-				/* phase 8.10 + LIBVA-1: Pi 5 daemon-backed decoder.
-				 * VP9 / AV1 / H.264 route through it via the 'd'
-				 * kind below.  On a mixed-driver box where
-				 * rpi-hevc-dec is ALSO loaded, pick it up as the
-				 * alt so HEVC has somewhere to land too — find_
-				 * codec_device's known_decoder_drivers[] order
-				 * normally puts rpi-hevc-dec first (we hit the
-				 * other branch in practice), but symmetric handling
-				 * keeps us correct if probe order ever flips. */
-				primary_driver = "daedalus_v4l2";
-				alt_driver = "rpi-hevc-dec";
-				driver_data->video_fd_daedalus = video_fd;
-				driver_data->media_fd_daedalus = media_fd;
-#endif
 			}
 		}

@@ -780,38 +762,15 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
 				int alt_v = open(alt_video, O_RDWR | O_NONBLOCK);
 				int alt_m = (alt_v >= 0) ? open(alt_media, O_RDWR | O_NONBLOCK) : -1;
 				if (alt_v >= 0 && alt_m >= 0) {
-					/* Dispatch into the matching per-driver slot.
-					 * iter38 only had rkvdec/hantro pairs; iter40 +
-					 * LIBVA-1 extended this to rpi-hevc-dec and
-					 * daedalus_v4l2 for the Pi 5 mixed-decoder
-					 * deployment. */
 					if (strcmp(alt_driver, "rkvdec") == 0) {
 						driver_data->video_fd_rkvdec = alt_v;
 						driver_data->media_fd_rkvdec = alt_m;
-					} else if (strcmp(alt_driver, "hantro-vpu") == 0) {
+					} else {
 						driver_data->video_fd_hantro = alt_v;
 						driver_data->media_fd_hantro = alt_m;
-					} else if (strcmp(alt_driver, "rpi-hevc-dec") == 0) {
-						driver_data->video_fd_rpi_hevc_dec = alt_v;
-						driver_data->media_fd_rpi_hevc_dec = alt_m;
-#ifdef HAVE_DAEDALUS_V4L2
-					} else if (strcmp(alt_driver, "daedalus_v4l2") == 0) {
-						driver_data->video_fd_daedalus = alt_v;
-						driver_data->media_fd_daedalus = alt_m;
-#endif
-					} else {
-						/* Shouldn't happen — primary_driver branches
-						 * above only set alt_driver to one of the
-						 * names handled here.  Close and move on. */
-						close(alt_v);
-						close(alt_m);
-						alt_v = -1;
-						alt_m = -1;
 					}
-					if (alt_v >= 0) {
 					request_log("iter38: also opened %s decoder at %s + %s\n",
 						    alt_driver, alt_video, alt_media);
-					}
 				} else {
 					if (alt_v >= 0) close(alt_v);
 					if (alt_m >= 0) close(alt_m);
@@ -821,57 +780,36 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
 		(void)primary_driver;

 		/*
-		 * ampere-av1-enablement Phase 2: walk hantro-vpu media nodes
-		 * for a SECOND one that advertises V4L2_PIX_FMT_AV1_FRAME
-		 * (AV1F) as OUTPUT pixfmt. RK3588 has 3 hantro-vpu instances
-		 * (legacy MPEG2/VP8 decoder, vepu121 encoder, vpu981 AV1
-		 * decoder) all reporting driver="hantro-vpu" / model="hantro-
-		 * vpu" — so OUTPUT-format probe is the only reliable
-		 * disambiguator that doesn't depend on parsing card-name
-		 * strings (which are DTS-dependent). First match wins.
-		 *
-		 * On non-RK3588 hosts the slot stays -1; RequestQueryConfig
-		 * Profiles' AV1 push then no-ops because any_fd_supports_
-		 * output_format() returns false for AV1F.
+		 * ampere-av1-enablement Phase 2 — additionally probe for
+		 * vpu981 (RK3588's dedicated AV1 decoder). Driver name
+		 * "hantro-vpu" alone is ambiguous on RK3588 (3 instances:
+		 * legacy MPEG2/VP8, encoder, vpu981 AV1). Discriminate by
+		 * V4L2_PIX_FMT_AV1_FRAME capability. If the primary or alt
+		 * hantro happens to BE vpu981 (unlikely but possible on
+		 * non-RK3588 boards), this probe finds it again and we just
+		 * dedupe via the fd value.
 		 */
 		{
-			int i;
-			char path[32], av1_video[32];
-
-			for (i = 0; i < 16; i++) {
-				int mfd, vfd;
-				struct media_device_info info;
-
-				snprintf(path, sizeof path, "/dev/media%d", i);
-				mfd = open(path, O_RDWR | O_NONBLOCK);
-				if (mfd < 0) continue;
-				memset(&info, 0, sizeof info);
-				if (ioctl(mfd, MEDIA_IOC_DEVICE_INFO, &info) != 0 ||
-				    strcmp(info.driver, "hantro-vpu") != 0) {
-					close(mfd);
-					continue;
+			static char av1_video[32], av1_media[32];
+			if (find_decoder_device_by_driver_with_fmt(
+				    "hantro-vpu", V4L2_PIX_FMT_AV1_FRAME,
+				    av1_video, sizeof av1_video,
+				    av1_media, sizeof av1_media) == 0) {
+				int av1_v = open(av1_video, O_RDWR | O_NONBLOCK);
+				int av1_m = (av1_v >= 0)
+					? open(av1_media, O_RDWR | O_NONBLOCK)
+					: -1;
+				if (av1_v >= 0 && av1_m >= 0) {
+					driver_data->video_fd_vpu981 = av1_v;
+					driver_data->media_fd_vpu981 = av1_m;
+					request_log(
+					    "ampere-av1: vpu981 AV1 decoder "
+					    "at %s + %s\n",
+					    av1_video, av1_media);
+				} else {
+					if (av1_v >= 0) close(av1_v);
+					if (av1_m >= 0) close(av1_m);
 				}
-				if (find_decoder_video_node_via_topology(
-					    mfd, av1_video, sizeof av1_video) != 0) {
-					close(mfd);
-					continue;
-				}
-				vfd = open(av1_video, O_RDWR | O_NONBLOCK);
-				if (vfd < 0) {
-					close(mfd);
-					continue;
-				}
-				if (!v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT, V4L2_PIX_FMT_AV1_FRAME) &&
-				    !v4l2_find_format(vfd, V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, V4L2_PIX_FMT_AV1_FRAME)) {
-					close(vfd);
-					close(mfd);
-					continue;
-				}
-				driver_data->video_fd_vpu981 = vfd;
-				driver_data->media_fd_vpu981 = mfd;
-				request_log("ampere-av1: vpu981 AV1 decoder at %s + %s\n",
-					    av1_video, path);
-				break;
 			}
 		}
 	}
@@ -886,27 +824,29 @@ VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context)
 		probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rkvdec);
 	driver_data->has_hevc_ext_sps_rps_hantro =
 		probe_hevc_ext_sps_rps_controls(driver_data->video_fd_hantro);
-	driver_data->has_hevc_ext_sps_rps_rpi_hevc_dec =
-		probe_hevc_ext_sps_rps_controls(driver_data->video_fd_rpi_hevc_dec);
 	if (driver_data->has_hevc_ext_sps_rps_rkvdec) {
 		request_log("iter2: kernel registers HEVC EXT_SPS_{ST,LT}_RPS "
 			    "controls on rkvdec fd (will route through "
 			    "vendored GStreamer parser)\n");
 	}
-	if (driver_data->video_fd_rpi_hevc_dec >= 0) {
-		request_log("iter40: also opened rpi-hevc-dec at video_fd=%d "
-			    "media_fd=%d (Pi 5 HEVC stateless)\n",
-			    driver_data->video_fd_rpi_hevc_dec,
-			    driver_data->media_fd_rpi_hevc_dec);
+
+	/*
+	 * ampere-av1 Phase 2.1: probe V4L2_CID_STATELESS_AV1_FILM_GRAIN
+	 * on the vpu981 fd. Per Janet v3 amendment, this runs at backend
+	 * init (not lazily) so any race window with concurrent device
+	 * switching can't observe an inconsistent flag.
+	 */
+	driver_data->has_av1_film_grain = false;
+	if (driver_data->video_fd_vpu981 >= 0) {
+		struct v4l2_query_ext_ctrl qec;
+		if (v4l2_query_ext_ctrl(driver_data->video_fd_vpu981,
+					V4L2_CID_STATELESS_AV1_FILM_GRAIN,
+					&qec) == 0) {
+			driver_data->has_av1_film_grain = true;
+			request_log("ampere-av1: vpu981 advertises FILM_GRAIN "
+				    "control (will include in per-frame batch)\n");
 		}
-#ifdef HAVE_DAEDALUS_V4L2
-	if (driver_data->video_fd_daedalus >= 0) {
-		request_log("phase 8.10: opened daedalus_v4l2 at video_fd=%d "
-			    "media_fd=%d (Pi 5 daemon-backed VP9/AV1/H264)\n",
-			    driver_data->video_fd_daedalus,
-			    driver_data->media_fd_daedalus);
 	}
-#endif

 	status = VA_STATUS_SUCCESS;
 	goto complete;
@@ -954,23 +894,15 @@ VAStatus RequestTerminate(VADriverContextP context)
 		close(driver_data->video_fd_hantro);
 	if (driver_data->media_fd_hantro >= 0)
 		close(driver_data->media_fd_hantro);
-	if (driver_data->video_fd_rpi_hevc_dec >= 0)
-		close(driver_data->video_fd_rpi_hevc_dec);
-	if (driver_data->media_fd_rpi_hevc_dec >= 0)
-		close(driver_data->media_fd_rpi_hevc_dec);
 	if (driver_data->video_fd_vpu981 >= 0)
 		close(driver_data->video_fd_vpu981);
 	if (driver_data->media_fd_vpu981 >= 0)
 		close(driver_data->media_fd_vpu981);
-#ifdef HAVE_DAEDALUS_V4L2
-	if (driver_data->video_fd_daedalus >= 0)
-		close(driver_data->video_fd_daedalus);
-	if (driver_data->media_fd_daedalus >= 0)
-		close(driver_data->media_fd_daedalus);
-#endif
 	/* Fall back to direct close if neither alt fd captured the active
 	 * pair (env-override path). */
-	if (driver_data->video_fd_rkvdec < 0 && driver_data->video_fd_hantro < 0) {
+	if (driver_data->video_fd_rkvdec < 0 &&
+	    driver_data->video_fd_hantro < 0 &&
+	    driver_data->video_fd_vpu981 < 0) {
 		if (driver_data->video_fd >= 0)
 			close(driver_data->video_fd);
 		if (driver_data->media_fd >= 0)
@@ -42,16 +42,7 @@

 #define V4L2_REQUEST_STR_VENDOR			"v4l2-request"

-/*
- * Sized for max-possible enumeration with iter39 Option B reverted:
- * MPEG2(2) + H264(6 incl. Hi10P) + HEVC(2 incl. Main10) + VP8 + VP9 + AV1 = 13.
- * The per-group guards use `if (... && index < (MAX_PROFILES - N))` where N
- * is the push-group size, so MAX must be ≥ total+1 — 14 here. Bumping
- * defensively now so a future re-enable of Hi10P/Main10 doesn't silently
- * drop AV1 through the off-by-one trap that ate ampere-av1's enumeration
- * for a week (see issue marfrit/libva-v4l2-request-fourier#2).
- */
-#define V4L2_REQUEST_MAX_PROFILES		14
+#define V4L2_REQUEST_MAX_PROFILES		11
 #define V4L2_REQUEST_MAX_ENTRYPOINTS		5
 #define V4L2_REQUEST_MAX_CONFIG_ATTRIBUTES	10
 #define V4L2_REQUEST_MAX_IMAGE_FORMATS		10
@@ -87,45 +78,17 @@ struct request_data {
 	int media_fd_rkvdec;
 	int video_fd_hantro;
 	int media_fd_hantro;
+
 	/*
-	 * iter40: third multi-device-probe slot for rpi-hevc-dec (Pi 5 /
-	 * CM5 / BCM2712). V4L2 stateless HEVC; CAPTURE is NC12/NC30 SAND
-	 * 128-pixel-wide column tiled (Pi-specific). On Pi 5 this is the
-	 * ONLY decoder slot; on RK hosts it stays -1 and HEVC routes to
-	 * rkvdec as before.
-	 */
-	int video_fd_rpi_hevc_dec;
-	int media_fd_rpi_hevc_dec;
-	/*
-	 * phase 8.10: fifth multi-device-probe slot for daedalus_v4l2 — the
-	 * out-of-tree V4L2 stateless decoder shim that forwards bitstream
-	 * to a userspace daemon (daedalus-v4l2 sibling repo). Daemon does
-	 * FFmpeg-software decode for VP9 / AV1 / H.264 and ships pixels
-	 * back via dmabuf into the CAPTURE buffer.  Picked up via the
-	 * same media-controller probe + known_decoder_drivers[] entry
-	 * pattern as iter40 rpi-hevc-dec.  Stays -1 on hosts without the
-	 * daedalus module loaded; HEVC routes to rpi-hevc-dec as before.
+	 * ampere-av1-enablement Phase 2 — vpu981 is a THIRD physical
+	 * hantro-vpu instance on RK3588 (separate from the legacy MPEG2/VP8
+	 * hantro at /dev/video2). It's the dedicated AV1 decoder at
+	 * /dev/video4 with card name "rockchip,rk3588-av1-vpu-dec".
 	 *
-	 * Fields are unconditional (8 bytes per session) so the struct
-	 * layout is stable regardless of meson option.  The active
-	 * probe + dispatch code in request.c is gated by
-	 * HAVE_DAEDALUS_V4L2; when disabled the fields stay at their
-	 * -1 init and no codepath touches them.
-	 */
-	int video_fd_daedalus;
-	int media_fd_daedalus;
-	/*
-	 * ampere-av1-enablement Phase 2: fourth multi-device-probe slot
-	 * for vpu981 (RK3588's dedicated AV1 hantro instance, kernel
-	 * card="rockchip,rk3588-av1-vpu-dec", driver name "hantro-vpu" —
-	 * shared with the legacy MPEG-2/VP8/H.264 hantro). Discriminated
-	 * by V4L2_PIX_FMT_AV1_FRAME (AV1F) OUTPUT-pixfmt capability since
-	 * the driver name alone is ambiguous on RK3588. Stays -1 on hosts
-	 * without the AV1 vpu-dec.
-	 *
-	 * Named "vpu981" for consistency with the in-progress av1-iter1
-	 * operator branch (Phase 3-5 bit-exact AV1 work — when that lands
-	 * these fields receive the actual decode dispatch wiring).
+	 * Driver-name alone ("hantro-vpu") is ambiguous on RK3588 — three
+	 * instances share the name. The probe discriminates by capability:
+	 * which OUTPUT format does the device advertise? Only vpu981
+	 * exposes V4L2_PIX_FMT_AV1_FRAME.
 	 */
 	int video_fd_vpu981;
 	int media_fd_vpu981;
@@ -149,12 +112,18 @@ struct request_data {
 	 */
 	bool has_hevc_ext_sps_rps_rkvdec;
 	bool has_hevc_ext_sps_rps_hantro;
-	/* iter40: rpi-hevc-dec doesn't expose EXT_SPS_*_RPS controls
-	 * (verified Phase 0 higgs probe: QUERY_EXT_CTRL on 0xa97 → EINVAL).
-	 * Probed for consistency with the iter2 pair-of-flags pattern;
-	 * stays false on Pi 5 and the iter2 vendored-parser path naturally
-	 * doesn't engage. */
-	bool has_hevc_ext_sps_rps_rpi_hevc_dec;
+
+	/*
+	 * ampere-av1 Phase 2.1: probe result for the optional
+	 * V4L2_CID_STATELESS_AV1_FILM_GRAIN control on the vpu981 fd.
+	 * Probed at VA_DRIVER_INIT (per Janet v3 amendment — init-time
+	 * not lazy). Consumed by av1_set_controls to conditionally include
+	 * the 4th control in the per-frame batch.
+	 *
+	 * True iff vpu981 advertises the control via VIDIOC_QUERY_EXT_CTRL.
+	 * False for non-RK3588 hosts (no vpu981 fd) or older kernels.
+	 */
+	bool has_av1_film_grain;

 	/*
 	 * iter2 — cached SPS-derived RPS arrays. SPS NALs only appear in
@@ -179,30 +148,6 @@ struct request_data {
 	unsigned int                          hevc_rps_cache_lt_count;
 	bool                                  hevc_rps_cache_valid;

-	/*
-	 * iter40b: bitstream-derived SPS field cache for VAAPI-omitted
-	 * fields. rpi-hevc-dec validates these against bitstream-true
-	 * values; the rkvdec/hantro fallback (sps_max_dec_pic_buffering_minus1,
-	 * 0) that satisfies §A.4.2 isn't enough for rpi.
-	 *
-	 * Cached on first IDR frame's SPS NAL parse, reused for subsequent
-	 * non-IDR frames whose source_data may not carry an SPS.
-	 *
-	 * sps_max_sub_layers_minus1 is the index into max_*[] arrays. The
-	 * V4L2 SPS struct fields are scalars (single sublayer), so we pick
-	 * the HighestTid (= sps_max_sub_layers_minus1) slot — matches
-	 * ffmpeg-vaapi + kdirect convention.
-	 */
-	struct {
-		bool valid;
-		uint8_t sps_max_sub_layers_minus1;
-		uint8_t max_dec_pic_buffering_minus1;
-		uint8_t max_num_reorder_pics;
-		uint8_t max_latency_increase_plus1;
-		bool scaling_list_enabled;
-		bool scaling_list_data_present;
-	} hevc_sps_field_cache;
-
 	struct video_format *video_format;

 	/*
@@ -259,17 +204,6 @@ struct request_data {
 	unsigned int fmt_buffers_count;
 	unsigned int fmt_sizes[VIDEO_MAX_PLANES];
 	unsigned int fmt_bytesperlines[VIDEO_MAX_PLANES];
-
-	/*
-	 * iter39: active session is decoding a 10-bit profile (Hi10P / Main10).
-	 * Set in RequestCreateContext from config->profile. Drives:
-	 *   - CAPTURE pix_fmt selection (NV15 instead of NV12)
-	 *   - image.c DeriveImage / QueryImageFormats fourcc reporting (P010
-	 *     instead of NV12)
-	 *   - copy_surface_to_image NV15→P010 unpack branch
-	 * Reset to false at DestroyContext.
-	 */
-	bool is_10bit;
 };

 VAStatus VA_DRIVER_INIT_FUNC(VADriverContextP context);
@@ -21,10 +21,7 @@
 #include "v4l2.h"

 int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
-		      unsigned int output_type, unsigned int count,
-		      unsigned int pixelformat,
-		      unsigned int picture_width,
-		      unsigned int picture_height)
+		      unsigned int output_type, unsigned int count)
 {
 	unsigned int index_base;
 	unsigned int length;
@@ -46,16 +43,6 @@ int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
 	pool->next = 0;
 	pool->media_fd = media_fd;	/* iter7: kept for force_release re-alloc */

-	/*
-	 * iter#15: cache the S_FMT params so request_pool_resize can
-	 * re-issue S_FMT with a sizeimage hint override on overrun.
-	 */
-	pool->video_fd = video_fd;
-	pool->output_type = output_type;
-	pool->pixelformat = pixelformat;
-	pool->picture_width = picture_width;
-	pool->picture_height = picture_height;
-
 	for (i = 0; i < count; i++)
 		pool->slots[i].request_fd = -1;

@@ -107,118 +94,6 @@ error:
 	return -1;
 }

-int request_pool_resize(struct request_pool *pool,
-			unsigned int new_sizeimage_min)
-{
-	unsigned int index_base;
-	unsigned int length;
-	unsigned int offset;
-	unsigned int saved_count;
-	unsigned int i;
-	int rc;
-
-	if (pool == NULL || !pool->initialized || pool->count == 0)
-		return -1;
-
-	/*
-	 * Pre-condition guard: no slot may be borrowed when we tear the
-	 * pool down. The caller in codec_store_buffer temporarily releases
-	 * the current in-flight surface's slot before invoking us; the
-	 * inline-Sync-in-EndPicture pattern guarantees no other slot is
-	 * borrowed elsewhere in the driver. Bail loudly if anyone breaks
-	 * that invariant rather than corrupting in-flight V4L2 state.
-	 */
-	for (i = 0; i < pool->count; i++) {
-		if (pool->slots[i].busy) {
-			request_log("request_pool_resize: slot %u still busy — "
-				    "caller must release before resize\n", i);
-			return -1;
-		}
-	}
-
-	saved_count = pool->count;
-
-	/* STREAMOFF the OUTPUT queue so REQBUFS(0) is accepted. */
-	rc = v4l2_set_stream(pool->video_fd, pool->output_type, false);
-	if (rc < 0)
-		return -1;
-
-	/*
-	 * Tear down every slot: munmap, close per-slot request_fd. Slot
-	 * fields are zeroed in place so failure halfway is recoverable.
-	 */
-	for (i = 0; i < pool->count; i++) {
-		if (pool->slots[i].data != NULL && pool->slots[i].size > 0) {
-			munmap(pool->slots[i].data, pool->slots[i].size);
-			pool->slots[i].data = NULL;
-			pool->slots[i].size = 0;
-		}
-		if (pool->slots[i].request_fd >= 0) {
-			close(pool->slots[i].request_fd);
-			pool->slots[i].request_fd = -1;
-		}
-	}
-
-	/*
-	 * Release the V4L2 OUTPUT buffer indices. REQBUFS(0) is the only
-	 * way to ask the kernel to free buffers so CREATE_BUFS can re-
-	 * allocate with a new per-buffer sizeimage.
-	 */
-	rc = v4l2_request_buffers(pool->video_fd, pool->output_type, 0);
-	if (rc < 0)
-		return -1;
-
-	/*
-	 * Re-issue S_FMT with the cached dimensions but a larger
-	 * sizeimage. The kernel may round up further (driver-specific
-	 * page / alignment rules); we accept whatever it returns and
-	 * pick that up from per-slot v4l2_query_buffer below.
-	 */
-	rc = v4l2_set_format_sizeimage(pool->video_fd, pool->output_type,
-				       pool->pixelformat,
-				       pool->picture_width,
-				       pool->picture_height,
-				       new_sizeimage_min);
-	if (rc < 0)
-		return -1;
-
-	rc = v4l2_create_buffers(pool->video_fd, pool->output_type,
-				 saved_count, &index_base);
-	if (rc < 0)
-		return -1;
-
-	for (i = 0; i < saved_count; i++) {
-		pool->slots[i].index = index_base + i;
-		pool->slots[i].busy = false;
-
-		rc = v4l2_query_buffer(pool->video_fd, pool->output_type,
-				       pool->slots[i].index,
-				       &length, &offset, 1);
-		if (rc < 0)
-			return -1;
-
-		pool->slots[i].data = mmap(NULL, length,
-					   PROT_READ | PROT_WRITE,
-					   MAP_SHARED, pool->video_fd, offset);
-		if (pool->slots[i].data == MAP_FAILED) {
-			pool->slots[i].data = NULL;
-			return -1;
-		}
-		pool->slots[i].size = length;
-
-		pool->slots[i].request_fd = media_request_alloc(pool->media_fd);
-		if (pool->slots[i].request_fd < 0)
-			return -1;
-	}
-
-	rc = v4l2_set_stream(pool->video_fd, pool->output_type, true);
-	if (rc < 0)
-		return -1;
-
-	pool->next = 0;
-	return 0;
-}
-
 void request_pool_destroy(struct request_pool *pool)
 {
 	unsigned int i;
@@ -52,71 +52,16 @@ struct request_pool {
 	int				 media_fd;	/* iter7: kept for
 							 * force_release re-alloc */
 	bool				 initialized;
-
-	/*
-	 * iter#15: cached S_FMT params from request_pool_init, so
-	 * request_pool_resize can re-S_FMT the OUTPUT queue with a new
-	 * sizeimage override on a mid-session resolution upshift overrun
-	 * without the caller having to re-thread these through six call
-	 * sites. video_fd is also cached so the resize is fully
-	 * self-contained — request_pool_resize takes only the pool and
-	 * the new sizeimage hint.
-	 */
-	int				 video_fd;
-	unsigned int			 output_type;
-	unsigned int			 pixelformat;
-	unsigned int			 picture_width;
-	unsigned int			 picture_height;
 };

 /*
 * Allocate count OUTPUT buffers via VIDIOC_CREATE_BUFS, query and mmap
 * each, populate pool->slots[]. Caller must have already done
- * VIDIOC_S_FMT on the OUTPUT queue. The S_FMT params (pixelformat,
- * picture_width, picture_height) are stashed on the pool so that
- * request_pool_resize can re-issue S_FMT with the same dimensions but
- * a larger sizeimage hint. Returns 0 on success, -1 on failure.
+ * VIDIOC_S_FMT on the OUTPUT queue. Returns 0 on success, -1 on
+ * failure.
 */
 int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
-		      unsigned int output_type, unsigned int count,
-		      unsigned int pixelformat,
-		      unsigned int picture_width,
-		      unsigned int picture_height);
-
-/*
- * iter#15: grow the OUTPUT pool's per-slot sizeimage in place.
- *
- * Issued from codec_store_buffer when an Annex-B start code / VP8
- * header pad / slice payload won't fit in the current
- * surface->source_size — i.e. the stream's per-frame bitstream budget
- * has outgrown the OUTPUT pool slot's mmap (typical cause: SPS-driven
- * resolution upshift mid-session).
- *
- * Steps:
- *   1. STREAMOFF the OUTPUT queue.
- *   2. munmap every slot, close every per-slot media-request fd.
- *   3. VIDIOC_REQBUFS(count=0) to release the V4L2 buffer indices.
- *   4. S_FMT with the cached pixelformat / picture_width /
- *      picture_height but a sizeimage hint of new_sizeimage_min.
- *   5. CREATE_BUFS with the original slot count.
- *   6. Per-slot: query buffer length, mmap, alloc fresh request_fd.
- *   7. STREAMON.
- *
- * Returns 0 on success, -1 on failure (caller falls back to
- * VA_STATUS_ERROR_ALLOCATION_FAILED — the libva consumer recreates
- * the surface at the new resolution).
- *
- * Pre-condition: NO pool slot is currently borrowed (busy=false on
- * every slot) AND no buffer is in-flight on the OUTPUT queue. The
- * inline-Sync-in-EndPicture pattern (RequestEndPicture calls
- * RequestSyncSurface before returning) makes this trivially true at
- * codec_store_buffer time for the only-supported single-context
- * single-render-surface flow: the in-flight surface's slot is the
- * sole borrowed slot, and the resize caller temporarily releases it
- * before calling here.
- */
-int request_pool_resize(struct request_pool *pool,
-			unsigned int new_sizeimage_min);
+		      unsigned int output_type, unsigned int count);

 /*
 * Munmap all slots and free the slots array. Idempotent.
@@ -111,6 +111,13 @@ void surface_unbind_slot(struct request_data *driver_data,
 {
 	if (surface_object->current_slot == NULL)
 		return;
+	/* AV1 Phase 3 diag: log every unbind with surface id + slot idx
+	 * + status — confirms whether BeginPicture rebind is racing the
+	 * consumer's vaGetImage on the previous frame. */
+	request_log("surface_unbind_slot id=0x%x status=%d slot_idx=%u\n",
+		    surface_object->base.id,
+		    surface_object->status,
+		    surface_object->current_slot->v4l2_index);
 	cap_pool_release(&driver_data->capture_pool, surface_object->current_slot);
 	surface_object->current_slot = NULL;
 }
@@ -182,9 +189,7 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
 	 * surface_bind_format_uniform_fields(); the per-slot
 	 * destination_* fields fill at BeginPicture via surface_bind_slot.
 	 */
-	/* iter39: allow YUV420_10 for Hi10P / Main10 surface allocation. */
-	if (format != VA_RT_FORMAT_YUV420 &&
-	    format != VA_RT_FORMAT_YUV420_10)
+	if (format != VA_RT_FORMAT_YUV420)
 		return VA_STATUS_ERROR_UNSUPPORTED_RT_FORMAT;

 	for (i = 0; i < surfaces_count; i++) {
@@ -194,6 +199,8 @@ VAStatus RequestCreateSurfaces2(VADriverContextP context, unsigned int format,
 			return VA_STATUS_ERROR_ALLOCATION_FAILED;

 		surface_object->current_slot = NULL;	/* iter2 Fix 3 */
+		surface_object->linked_decode_surface_id = VA_INVALID_SURFACE;
+		surface_object->av1_order_hint = 0;
 		surface_object->destination_index = 0;	/* set on bind */
 		surface_object->destination_planes_count = 0;	/* set at CreateContext */
 		surface_object->destination_buffers_count = 0;	/* set at CreateContext */
@@ -708,14 +715,7 @@ VAStatus RequestExportSurfaceHandle(VADriverContextP context,

 	planes_count = surface_object->destination_planes_count;

-	/* iter39: 10-bit session exports a DRM_FORMAT_NV15 buffer; advertise
-	 * the matching fourcc so a PRIME consumer aware of NV15 (panfrost-
-	 * Mesa et al.) can import correctly. PRIME consumers that only know
-	 * NV12 / P010 should use the COPY (vaGetImage) path which unpacks
-	 * NV15→P010 in image.c::copy_surface_to_image. */
-	surface_descriptor->fourcc = driver_data->is_10bit
-		? VA_FOURCC('N', 'V', '1', '5')
-		: VA_FOURCC_NV12;
+	surface_descriptor->fourcc = VA_FOURCC_NV12;
 	surface_descriptor->width = surface_object->width;
 	surface_descriptor->height = surface_object->height;
 	surface_descriptor->num_objects = export_fds_count;
@@ -89,6 +89,33 @@ struct object_surface {

 	struct timeval timestamp;

+	/*
+	 * AV1 Phase 3: for streams with apply_grain=1, VAAPI's
+	 * VADecPictureParameterBufferAV1 carries current_display_picture
+	 * (display-time surface) separate from current_frame (decode
+	 * target). vpu981 HW applies grain inline to the decode CAPTURE
+	 * buffer, so the decoded data lives in current_frame's slot — but
+	 * ffmpeg calls vaGetImage on current_display_picture which has no
+	 * slot bound. linked_decode_surface_id, set in av1_set_controls
+	 * on the display surface, points to the decode surface so
+	 * copy_surface_to_image can borrow its destination_data[].
+	 *
+	 * VA_INVALID_SURFACE = no link (the common case: 8-bit codecs,
+	 * AV1 with apply_grain=0, AV1 frames where cur_frame ==
+	 * cur_display).
+	 */
+	VASurfaceID linked_decode_surface_id;
+
+	/*
+	 * AV1 Phase 3: AV1 order_hint of the frame currently decoded into
+	 * this surface. VAAPI's VADecPictureParameterBufferAV1.order_hint
+	 * is per-frame; kernel's v4l2_ctrl_av1_frame.order_hints[8] is
+	 * per-reference. We track each decoded frame's order_hint here so
+	 * the next frame's av1_set_controls can populate order_hints[i]
+	 * from ref_frame_map[i] → SURFACE → av1_order_hint.
+	 */
+	uint8_t av1_order_hint;
+
 	union {
 		struct {
 			VAPictureParameterBufferMPEG2 picture;
@@ -122,17 +149,18 @@ struct object_surface {
 			VADecPictureParameterBufferVP9 picture;
 			VASliceParameterBufferVP9 slice;
 		} vp9;
-		struct {
 		/*
-			 * AV1 picture parameter buffer.  Slice params are
-			 * intentionally absent — the daedalus daemon track
-			 * (issue #11) consumes the slice OBU bytes directly
-			 * from the OUTPUT bitstream and synthesises only the
-			 * sequence-header OBU from V4L2_CID_STATELESS_AV1_
-			 * SEQUENCE.  No per-tile-group struct→OBU re-synthesis
-			 * required from libva today.
+		 * ampere-av1-enablement: AV1 needs picture-header +
+		 * variable number of slice/tile params (one per tile).
+		 * tile_group_entries[] holds parsed VASliceParameterBufferAV1
+		 * entries up to MAX_TILES; av1.c builds the matching
+		 * v4l2_ctrl_av1_tile_group_entry[] at set_controls time.
 		 */
+		struct {
+#define AV1_MAX_TILES 128
 			VADecPictureParameterBufferAV1 picture;
+			VASliceParameterBufferAV1 tile_group_entries[AV1_MAX_TILES];
+			unsigned int num_tile_group_entries;
 		} av1;
 	} params;

@@ -113,28 +113,6 @@ static void v4l2_setup_format(struct v4l2_format *format, unsigned int type,
 	}
 }

-static void v4l2_setup_format_sizeimage(struct v4l2_format *format,
-					unsigned int type,
-					unsigned int width, unsigned int height,
-					unsigned int pixelformat,
-					unsigned int sizeimage)
-{
-	memset(format, 0, sizeof(*format));
-	format->type = type;
-
-	if (v4l2_type_is_mplane(type)) {
-		format->fmt.pix_mp.width = width;
-		format->fmt.pix_mp.height = height;
-		format->fmt.pix_mp.plane_fmt[0].sizeimage = sizeimage;
-		format->fmt.pix_mp.pixelformat = pixelformat;
-	} else {
-		format->fmt.pix.width = width;
-		format->fmt.pix.height = height;
-		format->fmt.pix.sizeimage = sizeimage;
-		format->fmt.pix.pixelformat = pixelformat;
-	}
-}
-
 bool v4l2_find_format(int video_fd, unsigned int type, unsigned int pixelformat)
 {
 	struct v4l2_fmtdesc fmtdesc;
@@ -194,30 +172,6 @@ int v4l2_set_format(int video_fd, unsigned int type, unsigned int pixelformat,
 	return 0;
 }

-int v4l2_set_format_sizeimage(int video_fd, unsigned int type,
-			      unsigned int pixelformat,
-			      unsigned int width, unsigned int height,
-			      unsigned int sizeimage)
-{
-	struct v4l2_format format;
-	int rc;
-
-	if (sizeimage == 0)
-		return v4l2_set_format(video_fd, type, pixelformat, width, height);
-
-	v4l2_setup_format_sizeimage(&format, type, width, height, pixelformat,
-				    sizeimage);
-
-	rc = ioctl(video_fd, VIDIOC_S_FMT, &format);
-	if (rc < 0) {
-		request_log("Unable to set format (sizeimage=%u) for type %d: %s\n",
-			    sizeimage, type, strerror(errno));
-		return -1;
-	}
-
-	return 0;
-}
-
 int v4l2_get_format(int video_fd, unsigned int type, unsigned int *width,
 		    unsigned int *height, unsigned int *bytesperline,
 		    unsigned int *sizes, unsigned int *planes_count)
@@ -479,6 +433,7 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
 			       unsigned int num_controls)
 {
 	struct v4l2_ext_controls controls;
+	int rc;

 	memset(&controls, 0, sizeof(controls));

@@ -490,7 +445,28 @@ static int v4l2_ioctl_controls(int video_fd, int request_fd, unsigned long ioc,
 		controls.request_fd = request_fd;
 	}

-	return ioctl(video_fd, ioc, &controls);
+	rc = ioctl(video_fd, ioc, &controls);
+	if (rc < 0) {
+		/* ampere-av1 Phase 2.1 diag: surface error_idx so the caller's
+		 * error path knows which CID failed validation. error_idx >=
+		 * count means the failure was pre-validation (e.g., bad
+		 * request_fd). errno carries the syscall-level reason. */
+		const char *failed_cid_label = "<pre-validation>";
+		unsigned int failed_size = 0;
+		if (controls.error_idx < num_controls) {
+			failed_size = control_array[controls.error_idx].size;
+			(void)failed_cid_label;  /* keep symbol if logger truncates */
+		}
+		request_log("v4l2_ioctl_controls: rc=%d errno=%d (%s) "
+			    "ioc=0x%lx error_idx=%u count=%u "
+			    "failed_cid=0x%x failed_size=%u\n",
+			    rc, errno, strerror(errno), ioc,
+			    controls.error_idx, num_controls,
+			    controls.error_idx < num_controls
+			        ? control_array[controls.error_idx].id : 0,
+			    failed_size);
+	}
+	return rc;
 }

 int v4l2_get_controls(int video_fd, int request_fd,
@@ -522,35 +498,12 @@ int v4l2_set_controls(int video_fd, int request_fd,
 		      struct v4l2_ext_control *control_array,
 		      unsigned int num_controls)
 {
-	struct v4l2_ext_controls controls;
 	int rc;

-	memset(&controls, 0, sizeof(controls));
-	controls.controls = control_array;
-	controls.count = num_controls;
-	if (request_fd >= 0) {
-		controls.which = V4L2_CTRL_WHICH_REQUEST_VAL;
-		controls.request_fd = request_fd;
-	}
-
-	rc = ioctl(video_fd, VIDIOC_S_EXT_CTRLS, &controls);
+	rc = v4l2_ioctl_controls(video_fd, request_fd, VIDIOC_S_EXT_CTRLS,
+				 control_array, num_controls);
 	if (rc < 0) {
-		/* error_idx is the index of the first failing control;
-		 * if it equals count, the ioctl itself failed (not a
-		 * specific control payload).  Useful for triaging
-		 * which V4L2_CID_STATELESS_* the kernel rejected. */
-		if (controls.error_idx < num_controls)
-			request_log("Unable to set control(s): %s "
-				    "(error_idx=%u/%u failing_ctrl_id=0x%x size=%u)\n",
-				    strerror(errno),
-				    controls.error_idx, controls.count,
-				    control_array[controls.error_idx].id,
-				    control_array[controls.error_idx].size);
-		else
-			request_log("Unable to set control(s): %s "
-				    "(error_idx=%u/%u ioctl-level)\n",
-				    strerror(errno),
-				    controls.error_idx, controls.count);
+		request_log("Unable to set control(s): %s\n", strerror(errno));
 		return -1;
 	}

@@ -36,17 +36,6 @@ bool v4l2_find_format(int video_fd, unsigned int type,
 		      unsigned int pixelformat);
 int v4l2_set_format(int video_fd, unsigned int type, unsigned int pixelformat,
 		    unsigned int width, unsigned int height);
-/*
- * Same as v4l2_set_format but explicitly overrides the OUTPUT
- * sizeimage hint. Pass sizeimage=0 to get the v4l2_set_format default
- * (SOURCE_SIZE_MAX for OUTPUT, 0 for CAPTURE). Used by
- * request_pool_resize on a mid-session bitstream-budget overrun to
- * grow the OUTPUT pool slots past the SOURCE_SIZE_MAX floor.
- */
-int v4l2_set_format_sizeimage(int video_fd, unsigned int type,
-			      unsigned int pixelformat,
-			      unsigned int width, unsigned int height,
-			      unsigned int sizeimage);
 int v4l2_get_format(int video_fd, unsigned int type, unsigned int *width,
 		    unsigned int *height, unsigned int *bytesperline,
 		    unsigned int *sizes, unsigned int *planes_count);
@@ -31,8 +31,6 @@
 #include <drm_fourcc.h>
 #include <linux/videodev2.h>

-#include "nv12_col128.h"  /* fallback V4L2_PIX_FMT_NV12_COL128 define */
-#include "nv15.h"         /* fallback V4L2_PIX_FMT_NV15 define */
 #include "utils.h"
 #include "video.h"

@@ -47,38 +45,6 @@ static struct video_format formats[] = {
 		.planes_count		= 2,
 		.bpp			= 16,
 	},
-	{
-		.description		= "NV15 YUV (10-bit, rkvdec)",
-		.v4l2_format		= V4L2_PIX_FMT_NV15,
-		.v4l2_buffers_count	= 1,
-		.v4l2_mplane		= true,
-		.drm_format		= DRM_FORMAT_NV15,
-		.drm_modifier		= DRM_FORMAT_MOD_NONE,
-		.planes_count		= 2,
-		.bpp			= 24,
-	},
-	{
-		/*
-		 * iter40: Pi 5 / CM5 rpi-hevc-dec CAPTURE format. 8-bit NV12
-		 * stored as 128-pixel-wide column tiles (SAND128 layout).
-		 * Pi-specific; not in mainline drm_fourcc.h (uses NV12 + a
-		 * BROADCOM_SAND128 modifier for DRM_PRIME). Our consumer path
-		 * always detiles to linear NV12 in copy_surface_to_image, so
-		 * we don't expose the SAND modifier downstream — drm_format is
-		 * still DRM_FORMAT_NV12 and drm_modifier MOD_NONE so the
-		 * format-is-linear gate doesn't pull us into tiled_to_planar
-		 * (Sunxi-specific). image.c branches on v4l2_format ==
-		 * V4L2_PIX_FMT_NV12_COL128 to invoke the dedicated detile.
-		 */
-		.description		= "NV12 SAND128 (8-bit, rpi-hevc-dec)",
-		.v4l2_format		= V4L2_PIX_FMT_NV12_COL128,
-		.v4l2_buffers_count	= 1,
-		.v4l2_mplane		= true,
-		.drm_format		= DRM_FORMAT_NV12,
-		.drm_modifier		= DRM_FORMAT_MOD_NONE,
-		.planes_count		= 2,
-		.bpp			= 16,
-	},
 // Code to handle this DRM_FORMAT is __arm__ only
 #ifdef __arm__
 	{
@@ -1,196 +0,0 @@
-/*
- * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
- *
- * MIT-licensed per project. iter40 self-test for nv12_col128 detile.
- *
- * Build an NC12-tiled source buffer from a known linear NV12 image,
- * run the detile primitive, assert output matches the original. No
- * hardware needed — pure bit-layout verification of the kernel math
- * (drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
- * V4L2_PIX_FMT_NV12_COL128 case + ffmpeg/Kynesim per-pixel offset).
- *
- * Build:
- *   cc -Wall -Werror -O2 -o test_nv12_col128_detile \
- *      tests/test_nv12_col128_detile.c src/nv12_col128.c
- *
- * Exit 0 = all asserts pass.
- */
-
-#include "../src/nv12_col128.h"
-
-#include <assert.h>
-#include <stdint.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-
-#define TILE_W 128
-
-static unsigned int align_up(unsigned int v, unsigned int a)
-{
-	return (v + a - 1) & ~(a - 1);
-}
-
-/* Pack a linear plane (width × height bytes, stride=width) into NC12
- * layout: each 128-wide column held contiguously, columns at offsets
- * col * col_stride * 128. col_stride is the kernel-reported bytesperline
- * = ALIGN(height, 8) * 3/2. Returns the buffer + sizes. */
-static uint8_t *pack_to_nc12(const uint8_t *linear,
-			     unsigned int width, unsigned int height,
-			     unsigned int *out_col_stride,
-			     size_t *out_size)
-{
-	unsigned int aligned_w = align_up(width, TILE_W);
-	unsigned int aligned_h = align_up(height, 8);
-	unsigned int col_stride = aligned_h * 3 / 2;
-	unsigned int num_cols = aligned_w / TILE_W;
-	size_t total = (size_t)col_stride * aligned_w;
-	uint8_t *buf;
-	unsigned int col, y, in_col;
-
-	buf = calloc(1, total);
-	assert(buf != NULL);
-
-	for (col = 0; col < num_cols; col++) {
-		uint8_t *col_base = buf + (size_t)col * TILE_W * col_stride;
-		for (y = 0; y < height; y++) {
-			for (in_col = 0; in_col < TILE_W; in_col++) {
-				unsigned int x = col * TILE_W + in_col;
-				if (x >= width)
-					break;
-				col_base[(size_t)y * TILE_W + in_col] =
-					linear[(size_t)y * width + x];
-			}
-		}
-	}
-
-	*out_col_stride = col_stride;
-	*out_size = total;
-	return buf;
-}
-
-static void test_detile_y(unsigned int width, unsigned int height)
-{
-	uint8_t *linear, *tiled, *recovered;
-	unsigned int col_stride;
-	size_t tile_size, i;
-
-	linear = malloc((size_t)width * height);
-	assert(linear != NULL);
-	/* Distinctive content per pixel: y * 17 + x * 13 — avoids byte-
-	 * aliasing patterns that could mask off-by-one bugs. */
-	for (unsigned int y = 0; y < height; y++)
-		for (unsigned int x = 0; x < width; x++)
-			linear[(size_t)y * width + x] = (uint8_t)(y * 17 + x * 13);
-
-	tiled = pack_to_nc12(linear, width, height, &col_stride, &tile_size);
-
-	recovered = calloc(1, (size_t)width * height);
-	assert(recovered != NULL);
-
-	nv12_col128_detile_y(recovered, width, tiled, col_stride, width, height);
-
-	for (i = 0; i < (size_t)width * height; i++) {
-		if (recovered[i] != linear[i]) {
-			fprintf(stderr,
-				"FAIL %ux%u Y: pixel %zu (x=%zu y=%zu) "
-				"linear=0x%02x recovered=0x%02x\n",
-				width, height, i,
-				i % width, i / width,
-				linear[i], recovered[i]);
-			free(linear); free(tiled); free(recovered);
-			exit(1);
-		}
-	}
-	printf("PASS %ux%u Y plane (%u columns, col_stride=%u, tile_size=%zu)\n",
-	       width, height, align_up(width, TILE_W) / TILE_W,
-	       col_stride, tile_size);
-
-	free(linear);
-	free(tiled);
-	free(recovered);
-}
-
-static void test_detile_uv(unsigned int width, unsigned int height)
-{
-	unsigned int uv_h = height / 2;
-	uint8_t *linear, *tiled, *recovered;
-	unsigned int col_stride;
-	size_t tile_size, i;
-
-	linear = malloc((size_t)width * uv_h);
-	assert(linear != NULL);
-	for (unsigned int y = 0; y < uv_h; y++)
-		for (unsigned int x = 0; x < width; x++)
-			linear[(size_t)y * width + x] = (uint8_t)(y * 23 + x * 7);
-
-	tiled = pack_to_nc12(linear, width, uv_h, &col_stride, &tile_size);
-
-	recovered = calloc(1, (size_t)width * uv_h);
-	assert(recovered != NULL);
-
-	nv12_col128_detile_uv(recovered, width, tiled, col_stride, width, uv_h);
-
-	for (i = 0; i < (size_t)width * uv_h; i++) {
-		if (recovered[i] != linear[i]) {
-			fprintf(stderr,
-				"FAIL %ux%u UV: pixel %zu linear=0x%02x recovered=0x%02x\n",
-				width, height, i,
-				linear[i], recovered[i]);
-			free(linear); free(tiled); free(recovered);
-			exit(1);
-		}
-	}
-	printf("PASS %ux%u UV plane\n", width, height);
-
-	free(linear);
-	free(tiled);
-	free(recovered);
-}
-
-static void test_uv_offset(void)
-{
-	/* Per the SAND COL128 layout, Y and UV are interleaved within
-	 * EACH column (not concatenated as separate planes), so the UV
-	 * plane base pointer is offset by 128 * ALIGN(height, 8) — the
-	 * Y portion of column 0. NOT 128 * height * num_columns (the
-	 * size of all Y across all columns), which was an earlier wrong
-	 * formula caught by Phase 7 SEGV on higgs. */
-	unsigned int off = nv12_col128_uv_plane_offset(1280, 720);
-	if (off != 128u * 720) {
-		fprintf(stderr, "FAIL UV offset 1280×720: got %u expected %u\n",
-			off, 128u * 720);
-		exit(1);
-	}
-	printf("PASS UV offset 1280×720 = %u\n", off);
-
-	off = nv12_col128_uv_plane_offset(1366, 768);
-	if (off != 128u * 768) {
-		fprintf(stderr, "FAIL UV offset 1366×768: got %u expected %u\n",
-			off, 128u * 768);
-		exit(1);
-	}
-	printf("PASS UV offset 1366×768 (column-misaligned width)\n");
-}
-
-int main(void)
-{
-	/* Phase 3 fixture sizes — all 128-aligned, 8-line-aligned. */
-	test_detile_y(640, 360);
-	test_detile_y(1280, 720);
-	test_detile_y(1920, 1080);
-
-	/* Phase 5 review F4: column-misaligned width (1366 → 1408 padding). */
-	test_detile_y(1366, 768);
-
-	/* UV plane (half-height) at each width. */
-	test_detile_uv(640, 360);
-	test_detile_uv(1280, 720);
-	test_detile_uv(1920, 1080);
-	test_detile_uv(1366, 768);
-
-	test_uv_offset();
-
-	printf("All NC12 detile asserts pass.\n");
-	return 0;
-}
@@ -1,224 +0,0 @@
-/*
- * Copyright (C) 2026 claude-noether <claude-noether@reauktion.de>
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sub license, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice (including the
- * next paragraph) shall be included in all copies or substantial portions
- * of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
- * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
- * IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
- * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- */
-
-/*
- * iter39 self-test for nv15_unpack_plane_to_p010.
- *
- * Builds NV15 plane buffers from known 10-bit pixel arrays, runs the
- * unpack, asserts P010 output matches the expected pixel<<6 values.
- * No hardware needed — pure bit layout verification per
- * Documentation/userspace-api/media/v4l/pixfmt-nv15.rst.
- *
- * Build:
- *   cc -Wall -Werror -O2 -o test_nv15_unpack tests/test_nv15_unpack.c src/nv15.c
- *
- * Exit 0 = all asserts pass.
- */
-
-#include "../src/nv15.h"
-
-#include <assert.h>
-#include <stdint.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-
-/* Pack 4 10-bit pixels into 5 bytes per NV15 layout (LSB-first across
- * bits 0..39). Inverse of nv15_unpack_plane_to_p010's per-group unpack. */
-static void pack4(uint16_t a, uint16_t b, uint16_t c, uint16_t d,
-		  uint8_t out[5])
-{
-	out[0] = (uint8_t)(a & 0xFF);
-	out[1] = (uint8_t)(((a >> 8) & 0x03) | ((b & 0x3F) << 2));
-	out[2] = (uint8_t)(((b >> 6) & 0x0F) | ((c & 0x0F) << 4));
-	out[3] = (uint8_t)(((c >> 4) & 0x3F) | ((d & 0x03) << 6));
-	out[4] = (uint8_t)((d >> 2) & 0xFF);
-}
-
-#define ASSERT_EQ(actual, expected, msg) do {				\
-	if ((actual) != (expected)) {					\
-		fprintf(stderr, "FAIL %s: actual=0x%04x expected=0x%04x at %s:%d\n", \
-			(msg), (unsigned)(actual), (unsigned)(expected), \
-			__FILE__, __LINE__);				\
-		exit(1);						\
-	}								\
-} while (0)
-
-static void test_pack_unpack_roundtrip(uint16_t a, uint16_t b, uint16_t c,
-				       uint16_t d)
-{
-	uint8_t packed[5];
-	uint16_t dst[4];
-
-	pack4(a, b, c, d, packed);
-	nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
-	ASSERT_EQ(dst[0], (uint16_t)(a << 6), "roundtrip a");
-	ASSERT_EQ(dst[1], (uint16_t)(b << 6), "roundtrip b");
-	ASSERT_EQ(dst[2], (uint16_t)(c << 6), "roundtrip c");
-	ASSERT_EQ(dst[3], (uint16_t)(d << 6), "roundtrip d");
-}
-
-static void test_zero(void)
-{
-	uint8_t packed[5] = { 0, 0, 0, 0, 0 };
-	uint16_t dst[4] = { 0xDEAD, 0xDEAD, 0xDEAD, 0xDEAD };
-	nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
-	ASSERT_EQ(dst[0], 0, "zero[0]");
-	ASSERT_EQ(dst[1], 0, "zero[1]");
-	ASSERT_EQ(dst[2], 0, "zero[2]");
-	ASSERT_EQ(dst[3], 0, "zero[3]");
-}
-
-static void test_all_max(void)
-{
-	/* All four pixels = 0x3FF (max 10-bit). Packed bits all 1 → all 0xFF. */
-	uint8_t packed[5] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF };
-	uint16_t dst[4] = { 0, 0, 0, 0 };
-	nv15_unpack_plane_to_p010(packed, dst, 4, 1, 5);
-	ASSERT_EQ(dst[0], 0xFFC0, "max[0]");
-	ASSERT_EQ(dst[1], 0xFFC0, "max[1]");
-	ASSERT_EQ(dst[2], 0xFFC0, "max[2]");
-	ASSERT_EQ(dst[3], 0xFFC0, "max[3]");
-}
-
-static void test_known_vectors(void)
-{
-	/* Position-sensitive sanity: each pixel = its index+1. */
-	test_pack_unpack_roundtrip(1, 2, 3, 4);
-	/* Spread patterns that exercise every byte-boundary bit. */
-	test_pack_unpack_roundtrip(0x3FF, 0x000, 0x3FF, 0x000);
-	test_pack_unpack_roundtrip(0x000, 0x3FF, 0x000, 0x3FF);
-	test_pack_unpack_roundtrip(0x155, 0x2AA, 0x155, 0x2AA);
-	test_pack_unpack_roundtrip(0x001, 0x002, 0x004, 0x008);
-	test_pack_unpack_roundtrip(0x080, 0x040, 0x020, 0x010);
-	test_pack_unpack_roundtrip(0x200, 0x100, 0x080, 0x040);
-	test_pack_unpack_roundtrip(0x3F0, 0x0F3, 0x33C, 0x2A5);
-}
-
-static void test_remainder_width(void)
-{
-	/* width=1: only A unpacked, B/C/D undefined */
-	{
-		uint8_t packed[5];
-		uint16_t dst[1] = { 0xDEAD };
-		pack4(0x123, 0x000, 0x000, 0x000, packed);
-		nv15_unpack_plane_to_p010(packed, dst, 1, 1, 5);
-		ASSERT_EQ(dst[0], 0x123 << 6, "rem1[0]");
-	}
-	/* width=2 */
-	{
-		uint8_t packed[5];
-		uint16_t dst[2] = { 0, 0 };
-		pack4(0x111, 0x222, 0x000, 0x000, packed);
-		nv15_unpack_plane_to_p010(packed, dst, 2, 1, 5);
-		ASSERT_EQ(dst[0], 0x111 << 6, "rem2[0]");
-		ASSERT_EQ(dst[1], 0x222 << 6, "rem2[1]");
-	}
-	/* width=3 */
-	{
-		uint8_t packed[5];
-		uint16_t dst[3] = { 0, 0, 0 };
-		pack4(0x111, 0x222, 0x333, 0x000, packed);
-		nv15_unpack_plane_to_p010(packed, dst, 3, 1, 5);
-		ASSERT_EQ(dst[0], 0x111 << 6, "rem3[0]");
-		ASSERT_EQ(dst[1], 0x222 << 6, "rem3[1]");
-		ASSERT_EQ(dst[2], 0x333 << 6, "rem3[2]");
-	}
-	/* width=7: one full group + 3 remainder */
-	{
-		uint8_t packed[10];
-		uint16_t dst[7] = { 0 };
-		pack4(0x100, 0x200, 0x300, 0x010, &packed[0]);
-		pack4(0x011, 0x022, 0x033, 0x000, &packed[5]);
-		nv15_unpack_plane_to_p010(packed, dst, 7, 1, 10);
-		ASSERT_EQ(dst[0], 0x100 << 6, "rem7[0]");
-		ASSERT_EQ(dst[1], 0x200 << 6, "rem7[1]");
-		ASSERT_EQ(dst[2], 0x300 << 6, "rem7[2]");
-		ASSERT_EQ(dst[3], 0x010 << 6, "rem7[3]");
-		ASSERT_EQ(dst[4], 0x011 << 6, "rem7[4]");
-		ASSERT_EQ(dst[5], 0x022 << 6, "rem7[5]");
-		ASSERT_EQ(dst[6], 0x033 << 6, "rem7[6]");
-	}
-	/* width=8: two full groups */
-	{
-		uint8_t packed[10];
-		uint16_t dst[8] = { 0 };
-		pack4(0x101, 0x202, 0x303, 0x101, &packed[0]);
-		pack4(0x202, 0x303, 0x101, 0x202, &packed[5]);
-		nv15_unpack_plane_to_p010(packed, dst, 8, 1, 10);
-		ASSERT_EQ(dst[7], 0x202 << 6, "w8[7]");
-	}
-}
-
-static void test_multi_row_stride_padding(void)
-{
-	/* 4-pixel-wide, 3-row plane; stride = 8 bytes (3 bytes padding). */
-	uint8_t packed[24];  /* 3 rows × 8 bytes */
-	uint16_t dst[12];    /* 3 rows × 4 pixels */
-	memset(packed, 0xCC, sizeof(packed));  /* padding poison */
-
-	pack4(0x111, 0x222, 0x333, 0x044, &packed[0 * 8]);
-	pack4(0x055, 0x166, 0x177, 0x188, &packed[1 * 8]);
-	pack4(0x099, 0x1AA, 0x2BB, 0x3CC, &packed[2 * 8]);
-
-	memset(dst, 0xAB, sizeof(dst));
-	nv15_unpack_plane_to_p010(packed, dst, 4, 3, 8);
-
-	ASSERT_EQ(dst[0], 0x111 << 6, "row0[0]");
-	ASSERT_EQ(dst[3], 0x044 << 6, "row0[3]");
-	ASSERT_EQ(dst[4], 0x055 << 6, "row1[0]");
-	ASSERT_EQ(dst[7], 0x188 << 6, "row1[3]");
-	ASSERT_EQ(dst[8], 0x099 << 6, "row2[0]");
-	ASSERT_EQ(dst[11], 0x3CC << 6, "row2[3]");
-}
-
-static void test_chroma_half_height(void)
-{
-	/* 4-pixel-wide × 2-row chroma (matches 4×4 luma quadrant).
-	 * NV15 chroma uses same packing as luma, just half-height. */
-	uint8_t packed[10];  /* 2 rows × 5 bytes */
-	uint16_t dst[8];     /* 2 rows × 4 pixels (UV pairs in interleaved form) */
-
-	pack4(0x080, 0x180, 0x280, 0x380, &packed[0]);
-	pack4(0x040, 0x140, 0x240, 0x340, &packed[5]);
-
-	nv15_unpack_plane_to_p010(packed, dst, 4, 2, 5);
-
-	ASSERT_EQ(dst[0], 0x080 << 6, "chroma row0[0]");
-	ASSERT_EQ(dst[3], 0x380 << 6, "chroma row0[3]");
-	ASSERT_EQ(dst[4], 0x040 << 6, "chroma row1[0]");
-	ASSERT_EQ(dst[7], 0x340 << 6, "chroma row1[3]");
-}
-
-int main(void)
-{
-	test_zero();
-	test_all_max();
-	test_known_vectors();
-	test_remainder_width();
-	test_multi_row_stride_padding();
-	test_chroma_half_height();
-	printf("test_nv15_unpack: all PASS\n");
-	return 0;
-}