ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder

FFmpeg 8.x dropped the H.264 decoder's low_delay code path — AV_CODEC_FLAG_LOW_DELAY no longer prevents h264_select_output_frame from running the display-order DPB output queue. The daedalus-v4l2 daemon's `ctx->flags |= AV_CODEC_FLAG_LOW_DELAY` at daemon/src/decoder.c:202 has been a silent no-op since the SONAME 61→62 jump landed in reauktion/daedalus-v4l2 PR #16; on Firefox YouTube this re-introduced the 2-1-4-3 B-frame pair-swap that PR #12's daemon flag was supposed to prevent. Fix lives in libavcodec, not the daemon: restore the documented LOW_DELAY semantics so the daemon (and any other V4L2-stateless- style consumer) keeps the one-frame-per-send_packet decode-order output contract it already declares. ## Patch 0006-h264-restore-low-delay.patch touches libavcodec/h264_slice.c: - h264_select_output_frame: early-exit when LOW_DELAY is set. Emit the just-decoded picture as next_output_pic, mirror the corruption / recovery-point tracking the main path performs, skip delayed_pic[] / POC reorder machinery entirely. - h264_field_start: suppress the SPS-driven `has_b_frames = sps->num_reorder_frames` clobber when LOW_DELAY is set. Without this the per-slice bitstream_restriction_flag re-pickup would reintroduce a nonzero reorder buffer mid-stream even after the daemon set has_b_frames=0 at avcodec_open2. ## Why not daemon-side A daemon SPS-rewrite (`num_reorder_frames=0`) was considered but rejected: it works only for the daemon's reconstructed SPS NAL, not for any in-band SPS the daemon dlopens libavformat to parse in other code paths. Restoring documented FFmpeg flag semantics is the smaller, more durable change and keeps the daemon interface stable. ## Packaging - PKGREL/pkgrel bump to 9. - No new build-deps, no Depends change. - Substitution arc cycles 6/7/8 unchanged. ## Refs - reauktion/daedalus-v4l2#11 / #12 (LOW_DELAY half-measure on daemon side, originally landed against FFmpeg 7.x). - daemon/src/decoder.c:202 (`ctx->flags |= AV_CODEC_FLAG_LOW_DELAY` for H.264 only — unchanged, but now actually has effect again).
Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier' (#86 ) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-deblock-luma-v-daedalus into main
2026-05-22 14:20:37 +02:00 · 2026-05-22 10:29:09 +00:00 · 2026-05-22 12:17:14 +02:00 · 2026-05-22 08:32:15 +00:00 · 2026-05-22 08:20:34 +00:00 · 2026-05-22 10:20:27 +02:00
15 changed files with 1893 additions and 51 deletions
@@ -1419,8 +1419,9 @@ jobs:

      - name: wipe secrets
        if: always()
-        run: rm -f /root/repo_pass /root/.ssh/id_ed25519
-        run: rm -f /root/.ssh/id_ed25519_hertz
+        run: |
+          rm -f /root/repo_pass /root/.ssh/id_ed25519
+          rm -f /root/.ssh/id_ed25519_hertz

  # -------------------------------------------------------------------------
  # mesa-panvk-bifrost-video (aarch64 only) — sibling adding VK_KHR_video_decode_h264
@@ -0,0 +1,107 @@
+From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 10:18:21 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.idct8_add (called per 8x8 block from the High-profile
+intra-8x8-DCT decode path in h264_mb.c) now dispatches through
+daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
+
+The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
+the recipe is CPU NEON, so this is effectively a NEON-to-NEON
+substitution layered on top of the cycle-6 IDCT 4x4 wiring.  Same
+pthread_once global context, same destructive-zero semantics; FFmpeg
+column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
+
+Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
+on the in-tree NEON .S code and will be batched through
+daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
+
+Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
+green).
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 29 ++++++++++++++++-------
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  3 ++-
+ 2 files changed, 23 insertions(+), 9 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index 538d223..cbb98af 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,14 +1,16 @@
+ /*
+- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add through
+- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+- * The recipe layer picks the substrate (CPU NEON by default for
+- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
+  *
+- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+- * convention: block[r + 4*c] = coefficient at (row r, col c).  Both
+- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}.  Both sides destructively zero the
+ * block after the transform.
+  *
+  * The library context is process-global and lazily initialised under
+  * pthread_once.  We pick the no-QPU constructor here because
+@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
+ }
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+    static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                        block, 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index b993df2..741e551 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
+                             const uint8_t nnzc[15 * 8]);
+ 
+ void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
+                              int16_t *block, int stride,
+@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+         c->idct_add16intra = ff_h264_idct_add16intra_neon;
+         if (chroma_format_idc <= 1)
+             c->idct_add8   = ff_h264_idct_add8_neon;
+-        c->idct8_add       = ff_h264_idct8_add_neon;
+        c->idct8_add       = ff_h264_idct8_add_daedalus;
+         c->idct8_dc_add    = ff_h264_idct8_dc_add_neon;
+         c->idct8_add4      = ff_h264_idct8_add4_neon;
+     } else if (have_neon(cpu_flags) && bit_depth == 10) {
+-- 
+2.47.3
+
@@ -0,0 +1,121 @@
+From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 12:15:16 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+deblock, called per macroblock-row edge from the slice deblock
+loop) now dispatches through
+daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+ff_h264_v_loop_filter_luma_neon.
+
+The recipe layer picks the substrate; for cycle 8 the daedalus
+docstring marks the kernel "CPU primary; QPU opportunistic", but
+the libavcodec.so context here is built with
+daedalus_ctx_create_no_qpu — process-global pthread_once init,
+shared with cycles 6/7.  QPU opportunism stays gated off until a
+follow-up adds an explicit feature flag (no implicit Vulkan init
+in arbitrary host processes).  In the meantime cycle 8 is a
+plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
+
+Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
+the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+only covers the non-intra path per its docstring.
+
+FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
+(int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
+to row 0 of the bottom block per FFmpeg's deblock convention,
+satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
+ 2 files changed, 33 insertions(+), 7 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index cbb98af..92365fa 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,11 +1,14 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+- *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+- * recipe layer picks the substrate (CPU NEON by default for cycles
+- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+ *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
+  *
+  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+  * column-major convention: block[r + N*c] = coefficient at
+@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    meta.tc0[0] = tc0[0];
+    meta.tc0[1] = tc0[1];
+    meta.tc0[2] = tc0[2];
+    meta.tc0[3] = tc0[3];
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+                                                 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index 741e551..85ac381 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -27,6 +27,8 @@
+ 
+ void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+     int cpu_flags = av_get_cpu_flags();
+ 
+     if (have_neon(cpu_flags) && bit_depth == 8) {
+-        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
+        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
+         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
+         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+-- 
+2.47.3
+
@@ -0,0 +1,82 @@
+From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 14:18:25 +0200
+Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+FFmpeg 8.x dropped the H.264 decoder's low_delay path —
+AV_CODEC_FLAG_LOW_DELAY no longer prevents
+h264_select_output_frame from running the display-order DPB
+output queue.  V4L2-stateless-style consumers (daedalus-v4l2
+daemon, libva-v4l2-request-fourier) that set the flag end up
+seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
+
+Restore the documented semantics:
+
+  - Early-exit at the top of h264_select_output_frame when the
+    flag is set: emit the just-decoded picture immediately as
+    next_output_pic, mirror the corruption / recovery-point
+    tracking the main path performs, and skip the entire
+    delayed_pic[] / POC reorder machinery.
+
+  - Suppress the SPS-driven has_b_frames clobber in
+    h264_field_start when the flag is set, so the per-slice
+    bitstream_restriction_flag re-pickup cannot reintroduce a
+    nonzero reorder buffer mid-stream.
+
+This is a fork-only change required by the daedalus-v4l2 daemon's
+one-frame-per-send_packet contract; upstream FFmpeg consumers that
+expect display-order output remain untouched (flag default = off).
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
+---
+ libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
+ 1 file changed, 23 insertions(+)
+
+diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
+index 97fab70..a7bfbd6 100644
+--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
+@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
+     cur->mmco_reset = h->mmco_reset;
+     h->mmco_reset = 0;
+ 
+    /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+     * decoder's low_delay path).  Bypass the display-order DPB
+     * output queue: emit the just-decoded picture immediately, in
+     * decode order, one per send_packet.  V4L2-stateless-style
+     * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+     * do their own POC-based reorder downstream and require this
+     * behaviour. */
+    if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+        h->next_output_pic    = cur;
+        h->next_outputed_poc  = cur->poc;
+        h->frame_recovered   |= cur->recovered;
+        cur->recovered       |= h->frame_recovered & FRAME_RECOVERED_SEI;
+        if (!cur->recovered) {
+            if (!(h->avctx->flags  & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+                !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+                h->next_output_pic = NULL;
+            else
+                cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+        }
+        return 0;
+    }
+
+     if (sps->bitstream_restriction_flag ||
+         h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
+         h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
+@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
+     sps = h->ps.sps;
+ 
+     if (sps->bitstream_restriction_flag &&
+        !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
+         h->avctx->has_b_frames < sps->num_reorder_frames) {
+         h->avctx->has_b_frames = sps->num_reorder_frames;
+     }
+-- 
+2.47.3
+
@@ -24,7 +24,7 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=6   # pkgrel=6 — H.264 IDCT 4x4 daedalus-fourier substitution (2026-05-21)
+pkgrel=9   # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY for H.264 (2026-05-22)
 epoch=2

 # daedalus-fourier pin — first kernel substitution in libavcodec
@@ -90,8 +90,11 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        "daedalus-fourier-${_daedalus_fourier_commit}.tar.gz::https://git.reauktion.de/marfrit/daedalus-fourier/archive/${_daedalus_fourier_commit}.tar.gz"
        '0001-libudev-bypass-fallback.patch'
        '0002-nv15-to-p010-unpack.patch'
-        '0003-h264-idct4-daedalus-fourier.patch')
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0003-h264-idct4-daedalus-fourier.patch'
+        '0004-h264-idct8-daedalus-fourier.patch'
+        '0005-h264-deblock-luma-v-daedalus-fourier.patch'
+        '0006-h264-restore-low-delay.patch')
+sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')

 pkgver() {
  cd "${_srcname}"
@@ -105,6 +108,9 @@ prepare() {
  patch -Np1 -i "${srcdir}/0001-libudev-bypass-fallback.patch"
  patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch"
  patch -Np1 -i "${srcdir}/0003-h264-idct4-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
 }

 build() {
@@ -1 +0,0 @@
-../mesa-panvk-bifrost/0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch
@@ -0,0 +1,57 @@
+From: claude-noether (on behalf of mfritsche)
+Date: 2026-05-19
+Subject: panvk: expose VK_KHR/EXT_robustness2 + nullDescriptor on Bifrost (PAN_ARCH 6/7)
+
+Without this, Mesa's Zink driver refuses to use PanVk-Bifrost as its Vulkan
+backend, falling back silently to llvmpipe (software rasterizer) for all
+GL-via-Zink on Bifrost SBCs. That defeats the entire purpose of having a
+Vulkan driver on Bifrost — GL acceleration via Zink is the most natural
+near-term consumer.
+
+panvk_vX_nir_lower_descriptors.c:1309 and panvk_vX_shader.c:1355 already
+plumb dev->vk.enabled_features.nullDescriptor arch-agnostically — the gate
+at panvk_vX_physical_device.c was set conservatively when Bifrost was
+unmaintained, not because of hardware incapability.
+
+iter1–7 of the panvk-bifrost campaign proved fundamental driver functions
+on Mali-G52 r1 MC1 (PAN_ARCH=7). This patch is the iter8 follow-up.
+
+robustBufferAccess2 and robustImageAccess2 are NOT flipped — they're
+independent rb2 features Zink doesn't require, gated differently
+(robustBufferAccess2 = PAN_ARCH >= 11, robustImageAccess2 = false), and
+out of scope for iter8.
+
+---
+ src/panfrost/vulkan/panvk_vX_physical_device.c | 6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
+--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
+@@ -91,7 +91,7 @@ get_device_extensions(const struct panvk_physical_device *device,
+       .KHR_pipeline_binary = true,
+       .KHR_pipeline_executable_properties = true,
+       .KHR_pipeline_library = true,
+-      .KHR_robustness2 = PAN_ARCH >= 10,
+      .KHR_robustness2 = true,
+       .KHR_sampler_mirror_clamp_to_edge = true,
+       .KHR_sampler_ycbcr_conversion = true,
+       .KHR_separate_depth_stencil_layouts = true,
+@@ -168,7 +168,7 @@ get_device_extensions(const struct panvk_physical_device *device,
+       .EXT_queue_family_foreign = true,
+       .EXT_robustness = pan_arch(device->kmod.dev->props.gpu_id) >= 9,
+       .EXT_image_robustness = true,
+-      .EXT_robustness2 = PAN_ARCH >= 10,
+      .EXT_robustness2 = true,
+       .EXT_sampler_filter_minmax = PAN_ARCH >= 10,
+       .EXT_scalar_block_layout = true,
+       .EXT_separate_stencil_usage = true,
+@@ -493,7 +493,7 @@ get_device_features(const struct panvk_physical_device *device,
+       /* VK_KHR_robustness2 */
+       .robustBufferAccess2 = PAN_ARCH >= 11,
+       .robustImageAccess2 = false,
+-      .nullDescriptor = PAN_ARCH >= 10,
+      .nullDescriptor = true,
+
+       /* VK_KHR_shader_clock */
+       .shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
@@ -1 +0,0 @@
-../mesa-panvk-bifrost/0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch
@@ -0,0 +1,47 @@
+From: claude-noether (on behalf of mfritsche)
+Date: 2026-05-20
+Subject: panvk: expose Vulkan 1.1 + 1.2 on Bifrost (PAN_ARCH 6/7)
+
+ANGLE (Chromium's GL stack) requires apiVersion >= 1.1 to initialize. Without
+this, Brave / Chromium's GPU process fails at GL info collection:
+
+  vk_renderer.cpp:2659 (initialize): ANGLE Requires a minimum Vulkan device
+                                     version of 1.1
+  Display::initialize error 0: Internal Vulkan error (-9): The requested
+                               version of Vulkan is not supported by the driver
+
+Stack-up with iter8's robustness2 patch enables ANGLE → PanVk-Bifrost →
+Skia (via --enable-features=Vulkan) on Bifrost SBCs.
+
+PanVk-Bifrost already supports the bulk of 1.1-promoted features as extensions
+(multiview, maintenance1-3, descriptor update template, 16-bit storage,
+descriptor update template, sampler ycbcr, variable pointers, etc. — all
+visible in iter0 vulkaninfo). The version bump primarily bundles them.
+
+Risk: Vulkan 1.1 has features beyond what iter1–7 exercised (protected memory,
+full subgroup ops). Specific app failures will be characterizable.
+
+1.2 is also flipped — Brave's Vulkan path may want descriptor indexing,
+buffer device address, etc. (all listed in iter0 vulkaninfo as supported
+extensions, just gated as 1.0-with-extensions, not 1.2-core).
+
+---
+ src/panfrost/vulkan/panvk_vX_physical_device.c | 4 ++--
+ 1 file changed, 2 insertions(+), 2 deletions(-)
+
+diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
+--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
+@@ -38,8 +38,8 @@ get_device_extensions(const struct panvk_physical_device *device,
+                       struct vk_device_extension_table *ext)
+ {
+    *ext = (struct vk_device_extension_table){
+-      .KHR_8bit_storage = true,
+-      .KHR_16bit_storage = true,
+-      bool has_vk1_1 = PAN_ARCH >= 10;
+-      bool has_vk1_2 = PAN_ARCH >= 10;
+      .KHR_8bit_storage = true,
+      .KHR_16bit_storage = true,
+      bool has_vk1_1 = true;
+      bool has_vk1_2 = true;
+       *ext = (struct vk_device_extension_table){
@@ -1 +0,0 @@
-../mesa-panvk-bifrost/0003-panvk-bifrost-vk-ext-transform-feedback.patch
@@ -0,0 +1,328 @@
+--- a/src/panfrost/vulkan/panvk_shader.h	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h	2026-05-20 18:52:53.312698258 +0200
+@@ -150,6 +150,10 @@
+    struct {
+ #if PAN_ARCH < 9
+       int32_t raw_vertex_offset;
+      uint32_t num_vertices;       /* iter13: XFB needs per-draw vertex count */
+      /* aligned_u64 attribute below inserts the 4-byte alignment gap
+       * after num_vertices automatically — no explicit pad needed. */
+      aligned_u64 xfb_address[4];  /* iter13: 4 transform feedback buffer base addresses */
+ #endif
+       int32_t first_vertex;
+       int32_t base_instance;
+--- a/src/panfrost/vulkan/panvk_vX_physical_device.c	2026-05-20 19:09:29.711145446 +0200
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c	2026-05-20 18:52:54.832720445 +0200
+@@ -169,6 +169,7 @@
+       .EXT_provoking_vertex = true,
+       .EXT_queue_family_foreign = true,
+       .EXT_robustness2 = true,
+      .EXT_transform_feedback = PAN_ARCH < 9,   /* iter13: JM-class only for now */
+       .EXT_sampler_filter_minmax = PAN_ARCH >= 10,
+       .EXT_scalar_block_layout = true,
+       .EXT_separate_stencil_usage = true,
+@@ -495,6 +496,10 @@
+       .robustImageAccess2 = false,
+       .nullDescriptor = true,
+ 
+      /* VK_EXT_transform_feedback (iter13) */
+      .transformFeedback = PAN_ARCH < 9,
+      .geometryStreams = false,
+
+       /* VK_KHR_shader_clock */
+       .shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
+       .shaderDeviceClock = device->kmod.dev->props.timestamp_device_coherent,
+@@ -1020,6 +1025,18 @@
+       .robustStorageBufferAccessSizeAlignment = 1,
+       .robustUniformBufferAccessSizeAlignment = 1,
+ 
+      /* VK_EXT_transform_feedback (iter13) */
+      .maxTransformFeedbackStreams = 1,
+      .maxTransformFeedbackBuffers = 4,
+      .maxTransformFeedbackBufferSize = UINT32_MAX,
+      .maxTransformFeedbackStreamDataSize = 512,
+      .maxTransformFeedbackBufferDataSize = 512,
+      .maxTransformFeedbackBufferDataStride = 2048,
+      .transformFeedbackQueries = false,
+      .transformFeedbackStreamsLinesTriangles = false,
+      .transformFeedbackRasterizationStreamSelect = false,
+      .transformFeedbackDraw = false,
+
+       /* VK_EXT_shader_object */
+       /* We do not currently support VK_EXT_shader_object but this is used
+        * internally by vk_shader
+--- a/src/panfrost/vulkan/panvk_vX_shader.c	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c	2026-05-20 18:52:56.556745611 +0200
+@@ -21,6 +21,7 @@
+ #include "panvk_physical_device.h"
+ #include "panvk_sampler.h"
+ #include "panvk_shader.h"
+#include "pan_nir.h"   /* iter13: pan_nir_lower_xfb */
+ 
+ #include "spirv/nir_spirv.h"
+ #include "util/memstream.h"
+@@ -100,6 +101,20 @@
+    case nir_intrinsic_load_raw_vertex_offset_pan:
+       val = load_sysval(b, graphics, bit_size, vs.raw_vertex_offset);
+       break;
+   case nir_intrinsic_load_num_vertices:    /* iter13: XFB index calc */
+      val = load_sysval(b, graphics, bit_size, vs.num_vertices);
+      break;
+   case nir_intrinsic_load_xfb_address: {   /* iter13: XFB buffer N base address */
+      unsigned idx = nir_intrinsic_base(intr);
+      switch (idx) {
+      case 0: val = load_sysval(b, graphics, bit_size, vs.xfb_address[0]); break;
+      case 1: val = load_sysval(b, graphics, bit_size, vs.xfb_address[1]); break;
+      case 2: val = load_sysval(b, graphics, bit_size, vs.xfb_address[2]); break;
+      case 3: val = load_sysval(b, graphics, bit_size, vs.xfb_address[3]); break;
+      default: return false;
+      }
+      break;
+   }
+    case nir_intrinsic_load_layer_id:
+       assert(b->shader->info.stage == MESA_SHADER_FRAGMENT);
+       val = load_sysval(b, graphics, bit_size, layer_id);
+@@ -457,6 +472,7 @@
+             core_max_id);
+ 
+    pan_preprocess_nir(nir, pdev->kmod.dev->props.gpu_id);
+
+ }
+ 
+ static void
+@@ -870,6 +886,18 @@
+             nir_var_shader_in | nir_var_shader_out, UINT32_MAX);
+    NIR_PASS(_, nir, nir_lower_io, nir_var_shader_in | nir_var_shader_out,
+             glsl_type_size, nir_lower_io_use_interpolated_input_intrinsics);
+
+#if PAN_ARCH < 9
+   /* iter13: VK_EXT_transform_feedback — runs AFTER nir_lower_io so that
+    * shader outputs are now store_output intrinsics that pan_nir_lower_xfb
+    * can rewrite to nir_store_global+nir_load_xfb_address. */
+   if (nir->info.stage == MESA_SHADER_VERTEX &&
+       nir->info.has_transform_feedback_varyings) {
+      NIR_PASS(_, nir, nir_opt_constant_folding);
+      NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
+      NIR_PASS(_, nir, pan_nir_lower_xfb);
+   }
+#endif
+ }
+ 
+ static VkResult
+@@ -1288,6 +1316,9 @@
+       .view_mask = (state && state->rp) ? state->rp->view_mask : 0,
+       .robust2_modes = robust2_modes,
+       .robust_descriptors = dev->vk.enabled_features.nullDescriptor,
+      /* iter13: XFB shaders must disable IDVS (matches Panfrost-Gallium). */
+      .no_idvs = (info->stage == MESA_SHADER_VERTEX) &&
+                 info->nir->info.has_transform_feedback_varyings,
+    };
+ 
+    switch (info->stage) {
+--- a/src/panfrost/vulkan/panvk_cmd_draw.h	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_cmd_draw.h	2026-05-20 18:52:57.748763011 +0200
+@@ -135,6 +135,19 @@
+    struct panvk_graphics_sysvals sysvals;
+ 
+ #if PAN_ARCH < 9
+   /* iter13: VK_EXT_transform_feedback state (JM-class only for now). */
+   struct {
+      bool active;
+      uint32_t buffer_count;
+      struct {
+         uint64_t addr;
+         uint64_t offset;
+         uint64_t size;
+      } buffers[4];
+   } xfb;
+#endif
+
+#if PAN_ARCH < 9
+    struct panvk_shader_link link;
+ #endif
+ 
+--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c	2026-05-20 19:10:23.031919662 +0200
+@@ -10,6 +10,7 @@
+ #include "panvk_entrypoints.h"
+ 
+ #include "pan_desc.h"
+#include "pan_compiler.h"   /* PAN_SHADER_OOB_ADDRESS */
+ #include "pan_util.h"
+ 
+ static void
+@@ -722,6 +723,35 @@
+    set_gfx_sysval(cmdbuf, dirty_sysvals, vs.raw_vertex_offset,
+                   info->vertex.raw_offset);
+    set_gfx_sysval(cmdbuf, dirty_sysvals, layer_id, info->layer_id);
+
+   /* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
+    * reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
+   set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+   {
+      const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
+      /* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
+       * (= 1<<63). This is the Panfrost-Gallium memory-sink idiom — the
+       * Bifrost MMU silently discards stores to this address, so a pipeline
+       * with XFB outputs used in a non-XFB draw (or in an XFB draw with
+       * fewer bound buffers than the shader declares) is safe instead of
+       * faulting. See gallium/drivers/panfrost/pan_cmdstream.c PAN_SYSVAL_XFB. */
+      uint64_t _xa0 = PAN_SHADER_OOB_ADDRESS, _xa1 = PAN_SHADER_OOB_ADDRESS,
+               _xa2 = PAN_SHADER_OOB_ADDRESS, _xa3 = PAN_SHADER_OOB_ADDRESS;
+      if (_gfx->xfb.active) {
+         if (_gfx->xfb.buffer_count > 0 && _gfx->xfb.buffers[0].addr)
+            _xa0 = _gfx->xfb.buffers[0].addr + _gfx->xfb.buffers[0].offset;
+         if (_gfx->xfb.buffer_count > 1 && _gfx->xfb.buffers[1].addr)
+            _xa1 = _gfx->xfb.buffers[1].addr + _gfx->xfb.buffers[1].offset;
+         if (_gfx->xfb.buffer_count > 2 && _gfx->xfb.buffers[2].addr)
+            _xa2 = _gfx->xfb.buffers[2].addr + _gfx->xfb.buffers[2].offset;
+         if (_gfx->xfb.buffer_count > 3 && _gfx->xfb.buffers[3].addr)
+            _xa3 = _gfx->xfb.buffers[3].addr + _gfx->xfb.buffers[3].offset;
+      }
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[0], _xa0);
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[1], _xa1);
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[2], _xa2);
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[3], _xa3);
+   }
+ #endif
+ 
+    if (dyn_gfx_state_dirty(cmdbuf, CB_BLEND_CONSTANTS)) {
+--- a/src/panfrost/vulkan/meson.build	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/meson.build	2026-05-20 18:53:04.484861338 +0200
+@@ -73,6 +73,7 @@
+ jm_inc_dir = ['jm']
+ jm_files = [
+   'jm/panvk_vX_bind_queue.c',
+  'jm/panvk_vX_cmd_xfb.c',   # iter13
+   'jm/panvk_vX_cmd_buffer.c',
+   'jm/panvk_vX_cmd_dispatch.c',
+   'jm/panvk_vX_cmd_draw.c',
+--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c	2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c	2026-05-20 19:10:26.163965149 +0200
+@@ -473,5 +473,12 @@
+ 
+    vk_command_buffer_begin(&cmdbuf->vk, pBeginInfo);
+ 
+#if PAN_ARCH < 9
+   /* iter13: clear XFB state on Begin so a reused command buffer does not
+    * inherit stale xfb.buffer_count / xfb.active / xfb.buffers[] from a
+    * prior recording. */
+   memset(&cmdbuf->state.gfx.xfb, 0, sizeof(cmdbuf->state.gfx.xfb));
+#endif
+
+    return VK_SUCCESS;
+ }
+--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c	2026-05-18 12:50:53.067999996 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c	2026-05-20 19:10:27.175979847 +0200
+@@ -0,0 +1,111 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter13: VK_EXT_transform_feedback command handlers for the JM
+ * architecture path (Bifrost v6/v7 + Valhall-JM v9).
+ *
+ * The runtime contract:
+ *   - vkCmdBindTransformFeedbackBuffersEXT: stash (gpu_addr, offset, size)
+ *     for each slot into cmdbuf->state.gfx.xfb.buffers[].
+ *   - vkCmdBeginTransformFeedbackEXT: set cmdbuf->state.gfx.xfb.active = true.
+ *     Mark sysvals dirty so the next draw re-emits vs.xfb_address[].
+ *   - vkCmdEndTransformFeedbackEXT: set active = false.
+ *
+ * Counter buffers (firstCounterBuffer/counterBufferCount/pCounterBuffers/
+ * pCounterBufferOffsets) are accepted by API but ignored — v1 doesn't
+ * support pause/resume. transformFeedbackDraw is advertised as false.
+ *
+ * Per-draw integration: jm/panvk_vX_cmd_draw.c reads cmdbuf->state.gfx.xfb
+ * and populates vs.xfb_address[i] for shader use. The pan_nir_lower_xfb
+ * pass in panvk_vX_shader.c emits nir_load_xfb_address(i) which lowers
+ * (via panvk_vX_shader.c sysval handler) to a load from the per-draw
+ * sysval push area.
+ */
+
+#include "vk_log.h"
+#include "util/log.h"
+
+#include "panvk_cmd_buffer.h"
+#include "panvk_cmd_draw.h"
+#include "panvk_buffer.h"
+#include "panvk_entrypoints.h"
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBindTransformFeedbackBuffersEXT)(
+   VkCommandBuffer commandBuffer,
+   uint32_t firstBinding,
+   uint32_t bindingCount,
+   const VkBuffer *pBuffers,
+   const VkDeviceSize *pOffsets,
+   const VkDeviceSize *pSizes)
+{
+   VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+   struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+   for (uint32_t i = 0; i < bindingCount; i++) {
+      uint32_t slot = firstBinding + i;
+      if (slot >= 4)
+         continue;
+
+      VK_FROM_HANDLE(panvk_buffer, buf, pBuffers[i]);
+      gfx->xfb.buffers[slot].addr = panvk_buffer_gpu_ptr(buf, 0);
+      gfx->xfb.buffers[slot].offset = pOffsets[i];
+      gfx->xfb.buffers[slot].size =
+         (pSizes != NULL && pSizes[i] != VK_WHOLE_SIZE)
+            ? pSizes[i]
+            : (buf->vk.size - pOffsets[i]);
+   }
+
+   if (firstBinding + bindingCount > gfx->xfb.buffer_count)
+      gfx->xfb.buffer_count = firstBinding + bindingCount;
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBeginTransformFeedbackEXT)(
+   VkCommandBuffer commandBuffer,
+   uint32_t firstCounterBuffer,
+   uint32_t counterBufferCount,
+   const VkBuffer *pCounterBuffers,
+   const VkDeviceSize *pCounterBufferOffsets)
+{
+   VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+   struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+   /* Counter buffers ignored in v1 — see VkPhysicalDeviceTransformFeedback
+    * PropertiesEXT.transformFeedbackDraw = false in panvk_vX_physical_device.c.
+    * App is spec-compliant if it does not pass counter buffers (which our
+    * features advertisement allows), but warn loudly if it does so we do not
+    * silently produce wrong capture state. */
+   (void)firstCounterBuffer;
+   (void)pCounterBufferOffsets;
+   if (counterBufferCount > 0 && pCounterBuffers != NULL) {
+      mesa_logw("panvk: CmdBeginTransformFeedbackEXT: counter buffers not "
+                "implemented (transformFeedbackDraw=false); XFB resume will "
+                "restart at buffer offset 0");
+   }
+
+   gfx->xfb.active = true;
+   /* Per-draw set_gfx_sysval picks up the change automatically — no
+    * explicit dirty marking required (set_gfx_sysval uses memcmp +
+    * BITSET to detect state diffs and re-emit sysvals). */
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdEndTransformFeedbackEXT)(
+   VkCommandBuffer commandBuffer,
+   uint32_t firstCounterBuffer,
+   uint32_t counterBufferCount,
+   const VkBuffer *pCounterBuffers,
+   const VkDeviceSize *pCounterBufferOffsets)
+{
+   VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+   struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+   (void)firstCounterBuffer;
+   (void)counterBufferCount;
+   (void)pCounterBuffers;
+   (void)pCounterBufferOffsets;
+
+   gfx->xfb.active = false;
+}
@@ -1 +0,0 @@
-../mesa-panvk-bifrost/0004-panvk-bifrost-xfb-primitive-decomposition.patch
@@ -0,0 +1,629 @@
+diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
+--- a/src/panfrost/vulkan/meson.build	2026-05-21 14:04:02.529474145 +0200
+++ b/src/panfrost/vulkan/meson.build	2026-05-21 14:04:04.106755486 +0200
+@@ -123,6 +123,7 @@
+   'panvk_vX_nir_lower_input_attachment_loads.c',
+   'panvk_vX_sampler.c',
+   'panvk_vX_shader.c',
+  'panvk_vX_xfb_lower.c',
+   sha1_h,
+ ]
+ 
+diff -urN a/src/panfrost/vulkan/panvk_shader.h b/src/panfrost/vulkan/panvk_shader.h
+--- a/src/panfrost/vulkan/panvk_shader.h	2026-05-21 14:04:02.525251986 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h	2026-05-21 14:04:04.084251800 +0200
+@@ -154,6 +154,8 @@
+       /* aligned_u64 attribute below inserts the 4-byte alignment gap
+        * after num_vertices automatically — no explicit pad needed. */
+       aligned_u64 xfb_address[4];  /* iter13: 4 transform feedback buffer base addresses */
+      uint32_t xfb_topology;       /* iter17: panvk_xfb_topology enum value */
+      uint32_t xfb_output_count;   /* iter17: per-instance output verts after decomp */
+ #endif
+       int32_t first_vertex;
+       int32_t base_instance;
+@@ -569,4 +571,76 @@
+    struct pan_compute_dim local_size, const void *bin_ptr, size_t bin_size,
+    struct panvk_shader **shader_out);
+ 
+
+#if PAN_ARCH < 9
+/* iter17: encoding for vs.xfb_topology sysval. Maps VkPrimitiveTopology values
+ * we need to distinguish at shader runtime for XFB capture. LIST topologies
+ * use the iter13 single-store fast path; non-LIST need per-vertex decomposition. */
+enum panvk_xfb_topology {
+   PANVK_XFB_TOPO_LIST            = 0,
+   PANVK_XFB_TOPO_LINE_STRIP      = 1,
+   PANVK_XFB_TOPO_TRI_STRIP       = 2,
+   PANVK_XFB_TOPO_TRI_FAN         = 3,
+   PANVK_XFB_TOPO_LINE_LIST_ADJ   = 4,
+   PANVK_XFB_TOPO_LINE_STRIP_ADJ  = 5,
+   PANVK_XFB_TOPO_TRI_LIST_ADJ    = 6,
+   PANVK_XFB_TOPO_TRI_STRIP_ADJ   = 7,
+};
+
+#include "panvk_macros.h"
+struct nir_shader;
+bool panvk_per_arch(nir_lower_xfb)(struct nir_shader *nir);
+
+/* Map VkPrimitiveTopology to panvk_xfb_topology enum (driver-side helper). */
+static inline uint32_t
+panvk_vk_topology_to_xfb_enum(VkPrimitiveTopology topo)
+{
+   switch (topo) {
+   case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+      return PANVK_XFB_TOPO_LINE_STRIP;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+      return PANVK_XFB_TOPO_TRI_STRIP;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+      return PANVK_XFB_TOPO_TRI_FAN;
+   case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+      return PANVK_XFB_TOPO_LINE_LIST_ADJ;
+   case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+      return PANVK_XFB_TOPO_LINE_STRIP_ADJ;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+      return PANVK_XFB_TOPO_TRI_LIST_ADJ;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+      return PANVK_XFB_TOPO_TRI_STRIP_ADJ;
+   case VK_PRIMITIVE_TOPOLOGY_POINT_LIST:
+   case VK_PRIMITIVE_TOPOLOGY_LINE_LIST:
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST:
+   default:
+      return PANVK_XFB_TOPO_LIST;
+   }
+}
+
+/* Compute the per-instance output vertex count for a given (topology, input count). */
+static inline uint32_t
+panvk_xfb_output_count(VkPrimitiveTopology topo, uint32_t input_count)
+{
+   switch (topo) {
+   case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+      return input_count >= 1 ? 2u * (input_count - 1u) : 0u;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+      return input_count >= 2 ? 3u * (input_count - 2u) : 0u;
+   case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+      return (input_count / 4u) * 2u;
+   case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+      return input_count >= 3 ? 2u * (input_count - 3u) : 0u;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+      return (input_count / 6u) * 3u;
+   case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+      return input_count >= 6 ? 3u * (input_count / 2u - 2u) : 0u;
+   default:
+      return input_count;  /* LIST topologies: 1:1 mapping */
+   }
+}
+#endif
+
+
+ #endif
+diff -urN a/src/panfrost/vulkan/panvk_vX_cmd_draw.c b/src/panfrost/vulkan/panvk_vX_cmd_draw.c
+--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c	2026-05-21 14:04:02.528576354 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c	2026-05-21 14:04:04.091357598 +0200
+@@ -727,6 +727,20 @@
+    /* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
+     * reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
+    set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+
+   /* iter17: XFB primitive-decomposition sysvals.
+    * xfb_topology = enum value for the current bound topology.
+    * xfb_output_count = per-instance output vertex count after decomposition.
+    * For LIST topologies, output_count == input vertex count and the shader
+    * takes the iter13 single-store fast path. */
+   {
+      VkPrimitiveTopology vk_topo =
+         cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+      uint32_t topo_enum = panvk_vk_topology_to_xfb_enum(vk_topo);
+      uint32_t out_count = panvk_xfb_output_count(vk_topo, info->vertex.count);
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
+      set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_output_count, out_count);
+   }
+    {
+       const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
+       /* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
+diff -urN a/src/panfrost/vulkan/panvk_vX_shader.c b/src/panfrost/vulkan/panvk_vX_shader.c
+--- a/src/panfrost/vulkan/panvk_vX_shader.c	2026-05-21 14:04:02.527576494 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c	2026-05-21 14:04:04.098356619 +0200
+@@ -895,7 +895,10 @@
+        nir->info.has_transform_feedback_varyings) {
+       NIR_PASS(_, nir, nir_opt_constant_folding);
+       NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
+-      NIR_PASS(_, nir, pan_nir_lower_xfb);
+      /* iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+       * primitive decomposition for non-LIST topologies. Single-store LIST
+       * fast path matches iter13 behavior. */
+      NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb));
+    }
+ #endif
+ }
+diff -urN a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
+--- a/src/panfrost/vulkan/panvk_vX_xfb_lower.c	1970-01-01 01:00:00.000000000 +0100
+++ b/src/panfrost/vulkan/panvk_vX_xfb_lower.c	2026-05-21 14:04:04.115354242 +0200
+@@ -0,0 +1,486 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for transform_feedback on non-LIST topologies
+ * (TRIANGLE_STRIP/FAN, LINE_STRIP, *_WITH_ADJACENCY).
+ *
+ * Approach: emit a topology dispatch at the start of each store_output
+ * lowering. The shader reads vs.xfb_topology sysval at runtime and branches
+ * into per-topology emission logic. For each affected topology, the lowered
+ * code emits guarded conditional stores — one per primitive this vertex
+ * contributes to, computing the output buffer position via primitive index
+ * and slot within the decomposed primitive.
+ *
+ * For LIST topologies (POINT/LINE/TRIANGLE LIST), takes a fast path that
+ * matches iter13's single-store behavior.
+ *
+ * For TRIANGLE_FAN, the central vertex (v=0) contributes to ALL primitives
+ * as slot 2 — handled via a NIR loop bounded by num_vertices.
+ *
+ * See ~/src/panvk-bifrost/iter17/phase{0,1,2}_*.md for full design context.
+ */
+
+#include "panvk_macros.h"
+
+#if PAN_ARCH < 9
+
+#include "panvk_shader.h"
+
+#include "compiler/nir/nir_builder.h"
+#include "pan_nir.h"
+
+#include <vulkan/vulkan_core.h>
+
+/* ----- Address arithmetic ----- */
+
+static nir_def *
+xfb_store_addr(nir_builder *b, nir_def *buf, nir_def *out_idx,
+               uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *byte_off = nir_iadd_imm(b,
+      nir_imul_imm(b, out_idx, stride), offset_bytes);
+   return nir_iadd(b, buf, nir_u2u64(b, byte_off));
+}
+
+static void
+emit_list_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+                nir_def *instance_id, nir_def *raw_vid, nir_def *value,
+                uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *out_idx = nir_iadd(b,
+      nir_imul(b, instance_id, output_count), raw_vid);
+   nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+   nir_store_global(b, value, addr);
+}
+
+static void
+emit_prim_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+                nir_def *instance_id, nir_def *eligible,
+                nir_def *prim_idx, nir_def *slot,
+                uint32_t verts_per_prim,
+                nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   nir_push_if(b, eligible);
+   {
+      nir_def *out_idx = nir_iadd(b,
+         nir_imul(b, instance_id, output_count),
+         nir_iadd(b, nir_imul_imm(b, prim_idx, verts_per_prim), slot));
+      nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+      nir_store_global(b, value, addr);
+   }
+   nir_pop_if(b, NULL);
+}
+
+/* ----- Per-topology emission ----- */
+
+/* TRIANGLE_STRIP: vertex v contributes to prims v, v-1, v-2 (per eligibility). */
+static void
+emit_tri_strip(nir_builder *b, nir_def *v, nir_def *N,
+               nir_def *buf, nir_def *output_count, nir_def *instance_id,
+               nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+   nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+   /* Prim v, slot 0: v < N-2 */
+   emit_prim_store(b, buf, output_count, instance_id,
+      nir_ult(b, v, Nm2),
+      v, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+   /* Prim v-1, slot = 1 if prim even else 2: 1 <= v < N-1 */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -1);
+      nir_def *parity = nir_iand_imm(b, prim, 1u);
+      nir_def *slot = nir_iadd_imm(b, parity, 1);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 1)),
+         nir_ult(b, v, Nm1));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, slot, 3, value, stride, offset_bytes);
+   }
+
+   /* Prim v-2, slot = 2 if prim even else 1: 2 <= v < N */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -2);
+      nir_def *parity = nir_iand_imm(b, prim, 1u);
+      nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 2)),
+         nir_ult(b, v, N));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, slot, 3, value, stride, offset_bytes);
+   }
+}
+
+/* LINE_STRIP: vertex v contributes to prim v slot 0 + prim v-1 slot 1. */
+static void
+emit_line_strip(nir_builder *b, nir_def *v, nir_def *N,
+                nir_def *buf, nir_def *output_count, nir_def *instance_id,
+                nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+   /* Prim v, slot 0: v < N-1 */
+   emit_prim_store(b, buf, output_count, instance_id,
+      nir_ult(b, v, Nm1),
+      v, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+   /* Prim v-1, slot 1: 1 <= v < N */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -1);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 1)),
+         nir_ult(b, v, N));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+   }
+}
+
+/* TRIANGLE_FAN: prim p emits {p+1, p+2, 0}.
+ *   vertex v=0: contributes to ALL prims as slot 2 (loop required)
+ *   vertex v>=1: contributes to prim v-1 as slot 0 (if 1 <= v <= N-2)
+ *   vertex v>=2: contributes to prim v-2 as slot 1 (if 2 <= v <= N-1)
+ */
+static void
+emit_tri_fan(nir_builder *b, nir_def *v, nir_def *N,
+             nir_def *buf, nir_def *output_count, nir_def *instance_id,
+             nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+   nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+   /* Prim v-1, slot 0: 1 <= v < N-1 */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -1);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 1)),
+         nir_ult(b, v, Nm1));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+   }
+
+   /* Prim v-2, slot 1: 2 <= v < N */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -2);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 2)),
+         nir_ult(b, v, N));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, nir_imm_int(b, 1), 3, value, stride, offset_bytes);
+   }
+
+   /* Central vertex (v == 0): loop over all prims, write to slot 2. */
+   nir_push_if(b, nir_ieq_imm(b, v, 0));
+   {
+      nir_variable *p_var = nir_local_variable_create(b->impl,
+         glsl_uint_type(), "fan_p");
+      nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
+      nir_push_loop(b);
+      {
+         nir_def *p = nir_load_var(b, p_var);
+         nir_push_if(b, nir_uge(b, p, Nm2));
+         {
+            nir_jump(b, nir_jump_break);
+         }
+         nir_pop_if(b, NULL);
+
+         nir_def *out_idx = nir_iadd(b,
+            nir_imul(b, instance_id, output_count),
+            nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2));
+         nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+         nir_store_global(b, value, addr);
+
+         nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
+      }
+      nir_pop_loop(b, NULL);
+   }
+   nir_pop_if(b, NULL);
+}
+
+/* LINE_LIST_WITH_ADJACENCY: 4-vertex groups [4i..4i+3]; output {4i+1, 4i+2}.
+ *   v contributes if v%4 == 1: prim v/4 slot 0
+ *   v contributes if v%4 == 2: prim v/4 slot 1
+ */
+static void
+emit_line_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+                   nir_def *buf, nir_def *output_count, nir_def *instance_id,
+                   nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   (void)N; /* eligibility is mod-based, not range-based */
+   nir_def *vmod4 = nir_iand_imm(b, v, 3u);
+   nir_def *prim = nir_ushr_imm(b, v, 2);  /* v / 4 */
+
+   emit_prim_store(b, buf, output_count, instance_id,
+      nir_ieq_imm(b, vmod4, 1),
+      prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+   emit_prim_store(b, buf, output_count, instance_id,
+      nir_ieq_imm(b, vmod4, 2),
+      prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+}
+
+/* LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}.
+ *   v contributes to prim v-1 slot 0 (1 <= v <= N-2)
+ *   v contributes to prim v-2 slot 1 (2 <= v <= N-1)
+ */
+static void
+emit_line_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+                    nir_def *buf, nir_def *output_count, nir_def *instance_id,
+                    nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+   nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+   /* Prim v-1, slot 0: 1 <= v <= N-2 ⇔ v >= 1 AND v <= N-2 ⇔ v >= 1 AND v < N-1 */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -1);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 1)),
+         nir_ult(b, v, Nm1));
+      (void)Nm2;
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+   }
+
+   /* Prim v-2, slot 1: 2 <= v <= N-1 ⇔ v >= 2 AND v < N */
+   {
+      nir_def *prim = nir_iadd_imm(b, v, -2);
+      nir_def *eligible = nir_iand(b,
+         nir_uge(b, v, nir_imm_int(b, 2)),
+         nir_ult(b, v, N));
+      emit_prim_store(b, buf, output_count, instance_id, eligible,
+                      prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+   }
+}
+
+/* TRIANGLE_LIST_WITH_ADJACENCY: 6-vertex groups; output {6i, 6i+2, 6i+4}.
+ *   v contributes if v%6 == 0: prim v/6 slot 0
+ *   v contributes if v%6 == 2: prim v/6 slot 1
+ *   v contributes if v%6 == 4: prim v/6 slot 2
+ */
+static void
+emit_tri_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+                  nir_def *buf, nir_def *output_count, nir_def *instance_id,
+                  nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   (void)N;
+   nir_def *vmod6 = nir_umod_imm(b, v, 6);
+   nir_def *prim = nir_udiv_imm(b, v, 6);
+
+   for (uint32_t slot = 0; slot < 3; slot++) {
+      emit_prim_store(b, buf, output_count, instance_id,
+         nir_ieq_imm(b, vmod6, slot * 2),
+         prim, nir_imm_int(b, slot), 3, value, stride, offset_bytes);
+   }
+}
+
+/* TRIANGLE_STRIP_WITH_ADJACENCY: prim i emits:
+ *   even i: {2i, 2i+2, 2i+4}    (slots 0, 1, 2 ← input indices 2i, 2i+2, 2i+4)
+ *   odd  i: {2i, 2i+4, 2i+2}    (slots 0, 1, 2 ← input indices 2i, 2i+4, 2i+2)
+ *
+ * Only EVEN input vertices contribute (since all output indices are 2*something).
+ * For even input v:
+ *   prim v/2 slot 0 (always, if v/2 < N/2-2)
+ *   prim (v-2)/2 slot 1 if (v-2)/2 even, slot 2 if odd   (when v >= 2)
+ *   prim (v-4)/2 slot 2 if (v-4)/2 even, slot 1 if odd   (when v >= 4)
+ */
+static void
+emit_tri_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+                   nir_def *buf, nir_def *output_count, nir_def *instance_id,
+                   nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+   /* Bail for odd input vertices — they never contribute. */
+   nir_def *v_is_even = nir_ieq_imm(b, nir_iand_imm(b, v, 1u), 0);
+   nir_push_if(b, v_is_even);
+   {
+      nir_def *N_half = nir_ushr_imm(b, N, 1);
+      nir_def *max_prim = nir_iadd_imm(b, N_half, -2);  /* N/2 - 2 */
+      nir_def *v_half = nir_ushr_imm(b, v, 1);
+
+      /* Prim v/2 slot 0: v/2 < N/2 - 2 */
+      emit_prim_store(b, buf, output_count, instance_id,
+         nir_ult(b, v_half, max_prim),
+         v_half, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+      /* Prim (v-2)/2 = v/2 - 1: v >= 2 AND prim < N/2-2 */
+      {
+         nir_def *prim = nir_iadd_imm(b, v_half, -1);
+         nir_def *parity = nir_iand_imm(b, prim, 1u);
+         nir_def *slot = nir_iadd_imm(b, parity, 1);  /* even→1, odd→2 */
+         nir_def *eligible = nir_iand(b,
+            nir_uge(b, v, nir_imm_int(b, 2)),
+            nir_ult(b, prim, max_prim));
+         emit_prim_store(b, buf, output_count, instance_id, eligible,
+                         prim, slot, 3, value, stride, offset_bytes);
+      }
+
+      /* Prim (v-4)/2 = v/2 - 2: v >= 4 AND prim < N/2-2 */
+      {
+         nir_def *prim = nir_iadd_imm(b, v_half, -2);
+         nir_def *parity = nir_iand_imm(b, prim, 1u);
+         nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);  /* even→2, odd→1 */
+         nir_def *eligible = nir_iand(b,
+            nir_uge(b, v, nir_imm_int(b, 4)),
+            nir_ult(b, prim, max_prim));
+         emit_prim_store(b, buf, output_count, instance_id, eligible,
+                         prim, slot, 3, value, stride, offset_bytes);
+      }
+   }
+   nir_pop_if(b, NULL);
+}
+
+/* ----- Main lowering: per store_output XFB channel ----- */
+
+static void
+lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+                        unsigned channel_idx, unsigned num_components,
+                        unsigned buffer, unsigned offset_words)
+{
+   assert(buffer < MAX_XFB_BUFFERS);
+   assert(nir_intrinsic_component(intr) == 0);
+
+   uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
+   assert(stride != 0);
+   uint16_t offset_bytes = offset_words * 4;
+
+   BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_VERTEX_ID_ZERO_BASE);
+   BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_INSTANCE_ID);
+
+   nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
+   nir_def *out_count = load_sysval(b, graphics, 32, vs.xfb_output_count);
+   nir_def *N = nir_load_num_vertices(b);
+   nir_def *v = nir_load_raw_vertex_id_pan(b);
+   nir_def *instance = nir_load_instance_id(b);
+   nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
+
+   nir_def *src = intr->src[0].ssa;
+   nir_component_mask_t mask = nir_component_mask(num_components);
+   nir_def *value = nir_channels(b, src, mask << channel_idx);
+
+   /* Topology dispatch ladder. LIST first (fast path). */
+   nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
+   {
+      emit_list_store(b, buf, out_count, instance, v, value,
+                      stride, offset_bytes);
+   }
+   nir_push_else(b, NULL);
+   {
+      /* iter17 Janet Finding 3: gate all non-LIST emission on
+       * output_count > 0. For degenerate input counts (N < min required
+       * for the topology), output_count is 0 and we must emit NO stores
+       * — otherwise N-2 / N-3 / etc. arithmetic underflows in the
+       * eligibility predicates and we falsely fire stores. */
+      nir_push_if(b, nir_ult(b, nir_imm_int(b, 0), out_count));
+      {
+      nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
+      {
+         emit_tri_strip(b, v, N, buf, out_count, instance, value,
+                        stride, offset_bytes);
+      }
+      nir_push_else(b, NULL);
+      {
+         nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
+         {
+            emit_line_strip(b, v, N, buf, out_count, instance, value,
+                            stride, offset_bytes);
+         }
+         nir_push_else(b, NULL);
+         {
+            nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_FAN));
+            {
+               emit_tri_fan(b, v, N, buf, out_count, instance, value,
+                            stride, offset_bytes);
+            }
+            nir_push_else(b, NULL);
+            {
+               nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_LIST_ADJ));
+               {
+                  emit_line_list_adj(b, v, N, buf, out_count, instance, value,
+                                     stride, offset_bytes);
+               }
+               nir_push_else(b, NULL);
+               {
+                  nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP_ADJ));
+                  {
+                     emit_line_strip_adj(b, v, N, buf, out_count, instance, value,
+                                         stride, offset_bytes);
+                  }
+                  nir_push_else(b, NULL);
+                  {
+                     nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_LIST_ADJ));
+                     {
+                        emit_tri_list_adj(b, v, N, buf, out_count, instance, value,
+                                          stride, offset_bytes);
+                     }
+                     nir_push_else(b, NULL);
+                     {
+                        /* TRI_STRIP_ADJ — last case */
+                        emit_tri_strip_adj(b, v, N, buf, out_count, instance, value,
+                                           stride, offset_bytes);
+                     }
+                     nir_pop_if(b, NULL);
+                  }
+                  nir_pop_if(b, NULL);
+               }
+               nir_pop_if(b, NULL);
+            }
+            nir_pop_if(b, NULL);
+         }
+         nir_pop_if(b, NULL);
+      }
+      nir_pop_if(b, NULL);
+      }
+      nir_pop_if(b, NULL);  /* Janet Finding 3: close output_count > 0 guard */
+   }
+   nir_pop_if(b, NULL);
+}
+
+/* Mirror of pan_nir_lower_xfb's lower_xfb: load_vertex_id rewrite +
+ * dispatch store_output through our topology-aware emission. */
+static bool
+lower_xfb_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+                 UNUSED void *data)
+{
+   if (intr->intrinsic == nir_intrinsic_load_vertex_id) {
+      b->cursor = nir_instr_remove(&intr->instr);
+      nir_def *repl = nir_iadd(b, nir_load_raw_vertex_id_pan(b),
+                               nir_load_raw_vertex_offset_pan(b));
+      nir_def_rewrite_uses(&intr->def, repl);
+      return true;
+   }
+
+   if (intr->intrinsic != nir_intrinsic_store_output)
+      return false;
+
+   bool progress = false;
+   b->cursor = nir_before_instr(&intr->instr);
+
+   /* io_xfb has only out[0,1]; the other 2 channels are in io_xfb2.
+    * Outer loop selects which annotation; inner picks which channel. */
+   for (unsigned i = 0; i < 2; ++i) {
+      nir_io_xfb xfb = i ? nir_intrinsic_io_xfb2(intr)
+                         : nir_intrinsic_io_xfb(intr);
+      for (unsigned j = 0; j < 2; ++j) {
+         if (!xfb.out[j].num_components)
+            continue;
+         lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
+                                 xfb.out[j].buffer, xfb.out[j].offset);
+         progress = true;
+      }
+   }
+
+   if (progress)
+      nir_instr_remove(&intr->instr);
+   return progress;
+}
+
+bool
+panvk_per_arch(nir_lower_xfb)(nir_shader *nir)
+{
+   return nir_shader_intrinsics_pass(
+      nir, lower_xfb_iter17, nir_metadata_control_flow, NULL);
+}
+
+#endif /* PAN_ARCH < 9 */
@@ -1,6 +1,6 @@
 diff -urN a/src/panfrost/vulkan/jm/panvk_cmd_buffer.h b/src/panfrost/vulkan/jm/panvk_cmd_buffer.h
 --- a/src/panfrost/vulkan/jm/panvk_cmd_buffer.h	2026-05-21 22:46:57.477785029 +0200
-+++ b/src/panfrost/vulkan/jm/panvk_cmd_buffer.h	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/jm/panvk_cmd_buffer.h	2026-05-22 10:17:41.214043265 +0200
@@ -88,8 +88,18 @@
       struct panvk_cmd_compute_state compute;
       struct panvk_push_constant_state push_constants;
@@ -22,7 +22,7 @@ diff -urN a/src/panfrost/vulkan/jm/panvk_cmd_buffer.h b/src/panfrost/vulkan/jm/p
 
 diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
 --- a/src/panfrost/vulkan/meson.build	2026-05-21 22:46:59.277811484 +0200
-+++ b/src/panfrost/vulkan/meson.build	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/meson.build	2026-05-22 10:17:41.214043265 +0200
@@ -41,6 +41,10 @@
   'panvk_device_memory.c',
   'panvk_host_copy.c',
@@ -36,7 +36,7 @@ diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
   'panvk_physical_device.c',
 diff -urN a/src/panfrost/vulkan/panvk_buffer.c b/src/panfrost/vulkan/panvk_buffer.c
 --- a/src/panfrost/vulkan/panvk_buffer.c	2026-05-21 22:46:57.485785147 +0200
-+++ b/src/panfrost/vulkan/panvk_buffer.c	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_buffer.c	2026-05-22 10:17:41.214043265 +0200
@@ -88,6 +88,8 @@
          *bind_status->pResult = VK_SUCCESS;
 
@@ -48,7 +48,7 @@ diff -urN a/src/panfrost/vulkan/panvk_buffer.c b/src/panfrost/vulkan/panvk_buffe
 }
 diff -urN a/src/panfrost/vulkan/panvk_buffer.h b/src/panfrost/vulkan/panvk_buffer.h
 --- a/src/panfrost/vulkan/panvk_buffer.h	2026-05-21 22:46:57.485785147 +0200
-+++ b/src/panfrost/vulkan/panvk_buffer.h	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_buffer.h	2026-05-22 10:17:41.214043265 +0200
@@ -14,8 +14,14 @@
 
 struct panvk_priv_bo;
@@ -66,7 +66,7 @@ diff -urN a/src/panfrost/vulkan/panvk_buffer.h b/src/panfrost/vulkan/panvk_buffe
 VK_DEFINE_NONDISP_HANDLE_CASTS(panvk_buffer, vk.base, VkBuffer,
 diff -urN a/src/panfrost/vulkan/panvk_device.h b/src/panfrost/vulkan/panvk_device.h
 --- a/src/panfrost/vulkan/panvk_device.h	2026-05-21 22:46:57.489785206 +0200
-+++ b/src/panfrost/vulkan/panvk_device.h	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_device.h	2026-05-22 10:17:41.214043265 +0200
@@ -45,6 +45,8 @@
 enum panvk_queue_family {
    PANVK_QUEUE_FAMILY_GPU,
@@ -102,7 +102,7 @@ diff -urN a/src/panfrost/vulkan/panvk_device.h b/src/panfrost/vulkan/panvk_devic
    struct {
 diff -urN a/src/panfrost/vulkan/panvk_physical_device.c b/src/panfrost/vulkan/panvk_physical_device.c
 --- a/src/panfrost/vulkan/panvk_physical_device.c	2026-05-21 22:46:57.497785323 +0200
-+++ b/src/panfrost/vulkan/panvk_physical_device.c	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_physical_device.c	2026-05-22 10:17:41.214043265 +0200
@@ -577,12 +577,22 @@
          .queueFlags = VK_QUEUE_SPARSE_BINDING_BIT,
          .queueCount = 1,
@@ -234,8 +234,8 @@ diff -urN a/src/panfrost/vulkan/panvk_physical_device.c b/src/panfrost/vulkan/pa
 +}
 diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 --- a/src/panfrost/vulkan/panvk_v4l2.c	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_v4l2.c	2026-05-21 22:47:09.189957157 +0200
-@@ -0,0 +1,569 @@
+++ b/src/panfrost/vulkan/panvk_v4l2.c	2026-05-22 10:17:41.214043265 +0200
+@@ -0,0 +1,615 @@
 +/*
 + * panvk-bifrost-video Phase 4 commit 3:
 + *
@@ -250,6 +250,7 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 +#include "panvk_video_decode.h"
 +#include "panvk_device.h"
 +
+#include "util/macros.h"
 +#include "vk_alloc.h"
 +#include "vk_log.h"
 +
@@ -417,7 +418,9 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 +      mesa_loge("panvk_v4l2: REQBUFS OUTPUT failed: %s", strerror(errno));
 +      return -errno;
 +   }
-+   vs->num_output_buffers = rb.count;
+   /* REQBUFS may round up the count above the request — clamp to our
+    * fixed-size mmap arrays (Phase 5 review: prevents output_map OOB). */
+   vs->num_output_buffers = MIN2(rb.count, 18);
 +   vs->output_next = 0;
 +
 +   /* CAPTURE: MMAP — kernel-allocated, mmap to CPU for copy-out path. */
@@ -430,7 +433,7 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 +      mesa_loge("panvk_v4l2: REQBUFS CAPTURE failed: %s", strerror(errno));
 +      return -errno;
 +   }
-+   vs->num_capture_buffers = rb.count;
+   vs->num_capture_buffers = MIN2(rb.count, 18);
 +   vs->capture_next = 0;
 +
 +   return 0;
@@ -788,6 +791,49 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 +                          struct vk_device *vk_dev,
 +                          const VkAllocationCallbacks *alloc)
 +{
+   /* Unwind in reverse order of session_init. Each step is guarded by
+    * "have we got far enough to need this" so the function is safe to
+    * call on partially-initialised sessions (the session_init failure
+    * paths jump here via `goto fail`). */
+
+   /* munmap CAPTURE + OUTPUT (no-op for entries left at NULL by an
+    * earlier-failed mmap loop). */
+   for (unsigned i = 0; i < 18; i++) {
+      if (vs->capture_map[i]) {
+         munmap(vs->capture_map[i], vs->capture_map_size[i]);
+         vs->capture_map[i] = NULL;
+         vs->capture_map_size[i] = 0;
+      }
+      if (vs->output_map[i]) {
+         munmap(vs->output_map[i], vs->output_map_size[i]);
+         vs->output_map[i] = NULL;
+         vs->output_map_size[i] = 0;
+      }
+   }
+
+   if (vs->video_fd >= 0) {
+      /* STREAMOFF (safe to call even if STREAMON never ran — kernel
+       * returns EINVAL which we ignore). */
+      enum v4l2_buf_type t;
+      t = vs->mplane ? V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE
+                     : V4L2_BUF_TYPE_VIDEO_OUTPUT;
+      (void) ioctl(vs->video_fd, VIDIOC_STREAMOFF, &t);
+      t = vs->mplane ? V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE
+                     : V4L2_BUF_TYPE_VIDEO_CAPTURE;
+      (void) ioctl(vs->video_fd, VIDIOC_STREAMOFF, &t);
+
+      /* Release the kernel buffer queues via REQBUFS count=0. */
+      struct v4l2_requestbuffers rb;
+      memset(&rb, 0, sizeof(rb));
+      rb.memory = V4L2_MEMORY_MMAP;
+      rb.type = vs->mplane ? V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE
+                           : V4L2_BUF_TYPE_VIDEO_OUTPUT;
+      (void) ioctl(vs->video_fd, VIDIOC_REQBUFS, &rb);
+      rb.type = vs->mplane ? V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE
+                           : V4L2_BUF_TYPE_VIDEO_CAPTURE;
+      (void) ioctl(vs->video_fd, VIDIOC_REQBUFS, &rb);
+   }
+
 +   if (vs->request_fds) {
 +      for (unsigned i = 0; i < vs->num_request_fds; i++)
 +         if (vs->request_fds[i] >= 0)
@@ -807,7 +853,7 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2.c b/src/panfrost/vulkan/panvk_v4l2.c
 +}
 diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264.c b/src/panfrost/vulkan/panvk_v4l2_h264.c
 --- a/src/panfrost/vulkan/panvk_v4l2_h264.c	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_v4l2_h264.c	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_v4l2_h264.c	2026-05-22 10:17:41.214043265 +0200
@@ -0,0 +1,478 @@
 +/*
 + * panvk-bifrost-video Phase 4: Vulkan StdVideo H.264 → V4L2 stateless H.264
@@ -1289,7 +1335,7 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264.c b/src/panfrost/vulkan/panvk_v4
 +}
 diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c
 --- a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c	2026-05-22 10:17:41.214043265 +0200
@@ -0,0 +1,314 @@
 +/*
 + * H.264 slice header bit-parser implementation.
@@ -1607,7 +1653,7 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c b/src/panfrost/vu
 +}
 diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h
 --- a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h	2026-05-22 10:17:41.214043265 +0200
@@ -0,0 +1,94 @@
 +/*
 + * H.264 slice header bit-parser for panvk-bifrost-video / V4L2 stateless
@@ -1705,17 +1751,35 @@ diff -urN a/src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h b/src/panfrost/vu
 +#endif /* PANVK_V4L2_H264_SLICE_HEADER_H */
 diff -urN a/src/panfrost/vulkan/panvk_video_decode.c b/src/panfrost/vulkan/panvk_video_decode.c
 --- a/src/panfrost/vulkan/panvk_video_decode.c	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_video_decode.c	2026-05-21 22:47:09.189957157 +0200
-@@ -0,0 +1,362 @@
+++ b/src/panfrost/vulkan/panvk_video_decode.c	2026-05-22 10:17:41.214043265 +0200
+@@ -0,0 +1,380 @@
 +/*
-+ * panvk-bifrost-video Phase 4 commit 7b:
-+ * Vulkan-side decode dispatch wired to V4L2 hantro via dmabuf.
+ * panvk-bifrost-video: Vulkan video decode entrypoints (H.264).
 + *
-+ * Phase 1 simplification: cmd_buffer state tracking via DEVICE-level
-+ * active_video struct (under a mutex). Per-cmdbuf state hand-off is
-+ * Phase >>1 once arch-agnostic source can access per-arch cmd_buffer
-+ * structs without the include-path gymnastics. This works for
-+ * single-session decode workloads (mpv, ffmpeg, vk-video-samples).
+ * Drives the V4L2 stateless hantro VPU backend (panvk_v4l2.c) from
+ * Vulkan vkCmdDecodeVideoKHR. Decode is synchronous at record time —
+ * the full V4L2 ioctl dance runs to completion inside the command-
+ * recording call before returning to the application. The queue-side
+ * `driver_submit` is a no-op signal-everything (see panvk_vX_device.c).
+ *
+ * Phase 1 simplifications worth knowing about:
+ *
+ *  - Cmd-buffer state lives at the DEVICE level (`active_video`) under
+ *    a single mutex, NOT per-cmd-buffer. Concurrent video sessions on
+ *    the same device clobber each other. Sufficient for current single-
+ *    session consumers (mpv-fourier, ffmpeg-vulkan-h264, vk-video-
+ *    samples). Spec-compliant multi-session is a Phase >>1 follow-up.
+ *
+ *  - Source bitstream is read via `src_buf->mem->addr.host`, i.e. the
+ *    bound VkDeviceMemory's CPU mapping. Works because panvk-bifrost
+ *    only exposes HOST_VISIBLE memory types; an app that bound the
+ *    bitstream buffer to non-HOST_VISIBLE memory would get a logged
+ *    error and a silent decode skip (CmdDecodeVideoKHR is void, so we
+ *    have no clean error-return path). VkPhysicalDeviceVideo*
+ *    constraints would be the right place to make this contractual.
+ *
+ *  - Requires `PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1` (mesa-upstream gate
+ *    on panvk-on-Bifrost which is not conformant).
 + *
 + * SPDX-License-Identifier: MIT
 + */
@@ -1929,10 +1993,10 @@ diff -urN a/src/panfrost/vulkan/panvk_video_decode.c b/src/panfrost/vulkan/panvk
 +    * `tv_sec * 1e9 + tv_usec * 1e3`). Sub-microsecond bits are dropped, so
 +    * any high-resolution stamp (e.g. a 64-bit pointer cast) makes the
 +    * lookup miss and P/B frames decode against zero references. Use a
-+    * monotonic per-session counter in microseconds (i.e. * 1000 ns).
+    * per-session monotonic counter in microseconds (i.e. * 1000 ns) so
+    * concurrent sessions sharing /dev/video1 don't collide on stamp.
 +    */
-+   static uint32_t panvk_video_ts_counter = 0;
-+   const uint64_t output_ts = ((uint64_t)++panvk_video_ts_counter) * 1000ULL;
+   const uint64_t output_ts = ((uint64_t)++vs->ts_counter) * 1000ULL;
 +   uint32_t dst_dpb_slot = pDecodeInfo->pSetupReferenceSlot
 +      ? (uint32_t) pDecodeInfo->pSetupReferenceSlot->slotIndex : 0u;
 +
@@ -2071,8 +2135,8 @@ diff -urN a/src/panfrost/vulkan/panvk_video_decode.c b/src/panfrost/vulkan/panvk
 +}
 diff -urN a/src/panfrost/vulkan/panvk_video_decode.h b/src/panfrost/vulkan/panvk_video_decode.h
 --- a/src/panfrost/vulkan/panvk_video_decode.h	1970-01-01 01:00:00.000000000 +0100
-+++ b/src/panfrost/vulkan/panvk_video_decode.h	2026-05-21 22:47:09.189957157 +0200
-@@ -0,0 +1,114 @@
+++ b/src/panfrost/vulkan/panvk_video_decode.h	2026-05-22 10:17:41.214043265 +0200
+@@ -0,0 +1,124 @@
 +/*
 + * panvk-bifrost-video Phase 4 commit 3: extended for V4L2 state.
 + *
@@ -2103,12 +2167,22 @@ diff -urN a/src/panfrost/vulkan/panvk_video_decode.h b/src/panfrost/vulkan/panvk
 +   struct v4l2_format fmt_output;
 +   struct v4l2_format fmt_capture;
 +
-+   /* Request fd pool. PANVK_V4L2_REQUEST_FD_COUNT entries. */
+   /* Request fd pool. PANVK_V4L2_REQUEST_FD_COUNT entries.
+    * Size of request_fd_used[] is bounded by the same compile-time max;
+    * keep them coupled to avoid silent overflow if the pool grows. */
+#define PANVK_VIDEO_REQUEST_FD_MAX 32
 +   int *request_fds;
-+   bool request_fd_used[32];     /* tracks per-fd "ever queued" → REINIT before reuse */
+   bool request_fd_used[PANVK_VIDEO_REQUEST_FD_MAX];
 +   unsigned num_request_fds;
 +   uint32_t request_fd_next;     /* round-robin index */
 +
+   /* Per-session V4L2 buffer-identity counter. Multiplied by 1000 ns at
+    * QBUF time so the stamp round-trips losslessly through (tv_sec,
+    * tv_usec) — hantro's reflist builder matches dpb[i].reference_ts
+    * against the kernel-side OUTPUT timestamp. Per-session (not process-
+    * global) so concurrent sessions sharing /dev/video1 don't collide. */
+   uint32_t ts_counter;
+
 +   /* DPB slotIndex → V4L2 reference_ts mapping (Phase 1 D5) */
 +   struct {
 +      bool valid;
@@ -2189,7 +2263,7 @@ diff -urN a/src/panfrost/vulkan/panvk_video_decode.h b/src/panfrost/vulkan/panvk
 +#endif /* PANVK_VIDEO_DECODE_H */
 diff -urN a/src/panfrost/vulkan/panvk_vX_device.c b/src/panfrost/vulkan/panvk_vX_device.c
 --- a/src/panfrost/vulkan/panvk_vX_device.c	2026-05-21 22:46:57.505785441 +0200
-+++ b/src/panfrost/vulkan/panvk_vX_device.c	2026-05-21 22:47:09.189957157 +0200
+++ b/src/panfrost/vulkan/panvk_vX_device.c	2026-05-22 10:17:41.214043265 +0200
@@ -203,6 +203,27 @@
    }
 }
@@ -2372,14 +2446,27 @@ diff -urN a/src/panfrost/vulkan/panvk_vX_device.c b/src/panfrost/vulkan/panvk_vX
       qf->queues =
 diff -urN a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
 --- a/src/panfrost/vulkan/panvk_vX_physical_device.c	2026-05-21 22:46:59.273811425 +0200
-+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c	2026-05-21 22:47:09.189957157 +0200
-@@ -170,6 +170,9 @@
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c	2026-05-22 10:17:41.214043265 +0200
+@@ -12,6 +12,7 @@
+ #include <sys/sysmacros.h>
+ 
+ #include "git_sha1.h"
+#include "panvk_video_decode.h"
+ 
+ #include "vk_android.h"
+ #include "vk_device.h"
+@@ -170,6 +171,14 @@
       .EXT_queue_family_foreign = true,
       .EXT_robustness2 = true,
       .EXT_transform_feedback = PAN_ARCH < 9,   /* iter13: JM-class only for now */
-+      .KHR_video_queue = PAN_ARCH < 9,        /* panvk-bifrost-video Phase 4 commit 1 */
-+      .KHR_video_decode_queue = PAN_ARCH < 9, /* hantro V4L2-stateless backend */
-+      .KHR_video_decode_h264 = PAN_ARCH < 9,  /* H.264 only initially */
+      /* Video extensions are advertised only when (a) we're on a Bifrost
+       * arch (PAN_ARCH < 9) AND (b) a hantro VPU is reachable on the
+       * expected V4L2 nodes — otherwise CreateVideoSessionKHR would
+       * succeed at the panvk layer and then fail at v4l2_open_fds, giving
+       * the app a misleading capability claim. */
+      .KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro(),
+      .KHR_video_decode_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro(),
+      .KHR_video_decode_h264 = PAN_ARCH < 9 && panvk_v4l2_probe_hantro(),
       .EXT_sampler_filter_minmax = PAN_ARCH >= 10,
       .EXT_scalar_block_layout = true,
       .EXT_separate_stencil_usage = true,
@@ -0,0 +1,107 @@
+From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 10:18:21 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.idct8_add (called per 8x8 block from the High-profile
+intra-8x8-DCT decode path in h264_mb.c) now dispatches through
+daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
+
+The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
+the recipe is CPU NEON, so this is effectively a NEON-to-NEON
+substitution layered on top of the cycle-6 IDCT 4x4 wiring.  Same
+pthread_once global context, same destructive-zero semantics; FFmpeg
+column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
+
+Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
+on the in-tree NEON .S code and will be batched through
+daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
+
+Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
+green).
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 29 ++++++++++++++++-------
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  3 ++-
+ 2 files changed, 23 insertions(+), 9 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index 538d223..cbb98af 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,14 +1,16 @@
+ /*
+- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add through
+- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+- * The recipe layer picks the substrate (CPU NEON by default for
+- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
+  *
+- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+- * convention: block[r + 4*c] = coefficient at (row r, col c).  Both
+- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}.  Both sides destructively zero the
+ * block after the transform.
+  *
+  * The library context is process-global and lazily initialised under
+  * pthread_once.  We pick the no-QPU constructor here because
+@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
+ }
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+    static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                        block, 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index b993df2..741e551 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
+                             const uint8_t nnzc[15 * 8]);
+ 
+ void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
+                              int16_t *block, int stride,
+@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+         c->idct_add16intra = ff_h264_idct_add16intra_neon;
+         if (chroma_format_idc <= 1)
+             c->idct_add8   = ff_h264_idct_add8_neon;
+-        c->idct8_add       = ff_h264_idct8_add_neon;
+        c->idct8_add       = ff_h264_idct8_add_daedalus;
+         c->idct8_dc_add    = ff_h264_idct8_dc_add_neon;
+         c->idct8_add4      = ff_h264_idct8_add4_neon;
+     } else if (have_neon(cpu_flags) && bit_depth == 10) {
+-- 
+2.47.3
+
@@ -0,0 +1,121 @@
+From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 12:15:16 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+deblock, called per macroblock-row edge from the slice deblock
+loop) now dispatches through
+daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+ff_h264_v_loop_filter_luma_neon.
+
+The recipe layer picks the substrate; for cycle 8 the daedalus
+docstring marks the kernel "CPU primary; QPU opportunistic", but
+the libavcodec.so context here is built with
+daedalus_ctx_create_no_qpu — process-global pthread_once init,
+shared with cycles 6/7.  QPU opportunism stays gated off until a
+follow-up adds an explicit feature flag (no implicit Vulkan init
+in arbitrary host processes).  In the meantime cycle 8 is a
+plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
+
+Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
+the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+only covers the non-intra path per its docstring.
+
+FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
+(int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
+to row 0 of the bottom block per FFmpeg's deblock convention,
+satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
+ 2 files changed, 33 insertions(+), 7 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index cbb98af..92365fa 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,11 +1,14 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+- *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+- * recipe layer picks the substrate (CPU NEON by default for cycles
+- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+ *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
+  *
+  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+  * column-major convention: block[r + N*c] = coefficient at
+@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    meta.tc0[0] = tc0[0];
+    meta.tc0[1] = tc0[1];
+    meta.tc0[2] = tc0[2];
+    meta.tc0[3] = tc0[3];
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+                                                 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index 741e551..85ac381 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -27,6 +27,8 @@
+ 
+ void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+     int cpu_flags = av_get_cpu_flags();
+ 
+     if (have_neon(cpu_flags) && bit_depth == 8) {
+-        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
+        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
+         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
+         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+-- 
+2.47.3
+
@@ -0,0 +1,82 @@
+From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 14:18:25 +0200
+Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+FFmpeg 8.x dropped the H.264 decoder's low_delay path —
+AV_CODEC_FLAG_LOW_DELAY no longer prevents
+h264_select_output_frame from running the display-order DPB
+output queue.  V4L2-stateless-style consumers (daedalus-v4l2
+daemon, libva-v4l2-request-fourier) that set the flag end up
+seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
+
+Restore the documented semantics:
+
+  - Early-exit at the top of h264_select_output_frame when the
+    flag is set: emit the just-decoded picture immediately as
+    next_output_pic, mirror the corruption / recovery-point
+    tracking the main path performs, and skip the entire
+    delayed_pic[] / POC reorder machinery.
+
+  - Suppress the SPS-driven has_b_frames clobber in
+    h264_field_start when the flag is set, so the per-slice
+    bitstream_restriction_flag re-pickup cannot reintroduce a
+    nonzero reorder buffer mid-stream.
+
+This is a fork-only change required by the daedalus-v4l2 daemon's
+one-frame-per-send_packet contract; upstream FFmpeg consumers that
+expect display-order output remain untouched (flag default = off).
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
+---
+ libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
+ 1 file changed, 23 insertions(+)
+
+diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
+index 97fab70..a7bfbd6 100644
+--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
+@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
+     cur->mmco_reset = h->mmco_reset;
+     h->mmco_reset = 0;
+ 
+    /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+     * decoder's low_delay path).  Bypass the display-order DPB
+     * output queue: emit the just-decoded picture immediately, in
+     * decode order, one per send_packet.  V4L2-stateless-style
+     * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+     * do their own POC-based reorder downstream and require this
+     * behaviour. */
+    if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+        h->next_output_pic    = cur;
+        h->next_outputed_poc  = cur->poc;
+        h->frame_recovered   |= cur->recovered;
+        cur->recovered       |= h->frame_recovered & FRAME_RECOVERED_SEI;
+        if (!cur->recovered) {
+            if (!(h->avctx->flags  & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+                !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+                h->next_output_pic = NULL;
+            else
+                cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+        }
+        return 0;
+    }
+
+     if (sps->bitstream_restriction_flag ||
+         h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
+         h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
+@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
+     sps = h->ps.sps;
+ 
+     if (sps->bitstream_restriction_flag &&
+        !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
+         h->avctx->has_b_frames < sps->num_reorder_frames) {
+         h->avctx->has_b_frames = sps->num_reorder_frames;
+     }
+-- 
+2.47.3
+
@@ -33,13 +33,13 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=6  # pkgrel=6 — drop --enable-libxml2 to avoid runner/target libxml2
-          # SOVERSION skew (runner has libxml2 ≥ 2.14 = SONAME 16; trixie
-          # has 2.12 = SONAME 2; -5 .deb dlopens libavformat → fails on
-          # "libxml2.so.16: cannot open shared object").  Neither the
-          # daedalus-v4l2 daemon (direct AVPacket feed) nor mpv-fourier
-          # nor firefox-fourier consumers need FFmpeg's DASH demuxer.
-          # (2026-05-21)
+PKGREL=9  # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY semantics in the
+          # H.264 decoder (FFmpeg 8.x dropped them).  Fixes the 2-1-4-3
+          # B-frame pair-swap that re-appeared in Firefox YouTube after
+          # the SONAME 61→62 jump (PR #75) silently neutered the
+          # daemon's ctx->flags |= AV_CODEC_FLAG_LOW_DELAY at
+          # daemon/src/decoder.c:202.  Substitution arc unchanged.
+          # (2026-05-22)

 # daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6
 # H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon already ships
@@ -69,6 +69,9 @@ fi
 patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch"
 patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
 patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"

 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
@@ -1,3 +1,71 @@
+ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
+
+  * Add 0006-h264-restore-low-delay.patch — restore the documented
+    AV_CODEC_FLAG_LOW_DELAY semantics in the H.264 decoder.  FFmpeg
+    8.x dropped the H.264 low_delay code path entirely; setting the
+    flag at avcodec_open2 no longer prevents the display-order DPB
+    output queue from running.  Visible on Firefox YouTube as the
+    2-1-4-3 B-frame pair-swap, re-introduced silently by the
+    SONAME 61→62 jump in daedalus-v4l2 PR #16.
+  * h264_select_output_frame: early-exit when LOW_DELAY is set;
+    emit the just-decoded picture as next_output_pic, mirror the
+    corruption / recovery-point tracking, skip delayed_pic[] and
+    the POC reorder machinery entirely.
+  * h264_field_start: suppress the SPS-driven
+    has_b_frames = sps->num_reorder_frames clobber when LOW_DELAY
+    is set — without this the per-slice bitstream_restriction_flag
+    re-pickup would reintroduce a nonzero reorder buffer mid-
+    stream.
+  * Restores the same one-frame-per-send_packet contract the
+    daedalus-v4l2 daemon's decoder.c already relies on (the flag
+    is set unconditionally for H.264).  No daemon side change.
+  * No SONAME change, no Depends change.
+
+ -- Markus Fritsche <mfritsche@reauktion.de>  Fri, 22 May 2026 13:30:00 +0000
+
+ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-8) bookworm trixie; urgency=medium
+
+  * Add 0005-h264-deblock-luma-v-daedalus-fourier.patch —
+    H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+    deblock, called per macroblock-row edge from the slice deblock
+    loop in libavcodec/h264_loopfilter.c) now dispatches through
+    daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+    ff_h264_v_loop_filter_luma_neon.  Cycle 8 of the daedalus-v4l2#11
+    step 2 substitution arc.
+  * Cycle 8 is marked "CPU primary; QPU opportunistic" in
+    daedalus-fourier, but the libavcodec.so context here uses
+    daedalus_ctx_create_no_qpu (process-global pthread_once,
+    shared with cycles 6/7).  Opportunistic QPU is deferred to a
+    separate change that gates Vulkan init on a feature flag, to
+    avoid implicit Vulkan init in arbitrary host processes.  For
+    now cycle 8 is plumbing-only — NEON-by-recipe.
+  * Intra (bS=4) loop filter c->v_loop_filter_luma_intra stays on
+    the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+    only covers the non-intra path per its API docstring.
+  * Bit-exact against ff_h264_v_loop_filter_luma_neon (daedalus-fourier
+    cycle 8 green).
+  * No SONAME change, no Depends change.
+
+ -- Markus Fritsche <mfritsche@reauktion.de>  Fri, 22 May 2026 12:30:00 +0000
+
+ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-7) bookworm trixie; urgency=medium
+
+  * Add 0004-h264-idct8-daedalus-fourier.patch — H264DSPContext.idct8_add
+    (per-block 8x8 IDCT, called from the High-profile intra-8x8-DCT
+    macroblock path in libavcodec/h264_mb.c) now dispatches through
+    daedalus_recipe_dispatch_h264_idct8 instead of
+    ff_h264_idct8_add_neon.  Cycle 7 of the daedalus-v4l2#11 step 2
+    substitution arc — NEON-by-recipe, same pthread_once context the
+    cycle-6 IDCT 4x4 shim already owns.
+  * Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
+    green; FFmpeg 8x8 block storage block[r + 8*c] matches daedalus
+    column-major convention).
+  * Bulk c->idct8_add4 (inter 8x8-DCT macroblocks) stays on the
+    in-tree NEON .S code; batched substitution lands later.
+  * No SONAME change, no Depends change.
+
+ -- Markus Fritsche <mfritsche@reauktion.de>  Fri, 22 May 2026 10:30:00 +0000
+
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-6) bookworm trixie; urgency=medium

  * Drop --enable-libxml2 + libxml2 Depends — the Gitea
Author	SHA1	Message	Date
marfrit	5c69460722	ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder FFmpeg 8.x dropped the H.264 decoder's low_delay code path — AV_CODEC_FLAG_LOW_DELAY no longer prevents h264_select_output_frame from running the display-order DPB output queue. The daedalus-v4l2 daemon's `ctx->flags \|= AV_CODEC_FLAG_LOW_DELAY` at daemon/src/decoder.c:202 has been a silent no-op since the SONAME 61→62 jump landed in reauktion/daedalus-v4l2 PR #16; on Firefox YouTube this re-introduced the 2-1-4-3 B-frame pair-swap that PR #12's daemon flag was supposed to prevent. Fix lives in libavcodec, not the daemon: restore the documented LOW_DELAY semantics so the daemon (and any other V4L2-stateless- style consumer) keeps the one-frame-per-send_packet decode-order output contract it already declares. ## Patch 0006-h264-restore-low-delay.patch touches libavcodec/h264_slice.c: - h264_select_output_frame: early-exit when LOW_DELAY is set. Emit the just-decoded picture as next_output_pic, mirror the corruption / recovery-point tracking the main path performs, skip delayed_pic[] / POC reorder machinery entirely. - h264_field_start: suppress the SPS-driven `has_b_frames = sps->num_reorder_frames` clobber when LOW_DELAY is set. Without this the per-slice bitstream_restriction_flag re-pickup would reintroduce a nonzero reorder buffer mid-stream even after the daemon set has_b_frames=0 at avcodec_open2. ## Why not daemon-side A daemon SPS-rewrite (`num_reorder_frames=0`) was considered but rejected: it works only for the daemon's reconstructed SPS NAL, not for any in-band SPS the daemon dlopens libavformat to parse in other code paths. Restoring documented FFmpeg flag semantics is the smaller, more durable change and keeps the daemon interface stable. ## Packaging - PKGREL/pkgrel bump to 9. - No new build-deps, no Depends change. - Substitution arc cycles 6/7/8 unchanged. ## Refs - reauktion/daedalus-v4l2#11 / #12 (LOW_DELAY half-measure on daemon side, originally landed against FFmpeg 7.x). - daemon/src/decoder.c:202 (`ctx->flags \|= AV_CODEC_FLAG_LOW_DELAY` for H.264 only — unchanged, but now actually has effect again).	2026-05-22 14:20:37 +02:00
marfrit	d11a52405d	Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier' (#86 ) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-deblock-luma-v-daedalus into main Reviewed-on: marfrit/marfrit-packages#86	2026-05-22 10:29:09 +00:00
marfrit	29e0852d11	ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier Cycle 8 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11 step 2). H264DSPContext.v_loop_filter_luma — non-intra bS<4 vertical luma deblock, called per macroblock-row edge from the slice deblock loop in libavcodec/h264_loopfilter.c — now dispatches through daedalus_recipe_dispatch_h264_deblock_luma_v instead of ff_h264_v_loop_filter_luma_neon. ## What - Add 0005-h264-deblock-luma-v-daedalus-fourier.patch (in both arch/ and debian/ ffmpeg-v4l2-request-fourier/). Extends libavcodec/aarch64/h264_idct_daedalus.c with ff_h264_v_loop_filter_luma_daedalus (constructs a daedalus_h264_deblock_meta from FFmpeg's (alpha, beta, tc0[4]) and calls daedalus_recipe_dispatch_h264_deblock_luma_v with n_edges=1). Patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->v_loop_filter_luma to the new shim. - arch/PKGBUILD + debian/build-deb.sh: append patch + bump pkgrel/PKGREL to 8. - No new build-deps, no Depends change, no daedalus-fourier rev — the d87239d pin already exposes daedalus_recipe_dispatch_h264_deblock_luma_v. ## Why Cycle 8 is marked "CPU primary; QPU opportunistic" in the daedalus- fourier API docstring. Per the hybrid substrate philosophy ("if there's a coprocessor, use it") we eventually want the QPU opportunism active here. But the libavcodec.so context is process-global and shared with cycles 6/7 via pthread_once, and it uses daedalus_ctx_create_no_qpu deliberately to avoid implicit Vulkan init in arbitrary host processes (Firefox content, mpv-fourier, ffmpeg-fourier CLI, ...). Switching to daedalus_ctx_create here without a feature flag would be a footgun. So cycle 8 lands as plumbing-only NEON-by-recipe substitution for now; opportunistic QPU enablement is a separate follow-up that adds a DAEDALUS_FOURIER_ENABLE_QPU env var or equivalent. ## Scope NOT covered - Intra (bS=4) loop filter c->v_loop_filter_luma_intra — daedalus's daedalus_h264_deblock_meta only covers the non-intra path. - Horizontal-edge variant c->h_loop_filter_luma — separate kernel (not yet in daedalus-fourier API). - Chroma loop filters — separate kernels. - Bulk batching — single-edge dispatch wastes the kernel's n_edges>1 amortization. Same caveat as cycles 6/7; follow-up. - QPU opportunism — see "Why" above. ## SONAME Unchanged. libavcodec.so.62 / libavformat.so.62 / libavutil.so.60. ## Refs - reauktion/daedalus-v4l2 issue #11: reauktion/daedalus-v4l2#11 - marfrit-packages PR #76 (cycle 6 IDCT 4×4) - marfrit-packages PR #85 (cycle 7 IDCT 8×8) - marfrit/daedalus-fourier cycle 8 close (deblock luma-v NEON green)	2026-05-22 12:17:14 +02:00
marfrit	510a31622c	Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier' (#85 ) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-idct8-daedalus into main Reviewed-on: marfrit/marfrit-packages#85	2026-05-22 08:32:15 +00:00
marfrit	db9ae16da9	Merge pull request 'mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot' (#84 ) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main Reviewed-on: marfrit/marfrit-packages#84	2026-05-22 08:20:34 +00:00
marfrit	493c762967	ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier Cycle 7 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11 step 2). H264DSPContext.idct8_add — called per 8×8 block from the High-profile intra-8×8-DCT decode path in libavcodec/h264_mb.c — now dispatches through daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon. ## What - Add 0004-h264-idct8-daedalus-fourier.patch (in both arch/ and debian/ ffmpeg-v4l2-request-fourier/). Extends libavcodec/aarch64/ h264_idct_daedalus.c (introduced by 0003) with ff_h264_idct8_add_daedalus and a daedalus_recipe_dispatch_h264_idct8 call; patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct8_add to the new shim. - arch/PKGBUILD + debian/build-deb.sh: append the new patch to the apply list; bump pkgrel/PKGREL to 7. - No new build-deps, no Depends change, no daedalus-fourier rev — the d87239d pin already exposes daedalus_recipe_dispatch_h264_idct8. ## Why The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8×8) the recipe is CPU NEON, so this is effectively a NEON-to-NEON substitution layered on top of cycle 6. Production validation of cycle 6 on higgs Firefox YouTube: 3040 frames decoded cleanly, avg_decode_us=3388 (no regression vs the pre-substitution ~4 ms baseline). Cycle 7 inherits the same shim's pthread_once context. Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7 green; FFmpeg 8×8 block storage block[r + 8*c] matches daedalus column-major convention). ## Scope NOT covered (deferred) - Bulk c->idct8_add4 (inter 8×8-DCT macroblocks) stays on the in-tree NEON .S code; batched substitution with n_blocks>1 lands later alongside the cycle-6 bulk-paths work. - High-bit-depth (10-bit) path untouched. - Cycles 8/9 — separate PRs. ## SONAME Unchanged. libavcodec.so.62 / libavformat.so.62 / libavutil.so.60. ## Refs - reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11 - marfrit-packages PR #76 (cycle 6 IDCT 4×4) - marfrit-packages PR #78 (libxml2 ABI-skew workaround) - marfrit/daedalus-fourier cycle 7 close (H.264 IDCT 8×8 NEON green)	2026-05-22 10:20:27 +02:00
marfrit	7ecbcb3c1b	mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot The original 0005 patch was generated from the pre-Phase-5-review source snapshot (phase5_review_input_2026-05-21.tgz), missing the four load-bearing review fixes that landed in the post-review snapshot: - probe_hantro gate on KHR_video_* extension advertisement - per-session ts_counter (was process-global static) - panvk_v4l2_session_finish full unwind (munmap + STREAMOFF + REQBUFS=0) - MIN2(rb.count, 18) clamp on num_*_buffers Run #162 (job 17032) failed in prepare() because the PKGBUILD sanity check 'grep -q "KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro()"' didn't match the actual patched output (which still had the pre-review 'KHR_video_queue = PAN_ARCH < 9,'). This patch (regenerated from phase5_post_review_2026-05-21.tgz) carries all four review fixes. Validated locally: vanilla mesa-26.0.6 + r1..r4 + this patch reproduces prepare()-OK byte-for-byte. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 10:18:11 +02:00
marfrit	360e8eb6bf	Merge pull request 'mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI)' (#83 ) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main Reviewed-on: marfrit/marfrit-packages#83	2026-05-22 07:55:59 +00:00
marfrit	4db64917bc	mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI) The original PR #79 used symlinks for 0001..0004 patches (pointing into ../mesa-panvk-bifrost/) to avoid drift between siblings. CI's "cp -r arch/mesa-panvk-bifrost-video /tmp/build-..." preserves the symlinks, but the destination /tmp/build-... has no sibling dir to resolve them against, so makepkg errors with: ==> ERROR: 0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch was not found in the build directory and is not a URL. Each Arch PKGBUILD owns its source files per convention; the duplication risk is low because r1..r4 are closed-release patches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 09:49:59 +02:00
marfrit	6288536223	Merge pull request 'ci: fix duplicate `run:` key in build.yml wipe-secrets step (unblocks all builds since 2026-05-21)' (#82 ) from claude-noether/marfrit-packages:noether/fix-build-yaml-duplicate-run into main Reviewed-on: marfrit/marfrit-packages#82	2026-05-22 07:30:37 +00:00
claude-noether	09d8813507	ci: fix duplicate `run:` key in build.yml wipe-secrets step PR #79 (`6ee8f2748`, mesa-panvk-bifrost-video) added a second `run:` mapping key on the next line of the same step: - name: wipe secrets if: always() run: rm -f /root/repo_pass /root/.ssh/id_ed25519 run: rm -f /root/.ssh/id_ed25519_hertz ← duplicate `run:` key YAML doesn't allow two mappings with the same key in one node, so Gitea's workflow parser rejected the entire file: actions/workflows.go:124:DetectWorkflows() [W] ignore invalid workflow "build.yml": yaml: unmarshal errors: line 1423: mapping key "run" already defined at line 1422 Result: every push to main since `6ee8f2748` (2026-05-21 23:14 CEST) silently failed to enqueue ANY action run. PR #80's "re-trigger by README touch" had no chance — workflow file was invalid before #80 even existed. Runs #161-163 do not exist; #160 (pre-#79) is the last successful enqueue. Fix: merge the two single-line `run:` invocations into one literal block. Functionally identical, YAML-valid. Post-merge: workflow file becomes valid again, new push to main triggers a fresh build run covering the backlog (#79's mesa-panvk-bifrost-video build that #80 wanted re-triggered). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 09:15:18 +02:00
				`@@ -1 +0,0 @@`
				`../mesa-panvk-bifrost/0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch`
				`@@ -1 +0,0 @@`
				`../mesa-panvk-bifrost/0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch`
				`@@ -1 +0,0 @@`
				`../mesa-panvk-bifrost/0003-panvk-bifrost-vk-ext-transform-feedback.patch`
				`@@ -1 +0,0 @@`
				`../mesa-panvk-bifrost/0004-panvk-bifrost-xfb-primitive-decomposition.patch`