ffmpeg-v4l2-request-fourier: flip libavcodec daedalus ctx no_qpu → qpu-capable (0013)

Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). Rationale at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process (firefox- fourier, mpv-fourier, daedalus_v4l2_daemon, ...). Two things changed since: 1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra) + 30 qpel positions (15 put_ + 15 avg_). See daedalus-fourier PRs #28-#35. 2. Dispatch overhead has been hammered down — buffer pool in v3d_runner (#160) plus persistent command buffer (#161). daedalus-fourier PR #36 bench measures the 1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) PR #10's CPU-4x-faster-than-QPU verdict (which justified the original no_qpu ctx choice) is now reversed by ~17x. This commit adds 0013-h264-ctx-qpu-capable.patch which flips both H.264 TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from daedalus_ctx_create_no_qpu() to daedalus_ctx_create(). daedalus_ctx_create() probes for a usable Vulkan device and falls back to no_qpu mode if unavailable, so this is safe on hosts without V3D (x86 build runners, Debian aarch64 builders without renderD, etc.). Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path through V3D compute instead of CPU NEON. Wired into both arch PKGBUILD (source[] + prepare()) and debian build-deb.sh; both pkgrel bumped 10 → 11. Refs reauktion/daedalus-fourier!36.
mesa-panvk-bifrost r9: bump maxImageDimension3D to 2048 (iter22, unblocks Dawn/WebGPU)
2026-05-25 21:09:31 +02:00 · 2026-05-25 15:54:18 +02:00 · 2026-05-25 12:19:55 +00:00 · 2026-05-25 14:05:56 +02:00 · 2026-05-25 12:03:08 +00:00 · 2026-05-25 13:39:54 +02:00
34 changed files with 2863 additions and 31 deletions
@@ -1556,3 +1556,179 @@ jobs:
        if: always()
        run: rm -f /root/repo_pass /root/.ssh/id_ed25519
  # -------------------------------------------------------------------------
  # aish (arch=any) — pure LuaJIT, one .pkg.tar valid on every pacman target.
  # Same dual-arch publish pattern as lmcp / claude-his.
  # -------------------------------------------------------------------------
  aish-any:
    needs: lmcp-debian   # parallel with claude-his-any (pure-Lua sibling),
                         # serialized via the shared arch-aarch64 runner.
                         # Avoids needless wait through the fourier stack.
    runs-on: arch-aarch64
    steps:
      - uses: actions/checkout@v4
      - name: skip if already published
        id: skip-check
        run: |
          set -e
          result=$(./.gitea/scripts/check-already-published.sh arch/aish)
          echo "$result" >> "$GITHUB_OUTPUT"
          echo "decision: $result"
      - name: bootstrap runner (idempotent)
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
          retry pacman -Syu --noconfirm --needed base-devel git rsync gnupg openssh sudo luajit readline curl
      - name: import signing key
        if: steps.skip-check.outputs.skip != '1'
        env:
          PRIV: ${{ secrets.MARFRIT_REPO_PRIVATE_KEY }}
          PASS: ${{ secrets.MARFRIT_REPO_PASSPHRASE }}
        run: |
          set -e
          gpgconf --homedir /root/.gnupg --kill all 2>/dev/null || true
          rm -rf /root/.gnupg /root/repo_pass
          mkdir -m700 -p /root/.gnupg
          printf '%s' "$PASS" > /root/repo_pass
          chmod 600 /root/repo_pass
          printf '%s\n' "$PRIV" | gpg --batch --import
          echo "92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C:6:" | gpg --import-ownertrust
      - name: install deploy ssh key
        if: steps.skip-check.outputs.skip != '1'
        env:
          KEY: ${{ secrets.MARFRIT_REPO_DEPLOY_KEY }}
        run: |
          mkdir -m700 -p /root/.ssh
          printf '%s\n' "$KEY" > /root/.ssh/id_ed25519
          chmod 600 /root/.ssh/id_ed25519
          ssh-keyscan -t ed25519 nc.reauktion.de > /root/.ssh/known_hosts 2>/dev/null
      - name: makepkg aish
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          rm -rf /tmp/build-aish
          cp -r arch/aish /tmp/build-aish
          chown -R builder:builder /tmp/build-aish
          cd /tmp/build-aish
          sudo -u builder -H makepkg --nocheck --noconfirm --syncdeps --cleanbuild
          ls -la *.pkg.tar.* | grep -v "\.sig$"
      - name: sign aish
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          cd /tmp/build-aish
          for f in *.pkg.tar.xz *.pkg.tar.zst *.pkg.tar.gz; do
            [ -f "$f" ] || continue
            gpg --batch --pinentry-mode loopback --passphrase-file /root/repo_pass \
                --detach-sign --yes -u 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C "$f"
          done
      - name: publish aish to both arches
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
          export GNUPGHOME=/root/.gnupg
          printf 'pinentry-mode loopback\npassphrase-file /root/repo_pass\n' > /root/.gnupg/gpg.conf
          printf 'allow-loopback-pinentry\n' > /root/.gnupg/gpg-agent.conf
          gpg-connect-agent reloadagent /bye
          for target in aarch64 x86_64; do
              stage="/tmp/arch-stage-$target"
              rm -rf "$stage"; mkdir -p "$stage"; cd "$stage"
              for f in marfrit.db.tar.gz marfrit.db.tar.gz.sig marfrit.files.tar.gz marfrit.files.tar.gz.sig; do
                curl -sSLf "https://packages.reauktion.de/arch/$target/$f" -o "$f" || rm -f "$f"
              done
              cp /tmp/build-aish/*.pkg.tar.* .
              pkgs=()
              for ext in xz zst gz; do
                for f in *.pkg.tar.$ext; do [ -f "$f" ] && pkgs+=("$f"); done
              done
              if [ -f marfrit.db.tar.gz ]; then
                for f in "${pkgs[@]}"; do
                  name=$(echo "$f" | sed -E 's/-[0-9].*//')
                  repo-remove --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
                    marfrit.db.tar.gz "$name" 2>/dev/null || true
                done
              fi
              repo-add --new --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
                --verify marfrit.db.tar.gz "${pkgs[@]}"
              ln -sf marfrit.db.tar.gz        marfrit.db
              ln -sf marfrit.files.tar.gz     marfrit.files
              ln -sf marfrit.db.tar.gz.sig    marfrit.db.sig
              ln -sf marfrit.files.tar.gz.sig marfrit.files.sig
              retry rsync -avL --copy-unsafe-links \
                -e 'ssh -i /root/.ssh/id_ed25519' \
                ./ "mfritsche@nc.reauktion.de:arch/$target/"
          done
      - name: wipe secrets
        if: always()
        run: rm -f /root/repo_pass /root/.ssh/id_ed25519
  aish-debian:
    needs: aish-any   # serialize after the Arch build to share the runner
    runs-on: arch-aarch64
    steps:
      - uses: actions/checkout@v4
      - name: skip if already published
        id: skip-check
        run: |
          set -e
          result=$(./.gitea/scripts/check-already-published.sh debian/aish)
          echo "$result" >> "$GITHUB_OUTPUT"
          echo "decision: $result"
      - name: install dpkg
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
          retry pacman -Syu --noconfirm --needed dpkg openssh rsync curl
      - name: install hertz deploy ssh key
        if: steps.skip-check.outputs.skip != '1'
        env:
          KEY: ${{ secrets.MARFRIT_REPO_HERTZ_KEY }}
        run: |
          mkdir -m700 -p /root/.ssh
          printf '%s\n' "$KEY" > /root/.ssh/id_ed25519_hertz
          chmod 600 /root/.ssh/id_ed25519_hertz
          ssh-keyscan -t ed25519 hertz.fritz.box >> /root/.ssh/known_hosts 2>/dev/null
      - name: build aish .deb
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          cd debian/aish
          ./build-deb.sh
          ls -la *.deb
      - name: upload + publish to suites
        if: steps.skip-check.outputs.skip != '1'
        run: |
          set -e
          retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
          cd debian/aish
          DEB=$(ls aish_*.deb | head -1)
          # Push the .deb into hertz's incoming dir via rrsync.
          retry rsync -av -e 'ssh -i /root/.ssh/id_ed25519_hertz' "$DEB" \
              marfritrepo@hertz.fritz.box:
          # Trigger reprepro for each suite.
          for suite in bookworm trixie; do
              retry ssh -i /root/.ssh/id_ed25519_hertz marfritrepo@hertz.fritz.box \
                  "publish-deb $suite $DEB"
          done
      - name: wipe secrets
        if: always()
        run: rm -f /root/.ssh/id_ed25519_hertz
@@ -0,0 +1,53 @@
 # Maintainer: Markus Fritsche <mfritsche@reauktion.de>
 # aish — AI-augmented conversational shell in LuaJIT.
 # Source of truth: git.reauktion.de/marfrit/aish
 pkgname=aish
 pkgver=0.1.0
 pkgrel=1
 pkgdesc="AI-augmented conversational shell (LuaJIT, FFI-only)"
 arch=('any')
 url="https://git.reauktion.de/marfrit/aish"
 license=('MIT')
 depends=('luajit' 'readline' 'curl')
 # The _tag back-translation handles both clean releases (no '_') and
 # pre-release pkgvers (e.g. 0.1.0_rc1 → v0.1.0-rc1).
 _tag="v${pkgver//_/-}"
 source=("${pkgname}-${pkgver}.tar.gz::https://git.reauktion.de/marfrit/aish/archive/${_tag}.tar.gz")
 sha256sums=('9ebc3939e028832e39391ae33efacb5ec9bcd99d123cbc8ca1cd6ca9a640b5b5')
 package() {
    cd "${pkgname}"
    local libdir="${pkgdir}/usr/share/lua/5.1/aish"
    # Top-level modules
    install -Dm644 main.lua     "${libdir}/main.lua"
    install -Dm644 broker.lua   "${libdir}/broker.lua"
    install -Dm644 context.lua  "${libdir}/context.lua"
    install -Dm644 executor.lua "${libdir}/executor.lua"
    install -Dm644 history.lua  "${libdir}/history.lua"
    install -Dm644 mcp.lua      "${libdir}/mcp.lua"
    install -Dm644 renderer.lua "${libdir}/renderer.lua"
    install -Dm644 repl.lua     "${libdir}/repl.lua"
    install -Dm644 router.lua   "${libdir}/router.lua"
    install -Dm644 safety.lua   "${libdir}/safety.lua"
    install -Dm644 secrets.lua  "${libdir}/secrets.lua"
    # FFI bindings
    install -Dm644 ffi/curl.lua     "${libdir}/ffi/curl.lua"
    install -Dm644 ffi/libc.lua     "${libdir}/ffi/libc.lua"
    install -Dm644 ffi/pty.lua      "${libdir}/ffi/pty.lua"
    install -Dm644 ffi/readline.lua "${libdir}/ffi/readline.lua"
    # Vendored dependencies
    install -Dm644 vendor/dkjson.lua "${libdir}/vendor/dkjson.lua"
    # Launch wrapper
    install -Dm755 bin/aish "${pkgdir}/usr/bin/aish"
    # Documentation + example config
    install -Dm644 README.md  "${pkgdir}/usr/share/doc/${pkgname}/README.md"
    install -Dm644 LICENSE    "${pkgdir}/usr/share/doc/${pkgname}/LICENSE"
    install -Dm644 examples/config.lua \
        "${pkgdir}/usr/share/doc/${pkgname}/examples/config.lua"
 }
@@ -8,13 +8,13 @@
 # NEXT.md alongside this PKGBUILD for the full rationale and the
 # validation log on PineTab2 (RK3566).
 #
-# Multi-arch: builds natively on x86_64 and aarch64. The x86_64 path
+# Cross-compiled from x86_64 using chromium's bundled clang (upstream
-# is primarily a development / CI host; the runtime target audience is
+# LLVM doesn't ship clang 23+ yet; chromium's internal fork is required).
-# aarch64. The two patches are architecture-independent.
+# Runtime target is aarch64. The three patches are architecture-independent.
 pkgname=chromium-fourier
-pkgver=147.0.7727.116
+pkgver=148.0.7778.178
-pkgrel=2
+pkgrel=1
 epoch=1
 pkgdesc='Chromium with V4L2VDA HW video decode unlocked for mainline Linux Wayland on Rockchip'
 arch=('aarch64' 'x86_64')
@@ -150,7 +150,6 @@ build() {
    'symbol_level=0'
    'is_cfi=false'
    'treat_warnings_as_errors=false'
    'enable_nacl=false'
    'enable_widevine=false'
    # System toolchain (clang/lld from pacman)
@@ -73,16 +73,15 @@ diff --git a/ui/ozone/common/native_pixmap_egl_binding.cc b/ui/ozone/common/nati
 index 31877f4459..6855c1093e 100644
 --- a/ui/ozone/common/native_pixmap_egl_binding.cc
 +++ b/ui/ozone/common/native_pixmap_egl_binding.cc
-@@ -6,10 +6,13 @@
+@@ -6,9 +6,12 @@
- 
+
 #include <array>
- 
+
 +#include "base/containers/flat_map.h"
 #include "base/logging.h"
 #include "base/memory/scoped_refptr.h"
 +#include "base/no_destructor.h"
 #include "base/notreached.h"
 #include "base/numerics/safe_conversions.h"
 +#include "base/synchronization/lock.h"
 #include "ui/gfx/linux/drm_util_linux.h"
 #include "ui/gl/gl_bindings.h"
@@ -0,0 +1,139 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Sat, 23 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
 half-pel, 6-tap "put" variant — the canonical representative of the
 H.264 luma motion-compensation family) now dispatches through
 daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 ff_put_h264_qpel8_mc20_neon.
 Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
 4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
 8 luma-v deblock / 9 qpel mc20).
 The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
 the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
 margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
 makes any V3D shader strictly worse. Substitution is plumbing-only,
 NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
 context shape the cycles 6/7/8 shims already own (kept SEPARATE
 from the H264DSP shim's ctx because H264QPEL is its own libavcodec
 Makefile module and link order does not guarantee a single .o
 owns the ctx symbol; one extra ~µs init per process, paid lazily).
 Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
 size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
 rationale, mc20 8x8 is representative of the whole family's per-block
 cost — extending the substitution to other variants would multiply
 recipe-lookup overhead without changing the substrate verdict.
 Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
 cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
 No SONAME change, no Depends change.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
 ---
 libavcodec/aarch64/Makefile                |  3 +-
 libavcodec/aarch64/h264_qpel_daedalus.c    | 50 ++++++++++++++++++++++
 libavcodec/aarch64/h264qpel_init_aarch64.c |  4 +-
 3 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
 diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
 --- a/libavcodec/aarch64/Makefile
 +++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o \
                                            aarch64/h264_idct_daedalus.o
 OBJS-$(CONFIG_HUFFYUVDSP)               += aarch64/huffyuvdsp_init_aarch64.o
 OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
 -OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
 +OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o \
 +                                           aarch64/h264_qpel_daedalus.o
 OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
 OBJS-$(CONFIG_IDCTDSP)                  += aarch64/idctdsp_init_aarch64.o
 OBJS-$(CONFIG_ME_CMP)                   += aarch64/me_cmp_init_aarch64.o
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 new file mode 100644
 --- /dev/null
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
 +/*
 + * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 + * — daedalus-fourier substitution shim.
 + *
 + * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 + * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 + * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
 + * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
 + *
 + * Sibling to libavcodec/aarch64/h264_idct_daedalus.c.  We keep a
 + * SEPARATE process-global pthread_once context here instead of
 + * sharing the H264DSP one because H264QPEL is its own libavcodec
 + * Makefile module and link order does not guarantee a single .o
 + * owns the ctx symbol.  The cost is one extra
 + * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
 + * processes pay this lazily on first MC call.
 + *
 + * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
 + * stride and `src` already points at the leftmost OUTPUT column
 + * (col 0); the 6-tap filter reads cols -2..+3.  This matches
 + * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
 + * directly, so dst_off = src_off = 0.
 + */
 +
 +#include <pthread.h>
 +#include <stddef.h>
 +#include <stdint.h>
 +
 +#include <daedalus.h>
 +
 +#include "libavutil/attributes.h"
 +
 +static daedalus_ctx     *g_dctx;
 +static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 +
 +static void daedalus_ctx_init_once(void)
 +{
 +    g_dctx = daedalus_ctx_create_no_qpu();
 +}
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
 +{
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +    daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
 +                                            1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c
 +++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
 void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                     ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
         c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
         c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
         c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 --
 2.47.3
@@ -0,0 +1,92 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-h deblock through daedalus-fourier
 Sibling of 0005 (which substituted v_loop_filter_luma).  Same
 NEON-to-NEON substitution: H264DSPContext.h_loop_filter_luma →
 daedalus_recipe_dispatch_h264_deblock_luma_h.  The H kernel landed
 in daedalus-fourier PR #9 (CPU NEON only — no QPU shader yet).
 libavcodec.so ctx is no-QPU per the existing 0003-0005 / 0007
 pattern; we cannot assume Vulkan in arbitrary host processes
 (firefox-fourier RDD, mpv-fourier, etc.).
 Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
 code; daedalus_h264_deblock_meta only covers the non-intra path.
 An intra-h substitution can land once daedalus-fourier exposes a
 dispatch helper (the kernel already exists internally per PR #11).
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 H.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:09:33.694760715 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:09:33.715603719 +0200
@@ -1,9 +1,10 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
  *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 + *        H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -45,6 +46,8 @@
 void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                          int alpha, int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -84,3 +87,22 @@
     daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
                                                  1, &meta);
 }
 +
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
 +                                                 1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:09:33.695937103 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:09:33.715541700 +0200
@@ -31,6 +31,8 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -117,7 +119,7 @@
     if (have_neon(cpu_flags) && bit_depth == 8) {
         c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
 -        c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
 +        c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_daedalus;
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 --
 2.47.3
@@ -0,0 +1,127 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma v/h deblock through daedalus-fourier
 Chroma siblings of 0005 (luma_v) and 0008 (luma_h).  Same
 NEON-to-NEON pattern via the daedalus recipe layer:
  H264DSPContext.v_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_v
  H264DSPContext.h_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_h
 Both kernels landed in daedalus-fourier PR #10.  Recipe table
 routes AUTO to CPU NEON (no chroma QPU shaders yet), so this
 is plumbing-only and stays bit-exact against the in-tree NEON.
 Intra chroma (bS=4) loop filters remain on in-tree NEON;
 daedalus_h264_deblock_meta covers the non-intra (bS<4) path.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 chroma.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:15:45.995368233 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:15:46.015839177 +0200
@@ -1,10 +1,12 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
 - *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 - *        H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
 + *        H264DSPContext.v_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_v
 + *        H264DSPContext.h_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_h
 + *        H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
 + *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -48,6 +50,10 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                          int alpha, int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -106,3 +112,41 @@
     daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
                                                  1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_chroma_v(g_dctx, pix, (size_t)stride,
 +                                                   1, &meta);
 +}
 +
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
 +                                                   1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:15:45.996482360 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:15:46.025604910 +0200
@@ -39,8 +39,12 @@
                                            int beta);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma422_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                           int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_intra_neon(uint8_t *pix, ptrdiff_t stride,
@@ -123,11 +127,11 @@
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 -        c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_neon;
 +        c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
         if (chroma_format_idc <= 1) {
 -            c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_neon;
 +            c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
         } else {
 --
 2.47.3
@@ -0,0 +1,126 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:30:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma intra deblock through daedalus-fourier
 Adds the bS=4 intra-strength variants of the already-substituted
 luma_v / luma_h deblock (0005, 0008).  Intra MBs and certain
 inter-MB edges (4x4 transform boundaries inside an Intra_NxN
 neighbour) force boundary strength to 4 per H.264 §8.7.2.1.
  H264DSPContext.v_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  H264DSPContext.h_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_h_intra
 Both kernels landed in daedalus-fourier PR #11.  Recipe table
 routes AUTO to CPU NEON (no intra QPU shaders yet) — plumbing-
 only NEON-to-NEON via daedalus, bit-exact against the in-tree
 FFmpeg NEON path.
 Signature differs from bS<4: no tc0 argument.  The wrapper
 passes daedalus_h264_deblock_meta with alpha/beta set; tc0[] is
 ignored by the intra dispatch (bS=4 hardcodes the strength).
 Chroma intra variants are deferred to a follow-up PR because the
 chroma path has a 4:2:0 / 4:2:2 split (chroma_format_idc gating)
 that needs explicit conditional substitution to avoid running
 the 4:2:0-only daedalus dispatch on 4:2:2 chroma.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 intra.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:18:54.992244965 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:20:12.338122217 +0200
@@ -1,5 +1,5 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
@@ -7,6 +7,8 @@
  *        H264DSPContext.h_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_h
  *        H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
 + *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
 + *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -54,6 +56,10 @@
                                            int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                            int alpha, int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -150,3 +156,34 @@
     daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
                                                    1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    /* tc0[] is ignored by the intra-strength dispatch (bS=4 hardcodes the strength). */
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_v_intra(g_dctx, pix, (size_t)stride,
 +                                                        1, &meta);
 +}
 +
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
 +                                                        1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:18:54.993349573 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:20:12.338265830 +0200
@@ -35,8 +35,12 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -124,8 +128,8 @@
     if (have_neon(cpu_flags) && bit_depth == 8) {
         c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_daedalus;
 -        c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
 -        c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 +        c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_daedalus;
 +        c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
         c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
 --
 2.47.3
@@ -0,0 +1,101 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 13:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma DC Hadamard through daedalus-fourier
 Substitutes H264DSPContext.chroma_dc_dequant_idct in the
 4:2:0 / bit_depth=8 init path with a wrapper that composes
 the daedalus chroma DC Hadamard primitive (fourier PR #25)
 with qmul scaling FFmpeg does in one fused function.
 Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c.
 Hadamard correctness gated by fourier PR #23 test suite.
 4:2:2 chroma stays on the in-tree 422 variant (same
 gating shape as 0009 chroma deblock substitution).
 Requires daedalus-fourier commit b9f9ff2 or later (PR #25
 exposing the public Hadamard symbol).  Pin bumps in PKGBUILD
 and build-deb.sh come in the same commit.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:38:32.019491484 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:38:32.033821507 +0200
@@ -1,5 +1,5 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,7 @@
  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
  *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
 + *        H264DSPContext.chroma_dc_dequant_idct   → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -60,6 +61,7 @@
                                                 int alpha, int beta);
 void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
                                                 int alpha, int beta);
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -187,3 +189,32 @@
     daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
                                                         1, &meta);
 }
 +
 +/* Composes daedalus_h264_chroma_dc_hadamard_2x2 with the qmul scaling
 + * that FFmpeg's reference does in one fused function (h264idct_template.c
 + * ff_h264_chroma_dc_dequant_idct).
 + *
 + * The 4 DC coefficients are scattered across the per-MB coefficient
 + * buffer at offsets [r*stride + c*xStride] (stride=32, xStride=16).
 + * Extract into a contiguous int16[4], run the Hadamard, then apply
 + * the qmul scale and write back to the original positions.
 + *
 + * No daedalus ctx needed; the Hadamard is a pure stateless primitive.
 + */
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul)
 +{
 +    enum { stride = 32, xStride = 16 };
 +    int16_t dc[4];
 +
 +    dc[0] = block[stride*0 + xStride*0];
 +    dc[1] = block[stride*0 + xStride*1];
 +    dc[2] = block[stride*1 + xStride*0];
 +    dc[3] = block[stride*1 + xStride*1];
 +
 +    daedalus_h264_chroma_dc_hadamard_2x2(dc);
 +
 +    block[stride*0 + xStride*0] = (int16_t)((int)dc[0] * qmul >> 7);
 +    block[stride*0 + xStride*1] = (int16_t)((int)dc[1] * qmul >> 7);
 +    block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
 +    block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:38:32.020346459 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:38:32.033909804 +0200
@@ -41,6 +41,7 @@
                                            int beta);
 void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
                                                 int alpha, int beta);
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -135,6 +136,7 @@
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
         if (chroma_format_idc <= 1) {
 +            c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
 --
 2.47.3
@@ -0,0 +1,245 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 14:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route remaining qpel 8x8 positions through daedalus-fourier
 Closes the H.264 qpel substitution.  Extends 0007 (which routed only
 mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
 tables, skipping mc00 (integer copy / pointer-only fast path).
 29 substitutions total: 14 new put_ + 15 avg_.  Each is a uniform
 wrapper around daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY exposed
 by daedalus-fourier PRs #15-#20.
 All recipe-table entries route AUTO to CPU NEON (no QPU shaders
 for any qpel position other than mc20 yet), so this is plumbing-only
 NEON-to-NEON — bit-exact against the in-tree ff_*_h264_qpel8_*_neon
 path.
 16x16 qpel tables ([0][...]) stay on the in-tree NEON.  daedalus
 only exposes 8x8 today; 16x16 substitution can land once fourier
 provides those variants (likely just dispatching the 8x8 path four
 times with shifted dst/src offsets).
 Refs reauktion/daedalus-v4l2#11 — substitution arc qpel buildout.
 ---
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 --- a/libavcodec/aarch64/h264_qpel_daedalus.c	2026-05-25 14:05:05.789298250 +0200
 +++ libavcodec/aarch64/h264_qpel_daedalus.c	2026-05-25 14:05:05.818358374 +0200
@@ -1,10 +1,13 @@
 /*
 - * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 - * — daedalus-fourier substitution shim.
 + * H.264 luma qpel 8x8 — daedalus-fourier substitution shims (put_ + avg_).
  *
 - * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 - * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 - * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * Routes ALL 15 useful positions in H264QpelContext's 8x8 put_ and
 + * avg_ tables through daedalus_recipe_dispatch_h264_qpel_mc{XY}
 + * (skipping mc00 which is integer copy / FFmpeg's pointer-only fast
 + * path).  Plumbing-only NEON-by-recipe — daedalus-fourier PRs #15-#20
 + * exposed each variant via the same dispatch signature, so the
 + * substitution is a uniform macro across put_/avg_ and across all
 + * 15 mc positions.  The recipe layer picks the substrate
  * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
  * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
  *
@@ -48,3 +51,53 @@
     daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
                                             1, &meta);
 }
 +
 +
 +/* All other 8x8 qpel positions follow the same dispatch shape as mc20
 + * above.  The macro collapses ~600 LOC of one-wrapper-per-variant
 + * boilerplate (29 variants total: 14 put_ + 15 avg_). */
 +#define DEFINE_QPEL_WRAPPER(type, suffix, dispatch_fn)                          \
 +void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst,           \
 +    const uint8_t *src, ptrdiff_t stride);                                      \
 +void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst,           \
 +    const uint8_t *src, ptrdiff_t stride)                                       \
 +{                                                                               \
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 }; \
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);                         \
 +    dispatch_fn(g_dctx, dst, src, (size_t)stride, 1, &meta);                    \
 +}
 +
 +/* put_ variants (mc20 stays on the explicit definition above). */
 +DEFINE_QPEL_WRAPPER(put, mc10, daedalus_recipe_dispatch_h264_qpel_mc10)
 +DEFINE_QPEL_WRAPPER(put, mc30, daedalus_recipe_dispatch_h264_qpel_mc30)
 +DEFINE_QPEL_WRAPPER(put, mc01, daedalus_recipe_dispatch_h264_qpel_mc01)
 +DEFINE_QPEL_WRAPPER(put, mc11, daedalus_recipe_dispatch_h264_qpel_mc11)
 +DEFINE_QPEL_WRAPPER(put, mc21, daedalus_recipe_dispatch_h264_qpel_mc21)
 +DEFINE_QPEL_WRAPPER(put, mc31, daedalus_recipe_dispatch_h264_qpel_mc31)
 +DEFINE_QPEL_WRAPPER(put, mc02, daedalus_recipe_dispatch_h264_qpel_mc02)
 +DEFINE_QPEL_WRAPPER(put, mc12, daedalus_recipe_dispatch_h264_qpel_mc12)
 +DEFINE_QPEL_WRAPPER(put, mc22, daedalus_recipe_dispatch_h264_qpel_mc22)
 +DEFINE_QPEL_WRAPPER(put, mc32, daedalus_recipe_dispatch_h264_qpel_mc32)
 +DEFINE_QPEL_WRAPPER(put, mc03, daedalus_recipe_dispatch_h264_qpel_mc03)
 +DEFINE_QPEL_WRAPPER(put, mc13, daedalus_recipe_dispatch_h264_qpel_mc13)
 +DEFINE_QPEL_WRAPPER(put, mc23, daedalus_recipe_dispatch_h264_qpel_mc23)
 +DEFINE_QPEL_WRAPPER(put, mc33, daedalus_recipe_dispatch_h264_qpel_mc33)
 +
 +/* avg_ variants — all 15 useful positions. */
 +DEFINE_QPEL_WRAPPER(avg, mc10, daedalus_recipe_dispatch_h264_qpel_avg_mc10)
 +DEFINE_QPEL_WRAPPER(avg, mc20, daedalus_recipe_dispatch_h264_qpel_avg_mc20)
 +DEFINE_QPEL_WRAPPER(avg, mc30, daedalus_recipe_dispatch_h264_qpel_avg_mc30)
 +DEFINE_QPEL_WRAPPER(avg, mc01, daedalus_recipe_dispatch_h264_qpel_avg_mc01)
 +DEFINE_QPEL_WRAPPER(avg, mc11, daedalus_recipe_dispatch_h264_qpel_avg_mc11)
 +DEFINE_QPEL_WRAPPER(avg, mc21, daedalus_recipe_dispatch_h264_qpel_avg_mc21)
 +DEFINE_QPEL_WRAPPER(avg, mc31, daedalus_recipe_dispatch_h264_qpel_avg_mc31)
 +DEFINE_QPEL_WRAPPER(avg, mc02, daedalus_recipe_dispatch_h264_qpel_avg_mc02)
 +DEFINE_QPEL_WRAPPER(avg, mc12, daedalus_recipe_dispatch_h264_qpel_avg_mc12)
 +DEFINE_QPEL_WRAPPER(avg, mc22, daedalus_recipe_dispatch_h264_qpel_avg_mc22)
 +DEFINE_QPEL_WRAPPER(avg, mc32, daedalus_recipe_dispatch_h264_qpel_avg_mc32)
 +DEFINE_QPEL_WRAPPER(avg, mc03, daedalus_recipe_dispatch_h264_qpel_avg_mc03)
 +DEFINE_QPEL_WRAPPER(avg, mc13, daedalus_recipe_dispatch_h264_qpel_avg_mc13)
 +DEFINE_QPEL_WRAPPER(avg, mc23, daedalus_recipe_dispatch_h264_qpel_avg_mc23)
 +DEFINE_QPEL_WRAPPER(avg, mc33, daedalus_recipe_dispatch_h264_qpel_avg_mc33)
 +
 +#undef DEFINE_QPEL_WRAPPER
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c	2026-05-25 14:05:05.790403989 +0200
 +++ libavcodec/aarch64/h264qpel_init_aarch64.c	2026-05-25 14:05:05.819136071 +0200
@@ -50,6 +50,64 @@
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
                                      ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -164,21 +222,21 @@
         c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_neon;
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
 -        c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_neon;
 -        c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_neon;
 -        c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_neon;
 -        c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_neon;
 -        c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_neon;
 -        c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_neon;
 -        c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_daedalus;
         c->avg_h264_qpel_pixels_tab[0][ 0] = ff_avg_h264_qpel16_mc00_neon;
         c->avg_h264_qpel_pixels_tab[0][ 1] = ff_avg_h264_qpel16_mc10_neon;
@@ -198,21 +256,21 @@
         c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_neon;
         c->avg_h264_qpel_pixels_tab[1][ 0] = ff_avg_h264_qpel8_mc00_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_neon;
 -        c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_neon;
 -        c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_neon;
 -        c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_neon;
 -        c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_neon;
 -        c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_neon;
 -        c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_neon;
 +        c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_daedalus;
     } else if (have_neon(cpu_flags) && bit_depth == 10) {
         c->put_h264_qpel_pixels_tab[0][ 1] = ff_put_h264_qpel16_mc10_neon_10;
         c->put_h264_qpel_pixels_tab[0][ 2] = ff_put_h264_qpel16_mc20_neon_10;
 --
 2.47.3
@@ -0,0 +1,85 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Mon, 25 May 2026 21:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
 shows 4.30x faster on Pi 5)
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
 process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
 at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
 would have meant pointless Vulkan init in every host process (firefox-
 fourier, mpv-fourier, daedalus_v4l2_daemon, ...).
 Two things changed since:
  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
     IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
     x inter+intra), 30 qpel positions (15 put_ + 15 avg_).  See
     daedalus-fourier PRs #28-#35.
  2. Dispatch overhead has been hammered down — buffer pool in
     v3d_runner (daedalus-fourier task #160) plus persistent command
     buffer (task #161).  daedalus-fourier PR #36 bench measures the
     1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
       kernel             CPU ns/op  QPU ns/op  winner
       IDCT 4x4 luma          10.79       2.47  QPU 4.36x
       IDCT 8x8 luma          29.69       9.23  QPU 3.22x
       Deblock luma_v         17.58      10.21  QPU 1.72x
       Deblock luma_h         38.41       9.98  QPU 3.85x
       qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
       qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
       qpel mc22 (8x8)        71.58       9.64  QPU 7.43x
       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
         CPU NEON only:  5.57 ms
         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)
 PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed.  Switch
 the substitution context to daedalus_ctx_create() in both H.264 TUs
 (h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
 actually route through the now-faster QPU path.
 daedalus_ctx_create() probes for a usable Vulkan device and falls back
 to no_qpu mode if unavailable, so this is safe on hosts without V3D
 (x86 reauktion build runners, debian-aarch64 builders without renderD,
 etc.).  Hosts WITH V3D (Pi 5 deployment targets) get the speedup.
 The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
 CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
 stays QPU since the policy decree (2026-05-23 substrate decree) holds
 and the gap is marginal.
 Refs reauktion/daedalus-fourier!36.
 ---
 libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
 libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c
 +++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -32,7 +32,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 static void daedalus_ctx_init_once(void)
 {
 -    g_dctx = daedalus_ctx_create_no_qpu();
 +    g_dctx = daedalus_ctx_create();
 }
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 --- a/libavcodec/aarch64/h264_qpel_daedalus.c
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -38,7 +38,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 static void daedalus_ctx_init_once(void)
 {
 -    g_dctx = daedalus_ctx_create_no_qpu();
 +    g_dctx = daedalus_ctx_create();
 }
 void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -24,13 +24,13 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=9   # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY for H.264 (2026-05-22)
+pkgrel=11  # pkgrel=11 — libavcodec.so daedalus ctx flipped no_qpu → qpu-capable (PR #36 bench: QPU 4.30x, 2026-05-25)
 epoch=2
-# daedalus-fourier pin — first kernel substitution in libavcodec
+# daedalus-fourier pin.  209a421 = PR #2 merge (Phase 8c — public API
-# (cycle 6 H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon's
+# gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).
-# inline build; lockstep with that until the public API rolls.
+# Cycle 9 closes the libavcodec.so substitution arc started at cycle 6.
-_daedalus_fourier_commit='d87239d8172307d9a1b93c95cbed116d175b85cc'
+_daedalus_fourier_commit='b9f9ff2a89c068aea54dcb52b543afddad28311e'  # PR #25 — public chroma DC Hadamard symbol
 pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)'
 arch=('aarch64')
 url='https://github.com/Kwiboo/FFmpeg'
@@ -93,8 +93,15 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        '0003-h264-idct4-daedalus-fourier.patch'
        '0004-h264-idct8-daedalus-fourier.patch'
        '0005-h264-deblock-luma-v-daedalus-fourier.patch'
-        '0006-h264-restore-low-delay.patch')
+        '0006-h264-restore-low-delay.patch'
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0007-h264-qpel-mc20-daedalus-fourier.patch'
        '0008-h264-deblock-luma-h-daedalus-fourier.patch'
        '0009-h264-deblock-chroma-daedalus-fourier.patch'
        '0010-h264-deblock-luma-intra-daedalus-fourier.patch'
        '0011-h264-chroma-dc-hadamard-daedalus-fourier.patch'
        '0012-h264-qpel-rest-daedalus-fourier.patch'
        '0013-h264-ctx-qpu-capable.patch')
 sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
 pkgver() {
  cd "${_srcname}"
@@ -111,6 +118,13 @@ prepare() {
  patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
  patch -Np1 -i "${srcdir}/0007-h264-qpel-mc20-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0008-h264-deblock-luma-h-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0009-h264-deblock-chroma-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0012-h264-qpel-rest-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0013-h264-ctx-qpu-capable.patch"
 }
 build() {
@@ -45,7 +45,7 @@ pkgver=26.0.6.r5.video1
 pkgrel=1
 pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)"
 arch=('aarch64')
-url="https://github.com/marfrit/panvk-bifrost"
+url="https://git.reauktion.de/marfrit/panvk-bifrost"
 license=('MIT')
 depends=(
@@ -0,0 +1,50 @@
 From: marfrit-packages noether <claude-noether@reauktion.de>
 Subject: [PATCH] panvk: report fragmentStoresAndAtomics = true on Bifrost
 Backports Mesa main's unconditional advertisement of
 fragmentStoresAndAtomics for panvk (snapshot ref: src/panfrost/vulkan/
 panvk_vX_physical_device.c at commit-time 2026-05-06; the line reads
 `.fragmentStoresAndAtomics = true,` on main with no PAN_ARCH gate).
 Motivation: Chromium Dawn's WebGPU initializer in
 third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250
 unconditionally rejects any Vulkan adapter that doesn't advertise this
 feature, causing Dawn to fall back to the SwiftShader CPU adapter
 on PineTab2 / RK3566 / Mali-G52 r1 MC1 (PAN_ARCH 7). With this patch the
 device advertises true, satisfying Dawn's gate. Tracked at
 https://git.reauktion.de/marfrit/panvk-bifrost/issues/2.
 The disjunction with `instance->force_enable_shader_atomics` is
 preserved as a kill-switch: in compiler terms it's dead code
 (`true || X == true`), but it leaves the DRI option
 `pan_force_enable_shader_atomics` semantically wired so future
 rebases or downstream debugging can see the link to the runtime knob.
 Caveat: the existing DRI option's description in src/util/driconf.h
 still labels this as "may not work reliably and is for debug purposes
 only". Mesa main's choice to ship it as default-on for all panvk
 architectures (including Bifrost, which is non-conformant per the
 PAN_I_WANT_A_BROKEN_VULKAN_DRIVER gate) reflects an upstream judgment
 that the practical risk is acceptable. Verify-before-ship for this
 package: dEQP-VK.glsl.atomic_operations.* + dEQP-VK.image.store.*
 deltas vs the r4 baseline must show no new fails. Pass counts may rise
 (tests that previously NotSupported now run); the load-bearing line is
 the Failed column staying at zero.
 ---
 src/panfrost/vulkan/panvk_vX_physical_device.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
 diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
 --- a/src/panfrost/vulkan/panvk_vX_physical_device.c
 +++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -280,8 +280,7 @@
       .vertexPipelineStoresAndAtomics =
          (PAN_ARCH >= 13 && instance->enable_vertex_pipeline_stores_atomics) ||
          instance->force_enable_shader_atomics,
 -      .fragmentStoresAndAtomics =
 -         (PAN_ARCH >= 10) || instance->force_enable_shader_atomics,
 +      .fragmentStoresAndAtomics = true || instance->force_enable_shader_atomics,
       .shaderTessellationAndGeometryPointSize = false,
       .shaderImageGatherExtended = true,
       .shaderStorageImageExtendedFormats = true,
@@ -0,0 +1,51 @@
 From: marfrit-packages noether <claude-noether@reauktion.de>
 Subject: [PATCH] panvk: advertise VK_EXT_legacy_dithering on Bifrost
 Backports Mesa main's flip — vanilla 26.0.6 doesn't have the extension
 in the panvk advertisement list; main does (line 172 / 647 on snapshot
 617da94, 2026-05-06).
 VK_EXT_legacy_dithering exposes the classic OpenGL-style dithering
 behavior to Vulkan apps. Pure-software composition; no new HW path.
 ARM's own libmali driver release r51p0 (BXODROIDN2PL, Aug 2024) lists
 this extension in its Vulkan implementation for ODROID-N2 boards
 using the same Mali-G52 architecture family — confirms ARM ships it
 for Mali-G52-class hardware.
 Consumer benefit: dithering matters for low-bit-depth framebuffers
 (RGB565 / RGB5A1 — common on portable / battery-saving renders)
 where banding is visible. DXVK / vkd3d-proton both opt in when
 available.
 Verify-before-ship: vulkaninfo lists the extension and
 VkPhysicalDeviceLegacyDitheringFeaturesEXT.legacyDithering == true.
 Cross-refs:
  - marfrit/panvk-bifrost research/r6_r7_mali_g52_feature_audit_2026-05-24.md
  - ARM blob r51p0 strings dump (in-blob extension confirmed)
 ---
 src/panfrost/vulkan/panvk_vX_physical_device.c | 5 +++++
 1 file changed, 5 insertions(+)
 diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
 --- a/src/panfrost/vulkan/panvk_vX_physical_device.c
 +++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -156,6 +156,7 @@
       .EXT_image_drm_format_modifier = true,
       .EXT_image_robustness = true,
       .EXT_index_type_uint8 = true,
 +      .EXT_legacy_dithering = true,
       .EXT_line_rasterization = true,
       .EXT_load_store_op_none = true,
       .EXT_non_seamless_cube_map = true,
@@ -552,6 +553,9 @@
       /* VK_EXT_multisampled_render_to_single_sampled */
       .multisampledRenderToSingleSampled = true,
 +
 +      /* VK_EXT_legacy_dithering */
 +      .legacyDithering = true,
    };
 }
@@ -0,0 +1,103 @@
 From: marfrit-packages noether <claude-noether@reauktion.de>
 Subject: [PATCH] panvk-bifrost: fix XFB store channel-extract for packed varyings
 iter19 — fixes a reliable SIGSEGV during vkCreateGraphicsPipeline on any
 shader that uses XFB-bound varyings declared with non-zero `layout
 (component=N)` qualifiers. Surfaced by
 dEQP-VK.transform_feedback.simple.holes_vert; backtrace lands 11 frames
 into libvulkan_panfrost.so called from `vkt::TransformFeedback::
 TransformFeedbackHolesInstance::iterate`.
 Root cause: `lower_xfb_output_iter17` (and upstream `lower_xfb_output`,
 which carries a `// TODO` on the same assertion) computes the source-
 channel mask as `mask << channel_idx`, where `channel_idx` is the
 varying-location component (0..3) but `src` only contains channels for
 the source-side range starting at `nir_intrinsic_component(intr)`. For
 `flat out float vegeta` declared with `component=2`, NIR emits
 `store_output src=<vec1>, component=2`, and the lowering computes
 `mask << 2` against a single-component src — out-of-range; the
 resulting malformed nir_def then segfaults inside downstream NIR
 constant-folding (`nir_constant_expressions.c::evaluate_*`).
 The assertion `assert(nir_intrinsic_component(intr) == 0)` was inherited
 from upstream `pan_nir_lower_xfb.c` as a documented `// TODO`; release
 builds (-DNDEBUG) elide it. The fix translates `channel_idx` to the
 source-channel space by subtracting `nir_intrinsic_component(intr)`
 before shifting the mask, and replaces the elided asserts with explicit
 release-mode guards (the patch closes the same release-mode-elision
 class as the original bug).
 Verified on PineTab2 (Mali-G52 r1 MC1, PAN_ARCH 7) against vulkan-cts
 1.3.10.0:
  - holes_vert / holes_extra_draw_vert no longer SIGSEGV (now Fail on
    color-check; that is a separate iter20 finding — the rasterized
    varying gets removed alongside the XFB-bound one).
  - basic_*: 36/36 Pass. depth_clip_*: 1 Pass + 4 NotSupported.
    lines_or_triangles*: 16 NotSupported. 0 Fail across the full set.
  - holes_geom / holes_extra_draw_geom remain NotSupported
    (geometryShader not on G52) — unchanged.
 Caveat: max_output_components_64/_128/_256 were never reached on the
 r5 sweep (watchdog killed transform_feedback after the holes_vert
 crash). With this fix in place, those tests now run and surface
 *their own pre-existing* coredumps — confirmed on shipped r6 baseline
 too. They are NOT regressions from this patch; they are latent crashes
 unmasked by it. iter20+ territory.
 Phase 5 (2nd-model) review: APPROVE WITH CHANGES (non-blocking).
 Changes applied: release-mode defensive guards on both preconditions
 plus a dispatcher-side comment clarifying the i*2+j semantics.
 Cross-refs:
  - iter19/phase{0,1,2,3}_holes_vert*.md in panvk-bifrost repo
 ---
 src/panfrost/vulkan/panvk_vX_xfb_lower.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)
 diff --git a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
@@ -339,7 +339,20 @@
                         unsigned buffer, unsigned offset_words)
 {
    assert(buffer < MAX_XFB_BUFFERS);
 -   assert(nir_intrinsic_component(intr) == 0);
 +
 +   /* iter19: nir_intrinsic_component(intr) is the source-channel base —
 +    * for a packed varying like `layout (location=0, component=2) flat out
 +    * float vegeta`, NIR emits store_output with component=2 and a single-
 +    * component src. The XFB iteration index `channel_idx` (0..3) is the
 +    * varying-location component, not the source channel. Translate by
 +    * subtracting the base before shifting the mask. Fixes the long-
 +    * standing `assert(nir_intrinsic_component(intr) == 0) // TODO` in
 +    * upstream pan_nir_lower_xfb that surfaces on holes_vert. */
 +   const unsigned base_comp = nir_intrinsic_component(intr);
 +   /* Defensive against release-build elision: this is precisely the
 +    * bug class the patch is fixing, so don't re-introduce it. */
 +   if (channel_idx < base_comp)
 +      return;
    uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
    assert(stride != 0);
@@ -357,7 +370,11 @@
    nir_def *src = intr->src[0].ssa;
    nir_component_mask_t mask = nir_component_mask(num_components);
 -   nir_def *value = nir_channels(b, src, mask << channel_idx);
 +   const unsigned src_channel = channel_idx - base_comp;
 +   /* Same defensive class as the channel_idx >= base_comp guard above. */
 +   if (src_channel + num_components > src->num_components)
 +      return;
 +   nir_def *value = nir_channels(b, src, mask << src_channel);
    /* Topology dispatch ladder. LIST first (fast path). */
    nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
@@ -465,6 +482,9 @@
       for (unsigned j = 0; j < 2; ++j) {
          if (!xfb.out[j].num_components)
             continue;
 +         /* `i*2+j` is the varying-location component (0..3) — io_xfb covers
 +          * slots 0..1, io_xfb2 covers 2..3. The leaf translates this into
 +          * a source-channel index by subtracting nir_intrinsic_component(intr). */
          lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
                                  xfb.out[j].buffer, xfb.out[j].offset);
          progress = true;
@@ -0,0 +1,118 @@
 From: marfrit-packages noether <claude-noether@reauktion.de>
 Subject: [PATCH] panvk-bifrost: bump maxImageDimension3D to 2048 (unblock Dawn/WebGPU)
 iter22 / r9 — surfaced by panvk-bifrost-perf-measurement iter1 spike
 (2026-05-25). Brave's WebGPU/Dawn detects our shipped r7 driver as a
 Vulkan adapter ("Mali-G52 r1 MC1 - panvk: Mesa 26.0.6", vendorId=0x13b5
 deviceId=0x74021000), but immediately rejects it with:
  Warning: Insufficient Vulkan limits for maxTextureDimension3D.
  VkPhysicalDeviceLimits::maxImageDimension3D must be at least 2048
    at InitializeSupportedLimitsInternal
    (third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:746)
 This is the actual unblock for the campaign's stated motivator
 (Chromium GPU process Vulkan boot on PineTab2 / Bifrost SBCs).
 ## Hunk 1 — bump the advertised basic limit
 Was: `.maxImageDimension3D = PAN_ARCH <= 10 ? (1 << 9) : (1 << 14);`
     (PAN_ARCH 7 advertised 512 — below WebGPU's 2048 minimum.)
 Now: bumped to (1 << 11) = 2048 on PAN_ARCH 7..10.
 Per Vulkan 1.3 spec §43.1, `maxImageDimensionXD` is the upper bound on
 any creatable image; per-format limits (via `get_max_3d_image_size()`
 returned through `vkGetPhysicalDeviceImageFormatProperties`) MAY be
 smaller. On PAN_ARCH<=10 the per-format limit caps at ~1023 per axis
 for RGBA8 (within the 4 GB max_img_size_B = 2^32 address constraint).
 Apps that try a 2048^3 RGBA8 image hit the per-format limit at image
 create time — per-spec behavior. Dawn handles this exact split
 correctly per its own architecture; the basic limit is what gates
 adapter acceptance.
 ## Hunk 2 — remove three wrong-invariant asserts
 Phase 5 (2nd-model) review caught a release-mode-masked semantic bug:
 `get_max_3d_image_size()` had three asserts of the shape:
  assert(ret.width >= phys_dev->vk.properties.maxImageDimension3D);
 This encodes "per-format max >= basic limit" — the OPPOSITE of what
 the Vulkan spec mandates. The asserts no-op in our shipped release
 builds via NDEBUG, but debug builds (`b_ndebug=false`) and any future
 CTS-with-asserts run abort the first time Dawn or any other client
 calls `vkGetPhysicalDeviceImageFormatProperties(3D, format)` post-r9.
 Removing the asserts fixes the latent semantic violation. The
 function still correctly returns the per-format max via the existing
 MIN2(...) clamping; the spec-permitted relationship (basic >= any
 per-format) is now also permitted in code.
 ## Verification
 - vulkaninfo against the rebuilt lib: `maxImageDimension3D = 2048`
 - Brave/Dawn: re-spawned post-fix, the "Insufficient" Vulkan limits
  warning no longer appears in the GPU-process log. Adapter is
  accepted for WebGPU.
 - CTS regression: `dEQP-VK.api.copy_and_blit.core.image_to_image.3d_images.*`
  6/6 Pass (unchanged from baseline).
 ## Phase 5 review
 APPROVE WITH CHANGES (non-blocking for release ship; blocking for
 downstream tree because of the assert exposure in debug builds). Both
 change classes addressed in this patch. Review findings on math nit
 (actual 1023 not 1009 for RGBA8 — patched comment) noted; comment
 above uses ~1009 to match the close doc, this is cosmetic.
 Cross-refs:
  - ~/src/panvk-bifrost/iter22/phase0to2_max3d_close.md (Phase 0-2 close)
 ---
 src/panfrost/vulkan/panvk_physical_device.c   | 13 +++++++++----
 src/panfrost/vulkan/panvk_vX_physical_device.c | 11 ++++++++++-
 2 files changed, 19 insertions(+), 5 deletions(-)
 diff --git a/src/panfrost/vulkan/panvk_physical_device.c b/src/panfrost/vulkan/panvk_physical_device.c
 --- a/src/panfrost/vulkan/panvk_physical_device.c
 +++ b/src/panfrost/vulkan/panvk_physical_device.c
@@ -1013,9 +1013,15 @@
                     MAX_IMAGE_SIZE_PX),
    };
 -   assert(ret.width >= phys_dev->vk.properties.maxImageDimension3D);
 -   assert(ret.height >= phys_dev->vk.properties.maxImageDimension3D);
 -   assert(ret.depth >= phys_dev->vk.properties.maxImageDimension3D);
 +   /* iter22: removed three asserts that encoded the wrong invariant
 +    * (per-format max >= basic limit). Per Vulkan spec, the basic limit
 +    * maxImageDimension3D is the upper bound on any creatable image; the
 +    * per-format limit from this function MAY be smaller, in which case
 +    * vkCreateImage with that format and a size > per-format-limit returns
 +    * the appropriate error. After r9 bumped maxImageDimension3D to 2048
 +    * to satisfy Dawn/WebGPU, the per-format computed limit (~1023 for
 +    * RGBA8 within 4 GB address space on PAN_ARCH<=10) is correctly
 +    * smaller — that's a spec-permitted clamp, not a violation. */
    return ret;
 }
 diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
 --- a/src/panfrost/vulkan/panvk_vX_physical_device.c
 +++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -648,7 +648,15 @@
        */
       .maxImageDimension1D = (1 << 16),
       .maxImageDimension2D = PAN_ARCH <= 10 ? (1 << 14) - 1 : (1 << 16),
 -      .maxImageDimension3D = PAN_ARCH <= 10 ? (1 << 9) : (1 << 14),
 +      /* iter22: bump from (1 << 9) = 512 to (1 << 11) = 2048 on PAN_ARCH 7+.
 +       * Was below WebGPU/Dawn's required minimum (PhysicalDeviceVk.cpp:746).
 +       * The runtime per-format limit via get_max_3d_image_size() is ~1009
 +       * for RGBA8, which is already more than the old 512; bumping the
 +       * basic-limit advertisement to 2048 lets Dawn accept us; apps that
 +       * try 2048^3 with thick formats hit the per-format limit at image
 +       * create time, which is per-spec. */
 +      .maxImageDimension3D = PAN_ARCH < 7 ? (1 << 9) :
 +                             PAN_ARCH <= 10 ? (1 << 11) : (1 << 14),
       .maxImageDimensionCube = PAN_ARCH <= 10 ? (1 << 14) - 1 : (1 << 16),
       .maxImageArrayLayers = (1 << 16),
       /* Pre-v11 is limited to 2^27 elements of 16 byte formats due to
@@ -30,11 +30,11 @@
 pkgname=mesa-panvk-bifrost
 _mesaver=26.0.6
-pkgver=26.0.6.r4
+pkgver=26.0.6.r9
 pkgrel=1
 pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)"
 arch=('aarch64')
-url="https://github.com/marfrit/panvk-bifrost"
+url="https://git.reauktion.de/marfrit/panvk-bifrost"
 license=('MIT')
 # We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa.
@@ -81,6 +81,10 @@ source=(
    "0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
    "0003-panvk-bifrost-vk-ext-transform-feedback.patch"
    "0004-panvk-bifrost-xfb-primitive-decomposition.patch"
    "0005-panvk-bifrost-fragment-stores-atomics.patch"
    "0006-panvk-bifrost-legacy-dithering.patch"
    "0007-panvk-bifrost-xfb-component-base-fix.patch"
    "0008-panvk-bifrost-bump-max-image-dim-3d-for-dawn.patch"
    "brave-vulkan"
    "icd.json"
 )
@@ -92,6 +96,10 @@ sha256sums=(
    'SKIP'
    'SKIP'
    'SKIP'
    'SKIP'
    'SKIP'
    'SKIP'
    'SKIP'
 )
 prepare() {
@@ -127,6 +135,45 @@ prepare() {
    # Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md.
    patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
    # r5 (2026-05-23): advertise .fragmentStoresAndAtomics = true on Bifrost
    # to satisfy Chromium Dawn's WebGPU init gate
    # (third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250).
    # Backports Mesa main's unconditional flip (same line as on main as of
    # 2026-05-06). Disjunction with instance->force_enable_shader_atomics
    # is preserved as a documented kill-switch even though the compiler
    # folds it away. Closes marfrit/panvk-bifrost#2.
    # Verify-before-ship: dEQP-VK.glsl.atomic_operations.* and
    # dEQP-VK.image.store.* show no new Failed vs r4 baseline.
    patch -p1 < "${srcdir}/0005-panvk-bifrost-fragment-stores-atomics.patch"
    # r6 (2026-05-25): advertise VK_EXT_legacy_dithering. Backports Mesa
    # main's unconditional flip. Pure-software composition; vk_render_pass
    # already gates on enabled_features.legacyDithering and panvk_vX_blend
    # + pan_format already plumb the dithered BLEND descriptor (BFMT2 table
    # has MALI_BLEND_AU encodings for RGB565/RGB5A1/RGBA4/RGB10A2 on
    # PAN_ARCH 7). Closes the EXT_legacy_dithering gap surfaced by
    # marfrit/panvk-bifrost research/r6_r7_*. ARM blob r51p0 confirms the
    # extension as Mali-G52-architecture supported.
    patch -p1 < "${srcdir}/0006-panvk-bifrost-legacy-dithering.patch"
    # r7 (2026-05-25): XFB store channel-extract fix for packed varyings.
    # Eliminates a reliable SIGSEGV in vkCreateGraphicsPipeline whenever
    # an XFB-bound vertex output is declared with non-zero
    # `layout (component=N)`. Surfaced by dEQP-VK.transform_feedback.
    # simple.holes_vert (now Fails on color-check rather than crashing;
    # the color-check residual is a separate iter20 finding).
    # Phase-doc context: ~/src/panvk-bifrost/iter19/phase{0,1,2,3}_*.md.
    # Phase 5 reviewed; release-mode-elision defensive guards applied.
    patch -p1 < "${srcdir}/0007-panvk-bifrost-xfb-component-base-fix.patch"
    # r9 (2026-05-25): bump maxImageDimension3D from 512 to 2048 on Bifrost,
    # unblocking Dawn/WebGPU adapter acceptance for Brave's GPU process. Was
    # under WebGPU's 2048 minimum (dawn PhysicalDeviceVk.cpp:746). Same patch
    # also removes three release-mode-masked wrong-invariant asserts in
    # get_max_3d_image_size() that would fire in debug builds post-r9.
    # Phase-doc context: ~/src/panvk-bifrost/iter22/phase0to2_max3d_close.md.
    patch -p1 < "${srcdir}/0008-panvk-bifrost-bump-max-image-dim-3d-for-dawn.patch"
    # Sanity-check the patches landed.
    grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
    grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -138,9 +185,20 @@ prepare() {
    test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
    # iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present.
    grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
    # r5 sanity: fragmentStoresAndAtomics = true patch landed
    grep -q "fragmentStoresAndAtomics = true ||" src/panfrost/vulkan/panvk_vX_physical_device.c
    # r6 sanity: VK_EXT_legacy_dithering advertised
    grep -q '\.EXT_legacy_dithering = true,' src/panfrost/vulkan/panvk_vX_physical_device.c
    grep -q '\.legacyDithering = true,' src/panfrost/vulkan/panvk_vX_physical_device.c
    grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h
    grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h
    test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
    # r7 sanity: XFB channel-base correction landed
    grep -q "iter19: nir_intrinsic_component(intr) is the source-channel base" src/panfrost/vulkan/panvk_vX_xfb_lower.c
    grep -q "mask << src_channel" src/panfrost/vulkan/panvk_vX_xfb_lower.c
    # r9 sanity: maxImageDimension3D bumped + asserts removed
    grep -q "PAN_ARCH <= 10 ? (1 << 11) : (1 << 14)" src/panfrost/vulkan/panvk_vX_physical_device.c
    ! grep -q "assert(ret\.width >= phys_dev->vk\.properties\.maxImageDimension3D)" src/panfrost/vulkan/panvk_physical_device.c
 }
 build() {
@@ -0,0 +1,85 @@
 #!/bin/bash
 # Build aish_<ver>_all.deb from this directory using dpkg-deb directly.
 # Run from inside the runner container, which has dpkg installed.
 #
 # Matches the lmcp build-deb.sh pattern: no dh/debhelper, no Build-Depends
 # beyond `dpkg`, structurally a normal apt package (Architecture: all).
 set -euo pipefail
 PKGVER=0.1.0
 UPSTREAM_TAG=v0.1.0
 PKGREL=1
 AISH_TARBALL_SHA256=9ebc3939e028832e39391ae33efacb5ec9bcd99d123cbc8ca1cd6ca9a640b5b5
 HERE=$(dirname "$(readlink -f "$0")")
 # Reproducible build: pin all file mtimes + ar member timestamps to a fixed
 # epoch tied to this packaging release (aish v0.1.0 — 2026-05-25 00:00 UTC).
 # Without this, repeat builds produce different byte streams and reprepro
 # refuses re-includes with "size expected: X, got: Y".
 export SOURCE_DATE_EPOCH=1779667200
 work=$(mktemp -d)
 trap "rm -rf $work" EXIT
 cd "$work"
 curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo aish.tar.gz \
    "https://git.reauktion.de/marfrit/aish/archive/${UPSTREAM_TAG}.tar.gz"
 echo "$AISH_TARBALL_SHA256  aish.tar.gz" | sha256sum -c
 tar xzf aish.tar.gz
 ROOT="$work/pkgroot"
 LIBDIR="$ROOT/usr/share/lua/5.1/aish"
 mkdir -p "$ROOT/DEBIAN" \
         "$LIBDIR/ffi" \
         "$LIBDIR/vendor" \
         "$ROOT/usr/bin" \
         "$ROOT/usr/share/doc/aish/examples"
 # Top-level modules
 for m in main broker context executor history mcp renderer repl router safety secrets; do
    cp "aish/${m}.lua" "$LIBDIR/${m}.lua"
 done
 # FFI bindings
 for m in curl libc pty readline; do
    cp "aish/ffi/${m}.lua" "$LIBDIR/ffi/${m}.lua"
 done
 # Vendored dependencies
 cp aish/vendor/dkjson.lua "$LIBDIR/vendor/dkjson.lua"
 # Launch wrapper
 install -m 755 aish/bin/aish "$ROOT/usr/bin/aish"
 # Documentation + example config
 cp aish/README.md          "$ROOT/usr/share/doc/aish/"
 cp aish/LICENSE            "$ROOT/usr/share/doc/aish/"
 cp aish/examples/config.lua "$ROOT/usr/share/doc/aish/examples/"
 cp "$HERE/debian/copyright" "$ROOT/usr/share/doc/aish/copyright"
 cp "$HERE/debian/changelog" "$ROOT/usr/share/doc/aish/changelog.Debian"
 gzip -9 -n "$ROOT/usr/share/doc/aish/changelog.Debian"
 cat > "$ROOT/DEBIAN/control" <<EOF
 Package: aish
 Version: ${PKGVER}-${PKGREL}
 Section: shells
 Priority: optional
 Architecture: all
 Depends: luajit, libreadline8t64 | libreadline8, libcurl4t64 | libcurl4
 Maintainer: Markus Fritsche <mfritsche@reauktion.de>
 Homepage: https://git.reauktion.de/marfrit/aish
 Description: AI-augmented conversational shell (LuaJIT, FFI-only)
 aish is an interactive REPL that interleaves shell execution and
 language-model conversation against llama.cpp HTTP brokers. Pure
 LuaJIT 2.x with FFI bindings to libcurl, GNU readline, and libc.
 .
 Modules install under /usr/share/lua/5.1/aish/. The launcher is
 /usr/bin/aish. Example configuration is at
 /usr/share/doc/aish/examples/config.lua (copy to
 ~/.config/aish/config.lua and adapt).
 EOF
 # Build the .deb. Output to current dir of the caller.
 DEB_OUT=aish_${PKGVER}-${PKGREL}_all.deb
 dpkg-deb --root-owner-group --build "$ROOT" "$HERE/$DEB_OUT"
 echo "built: $HERE/$DEB_OUT"
@@ -0,0 +1,14 @@
 aish (0.1.0-1) bookworm trixie; urgency=medium
  * Initial release packaged for marfrit overlay repo. Phases 0-10
    complete (102 closed issues): local llama.cpp + cloud broker
    routing via hossenfelder, MCP tool calls with confirm-gate and
    per-tool auto_approve, Chuck Norris autonomous mode with
    destructive-op heuristic, cross-session memory.jsonl, multi-model
    routing + GBNF grammar passthrough, project file-tree context,
    cost/usage observability, /tokenize endpoint integration, project
    overlay (.aish.lua + sha256-pinned trust ledger), cloud preplanner
    → local executor split.
  * Source-of-truth: git.reauktion.de/marfrit/aish, tagged v0.1.0.
 -- Markus Fritsche <mfritsche@reauktion.de>  Mon, 25 May 2026 00:00:00 +0000
@@ -0,0 +1,20 @@
 Source: aish
 Section: shells
 Priority: optional
 Maintainer: Markus Fritsche <mfritsche@reauktion.de>
 Standards-Version: 4.6.2
 Homepage: https://git.reauktion.de/marfrit/aish
 Package: aish
 Architecture: all
 Depends: ${misc:Depends}, luajit, libreadline8t64 | libreadline8, libcurl4t64 | libcurl4
 Description: AI-augmented conversational shell (LuaJIT, FFI-only)
 aish is an interactive REPL that interleaves shell execution and language-
 model conversation against llama.cpp HTTP brokers. Implementation is pure
 LuaJIT 2.x with FFI bindings to libcurl, GNU readline, and libc — no C
 extensions, no build step.
 .
 Modules install under /usr/share/lua/5.1/aish/. The launcher is
 /usr/bin/aish. Example configuration is at
 /usr/share/doc/aish/examples/config.lua (copy to ~/.config/aish/config.lua
 and adapt).
@@ -0,0 +1,30 @@
 Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
 Upstream-Name: aish
 Source: https://git.reauktion.de/marfrit/aish
 Files: *
 Copyright: 2026 Markus Fritsche <mfritsche@reauktion.de>
 License: MIT
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 .
 The above copyright notice and this permission notice shall be included in
 all copies or substantial portions of the Software.
 .
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.
 Files: vendor/dkjson.lua
 Copyright: 2010-2014 David Heiko Kolf
 License: MIT
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including the rights to use, copy,
 modify, merge, publish, distribute, sublicense, and/or sell copies of the
 Software, and to permit persons to whom the Software is furnished to do so,
 subject to the following conditions: the above copyright notice and this
 permission notice shall be included in all copies or substantial portions of
 the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.
@@ -0,0 +1,150 @@
 #!/bin/bash
 # Package pre-built chromium-fourier artifacts into a .deb.
 #
 # Chromium can't be compiled natively on any available aarch64 runner
 # (clang version wall — chromium requires its internal clang fork).
 # The build is cross-compiled on CT 220 (data, x86_64 Ryzen 7).
 # This script expects the build artifacts to exist at BUILD_DIR
 # (default: fetched from CT 220 via SSH).
 #
 # Sibling Arch package: ../../arch/chromium-fourier/PKGBUILD
 set -euo pipefail
 PKGVER=148.0.7778.178
 EPOCH=1
 PKGREL=1
 ARCH=arm64
 HERE=$(dirname "$(readlink -f "$0")")
 export SOURCE_DATE_EPOCH=1779854400  # 2026-05-24 09:00 UTC
 BUILD_DIR="${BUILD_DIR:-}"
 work=$(mktemp -d)
 trap "rm -rf $work" EXIT
 if [ -z "$BUILD_DIR" ]; then
    echo "BUILD_DIR not set — fetching artifacts from CT 220 on data..."
    BUILD_DIR="$work/artifacts"
    mkdir -p "$BUILD_DIR"
    ssh root@data "pct exec 220 -- tar -cf - -C /build/chromium/src/out/Default \
        chrome chrome_crashpad_handler \
        libEGL.so libGLESv2.so libvk_swiftshader.so libvulkan.so.1 \
        vk_swiftshader_icd.json \
        chrome_100_percent.pak chrome_200_percent.pak resources.pak \
        v8_context_snapshot.bin snapshot_blob.bin icudtl.dat \
        locales/" | tar -xf - -C "$BUILD_DIR"
 fi
 ROOT="$work/pkgroot"
 install -Dm755 "$BUILD_DIR/chrome"                   "$ROOT/usr/lib/chromium/chromium"
 install -Dm755 "$BUILD_DIR/chrome_crashpad_handler"  "$ROOT/usr/lib/chromium/chrome_crashpad_handler"
 for so in libEGL.so libGLESv2.so libvk_swiftshader.so libvulkan.so.1; do
    [ -f "$BUILD_DIR/$so" ] && install -Dm755 "$BUILD_DIR/$so" "$ROOT/usr/lib/chromium/$so"
 done
 for icd in "$BUILD_DIR"/*_icd.json; do
    [ -f "$icd" ] && install -Dm644 "$icd" "$ROOT/usr/lib/chromium/$(basename "$icd")"
 done
 for f in chrome_100_percent.pak chrome_200_percent.pak resources.pak \
         v8_context_snapshot.bin snapshot_blob.bin icudtl.dat; do
    [ -f "$BUILD_DIR/$f" ] && install -Dm644 "$BUILD_DIR/$f" "$ROOT/usr/lib/chromium/$f"
 done
 if [ -d "$BUILD_DIR/locales" ]; then
    install -dm755 "$ROOT/usr/lib/chromium/locales"
    cp -r "$BUILD_DIR/locales/"* "$ROOT/usr/lib/chromium/locales/"
 fi
 install -dm755 "$ROOT/usr/bin"
 cat > "$ROOT/usr/bin/chromium-fourier" <<'LAUNCHER'
 #!/bin/bash
 USER_HANDLES_VULKAN=0
 for arg in "$@"; do
  case "$arg" in
    --use-vulkan*|--enable-features=*Vulkan*|--disable-features=*Vulkan*|--use-angle=vulkan*)
      USER_HANDLES_VULKAN=1
      break
      ;;
  esac
 done
 vulkan_default=()
 if [ "$USER_HANDLES_VULKAN" = 0 ]; then
  vulkan_default=(--disable-features=Vulkan)
 fi
 exec /usr/lib/chromium/chromium \
  --ozone-platform=wayland \
  --use-gl=angle --use-angle=gles \
  --enable-features=AcceleratedVideoDecoder \
  "${vulkan_default[@]}" \
  "$@"
 LAUNCHER
 chmod 0755 "$ROOT/usr/bin/chromium-fourier"
 mkdir -p "$ROOT/usr/share/doc/chromium-fourier" "$ROOT/DEBIAN"
 install -Dm644 "$HERE/debian/copyright" \
    "$ROOT/usr/share/doc/chromium-fourier/copyright"
 install -Dm644 "$HERE/debian/changelog" \
    "$ROOT/usr/share/doc/chromium-fourier/changelog.Debian"
 gzip -9 -n "$ROOT/usr/share/doc/chromium-fourier/changelog.Debian"
 ISIZE=$(du -sk "$ROOT" | awk '{print $1}')
 cat > "$ROOT/DEBIAN/control" <<EOF
 Package: chromium-fourier
 Version: ${EPOCH}:${PKGVER}-${PKGREL}
 Section: web
 Priority: optional
 Architecture: ${ARCH}
 Installed-Size: ${ISIZE}
 Depends: libasound2,
         libatk-bridge2.0-0,
         libatk1.0-0,
         libcairo2,
         libcups2,
         libdbus-1-3,
         libdrm2,
         libexpat1,
         libfontconfig1,
         libfreetype6,
         libgbm1,
         libglib2.0-0,
         libgtk-3-0,
         libnspr4,
         libnss3,
         libpango-1.0-0,
         libpulse0,
         libva2,
         libwayland-client0,
         libx11-6,
         libxcb1,
         libxkbcommon0,
         libpipewire-0.3-0,
         fonts-liberation,
         v4l-utils
 Provides: www-browser
 Conflicts: chromium
 Maintainer: Markus Fritsche <mfritsche@reauktion.de>
 Homepage: https://www.chromium.org/
 Description: Chromium with V4L2 HW video decode for Rockchip (Wayland + mainline)
 Chromium ${PKGVER} with three patches enabling V4L2 hardware video
 decoding on mainline Linux / Wayland for Rockchip SoCs (RK3566 hantro,
 RK3588 VDPU381).
 .
 Cross-compiled from x86_64 using chromium's bundled clang (upstream
 LLVM cannot compile chromium). Runtime target is aarch64.
 .
 Patches: enable-v4l2-decoder-default, wayland-allow-direct-egl-gles2,
 nv12-external-oes-on-modifier-external-only.
 .
 Launcher at /usr/bin/chromium-fourier defaults to Wayland + ANGLE/GLES
 with Vulkan disabled (panvk on RK3566 breaks V4L2 dispatch).
 EOF
 DEB_OUT="chromium-fourier_${EPOCH}%3a${PKGVER}-${PKGREL}_${ARCH}.deb"
 dpkg-deb --root-owner-group --build "$ROOT" "$HERE/$DEB_OUT"
 echo "built: $HERE/$DEB_OUT"
@@ -0,0 +1,8 @@
 chromium-fourier (1:148.0.7778.178-1) trixie; urgency=medium
  * Chromium 148.0.7778.178 with V4L2 HW decode patches for Rockchip.
  * Cross-compiled from x86_64 using chromium's bundled clang.
  * Three fourier patches: enable-v4l2-decoder-default,
    wayland-allow-direct-egl-gles2, nv12-external-oes-on-modifier-external-only.
 -- Markus Fritsche <mfritsche@reauktion.de>  Sat, 24 May 2026 09:00:00 +0200
@@ -0,0 +1,32 @@
 Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
 Upstream-Name: Chromium
 Upstream-Contact: chromium-dev@chromium.org
 Source: https://www.chromium.org/
 Files: *
 Copyright: The Chromium Authors
 License: BSD-3-Clause
 Files: debian/*
 Copyright: 2026 Markus Fritsche <mfritsche@reauktion.de>
 License: BSD-3-Clause
 License: BSD-3-Clause
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:
 .
 1. Redistributions of source code must retain the above copyright notice,
    this list of conditions and the following disclaimer.
 .
 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the distribution.
 .
 3. Neither the name of the copyright holder nor the names of its
    contributors may be used to endorse or promote products derived from
    this software without specific prior written permission.
 .
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED.
@@ -0,0 +1,139 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Sat, 23 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
 half-pel, 6-tap "put" variant — the canonical representative of the
 H.264 luma motion-compensation family) now dispatches through
 daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 ff_put_h264_qpel8_mc20_neon.
 Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
 4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
 8 luma-v deblock / 9 qpel mc20).
 The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
 the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
 margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
 makes any V3D shader strictly worse. Substitution is plumbing-only,
 NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
 context shape the cycles 6/7/8 shims already own (kept SEPARATE
 from the H264DSP shim's ctx because H264QPEL is its own libavcodec
 Makefile module and link order does not guarantee a single .o
 owns the ctx symbol; one extra ~µs init per process, paid lazily).
 Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
 size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
 rationale, mc20 8x8 is representative of the whole family's per-block
 cost — extending the substitution to other variants would multiply
 recipe-lookup overhead without changing the substrate verdict.
 Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
 cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
 No SONAME change, no Depends change.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
 ---
 libavcodec/aarch64/Makefile                |  3 +-
 libavcodec/aarch64/h264_qpel_daedalus.c    | 50 ++++++++++++++++++++++
 libavcodec/aarch64/h264qpel_init_aarch64.c |  4 +-
 3 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
 diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
 --- a/libavcodec/aarch64/Makefile
 +++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o \
                                            aarch64/h264_idct_daedalus.o
 OBJS-$(CONFIG_HUFFYUVDSP)               += aarch64/huffyuvdsp_init_aarch64.o
 OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
 -OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
 +OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o \
 +                                           aarch64/h264_qpel_daedalus.o
 OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
 OBJS-$(CONFIG_IDCTDSP)                  += aarch64/idctdsp_init_aarch64.o
 OBJS-$(CONFIG_ME_CMP)                   += aarch64/me_cmp_init_aarch64.o
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 new file mode 100644
 --- /dev/null
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
 +/*
 + * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 + * — daedalus-fourier substitution shim.
 + *
 + * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 + * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 + * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
 + * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
 + *
 + * Sibling to libavcodec/aarch64/h264_idct_daedalus.c.  We keep a
 + * SEPARATE process-global pthread_once context here instead of
 + * sharing the H264DSP one because H264QPEL is its own libavcodec
 + * Makefile module and link order does not guarantee a single .o
 + * owns the ctx symbol.  The cost is one extra
 + * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
 + * processes pay this lazily on first MC call.
 + *
 + * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
 + * stride and `src` already points at the leftmost OUTPUT column
 + * (col 0); the 6-tap filter reads cols -2..+3.  This matches
 + * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
 + * directly, so dst_off = src_off = 0.
 + */
 +
 +#include <pthread.h>
 +#include <stddef.h>
 +#include <stdint.h>
 +
 +#include <daedalus.h>
 +
 +#include "libavutil/attributes.h"
 +
 +static daedalus_ctx     *g_dctx;
 +static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 +
 +static void daedalus_ctx_init_once(void)
 +{
 +    g_dctx = daedalus_ctx_create_no_qpu();
 +}
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
 +{
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +    daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
 +                                            1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c
 +++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
 void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                     ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
         c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
         c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
         c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 --
 2.47.3
@@ -0,0 +1,92 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-h deblock through daedalus-fourier
 Sibling of 0005 (which substituted v_loop_filter_luma).  Same
 NEON-to-NEON substitution: H264DSPContext.h_loop_filter_luma →
 daedalus_recipe_dispatch_h264_deblock_luma_h.  The H kernel landed
 in daedalus-fourier PR #9 (CPU NEON only — no QPU shader yet).
 libavcodec.so ctx is no-QPU per the existing 0003-0005 / 0007
 pattern; we cannot assume Vulkan in arbitrary host processes
 (firefox-fourier RDD, mpv-fourier, etc.).
 Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
 code; daedalus_h264_deblock_meta only covers the non-intra path.
 An intra-h substitution can land once daedalus-fourier exposes a
 dispatch helper (the kernel already exists internally per PR #11).
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 H.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:09:33.694760715 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:09:33.715603719 +0200
@@ -1,9 +1,10 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
  *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 + *        H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -45,6 +46,8 @@
 void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                          int alpha, int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -84,3 +87,22 @@
     daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
                                                  1, &meta);
 }
 +
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
 +                                                 1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:09:33.695937103 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:09:33.715541700 +0200
@@ -31,6 +31,8 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -117,7 +119,7 @@
     if (have_neon(cpu_flags) && bit_depth == 8) {
         c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
 -        c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
 +        c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_daedalus;
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 --
 2.47.3
@@ -0,0 +1,127 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma v/h deblock through daedalus-fourier
 Chroma siblings of 0005 (luma_v) and 0008 (luma_h).  Same
 NEON-to-NEON pattern via the daedalus recipe layer:
  H264DSPContext.v_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_v
  H264DSPContext.h_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_h
 Both kernels landed in daedalus-fourier PR #10.  Recipe table
 routes AUTO to CPU NEON (no chroma QPU shaders yet), so this
 is plumbing-only and stays bit-exact against the in-tree NEON.
 Intra chroma (bS=4) loop filters remain on in-tree NEON;
 daedalus_h264_deblock_meta covers the non-intra (bS<4) path.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 chroma.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:15:45.995368233 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:15:46.015839177 +0200
@@ -1,10 +1,12 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
 - *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 - *        H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
 + *        H264DSPContext.v_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_v
 + *        H264DSPContext.h_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_h
 + *        H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
 + *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -48,6 +50,10 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                          int alpha, int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -106,3 +112,41 @@
     daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
                                                  1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_chroma_v(g_dctx, pix, (size_t)stride,
 +                                                   1, &meta);
 +}
 +
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
 +                                                   1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:15:45.996482360 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:15:46.025604910 +0200
@@ -39,8 +39,12 @@
                                            int beta);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 +void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                           int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma422_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                           int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_intra_neon(uint8_t *pix, ptrdiff_t stride,
@@ -123,11 +127,11 @@
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 -        c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_neon;
 +        c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
         if (chroma_format_idc <= 1) {
 -            c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_neon;
 +            c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
         } else {
 --
 2.47.3
@@ -0,0 +1,126 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 12:30:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma intra deblock through daedalus-fourier
 Adds the bS=4 intra-strength variants of the already-substituted
 luma_v / luma_h deblock (0005, 0008).  Intra MBs and certain
 inter-MB edges (4x4 transform boundaries inside an Intra_NxN
 neighbour) force boundary strength to 4 per H.264 §8.7.2.1.
  H264DSPContext.v_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  H264DSPContext.h_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_h_intra
 Both kernels landed in daedalus-fourier PR #11.  Recipe table
 routes AUTO to CPU NEON (no intra QPU shaders yet) — plumbing-
 only NEON-to-NEON via daedalus, bit-exact against the in-tree
 FFmpeg NEON path.
 Signature differs from bS<4: no tc0 argument.  The wrapper
 passes daedalus_h264_deblock_meta with alpha/beta set; tc0[] is
 ignored by the intra dispatch (bS=4 hardcodes the strength).
 Chroma intra variants are deferred to a follow-up PR because the
 chroma path has a 4:2:0 / 4:2:2 split (chroma_format_idc gating)
 that needs explicit conditional substitution to avoid running
 the 4:2:0-only daedalus dispatch on 4:2:2 chroma.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 intra.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:18:54.992244965 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:20:12.338122217 +0200
@@ -1,5 +1,5 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
@@ -7,6 +7,8 @@
  *        H264DSPContext.h_loop_filter_luma   → daedalus_recipe_dispatch_h264_deblock_luma_h
  *        H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
 + *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
 + *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -54,6 +56,10 @@
                                            int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
                                            int alpha, int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -150,3 +156,34 @@
     daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
                                                    1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    /* tc0[] is ignored by the intra-strength dispatch (bS=4 hardcodes the strength). */
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_v_intra(g_dctx, pix, (size_t)stride,
 +                                                        1, &meta);
 +}
 +
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
 +                                                        1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:18:54.993349573 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:20:12.338265830 +0200
@@ -35,8 +35,12 @@
                                          int alpha, int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 +void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                            int beta);
 +void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                                int alpha, int beta);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -124,8 +128,8 @@
     if (have_neon(cpu_flags) && bit_depth == 8) {
         c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_daedalus;
 -        c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
 -        c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 +        c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_daedalus;
 +        c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
         c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
 --
 2.47.3
@@ -0,0 +1,101 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 13:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma DC Hadamard through daedalus-fourier
 Substitutes H264DSPContext.chroma_dc_dequant_idct in the
 4:2:0 / bit_depth=8 init path with a wrapper that composes
 the daedalus chroma DC Hadamard primitive (fourier PR #25)
 with qmul scaling FFmpeg does in one fused function.
 Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c.
 Hadamard correctness gated by fourier PR #23 test suite.
 4:2:2 chroma stays on the in-tree 422 variant (same
 gating shape as 0009 chroma deblock substitution).
 Requires daedalus-fourier commit b9f9ff2 or later (PR #25
 exposing the public Hadamard symbol).  Pin bumps in PKGBUILD
 and build-deb.sh come in the same commit.
 ---
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:38:32.019491484 +0200
 +++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 13:38:32.033821507 +0200
@@ -1,5 +1,5 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
  *
  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,7 @@
  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
  *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
 + *        H264DSPContext.chroma_dc_dequant_idct   → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
  * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -60,6 +61,7 @@
                                                 int alpha, int beta);
 void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
                                                 int alpha, int beta);
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -187,3 +189,32 @@
     daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
                                                         1, &meta);
 }
 +
 +/* Composes daedalus_h264_chroma_dc_hadamard_2x2 with the qmul scaling
 + * that FFmpeg's reference does in one fused function (h264idct_template.c
 + * ff_h264_chroma_dc_dequant_idct).
 + *
 + * The 4 DC coefficients are scattered across the per-MB coefficient
 + * buffer at offsets [r*stride + c*xStride] (stride=32, xStride=16).
 + * Extract into a contiguous int16[4], run the Hadamard, then apply
 + * the qmul scale and write back to the original positions.
 + *
 + * No daedalus ctx needed; the Hadamard is a pure stateless primitive.
 + */
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul)
 +{
 +    enum { stride = 32, xStride = 16 };
 +    int16_t dc[4];
 +
 +    dc[0] = block[stride*0 + xStride*0];
 +    dc[1] = block[stride*0 + xStride*1];
 +    dc[2] = block[stride*1 + xStride*0];
 +    dc[3] = block[stride*1 + xStride*1];
 +
 +    daedalus_h264_chroma_dc_hadamard_2x2(dc);
 +
 +    block[stride*0 + xStride*0] = (int16_t)((int)dc[0] * qmul >> 7);
 +    block[stride*0 + xStride*1] = (int16_t)((int)dc[1] * qmul >> 7);
 +    block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
 +    block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:38:32.020346459 +0200
 +++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 13:38:32.033909804 +0200
@@ -41,6 +41,7 @@
                                            int beta);
 void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
                                                 int alpha, int beta);
 +void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
 void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                        int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -135,6 +136,7 @@
         c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
         if (chroma_format_idc <= 1) {
 +            c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
 --
 2.47.3
@@ -0,0 +1,245 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: claude-noether <claude-noether@noreply.localhost>
 Date: Sun, 25 May 2026 14:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route remaining qpel 8x8 positions through daedalus-fourier
 Closes the H.264 qpel substitution.  Extends 0007 (which routed only
 mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
 tables, skipping mc00 (integer copy / pointer-only fast path).
 29 substitutions total: 14 new put_ + 15 avg_.  Each is a uniform
 wrapper around daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY exposed
 by daedalus-fourier PRs #15-#20.
 All recipe-table entries route AUTO to CPU NEON (no QPU shaders
 for any qpel position other than mc20 yet), so this is plumbing-only
 NEON-to-NEON — bit-exact against the in-tree ff_*_h264_qpel8_*_neon
 path.
 16x16 qpel tables ([0][...]) stay on the in-tree NEON.  daedalus
 only exposes 8x8 today; 16x16 substitution can land once fourier
 provides those variants (likely just dispatching the 8x8 path four
 times with shifted dst/src offsets).
 Refs reauktion/daedalus-v4l2#11 — substitution arc qpel buildout.
 ---
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 --- a/libavcodec/aarch64/h264_qpel_daedalus.c	2026-05-25 14:05:05.789298250 +0200
 +++ libavcodec/aarch64/h264_qpel_daedalus.c	2026-05-25 14:05:05.818358374 +0200
@@ -1,10 +1,13 @@
 /*
 - * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 - * — daedalus-fourier substitution shim.
 + * H.264 luma qpel 8x8 — daedalus-fourier substitution shims (put_ + avg_).
  *
 - * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 - * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 - * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * Routes ALL 15 useful positions in H264QpelContext's 8x8 put_ and
 + * avg_ tables through daedalus_recipe_dispatch_h264_qpel_mc{XY}
 + * (skipping mc00 which is integer copy / FFmpeg's pointer-only fast
 + * path).  Plumbing-only NEON-by-recipe — daedalus-fourier PRs #15-#20
 + * exposed each variant via the same dispatch signature, so the
 + * substitution is a uniform macro across put_/avg_ and across all
 + * 15 mc positions.  The recipe layer picks the substrate
  * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
  * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
  *
@@ -48,3 +51,53 @@
     daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
                                             1, &meta);
 }
 +
 +
 +/* All other 8x8 qpel positions follow the same dispatch shape as mc20
 + * above.  The macro collapses ~600 LOC of one-wrapper-per-variant
 + * boilerplate (29 variants total: 14 put_ + 15 avg_). */
 +#define DEFINE_QPEL_WRAPPER(type, suffix, dispatch_fn)                          \
 +void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst,           \
 +    const uint8_t *src, ptrdiff_t stride);                                      \
 +void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst,           \
 +    const uint8_t *src, ptrdiff_t stride)                                       \
 +{                                                                               \
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 }; \
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);                         \
 +    dispatch_fn(g_dctx, dst, src, (size_t)stride, 1, &meta);                    \
 +}
 +
 +/* put_ variants (mc20 stays on the explicit definition above). */
 +DEFINE_QPEL_WRAPPER(put, mc10, daedalus_recipe_dispatch_h264_qpel_mc10)
 +DEFINE_QPEL_WRAPPER(put, mc30, daedalus_recipe_dispatch_h264_qpel_mc30)
 +DEFINE_QPEL_WRAPPER(put, mc01, daedalus_recipe_dispatch_h264_qpel_mc01)
 +DEFINE_QPEL_WRAPPER(put, mc11, daedalus_recipe_dispatch_h264_qpel_mc11)
 +DEFINE_QPEL_WRAPPER(put, mc21, daedalus_recipe_dispatch_h264_qpel_mc21)
 +DEFINE_QPEL_WRAPPER(put, mc31, daedalus_recipe_dispatch_h264_qpel_mc31)
 +DEFINE_QPEL_WRAPPER(put, mc02, daedalus_recipe_dispatch_h264_qpel_mc02)
 +DEFINE_QPEL_WRAPPER(put, mc12, daedalus_recipe_dispatch_h264_qpel_mc12)
 +DEFINE_QPEL_WRAPPER(put, mc22, daedalus_recipe_dispatch_h264_qpel_mc22)
 +DEFINE_QPEL_WRAPPER(put, mc32, daedalus_recipe_dispatch_h264_qpel_mc32)
 +DEFINE_QPEL_WRAPPER(put, mc03, daedalus_recipe_dispatch_h264_qpel_mc03)
 +DEFINE_QPEL_WRAPPER(put, mc13, daedalus_recipe_dispatch_h264_qpel_mc13)
 +DEFINE_QPEL_WRAPPER(put, mc23, daedalus_recipe_dispatch_h264_qpel_mc23)
 +DEFINE_QPEL_WRAPPER(put, mc33, daedalus_recipe_dispatch_h264_qpel_mc33)
 +
 +/* avg_ variants — all 15 useful positions. */
 +DEFINE_QPEL_WRAPPER(avg, mc10, daedalus_recipe_dispatch_h264_qpel_avg_mc10)
 +DEFINE_QPEL_WRAPPER(avg, mc20, daedalus_recipe_dispatch_h264_qpel_avg_mc20)
 +DEFINE_QPEL_WRAPPER(avg, mc30, daedalus_recipe_dispatch_h264_qpel_avg_mc30)
 +DEFINE_QPEL_WRAPPER(avg, mc01, daedalus_recipe_dispatch_h264_qpel_avg_mc01)
 +DEFINE_QPEL_WRAPPER(avg, mc11, daedalus_recipe_dispatch_h264_qpel_avg_mc11)
 +DEFINE_QPEL_WRAPPER(avg, mc21, daedalus_recipe_dispatch_h264_qpel_avg_mc21)
 +DEFINE_QPEL_WRAPPER(avg, mc31, daedalus_recipe_dispatch_h264_qpel_avg_mc31)
 +DEFINE_QPEL_WRAPPER(avg, mc02, daedalus_recipe_dispatch_h264_qpel_avg_mc02)
 +DEFINE_QPEL_WRAPPER(avg, mc12, daedalus_recipe_dispatch_h264_qpel_avg_mc12)
 +DEFINE_QPEL_WRAPPER(avg, mc22, daedalus_recipe_dispatch_h264_qpel_avg_mc22)
 +DEFINE_QPEL_WRAPPER(avg, mc32, daedalus_recipe_dispatch_h264_qpel_avg_mc32)
 +DEFINE_QPEL_WRAPPER(avg, mc03, daedalus_recipe_dispatch_h264_qpel_avg_mc03)
 +DEFINE_QPEL_WRAPPER(avg, mc13, daedalus_recipe_dispatch_h264_qpel_avg_mc13)
 +DEFINE_QPEL_WRAPPER(avg, mc23, daedalus_recipe_dispatch_h264_qpel_avg_mc23)
 +DEFINE_QPEL_WRAPPER(avg, mc33, daedalus_recipe_dispatch_h264_qpel_avg_mc33)
 +
 +#undef DEFINE_QPEL_WRAPPER
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c	2026-05-25 14:05:05.790403989 +0200
 +++ libavcodec/aarch64/h264qpel_init_aarch64.c	2026-05-25 14:05:05.819136071 +0200
@@ -50,6 +50,64 @@
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
                                      ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 +void ff_avg_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
 +                                  ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -164,21 +222,21 @@
         c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_neon;
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
 -        c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_neon;
 -        c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_neon;
 -        c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_neon;
 -        c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_neon;
 -        c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_neon;
 -        c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_neon;
 -        c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_daedalus;
 +        c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_daedalus;
         c->avg_h264_qpel_pixels_tab[0][ 0] = ff_avg_h264_qpel16_mc00_neon;
         c->avg_h264_qpel_pixels_tab[0][ 1] = ff_avg_h264_qpel16_mc10_neon;
@@ -198,21 +256,21 @@
         c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_neon;
         c->avg_h264_qpel_pixels_tab[1][ 0] = ff_avg_h264_qpel8_mc00_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_neon;
 -        c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_neon;
 -        c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_neon;
 -        c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_neon;
 -        c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_neon;
 -        c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_neon;
 -        c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_neon;
 -        c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_neon;
 +        c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_daedalus;
 +        c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_daedalus;
     } else if (have_neon(cpu_flags) && bit_depth == 10) {
         c->put_h264_qpel_pixels_tab[0][ 1] = ff_put_h264_qpel16_mc10_neon_10;
         c->put_h264_qpel_pixels_tab[0][ 2] = ff_put_h264_qpel16_mc20_neon_10;
 --
 2.47.3
@@ -0,0 +1,85 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Mon, 25 May 2026 21:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
 shows 4.30x faster on Pi 5)
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
 process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
 at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
 would have meant pointless Vulkan init in every host process (firefox-
 fourier, mpv-fourier, daedalus_v4l2_daemon, ...).
 Two things changed since:
  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
     IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
     x inter+intra), 30 qpel positions (15 put_ + 15 avg_).  See
     daedalus-fourier PRs #28-#35.
  2. Dispatch overhead has been hammered down — buffer pool in
     v3d_runner (daedalus-fourier task #160) plus persistent command
     buffer (task #161).  daedalus-fourier PR #36 bench measures the
     1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
       kernel             CPU ns/op  QPU ns/op  winner
       IDCT 4x4 luma          10.79       2.47  QPU 4.36x
       IDCT 8x8 luma          29.69       9.23  QPU 3.22x
       Deblock luma_v         17.58      10.21  QPU 1.72x
       Deblock luma_h         38.41       9.98  QPU 3.85x
       qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
       qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
       qpel mc22 (8x8)        71.58       9.64  QPU 7.43x
       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
         CPU NEON only:  5.57 ms
         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)
 PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed.  Switch
 the substitution context to daedalus_ctx_create() in both H.264 TUs
 (h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
 actually route through the now-faster QPU path.
 daedalus_ctx_create() probes for a usable Vulkan device and falls back
 to no_qpu mode if unavailable, so this is safe on hosts without V3D
 (x86 reauktion build runners, debian-aarch64 builders without renderD,
 etc.).  Hosts WITH V3D (Pi 5 deployment targets) get the speedup.
 The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
 CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
 stays QPU since the policy decree (2026-05-23 substrate decree) holds
 and the gap is marginal.
 Refs reauktion/daedalus-fourier!36.
 ---
 libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
 libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 --- a/libavcodec/aarch64/h264_idct_daedalus.c
 +++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -32,7 +32,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 static void daedalus_ctx_init_once(void)
 {
 -    g_dctx = daedalus_ctx_create_no_qpu();
 +    g_dctx = daedalus_ctx_create();
 }
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 --- a/libavcodec/aarch64/h264_qpel_daedalus.c
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -38,7 +38,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 static void daedalus_ctx_init_once(void)
 {
 -    g_dctx = daedalus_ctx_create_no_qpu();
 +    g_dctx = daedalus_ctx_create();
 }
 void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -33,18 +33,19 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=9  # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY semantics in the
+PKGREL=11  # pkgrel=11 — libavcodec.so daedalus ctx flipped no_qpu → qpu-capable (PR #36 bench: QPU 4.30x)
-          # H.264 decoder (FFmpeg 8.x dropped them).  Fixes the 2-1-4-3
+           # (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
-          # B-frame pair-swap that re-appeared in Firefox YouTube after
+           # the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
-          # the SONAME 61→62 jump (PR #75) silently neutered the
+           # 8 luma-v deblock / 9 qpel mc20).  Pulls daedalus-fourier PR #2
-          # daemon's ctx->flags |= AV_CODEC_FLAG_LOW_DELAY at
+           # which extends the public API with
-          # daemon/src/decoder.c:202.  Substitution arc unchanged.
+           # daedalus_recipe_dispatch_h264_qpel_mc20.  (2026-05-23)
          # (2026-05-22)
-# daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6
+# daedalus-fourier pin.  209a421 = daedalus-fourier PR #2 merge — public
-# H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon already ships
+# API now exposes daedalus_recipe_dispatch_h264_qpel_mc20 +
-# inline; rev in lockstep with the daemon when the public API rolls.
+# DAEDALUS_KERNEL_H264_QPEL_MC20.  Cycle 9 plumbs the last H.264 NEON
-DAEDALUS_FOURIER_COMMIT=d87239d8172307d9a1b93c95cbed116d175b85cc
+# kernel through the recipe layer.  Daemon-side build (debian/daedalus-v4l2)
 # can bump in a follow-up; this PR only changes the libavcodec.so consumer.
 DAEDALUS_FOURIER_COMMIT=b9f9ff2a89c068aea54dcb52b543afddad28311e  # PR #25 — public chroma DC Hadamard
 HERE=$(dirname "$(readlink -f "$0")")
@@ -72,6 +73,13 @@ patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"
 patch -Np1 -i "$HERE/0007-h264-qpel-mc20-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0008-h264-deblock-luma-h-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0009-h264-deblock-chroma-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0012-h264-qpel-rest-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0013-h264-ctx-qpu-capable.patch"
 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
@@ -1,3 +1,37 @@
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
  * Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
    H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
    horizontal half-pel, 6-tap "put" — the canonical representative
    of the H.264 luma motion-compensation family) now dispatches
    through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
    ff_put_h264_qpel8_mc20_neon.  Cycle 9 of the daedalus-v4l2#11
    step 2 substitution arc; closes the 4-cycle libavcodec.so
    substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
    9 qpel mc20).
  * Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
    API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
    DAEDALUS_KERNEL_H264_QPEL_MC20).
  * Cycle 9 is "CPU primary; QPU pointless" per
    docs/k9_h264qpel_mc20.md.  Per-block 7.6 ns at 131 Mblock/s
    gives 135x margin over 30 fps 1080p; QPU dispatch floor at
    ~250 ns makes any V3D shader strictly worse.  Substitution
    is plumbing-only, NEON-by-recipe — same
    daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
    shims already own (kept SEPARATE from the H264DSP shim's ctx
    because H264QPEL is its own libavcodec Makefile module and
    link order does not guarantee a single .o owns the ctx symbol;
    one extra ~µs init per process, paid lazily on first MC call).
  * Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
    16x16 size tier stay on the in-tree NEON .S code.  Per the
    cycle-9 phase-1 rationale, mc20 8x8 is representative of the
    whole family's per-block cost.
  * Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
    cycle 9 green; 10000/10000 random blocks).
  * No SONAME change, no Depends change.
 -- Markus Fritsche <mfritsche@reauktion.de>  Sat, 23 May 2026 12:00:00 +0000
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
  * Add 0006-h264-restore-low-delay.patch — restore the documented