Merge pull request 'mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost (closes panvk-bifrost#2)' (#92 ) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r5 into main

Reviewed-on: marfrit/marfrit-packages#92
mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost
2026-05-23 12:19:10 +00:00 · 2026-05-23 14:04:44 +02:00 · 2026-05-23 03:18:37 +00:00 · 2026-05-23 05:17:41 +02:00 · 2026-05-23 01:34:04 +00:00 · 2026-05-23 03:32:29 +02:00
8 changed files with 402 additions and 21 deletions
@@ -0,0 +1,139 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Sat, 23 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
 half-pel, 6-tap "put" variant — the canonical representative of the
 H.264 luma motion-compensation family) now dispatches through
 daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 ff_put_h264_qpel8_mc20_neon.
 Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
 4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
 8 luma-v deblock / 9 qpel mc20).
 The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
 the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
 margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
 makes any V3D shader strictly worse. Substitution is plumbing-only,
 NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
 context shape the cycles 6/7/8 shims already own (kept SEPARATE
 from the H264DSP shim's ctx because H264QPEL is its own libavcodec
 Makefile module and link order does not guarantee a single .o
 owns the ctx symbol; one extra ~µs init per process, paid lazily).
 Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
 size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
 rationale, mc20 8x8 is representative of the whole family's per-block
 cost — extending the substitution to other variants would multiply
 recipe-lookup overhead without changing the substrate verdict.
 Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
 cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
 No SONAME change, no Depends change.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
 ---
 libavcodec/aarch64/Makefile                |  3 +-
 libavcodec/aarch64/h264_qpel_daedalus.c    | 50 ++++++++++++++++++++++
 libavcodec/aarch64/h264qpel_init_aarch64.c |  4 +-
 3 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
 diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
 --- a/libavcodec/aarch64/Makefile
 +++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o \
                                            aarch64/h264_idct_daedalus.o
 OBJS-$(CONFIG_HUFFYUVDSP)               += aarch64/huffyuvdsp_init_aarch64.o
 OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
 -OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
 +OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o \
 +                                           aarch64/h264_qpel_daedalus.o
 OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
 OBJS-$(CONFIG_IDCTDSP)                  += aarch64/idctdsp_init_aarch64.o
 OBJS-$(CONFIG_ME_CMP)                   += aarch64/me_cmp_init_aarch64.o
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 new file mode 100644
 --- /dev/null
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
 +/*
 + * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 + * — daedalus-fourier substitution shim.
 + *
 + * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 + * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 + * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
 + * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
 + *
 + * Sibling to libavcodec/aarch64/h264_idct_daedalus.c.  We keep a
 + * SEPARATE process-global pthread_once context here instead of
 + * sharing the H264DSP one because H264QPEL is its own libavcodec
 + * Makefile module and link order does not guarantee a single .o
 + * owns the ctx symbol.  The cost is one extra
 + * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
 + * processes pay this lazily on first MC call.
 + *
 + * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
 + * stride and `src` already points at the leftmost OUTPUT column
 + * (col 0); the 6-tap filter reads cols -2..+3.  This matches
 + * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
 + * directly, so dst_off = src_off = 0.
 + */
 +
 +#include <pthread.h>
 +#include <stddef.h>
 +#include <stdint.h>
 +
 +#include <daedalus.h>
 +
 +#include "libavutil/attributes.h"
 +
 +static daedalus_ctx     *g_dctx;
 +static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 +
 +static void daedalus_ctx_init_once(void)
 +{
 +    g_dctx = daedalus_ctx_create_no_qpu();
 +}
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
 +{
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +    daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
 +                                            1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c
 +++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
 void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                     ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
         c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
         c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
         c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 --
 2.47.3
@@ -24,13 +24,13 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=9   # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY for H.264 (2026-05-22)
+pkgrel=10  # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution (cycle 9, 2026-05-23)
 epoch=2
-# daedalus-fourier pin — first kernel substitution in libavcodec
+# daedalus-fourier pin.  209a421 = PR #2 merge (Phase 8c — public API
-# (cycle 6 H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon's
+# gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).
-# inline build; lockstep with that until the public API rolls.
+# Cycle 9 closes the libavcodec.so substitution arc started at cycle 6.
-_daedalus_fourier_commit='d87239d8172307d9a1b93c95cbed116d175b85cc'
+_daedalus_fourier_commit='209a4218bcb98b91c04f07ad61513bb04adb13ad'
 pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)'
 arch=('aarch64')
 url='https://github.com/Kwiboo/FFmpeg'
@@ -93,8 +93,9 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        '0003-h264-idct4-daedalus-fourier.patch'
        '0004-h264-idct8-daedalus-fourier.patch'
        '0005-h264-deblock-luma-v-daedalus-fourier.patch'
-        '0006-h264-restore-low-delay.patch')
+        '0006-h264-restore-low-delay.patch'
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0007-h264-qpel-mc20-daedalus-fourier.patch')
 sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
 pkgver() {
  cd "${_srcname}"
@@ -111,6 +112,7 @@ prepare() {
  patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
  patch -Np1 -i "${srcdir}/0007-h264-qpel-mc20-daedalus-fourier.patch"
 }
 build() {
@@ -45,7 +45,7 @@ pkgver=26.0.6.r5.video1
 pkgrel=1
 pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)"
 arch=('aarch64')
-url="https://github.com/marfrit/panvk-bifrost"
+url="https://git.reauktion.de/marfrit/panvk-bifrost"
 license=('MIT')
 depends=(
@@ -0,0 +1,50 @@
 From: marfrit-packages noether <claude-noether@reauktion.de>
 Subject: [PATCH] panvk: report fragmentStoresAndAtomics = true on Bifrost
 Backports Mesa main's unconditional advertisement of
 fragmentStoresAndAtomics for panvk (snapshot ref: src/panfrost/vulkan/
 panvk_vX_physical_device.c at commit-time 2026-05-06; the line reads
 `.fragmentStoresAndAtomics = true,` on main with no PAN_ARCH gate).
 Motivation: Chromium Dawn's WebGPU initializer in
 third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250
 unconditionally rejects any Vulkan adapter that doesn't advertise this
 feature, causing Dawn to fall back to the SwiftShader CPU adapter
 on PineTab2 / RK3566 / Mali-G52 r1 MC1 (PAN_ARCH 7). With this patch the
 device advertises true, satisfying Dawn's gate. Tracked at
 https://git.reauktion.de/marfrit/panvk-bifrost/issues/2.
 The disjunction with `instance->force_enable_shader_atomics` is
 preserved as a kill-switch: in compiler terms it's dead code
 (`true || X == true`), but it leaves the DRI option
 `pan_force_enable_shader_atomics` semantically wired so future
 rebases or downstream debugging can see the link to the runtime knob.
 Caveat: the existing DRI option's description in src/util/driconf.h
 still labels this as "may not work reliably and is for debug purposes
 only". Mesa main's choice to ship it as default-on for all panvk
 architectures (including Bifrost, which is non-conformant per the
 PAN_I_WANT_A_BROKEN_VULKAN_DRIVER gate) reflects an upstream judgment
 that the practical risk is acceptable. Verify-before-ship for this
 package: dEQP-VK.glsl.atomic_operations.* + dEQP-VK.image.store.*
 deltas vs the r4 baseline must show no new fails. Pass counts may rise
 (tests that previously NotSupported now run); the load-bearing line is
 the Failed column staying at zero.
 ---
 src/panfrost/vulkan/panvk_vX_physical_device.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
 diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
 --- a/src/panfrost/vulkan/panvk_vX_physical_device.c
 +++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -280,8 +280,7 @@
       .vertexPipelineStoresAndAtomics =
          (PAN_ARCH >= 13 && instance->enable_vertex_pipeline_stores_atomics) ||
          instance->force_enable_shader_atomics,
 -      .fragmentStoresAndAtomics =
 -         (PAN_ARCH >= 10) || instance->force_enable_shader_atomics,
 +      .fragmentStoresAndAtomics = true || instance->force_enable_shader_atomics,
       .shaderTessellationAndGeometryPointSize = false,
       .shaderImageGatherExtended = true,
       .shaderStorageImageExtendedFormats = true,
@@ -30,11 +30,11 @@
 pkgname=mesa-panvk-bifrost
 _mesaver=26.0.6
-pkgver=26.0.6.r4
+pkgver=26.0.6.r5
 pkgrel=1
 pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)"
 arch=('aarch64')
-url="https://github.com/marfrit/panvk-bifrost"
+url="https://git.reauktion.de/marfrit/panvk-bifrost"
 license=('MIT')
 # We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa.
@@ -81,6 +81,7 @@ source=(
    "0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
    "0003-panvk-bifrost-vk-ext-transform-feedback.patch"
    "0004-panvk-bifrost-xfb-primitive-decomposition.patch"
    "0005-panvk-bifrost-fragment-stores-atomics.patch"
    "brave-vulkan"
    "icd.json"
 )
@@ -92,6 +93,7 @@ sha256sums=(
    'SKIP'
    'SKIP'
    'SKIP'
    'SKIP'
 )
 prepare() {
@@ -127,6 +129,17 @@ prepare() {
    # Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md.
    patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
    # r5 (2026-05-23): advertise .fragmentStoresAndAtomics = true on Bifrost
    # to satisfy Chromium Dawn's WebGPU init gate
    # (third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250).
    # Backports Mesa main's unconditional flip (same line as on main as of
    # 2026-05-06). Disjunction with instance->force_enable_shader_atomics
    # is preserved as a documented kill-switch even though the compiler
    # folds it away. Closes marfrit/panvk-bifrost#2.
    # Verify-before-ship: dEQP-VK.glsl.atomic_operations.* and
    # dEQP-VK.image.store.* show no new Failed vs r4 baseline.
    patch -p1 < "${srcdir}/0005-panvk-bifrost-fragment-stores-atomics.patch"
    # Sanity-check the patches landed.
    grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
    grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -138,6 +151,8 @@ prepare() {
    test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
    # iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present.
    grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
    # r5 sanity: fragmentStoresAndAtomics = true patch landed
    grep -q "fragmentStoresAndAtomics = true ||" src/panfrost/vulkan/panvk_vX_physical_device.c
    grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h
    grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h
    test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
@@ -0,0 +1,139 @@
 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Sat, 23 May 2026 12:00:00 +0200
 Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
 half-pel, 6-tap "put" variant — the canonical representative of the
 H.264 luma motion-compensation family) now dispatches through
 daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 ff_put_h264_qpel8_mc20_neon.
 Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
 4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
 8 luma-v deblock / 9 qpel mc20).
 The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
 the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
 margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
 makes any V3D shader strictly worse. Substitution is plumbing-only,
 NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
 context shape the cycles 6/7/8 shims already own (kept SEPARATE
 from the H264DSP shim's ctx because H264QPEL is its own libavcodec
 Makefile module and link order does not guarantee a single .o
 owns the ctx symbol; one extra ~µs init per process, paid lazily).
 Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
 size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
 rationale, mc20 8x8 is representative of the whole family's per-block
 cost — extending the substitution to other variants would multiply
 recipe-lookup overhead without changing the substrate verdict.
 Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
 cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
 No SONAME change, no Depends change.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
 ---
 libavcodec/aarch64/Makefile                |  3 +-
 libavcodec/aarch64/h264_qpel_daedalus.c    | 50 ++++++++++++++++++++++
 libavcodec/aarch64/h264qpel_init_aarch64.c |  4 +-
 3 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
 diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
 --- a/libavcodec/aarch64/Makefile
 +++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o \
                                            aarch64/h264_idct_daedalus.o
 OBJS-$(CONFIG_HUFFYUVDSP)               += aarch64/huffyuvdsp_init_aarch64.o
 OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
 -OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
 +OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o \
 +                                           aarch64/h264_qpel_daedalus.o
 OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
 OBJS-$(CONFIG_IDCTDSP)                  += aarch64/idctdsp_init_aarch64.o
 OBJS-$(CONFIG_ME_CMP)                   += aarch64/me_cmp_init_aarch64.o
 diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
 new file mode 100644
 --- /dev/null
 +++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
 +/*
 + * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
 + * — daedalus-fourier substitution shim.
 + *
 + * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
 + * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
 + * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
 + * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
 + * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
 + *
 + * Sibling to libavcodec/aarch64/h264_idct_daedalus.c.  We keep a
 + * SEPARATE process-global pthread_once context here instead of
 + * sharing the H264DSP one because H264QPEL is its own libavcodec
 + * Makefile module and link order does not guarantee a single .o
 + * owns the ctx symbol.  The cost is one extra
 + * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
 + * processes pay this lazily on first MC call.
 + *
 + * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
 + * stride and `src` already points at the leftmost OUTPUT column
 + * (col 0); the 6-tap filter reads cols -2..+3.  This matches
 + * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
 + * directly, so dst_off = src_off = 0.
 + */
 +
 +#include <pthread.h>
 +#include <stddef.h>
 +#include <stdint.h>
 +
 +#include <daedalus.h>
 +
 +#include "libavutil/attributes.h"
 +
 +static daedalus_ctx     *g_dctx;
 +static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
 +
 +static void daedalus_ctx_init_once(void)
 +{
 +    g_dctx = daedalus_ctx_create_no_qpu();
 +}
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
 +{
 +    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +    daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
 +                                            1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
 --- a/libavcodec/aarch64/h264qpel_init_aarch64.c
 +++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
 void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 +void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
 +                                     ptrdiff_t stride);
 void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
 void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
         c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
 -        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
 +        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
         c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
         c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
         c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
 --
 2.47.3
@@ -33,18 +33,19 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=9  # pkgrel=9 — restore AV_CODEC_FLAG_LOW_DELAY semantics in the
+PKGREL=10  # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution
-          # H.264 decoder (FFmpeg 8.x dropped them).  Fixes the 2-1-4-3
+           # (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
-          # B-frame pair-swap that re-appeared in Firefox YouTube after
+           # the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
-          # the SONAME 61→62 jump (PR #75) silently neutered the
+           # 8 luma-v deblock / 9 qpel mc20).  Pulls daedalus-fourier PR #2
-          # daemon's ctx->flags |= AV_CODEC_FLAG_LOW_DELAY at
+           # which extends the public API with
-          # daemon/src/decoder.c:202.  Substitution arc unchanged.
+           # daedalus_recipe_dispatch_h264_qpel_mc20.  (2026-05-23)
          # (2026-05-22)
-# daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6
+# daedalus-fourier pin.  209a421 = daedalus-fourier PR #2 merge — public
-# H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon already ships
+# API now exposes daedalus_recipe_dispatch_h264_qpel_mc20 +
-# inline; rev in lockstep with the daemon when the public API rolls.
+# DAEDALUS_KERNEL_H264_QPEL_MC20.  Cycle 9 plumbs the last H.264 NEON
-DAEDALUS_FOURIER_COMMIT=d87239d8172307d9a1b93c95cbed116d175b85cc
+# kernel through the recipe layer.  Daemon-side build (debian/daedalus-v4l2)
 # can bump in a follow-up; this PR only changes the libavcodec.so consumer.
 DAEDALUS_FOURIER_COMMIT=209a4218bcb98b91c04f07ad61513bb04adb13ad
 HERE=$(dirname "$(readlink -f "$0")")
@@ -72,6 +73,7 @@ patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"
 patch -Np1 -i "$HERE/0007-h264-qpel-mc20-daedalus-fourier.patch"
 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
@@ -1,3 +1,37 @@
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
  * Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
    H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
    horizontal half-pel, 6-tap "put" — the canonical representative
    of the H.264 luma motion-compensation family) now dispatches
    through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
    ff_put_h264_qpel8_mc20_neon.  Cycle 9 of the daedalus-v4l2#11
    step 2 substitution arc; closes the 4-cycle libavcodec.so
    substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
    9 qpel mc20).
  * Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
    API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
    DAEDALUS_KERNEL_H264_QPEL_MC20).
  * Cycle 9 is "CPU primary; QPU pointless" per
    docs/k9_h264qpel_mc20.md.  Per-block 7.6 ns at 131 Mblock/s
    gives 135x margin over 30 fps 1080p; QPU dispatch floor at
    ~250 ns makes any V3D shader strictly worse.  Substitution
    is plumbing-only, NEON-by-recipe — same
    daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
    shims already own (kept SEPARATE from the H264DSP shim's ctx
    because H264QPEL is its own libavcodec Makefile module and
    link order does not guarantee a single .o owns the ctx symbol;
    one extra ~µs init per process, paid lazily on first MC call).
  * Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
    16x16 size tier stay on the in-tree NEON .S code.  Per the
    cycle-9 phase-1 rationale, mc20 8x8 is representative of the
    whole family's per-block cost.
  * Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
    cycle 9 green; 10000/10000 random blocks).
  * No SONAME change, no Depends change.
 -- Markus Fritsche <mfritsche@reauktion.de>  Sat, 23 May 2026 12:00:00 +0000
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
  * Add 0006-h264-restore-low-delay.patch — restore the documented
Author	SHA1	Message	Date
marfrit	685f85c22e	Merge pull request 'mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost (closes panvk-bifrost#2)' (#92 ) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r5 into main Reviewed-on: marfrit/marfrit-packages#92	2026-05-23 12:19:10 +00:00
marfrit	6896853544	mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost Backports Mesa main's unconditional flip of .fragmentStoresAndAtomics to true in src/panfrost/vulkan/panvk_vX_physical_device.c. Closes the Dawn-WebGPU adapter rejection at PhysicalDeviceVk.cpp:250 that caused brave-vulkan to fall back to the SwiftShader CPU adapter on PineTab2/Mali-G52, per marfrit/panvk-bifrost#2. Phase 7 verify on ohm (PineTab2, RK3566, Mali-G52 r1 MC1) with a locally-built r5 lib installed to /tmp/r5_test_lib/: dEQP-VK.glsl.atomic_operations.: r4: 48 pass / 0 fail / 992 NotSupported (1040 total) r5: 80 pass / 0 fail / 960 NotSupported (1040 total) delta: +32 newly-passing, zero new failures dEQP-VK.image.store.: r4: 2772 pass / 0 fail / 238 NotSupported (3010 total) r5: 2772 pass / 0 fail / 238 NotSupported (3010 total) delta: identical (image.store is independent of the flag) The disjunction with instance->force_enable_shader_atomics is kept as a documented kill-switch even though the compiler folds it away — it leaves the DRI option pan_force_enable_shader_atomics semantically wired for future rebases or downstream debugging. Patch reviewed via 2nd-model pass (per bugfix-process step 4): recommended keeping the disjunction (applied), Bifrost-only-vs-unconditional left unconditional to match upstream (applied), pre-ship CTS subset (applied with results above). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 14:04:44 +02:00
marfrit	fd56eca3cb	Merge pull request 'mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo' (#91 ) from claude-noether/marfrit-packages:noether/panvk-bifrost-url-fix into main Reviewed-on: marfrit/marfrit-packages#91	2026-05-23 03:18:37 +00:00
marfrit	91022b390e	mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo Both PKGBUILDs referenced url=https://github.com/marfrit/panvk-bifrost, which was a hallucinated URL — no such repo existed. The campaign's real source-of-truth home was just created at https://git.reauktion.de/marfrit/panvk-bifrost (mfritsche, 2026-05-23). Point both PKGBUILDs at the real URL so `pacman -Si` and any consumer reading package metadata follows a working link. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 05:17:41 +02:00
marfrit	b736dd0529	Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier' (#90 ) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-qpel-mc20-daedalus into main Reviewed-on: marfrit/marfrit-packages#90	2026-05-23 01:34:04 +00:00
claude-noether	0bfc4ab03e	ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal half-pel, 6-tap "put" — the canonical representative of the H.264 luma motion-compensation family) now dispatches through daedalus_recipe_dispatch_h264_qpel_mc20 instead of ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the 4-cycle libavcodec.so substitution sequence: cycle 6 (PR #76) H.264 IDCT 4x4 done cycle 7 (PR #85) H.264 IDCT 8x8 done cycle 8 (PR #86) H.264 luma-v deblock done cycle 9 (this) H.264 qpel mc20 Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20). Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns at 131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Substitution is plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP shim's ctx because H264QPEL is its own libavcodec Makefile module and link order does not guarantee a single .o owns the ctx symbol; one extra ~µs init per process, paid lazily on first MC call). Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16 size tier stay on the in-tree NEON .S code per the cycle-9 phase-1 rationale (mc20 8x8 is representative; remaining variants would multiply recipe-lookup overhead without changing the substrate verdict). Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s). No SONAME change, no Depends change. PKGREL 9 → 10. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:32:29 +02:00
marfrit	8729c2db92	Merge pull request 'daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20 )' (#89 ) from claude-noether/marfrit-packages:noether/daedalus-bump-872eec5-1mib-payload into main Reviewed-on: marfrit/marfrit-packages#89	2026-05-22 18:52:53 +00:00