7 Commits

Author SHA1 Message Date
marfrit 685f85c22e Merge pull request 'mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost (closes panvk-bifrost#2)' (#92) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r5 into main
Reviewed-on: marfrit/marfrit-packages#92
2026-05-23 12:19:10 +00:00
marfrit 6896853544 mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost
Backports Mesa main's unconditional flip of .fragmentStoresAndAtomics
to true in src/panfrost/vulkan/panvk_vX_physical_device.c. Closes
the Dawn-WebGPU adapter rejection at PhysicalDeviceVk.cpp:250 that
caused brave-vulkan to fall back to the SwiftShader CPU adapter on
PineTab2/Mali-G52, per marfrit/panvk-bifrost#2.

Phase 7 verify on ohm (PineTab2, RK3566, Mali-G52 r1 MC1) with a
locally-built r5 lib installed to /tmp/r5_test_lib/:

  dEQP-VK.glsl.atomic_operations.*:
    r4:  48 pass /   0 fail / 992 NotSupported  (1040 total)
    r5:  80 pass /   0 fail / 960 NotSupported  (1040 total)
    delta: +32 newly-passing, zero new failures

  dEQP-VK.image.store.*:
    r4: 2772 pass /  0 fail / 238 NotSupported  (3010 total)
    r5: 2772 pass /  0 fail / 238 NotSupported  (3010 total)
    delta: identical (image.store is independent of the flag)

The disjunction with instance->force_enable_shader_atomics is kept as
a documented kill-switch even though the compiler folds it away —
it leaves the DRI option pan_force_enable_shader_atomics semantically
wired for future rebases or downstream debugging.

Patch reviewed via 2nd-model pass (per bugfix-process step 4):
recommended keeping the disjunction (applied), Bifrost-only-vs-unconditional
left unconditional to match upstream (applied), pre-ship CTS subset
(applied with results above).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 14:04:44 +02:00
marfrit fd56eca3cb Merge pull request 'mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo' (#91) from claude-noether/marfrit-packages:noether/panvk-bifrost-url-fix into main
Reviewed-on: marfrit/marfrit-packages#91
2026-05-23 03:18:37 +00:00
marfrit 91022b390e mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo
Both PKGBUILDs referenced url=https://github.com/marfrit/panvk-bifrost,
which was a hallucinated URL — no such repo existed. The campaign's
real source-of-truth home was just created at
https://git.reauktion.de/marfrit/panvk-bifrost (mfritsche, 2026-05-23).

Point both PKGBUILDs at the real URL so `pacman -Si` and any consumer
reading package metadata follows a working link.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:17:41 +02:00
marfrit b736dd0529 Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier' (#90) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-qpel-mc20-daedalus into main
Reviewed-on: marfrit/marfrit-packages#90
2026-05-23 01:34:04 +00:00
claude-noether 0bfc4ab03e ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
marfrit 8729c2db92 Merge pull request 'daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20)' (#89) from claude-noether/marfrit-packages:noether/daedalus-bump-872eec5-1mib-payload into main
Reviewed-on: marfrit/marfrit-packages#89
2026-05-22 18:52:53 +00:00
8 changed files with 402 additions and 21 deletions
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
+9 -7
View File
@@ -24,13 +24,13 @@ _srcname=FFmpeg
_version='8.1' _version='8.1'
_commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935' # v4l2-request-n8.1 tip 2026-04-24 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935' # v4l2-request-n8.1 tip 2026-04-24
pkgver=8.1.r123329.b57fbbe pkgver=8.1.r123329.b57fbbe
pkgrel=9 # pkgrel=9restore AV_CODEC_FLAG_LOW_DELAY for H.264 (2026-05-22) pkgrel=10 # pkgrel=10H.264 luma qpel mc20 daedalus-fourier substitution (cycle 9, 2026-05-23)
epoch=2 epoch=2
# daedalus-fourier pin — first kernel substitution in libavcodec # daedalus-fourier pin. 209a421 = PR #2 merge (Phase 8c — public API
# (cycle 6 H.264 IDCT 4x4). Same SHA as the daedalus-v4l2 daemon's # gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).
# inline build; lockstep with that until the public API rolls. # Cycle 9 closes the libavcodec.so substitution arc started at cycle 6.
_daedalus_fourier_commit='d87239d8172307d9a1b93c95cbed116d175b85cc' _daedalus_fourier_commit='209a4218bcb98b91c04f07ad61513bb04adb13ad'
pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)' pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)'
arch=('aarch64') arch=('aarch64')
url='https://github.com/Kwiboo/FFmpeg' url='https://github.com/Kwiboo/FFmpeg'
@@ -93,8 +93,9 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
'0003-h264-idct4-daedalus-fourier.patch' '0003-h264-idct4-daedalus-fourier.patch'
'0004-h264-idct8-daedalus-fourier.patch' '0004-h264-idct8-daedalus-fourier.patch'
'0005-h264-deblock-luma-v-daedalus-fourier.patch' '0005-h264-deblock-luma-v-daedalus-fourier.patch'
'0006-h264-restore-low-delay.patch') '0006-h264-restore-low-delay.patch'
sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP') '0007-h264-qpel-mc20-daedalus-fourier.patch')
sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
pkgver() { pkgver() {
cd "${_srcname}" cd "${_srcname}"
@@ -111,6 +112,7 @@ prepare() {
patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch" patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch" patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch" patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
patch -Np1 -i "${srcdir}/0007-h264-qpel-mc20-daedalus-fourier.patch"
} }
build() { build() {
+1 -1
View File
@@ -45,7 +45,7 @@ pkgver=26.0.6.r5.video1
pkgrel=1 pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)" pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)"
arch=('aarch64') arch=('aarch64')
url="https://github.com/marfrit/panvk-bifrost" url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT') license=('MIT')
depends=( depends=(
@@ -0,0 +1,50 @@
From: marfrit-packages noether <claude-noether@reauktion.de>
Subject: [PATCH] panvk: report fragmentStoresAndAtomics = true on Bifrost
Backports Mesa main's unconditional advertisement of
fragmentStoresAndAtomics for panvk (snapshot ref: src/panfrost/vulkan/
panvk_vX_physical_device.c at commit-time 2026-05-06; the line reads
`.fragmentStoresAndAtomics = true,` on main with no PAN_ARCH gate).
Motivation: Chromium Dawn's WebGPU initializer in
third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250
unconditionally rejects any Vulkan adapter that doesn't advertise this
feature, causing Dawn to fall back to the SwiftShader CPU adapter
on PineTab2 / RK3566 / Mali-G52 r1 MC1 (PAN_ARCH 7). With this patch the
device advertises true, satisfying Dawn's gate. Tracked at
https://git.reauktion.de/marfrit/panvk-bifrost/issues/2.
The disjunction with `instance->force_enable_shader_atomics` is
preserved as a kill-switch: in compiler terms it's dead code
(`true || X == true`), but it leaves the DRI option
`pan_force_enable_shader_atomics` semantically wired so future
rebases or downstream debugging can see the link to the runtime knob.
Caveat: the existing DRI option's description in src/util/driconf.h
still labels this as "may not work reliably and is for debug purposes
only". Mesa main's choice to ship it as default-on for all panvk
architectures (including Bifrost, which is non-conformant per the
PAN_I_WANT_A_BROKEN_VULKAN_DRIVER gate) reflects an upstream judgment
that the practical risk is acceptable. Verify-before-ship for this
package: dEQP-VK.glsl.atomic_operations.* + dEQP-VK.image.store.*
deltas vs the r4 baseline must show no new fails. Pass counts may rise
(tests that previously NotSupported now run); the load-bearing line is
the Failed column staying at zero.
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -280,8 +280,7 @@
.vertexPipelineStoresAndAtomics =
(PAN_ARCH >= 13 && instance->enable_vertex_pipeline_stores_atomics) ||
instance->force_enable_shader_atomics,
- .fragmentStoresAndAtomics =
- (PAN_ARCH >= 10) || instance->force_enable_shader_atomics,
+ .fragmentStoresAndAtomics = true || instance->force_enable_shader_atomics,
.shaderTessellationAndGeometryPointSize = false,
.shaderImageGatherExtended = true,
.shaderStorageImageExtendedFormats = true,
+17 -2
View File
@@ -30,11 +30,11 @@
pkgname=mesa-panvk-bifrost pkgname=mesa-panvk-bifrost
_mesaver=26.0.6 _mesaver=26.0.6
pkgver=26.0.6.r4 pkgver=26.0.6.r5
pkgrel=1 pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)" pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)"
arch=('aarch64') arch=('aarch64')
url="https://github.com/marfrit/panvk-bifrost" url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT') license=('MIT')
# We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa. # We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa.
@@ -81,6 +81,7 @@ source=(
"0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch" "0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
"0003-panvk-bifrost-vk-ext-transform-feedback.patch" "0003-panvk-bifrost-vk-ext-transform-feedback.patch"
"0004-panvk-bifrost-xfb-primitive-decomposition.patch" "0004-panvk-bifrost-xfb-primitive-decomposition.patch"
"0005-panvk-bifrost-fragment-stores-atomics.patch"
"brave-vulkan" "brave-vulkan"
"icd.json" "icd.json"
) )
@@ -92,6 +93,7 @@ sha256sums=(
'SKIP' 'SKIP'
'SKIP' 'SKIP'
'SKIP' 'SKIP'
'SKIP'
) )
prepare() { prepare() {
@@ -127,6 +129,17 @@ prepare() {
# Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md. # Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md.
patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch" patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
# r5 (2026-05-23): advertise .fragmentStoresAndAtomics = true on Bifrost
# to satisfy Chromium Dawn's WebGPU init gate
# (third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250).
# Backports Mesa main's unconditional flip (same line as on main as of
# 2026-05-06). Disjunction with instance->force_enable_shader_atomics
# is preserved as a documented kill-switch even though the compiler
# folds it away. Closes marfrit/panvk-bifrost#2.
# Verify-before-ship: dEQP-VK.glsl.atomic_operations.* and
# dEQP-VK.image.store.* show no new Failed vs r4 baseline.
patch -p1 < "${srcdir}/0005-panvk-bifrost-fragment-stores-atomics.patch"
# Sanity-check the patches landed. # Sanity-check the patches landed.
grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -138,6 +151,8 @@ prepare() {
test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
# iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present. # iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present.
grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
# r5 sanity: fragmentStoresAndAtomics = true patch landed
grep -q "fragmentStoresAndAtomics = true ||" src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h
grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h
test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
+13 -11
View File
@@ -33,18 +33,19 @@ FFMPEG_VERSION=8.1
# epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie); # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
# +rfourier suffix to avoid colliding with upstream/Debian rebuilds. # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
PKGREL=9 # pkgrel=9restore AV_CODEC_FLAG_LOW_DELAY semantics in the PKGREL=10 # pkgrel=10H.264 luma qpel mc20 daedalus-fourier substitution
# H.264 decoder (FFmpeg 8.x dropped them). Fixes the 2-1-4-3 # (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
# B-frame pair-swap that re-appeared in Firefox YouTube after # the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
# the SONAME 61→62 jump (PR #75) silently neutered the # 8 luma-v deblock / 9 qpel mc20). Pulls daedalus-fourier PR #2
# daemon's ctx->flags |= AV_CODEC_FLAG_LOW_DELAY at # which extends the public API with
# daemon/src/decoder.c:202. Substitution arc unchanged. # daedalus_recipe_dispatch_h264_qpel_mc20. (2026-05-23)
# (2026-05-22)
# daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6 # daedalus-fourier pin. 209a421 = daedalus-fourier PR #2 merge — public
# H.264 IDCT 4x4). Same SHA as the daedalus-v4l2 daemon already ships # API now exposes daedalus_recipe_dispatch_h264_qpel_mc20 +
# inline; rev in lockstep with the daemon when the public API rolls. # DAEDALUS_KERNEL_H264_QPEL_MC20. Cycle 9 plumbs the last H.264 NEON
DAEDALUS_FOURIER_COMMIT=d87239d8172307d9a1b93c95cbed116d175b85cc # kernel through the recipe layer. Daemon-side build (debian/daedalus-v4l2)
# can bump in a follow-up; this PR only changes the libavcodec.so consumer.
DAEDALUS_FOURIER_COMMIT=209a4218bcb98b91c04f07ad61513bb04adb13ad
HERE=$(dirname "$(readlink -f "$0")") HERE=$(dirname "$(readlink -f "$0")")
@@ -72,6 +73,7 @@ patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch" patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch" patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch" patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"
patch -Np1 -i "$HERE/0007-h264-qpel-mc20-daedalus-fourier.patch"
# --- daedalus-fourier: fetch + build static .a with PIC, install to a # --- daedalus-fourier: fetch + build static .a with PIC, install to a
# per-build prefix; libavcodec.so links it into the shared object so # per-build prefix; libavcodec.so links it into the shared object so
+34
View File
@@ -1,3 +1,37 @@
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
* Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
horizontal half-pel, 6-tap "put" — the canonical representative
of the H.264 luma motion-compensation family) now dispatches
through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11
step 2 substitution arc; closes the 4-cycle libavcodec.so
substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
9 qpel mc20).
* Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).
* Cycle 9 is "CPU primary; QPU pointless" per
docs/k9_h264qpel_mc20.md. Per-block 7.6 ns at 131 Mblock/s
gives 135x margin over 30 fps 1080p; QPU dispatch floor at
~250 ns makes any V3D shader strictly worse. Substitution
is plumbing-only, NEON-by-recipe — same
daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
shims already own (kept SEPARATE from the H264DSP shim's ctx
because H264QPEL is its own libavcodec Makefile module and
link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).
* Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
16x16 size tier stay on the in-tree NEON .S code. Per the
cycle-9 phase-1 rationale, mc20 8x8 is representative of the
whole family's per-block cost.
* Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Sat, 23 May 2026 12:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
* Add 0006-h264-restore-low-delay.patch — restore the documented * Add 0006-h264-restore-low-delay.patch — restore the documented