forked from marfrit/marfrit-packages
493c762967
Cycle 7 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11 step 2). H264DSPContext.idct8_add — called per 8×8 block from the High-profile intra-8×8-DCT decode path in libavcodec/h264_mb.c — now dispatches through daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon. ## What - Add 0004-h264-idct8-daedalus-fourier.patch (in both arch/ and debian/ ffmpeg-v4l2-request-fourier/). Extends libavcodec/aarch64/ h264_idct_daedalus.c (introduced by 0003) with ff_h264_idct8_add_daedalus and a daedalus_recipe_dispatch_h264_idct8 call; patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct8_add to the new shim. - arch/PKGBUILD + debian/build-deb.sh: append the new patch to the apply list; bump pkgrel/PKGREL to 7. - No new build-deps, no Depends change, no daedalus-fourier rev — the d87239d pin already exposes daedalus_recipe_dispatch_h264_idct8. ## Why The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8×8) the recipe is CPU NEON, so this is effectively a NEON-to-NEON substitution layered on top of cycle 6. Production validation of cycle 6 on higgs Firefox YouTube: 3040 frames decoded cleanly, avg_decode_us=3388 (no regression vs the pre-substitution ~4 ms baseline). Cycle 7 inherits the same shim's pthread_once context. Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7 green; FFmpeg 8×8 block storage block[r + 8*c] matches daedalus column-major convention). ## Scope NOT covered (deferred) - Bulk c->idct8_add4 (inter 8×8-DCT macroblocks) stays on the in-tree NEON .S code; batched substitution with n_blocks>1 lands later alongside the cycle-6 bulk-paths work. - High-bit-depth (10-bit) path untouched. - Cycles 8/9 — separate PRs. ## SONAME Unchanged. libavcodec.so.62 / libavformat.so.62 / libavutil.so.60. ## Refs - reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11 - marfrit-packages PR #76 (cycle 6 IDCT 4×4) - marfrit-packages PR #78 (libxml2 ABI-skew workaround) - marfrit/daedalus-fourier cycle 7 close (H.264 IDCT 8×8 NEON green)
108 lines
5.0 KiB
Diff
108 lines
5.0 KiB
Diff
From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
|
|
From: Markus Fritsche <mfritsche@reauktion.de>
|
|
Date: Fri, 22 May 2026 10:18:21 +0200
|
|
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
|
|
daedalus-fourier
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=UTF-8
|
|
Content-Transfer-Encoding: 8bit
|
|
|
|
H264DSPContext.idct8_add (called per 8x8 block from the High-profile
|
|
intra-8x8-DCT decode path in h264_mb.c) now dispatches through
|
|
daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
|
|
|
|
The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
|
|
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
|
|
substitution layered on top of the cycle-6 IDCT 4x4 wiring. Same
|
|
pthread_once global context, same destructive-zero semantics; FFmpeg
|
|
column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
|
|
|
|
Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
|
|
on the in-tree NEON .S code and will be batched through
|
|
daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
|
|
|
|
Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
|
|
green).
|
|
|
|
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
|
|
---
|
|
libavcodec/aarch64/h264_idct_daedalus.c | 29 ++++++++++++++++-------
|
|
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 ++-
|
|
2 files changed, 23 insertions(+), 9 deletions(-)
|
|
|
|
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
|
|
index 538d223..cbb98af 100644
|
|
--- a/libavcodec/aarch64/h264_idct_daedalus.c
|
|
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
|
|
@@ -1,14 +1,16 @@
|
|
/*
|
|
- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
|
|
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
|
|
*
|
|
- * Routes H264DSPContext.idct_add through
|
|
- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
|
|
- * The recipe layer picks the substrate (CPU NEON by default for
|
|
- * cycle 6; future cycles may dispatch to V3D opportunistically).
|
|
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
|
|
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
|
|
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
|
|
+ * recipe layer picks the substrate (CPU NEON by default for cycles
|
|
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
|
|
*
|
|
- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
|
|
- * convention: block[r + 4*c] = coefficient at (row r, col c). Both
|
|
- * sides destructively zero the block after the transform.
|
|
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
|
|
+ * column-major convention: block[r + N*c] = coefficient at
|
|
+ * (row r, col c) for N ∈ {4, 8}. Both sides destructively zero the
|
|
+ * block after the transform.
|
|
*
|
|
* The library context is process-global and lazily initialised under
|
|
* pthread_once. We pick the no-QPU constructor here because
|
|
@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
|
|
}
|
|
|
|
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
|
|
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
|
|
|
|
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
|
|
{
|
|
@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
|
|
daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
|
|
block, 1, &meta);
|
|
}
|
|
+
|
|
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
|
|
+{
|
|
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
|
|
+
|
|
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
|
|
+
|
|
+ daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
|
|
+ block, 1, &meta);
|
|
+}
|
|
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
|
|
index b993df2..741e551 100644
|
|
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
|
|
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
|
|
@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
|
|
const uint8_t nnzc[15 * 8]);
|
|
|
|
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
|
|
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
|
|
void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
|
|
void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
|
|
int16_t *block, int stride,
|
|
@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
|
|
c->idct_add16intra = ff_h264_idct_add16intra_neon;
|
|
if (chroma_format_idc <= 1)
|
|
c->idct_add8 = ff_h264_idct_add8_neon;
|
|
- c->idct8_add = ff_h264_idct8_add_neon;
|
|
+ c->idct8_add = ff_h264_idct8_add_daedalus;
|
|
c->idct8_dc_add = ff_h264_idct8_dc_add_neon;
|
|
c->idct8_add4 = ff_h264_idct8_add4_neon;
|
|
} else if (have_neon(cpu_flags) && bit_depth == 10) {
|
|
--
|
|
2.47.3
|
|
|