Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.
docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.
docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.
docs/k2_deblock_phase3.md - NEON baseline:
M1''_c bit-exact 100.0000 % (10000 random edges)
M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76)
per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame)
cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row
LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).
New artifacts:
- tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4
inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
(5s window, edge-content-biased RNG for
realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target
Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.8 KiB
FFmpeg source snapshot
Verbatim subset of FFmpeg source pinned for use as reference
implementations of the VP9 8×8 inverse DCT (Phase 1 target of
daedalus-fourier). See ../../docs/phase2.md §2 and §5 for
the rationale.
Upstream pin
- Repository: https://github.com/FFmpeg/FFmpeg
- Tag:
n7.1.3(matcheslibavcodec61 8:7.1.3-0+deb13u1+rpt1shipping in Debian Trixie on the dev hosthertz) - Annotated tag object:
0a9a757e96fdf053697084bbd1f620edeac9d084 - Commit object (tag target):
f46e514491172d15bd74b4abb1814cd2f05a763e - Snapshot fetched: 2026-05-18 (UTC), via
https://raw.githubusercontent.com/FFmpeg/FFmpeg/n7.1.3/<path>
Files in this snapshot
All files are byte-for-byte copies of the upstream source at the tagged commit, no modifications.
| Path | Lines | Bytes | SHA-256 |
|---|---|---|---|
libavcodec/vp9dsp_template.c |
2578 | 89045 | 41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f |
libavcodec/aarch64/vp9itxfm_neon.S |
1580 | 63534 | 82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 |
libavcodec/aarch64/vp9lpf_neon.S |
1334 | — | 384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7 |
libavcodec/aarch64/neon.S |
173 | 7496 | 72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 |
libavutil/aarch64/asm.S |
260 | 8069 | c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 |
COPYING.LGPLv2.1 |
502 | — | b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe |
Verify with:
( cd external/ffmpeg-snapshot && sha256sum -c <<'EOF'
41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f libavcodec/vp9dsp_template.c
82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 libavcodec/aarch64/vp9itxfm_neon.S
72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 libavcodec/aarch64/neon.S
c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 libavutil/aarch64/asm.S
b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe COPYING.LGPLv2.1
EOF
)
License
LGPL-2.1-or-later. See COPYING.LGPLv2.1. Original copyright
holders include the FFmpeg authors and Google Inc. (2016) for
the aarch64 NEON paths. The snapshot inherits FFmpeg's license
in full.
Why each file is in this snapshot
libavcodec/vp9dsp_template.c— containsidct_idct_8x8_add_c, the bit-exact C reference for the Phase 1 kernel under test (M1).libavcodec/aarch64/vp9itxfm_neon.S— containsff_vp9_idct_idct_8x8_add_neon, the NEON throughput baseline (M3). Also definesidct8,dmbutterfly0,dmbutterfly,dmbutterfly_l,butterfly_8h, and theidct_coeffsconstant table.libavcodec/aarch64/neon.S— definestranspose_8x8Hused byvp9itxfm_neon.S.libavutil/aarch64/asm.S— definesfunction,endfunc,movrel,const,endconst, and other assembly preamble macros required to assemble the above NEON files.
Re-vendoring procedure
If the upstream pin needs to change (e.g., hertz updates to a newer libavcodec):
TAG=nX.Y.Z
BASE=https://raw.githubusercontent.com/FFmpeg/FFmpeg/$TAG
cd external/ffmpeg-snapshot
for f in libavcodec/vp9dsp_template.c \
libavcodec/aarch64/vp9itxfm_neon.S \
libavcodec/aarch64/neon.S \
libavutil/aarch64/asm.S \
COPYING.LGPLv2.1; do
curl -sSf -o "$f" "$BASE/$f"
done
sha256sum libavcodec/vp9dsp_template.c \
libavcodec/aarch64/vp9itxfm_neon.S \
libavcodec/aarch64/neon.S \
libavutil/aarch64/asm.S \
COPYING.LGPLv2.1
# update this PROVENANCE.md with the new tag, commit hash, and hashes
After re-vendoring, re-run the bit-exact gate (M1) and throughput baseline (M3) — both can shift across FFmpeg versions even when the VP9 spec doesn't change (e.g., NEON micro-optimizations).