be7ff5587c
Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.
docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.
docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.
docs/k2_deblock_phase3.md - NEON baseline:
M1''_c bit-exact 100.0000 % (10000 random edges)
M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76)
per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame)
cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row
LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).
New artifacts:
- tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4
inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
(5s window, edge-content-biased RNG for
realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target
Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94 lines
3.8 KiB
Markdown
94 lines
3.8 KiB
Markdown
# FFmpeg source snapshot
|
||
|
||
Verbatim subset of FFmpeg source pinned for use as reference
|
||
implementations of the VP9 8×8 inverse DCT (Phase 1 target of
|
||
`daedalus-fourier`). See `../../docs/phase2.md §2` and `§5` for
|
||
the rationale.
|
||
|
||
## Upstream pin
|
||
|
||
- **Repository**: https://github.com/FFmpeg/FFmpeg
|
||
- **Tag**: `n7.1.3` (matches `libavcodec61 8:7.1.3-0+deb13u1+rpt1`
|
||
shipping in Debian Trixie on the dev host `hertz`)
|
||
- **Annotated tag object**: `0a9a757e96fdf053697084bbd1f620edeac9d084`
|
||
- **Commit object (tag target)**: `f46e514491172d15bd74b4abb1814cd2f05a763e`
|
||
- **Snapshot fetched**: 2026-05-18 (UTC), via
|
||
`https://raw.githubusercontent.com/FFmpeg/FFmpeg/n7.1.3/<path>`
|
||
|
||
## Files in this snapshot
|
||
|
||
All files are byte-for-byte copies of the upstream source at the
|
||
tagged commit, no modifications.
|
||
|
||
| Path | Lines | Bytes | SHA-256 |
|
||
|---|---|---|---|
|
||
| `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
|
||
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
|
||
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
|
||
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
|
||
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
|
||
| `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
|
||
|
||
Verify with:
|
||
|
||
```sh
|
||
( cd external/ffmpeg-snapshot && sha256sum -c <<'EOF'
|
||
41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f libavcodec/vp9dsp_template.c
|
||
82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 libavcodec/aarch64/vp9itxfm_neon.S
|
||
72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 libavcodec/aarch64/neon.S
|
||
c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 libavutil/aarch64/asm.S
|
||
b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe COPYING.LGPLv2.1
|
||
EOF
|
||
)
|
||
```
|
||
|
||
## License
|
||
|
||
LGPL-2.1-or-later. See `COPYING.LGPLv2.1`. Original copyright
|
||
holders include the FFmpeg authors and Google Inc. (2016) for
|
||
the aarch64 NEON paths. The snapshot inherits FFmpeg's license
|
||
in full.
|
||
|
||
## Why each file is in this snapshot
|
||
|
||
- `libavcodec/vp9dsp_template.c` — contains `idct_idct_8x8_add_c`,
|
||
the bit-exact C reference for the Phase 1 kernel under test (M1).
|
||
- `libavcodec/aarch64/vp9itxfm_neon.S` — contains
|
||
`ff_vp9_idct_idct_8x8_add_neon`, the NEON throughput baseline
|
||
(M3). Also defines `idct8`, `dmbutterfly0`, `dmbutterfly`,
|
||
`dmbutterfly_l`, `butterfly_8h`, and the `idct_coeffs` constant
|
||
table.
|
||
- `libavcodec/aarch64/neon.S` — defines `transpose_8x8H` used by
|
||
`vp9itxfm_neon.S`.
|
||
- `libavutil/aarch64/asm.S` — defines `function`, `endfunc`,
|
||
`movrel`, `const`, `endconst`, and other assembly preamble
|
||
macros required to assemble the above NEON files.
|
||
|
||
## Re-vendoring procedure
|
||
|
||
If the upstream pin needs to change (e.g., hertz updates to a
|
||
newer libavcodec):
|
||
|
||
```sh
|
||
TAG=nX.Y.Z
|
||
BASE=https://raw.githubusercontent.com/FFmpeg/FFmpeg/$TAG
|
||
cd external/ffmpeg-snapshot
|
||
for f in libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1; do
|
||
curl -sSf -o "$f" "$BASE/$f"
|
||
done
|
||
sha256sum libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1
|
||
# update this PROVENANCE.md with the new tag, commit hash, and hashes
|
||
```
|
||
|
||
After re-vendoring, re-run the bit-exact gate (M1) and throughput
|
||
baseline (M3) — both can shift across FFmpeg versions even when
|
||
the VP9 spec doesn't change (e.g., NEON micro-optimizations).
|