5c8b09349c
Last unmeasured H.264 kernel. mc20 picked as representative (horizontal half-pel, 6-tap filter; canonical for the H.264 luma qpel family). M1 PASS 10000/10000 first try, M3 = 131.477 Mblock/s on a single core (7.6 ns/block), 135x the 1080p30 floor. Per the cycles 6+7 lightweight-kernel rationale, Phase 4 deferred: QPU dispatch floor (~250 ns/block) is 33x above the NEON per-block cost; R9 ≈ 0.03 deep RED. No realistic QPU offload value. Generalization: all H.264 luma MC variants (mc02, mc11, mc22, etc.) will share this verdict. No need to measure each variant individually. H.264 NEON is dramatically faster than VP9 NEON across the board: - IDCT 4x4: 175 vs N/A (no VP9 analog) - IDCT 8x8: 151 vs 8.2 Mblock/s (18x faster) - MC 6/8-tap: 131 vs 7.0 (19x faster) - Deblock: 92 vs 48 Medge/s (2x faster) H.264 deployment recipe: all CPU NEON except deblock (opportunistic QPU). On a Pi 5 running H.264-only, the QPU is mostly idle. Cycles 1-9 complete. Public API exposes all 9. Next: daedalus-v4l2 sibling repo per locked Phase 8 architecture (B + γ + sibling), then README polish. - external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S vendored (1467 lines, all qpel variants) - tests/h264_qpel8_mc20_ref.c: 40-line C ref (clip255 of 6-tap convolution) - tests/bench_neon_h264qpel_mc20.c: M1 + M3 bench - docs/k9_h264qpel_mc20.md: cycle 9 closure with comparison matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99 lines
4.7 KiB
Markdown
99 lines
4.7 KiB
Markdown
# FFmpeg source snapshot
|
||
|
||
Verbatim subset of FFmpeg source pinned for use as reference
|
||
implementations of the VP9 8×8 inverse DCT (Phase 1 target of
|
||
`daedalus-fourier`). See `../../docs/phase2.md §2` and `§5` for
|
||
the rationale.
|
||
|
||
## Upstream pin
|
||
|
||
- **Repository**: https://github.com/FFmpeg/FFmpeg
|
||
- **Tag**: `n7.1.3` (matches `libavcodec61 8:7.1.3-0+deb13u1+rpt1`
|
||
shipping in Debian Trixie on the dev host `hertz`)
|
||
- **Annotated tag object**: `0a9a757e96fdf053697084bbd1f620edeac9d084`
|
||
- **Commit object (tag target)**: `f46e514491172d15bd74b4abb1814cd2f05a763e`
|
||
- **Snapshot fetched**: 2026-05-18 (UTC), via
|
||
`https://raw.githubusercontent.com/FFmpeg/FFmpeg/n7.1.3/<path>`
|
||
|
||
## Files in this snapshot
|
||
|
||
All files are byte-for-byte copies of the upstream source at the
|
||
tagged commit, no modifications.
|
||
|
||
| Path | Lines | Bytes | SHA-256 |
|
||
|---|---|---|---|
|
||
| `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
|
||
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
|
||
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
|
||
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
|
||
| `libavcodec/aarch64/h264idct_neon.S` | 415 | 16269 | `963ffe5f31b5a6a422e13b0d394cf5630126927abfb23aa214f7cbe83d60683f` — H.264 IDCT 4×4/8×8/DC NEON kernels for cycle 6+ |
|
||
| `libavcodec/aarch64/h264dsp_neon.S` | 1076 | — | `978e076f0020e688b40c6dd827708c3d53e17c64a99fd0052e43d983536ce638` — H.264 in-loop deblock + weight/biweight kernels for cycle 8+ |
|
||
| `libavcodec/aarch64/h264qpel_neon.S` | 1467 | — | `897b79be7856341847ad7a5ce6ca0c15a7acc439a95bf33ddab616cfe982c544` — H.264 luma qpel MC (16 mc-position variants × put/avg × 8x8/16x16) for cycle 9 |
|
||
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
|
||
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
|
||
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
|
||
| `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
|
||
|
||
Verify with:
|
||
|
||
```sh
|
||
( cd external/ffmpeg-snapshot && sha256sum -c <<'EOF'
|
||
41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f libavcodec/vp9dsp_template.c
|
||
82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 libavcodec/aarch64/vp9itxfm_neon.S
|
||
72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 libavcodec/aarch64/neon.S
|
||
c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 libavutil/aarch64/asm.S
|
||
b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe COPYING.LGPLv2.1
|
||
EOF
|
||
)
|
||
```
|
||
|
||
## License
|
||
|
||
LGPL-2.1-or-later. See `COPYING.LGPLv2.1`. Original copyright
|
||
holders include the FFmpeg authors and Google Inc. (2016) for
|
||
the aarch64 NEON paths. The snapshot inherits FFmpeg's license
|
||
in full.
|
||
|
||
## Why each file is in this snapshot
|
||
|
||
- `libavcodec/vp9dsp_template.c` — contains `idct_idct_8x8_add_c`,
|
||
the bit-exact C reference for the Phase 1 kernel under test (M1).
|
||
- `libavcodec/aarch64/vp9itxfm_neon.S` — contains
|
||
`ff_vp9_idct_idct_8x8_add_neon`, the NEON throughput baseline
|
||
(M3). Also defines `idct8`, `dmbutterfly0`, `dmbutterfly`,
|
||
`dmbutterfly_l`, `butterfly_8h`, and the `idct_coeffs` constant
|
||
table.
|
||
- `libavcodec/aarch64/neon.S` — defines `transpose_8x8H` used by
|
||
`vp9itxfm_neon.S`.
|
||
- `libavutil/aarch64/asm.S` — defines `function`, `endfunc`,
|
||
`movrel`, `const`, `endconst`, and other assembly preamble
|
||
macros required to assemble the above NEON files.
|
||
|
||
## Re-vendoring procedure
|
||
|
||
If the upstream pin needs to change (e.g., hertz updates to a
|
||
newer libavcodec):
|
||
|
||
```sh
|
||
TAG=nX.Y.Z
|
||
BASE=https://raw.githubusercontent.com/FFmpeg/FFmpeg/$TAG
|
||
cd external/ffmpeg-snapshot
|
||
for f in libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1; do
|
||
curl -sSf -o "$f" "$BASE/$f"
|
||
done
|
||
sha256sum libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1
|
||
# update this PROVENANCE.md with the new tag, commit hash, and hashes
|
||
```
|
||
|
||
After re-vendoring, re-run the bit-exact gate (M1) and throughput
|
||
baseline (M3) — both can shift across FFmpeg versions even when
|
||
the VP9 spec doesn't change (e.g., NEON micro-optimizations).
|