356e446a49
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.
Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.
Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).
Performance:
M2''' = 1.413 Mblock/s (707.9 ns/block)
M3''' = 20.997 Mblock/s (NEON baseline phase3)
R''' = 0.067 (RED band — structural mismatch)
shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills
M4''' concurrent matrix (8s windows):
NEON 1-core 14.479 Mblock/s
NEON 4-core 15.248 Mblock/s <- baseline (compute-bound,
not bandwidth-saturated
like cycles 1+2!)
QPU only 1.380 Mblock/s
MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate)
MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3%
NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.
But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.
DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):
IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core)
LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core)
MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU)
Entropy -> CPU (structurally serial)
Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.
New artifacts:
- src/v3d_mc_8h.comp — GLSL kernel
- tests/vp9_mc_ref.c — standalone C ref (REGULAR filter
embedded; clean transcription)
- tests/bench_neon_mc.c — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c — M1''' + M2''' bench with contract
asserts + 30fps margin display
- tests/bench_concurrent_mc.c — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
(hand-extracted; provides
ff_vp9_subpel_filters symbol
without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation
Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
96 lines
4.2 KiB
Markdown
96 lines
4.2 KiB
Markdown
# FFmpeg source snapshot
|
||
|
||
Verbatim subset of FFmpeg source pinned for use as reference
|
||
implementations of the VP9 8×8 inverse DCT (Phase 1 target of
|
||
`daedalus-fourier`). See `../../docs/phase2.md §2` and `§5` for
|
||
the rationale.
|
||
|
||
## Upstream pin
|
||
|
||
- **Repository**: https://github.com/FFmpeg/FFmpeg
|
||
- **Tag**: `n7.1.3` (matches `libavcodec61 8:7.1.3-0+deb13u1+rpt1`
|
||
shipping in Debian Trixie on the dev host `hertz`)
|
||
- **Annotated tag object**: `0a9a757e96fdf053697084bbd1f620edeac9d084`
|
||
- **Commit object (tag target)**: `f46e514491172d15bd74b4abb1814cd2f05a763e`
|
||
- **Snapshot fetched**: 2026-05-18 (UTC), via
|
||
`https://raw.githubusercontent.com/FFmpeg/FFmpeg/n7.1.3/<path>`
|
||
|
||
## Files in this snapshot
|
||
|
||
All files are byte-for-byte copies of the upstream source at the
|
||
tagged commit, no modifications.
|
||
|
||
| Path | Lines | Bytes | SHA-256 |
|
||
|---|---|---|---|
|
||
| `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
|
||
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
|
||
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
|
||
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
|
||
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
|
||
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
|
||
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
|
||
| `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
|
||
|
||
Verify with:
|
||
|
||
```sh
|
||
( cd external/ffmpeg-snapshot && sha256sum -c <<'EOF'
|
||
41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f libavcodec/vp9dsp_template.c
|
||
82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 libavcodec/aarch64/vp9itxfm_neon.S
|
||
72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 libavcodec/aarch64/neon.S
|
||
c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 libavutil/aarch64/asm.S
|
||
b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe COPYING.LGPLv2.1
|
||
EOF
|
||
)
|
||
```
|
||
|
||
## License
|
||
|
||
LGPL-2.1-or-later. See `COPYING.LGPLv2.1`. Original copyright
|
||
holders include the FFmpeg authors and Google Inc. (2016) for
|
||
the aarch64 NEON paths. The snapshot inherits FFmpeg's license
|
||
in full.
|
||
|
||
## Why each file is in this snapshot
|
||
|
||
- `libavcodec/vp9dsp_template.c` — contains `idct_idct_8x8_add_c`,
|
||
the bit-exact C reference for the Phase 1 kernel under test (M1).
|
||
- `libavcodec/aarch64/vp9itxfm_neon.S` — contains
|
||
`ff_vp9_idct_idct_8x8_add_neon`, the NEON throughput baseline
|
||
(M3). Also defines `idct8`, `dmbutterfly0`, `dmbutterfly`,
|
||
`dmbutterfly_l`, `butterfly_8h`, and the `idct_coeffs` constant
|
||
table.
|
||
- `libavcodec/aarch64/neon.S` — defines `transpose_8x8H` used by
|
||
`vp9itxfm_neon.S`.
|
||
- `libavutil/aarch64/asm.S` — defines `function`, `endfunc`,
|
||
`movrel`, `const`, `endconst`, and other assembly preamble
|
||
macros required to assemble the above NEON files.
|
||
|
||
## Re-vendoring procedure
|
||
|
||
If the upstream pin needs to change (e.g., hertz updates to a
|
||
newer libavcodec):
|
||
|
||
```sh
|
||
TAG=nX.Y.Z
|
||
BASE=https://raw.githubusercontent.com/FFmpeg/FFmpeg/$TAG
|
||
cd external/ffmpeg-snapshot
|
||
for f in libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1; do
|
||
curl -sSf -o "$f" "$BASE/$f"
|
||
done
|
||
sha256sum libavcodec/vp9dsp_template.c \
|
||
libavcodec/aarch64/vp9itxfm_neon.S \
|
||
libavcodec/aarch64/neon.S \
|
||
libavutil/aarch64/asm.S \
|
||
COPYING.LGPLv2.1
|
||
# update this PROVENANCE.md with the new tag, commit hash, and hashes
|
||
```
|
||
|
||
After re-vendoring, re-run the bit-exact gate (M1) and throughput
|
||
baseline (M3) — both can shift across FFmpeg versions even when
|
||
the VP9 spec doesn't change (e.g., NEON micro-optimizations).
|