Files
marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00

96 lines
4.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# FFmpeg source snapshot
Verbatim subset of FFmpeg source pinned for use as reference
implementations of the VP9 8×8 inverse DCT (Phase 1 target of
`daedalus-fourier`). See `../../docs/phase2.md §2` and `§5` for
the rationale.
## Upstream pin
- **Repository**: https://github.com/FFmpeg/FFmpeg
- **Tag**: `n7.1.3` (matches `libavcodec61 8:7.1.3-0+deb13u1+rpt1`
shipping in Debian Trixie on the dev host `hertz`)
- **Annotated tag object**: `0a9a757e96fdf053697084bbd1f620edeac9d084`
- **Commit object (tag target)**: `f46e514491172d15bd74b4abb1814cd2f05a763e`
- **Snapshot fetched**: 2026-05-18 (UTC), via
`https://raw.githubusercontent.com/FFmpeg/FFmpeg/n7.1.3/<path>`
## Files in this snapshot
All files are byte-for-byte copies of the upstream source at the
tagged commit, no modifications.
| Path | Lines | Bytes | SHA-256 |
|---|---|---|---|
| `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
| `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
Verify with:
```sh
( cd external/ffmpeg-snapshot && sha256sum -c <<'EOF'
41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f libavcodec/vp9dsp_template.c
82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6 libavcodec/aarch64/vp9itxfm_neon.S
72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538 libavcodec/aarch64/neon.S
c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3 libavutil/aarch64/asm.S
b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe COPYING.LGPLv2.1
EOF
)
```
## License
LGPL-2.1-or-later. See `COPYING.LGPLv2.1`. Original copyright
holders include the FFmpeg authors and Google Inc. (2016) for
the aarch64 NEON paths. The snapshot inherits FFmpeg's license
in full.
## Why each file is in this snapshot
- `libavcodec/vp9dsp_template.c` — contains `idct_idct_8x8_add_c`,
the bit-exact C reference for the Phase 1 kernel under test (M1).
- `libavcodec/aarch64/vp9itxfm_neon.S` — contains
`ff_vp9_idct_idct_8x8_add_neon`, the NEON throughput baseline
(M3). Also defines `idct8`, `dmbutterfly0`, `dmbutterfly`,
`dmbutterfly_l`, `butterfly_8h`, and the `idct_coeffs` constant
table.
- `libavcodec/aarch64/neon.S` — defines `transpose_8x8H` used by
`vp9itxfm_neon.S`.
- `libavutil/aarch64/asm.S` — defines `function`, `endfunc`,
`movrel`, `const`, `endconst`, and other assembly preamble
macros required to assemble the above NEON files.
## Re-vendoring procedure
If the upstream pin needs to change (e.g., hertz updates to a
newer libavcodec):
```sh
TAG=nX.Y.Z
BASE=https://raw.githubusercontent.com/FFmpeg/FFmpeg/$TAG
cd external/ffmpeg-snapshot
for f in libavcodec/vp9dsp_template.c \
libavcodec/aarch64/vp9itxfm_neon.S \
libavcodec/aarch64/neon.S \
libavutil/aarch64/asm.S \
COPYING.LGPLv2.1; do
curl -sSf -o "$f" "$BASE/$f"
done
sha256sum libavcodec/vp9dsp_template.c \
libavcodec/aarch64/vp9itxfm_neon.S \
libavcodec/aarch64/neon.S \
libavutil/aarch64/asm.S \
COPYING.LGPLv2.1
# update this PROVENANCE.md with the new tag, commit hash, and hashes
```
After re-vendoring, re-run the bit-exact gate (M1) and throughput
baseline (M3) — both can shift across FFmpeg versions even when
the VP9 spec doesn't change (e.g., NEON micro-optimizations).