Files
daedalus-fourier/docs/k3_mc_phase3.md
T
marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00

78 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 3
phase: 3
status: closed 2026-05-18
date_opened: 2026-05-18
parent: k3_mc_phase2.md
host: hertz
---
# Cycle 3, Phase 3 — NEON M3''' baseline
## Raw
```
=== M1'''_c bit-exact (10000 random blocks) ===
M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
mx phase coverage: min=577 max=668 (16 phases sampled)
=== M3''' NEON throughput ===
M3''' NEON throughput:
blocks/batch: 65536
batches done: 939
total blocks: 61 538 304
elapsed (kernel)=2.930751 s
elapsed (setup) =2.075477 s
throughput = 20.997 Mblock/s
per-block = 47.6 ns
equiv 1080p = 648.1 FPS (32400 blocks/frame)
```
## Numbers
| | |
|---|---|
| **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` |
| mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count |
| **M3''' (throughput)** | **20.997 Mblock/s** single-core |
| per-block | 47.6 ns |
| cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles |
| 1080p FPS-eq | 648 FPS |
## Comparison across cycles
| | IDCT (k1) | LPF (k2) | MC (k3) |
|---|---|---|---|
| Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 |
| 1080p FPS-eq | 252 | 748 (worst edges) | 648 |
| Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy |
| NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing |
MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT
or SMULL-pair pattern handles 8-tap convolution extremely well; this
is precisely the workload NEON SIMD was built for. **The QPU's
break-even point on cycle 3 is correspondingly tight.**
## Predictions for M2''' / R'''
V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the
QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise
MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier
than LPF.
- Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops
per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS
effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s
- Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s
- Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s
**Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW
band by a small margin.
Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than
estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU
needs 4× more cycles for the same work in the worst case), R'''
could land near 0.5-0.6. Phase 7''' measures.
Phase 4 next.