--- cycle: 3 phase: 3 status: closed 2026-05-18 date_opened: 2026-05-18 parent: k3_mc_phase2.md host: hertz --- # Cycle 3, Phase 3 — NEON M3''' baseline ## Raw ``` === M1'''_c bit-exact (10000 random blocks) === M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%) mx phase coverage: min=577 max=668 (16 phases sampled) === M3''' NEON throughput === M3''' NEON throughput: blocks/batch: 65536 batches done: 939 total blocks: 61 538 304 elapsed (kernel)=2.930751 s elapsed (setup) =2.075477 s throughput = 20.997 Mblock/s per-block = 47.6 ns equiv 1080p = 648.1 FPS (32400 blocks/frame) ``` ## Numbers | | | |---|---| | **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` | | mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count | | **M3''' (throughput)** | **20.997 Mblock/s** single-core | | per-block | 47.6 ns | | cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles | | 1080p FPS-eq | 648 FPS | ## Comparison across cycles | | IDCT (k1) | LPF (k2) | MC (k3) | |---|---|---|---| | Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 | | 1080p FPS-eq | 252 | 748 (worst edges) | 648 | | Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy | | NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing | MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT or SMULL-pair pattern handles 8-tap convolution extremely well; this is precisely the workload NEON SIMD was built for. **The QPU's break-even point on cycle 3 is correspondingly tight.** ## Predictions for M2''' / R''' V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier than LPF. - Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s - Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s - Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s **Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW band by a small margin. Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU needs 4× more cycles for the same work in the worst case), R''' could land near 0.5-0.6. Phase 7''' measures. Phase 4 next.