---
cycle: 3
phase: 3
status: closed 2026-05-18
date_opened: 2026-05-18
parent: k3_mc_phase2.md
host: hertz
---

# Cycle 3, Phase 3 — NEON M3''' baseline

## Raw

```
=== M1'''_c bit-exact (10000 random blocks) ===
M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
  mx phase coverage: min=577 max=668 (16 phases sampled)

=== M3''' NEON throughput ===
M3''' NEON throughput:
  blocks/batch:    65536
  batches done:    939
  total blocks:    61 538 304
  elapsed (kernel)=2.930751 s
  elapsed (setup) =2.075477 s
  throughput      = 20.997 Mblock/s
  per-block       = 47.6 ns
  equiv 1080p     = 648.1 FPS  (32400 blocks/frame)
```

## Numbers

| | |
|---|---|
| **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` |
| mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count |
| **M3''' (throughput)** | **20.997 Mblock/s** single-core |
| per-block | 47.6 ns |
| cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles |
| 1080p FPS-eq | 648 FPS |

## Comparison across cycles

| | IDCT (k1) | LPF (k2) | MC (k3) |
|---|---|---|---|
| Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 |
| 1080p FPS-eq | 252 | 748 (worst edges) | 648 |
| Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy |
| NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing |

MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT
or SMULL-pair pattern handles 8-tap convolution extremely well; this
is precisely the workload NEON SIMD was built for. **The QPU's
break-even point on cycle 3 is correspondingly tight.**

## Predictions for M2''' / R'''

V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the
QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise
MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier
than LPF.

- Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops
  per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS
  effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s
- Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s
- Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s

**Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW
band by a small margin.

Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than
estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU
needs 4× more cycles for the same work in the worst case), R'''
could land near 0.5-0.6. Phase 7''' measures.

Phase 4 next.