--- cycle: 7 phase: 3 + 4 (decision: defer Phase 4) status: closed 2026-05-18 — M1 PASS, M3₇ = 151 Mblock/s, Phase 4 deferred date_opened: 2026-05-18 date_closed: 2026-05-18 parent: k7_h264idct8_phase1.md host: hertz --- # Cycle 7, Phases 3+4 — H.264 IDCT 8×8 NEON baseline + Phase 4 deferral ## M1 + M3 ``` === M1₇ bit-exact (10000 random 8x8 blocks) === M1₇ correctness: 10000 / 10000 blocks bit-exact (100.0000%) === M3₇ NEON throughput === total blocks: 62 074 880 elapsed (kernel)=0.411 s throughput = 151.2 Mblock/s per-block = 6.6 ns H.264 1080p30 IDCT8 floor: 155.53x margin (0.972 Mblock/s req'd) ``` M1 PASS first try — the column-major-block convention from cycle 6 Phase 9 was correctly carried over and tested with a sharply more complex butterfly (3 sub-stages). No debugging needed. ## Surprise: H.264 IDCT 8×8 is dramatically lighter than VP9 IDCT 8×8 | | VP9 IDCT 8×8 (cycle 1) | H.264 IDCT 8×8 (cycle 7) | |---|---|---| | NEON M3 (1 core) | 8.171 Mblock/s | **151.177 Mblock/s** (18.5× faster) | | Per-block ns | 122 | **6.6** | | Math | Q14 trig × COSPI constants | Pure integer butterfly + shifts | | NEON instruction shape | Multiply-heavy | Add-and-shift | The H.264 IDCT uses an INTEGER transform with only additions, subtractions, and right-shifts — no multiplies. NEON's add/sub/shift throughput is near-peak (1 cycle per op on most ports). VP9's IDCT requires Q14 multiplies for the cosine-related transform, which are ~4× slower per op on NEON. **My Phase 1 prediction of R₇ ≈ 0.5-0.9 was wrong.** I extrapolated from cycle 1 (VP9 IDCT 8×8) which I assumed was the closest analog — it's the same data shape (64 coefs, 8×8 output) but the compute shape is completely different. H.264's pure-integer butterfly is much cheaper than VP9's trig butterfly. ## Phase 4 deferral (same pattern as cycle 6) Per the cycle 6 Phase 9 lesson ("for any cycle with NEON per-block < ~30 ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage"): - NEON 151 Mblock/s on a single core - QPU per-block floor ~250 ns (cycle 1 scaling) → ~4 Mblock/s - R₇ predicted = 4 / 151 = **0.026 → deep RED** - 30fps@1080p floor passed by 155× on a single core - No realistic deployment benefit from QPU offload **Phase 4 deferred. Cycle 7 closed.** ## Recipe verdict **H.264 IDCT 8×8 stays on CPU.** Same recipe slot as cycle 6 (H.264 IDCT 4×4): trivially fast on NEON, no need for QPU help. The public API will route through `daedalus_dispatch_*` CPU paths when these kernel slots are added. ## Phase 9 lesson (cycle 6 + 7 combined) **H.264 transforms are NEON-trivial.** Both 4×4 (5.7 ns/block, 175 Mblock/s) and 8×8 (6.6 ns/block, 151 Mblock/s) are dominated by memory bandwidth, not compute. The transform math is too lightweight to make QPU offload worthwhile. Implications for cycle-selection going forward: - **Skip all H.264 transform cycles** (chroma IDCT 4×4 in cycle 8 was originally planned; defer all transform work to CPU-only). - **Target compute-heavy H.264 kernels** where QPU might compete: - **Deblock** (cycle 8, reordered up): analogous to VP9 LPF which was GREEN. Predicted YELLOW or GREEN. - **Luma qpel MC** (6-tap): analogous to VP9 8-tap MC which was RED. Predicted RED. - **Chroma MC** (4-tap): even lighter than luma. Predicted RED. So the practical H.264 QPU plan: **only build cycle 8 (deblock)**. Other H.264 kernels go CPU-only via the public API. This is a much narrower scope than originally envisioned in `project_h264_scope_added`. The end deliverable still meets the user goal (Pi 5 + daedalus-fourier decoding H.264) — just with the QPU only helping the deblock stage. Most of H.264 stays on NEON because NEON is already so fast. ## Codec coverage state after cycle 7 | Codec | Kernel | Recipe | Status | |---|---|---|---| | VP9 | IDCT 8x8 | QPU | cycle 1 closed | | VP9 | LPF wd=4 | QPU | cycle 2 closed | | VP9 | MC 8h | CPU | cycle 3 closed | | VP9 | LPF wd=8 | QPU | cycle 4 closed | | AV1 | CDEF 8x8 | CPU | cycle 5 closed | | H.264 | IDCT 4x4 | CPU | cycle 6 closed (this session) | | H.264 | IDCT 8x8 | CPU | cycle 7 closed (this session) | | H.264 | Deblock | TBD | cycle 8 next | | H.264 | MC | CPU | future (predicted RED) | | H.264 | Chroma MC | CPU | future (predicted RED) | 7 cycles closed. 3 deployed on QPU (VP9 cycles 1+2+4). 4 stay on CPU. Deployment recipe matrix grows but stays narrowly focused on QPU-wins.