--- cycle: 6 phase: 3 status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s date_opened: 2026-05-18 date_closed: 2026-05-18 codec: H.264 kernel: IDCT 4x4 + add parent: k6_h264idct4_phase1.md host: hertz --- # Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline ## M3₆ throughput ``` === M3₆ NEON throughput === blocks/batch: 4096 batches done: 51 206 total blocks: 209 739 776 elapsed (kernel)=1.199 s throughput = 175.0 Mblock/s per-block = 5.7 ns H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd) H.264 1080p30 realistic floor: 87.50× margin (2.0 Mblock/s req'd) ``` **Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2 LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a genuinely small kernel and FFmpeg's NEON is extremely tight (56 instructions per block). NEON 4-core scaling: not measured this phase; based on cycle 2/4 patterns, expect ~3-4× scaling (bandwidth-bound territory) → ~500-700 Mblock/s aggregate. That's >100× the floor. ## M1 bit-exact gate ``` === M1₆ bit-exact (10000 random 4x4 blocks) === M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%) ``` ## Key Phase 9 lesson — H.264 block layout is column-major The bench's initial C reference assumed row-major block storage (`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all random). After failed attempts swapping the row/column pass order (both row-first and column-first gave the same 5 % rate), trace analysis revealed the actual mismatch: - NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does **interleaved** loading (load 4 structures of 4 elements, scattering across registers), NOT sequential — I initially assumed sequential. - Combined with FFmpeg's choice of **column-major** block layout (`block[c*4 + r]` = coefficient at row r, column c), the interleaved load gives each NEON vector `v_r` = row r of block (lane = column). - FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses `block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major indexing in disguise. Fix: read block as column-major (`block[c*4 + r]`) in the C reference's row-pass loop. M1 then PASS 10000/10000. Lesson encoded for future H.264 cycles: - **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg. - This convention propagates through all the libavcodec/aarch64 H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc). Cycles 7+ (other H.264 kernels) should default-assume column-major. ## Comparison vs cycle 1 IDCT 8×8 (the closest analog) | | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 | |---|---|---| | Codec | VP9 | H.264 | | Block size | 8×8 (64 coefs) | 4×4 (16 coefs) | | Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) | | NEON cycles/block | 122 ns | **5.7 ns** (21× faster) | | Block storage | row-major | column-major | | 30fps@1080p floor margin | 8× | **30×** (vs worst case) | H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both per-coef and per-block. This validates the "H.264 should be easier" hypothesis from [project_h264_scope_added]. ## Predicted R₆ band NEON per-block 5.7 ns is so fast that the QPU must be very fast to compete. QPU dispatch overhead is ~30 µs per call (from M5), so the QPU-call breakeven needs to amortize across many blocks per dispatch. Per-block estimate for QPU on a similar tiny kernel: - 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG - ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250) - At 8 ns/instruction (NEON-tuned guess), ~600 ns per block. - R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation** But: per-WG packing of 16 blocks means dispatch overhead amortizes better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's bandwidth doesn't contend on the same channel. Plan: implement QPU path anyway for cycle-completion and opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel (per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and CPU for H.264 IDCT 4×4, that's fine — the recipe carries over. ## Next: Phase 4 plan Per the established cycle pattern. Plan the QPU shader. Phase 5 Sonnet review. Phase 6 implementation. Phase 7 measurement. Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel to make per-call buffer alloc dominate the latency. Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader build) and instead move directly to next H.264 cycles where QPU might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it doesn't NEED QPU help. ## Acceptance - ✓ M1 bit-exact (100.00 % on 10 000 random blocks) - ✓ M3 captured (175 Mblock/s) - ✓ 30fps@1080p floor exceeded by 30× worst-case - ✓ Block-layout convention documented for future cycles