---
cycle: 6
phase: 3
status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s
date_opened: 2026-05-18
date_closed: 2026-05-18
codec: H.264
kernel: IDCT 4x4 + add
parent: k6_h264idct4_phase1.md
host: hertz
---

# Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline

## M3₆ throughput

```
=== M3₆ NEON throughput ===
  blocks/batch:    4096
  batches done:    51 206
  total blocks:    209 739 776
  elapsed (kernel)=1.199 s
  throughput      = 175.0 Mblock/s
  per-block       = 5.7 ns
  H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
  H.264 1080p30 realistic floor:  87.50× margin (2.0 Mblock/s req'd)
```

**Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2
LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a
genuinely small kernel and FFmpeg's NEON is extremely tight
(56 instructions per block).

NEON 4-core scaling: not measured this phase; based on cycle 2/4
patterns, expect ~3-4× scaling (bandwidth-bound territory) →
~500-700 Mblock/s aggregate. That's >100× the floor.

## M1 bit-exact gate

```
=== M1₆ bit-exact (10000 random 4x4 blocks) ===
M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
```

## Key Phase 9 lesson — H.264 block layout is column-major

The bench's initial C reference assumed row-major block storage
(`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all
random). After failed attempts swapping the row/column pass order
(both row-first and column-first gave the same 5 % rate), trace
analysis revealed the actual mismatch:

- NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does
  **interleaved** loading (load 4 structures of 4 elements,
  scattering across registers), NOT sequential — I initially
  assumed sequential.
- Combined with FFmpeg's choice of **column-major** block layout
  (`block[c*4 + r]` = coefficient at row r, column c), the
  interleaved load gives each NEON vector `v_r` = row r of block
  (lane = column).
- FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses
  `block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major
  indexing in disguise.

Fix: read block as column-major (`block[c*4 + r]`) in the C
reference's row-pass loop. M1 then PASS 10000/10000.

Lesson encoded for future H.264 cycles:
- **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg.
- This convention propagates through all the libavcodec/aarch64
  H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc).
  Cycles 7+ (other H.264 kernels) should default-assume
  column-major.

## Comparison vs cycle 1 IDCT 8×8 (the closest analog)

| | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 |
|---|---|---|
| Codec | VP9 | H.264 |
| Block size | 8×8 (64 coefs) | 4×4 (16 coefs) |
| Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) |
| NEON cycles/block | 122 ns | **5.7 ns** (21× faster) |
| Block storage | row-major | column-major |
| 30fps@1080p floor margin | 8× | **30×** (vs worst case) |

H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both
per-coef and per-block. This validates the "H.264 should be
easier" hypothesis from [project_h264_scope_added].

## Predicted R₆ band

NEON per-block 5.7 ns is so fast that the QPU must be very fast
to compete. QPU dispatch overhead is ~30 µs per call (from M5),
so the QPU-call breakeven needs to amortize across many blocks
per dispatch.

Per-block estimate for QPU on a similar tiny kernel:
- 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
- ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
- At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
- R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation**

But: per-WG packing of 16 blocks means dispatch overhead amortizes
better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes
read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to
LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's
bandwidth doesn't contend on the same channel.

Plan: implement QPU path anyway for cycle-completion and
opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel
(per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and
CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.

## Next: Phase 4 plan

Per the established cycle pattern. Plan the QPU shader. Phase 5
Sonnet review. Phase 6 implementation. Phase 7 measurement.
Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel
to make per-call buffer alloc dominate the latency.

Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader
build) and instead move directly to next H.264 cycles where QPU
might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or
deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it
doesn't NEED QPU help.

## Acceptance

- ✓ M1 bit-exact (100.00 % on 10 000 random blocks)
- ✓ M3 captured (175 Mblock/s)
- ✓ 30fps@1080p floor exceeded by 30× worst-case
- ✓ Block-layout convention documented for future cycles