Files
daedalus-fourier/docs/k7_h264idct8_phase3_and_4.md
marfrit db2205d0e3 Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson
from cycle 6 carried over cleanly).

M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the
1080p30 floor (0.972 Mblock/s req'd).

Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264
IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster
NEON):

  VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies)
  H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts)

Phase 4 deferred via the cycle 6 lightweight-kernel rationale:
NEON per-block << QPU dispatch floor; offload doesn't help.

Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are
NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target
compute-heavy H.264 kernels only (deblock = cycle 8 next; MC
likely RED).

Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only"
result. Forces a sharper view of which kernels QPU can actually
help with: deblock and possibly some VP9 cases.

- tests/h264_idct8_ref.c (column-major C ref)
- tests/bench_neon_h264idct8.c (M1 + M3 bench)
- CMakeLists.txt: cycle 7 bench wiring
- docs/k7_h264idct8_phase3_and_4.md (closure)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:16:42 +00:00

4.4 KiB
Raw Permalink Blame History


cycle: 7 phase: 3 + 4 (decision: defer Phase 4) status: closed 2026-05-18 — M1 PASS, M3₇ = 151 Mblock/s, Phase 4 deferred date_opened: 2026-05-18 date_closed: 2026-05-18 parent: k7_h264idct8_phase1.md host: hertz

Cycle 7, Phases 3+4 — H.264 IDCT 8×8 NEON baseline + Phase 4 deferral

M1 + M3

=== M1₇ bit-exact (10000 random 8x8 blocks) ===
M1₇ correctness: 10000 / 10000 blocks bit-exact (100.0000%)

=== M3₇ NEON throughput ===
  total blocks:    62 074 880
  elapsed (kernel)=0.411 s
  throughput      = 151.2 Mblock/s
  per-block       = 6.6 ns
  H.264 1080p30 IDCT8 floor: 155.53x margin (0.972 Mblock/s req'd)

M1 PASS first try — the column-major-block convention from cycle 6 Phase 9 was correctly carried over and tested with a sharply more complex butterfly (3 sub-stages). No debugging needed.

Surprise: H.264 IDCT 8×8 is dramatically lighter than VP9 IDCT 8×8

VP9 IDCT 8×8 (cycle 1) H.264 IDCT 8×8 (cycle 7)
NEON M3 (1 core) 8.171 Mblock/s 151.177 Mblock/s (18.5× faster)
Per-block ns 122 6.6
Math Q14 trig × COSPI constants Pure integer butterfly + shifts
NEON instruction shape Multiply-heavy Add-and-shift

The H.264 IDCT uses an INTEGER transform with only additions, subtractions, and right-shifts — no multiplies. NEON's add/sub/shift throughput is near-peak (1 cycle per op on most ports). VP9's IDCT requires Q14 multiplies for the cosine-related transform, which are ~4× slower per op on NEON.

My Phase 1 prediction of R₇ ≈ 0.5-0.9 was wrong. I extrapolated from cycle 1 (VP9 IDCT 8×8) which I assumed was the closest analog — it's the same data shape (64 coefs, 8×8 output) but the compute shape is completely different. H.264's pure-integer butterfly is much cheaper than VP9's trig butterfly.

Phase 4 deferral (same pattern as cycle 6)

Per the cycle 6 Phase 9 lesson ("for any cycle with NEON per-block < ~30 ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage"):

  • NEON 151 Mblock/s on a single core
  • QPU per-block floor ~250 ns (cycle 1 scaling) → ~4 Mblock/s
  • R₇ predicted = 4 / 151 = 0.026 → deep RED
  • 30fps@1080p floor passed by 155× on a single core
  • No realistic deployment benefit from QPU offload

Phase 4 deferred. Cycle 7 closed.

Recipe verdict

H.264 IDCT 8×8 stays on CPU. Same recipe slot as cycle 6 (H.264 IDCT 4×4): trivially fast on NEON, no need for QPU help.

The public API will route through daedalus_dispatch_* CPU paths when these kernel slots are added.

Phase 9 lesson (cycle 6 + 7 combined)

H.264 transforms are NEON-trivial. Both 4×4 (5.7 ns/block, 175 Mblock/s) and 8×8 (6.6 ns/block, 151 Mblock/s) are dominated by memory bandwidth, not compute. The transform math is too lightweight to make QPU offload worthwhile.

Implications for cycle-selection going forward:

  • Skip all H.264 transform cycles (chroma IDCT 4×4 in cycle 8 was originally planned; defer all transform work to CPU-only).
  • Target compute-heavy H.264 kernels where QPU might compete:
    • Deblock (cycle 8, reordered up): analogous to VP9 LPF which was GREEN. Predicted YELLOW or GREEN.
    • Luma qpel MC (6-tap): analogous to VP9 8-tap MC which was RED. Predicted RED.
    • Chroma MC (4-tap): even lighter than luma. Predicted RED.

So the practical H.264 QPU plan: only build cycle 8 (deblock). Other H.264 kernels go CPU-only via the public API.

This is a much narrower scope than originally envisioned in project_h264_scope_added. The end deliverable still meets the user goal (Pi 5 + daedalus-fourier decoding H.264) — just with the QPU only helping the deblock stage. Most of H.264 stays on NEON because NEON is already so fast.

Codec coverage state after cycle 7

Codec Kernel Recipe Status
VP9 IDCT 8x8 QPU cycle 1 closed
VP9 LPF wd=4 QPU cycle 2 closed
VP9 MC 8h CPU cycle 3 closed
VP9 LPF wd=8 QPU cycle 4 closed
AV1 CDEF 8x8 CPU cycle 5 closed
H.264 IDCT 4x4 CPU cycle 6 closed (this session)
H.264 IDCT 8x8 CPU cycle 7 closed (this session)
H.264 Deblock TBD cycle 8 next
H.264 MC CPU future (predicted RED)
H.264 Chroma MC CPU future (predicted RED)

7 cycles closed. 3 deployed on QPU (VP9 cycles 1+2+4). 4 stay on CPU. Deployment recipe matrix grows but stays narrowly focused on QPU-wins.