Files
daedalus-fourier/docs
marfrit db2205d0e3 Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson
from cycle 6 carried over cleanly).

M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the
1080p30 floor (0.972 Mblock/s req'd).

Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264
IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster
NEON):

  VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies)
  H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts)

Phase 4 deferred via the cycle 6 lightweight-kernel rationale:
NEON per-block << QPU dispatch floor; offload doesn't help.

Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are
NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target
compute-heavy H.264 kernels only (deblock = cycle 8 next; MC
likely RED).

Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only"
result. Forces a sharper view of which kernels QPU can actually
help with: deblock and possibly some VP9 cases.

- tests/h264_idct8_ref.c (column-major C ref)
- tests/bench_neon_h264idct8.c (M1 + M3 bench)
- CMakeLists.txt: cycle 7 bench wiring
- docs/k7_h264idct8_phase3_and_4.md (closure)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:16:42 +00:00
..