Files
daedalus-fourier/docs/k7_h264idct8_phase3_and_4.md
T
marfrit db2205d0e3 Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson
from cycle 6 carried over cleanly).

M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the
1080p30 floor (0.972 Mblock/s req'd).

Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264
IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster
NEON):

  VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies)
  H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts)

Phase 4 deferred via the cycle 6 lightweight-kernel rationale:
NEON per-block << QPU dispatch floor; offload doesn't help.

Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are
NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target
compute-heavy H.264 kernels only (deblock = cycle 8 next; MC
likely RED).

Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only"
result. Forces a sharper view of which kernels QPU can actually
help with: deblock and possibly some VP9 cases.

- tests/h264_idct8_ref.c (column-major C ref)
- tests/bench_neon_h264idct8.c (M1 + M3 bench)
- CMakeLists.txt: cycle 7 bench wiring
- docs/k7_h264idct8_phase3_and_4.md (closure)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:16:42 +00:00

118 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 7
phase: 3 + 4 (decision: defer Phase 4)
status: closed 2026-05-18 — M1 PASS, M3₇ = 151 Mblock/s, Phase 4 deferred
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k7_h264idct8_phase1.md
host: hertz
---
# Cycle 7, Phases 3+4 — H.264 IDCT 8×8 NEON baseline + Phase 4 deferral
## M1 + M3
```
=== M1₇ bit-exact (10000 random 8x8 blocks) ===
M1₇ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
=== M3₇ NEON throughput ===
total blocks: 62 074 880
elapsed (kernel)=0.411 s
throughput = 151.2 Mblock/s
per-block = 6.6 ns
H.264 1080p30 IDCT8 floor: 155.53x margin (0.972 Mblock/s req'd)
```
M1 PASS first try — the column-major-block convention from cycle
6 Phase 9 was correctly carried over and tested with a sharply
more complex butterfly (3 sub-stages). No debugging needed.
## Surprise: H.264 IDCT 8×8 is dramatically lighter than VP9 IDCT 8×8
| | VP9 IDCT 8×8 (cycle 1) | H.264 IDCT 8×8 (cycle 7) |
|---|---|---|
| NEON M3 (1 core) | 8.171 Mblock/s | **151.177 Mblock/s** (18.5× faster) |
| Per-block ns | 122 | **6.6** |
| Math | Q14 trig × COSPI constants | Pure integer butterfly + shifts |
| NEON instruction shape | Multiply-heavy | Add-and-shift |
The H.264 IDCT uses an INTEGER transform with only additions,
subtractions, and right-shifts — no multiplies. NEON's
add/sub/shift throughput is near-peak (1 cycle per op on most
ports). VP9's IDCT requires Q14 multiplies for the cosine-related
transform, which are ~4× slower per op on NEON.
**My Phase 1 prediction of R₇ ≈ 0.5-0.9 was wrong.** I extrapolated
from cycle 1 (VP9 IDCT 8×8) which I assumed was the closest analog
— it's the same data shape (64 coefs, 8×8 output) but the compute
shape is completely different. H.264's pure-integer butterfly is
much cheaper than VP9's trig butterfly.
## Phase 4 deferral (same pattern as cycle 6)
Per the cycle 6 Phase 9 lesson ("for any cycle with NEON per-block
< ~30 ns, predict deep RED and defer Phase 4 unless there's a
specific structural QPU advantage"):
- NEON 151 Mblock/s on a single core
- QPU per-block floor ~250 ns (cycle 1 scaling) → ~4 Mblock/s
- R₇ predicted = 4 / 151 = **0.026 → deep RED**
- 30fps@1080p floor passed by 155× on a single core
- No realistic deployment benefit from QPU offload
**Phase 4 deferred. Cycle 7 closed.**
## Recipe verdict
**H.264 IDCT 8×8 stays on CPU.** Same recipe slot as cycle 6
(H.264 IDCT 4×4): trivially fast on NEON, no need for QPU help.
The public API will route through `daedalus_dispatch_*` CPU paths
when these kernel slots are added.
## Phase 9 lesson (cycle 6 + 7 combined)
**H.264 transforms are NEON-trivial.** Both 4×4 (5.7 ns/block,
175 Mblock/s) and 8×8 (6.6 ns/block, 151 Mblock/s) are dominated
by memory bandwidth, not compute. The transform math is too
lightweight to make QPU offload worthwhile.
Implications for cycle-selection going forward:
- **Skip all H.264 transform cycles** (chroma IDCT 4×4 in cycle 8
was originally planned; defer all transform work to CPU-only).
- **Target compute-heavy H.264 kernels** where QPU might compete:
- **Deblock** (cycle 8, reordered up): analogous to VP9 LPF
which was GREEN. Predicted YELLOW or GREEN.
- **Luma qpel MC** (6-tap): analogous to VP9 8-tap MC which
was RED. Predicted RED.
- **Chroma MC** (4-tap): even lighter than luma. Predicted RED.
So the practical H.264 QPU plan: **only build cycle 8 (deblock)**.
Other H.264 kernels go CPU-only via the public API.
This is a much narrower scope than originally envisioned in
`project_h264_scope_added`. The end deliverable still meets the
user goal (Pi 5 + daedalus-fourier decoding H.264) — just with
the QPU only helping the deblock stage. Most of H.264 stays on
NEON because NEON is already so fast.
## Codec coverage state after cycle 7
| Codec | Kernel | Recipe | Status |
|---|---|---|---|
| VP9 | IDCT 8x8 | QPU | cycle 1 closed |
| VP9 | LPF wd=4 | QPU | cycle 2 closed |
| VP9 | MC 8h | CPU | cycle 3 closed |
| VP9 | LPF wd=8 | QPU | cycle 4 closed |
| AV1 | CDEF 8x8 | CPU | cycle 5 closed |
| H.264 | IDCT 4x4 | CPU | cycle 6 closed (this session) |
| H.264 | IDCT 8x8 | CPU | cycle 7 closed (this session) |
| H.264 | Deblock | TBD | cycle 8 next |
| H.264 | MC | CPU | future (predicted RED) |
| H.264 | Chroma MC | CPU | future (predicted RED) |
7 cycles closed. 3 deployed on QPU (VP9 cycles 1+2+4). 4 stay on
CPU. Deployment recipe matrix grows but stays narrowly focused on
QPU-wins.