db2205d0e3
M1: 10000/10000 bit-exact first try (column-major-block lesson from cycle 6 carried over cleanly). M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the 1080p30 floor (0.972 Mblock/s req'd). Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264 IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster NEON): VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies) H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts) Phase 4 deferred via the cycle 6 lightweight-kernel rationale: NEON per-block << QPU dispatch floor; offload doesn't help. Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target compute-heavy H.264 kernels only (deblock = cycle 8 next; MC likely RED). Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only" result. Forces a sharper view of which kernels QPU can actually help with: deblock and possibly some VP9 cases. - tests/h264_idct8_ref.c (column-major C ref) - tests/bench_neon_h264idct8.c (M1 + M3 bench) - CMakeLists.txt: cycle 7 bench wiring - docs/k7_h264idct8_phase3_and_4.md (closure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
118 lines
4.4 KiB
Markdown
118 lines
4.4 KiB
Markdown
---
|
||
cycle: 7
|
||
phase: 3 + 4 (decision: defer Phase 4)
|
||
status: closed 2026-05-18 — M1 PASS, M3₇ = 151 Mblock/s, Phase 4 deferred
|
||
date_opened: 2026-05-18
|
||
date_closed: 2026-05-18
|
||
parent: k7_h264idct8_phase1.md
|
||
host: hertz
|
||
---
|
||
|
||
# Cycle 7, Phases 3+4 — H.264 IDCT 8×8 NEON baseline + Phase 4 deferral
|
||
|
||
## M1 + M3
|
||
|
||
```
|
||
=== M1₇ bit-exact (10000 random 8x8 blocks) ===
|
||
M1₇ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
|
||
|
||
=== M3₇ NEON throughput ===
|
||
total blocks: 62 074 880
|
||
elapsed (kernel)=0.411 s
|
||
throughput = 151.2 Mblock/s
|
||
per-block = 6.6 ns
|
||
H.264 1080p30 IDCT8 floor: 155.53x margin (0.972 Mblock/s req'd)
|
||
```
|
||
|
||
M1 PASS first try — the column-major-block convention from cycle
|
||
6 Phase 9 was correctly carried over and tested with a sharply
|
||
more complex butterfly (3 sub-stages). No debugging needed.
|
||
|
||
## Surprise: H.264 IDCT 8×8 is dramatically lighter than VP9 IDCT 8×8
|
||
|
||
| | VP9 IDCT 8×8 (cycle 1) | H.264 IDCT 8×8 (cycle 7) |
|
||
|---|---|---|
|
||
| NEON M3 (1 core) | 8.171 Mblock/s | **151.177 Mblock/s** (18.5× faster) |
|
||
| Per-block ns | 122 | **6.6** |
|
||
| Math | Q14 trig × COSPI constants | Pure integer butterfly + shifts |
|
||
| NEON instruction shape | Multiply-heavy | Add-and-shift |
|
||
|
||
The H.264 IDCT uses an INTEGER transform with only additions,
|
||
subtractions, and right-shifts — no multiplies. NEON's
|
||
add/sub/shift throughput is near-peak (1 cycle per op on most
|
||
ports). VP9's IDCT requires Q14 multiplies for the cosine-related
|
||
transform, which are ~4× slower per op on NEON.
|
||
|
||
**My Phase 1 prediction of R₇ ≈ 0.5-0.9 was wrong.** I extrapolated
|
||
from cycle 1 (VP9 IDCT 8×8) which I assumed was the closest analog
|
||
— it's the same data shape (64 coefs, 8×8 output) but the compute
|
||
shape is completely different. H.264's pure-integer butterfly is
|
||
much cheaper than VP9's trig butterfly.
|
||
|
||
## Phase 4 deferral (same pattern as cycle 6)
|
||
|
||
Per the cycle 6 Phase 9 lesson ("for any cycle with NEON per-block
|
||
< ~30 ns, predict deep RED and defer Phase 4 unless there's a
|
||
specific structural QPU advantage"):
|
||
|
||
- NEON 151 Mblock/s on a single core
|
||
- QPU per-block floor ~250 ns (cycle 1 scaling) → ~4 Mblock/s
|
||
- R₇ predicted = 4 / 151 = **0.026 → deep RED**
|
||
- 30fps@1080p floor passed by 155× on a single core
|
||
- No realistic deployment benefit from QPU offload
|
||
|
||
**Phase 4 deferred. Cycle 7 closed.**
|
||
|
||
## Recipe verdict
|
||
|
||
**H.264 IDCT 8×8 stays on CPU.** Same recipe slot as cycle 6
|
||
(H.264 IDCT 4×4): trivially fast on NEON, no need for QPU help.
|
||
|
||
The public API will route through `daedalus_dispatch_*` CPU paths
|
||
when these kernel slots are added.
|
||
|
||
## Phase 9 lesson (cycle 6 + 7 combined)
|
||
|
||
**H.264 transforms are NEON-trivial.** Both 4×4 (5.7 ns/block,
|
||
175 Mblock/s) and 8×8 (6.6 ns/block, 151 Mblock/s) are dominated
|
||
by memory bandwidth, not compute. The transform math is too
|
||
lightweight to make QPU offload worthwhile.
|
||
|
||
Implications for cycle-selection going forward:
|
||
- **Skip all H.264 transform cycles** (chroma IDCT 4×4 in cycle 8
|
||
was originally planned; defer all transform work to CPU-only).
|
||
- **Target compute-heavy H.264 kernels** where QPU might compete:
|
||
- **Deblock** (cycle 8, reordered up): analogous to VP9 LPF
|
||
which was GREEN. Predicted YELLOW or GREEN.
|
||
- **Luma qpel MC** (6-tap): analogous to VP9 8-tap MC which
|
||
was RED. Predicted RED.
|
||
- **Chroma MC** (4-tap): even lighter than luma. Predicted RED.
|
||
|
||
So the practical H.264 QPU plan: **only build cycle 8 (deblock)**.
|
||
Other H.264 kernels go CPU-only via the public API.
|
||
|
||
This is a much narrower scope than originally envisioned in
|
||
`project_h264_scope_added`. The end deliverable still meets the
|
||
user goal (Pi 5 + daedalus-fourier decoding H.264) — just with
|
||
the QPU only helping the deblock stage. Most of H.264 stays on
|
||
NEON because NEON is already so fast.
|
||
|
||
## Codec coverage state after cycle 7
|
||
|
||
| Codec | Kernel | Recipe | Status |
|
||
|---|---|---|---|
|
||
| VP9 | IDCT 8x8 | QPU | cycle 1 closed |
|
||
| VP9 | LPF wd=4 | QPU | cycle 2 closed |
|
||
| VP9 | MC 8h | CPU | cycle 3 closed |
|
||
| VP9 | LPF wd=8 | QPU | cycle 4 closed |
|
||
| AV1 | CDEF 8x8 | CPU | cycle 5 closed |
|
||
| H.264 | IDCT 4x4 | CPU | cycle 6 closed (this session) |
|
||
| H.264 | IDCT 8x8 | CPU | cycle 7 closed (this session) |
|
||
| H.264 | Deblock | TBD | cycle 8 next |
|
||
| H.264 | MC | CPU | future (predicted RED) |
|
||
| H.264 | Chroma MC | CPU | future (predicted RED) |
|
||
|
||
7 cycles closed. 3 deployed on QPU (VP9 cycles 1+2+4). 4 stay on
|
||
CPU. Deployment recipe matrix grows but stays narrowly focused on
|
||
QPU-wins.
|