f92dc40f43
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
133 lines
4.9 KiB
Markdown
133 lines
4.9 KiB
Markdown
---
|
||
cycle: 6
|
||
phase: 3
|
||
status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s
|
||
date_opened: 2026-05-18
|
||
date_closed: 2026-05-18
|
||
codec: H.264
|
||
kernel: IDCT 4x4 + add
|
||
parent: k6_h264idct4_phase1.md
|
||
host: hertz
|
||
---
|
||
|
||
# Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline
|
||
|
||
## M3₆ throughput
|
||
|
||
```
|
||
=== M3₆ NEON throughput ===
|
||
blocks/batch: 4096
|
||
batches done: 51 206
|
||
total blocks: 209 739 776
|
||
elapsed (kernel)=1.199 s
|
||
throughput = 175.0 Mblock/s
|
||
per-block = 5.7 ns
|
||
H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
|
||
H.264 1080p30 realistic floor: 87.50× margin (2.0 Mblock/s req'd)
|
||
```
|
||
|
||
**Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2
|
||
LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a
|
||
genuinely small kernel and FFmpeg's NEON is extremely tight
|
||
(56 instructions per block).
|
||
|
||
NEON 4-core scaling: not measured this phase; based on cycle 2/4
|
||
patterns, expect ~3-4× scaling (bandwidth-bound territory) →
|
||
~500-700 Mblock/s aggregate. That's >100× the floor.
|
||
|
||
## M1 bit-exact gate
|
||
|
||
```
|
||
=== M1₆ bit-exact (10000 random 4x4 blocks) ===
|
||
M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
|
||
```
|
||
|
||
## Key Phase 9 lesson — H.264 block layout is column-major
|
||
|
||
The bench's initial C reference assumed row-major block storage
|
||
(`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all
|
||
random). After failed attempts swapping the row/column pass order
|
||
(both row-first and column-first gave the same 5 % rate), trace
|
||
analysis revealed the actual mismatch:
|
||
|
||
- NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does
|
||
**interleaved** loading (load 4 structures of 4 elements,
|
||
scattering across registers), NOT sequential — I initially
|
||
assumed sequential.
|
||
- Combined with FFmpeg's choice of **column-major** block layout
|
||
(`block[c*4 + r]` = coefficient at row r, column c), the
|
||
interleaved load gives each NEON vector `v_r` = row r of block
|
||
(lane = column).
|
||
- FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses
|
||
`block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major
|
||
indexing in disguise.
|
||
|
||
Fix: read block as column-major (`block[c*4 + r]`) in the C
|
||
reference's row-pass loop. M1 then PASS 10000/10000.
|
||
|
||
Lesson encoded for future H.264 cycles:
|
||
- **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg.
|
||
- This convention propagates through all the libavcodec/aarch64
|
||
H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc).
|
||
Cycles 7+ (other H.264 kernels) should default-assume
|
||
column-major.
|
||
|
||
## Comparison vs cycle 1 IDCT 8×8 (the closest analog)
|
||
|
||
| | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 |
|
||
|---|---|---|
|
||
| Codec | VP9 | H.264 |
|
||
| Block size | 8×8 (64 coefs) | 4×4 (16 coefs) |
|
||
| Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) |
|
||
| NEON cycles/block | 122 ns | **5.7 ns** (21× faster) |
|
||
| Block storage | row-major | column-major |
|
||
| 30fps@1080p floor margin | 8× | **30×** (vs worst case) |
|
||
|
||
H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both
|
||
per-coef and per-block. This validates the "H.264 should be
|
||
easier" hypothesis from [project_h264_scope_added].
|
||
|
||
## Predicted R₆ band
|
||
|
||
NEON per-block 5.7 ns is so fast that the QPU must be very fast
|
||
to compete. QPU dispatch overhead is ~30 µs per call (from M5),
|
||
so the QPU-call breakeven needs to amortize across many blocks
|
||
per dispatch.
|
||
|
||
Per-block estimate for QPU on a similar tiny kernel:
|
||
- 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
|
||
- ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
|
||
- At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
|
||
- R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation**
|
||
|
||
But: per-WG packing of 16 blocks means dispatch overhead amortizes
|
||
better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes
|
||
read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to
|
||
LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's
|
||
bandwidth doesn't contend on the same channel.
|
||
|
||
Plan: implement QPU path anyway for cycle-completion and
|
||
opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel
|
||
(per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and
|
||
CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.
|
||
|
||
## Next: Phase 4 plan
|
||
|
||
Per the established cycle pattern. Plan the QPU shader. Phase 5
|
||
Sonnet review. Phase 6 implementation. Phase 7 measurement.
|
||
Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel
|
||
to make per-call buffer alloc dominate the latency.
|
||
|
||
Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader
|
||
build) and instead move directly to next H.264 cycles where QPU
|
||
might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or
|
||
deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it
|
||
doesn't NEED QPU help.
|
||
|
||
## Acceptance
|
||
|
||
- ✓ M1 bit-exact (100.00 % on 10 000 random blocks)
|
||
- ✓ M3 captured (175 Mblock/s)
|
||
- ✓ 30fps@1080p floor exceeded by 30× worst-case
|
||
- ✓ Block-layout convention documented for future cycles
|