Files
daedalus-fourier/docs/k6_h264idct4_phase1.md
T
marfrit f92dc40f43 Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore
VII has no hardware H.264 decoder block (only HEVC), so a
QPU-accelerated H.264 path fills the most impactful codec gap.
Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264
transform, simplest first cycle).

Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s
worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or
P-skip).

Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on
hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case
floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9
IDCT 8x8 (21x faster per block).

Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg
blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON
ld1 with 4 registers interleaves loading, and the FFmpeg C ref
indexing makes this convention explicit. Initial C ref assumed
row-major, M1 was 5% bit-exact; after fix, M1 = 100%.

Convention encoded for all subsequent H.264 cycles (cycle 7+).

- external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
  (vendored verbatim from FFmpeg n7.1.3, 415 lines)
- external/ffmpeg-snapshot/PROVENANCE.md: updated
- tests/h264_idct4_ref.c: column-major C ref
- tests/bench_neon_h264idct4.c: M1 + M3 bench
- CMakeLists.txt: cycle 6 NEON bench wiring
- docs/k6_h264idct4_phase1.md, phase3.md

Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep
RED — kernel too small relative to QPU dispatch overhead) but
worth building for cycle-completeness + the opportunistic-helper
hypothesis (cycle 6 may stay CPU per recipe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:14:43 +00:00

120 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 6
phase: 1
status: open
date_opened: 2026-05-18
codec: H.264
kernel: IDCT 4x4 + add (intra-block residual)
parent: project_h264_scope_added.md (memory)
---
# Cycle 6, Phase 1 — H.264 IDCT 4×4 + add
First H.264 kernel. Per `project_h264_scope_added`, the user
added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5
has no hardware H.264 decoder despite H.264 being the most
common web codec.
## Why IDCT 4×4 first
- **Smallest H.264 transform.** 16 coefficients per block, 4×4
output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs).
- **Most-used.** H.264 macroblocks default to 4×4 intra
prediction + residual; 8×8 is High-profile only. 4×4 hits
most real-world H.264 streams.
- **Predicted GREEN.** Per the cycle 1-5 bandwidth-bound vs
compute-bound classification: 4×4 IDCT is bandwidth-bound
(16 reads, 16 writes, ~20 ALU ops/output). Should map well
to V3D 7.1 compute.
- **Clean reference.** FFmpeg's `ff_h264_idct_add_neon` is
standalone (no eob parameter, no complex DC dispatch). Single
call computes 1 block of IDCT + add.
## Kernel contract
Per H.264 spec §8.5.12, the inverse transform is an
integer-arithmetic transform (no rounding-by-cosine like VP9's
Q14 trig math). Each 4×4 block:
1. Inverse row transform (4 row passes, each one 1D IDCT-like
integer butterfly).
2. Inverse column transform (4 column passes, same butterfly).
3. Round and add to `dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255)`.
Spec coefficients (Hadamard-like with 1/2 scaling):
```
[1 1 1 1/2]
[1 1/2 -1 -1]
[1 -1/2 -1 1]
[1 -1 1 -1/2]
```
Integer form scales by 2: replace 1/2 with 1 and ½ with right-
shift in the round step.
## NEON reference (M3 target)
FFmpeg's `ff_h264_idct_add_neon`
(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
line 25, 56 instructions of NEON asm). Signature:
```
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
```
- `dst`: 4×4 pixel block in 8-bit luma surface, `stride` between rows.
- `block`: 16 int16 coefficients (row-major).
- destructively clears `block` to zero after the transform (per H.264 conformance).
## 30fps@1080p H.264 floor
H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB.
Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame ×
16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8
chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k
4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip).
At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case.
A more realistic average (many MBs use 8×8, P-skip, etc.) is
~2 Mblock/s.
**30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s.**
**30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.**
## R-band decision rules (carried from phase1.md)
- R ≥ 1.0 → **GREEN** (QPU faster than NEON-1 in isolation).
- 0.5 ≤ R < 1.0 → **YELLOW** (M4 decides).
- 0.1 ≤ R < 0.5 → **ORANGE** (M4 may rescue).
- R < 0.1 → **RED** (structural mismatch).
Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85
Mblock/s worst-case 30fps floor.
## Acceptance for Phase 7
- M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random
blocks). Same standard as cycles 1-5.
- M2: captured, classified per R band.
- M4: same-kernel mixed-bench measured (with Issue 003 caveats —
this is the worst-case framing).
- 30fps@1080p H.264 4×4 floor margin reported.
## Cycle 6 deliverables
1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S`
(vendored 2026-05-18, this phase).
2. `tests/h264_idct4_ref.c` — standalone C reference (LGPL-2.1+
transcribed from spec).
3. `tests/bench_neon_h264idct4.c` — Phase 3 M3 bench.
4. `src/v3d_h264idct4.comp` — Phase 6 QPU shader.
5. `tests/bench_v3d_h264idct4.c` — Phase 6+7 M1+M2 bench (3-way
vs NEON + C ref).
6. M4: extend `bench_concurrent_mixed.c` with K_H264_IDCT4.
7. Phase 4-7 docs.
## Next step (within this phase)
Move to Phase 3 (NEON baseline M3) after writing the C
reference. Phase 2 (libavcodec inventory) is implicit since we
know the kernel from the FFmpeg vendor.