--- cycle: 6 phase: 1 status: open date_opened: 2026-05-18 codec: H.264 kernel: IDCT 4x4 + add (intra-block residual) parent: project_h264_scope_added.md (memory) --- # Cycle 6, Phase 1 — H.264 IDCT 4×4 + add First H.264 kernel. Per `project_h264_scope_added`, the user added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5 has no hardware H.264 decoder despite H.264 being the most common web codec. ## Why IDCT 4×4 first - **Smallest H.264 transform.** 16 coefficients per block, 4×4 output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs). - **Most-used.** H.264 macroblocks default to 4×4 intra prediction + residual; 8×8 is High-profile only. 4×4 hits most real-world H.264 streams. - **Predicted GREEN.** Per the cycle 1-5 bandwidth-bound vs compute-bound classification: 4×4 IDCT is bandwidth-bound (16 reads, 16 writes, ~20 ALU ops/output). Should map well to V3D 7.1 compute. - **Clean reference.** FFmpeg's `ff_h264_idct_add_neon` is standalone (no eob parameter, no complex DC dispatch). Single call computes 1 block of IDCT + add. ## Kernel contract Per H.264 spec §8.5.12, the inverse transform is an integer-arithmetic transform (no rounding-by-cosine like VP9's Q14 trig math). Each 4×4 block: 1. Inverse row transform (4 row passes, each one 1D IDCT-like integer butterfly). 2. Inverse column transform (4 column passes, same butterfly). 3. Round and add to `dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255)`. Spec coefficients (Hadamard-like with 1/2 scaling): ``` [1 1 1 1/2] [1 1/2 -1 -1] [1 -1/2 -1 1] [1 -1 1 -1/2] ``` Integer form scales by 2: replace 1/2 with 1 and ½ with right- shift in the round step. ## NEON reference (M3 target) FFmpeg's `ff_h264_idct_add_neon` (external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S line 25, 56 instructions of NEON asm). Signature: ``` void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride); ``` - `dst`: 4×4 pixel block in 8-bit luma surface, `stride` between rows. - `block`: 16 int16 coefficients (row-major). - destructively clears `block` to zero after the transform (per H.264 conformance). ## 30fps@1080p H.264 floor H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB. Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame × 16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8 chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k 4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip). At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case. A more realistic average (many MBs use 8×8, P-skip, etc.) is ~2 Mblock/s. **30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s.** **30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.** ## R-band decision rules (carried from phase1.md) - R ≥ 1.0 → **GREEN** (QPU faster than NEON-1 in isolation). - 0.5 ≤ R < 1.0 → **YELLOW** (M4 decides). - 0.1 ≤ R < 0.5 → **ORANGE** (M4 may rescue). - R < 0.1 → **RED** (structural mismatch). Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85 Mblock/s worst-case 30fps floor. ## Acceptance for Phase 7 - M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random blocks). Same standard as cycles 1-5. - M2: captured, classified per R band. - M4: same-kernel mixed-bench measured (with Issue 003 caveats — this is the worst-case framing). - 30fps@1080p H.264 4×4 floor margin reported. ## Cycle 6 deliverables 1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S` (vendored 2026-05-18, this phase). 2. `tests/h264_idct4_ref.c` — standalone C reference (LGPL-2.1+ transcribed from spec). 3. `tests/bench_neon_h264idct4.c` — Phase 3 M3 bench. 4. `src/v3d_h264idct4.comp` — Phase 6 QPU shader. 5. `tests/bench_v3d_h264idct4.c` — Phase 6+7 M1+M2 bench (3-way vs NEON + C ref). 6. M4: extend `bench_concurrent_mixed.c` with K_H264_IDCT4. 7. Phase 4-7 docs. ## Next step (within this phase) Move to Phase 3 (NEON baseline M3) after writing the C reference. Phase 2 (libavcodec inventory) is implicit since we know the kernel from the FFmpeg vendor.