Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:14:43 +00:00
parent eb5cfb34c4
commit f92dc40f43
7 changed files with 974 additions and 0 deletions
@@ -0,0 +1,119 @@
+---
+cycle: 6
+phase: 1
+status: open
+date_opened: 2026-05-18
+codec: H.264
+kernel: IDCT 4x4 + add (intra-block residual)
+parent: project_h264_scope_added.md (memory)
+---
+
+# Cycle 6, Phase 1 — H.264 IDCT 4×4 + add
+
+First H.264 kernel. Per `project_h264_scope_added`, the user
+added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5
+has no hardware H.264 decoder despite H.264 being the most
+common web codec.
+
+## Why IDCT 4×4 first
+
+- **Smallest H.264 transform.** 16 coefficients per block, 4×4
+  output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs).
+- **Most-used.** H.264 macroblocks default to 4×4 intra
+  prediction + residual; 8×8 is High-profile only. 4×4 hits
+  most real-world H.264 streams.
+- **Predicted GREEN.** Per the cycle 1-5 bandwidth-bound vs
+  compute-bound classification: 4×4 IDCT is bandwidth-bound
+  (16 reads, 16 writes, ~20 ALU ops/output). Should map well
+  to V3D 7.1 compute.
+- **Clean reference.** FFmpeg's `ff_h264_idct_add_neon` is
+  standalone (no eob parameter, no complex DC dispatch). Single
+  call computes 1 block of IDCT + add.
+
+## Kernel contract
+
+Per H.264 spec §8.5.12, the inverse transform is an
+integer-arithmetic transform (no rounding-by-cosine like VP9's
+Q14 trig math). Each 4×4 block:
+
+1. Inverse row transform (4 row passes, each one 1D IDCT-like
+   integer butterfly).
+2. Inverse column transform (4 column passes, same butterfly).
+3. Round and add to `dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255)`.
+
+Spec coefficients (Hadamard-like with 1/2 scaling):
+```
+  [1  1  1  1/2]
+  [1  1/2 -1 -1]
+  [1 -1/2 -1  1]
+  [1 -1   1 -1/2]
+```
+Integer form scales by 2: replace 1/2 with 1 and ½ with right-
+shift in the round step.
+
+## NEON reference (M3 target)
+
+FFmpeg's `ff_h264_idct_add_neon`
+(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
+line 25, 56 instructions of NEON asm). Signature:
+
+```
+void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
+```
+
+- `dst`: 4×4 pixel block in 8-bit luma surface, `stride` between rows.
+- `block`: 16 int16 coefficients (row-major).
+- destructively clears `block` to zero after the transform (per H.264 conformance).
+
+## 30fps@1080p H.264 floor
+
+H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB.
+Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame ×
+16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8
+chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k
+4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip).
+
+At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case.
+A more realistic average (many MBs use 8×8, P-skip, etc.) is
+~2 Mblock/s.
+
+**30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s.**
+**30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.**
+
+## R-band decision rules (carried from phase1.md)
+
+- R ≥ 1.0 → **GREEN** (QPU faster than NEON-1 in isolation).
+- 0.5 ≤ R < 1.0 → **YELLOW** (M4 decides).
+- 0.1 ≤ R < 0.5 → **ORANGE** (M4 may rescue).
+- R < 0.1 → **RED** (structural mismatch).
+
+Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85
+Mblock/s worst-case 30fps floor.
+
+## Acceptance for Phase 7
+
+- M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random
+  blocks). Same standard as cycles 1-5.
+- M2: captured, classified per R band.
+- M4: same-kernel mixed-bench measured (with Issue 003 caveats —
+  this is the worst-case framing).
+- 30fps@1080p H.264 4×4 floor margin reported.
+
+## Cycle 6 deliverables
+
+1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S`
+   (vendored 2026-05-18, this phase).
+2. `tests/h264_idct4_ref.c` — standalone C reference (LGPL-2.1+
+   transcribed from spec).
+3. `tests/bench_neon_h264idct4.c` — Phase 3 M3 bench.
+4. `src/v3d_h264idct4.comp` — Phase 6 QPU shader.
+5. `tests/bench_v3d_h264idct4.c` — Phase 6+7 M1+M2 bench (3-way
+   vs NEON + C ref).
+6. M4: extend `bench_concurrent_mixed.c` with K_H264_IDCT4.
+7. Phase 4-7 docs.
+
+## Next step (within this phase)
+
+Move to Phase 3 (NEON baseline M3) after writing the C
+reference. Phase 2 (libavcodec inventory) is implicit since we
+know the kernel from the FFmpeg vendor.
@@ -0,0 +1,132 @@
+---
+cycle: 6
+phase: 3
+status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+codec: H.264
+kernel: IDCT 4x4 + add
+parent: k6_h264idct4_phase1.md
+host: hertz
+---
+
+# Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline
+
+## M3₆ throughput
+
+```
+=== M3₆ NEON throughput ===
+  blocks/batch:    4096
+  batches done:    51 206
+  total blocks:    209 739 776
+  elapsed (kernel)=1.199 s
+  throughput      = 175.0 Mblock/s
+  per-block       = 5.7 ns
+  H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
+  H.264 1080p30 realistic floor:  87.50× margin (2.0 Mblock/s req'd)
+```
+
+**Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2
+LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a
+genuinely small kernel and FFmpeg's NEON is extremely tight
+(56 instructions per block).
+
+NEON 4-core scaling: not measured this phase; based on cycle 2/4
+patterns, expect ~3-4× scaling (bandwidth-bound territory) →
+~500-700 Mblock/s aggregate. That's >100× the floor.
+
+## M1 bit-exact gate
+
+```
+=== M1₆ bit-exact (10000 random 4x4 blocks) ===
+M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
+```
+
+## Key Phase 9 lesson — H.264 block layout is column-major
+
+The bench's initial C reference assumed row-major block storage
+(`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all
+random). After failed attempts swapping the row/column pass order
+(both row-first and column-first gave the same 5 % rate), trace
+analysis revealed the actual mismatch:
+
+- NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does
+  **interleaved** loading (load 4 structures of 4 elements,
+  scattering across registers), NOT sequential — I initially
+  assumed sequential.
+- Combined with FFmpeg's choice of **column-major** block layout
+  (`block[c*4 + r]` = coefficient at row r, column c), the
+  interleaved load gives each NEON vector `v_r` = row r of block
+  (lane = column).
+- FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses
+  `block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major
+  indexing in disguise.
+
+Fix: read block as column-major (`block[c*4 + r]`) in the C
+reference's row-pass loop. M1 then PASS 10000/10000.
+
+Lesson encoded for future H.264 cycles:
+- **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg.
+- This convention propagates through all the libavcodec/aarch64
+  H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc).
+  Cycles 7+ (other H.264 kernels) should default-assume
+  column-major.
+
+## Comparison vs cycle 1 IDCT 8×8 (the closest analog)
+
+| | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 |
+|---|---|---|
+| Codec | VP9 | H.264 |
+| Block size | 8×8 (64 coefs) | 4×4 (16 coefs) |
+| Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) |
+| NEON cycles/block | 122 ns | **5.7 ns** (21× faster) |
+| Block storage | row-major | column-major |
+| 30fps@1080p floor margin | 8× | **30×** (vs worst case) |
+
+H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both
+per-coef and per-block. This validates the "H.264 should be
+easier" hypothesis from [project_h264_scope_added].
+
+## Predicted R₆ band
+
+NEON per-block 5.7 ns is so fast that the QPU must be very fast
+to compete. QPU dispatch overhead is ~30 µs per call (from M5),
+so the QPU-call breakeven needs to amortize across many blocks
+per dispatch.
+
+Per-block estimate for QPU on a similar tiny kernel:
+- 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
+- ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
+- At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
+- R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation**
+
+But: per-WG packing of 16 blocks means dispatch overhead amortizes
+better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes
+read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to
+LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's
+bandwidth doesn't contend on the same channel.
+
+Plan: implement QPU path anyway for cycle-completion and
+opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel
+(per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and
+CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.
+
+## Next: Phase 4 plan
+
+Per the established cycle pattern. Plan the QPU shader. Phase 5
+Sonnet review. Phase 6 implementation. Phase 7 measurement.
+Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel
+to make per-call buffer alloc dominate the latency.
+
+Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader
+build) and instead move directly to next H.264 cycles where QPU
+might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or
+deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it
+doesn't NEED QPU help.
+
+## Acceptance
+
+- ✓ M1 bit-exact (100.00 % on 10 000 random blocks)
+- ✓ M3 captured (175 Mblock/s)
+- ✓ 30fps@1080p floor exceeded by 30× worst-case
+- ✓ Block-layout convention documented for future cycles