diff --git a/docs/k7_h264idct8_phase1.md b/docs/k7_h264idct8_phase1.md new file mode 100644 index 0000000..c687474 --- /dev/null +++ b/docs/k7_h264idct8_phase1.md @@ -0,0 +1,130 @@ +--- +cycle: 7 +phase: 1 +status: open +date_opened: 2026-05-18 +codec: H.264 +kernel: IDCT 8x8 + add (High-profile residual) +parent: project_h264_scope_added.md (memory) +predicted_R: 0.4-0.8 (YELLOW/ORANGE) — comparable to VP9 IDCT 8x8 (cycle 1, R=0.92) +--- + +# Cycle 7, Phase 1 — H.264 IDCT 8×8 + add + +Second H.264 kernel. 8×8 inverse integer transform used in +High-profile H.264 (most modern H.264 encodes High; broadcast +TV, web streams, file media). Smaller scope than IDCT 4×4 but +much more compute-heavy per block. + +## Why IDCT 8x8 next + +- Closely analogous to **cycle 1 (VP9 IDCT 8×8) which was R=0.92 + GREEN**. Best candidate for a near-immediate H.264 GREEN result. +- 64 coefficients per block (8×8) = same data shape as cycle 1. +- Integer butterfly (no trig multiplies) but more sub-stages than + 4×4. Per-block compute weight ~3-5× the 4×4. +- H.264 High-profile uses IDCT 8×8 for ~40-60 % of residual blocks + (encoder choice). Decoder must support it for spec compliance. + +## Kernel contract + +Per H.264 spec §8.5.13 (8x8 inverse integer transform). 1D +butterfly (g[0..7] from input d[0..7]): + +``` +e[0] = d[0] + d[4] +e[1] = -d[3] + d[5] - d[7] - (d[7] >> 1) +e[2] = d[0] - d[4] +e[3] = d[1] + d[7] - d[3] - (d[3] >> 1) +e[4] = (d[2] >> 1) - d[6] +e[5] = -d[1] + d[7] + d[5] + (d[5] >> 1) +e[6] = d[2] + (d[6] >> 1) +e[7] = d[3] + d[5] + d[1] + (d[1] >> 1) + +f[0] = e[0] + e[6] +f[1] = e[1] + (e[7] >> 2) +f[2] = e[2] + e[4] +f[3] = e[3] + (e[5] >> 2) +f[4] = e[2] - e[4] +f[5] = (e[3] >> 2) - e[5] +f[6] = e[0] - e[6] +f[7] = e[7] - (e[1] >> 2) + +g[0..7] = butterfly of f[0..7] +``` + +Applied row-pass then column-pass (per H.264/FFmpeg convention, +with column-major block). + +Final: dst[r,c] = clip(dst[r,c] + (g_2d[r,c] + 32) >> 6). + +## NEON reference (M3 target) + +FFmpeg's `ff_h264_idct8_add_neon` +(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S +line 267, ~60 instructions / pass × 2 + transpose + dst-add). +Signature mirrors cycle 6 IDCT 4×4: + +``` +void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride); +``` + +Block: 64 int16, column-major (per cycle 6 Phase 9 lesson). + +## 30fps@1080p H.264 8×8 floor + +1920×1080 luma using all 8×8 transforms: 240 × 135 = 32 400 +blocks/frame × 30 fps = 0.972 Mblock/s. Same as VP9 IDCT 8×8 +(cycle 1) since the block density is the same. + +**30fps@1080p floor: 0.972 Mblock/s.** + +## Predicted R₇ + +Per the cycle 1 / cycle 6 patterns: +- VP9 IDCT 8×8 NEON M3 = 8.171 Mblock/s (cycle 1), per-block 122 ns +- H.264 IDCT 8×8 likely **less compute per block** than VP9 (no + trig multiplies, just integer ops + shifts) → maybe 80-120 ns + per block → 8-12 Mblock/s NEON +- QPU 8×8 IDCT R=0.92 GREEN in cycle 1 came from the matching + 16-lane / 8-row layout and shared-mem transpose +- H.264 IDCT 8×8 same shape → predicted **R₇ ≈ 0.5-0.9 YELLOW/GREEN** + +## Acceptance for Phase 7 + +- M1: 100.0000% bit-exact (10000+ random blocks) +- M3: captured +- M2: captured +- R₇: classified +- M4: same-kernel mixed bench measured + +## Cycle 7 deliverables + +1. `tests/h264_idct8_ref.c` — column-major C reference +2. `tests/bench_neon_h264idct8.c` — Phase 3 bench +3. `src/v3d_h264idct8.comp` — Phase 6 shader (likely close to + v3d_idct8.comp shape, but with different butterfly + integer + math instead of Q14 trig) +4. `tests/bench_v3d_h264idct8.c` — Phase 6+7 bench +5. M4 via `bench_concurrent_mixed.c` extension + +## Phase 4 effort estimate + +Higher than cycle 1's iterations because the 8×8 IT butterfly is +more involved (3 sub-stages vs cycle 1's IDCT8 single butterfly). +~3-4 hours through Phase 7. Phase 5 Sonnet review again +non-skippable per CLAUDE.md. + +## Next step (within this phase) + +Move to Phase 3 (NEON baseline M3) after writing the C reference. + +## Future H.264 cycles (preview, post cycle 7) + +- Cycle 8 — H.264 chroma MC (4-tap; very lightweight; predicted + RED per cycle 6 pattern but smaller still) +- Cycle 9 — H.264 luma quarter-pel MC (6-tap; analogous to cycle 3 + VP9 MC which was RED; predicted RED) +- Cycle 10 — H.264 in-loop deblock (analogous to cycle 2/4 VP9 + LPF which were GREEN; predicted GREEN) +- After cycle 10: scope re-evaluated based on cycle 7/10 results