Files

T

marfrit 480f34f6e6 Cycle 7 (H.264 IDCT 8x8) opened — Phase 1 goal doc

Predicted R7 = 0.5-0.9 YELLOW/GREEN. Closely analogous to cycle 1
(VP9 IDCT 8x8 R=0.92 GREEN): same block size, same lane geometry,
same data shape. H.264 8x8 IT uses integer butterfly with 3
sub-stages (vs cycle 1's Q14 trig single butterfly) — more
compute per pass but simpler operations.

Phase 1 documents:
- Spec butterfly (e/f/g stages per H.264 §8.5.13)
- 30fps@1080p floor = 0.972 Mblock/s (same as cycle 1 since same
  block density)
- NEON ref = ff_h264_idct8_add_neon (already vendored in
  cycle 6's h264idct_neon.S)
- Cycle 8-10 preview: chroma MC, luma qpel MC, in-loop deblock

Phase 3 next session: write column-major C ref + bench, capture
M1 + M3. Then Phase 4 plan (likely cycle-1 v3d_idct8.comp adapted
to integer butterfly), Phase 5 review, Phase 6 implement, Phase 7
measure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:15:37 +00:00

4.1 KiB

Raw Blame History

cycle, phase, status, date_opened, codec, kernel, parent, predicted_R

cycle	phase	status	date_opened	codec	kernel	parent	predicted_R
7	1	open	2026-05-18	H.264	IDCT 8x8 + add (High-profile residual)	project_h264_scope_added.md (memory)	0.4-0.8 (YELLOW/ORANGE) — comparable to VP9 IDCT 8x8 (cycle 1, R=0.92)

Cycle 7, Phase 1 — H.264 IDCT 8×8 + add

Second H.264 kernel. 8×8 inverse integer transform used in High-profile H.264 (most modern H.264 encodes High; broadcast TV, web streams, file media). Smaller scope than IDCT 4×4 but much more compute-heavy per block.

Why IDCT 8x8 next

Closely analogous to cycle 1 (VP9 IDCT 8×8) which was R=0.92 GREEN. Best candidate for a near-immediate H.264 GREEN result.
64 coefficients per block (8×8) = same data shape as cycle 1.
Integer butterfly (no trig multiplies) but more sub-stages than 4×4. Per-block compute weight ~3-5× the 4×4.
H.264 High-profile uses IDCT 8×8 for ~40-60 % of residual blocks (encoder choice). Decoder must support it for spec compliance.

Kernel contract

Per H.264 spec §8.5.13 (8x8 inverse integer transform). 1D butterfly (g[0..7] from input d[0..7]):

e[0] = d[0] + d[4]
e[1] = -d[3] + d[5] - d[7] - (d[7] >> 1)
e[2] = d[0] - d[4]
e[3] = d[1] + d[7] - d[3] - (d[3] >> 1)
e[4] = (d[2] >> 1) - d[6]
e[5] = -d[1] + d[7] + d[5] + (d[5] >> 1)
e[6] = d[2] + (d[6] >> 1)
e[7] = d[3] + d[5] + d[1] + (d[1] >> 1)

f[0] = e[0] + e[6]
f[1] = e[1] + (e[7] >> 2)
f[2] = e[2] + e[4]
f[3] = e[3] + (e[5] >> 2)
f[4] = e[2] - e[4]
f[5] = (e[3] >> 2) - e[5]
f[6] = e[0] - e[6]
f[7] = e[7] - (e[1] >> 2)

g[0..7] = butterfly of f[0..7]

Applied row-pass then column-pass (per H.264/FFmpeg convention, with column-major block).

Final: dst[r,c] = clip(dst[r,c] + (g_2d[r,c] + 32) >> 6).

NEON reference (M3 target)

FFmpeg's ff_h264_idct8_add_neon (external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S line 267, ~60 instructions / pass × 2 + transpose + dst-add). Signature mirrors cycle 6 IDCT 4×4:

void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);

Block: 64 int16, column-major (per cycle 6 Phase 9 lesson).

30fps@1080p H.264 8×8 floor

1920×1080 luma using all 8×8 transforms: 240 × 135 = 32 400 blocks/frame × 30 fps = 0.972 Mblock/s. Same as VP9 IDCT 8×8 (cycle 1) since the block density is the same.

30fps@1080p floor: 0.972 Mblock/s.

Predicted R₇

Per the cycle 1 / cycle 6 patterns:

VP9 IDCT 8×8 NEON M3 = 8.171 Mblock/s (cycle 1), per-block 122 ns
H.264 IDCT 8×8 likely less compute per block than VP9 (no trig multiplies, just integer ops + shifts) → maybe 80-120 ns per block → 8-12 Mblock/s NEON
QPU 8×8 IDCT R=0.92 GREEN in cycle 1 came from the matching 16-lane / 8-row layout and shared-mem transpose
H.264 IDCT 8×8 same shape → predicted R₇ ≈ 0.5-0.9 YELLOW/GREEN

Acceptance for Phase 7

M1: 100.0000% bit-exact (10000+ random blocks)
M3: captured
M2: captured
R₇: classified
M4: same-kernel mixed bench measured

Cycle 7 deliverables

tests/h264_idct8_ref.c — column-major C reference
tests/bench_neon_h264idct8.c — Phase 3 bench
src/v3d_h264idct8.comp — Phase 6 shader (likely close to v3d_idct8.comp shape, but with different butterfly + integer math instead of Q14 trig)
tests/bench_v3d_h264idct8.c — Phase 6+7 bench
M4 via bench_concurrent_mixed.c extension

Phase 4 effort estimate

Higher than cycle 1's iterations because the 8×8 IT butterfly is more involved (3 sub-stages vs cycle 1's IDCT8 single butterfly). ~3-4 hours through Phase 7. Phase 5 Sonnet review again non-skippable per CLAUDE.md.

Next step (within this phase)

Move to Phase 3 (NEON baseline M3) after writing the C reference.

Future H.264 cycles (preview, post cycle 7)

Cycle 8 — H.264 chroma MC (4-tap; very lightweight; predicted RED per cycle 6 pattern but smaller still)
Cycle 9 — H.264 luma quarter-pel MC (6-tap; analogous to cycle 3 VP9 MC which was RED; predicted RED)
Cycle 10 — H.264 in-loop deblock (analogous to cycle 2/4 VP9 LPF which were GREEN; predicted GREEN)
After cycle 10: scope re-evaluated based on cycle 7/10 results

4.1 KiB Raw Blame History Unescape Escape