Files

T

marfrit f92dc40f43 Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore
VII has no hardware H.264 decoder block (only HEVC), so a
QPU-accelerated H.264 path fills the most impactful codec gap.
Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264
transform, simplest first cycle).

Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s
worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or
P-skip).

Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on
hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case
floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9
IDCT 8x8 (21x faster per block).

Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg
blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON
ld1 with 4 registers interleaves loading, and the FFmpeg C ref
indexing makes this convention explicit. Initial C ref assumed
row-major, M1 was 5% bit-exact; after fix, M1 = 100%.

Convention encoded for all subsequent H.264 cycles (cycle 7+).

- external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
  (vendored verbatim from FFmpeg n7.1.3, 415 lines)
- external/ffmpeg-snapshot/PROVENANCE.md: updated
- tests/h264_idct4_ref.c: column-major C ref
- tests/bench_neon_h264idct4.c: M1 + M3 bench
- CMakeLists.txt: cycle 6 NEON bench wiring
- docs/k6_h264idct4_phase1.md, phase3.md

Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep
RED — kernel too small relative to QPU dispatch overhead) but
worth building for cycle-completeness + the opportunistic-helper
hypothesis (cycle 6 may stay CPU per recipe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:14:43 +00:00

4.1 KiB

Raw Permalink Blame History

cycle, phase, status, date_opened, codec, kernel, parent

cycle	phase	status	date_opened	codec	kernel	parent
6	1	open	2026-05-18	H.264	IDCT 4x4 + add (intra-block residual)	project_h264_scope_added.md (memory)

Cycle 6, Phase 1 — H.264 IDCT 4×4 + add

First H.264 kernel. Per project_h264_scope_added, the user added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5 has no hardware H.264 decoder despite H.264 being the most common web codec.

Why IDCT 4×4 first

Smallest H.264 transform. 16 coefficients per block, 4×4 output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs).
Most-used. H.264 macroblocks default to 4×4 intra prediction + residual; 8×8 is High-profile only. 4×4 hits most real-world H.264 streams.
Predicted GREEN. Per the cycle 1-5 bandwidth-bound vs compute-bound classification: 4×4 IDCT is bandwidth-bound (16 reads, 16 writes, ~20 ALU ops/output). Should map well to V3D 7.1 compute.
Clean reference. FFmpeg's ff_h264_idct_add_neon is standalone (no eob parameter, no complex DC dispatch). Single call computes 1 block of IDCT + add.

Kernel contract

Per H.264 spec §8.5.12, the inverse transform is an integer-arithmetic transform (no rounding-by-cosine like VP9's Q14 trig math). Each 4×4 block:

Inverse row transform (4 row passes, each one 1D IDCT-like integer butterfly).
Inverse column transform (4 column passes, same butterfly).
Round and add to dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255).

Spec coefficients (Hadamard-like with 1/2 scaling):

  [1  1  1  1/2]
  [1  1/2 -1 -1]
  [1 -1/2 -1  1]
  [1 -1   1 -1/2]

Integer form scales by 2: replace 1/2 with 1 and ½ with right- shift in the round step.

NEON reference (M3 target)

FFmpeg's ff_h264_idct_add_neon (external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S line 25, 56 instructions of NEON asm). Signature:

void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);

dst: 4×4 pixel block in 8-bit luma surface, stride between rows.
block: 16 int16 coefficients (row-major).
destructively clears block to zero after the transform (per H.264 conformance).

30fps@1080p H.264 floor

H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB. Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame × 16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8 chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k 4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip).

At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case. A more realistic average (many MBs use 8×8, P-skip, etc.) is ~2 Mblock/s.

30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s. 30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.

R-band decision rules (carried from phase1.md)

R ≥ 1.0 → GREEN (QPU faster than NEON-1 in isolation).
0.5 ≤ R < 1.0 → YELLOW (M4 decides).
0.1 ≤ R < 0.5 → ORANGE (M4 may rescue).
R < 0.1 → RED (structural mismatch).

Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85 Mblock/s worst-case 30fps floor.

Acceptance for Phase 7

M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random blocks). Same standard as cycles 1-5.
M2: captured, classified per R band.
M4: same-kernel mixed-bench measured (with Issue 003 caveats — this is the worst-case framing).
30fps@1080p H.264 4×4 floor margin reported.

Cycle 6 deliverables

external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored 2026-05-18, this phase).
tests/h264_idct4_ref.c — standalone C reference (LGPL-2.1+ transcribed from spec).
tests/bench_neon_h264idct4.c — Phase 3 M3 bench.
src/v3d_h264idct4.comp — Phase 6 QPU shader.
tests/bench_v3d_h264idct4.c — Phase 6+7 M1+M2 bench (3-way vs NEON + C ref).
M4: extend bench_concurrent_mixed.c with K_H264_IDCT4.
Phase 4-7 docs.

Next step (within this phase)

Move to Phase 3 (NEON baseline M3) after writing the C reference. Phase 2 (libavcodec inventory) is implicit since we know the kernel from the FFmpeg vendor.

4.1 KiB Raw Permalink Blame History Unescape Escape