Files

T

marfrit f92dc40f43 Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore
VII has no hardware H.264 decoder block (only HEVC), so a
QPU-accelerated H.264 path fills the most impactful codec gap.
Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264
transform, simplest first cycle).

Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s
worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or
P-skip).

Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on
hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case
floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9
IDCT 8x8 (21x faster per block).

Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg
blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON
ld1 with 4 registers interleaves loading, and the FFmpeg C ref
indexing makes this convention explicit. Initial C ref assumed
row-major, M1 was 5% bit-exact; after fix, M1 = 100%.

Convention encoded for all subsequent H.264 cycles (cycle 7+).

- external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
  (vendored verbatim from FFmpeg n7.1.3, 415 lines)
- external/ffmpeg-snapshot/PROVENANCE.md: updated
- tests/h264_idct4_ref.c: column-major C ref
- tests/bench_neon_h264idct4.c: M1 + M3 bench
- CMakeLists.txt: cycle 6 NEON bench wiring
- docs/k6_h264idct4_phase1.md, phase3.md

Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep
RED — kernel too small relative to QPU dispatch overhead) but
worth building for cycle-completeness + the opportunistic-helper
hypothesis (cycle 6 may stay CPU per recipe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:14:43 +00:00

4.9 KiB

Raw Blame History

cycle, phase, status, date_opened, date_closed, codec, kernel, parent, host

cycle	phase	status	date_opened	date_closed	codec	kernel	parent	host
6	3	closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s	2026-05-18	2026-05-18	H.264	IDCT 4x4 + add	k6_h264idct4_phase1.md	hertz

Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline

M3₆ throughput

=== M3₆ NEON throughput ===
  blocks/batch:    4096
  batches done:    51 206
  total blocks:    209 739 776
  elapsed (kernel)=1.199 s
  throughput      = 175.0 Mblock/s
  per-block       = 5.7 ns
  H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
  H.264 1080p30 realistic floor:  87.50× margin (2.0 Mblock/s req'd)

Per-block 5.7 ns — by far the lightest cycle so far (cycle 2 LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a genuinely small kernel and FFmpeg's NEON is extremely tight (56 instructions per block).

NEON 4-core scaling: not measured this phase; based on cycle 2/4 patterns, expect ~3-4× scaling (bandwidth-bound territory) → ~500-700 Mblock/s aggregate. That's >100× the floor.

M1 bit-exact gate

=== M1₆ bit-exact (10000 random 4x4 blocks) ===
M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)

Key Phase 9 lesson — H.264 block layout is column-major

The bench's initial C reference assumed row-major block storage (block[r*4 + c]), giving M1 = 4.98 % bit-exact (essentially all random). After failed attempts swapping the row/column pass order (both row-first and column-first gave the same 5 % rate), trace analysis revealed the actual mismatch:

NEON ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1] does interleaved loading (load 4 structures of 4 elements, scattering across registers), NOT sequential — I initially assumed sequential.
Combined with FFmpeg's choice of column-major block layout (block[c*4 + r] = coefficient at row r, column c), the interleaved load gives each NEON vector v_r = row r of block (lane = column).
FFmpeg's C reference (libavcodec/h264dsp_template.c) uses block[i + 4*0], block[i + 4*1], etc. which is column-major indexing in disguise.

Fix: read block as column-major (block[c*4 + r]) in the C reference's row-pass loop. M1 then PASS 10000/10000.

Lesson encoded for future H.264 cycles:

H.264 4×4 (and 8×8) blocks are column-major in FFmpeg.
This convention propagates through all the libavcodec/aarch64 H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc). Cycles 7+ (other H.264 kernels) should default-assume column-major.

Comparison vs cycle 1 IDCT 8×8 (the closest analog)

	Cycle 1 IDCT 8×8	Cycle 6 IDCT 4×4
Codec	VP9	H.264
Block size	8×8 (64 coefs)	4×4 (16 coefs)
Transform math	Q14 trig DCT (heavy multiplies)	Integer butterfly (no multiplies, only shifts)
NEON cycles/block	122 ns	5.7 ns (21× faster)
Block storage	row-major	column-major
30fps@1080p floor margin	8×	30× (vs worst case)

H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both per-coef and per-block. This validates the "H.264 should be easier" hypothesis from [project_h264_scope_added].

Predicted R₆ band

NEON per-block 5.7 ns is so fast that the QPU must be very fast to compete. QPU dispatch overhead is ~30 µs per call (from M5), so the QPU-call breakeven needs to amortize across many blocks per dispatch.

Per-block estimate for QPU on a similar tiny kernel:

4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
R₆ = 5.7 / 600 = 0.01 → deep RED in isolation

But: per-WG packing of 16 blocks means dispatch overhead amortizes better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's bandwidth doesn't contend on the same channel.

Plan: implement QPU path anyway for cycle-completion and opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel (per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.

Next: Phase 4 plan

Per the established cycle pattern. Plan the QPU shader. Phase 5 Sonnet review. Phase 6 implementation. Phase 7 measurement. Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel to make per-call buffer alloc dominate the latency.

Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader build) and instead move directly to next H.264 cycles where QPU might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it doesn't NEED QPU help.

Acceptance

✓ M1 bit-exact (100.00 % on 10 000 random blocks)
✓ M3 captured (175 Mblock/s)
✓ 30fps@1080p floor exceeded by 30× worst-case
✓ Block-layout convention documented for future cycles

4.9 KiB Raw Blame History Unescape Escape