Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -68,6 +68,14 @@ set(FFASM_SOURCES
|
||||
${FFSNAP}/libavcodec/aarch64/vp9itxfm_neon.S
|
||||
)
|
||||
|
||||
# Cycle 6 — H.264 IDCT 4x4 + 8x8 NEON (vendored 2026-05-18).
|
||||
set(FFASM_H264IDCT_SOURCES
|
||||
${FFSNAP}/libavcodec/aarch64/h264idct_neon.S
|
||||
)
|
||||
set_source_files_properties(${FFASM_H264IDCT_SOURCES} PROPERTIES
|
||||
COMPILE_OPTIONS "${FFASM_FLAGS}"
|
||||
LANGUAGE ASM)
|
||||
|
||||
# Cycle 2 — VP9 loop filter NEON source (vendored 2026-05-18).
|
||||
set(FFASM_LPF_SOURCES
|
||||
${FFSNAP}/libavcodec/aarch64/vp9lpf_neon.S
|
||||
@@ -96,6 +104,14 @@ set_source_files_properties(${FFASM_SOURCES} PROPERTIES
|
||||
|
||||
# ---- NEON baseline microbenches --------------------------------------------
|
||||
|
||||
# Cycle 6 — H.264 IDCT 4x4 NEON M3 baseline bench.
|
||||
add_executable(bench_neon_h264idct4
|
||||
tests/bench_neon_h264idct4.c
|
||||
tests/h264_idct4_ref.c
|
||||
${FFASM_H264IDCT_SOURCES}
|
||||
)
|
||||
target_compile_options(bench_neon_h264idct4 PRIVATE -O3 -march=armv8-a+simd)
|
||||
|
||||
add_executable(bench_neon_idct
|
||||
tests/bench_neon_idct.c
|
||||
tests/vp9_idct8_ref.c
|
||||
|
||||
Reference in New Issue
Block a user