Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson from cycle 6 carried over cleanly). M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the 1080p30 floor (0.972 Mblock/s req'd). Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264 IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster NEON): VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies) H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts) Phase 4 deferred via the cycle 6 lightweight-kernel rationale: NEON per-block << QPU dispatch floor; offload doesn't help. Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target compute-heavy H.264 kernels only (deblock = cycle 8 next; MC likely RED). Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only" result. Forces a sharper view of which kernels QPU can actually help with: deblock and possibly some VP9 cases. - tests/h264_idct8_ref.c (column-major C ref) - tests/bench_neon_h264idct8.c (M1 + M3 bench) - CMakeLists.txt: cycle 7 bench wiring - docs/k7_h264idct8_phase3_and_4.md (closure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -112,6 +112,14 @@ add_executable(bench_neon_h264idct4
|
||||
)
|
||||
target_compile_options(bench_neon_h264idct4 PRIVATE -O3 -march=armv8-a+simd)
|
||||
|
||||
# Cycle 7 — H.264 IDCT 8x8 NEON M3 baseline bench.
|
||||
add_executable(bench_neon_h264idct8
|
||||
tests/bench_neon_h264idct8.c
|
||||
tests/h264_idct8_ref.c
|
||||
${FFASM_H264IDCT_SOURCES}
|
||||
)
|
||||
target_compile_options(bench_neon_h264idct8 PRIVATE -O3 -march=armv8-a+simd)
|
||||
|
||||
add_executable(bench_neon_idct
|
||||
tests/bench_neon_idct.c
|
||||
tests/vp9_idct8_ref.c
|
||||
|
||||
Reference in New Issue
Block a user