Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_ filter is "vertical filtering of horizontal edges", not "vertical edge"; 16 columns process the edge horizontally with 8 rows of vertical context). M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case 1080p30 floor, 30x realistic floor. Filter triggers on 25 % of edges (random alpha/beta/tc0 covers both gating paths). Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses filter DIRECTION (vertical) not edge orientation. Edge is horizontal; filter operates vertically across it. Distinct from cycle 6's column-major-block lesson but related discovery pattern. Encoded for future cycles. R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's 0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9 LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches that hurt QPU more. Worth building anyway: - ORANGE in cycle 1's "M4 may rescue" band - Mixed-kernel deployment helper value (Issue 003) matters more than isolation R - 25%-trigger rate gives 4x effective contribution multiplier on QPU side - tests/h264_deblock_ref.c (column-walking C ref per row segment) - tests/bench_neon_h264deblock.c (M1 + M3 bench) - CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S - docs/k8_h264deblock_phase3.md (closure) Next: Phase 4 plan QPU shader, Phase 5 Sonnet review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:39:36 +00:00
parent 5a085e7180
commit 436a5c4f74
4 changed files with 493 additions and 0 deletions
@@ -120,6 +120,21 @@ add_executable(bench_neon_h264idct8
 )
 target_compile_options(bench_neon_h264idct8 PRIVATE -O3 -march=armv8-a+simd)

+# Cycle 8 — H.264 luma vertical deblock NEON M3 baseline bench.
+set(FFASM_H264DSP_SOURCES
+    ${FFSNAP}/libavcodec/aarch64/h264dsp_neon.S
+)
+set_source_files_properties(${FFASM_H264DSP_SOURCES} PROPERTIES
+    COMPILE_OPTIONS "${FFASM_FLAGS}"
+    LANGUAGE ASM)
+
+add_executable(bench_neon_h264deblock
+    tests/bench_neon_h264deblock.c
+    tests/h264_deblock_ref.c
+    ${FFASM_H264DSP_SOURCES}
+)
+target_compile_options(bench_neon_h264deblock PRIVATE -O3 -march=armv8-a+simd)
+
 add_executable(bench_neon_idct
    tests/bench_neon_idct.c
    tests/vp9_idct8_ref.c