Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
RED-2: bench-enforced m.x >= 4*stride contract
M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
Lower than prediction; H.264 deblock has 4 early-return paths +
2 conditional writes that hurt V3D branchy execution more than
expected.
M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
(neutral).
M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
QPU contribution is essentially unchanged from isolation —
the cross-substrate contention is gentle (consistent with
Issue 003's V4 finding).
Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.
Cycles 1-8 deployment recipe complete:
Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)
Phase 9 lessons added:
- Branchy kernels underperform V3D vs straight-line ones
- Mixed-kernel helper value scales with isolation M2, not
same-kernel M4
- R prediction needs branchiness weight, not just compute density
- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
+ h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)
Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+24
-2
@@ -257,7 +257,18 @@ if (DAEDALUS_BUILD_VULKAN)
|
||||
VERBATIM
|
||||
)
|
||||
|
||||
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV})
|
||||
set(H264DEBLOCK_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock.spv)
|
||||
add_custom_command(
|
||||
OUTPUT ${H264DEBLOCK_SPV}
|
||||
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
|
||||
-o ${H264DEBLOCK_SPV}
|
||||
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock.comp
|
||||
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock.comp
|
||||
COMMENT "glslang: v3d_h264deblock.comp -> v3d_h264deblock.spv"
|
||||
VERBATIM
|
||||
)
|
||||
|
||||
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV})
|
||||
|
||||
# v3d_runner — reusable Vulkan plumbing.
|
||||
add_library(v3d_runner STATIC src/v3d_runner.c)
|
||||
@@ -315,6 +326,16 @@ if (DAEDALUS_BUILD_VULKAN)
|
||||
add_dependencies(bench_v3d_cdef daedalus_shaders)
|
||||
target_link_libraries(bench_v3d_cdef PRIVATE v3d_runner Vulkan::Vulkan)
|
||||
target_compile_options(bench_v3d_cdef PRIVATE -O2)
|
||||
|
||||
# Cycle 8 — QPU H.264 deblock bench (3-way).
|
||||
add_executable(bench_v3d_h264deblock
|
||||
tests/bench_v3d_h264deblock.c
|
||||
tests/h264_deblock_ref.c
|
||||
${FFASM_H264DSP_SOURCES}
|
||||
)
|
||||
add_dependencies(bench_v3d_h264deblock daedalus_shaders)
|
||||
target_link_libraries(bench_v3d_h264deblock PRIVATE v3d_runner Vulkan::Vulkan)
|
||||
target_compile_options(bench_v3d_h264deblock PRIVATE -O2)
|
||||
endif()
|
||||
|
||||
# ---- Phase 8 — public C API library + smoke test ---------------------------
|
||||
@@ -396,13 +417,14 @@ if (DAEDALUS_BUILD_VULKAN)
|
||||
target_compile_options(bench_concurrent_lpf8 PRIVATE -O3 -march=armv8-a+simd)
|
||||
|
||||
# Issue 003 — mixed-kernel M4 bench (NEON-N kernel A + QPU kernel B).
|
||||
# Links all FFmpeg + dav1d NEON sources we have.
|
||||
# Links all FFmpeg + dav1d NEON sources we have (cycles 1-8).
|
||||
add_executable(bench_concurrent_mixed
|
||||
tests/bench_concurrent_mixed.c
|
||||
${FFASM_SOURCES}
|
||||
${FFASM_LPF_SOURCES}
|
||||
${FFASM_MC_SOURCES}
|
||||
${FFC_MC_SOURCES}
|
||||
${FFASM_H264DSP_SOURCES}
|
||||
${DAV1D_CDEF_ASM_SOURCES}
|
||||
${DAV1D_CDEF_C_SOURCES}
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user