Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: **oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00
parent be7ff5587c
commit 36eca40ff2
7 changed files with 1436 additions and 1 deletions
@@ -113,7 +113,19 @@ if (DAEDALUS_BUILD_VULKAN)
        COMMENT "glslang: v3d_idct8.comp -> v3d_idct8.spv"
        VERBATIM
    )
-    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV})
+
+    set(LPF_SPV ${CMAKE_BINARY_DIR}/v3d_lpf_h_4_8.spv)
+    add_custom_command(
+        OUTPUT ${LPF_SPV}
+        COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
+                -o ${LPF_SPV}
+                ${CMAKE_SOURCE_DIR}/src/v3d_lpf_h_4_8.comp
+        DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_lpf_h_4_8.comp
+        COMMENT "glslang: v3d_lpf_h_4_8.comp -> v3d_lpf_h_4_8.spv"
+        VERBATIM
+    )
+
+    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV})

    # v3d_runner — reusable Vulkan plumbing.
    add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -134,6 +146,15 @@ if (DAEDALUS_BUILD_VULKAN)
    target_link_libraries(bench_v3d_idct PRIVATE v3d_runner Vulkan::Vulkan)
    target_compile_options(bench_v3d_idct PRIVATE -O2)

+    # Cycle 2 — QPU LPF bench.
+    add_executable(bench_v3d_lpf
+        tests/bench_v3d_lpf.c
+        tests/vp9_lpf_ref.c
+    )
+    add_dependencies(bench_v3d_lpf daedalus_shaders)
+    target_link_libraries(bench_v3d_lpf PRIVATE v3d_runner Vulkan::Vulkan)
+    target_compile_options(bench_v3d_lpf PRIVATE -O2)
+
    # M4 — concurrent CPU(NEON) + QPU bench. Links the FFmpeg NEON
    # snapshot so we can run real NEON kernels on pinned CPU cores
    # while the QPU runs its dispatch loop concurrently.
@@ -144,6 +165,15 @@ if (DAEDALUS_BUILD_VULKAN)
    add_dependencies(bench_concurrent daedalus_shaders)
    target_link_libraries(bench_concurrent PRIVATE v3d_runner Vulkan::Vulkan pthread)
    target_compile_options(bench_concurrent PRIVATE -O3 -march=armv8-a+simd)
+
+    # Cycle 2 M4'' — concurrent LPF.
+    add_executable(bench_concurrent_lpf
+        tests/bench_concurrent_lpf.c
+        ${FFASM_LPF_SOURCES}
+    )
+    add_dependencies(bench_concurrent_lpf daedalus_shaders)
+    target_link_libraries(bench_concurrent_lpf PRIVATE v3d_runner Vulkan::Vulkan pthread)
+    target_compile_options(bench_concurrent_lpf PRIVATE -O3 -march=armv8-a+simd)
 endif()

 # ---- Summary ----------------------------------------------------------------