Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU

Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:15:25 +00:00
parent f92dc40f43
commit 7288473d79
1 changed files with 97 additions and 0 deletions
@@ -0,0 +1,97 @@
+---
+cycle: 6
+phase: 4 (decision: defer)
+status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch
+date_opened: 2026-05-18
+date_decision: 2026-05-18
+parent: k6_h264idct4_phase3.md
+---
+
+# Cycle 6, Phase 4 — DEFERRED
+
+## The decision
+
+After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per
+block), Phase 4 (QPU shader plan) is **deferred** because the
+kernel is too lightweight to make QPU offload worthwhile.
+
+## Reasoning
+
+V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5
+measurement, `tests/bench_vulkan_dispatch.c`). To break even
+against NEON at 175 Mblock/s, a single dispatch would need to
+process at least:
+
+  30 µs × 175 Mblock/s = 5 250 blocks per dispatch
+
+Which is feasible for batch processing — but the QPU side itself
+needs to do meaningful work per block to beat NEON, and:
+
+- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block
+  amortized = ~175 Mblock/s.
+- QPU per-block estimate (from cycle 1 scaling): even small kernels
+  hit 50+ instructions per block. At V3D 7.1's compute rate
+  (~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for
+  scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per-
+  block-equivalent → 256 ns per block at peak utilization. That's
+  45× slower than NEON.
+- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**.
+
+Even if mixed-kernel M4 (Issue 003) is more favorable, the
+contribution would be:
+- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
+- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
+- vs NEON's 175 Mblock/s headroom on a single core
+- Net: QPU helper adds <1 % to NEON's capacity for this kernel
+
+## Recipe verdict for cycle 6
+
+**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.**
+
+H.264 4×4 IDCT is so lightweight on NEON that a single CPU core
+delivers 30× the 1080p30 worst-case requirement. No realistic
+benefit from QPU offload.
+
+## What's left open
+
+- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process
+  256 or 1024 blocks per dispatch to amortize call overhead, see
+  if amortized throughput beats NEON. Likely still RED but
+  potentially YELLOW if V3D's scalar ALU can keep up with the
+  tiny butterfly. Low priority; not blocking.
+- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON
+  fully saturated by other H.264 kernels (entropy + MC + deblock),
+  IDCT 4×4 QPU offload becomes more attractive as a CPU-relief
+  measure even at neutral throughput.
+
+## Phase 9 lesson
+
+**Predicted R for very lightweight kernels (per-block ns < ~30) is
+likely deep RED regardless of how well the kernel maps to V3D
+compute, because the per-block QPU floor (~250 ns) is dominated
+by overheads that NEON avoids by virtue of being on the same
+substrate as the data.**
+
+Generalisation: for daedalus-fourier going forward, any new kernel
+with NEON per-block < 30 ns can be predicted RED and Phase 4
+deferred unless there's a specific structural reason QPU might be
+faster (e.g., parallel ops that NEON can't pack).
+
+This shapes future cycle selection: prefer COMPUTE-HEAVY kernels
+where QPU has a chance to add value. For H.264, that points
+toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock
+(cycle 10).
+
+## Cycle 6 closure
+
+- Phase 1 ✓ goal doc
+- Phase 2 implicit (vendored kernel)
+- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
+- Phase 4 DEFERRED (this doc)
+- Phases 5-7 N/A
+- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*`
+  in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only
+  shim; deferred until V4L2 wrapper actually exists.)
+- Phase 9 lesson encoded above
+
+**Cycle 6 status: closed. Move on to cycle 7.**