Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU
Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,97 @@
|
||||
---
|
||||
cycle: 6
|
||||
phase: 4 (decision: defer)
|
||||
status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch
|
||||
date_opened: 2026-05-18
|
||||
date_decision: 2026-05-18
|
||||
parent: k6_h264idct4_phase3.md
|
||||
---
|
||||
|
||||
# Cycle 6, Phase 4 — DEFERRED
|
||||
|
||||
## The decision
|
||||
|
||||
After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per
|
||||
block), Phase 4 (QPU shader plan) is **deferred** because the
|
||||
kernel is too lightweight to make QPU offload worthwhile.
|
||||
|
||||
## Reasoning
|
||||
|
||||
V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5
|
||||
measurement, `tests/bench_vulkan_dispatch.c`). To break even
|
||||
against NEON at 175 Mblock/s, a single dispatch would need to
|
||||
process at least:
|
||||
|
||||
30 µs × 175 Mblock/s = 5 250 blocks per dispatch
|
||||
|
||||
Which is feasible for batch processing — but the QPU side itself
|
||||
needs to do meaningful work per block to beat NEON, and:
|
||||
|
||||
- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block
|
||||
amortized = ~175 Mblock/s.
|
||||
- QPU per-block estimate (from cycle 1 scaling): even small kernels
|
||||
hit 50+ instructions per block. At V3D 7.1's compute rate
|
||||
(~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for
|
||||
scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per-
|
||||
block-equivalent → 256 ns per block at peak utilization. That's
|
||||
45× slower than NEON.
|
||||
- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**.
|
||||
|
||||
Even if mixed-kernel M4 (Issue 003) is more favorable, the
|
||||
contribution would be:
|
||||
- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
|
||||
- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
|
||||
- vs NEON's 175 Mblock/s headroom on a single core
|
||||
- Net: QPU helper adds <1 % to NEON's capacity for this kernel
|
||||
|
||||
## Recipe verdict for cycle 6
|
||||
|
||||
**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.**
|
||||
|
||||
H.264 4×4 IDCT is so lightweight on NEON that a single CPU core
|
||||
delivers 30× the 1080p30 worst-case requirement. No realistic
|
||||
benefit from QPU offload.
|
||||
|
||||
## What's left open
|
||||
|
||||
- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process
|
||||
256 or 1024 blocks per dispatch to amortize call overhead, see
|
||||
if amortized throughput beats NEON. Likely still RED but
|
||||
potentially YELLOW if V3D's scalar ALU can keep up with the
|
||||
tiny butterfly. Low priority; not blocking.
|
||||
- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON
|
||||
fully saturated by other H.264 kernels (entropy + MC + deblock),
|
||||
IDCT 4×4 QPU offload becomes more attractive as a CPU-relief
|
||||
measure even at neutral throughput.
|
||||
|
||||
## Phase 9 lesson
|
||||
|
||||
**Predicted R for very lightweight kernels (per-block ns < ~30) is
|
||||
likely deep RED regardless of how well the kernel maps to V3D
|
||||
compute, because the per-block QPU floor (~250 ns) is dominated
|
||||
by overheads that NEON avoids by virtue of being on the same
|
||||
substrate as the data.**
|
||||
|
||||
Generalisation: for daedalus-fourier going forward, any new kernel
|
||||
with NEON per-block < 30 ns can be predicted RED and Phase 4
|
||||
deferred unless there's a specific structural reason QPU might be
|
||||
faster (e.g., parallel ops that NEON can't pack).
|
||||
|
||||
This shapes future cycle selection: prefer COMPUTE-HEAVY kernels
|
||||
where QPU has a chance to add value. For H.264, that points
|
||||
toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock
|
||||
(cycle 10).
|
||||
|
||||
## Cycle 6 closure
|
||||
|
||||
- Phase 1 ✓ goal doc
|
||||
- Phase 2 implicit (vendored kernel)
|
||||
- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
|
||||
- Phase 4 DEFERRED (this doc)
|
||||
- Phases 5-7 N/A
|
||||
- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*`
|
||||
in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only
|
||||
shim; deferred until V4L2 wrapper actually exists.)
|
||||
- Phase 9 lesson encoded above
|
||||
|
||||
**Cycle 6 status: closed. Move on to cycle 7.**
|
||||
Reference in New Issue
Block a user