Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU
Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,97 @@
|
|||||||
|
---
|
||||||
|
cycle: 6
|
||||||
|
phase: 4 (decision: defer)
|
||||||
|
status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
date_decision: 2026-05-18
|
||||||
|
parent: k6_h264idct4_phase3.md
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 6, Phase 4 — DEFERRED
|
||||||
|
|
||||||
|
## The decision
|
||||||
|
|
||||||
|
After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per
|
||||||
|
block), Phase 4 (QPU shader plan) is **deferred** because the
|
||||||
|
kernel is too lightweight to make QPU offload worthwhile.
|
||||||
|
|
||||||
|
## Reasoning
|
||||||
|
|
||||||
|
V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5
|
||||||
|
measurement, `tests/bench_vulkan_dispatch.c`). To break even
|
||||||
|
against NEON at 175 Mblock/s, a single dispatch would need to
|
||||||
|
process at least:
|
||||||
|
|
||||||
|
30 µs × 175 Mblock/s = 5 250 blocks per dispatch
|
||||||
|
|
||||||
|
Which is feasible for batch processing — but the QPU side itself
|
||||||
|
needs to do meaningful work per block to beat NEON, and:
|
||||||
|
|
||||||
|
- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block
|
||||||
|
amortized = ~175 Mblock/s.
|
||||||
|
- QPU per-block estimate (from cycle 1 scaling): even small kernels
|
||||||
|
hit 50+ instructions per block. At V3D 7.1's compute rate
|
||||||
|
(~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for
|
||||||
|
scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per-
|
||||||
|
block-equivalent → 256 ns per block at peak utilization. That's
|
||||||
|
45× slower than NEON.
|
||||||
|
- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**.
|
||||||
|
|
||||||
|
Even if mixed-kernel M4 (Issue 003) is more favorable, the
|
||||||
|
contribution would be:
|
||||||
|
- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
|
||||||
|
- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
|
||||||
|
- vs NEON's 175 Mblock/s headroom on a single core
|
||||||
|
- Net: QPU helper adds <1 % to NEON's capacity for this kernel
|
||||||
|
|
||||||
|
## Recipe verdict for cycle 6
|
||||||
|
|
||||||
|
**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.**
|
||||||
|
|
||||||
|
H.264 4×4 IDCT is so lightweight on NEON that a single CPU core
|
||||||
|
delivers 30× the 1080p30 worst-case requirement. No realistic
|
||||||
|
benefit from QPU offload.
|
||||||
|
|
||||||
|
## What's left open
|
||||||
|
|
||||||
|
- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process
|
||||||
|
256 or 1024 blocks per dispatch to amortize call overhead, see
|
||||||
|
if amortized throughput beats NEON. Likely still RED but
|
||||||
|
potentially YELLOW if V3D's scalar ALU can keep up with the
|
||||||
|
tiny butterfly. Low priority; not blocking.
|
||||||
|
- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON
|
||||||
|
fully saturated by other H.264 kernels (entropy + MC + deblock),
|
||||||
|
IDCT 4×4 QPU offload becomes more attractive as a CPU-relief
|
||||||
|
measure even at neutral throughput.
|
||||||
|
|
||||||
|
## Phase 9 lesson
|
||||||
|
|
||||||
|
**Predicted R for very lightweight kernels (per-block ns < ~30) is
|
||||||
|
likely deep RED regardless of how well the kernel maps to V3D
|
||||||
|
compute, because the per-block QPU floor (~250 ns) is dominated
|
||||||
|
by overheads that NEON avoids by virtue of being on the same
|
||||||
|
substrate as the data.**
|
||||||
|
|
||||||
|
Generalisation: for daedalus-fourier going forward, any new kernel
|
||||||
|
with NEON per-block < 30 ns can be predicted RED and Phase 4
|
||||||
|
deferred unless there's a specific structural reason QPU might be
|
||||||
|
faster (e.g., parallel ops that NEON can't pack).
|
||||||
|
|
||||||
|
This shapes future cycle selection: prefer COMPUTE-HEAVY kernels
|
||||||
|
where QPU has a chance to add value. For H.264, that points
|
||||||
|
toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock
|
||||||
|
(cycle 10).
|
||||||
|
|
||||||
|
## Cycle 6 closure
|
||||||
|
|
||||||
|
- Phase 1 ✓ goal doc
|
||||||
|
- Phase 2 implicit (vendored kernel)
|
||||||
|
- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
|
||||||
|
- Phase 4 DEFERRED (this doc)
|
||||||
|
- Phases 5-7 N/A
|
||||||
|
- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*`
|
||||||
|
in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only
|
||||||
|
shim; deferred until V4L2 wrapper actually exists.)
|
||||||
|
- Phase 9 lesson encoded above
|
||||||
|
|
||||||
|
**Cycle 6 status: closed. Move on to cycle 7.**
|
||||||
Reference in New Issue
Block a user