Files

T

marfrit 7288473d79 Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU

Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED
and not worth building):
- NEON delivers 175 Mblock/s (5.7 ns/block) on a single core
- QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022
- Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1%
  of NEON capacity
- 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that
  on ONE core. No need for QPU help.

Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict
deep RED and defer Phase 4 unless there's a specific structural
QPU advantage. Shapes future cycle selection: prefer compute-heavy
kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle
10 deblock).

Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓
(M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial
CPU-only (recipe = stay CPU), Phase 9 ✓.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:15:25 +00:00

3.6 KiB

Raw Permalink Blame History

cycle: 6 phase: 4 (decision: defer) status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch date_opened: 2026-05-18 date_decision: 2026-05-18 parent: k6_h264idct4_phase3.md

Cycle 6, Phase 4 — DEFERRED

The decision

After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per block), Phase 4 (QPU shader plan) is deferred because the kernel is too lightweight to make QPU offload worthwhile.

Reasoning

V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5 measurement, tests/bench_vulkan_dispatch.c). To break even against NEON at 175 Mblock/s, a single dispatch would need to process at least:

30 µs × 175 Mblock/s = 5 250 blocks per dispatch

Which is feasible for batch processing — but the QPU side itself needs to do meaningful work per block to beat NEON, and:

NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block amortized = ~175 Mblock/s.
QPU per-block estimate (from cycle 1 scaling): even small kernels hit 50+ instructions per block. At V3D 7.1's compute rate (~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per- block-equivalent → 256 ns per block at peak utilization. That's 45× slower than NEON.
Predicted R₆ = 5.7 / 256 = 0.022 → deep RED.

Even if mixed-kernel M4 (Issue 003) is more favorable, the contribution would be:

Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
vs NEON's 175 Mblock/s headroom on a single core
Net: QPU helper adds <1 % to NEON's capacity for this kernel

Recipe verdict for cycle 6

CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.

H.264 4×4 IDCT is so lightweight on NEON that a single CPU core delivers 30× the 1080p30 worst-case requirement. No realistic benefit from QPU offload.

What's left open

Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process 256 or 1024 blocks per dispatch to amortize call overhead, see if amortized throughput beats NEON. Likely still RED but potentially YELLOW if V3D's scalar ALU can keep up with the tiny butterfly. Low priority; not blocking.
Future re-evaluation: if Phase 8 V4L2 deployment finds NEON fully saturated by other H.264 kernels (entropy + MC + deblock), IDCT 4×4 QPU offload becomes more attractive as a CPU-relief measure even at neutral throughput.

Phase 9 lesson

Predicted R for very lightweight kernels (per-block ns < ~30) is likely deep RED regardless of how well the kernel maps to V3D compute, because the per-block QPU floor (~250 ns) is dominated by overheads that NEON avoids by virtue of being on the same substrate as the data.

Generalisation: for daedalus-fourier going forward, any new kernel with NEON per-block < 30 ns can be predicted RED and Phase 4 deferred unless there's a specific structural reason QPU might be faster (e.g., parallel ops that NEON can't pack).

This shapes future cycle selection: prefer COMPUTE-HEAVY kernels where QPU has a chance to add value. For H.264, that points toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock (cycle 10).

Cycle 6 closure

Phase 1 ✓ goal doc
Phase 2 implicit (vendored kernel)
Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
Phase 4 DEFERRED (this doc)
Phases 5-7 N/A
Phase 8 (deployment): CPU path via existing daedalus_dispatch_* in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only shim; deferred until V4L2 wrapper actually exists.)
Phase 9 lesson encoded above

Cycle 6 status: closed. Move on to cycle 7.

3.6 KiB Raw Permalink Blame History Unescape Escape