From 7288473d7972ffbf57d3fc1e30f63b081352949e Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Mon, 18 May 2026 14:15:25 +0000 Subject: [PATCH] Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/k6_h264idct4_phase4.md | 97 +++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 docs/k6_h264idct4_phase4.md diff --git a/docs/k6_h264idct4_phase4.md b/docs/k6_h264idct4_phase4.md new file mode 100644 index 0000000..bee82b4 --- /dev/null +++ b/docs/k6_h264idct4_phase4.md @@ -0,0 +1,97 @@ +--- +cycle: 6 +phase: 4 (decision: defer) +status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch +date_opened: 2026-05-18 +date_decision: 2026-05-18 +parent: k6_h264idct4_phase3.md +--- + +# Cycle 6, Phase 4 — DEFERRED + +## The decision + +After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per +block), Phase 4 (QPU shader plan) is **deferred** because the +kernel is too lightweight to make QPU offload worthwhile. + +## Reasoning + +V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5 +measurement, `tests/bench_vulkan_dispatch.c`). To break even +against NEON at 175 Mblock/s, a single dispatch would need to +process at least: + + 30 µs × 175 Mblock/s = 5 250 blocks per dispatch + +Which is feasible for batch processing — but the QPU side itself +needs to do meaningful work per block to beat NEON, and: + +- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block + amortized = ~175 Mblock/s. +- QPU per-block estimate (from cycle 1 scaling): even small kernels + hit 50+ instructions per block. At V3D 7.1's compute rate + (~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for + scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per- + block-equivalent → 256 ns per block at peak utilization. That's + 45× slower than NEON. +- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**. + +Even if mixed-kernel M4 (Issue 003) is more favorable, the +contribution would be: +- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5) +- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s +- vs NEON's 175 Mblock/s headroom on a single core +- Net: QPU helper adds <1 % to NEON's capacity for this kernel + +## Recipe verdict for cycle 6 + +**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.** + +H.264 4×4 IDCT is so lightweight on NEON that a single CPU core +delivers 30× the 1080p30 worst-case requirement. No realistic +benefit from QPU offload. + +## What's left open + +- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process + 256 or 1024 blocks per dispatch to amortize call overhead, see + if amortized throughput beats NEON. Likely still RED but + potentially YELLOW if V3D's scalar ALU can keep up with the + tiny butterfly. Low priority; not blocking. +- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON + fully saturated by other H.264 kernels (entropy + MC + deblock), + IDCT 4×4 QPU offload becomes more attractive as a CPU-relief + measure even at neutral throughput. + +## Phase 9 lesson + +**Predicted R for very lightweight kernels (per-block ns < ~30) is +likely deep RED regardless of how well the kernel maps to V3D +compute, because the per-block QPU floor (~250 ns) is dominated +by overheads that NEON avoids by virtue of being on the same +substrate as the data.** + +Generalisation: for daedalus-fourier going forward, any new kernel +with NEON per-block < 30 ns can be predicted RED and Phase 4 +deferred unless there's a specific structural reason QPU might be +faster (e.g., parallel ops that NEON can't pack). + +This shapes future cycle selection: prefer COMPUTE-HEAVY kernels +where QPU has a chance to add value. For H.264, that points +toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock +(cycle 10). + +## Cycle 6 closure + +- Phase 1 ✓ goal doc +- Phase 2 implicit (vendored kernel) +- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS +- Phase 4 DEFERRED (this doc) +- Phases 5-7 N/A +- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*` + in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only + shim; deferred until V4L2 wrapper actually exists.) +- Phase 9 lesson encoded above + +**Cycle 6 status: closed. Move on to cycle 7.**