diff --git a/docs/k6_h264idct4_phase4.md b/docs/k6_h264idct4_phase4.md new file mode 100644 index 0000000..bee82b4 --- /dev/null +++ b/docs/k6_h264idct4_phase4.md @@ -0,0 +1,97 @@ +--- +cycle: 6 +phase: 4 (decision: defer) +status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch +date_opened: 2026-05-18 +date_decision: 2026-05-18 +parent: k6_h264idct4_phase3.md +--- + +# Cycle 6, Phase 4 — DEFERRED + +## The decision + +After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per +block), Phase 4 (QPU shader plan) is **deferred** because the +kernel is too lightweight to make QPU offload worthwhile. + +## Reasoning + +V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5 +measurement, `tests/bench_vulkan_dispatch.c`). To break even +against NEON at 175 Mblock/s, a single dispatch would need to +process at least: + + 30 µs × 175 Mblock/s = 5 250 blocks per dispatch + +Which is feasible for batch processing — but the QPU side itself +needs to do meaningful work per block to beat NEON, and: + +- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block + amortized = ~175 Mblock/s. +- QPU per-block estimate (from cycle 1 scaling): even small kernels + hit 50+ instructions per block. At V3D 7.1's compute rate + (~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for + scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per- + block-equivalent → 256 ns per block at peak utilization. That's + 45× slower than NEON. +- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**. + +Even if mixed-kernel M4 (Issue 003) is more favorable, the +contribution would be: +- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5) +- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s +- vs NEON's 175 Mblock/s headroom on a single core +- Net: QPU helper adds <1 % to NEON's capacity for this kernel + +## Recipe verdict for cycle 6 + +**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.** + +H.264 4×4 IDCT is so lightweight on NEON that a single CPU core +delivers 30× the 1080p30 worst-case requirement. No realistic +benefit from QPU offload. + +## What's left open + +- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process + 256 or 1024 blocks per dispatch to amortize call overhead, see + if amortized throughput beats NEON. Likely still RED but + potentially YELLOW if V3D's scalar ALU can keep up with the + tiny butterfly. Low priority; not blocking. +- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON + fully saturated by other H.264 kernels (entropy + MC + deblock), + IDCT 4×4 QPU offload becomes more attractive as a CPU-relief + measure even at neutral throughput. + +## Phase 9 lesson + +**Predicted R for very lightweight kernels (per-block ns < ~30) is +likely deep RED regardless of how well the kernel maps to V3D +compute, because the per-block QPU floor (~250 ns) is dominated +by overheads that NEON avoids by virtue of being on the same +substrate as the data.** + +Generalisation: for daedalus-fourier going forward, any new kernel +with NEON per-block < 30 ns can be predicted RED and Phase 4 +deferred unless there's a specific structural reason QPU might be +faster (e.g., parallel ops that NEON can't pack). + +This shapes future cycle selection: prefer COMPUTE-HEAVY kernels +where QPU has a chance to add value. For H.264, that points +toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock +(cycle 10). + +## Cycle 6 closure + +- Phase 1 ✓ goal doc +- Phase 2 implicit (vendored kernel) +- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS +- Phase 4 DEFERRED (this doc) +- Phases 5-7 N/A +- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*` + in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only + shim; deferred until V4L2 wrapper actually exists.) +- Phase 9 lesson encoded above + +**Cycle 6 status: closed. Move on to cycle 7.**