Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 480f34f6e6 | |||
| 7288473d79 |
@@ -0,0 +1,97 @@
|
|||||||
|
---
|
||||||
|
cycle: 6
|
||||||
|
phase: 4 (decision: defer)
|
||||||
|
status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
date_decision: 2026-05-18
|
||||||
|
parent: k6_h264idct4_phase3.md
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 6, Phase 4 — DEFERRED
|
||||||
|
|
||||||
|
## The decision
|
||||||
|
|
||||||
|
After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per
|
||||||
|
block), Phase 4 (QPU shader plan) is **deferred** because the
|
||||||
|
kernel is too lightweight to make QPU offload worthwhile.
|
||||||
|
|
||||||
|
## Reasoning
|
||||||
|
|
||||||
|
V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5
|
||||||
|
measurement, `tests/bench_vulkan_dispatch.c`). To break even
|
||||||
|
against NEON at 175 Mblock/s, a single dispatch would need to
|
||||||
|
process at least:
|
||||||
|
|
||||||
|
30 µs × 175 Mblock/s = 5 250 blocks per dispatch
|
||||||
|
|
||||||
|
Which is feasible for batch processing — but the QPU side itself
|
||||||
|
needs to do meaningful work per block to beat NEON, and:
|
||||||
|
|
||||||
|
- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block
|
||||||
|
amortized = ~175 Mblock/s.
|
||||||
|
- QPU per-block estimate (from cycle 1 scaling): even small kernels
|
||||||
|
hit 50+ instructions per block. At V3D 7.1's compute rate
|
||||||
|
(~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for
|
||||||
|
scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per-
|
||||||
|
block-equivalent → 256 ns per block at peak utilization. That's
|
||||||
|
45× slower than NEON.
|
||||||
|
- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**.
|
||||||
|
|
||||||
|
Even if mixed-kernel M4 (Issue 003) is more favorable, the
|
||||||
|
contribution would be:
|
||||||
|
- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
|
||||||
|
- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
|
||||||
|
- vs NEON's 175 Mblock/s headroom on a single core
|
||||||
|
- Net: QPU helper adds <1 % to NEON's capacity for this kernel
|
||||||
|
|
||||||
|
## Recipe verdict for cycle 6
|
||||||
|
|
||||||
|
**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.**
|
||||||
|
|
||||||
|
H.264 4×4 IDCT is so lightweight on NEON that a single CPU core
|
||||||
|
delivers 30× the 1080p30 worst-case requirement. No realistic
|
||||||
|
benefit from QPU offload.
|
||||||
|
|
||||||
|
## What's left open
|
||||||
|
|
||||||
|
- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process
|
||||||
|
256 or 1024 blocks per dispatch to amortize call overhead, see
|
||||||
|
if amortized throughput beats NEON. Likely still RED but
|
||||||
|
potentially YELLOW if V3D's scalar ALU can keep up with the
|
||||||
|
tiny butterfly. Low priority; not blocking.
|
||||||
|
- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON
|
||||||
|
fully saturated by other H.264 kernels (entropy + MC + deblock),
|
||||||
|
IDCT 4×4 QPU offload becomes more attractive as a CPU-relief
|
||||||
|
measure even at neutral throughput.
|
||||||
|
|
||||||
|
## Phase 9 lesson
|
||||||
|
|
||||||
|
**Predicted R for very lightweight kernels (per-block ns < ~30) is
|
||||||
|
likely deep RED regardless of how well the kernel maps to V3D
|
||||||
|
compute, because the per-block QPU floor (~250 ns) is dominated
|
||||||
|
by overheads that NEON avoids by virtue of being on the same
|
||||||
|
substrate as the data.**
|
||||||
|
|
||||||
|
Generalisation: for daedalus-fourier going forward, any new kernel
|
||||||
|
with NEON per-block < 30 ns can be predicted RED and Phase 4
|
||||||
|
deferred unless there's a specific structural reason QPU might be
|
||||||
|
faster (e.g., parallel ops that NEON can't pack).
|
||||||
|
|
||||||
|
This shapes future cycle selection: prefer COMPUTE-HEAVY kernels
|
||||||
|
where QPU has a chance to add value. For H.264, that points
|
||||||
|
toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock
|
||||||
|
(cycle 10).
|
||||||
|
|
||||||
|
## Cycle 6 closure
|
||||||
|
|
||||||
|
- Phase 1 ✓ goal doc
|
||||||
|
- Phase 2 implicit (vendored kernel)
|
||||||
|
- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
|
||||||
|
- Phase 4 DEFERRED (this doc)
|
||||||
|
- Phases 5-7 N/A
|
||||||
|
- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*`
|
||||||
|
in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only
|
||||||
|
shim; deferred until V4L2 wrapper actually exists.)
|
||||||
|
- Phase 9 lesson encoded above
|
||||||
|
|
||||||
|
**Cycle 6 status: closed. Move on to cycle 7.**
|
||||||
@@ -0,0 +1,130 @@
|
|||||||
|
---
|
||||||
|
cycle: 7
|
||||||
|
phase: 1
|
||||||
|
status: open
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
codec: H.264
|
||||||
|
kernel: IDCT 8x8 + add (High-profile residual)
|
||||||
|
parent: project_h264_scope_added.md (memory)
|
||||||
|
predicted_R: 0.4-0.8 (YELLOW/ORANGE) — comparable to VP9 IDCT 8x8 (cycle 1, R=0.92)
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 7, Phase 1 — H.264 IDCT 8×8 + add
|
||||||
|
|
||||||
|
Second H.264 kernel. 8×8 inverse integer transform used in
|
||||||
|
High-profile H.264 (most modern H.264 encodes High; broadcast
|
||||||
|
TV, web streams, file media). Smaller scope than IDCT 4×4 but
|
||||||
|
much more compute-heavy per block.
|
||||||
|
|
||||||
|
## Why IDCT 8x8 next
|
||||||
|
|
||||||
|
- Closely analogous to **cycle 1 (VP9 IDCT 8×8) which was R=0.92
|
||||||
|
GREEN**. Best candidate for a near-immediate H.264 GREEN result.
|
||||||
|
- 64 coefficients per block (8×8) = same data shape as cycle 1.
|
||||||
|
- Integer butterfly (no trig multiplies) but more sub-stages than
|
||||||
|
4×4. Per-block compute weight ~3-5× the 4×4.
|
||||||
|
- H.264 High-profile uses IDCT 8×8 for ~40-60 % of residual blocks
|
||||||
|
(encoder choice). Decoder must support it for spec compliance.
|
||||||
|
|
||||||
|
## Kernel contract
|
||||||
|
|
||||||
|
Per H.264 spec §8.5.13 (8x8 inverse integer transform). 1D
|
||||||
|
butterfly (g[0..7] from input d[0..7]):
|
||||||
|
|
||||||
|
```
|
||||||
|
e[0] = d[0] + d[4]
|
||||||
|
e[1] = -d[3] + d[5] - d[7] - (d[7] >> 1)
|
||||||
|
e[2] = d[0] - d[4]
|
||||||
|
e[3] = d[1] + d[7] - d[3] - (d[3] >> 1)
|
||||||
|
e[4] = (d[2] >> 1) - d[6]
|
||||||
|
e[5] = -d[1] + d[7] + d[5] + (d[5] >> 1)
|
||||||
|
e[6] = d[2] + (d[6] >> 1)
|
||||||
|
e[7] = d[3] + d[5] + d[1] + (d[1] >> 1)
|
||||||
|
|
||||||
|
f[0] = e[0] + e[6]
|
||||||
|
f[1] = e[1] + (e[7] >> 2)
|
||||||
|
f[2] = e[2] + e[4]
|
||||||
|
f[3] = e[3] + (e[5] >> 2)
|
||||||
|
f[4] = e[2] - e[4]
|
||||||
|
f[5] = (e[3] >> 2) - e[5]
|
||||||
|
f[6] = e[0] - e[6]
|
||||||
|
f[7] = e[7] - (e[1] >> 2)
|
||||||
|
|
||||||
|
g[0..7] = butterfly of f[0..7]
|
||||||
|
```
|
||||||
|
|
||||||
|
Applied row-pass then column-pass (per H.264/FFmpeg convention,
|
||||||
|
with column-major block).
|
||||||
|
|
||||||
|
Final: dst[r,c] = clip(dst[r,c] + (g_2d[r,c] + 32) >> 6).
|
||||||
|
|
||||||
|
## NEON reference (M3 target)
|
||||||
|
|
||||||
|
FFmpeg's `ff_h264_idct8_add_neon`
|
||||||
|
(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
|
||||||
|
line 267, ~60 instructions / pass × 2 + transpose + dst-add).
|
||||||
|
Signature mirrors cycle 6 IDCT 4×4:
|
||||||
|
|
||||||
|
```
|
||||||
|
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
||||||
|
```
|
||||||
|
|
||||||
|
Block: 64 int16, column-major (per cycle 6 Phase 9 lesson).
|
||||||
|
|
||||||
|
## 30fps@1080p H.264 8×8 floor
|
||||||
|
|
||||||
|
1920×1080 luma using all 8×8 transforms: 240 × 135 = 32 400
|
||||||
|
blocks/frame × 30 fps = 0.972 Mblock/s. Same as VP9 IDCT 8×8
|
||||||
|
(cycle 1) since the block density is the same.
|
||||||
|
|
||||||
|
**30fps@1080p floor: 0.972 Mblock/s.**
|
||||||
|
|
||||||
|
## Predicted R₇
|
||||||
|
|
||||||
|
Per the cycle 1 / cycle 6 patterns:
|
||||||
|
- VP9 IDCT 8×8 NEON M3 = 8.171 Mblock/s (cycle 1), per-block 122 ns
|
||||||
|
- H.264 IDCT 8×8 likely **less compute per block** than VP9 (no
|
||||||
|
trig multiplies, just integer ops + shifts) → maybe 80-120 ns
|
||||||
|
per block → 8-12 Mblock/s NEON
|
||||||
|
- QPU 8×8 IDCT R=0.92 GREEN in cycle 1 came from the matching
|
||||||
|
16-lane / 8-row layout and shared-mem transpose
|
||||||
|
- H.264 IDCT 8×8 same shape → predicted **R₇ ≈ 0.5-0.9 YELLOW/GREEN**
|
||||||
|
|
||||||
|
## Acceptance for Phase 7
|
||||||
|
|
||||||
|
- M1: 100.0000% bit-exact (10000+ random blocks)
|
||||||
|
- M3: captured
|
||||||
|
- M2: captured
|
||||||
|
- R₇: classified
|
||||||
|
- M4: same-kernel mixed bench measured
|
||||||
|
|
||||||
|
## Cycle 7 deliverables
|
||||||
|
|
||||||
|
1. `tests/h264_idct8_ref.c` — column-major C reference
|
||||||
|
2. `tests/bench_neon_h264idct8.c` — Phase 3 bench
|
||||||
|
3. `src/v3d_h264idct8.comp` — Phase 6 shader (likely close to
|
||||||
|
v3d_idct8.comp shape, but with different butterfly + integer
|
||||||
|
math instead of Q14 trig)
|
||||||
|
4. `tests/bench_v3d_h264idct8.c` — Phase 6+7 bench
|
||||||
|
5. M4 via `bench_concurrent_mixed.c` extension
|
||||||
|
|
||||||
|
## Phase 4 effort estimate
|
||||||
|
|
||||||
|
Higher than cycle 1's iterations because the 8×8 IT butterfly is
|
||||||
|
more involved (3 sub-stages vs cycle 1's IDCT8 single butterfly).
|
||||||
|
~3-4 hours through Phase 7. Phase 5 Sonnet review again
|
||||||
|
non-skippable per CLAUDE.md.
|
||||||
|
|
||||||
|
## Next step (within this phase)
|
||||||
|
|
||||||
|
Move to Phase 3 (NEON baseline M3) after writing the C reference.
|
||||||
|
|
||||||
|
## Future H.264 cycles (preview, post cycle 7)
|
||||||
|
|
||||||
|
- Cycle 8 — H.264 chroma MC (4-tap; very lightweight; predicted
|
||||||
|
RED per cycle 6 pattern but smaller still)
|
||||||
|
- Cycle 9 — H.264 luma quarter-pel MC (6-tap; analogous to cycle 3
|
||||||
|
VP9 MC which was RED; predicted RED)
|
||||||
|
- Cycle 10 — H.264 in-loop deblock (analogous to cycle 2/4 VP9
|
||||||
|
LPF which were GREEN; predicted GREEN)
|
||||||
|
- After cycle 10: scope re-evaluated based on cycle 7/10 results
|
||||||
Reference in New Issue
Block a user