2 Commits

Author SHA1 Message Date
marfrit 480f34f6e6 Cycle 7 (H.264 IDCT 8x8) opened — Phase 1 goal doc
Predicted R7 = 0.5-0.9 YELLOW/GREEN. Closely analogous to cycle 1
(VP9 IDCT 8x8 R=0.92 GREEN): same block size, same lane geometry,
same data shape. H.264 8x8 IT uses integer butterfly with 3
sub-stages (vs cycle 1's Q14 trig single butterfly) — more
compute per pass but simpler operations.

Phase 1 documents:
- Spec butterfly (e/f/g stages per H.264 §8.5.13)
- 30fps@1080p floor = 0.972 Mblock/s (same as cycle 1 since same
  block density)
- NEON ref = ff_h264_idct8_add_neon (already vendored in
  cycle 6's h264idct_neon.S)
- Cycle 8-10 preview: chroma MC, luma qpel MC, in-loop deblock

Phase 3 next session: write column-major C ref + bench, capture
M1 + M3. Then Phase 4 plan (likely cycle-1 v3d_idct8.comp adapted
to integer butterfly), Phase 5 review, Phase 6 implement, Phase 7
measure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:15:37 +00:00
marfrit 7288473d79 Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU
Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED
and not worth building):
- NEON delivers 175 Mblock/s (5.7 ns/block) on a single core
- QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022
- Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1%
  of NEON capacity
- 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that
  on ONE core. No need for QPU help.

Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict
deep RED and defer Phase 4 unless there's a specific structural
QPU advantage. Shapes future cycle selection: prefer compute-heavy
kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle
10 deblock).

Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓
(M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial
CPU-only (recipe = stay CPU), Phase 9 ✓.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:15:25 +00:00
2 changed files with 227 additions and 0 deletions
+97
View File
@@ -0,0 +1,97 @@
---
cycle: 6
phase: 4 (decision: defer)
status: deferred 2026-05-18 — kernel too lightweight to amortize QPU dispatch
date_opened: 2026-05-18
date_decision: 2026-05-18
parent: k6_h264idct4_phase3.md
---
# Cycle 6, Phase 4 — DEFERRED
## The decision
After M3 captured (175 Mblock/s on a single NEON core, 5.7 ns per
block), Phase 4 (QPU shader plan) is **deferred** because the
kernel is too lightweight to make QPU offload worthwhile.
## Reasoning
V3D Vulkan dispatch overhead per call ≈ 30 µs (from cycle 1 M5
measurement, `tests/bench_vulkan_dispatch.c`). To break even
against NEON at 175 Mblock/s, a single dispatch would need to
process at least:
30 µs × 175 Mblock/s = 5 250 blocks per dispatch
Which is feasible for batch processing — but the QPU side itself
needs to do meaningful work per block to beat NEON, and:
- NEON does 5.7 ns/block. To beat NEON, QPU needs < 5.7 ns/block
amortized = ~175 Mblock/s.
- QPU per-block estimate (from cycle 1 scaling): even small kernels
hit 50+ instructions per block. At V3D 7.1's compute rate
(~1 cycle per ALU per lane at 2 threads = ~500 MHz effective for
scalar work), 50 inst at 16 lanes/sg × 8 sg/WG = 128 inst-per-
block-equivalent → 256 ns per block at peak utilization. That's
45× slower than NEON.
- Predicted R₆ = 5.7 / 256 = **0.022 → deep RED**.
Even if mixed-kernel M4 (Issue 003) is more favorable, the
contribution would be:
- Best-case QPU CDEF helper was 0.42 Mblock/s (cycle 5)
- IDCT 4×4 QPU helper likely similar scale: ~1-2 Mblock/s
- vs NEON's 175 Mblock/s headroom on a single core
- Net: QPU helper adds <1 % to NEON's capacity for this kernel
## Recipe verdict for cycle 6
**CPU NEON, no QPU dispatch path needed in the V4L2 wrapper.**
H.264 4×4 IDCT is so lightweight on NEON that a single CPU core
delivers 30× the 1080p30 worst-case requirement. No realistic
benefit from QPU offload.
## What's left open
- Issue 004 (if ever filed): wide-batch QPU IDCT 4×4 — process
256 or 1024 blocks per dispatch to amortize call overhead, see
if amortized throughput beats NEON. Likely still RED but
potentially YELLOW if V3D's scalar ALU can keep up with the
tiny butterfly. Low priority; not blocking.
- Future re-evaluation: if Phase 8 V4L2 deployment finds NEON
fully saturated by other H.264 kernels (entropy + MC + deblock),
IDCT 4×4 QPU offload becomes more attractive as a CPU-relief
measure even at neutral throughput.
## Phase 9 lesson
**Predicted R for very lightweight kernels (per-block ns < ~30) is
likely deep RED regardless of how well the kernel maps to V3D
compute, because the per-block QPU floor (~250 ns) is dominated
by overheads that NEON avoids by virtue of being on the same
substrate as the data.**
Generalisation: for daedalus-fourier going forward, any new kernel
with NEON per-block < 30 ns can be predicted RED and Phase 4
deferred unless there's a specific structural reason QPU might be
faster (e.g., parallel ops that NEON can't pack).
This shapes future cycle selection: prefer COMPUTE-HEAVY kernels
where QPU has a chance to add value. For H.264, that points
toward IDCT 8×8 (cycle 7), 6-tap MC (cycle 9), or in-loop deblock
(cycle 10).
## Cycle 6 closure
- Phase 1 ✓ goal doc
- Phase 2 implicit (vendored kernel)
- Phase 3 ✓ M3 = 175 Mblock/s, M1 PASS
- Phase 4 DEFERRED (this doc)
- Phases 5-7 N/A
- Phase 8 (deployment): CPU path via existing `daedalus_dispatch_*`
in include/daedalus.h. (Wiring for cycle 6 = trivial CPU-only
shim; deferred until V4L2 wrapper actually exists.)
- Phase 9 lesson encoded above
**Cycle 6 status: closed. Move on to cycle 7.**
+130
View File
@@ -0,0 +1,130 @@
---
cycle: 7
phase: 1
status: open
date_opened: 2026-05-18
codec: H.264
kernel: IDCT 8x8 + add (High-profile residual)
parent: project_h264_scope_added.md (memory)
predicted_R: 0.4-0.8 (YELLOW/ORANGE) — comparable to VP9 IDCT 8x8 (cycle 1, R=0.92)
---
# Cycle 7, Phase 1 — H.264 IDCT 8×8 + add
Second H.264 kernel. 8×8 inverse integer transform used in
High-profile H.264 (most modern H.264 encodes High; broadcast
TV, web streams, file media). Smaller scope than IDCT 4×4 but
much more compute-heavy per block.
## Why IDCT 8x8 next
- Closely analogous to **cycle 1 (VP9 IDCT 8×8) which was R=0.92
GREEN**. Best candidate for a near-immediate H.264 GREEN result.
- 64 coefficients per block (8×8) = same data shape as cycle 1.
- Integer butterfly (no trig multiplies) but more sub-stages than
4×4. Per-block compute weight ~3-5× the 4×4.
- H.264 High-profile uses IDCT 8×8 for ~40-60 % of residual blocks
(encoder choice). Decoder must support it for spec compliance.
## Kernel contract
Per H.264 spec §8.5.13 (8x8 inverse integer transform). 1D
butterfly (g[0..7] from input d[0..7]):
```
e[0] = d[0] + d[4]
e[1] = -d[3] + d[5] - d[7] - (d[7] >> 1)
e[2] = d[0] - d[4]
e[3] = d[1] + d[7] - d[3] - (d[3] >> 1)
e[4] = (d[2] >> 1) - d[6]
e[5] = -d[1] + d[7] + d[5] + (d[5] >> 1)
e[6] = d[2] + (d[6] >> 1)
e[7] = d[3] + d[5] + d[1] + (d[1] >> 1)
f[0] = e[0] + e[6]
f[1] = e[1] + (e[7] >> 2)
f[2] = e[2] + e[4]
f[3] = e[3] + (e[5] >> 2)
f[4] = e[2] - e[4]
f[5] = (e[3] >> 2) - e[5]
f[6] = e[0] - e[6]
f[7] = e[7] - (e[1] >> 2)
g[0..7] = butterfly of f[0..7]
```
Applied row-pass then column-pass (per H.264/FFmpeg convention,
with column-major block).
Final: dst[r,c] = clip(dst[r,c] + (g_2d[r,c] + 32) >> 6).
## NEON reference (M3 target)
FFmpeg's `ff_h264_idct8_add_neon`
(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
line 267, ~60 instructions / pass × 2 + transpose + dst-add).
Signature mirrors cycle 6 IDCT 4×4:
```
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
```
Block: 64 int16, column-major (per cycle 6 Phase 9 lesson).
## 30fps@1080p H.264 8×8 floor
1920×1080 luma using all 8×8 transforms: 240 × 135 = 32 400
blocks/frame × 30 fps = 0.972 Mblock/s. Same as VP9 IDCT 8×8
(cycle 1) since the block density is the same.
**30fps@1080p floor: 0.972 Mblock/s.**
## Predicted R₇
Per the cycle 1 / cycle 6 patterns:
- VP9 IDCT 8×8 NEON M3 = 8.171 Mblock/s (cycle 1), per-block 122 ns
- H.264 IDCT 8×8 likely **less compute per block** than VP9 (no
trig multiplies, just integer ops + shifts) → maybe 80-120 ns
per block → 8-12 Mblock/s NEON
- QPU 8×8 IDCT R=0.92 GREEN in cycle 1 came from the matching
16-lane / 8-row layout and shared-mem transpose
- H.264 IDCT 8×8 same shape → predicted **R₇ ≈ 0.5-0.9 YELLOW/GREEN**
## Acceptance for Phase 7
- M1: 100.0000% bit-exact (10000+ random blocks)
- M3: captured
- M2: captured
- R₇: classified
- M4: same-kernel mixed bench measured
## Cycle 7 deliverables
1. `tests/h264_idct8_ref.c` — column-major C reference
2. `tests/bench_neon_h264idct8.c` — Phase 3 bench
3. `src/v3d_h264idct8.comp` — Phase 6 shader (likely close to
v3d_idct8.comp shape, but with different butterfly + integer
math instead of Q14 trig)
4. `tests/bench_v3d_h264idct8.c` — Phase 6+7 bench
5. M4 via `bench_concurrent_mixed.c` extension
## Phase 4 effort estimate
Higher than cycle 1's iterations because the 8×8 IT butterfly is
more involved (3 sub-stages vs cycle 1's IDCT8 single butterfly).
~3-4 hours through Phase 7. Phase 5 Sonnet review again
non-skippable per CLAUDE.md.
## Next step (within this phase)
Move to Phase 3 (NEON baseline M3) after writing the C reference.
## Future H.264 cycles (preview, post cycle 7)
- Cycle 8 — H.264 chroma MC (4-tap; very lightweight; predicted
RED per cycle 6 pattern but smaller still)
- Cycle 9 — H.264 luma quarter-pel MC (6-tap; analogous to cycle 3
VP9 MC which was RED; predicted RED)
- Cycle 10 — H.264 in-loop deblock (analogous to cycle 2/4 VP9
LPF which were GREEN; predicted GREEN)
- After cycle 10: scope re-evaluated based on cycle 7/10 results