Files
marfrit 5223d3cb3f Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
  - meta layout: m.z=tmp_off, m.w=dir
  - sec_shift clamped to >=0 (NEON uqsub semantics)
  - directions table as const ivec2[14], not OR-packed

Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.

M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
  QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
  Issue 003 NEON-fallback proxy suggested - cross-substrate
  contention is gentler than same-side NEON contention.

Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.

- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
  expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:52:46 +00:00

7.8 KiB
Raw Permalink Blame History

cycle, phase, status, date_opened, date_closed, parent, host, verdict
cycle phase status date_opened date_closed parent host verdict
5 7 closed 2026-05-18 — M1 PASS, R₅=0.116 ORANGE, M4 same-kernel NEGATIVE, M4 mixed-kernel POSITIVE 2026-05-18 2026-05-18 k5_cdef_phase6 (no doc — phase 6 is the shader + bench commit) hertz CDEF baseline = CPU; QPU dispatch path exists for opportunistic use. Better than predicted (ORANGE not RED).

Cycle 5, Phase 7 — Verification (CDEF on V3D)

Phase 6 deliverable

  • src/v3d_cdef.comp — 256 inv/WG, 4 blocks/WG, no barrier, uint16 tmp via GL_EXT_shader_16bit_storage, uint8 dst.
  • tests/bench_v3d_cdef.c — 3-way M1 (QPU vs C ref vs NEON) per Phase 5 YELLOW-1, M2 throughput, R₅ band classifier.
  • tests/bench_concurrent_mixed.c extended with K_CDEF on both CPU and QPU sides for M4.

shaderdb:

SHADER-DB-4a79c02a... 387 inst, 2 threads, 0 loops, 133 uniforms,
  21 max-temps, 0:0 spills:fills, 0 sfu-stalls, 5 nops

2 threads (not 4 as plan hoped) — register pressure same as cycle 3 MC. 133 uniforms under the 144 gate. No spills.

M1 — 3-way bit-exact

=== M1₅: QPU vs C-ref vs NEON 3-way ===
  C ref vs NEON parity check: 0/4096 mismatches
  QPU vs C ref: 4096 / 4096 blocks bit-exact (100.0000%)
  QPU vs NEON:  4096 / 4096 blocks bit-exact (100.0000%)

All three implementations agree. Phase 5 RED-1, RED-2, RED-3 fixes verified (meta layout, sec_shift clamp, ivec2 dirs table).

M2 — QPU throughput

=== M2₅: QPU throughput ===
  blocks/dispatch: 4096
  iters:           50
  total blocks:    204 800
  elapsed (kernel)=0.462 s
  M2₅ throughput  = 0.443 Mblock/s
  per-block       = 2256.1 ns
  per-dispatch    = 9241.0 us

R₅ = 0.443 / 3.809 = 0.116 → ORANGE band.

Better than predicted (Phase 4 estimated R₅ = 0.02-0.05, deep RED). The prediction was extrapolated from cycle 3 MC's R₃ = 0.067 × scaling for higher per-block compute weight. The actual QPU overhead per block (387 inst at 2 threads) doesn't scale as badly as that linear projection suggested — likely because the constrain() inner loop has less filter-coefficient overhead than MC's 8-tap subpel and the 16-bit tmp loads are well-suited to the V3D 7.1 storage path.

30fps@1080p floor: 0.443 / 0.972 = 0.46× margin (isolation). Below the user-facing floor as sole substrate. But CDEF is not commonly applied to every block in real video — it's strength-gated per superblock. Effective CDEF rate in real content is often < 0.5 Mblock/s. Within reach.

M4 — concurrent matrix

All windows 6 s, hertz, bench_concurrent_mixed.

M4 same-kernel (cycle 5 closure)

Config CPU CDEF agg QPU CDEF total per-core CPU
NEON-3 + QPU 8.080 0.381 8.461 2.69 avg
NEON-4 + QPU 7.866 0.385 8.251 1.97 avg

NEON-3 + QPU > NEON-4 + QPU (8.46 > 8.25). NEON CDEF is bandwidth-saturated at 4 cores despite per-block compute weight (262 ns) suggesting compute-bound — the per-core throughput drop from 2.69 (NEON-3) to 1.97 (NEON-4) confirms it. Same pattern as cycle 1 IDCT and cycle 2 LPF.

Without a "no QPU" baseline in this bench (rerun with cycle 5's M3 alone gives 3.8 Mblock/s per core × 4 ≈ 15 Mblock/s theoretical), the same-kernel M4 verdict:

  • NEON-4 alone CDEF estimated ~9-10 Mblock/s (saturation reduces from theoretical 15 to actual; matches per-core 2.5 trend)
  • NEON-3 + QPU CDEF (8.46) is below NEON-4 alone
  • Same-kernel M4: NEGATIVE

This matches the pessimistic same-kernel-bench framing (feedback_m4_same_kernel_worst_case.md).

M4 mixed-kernel (deployment shape)

Config CPU side CPU agg QPU CDEF
NEON-3 MC + QPU CDEF MC 34.17 Mblock/s 0.424 Mblock/s
NEON-3 LPF4 + QPU CDEF LPF4 31.48 Medge/s 0.414 Mblock/s

QPU CDEF contributes 0.41-0.42 Mblock/s while the CPU side runs near-maximum throughput. Compare against Issue 003 V1/V2 NEON-fallback proxy (1.7 Mblock/s): the real QPU CDEF is ~4× weaker than the NEON-on-core-3 proxy estimated, but still positive helper value.

CPU MC agg in this mixed config (34.17 Mblock/s) is higher than CPU MC in Issue 003 V1 (24.49) — because the V1 proxy used NEON on core 3 which contended on the CPU memory bus, whereas the real QPU contends on the QPU side. Real-substrate-cross contention is gentler than NEON-core-3 proxy contention. Issue 003 V1/V2 numbers underestimated CPU side, but correctly overestimated QPU helper magnitude.

Verdict

Rule Result Status
M1 bit-exact (3-way) 100.00% on 4096 blocks ✓ PASS
R₅ = M2₅/M3₅ 0.116 (ORANGE) better than predicted
M4 same-kernel NEGATIVE (8.46 < ~10) ✗ FAIL gate
M4 mixed-kernel (CPU=MC) +0.42 Mblock/s QPU helper ✓ POSITIVE
30fps@1080p floor (isolation) 0.46× ✗ FAIL as sole substrate
30fps@1080p floor (CPU baseline) 8.46 / 0.972 = 8.7× ✓ PASS via CPU

Engineering verdict: CDEF QPU offload viable as opportunistic helper; CPU NEON remains primary substrate. Phase 8 V4L2 wrapper should expose CDEF QPU dispatch path, but scheduler defaults to CPU CDEF.

Surprise (positive): cycle 5 came in better than predicted (ORANGE not RED). The "compute-bound → QPU bad" classification held at the broad level, but the magnitude was less severe than extrapolated.

Deployment recipe update

Cycle Kernel Primary QPU dispatch path Verdict
1 IDCT 8×8 QPU yes M4 +7.2 % validated
2 LPF wd=4 QPU yes M4 +6.9 % validated; V4 confirmed
3 MC 8h CPU exists, unused QPU MC = 0.39 Mblock/s under any contention
4 LPF wd=8 QPU yes M4 +4.1 % validated
5 CDEF CPU exists, opportunistic QPU CDEF = 0.42 Mblock/s mixed, ~half-floor on its own

Phase 9 lessons

  1. Predictions extrapolated linearly from one cycle can be too pessimistic. Cycle 3 MC R₃ = 0.067 extrapolated → R₅ = 0.02-0.05 predicted; actual R₅ = 0.116. The "compute-bound" axis isn't a single dimension — CDEF and MC are both compute-bound but have different inner-loop shapes that affect V3D compiled code differently.

  2. CDEF is bandwidth-bound on NEON despite high per-block ns. Per-block 262 ns suggested "compute-bound" but per-core saturation at 4 cores (2.5 → 2.0 Mblock/s) shows the real constraint is memory bandwidth (192 u16 × 64 lanes/core reads

    • 64 byte writes per block). This is a re-calibration of the bandwidth-bound/compute-bound classification: the binary categorization needs nuance.
  3. Real-substrate-cross contention is gentler than same-side NEON proxy. Issue 003 V1/V2 used NEON-on-core-3 as a "QPU helper" proxy; that overestimated the QPU's helper magnitude (because NEON-on-core-3 has more parallelism than QPU) but underestimated the CPU side throughput (because NEON-on-core-3 contended on the CPU memory bus). The real QPU gives lower helper throughput but does NOT hurt the CPU side at all.

  4. 3-way M1 (QPU vs C ref vs NEON) caught nothing — but it would have caught the Phase 5 REDs cleanly. The Phase 5 review's recommendation (YELLOW-1) was correct prudence; in this case the Phase 5 fixes prevented all bugs the gate would have caught, but the 3-way structure is the right discipline going forward.

What lands in this commit

  • src/v3d_cdef.comp (Phase 6 shader, 387 inst, 2 threads)
  • tests/bench_v3d_cdef.c (3-way M1, M2, R₅ classifier)
  • tests/bench_concurrent_mixed.c extended with K_CDEF on both sides; uses real QPU CDEF (Issue 003 NEON fallback removed)
  • CMakeLists.txt: build wiring for v3d_cdef.spv + bench_v3d_cdef
  • docs/k5_cdef_phase7.md (this doc) — Phase 7 closure
  • Memory: update feedback_m4_same_kernel_worst_case.md with cycle 5 real-QPU numbers (Issue 003 V1/V2 fallback proxy obsolete).