3 Commits

Author SHA1 Message Date
marfrit 373f63a910 Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
  RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
  RED-2: bench-enforced m.x >= 4*stride contract

M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
    Lower than prediction; H.264 deblock has 4 early-return paths +
    2 conditional writes that hurt V3D branchy execution more than
    expected.

M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
  (neutral).

M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
  gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
  QPU contribution is essentially unchanged from isolation —
  the cross-substrate contention is gentle (consistent with
  Issue 003's V4 finding).

Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.

Cycles 1-8 deployment recipe complete:
  Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
  Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
  Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)

Phase 9 lessons added:
  - Branchy kernels underperform V3D vs straight-line ones
  - Mixed-kernel helper value scales with isolation M2, not
    same-kernel M4
  - R prediction needs branchiness weight, not just compute density

- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
  + h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)

Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:44:21 +00:00
marfrit 5223d3cb3f Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
  - meta layout: m.z=tmp_off, m.w=dir
  - sec_shift clamped to >=0 (NEON uqsub semantics)
  - directions table as const ivec2[14], not OR-packed

Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.

M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
  QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
  Issue 003 NEON-fallback proxy suggested - cross-substrate
  contention is gentler than same-side NEON contention.

Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.

- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
  expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:52:46 +00:00
marfrit 2dd774a9ab Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:

  V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
  V4 (CPU MC + QPU LPF4):            CPU 27.87 + QPU 12.74 Medge/s
  V1 (CPU MC + NEON-fb CDEF):        CPU 24.49 + 1.75 Mblock/s CDEF
  V2 (CPU LPF4 + NEON-fb CDEF):      CPU 27.28 Medge + 1.70 Mblock/s

V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.

Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).

- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
  qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
  dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:44:08 +00:00