Files
marfrit 36eca40ff2 Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.

Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
  - stride >= 4 contract added alongside m.x >= 4 (finding 2)
  - assert(...) in bench made a MUST not a suggestion (finding 4)
  - V3D divergence-cost note: don't restructure to always-execute,
    masked lanes consume clock anyway (finding 3, informational)

Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.

Performance:

  M2'' = 19.645 Medge/s     (50.9 ns/edge)
  M3'' = 48.285 Medge/s     (NEON baseline from phase3)
  R''  = 0.41               (ORANGE band - doesn't auto-close per
                             cycle-1 calibration adjustment)

shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.

M4'' concurrent matrix on hertz (8s windows):

  NEON-1 LPF          41.131 Medge/s
  NEON-4 LPF          33.726 Medge/s  <- realistic CPU ceiling
                                          (per-core 7-9; same
                                          bandwidth-saturation as
                                          cycle-1 F1)
  QPU only            14.299 Medge/s
  MIXED NEON-3 + QPU  36.049 Medge/s  <- +6.9% over NEON-4
  MIXED NEON-4 + QPU  31.892 Medge/s  <- -5.4% oversubscribed

The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".

Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck

New artifacts:
- src/v3d_lpf_h_4_8.comp                — GLSL kernel
- tests/bench_v3d_lpf.c                 — M1'' + M2'' harness with
                                          contract asserts + fm/hev
                                          pass-rate instrumentation
- tests/bench_concurrent_lpf.c          — M4'' pthread bench
                                          (mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md       — plan + review + verification

Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00

7.6 KiB
Raw Permalink Blame History

cycle, phase, status, date_opened, date_closed, parent, host, verdict
cycle phase status date_opened date_closed parent host verdict
2 7 closed 2026-05-18 — PASS 2026-05-18 2026-05-18 k2_deblock_phase4.md (+ phase5 revisions) hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712, Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz) M4'' PASS — mixed +6.9 % over pure NEON-4; project continues

Cycle 2, Phase 7 — Verification (v1 + M4'')

Per dev_process.md: repeat measurements from Phase 3, compare explicitly to baseline. Phase 4 §6 predicted R'' ≈ 0.50.9 isolation, bandwidth ceiling at 0.87. Measured R'' = 0.41 isolation — below the predicted lower bound. Per cycle-1 calibration (M4 showed mixed > pure-CPU even at modest R), this triggers M4'' rather than honest-close.

M4'' gate result: PASS. Project continues.

v1 first-light (single dispatch, isolation R'')

=== v3d LPF h_4_8 bench ===
  device:  V3D 7.1.7.0
  n_edges: 65536  iters: 100
  fm pass rate:  8.09% (10k-edge sample)
  hev pass rate: 4.93% (of fm-passing)
  dispatch: 2048 WGs × 256 invocations = 65536 edges

=== M1'': QPU vs C-reference bit-exact ===
  edges bit-exact: 65536 / 65536 (100.0000 %)
  total byte diffs: 0 / 4194304 (0.0000 %)

=== M2'': QPU throughput ===
  M2'' throughput = 19.645 Medge/s
  per-edge        = 50.9 ns
  per-dispatch    = 3336.1 us
  R'' = M2''/M3''   = 0.407 → ORANGE band

shaderdb (v1 LPF kernel):

SHADER-DB-6c8e828054...: MESA_SHADER_COMPUTE shader:
  160 inst, 4 threads, 0 loops, 36 uniforms, 21 max-temps,
  0:0 spills:fills, 0 sfu-stalls, 160 inst-and-stalls, 15 nops

The shader is already well-optimised by v3d_compiler:

  • 4 hardware threads (vs cycle-1 IDCT's 2 — better latency hiding from the start)
  • 0 spills:fills (compiler delivered)
  • 160 instructions — about 60 % of cycle-1 IDCT's 270

Yet R'' = 0.41. The 30× gap between theoretical instruction throughput and measured wall-clock is not compile-quality limited. Plausible attribution:

  1. fm-pass rate 8 % → 92 % of edges read+compute then return. But masked lanes still pay clock (phase5'' finding 3) — no throughput benefit from early-return.
  2. Memory latency: per-edge 64 reads + 0-4 writes via TMU; less compute density per memory op than IDCT.
  3. v3dv per-dispatch overhead is 0.05 % of total at 3.3 ms per-dispatch — not the bottleneck.

The fundamental issue: LPF on QPU is memory-bound, not compute-bound. Per-edge ~88 B of traffic × 19.6 Medge/s ≈ 1.7 GB/s — well below the 4 GB/s GPU bandwidth ceiling. The divergence tax may be eating the bandwidth headroom (lanes that early-return don't write but still consume cycle).

M4'' concurrent matrix (cycle-2 gate test)

8-second time-based windows, hertz, all 65 536-edge dispatches:

Config Medge/s per-core (NEON) vs NEON-4
NEON 1-core 41.131 41.131
NEON 4-core 33.726 7.21 9.28 baseline ceiling
QPU alone (host on core 3) 14.299 n/a
MIXED NEON-3 + QPU 36.049 9.44 12.98 +6.9 %
MIXED NEON-4 + QPU (oversubscribed) 31.892 6.45 8.02 5.4 %

The gate verdict: NEON-3 + QPU (36.05) > NEON-4 alone (33.73) by 2.32 Medge/s = +6.9 %. M4'' PASSES.

QPU's contribution in mixed mode (4.0 Medge/s) is 28 % of its isolation throughput (14.3) — the same QPU-bandwidth-collapse under CPU contention seen in cycle-1 M4 (where QPU dropped from 6.9 → 1.6 Medge/s = 23 % survival).

Cycle-2 vs cycle-1 M4 deltas

Cycle 1 (IDCT) Cycle 2 (LPF)
NEON 1-core (Mblock/s vs Medge/s) 12.6 41.1
NEON 4-core 7.07 33.7
QPU isolation 6.89 14.3
R isolation (vs 1-core NEON) 0.55 0.35
R isolation (vs 4-core NEON saturated) 0.97 0.42
MIXED N3+Q vs N4 +7.2 % +6.9 %
MIXED N4+Q vs N4 +9.4 % (neutral-to-pos) 5.4 % (negative)

The "freed-core" pattern generalizes: NEON-3+QPU > NEON-4 by roughly the same percentage in both cycles. The oversubscription flip (cycle 1 positive → cycle 2 negative) is the new finding: lighter per-unit kernels are more sensitive to CPU/QPU-host contention. For deployment on higgs the recommendation hardens to "always NEON-3 + QPU, never NEON-4 + QPU".

Phase 4''/5'' prediction calibration

What Phase 4'' got right:

  • Bandwidth-bound — bench fm-pass rate confirms most edges don't even do the conditional write work, yet bandwidth is the ceiling
  • 4-thread shaderdb result — phase 4 §6 predicted "compute doesn't bottleneck"; confirmed

What Phase 4'' got wrong:

  • Isolation R'' band 0.50.9 was too optimistic by ~25 %. Actual 0.41. Divergence tax was bigger than estimated.
  • Phase 5'' finding 3 specifically warned not to restructure for divergence — that holds; the 0.41 IS the floor.

What this means: the cycle-1-style "single big v4 jump from WG sweep" probably doesn't exist for LPF — we're already at WG 256 from v1, already at 4 hardware threads, already at 0 spills. The compiler delivered. The hardware limit on LPF-shape kernels appears to be ~14 Medge/s isolation. The project can pursue further optimization only by attacking the algorithm structure (e.g., fused multi-edge-per-WG with shared prefetch — but that adds shared mem and barriers, complicating divergence further).

For now: cycle 2 closes as a YELLOW-PASS via M4''. Cycle 3 next.

Phase 7'' decision

Per k2_deblock_phase1.md §"Decision rules" and cycle-1 calibration adjustment:

Rule Result Status
M1'' bit-exact 100.0000 % ✓ PASS
R'' = M2''/M3'' 0.41 (ORANGE) does not auto-close
M4'' > pure-CPU 4-core +6.9 % ✓ PASS
Cycle verdict YELLOW-via-M4'' continue to next kernel

Phase 9 (lessons): see end of this doc.

Leaves open

  • Real-bitstream fm-pass rate. Bench's random distribution gives 8 % fm-pass. Real VP9 streams may be 30-60 %. If fm-pass rate matters for the divergence tax, real content might measurably shift M2''. Worth a sample-stream re-measurement if/when an end-to-end pipeline exists.
  • Vertical variant v_4_8. Different memory access pattern (column-strided reads). Cycle 2 v2 if there's a reason; not blocking.
  • wd=8 and wd=16 filters. Bigger conditional paths. Cycle 3+ candidates.

Phase 9 lessons (added to project memory)

  1. Cycle-1 v4-pattern is the v1 starting point. Bake in WG 256, 2-block-per-subgroup adaptation, uint8_t SSBO, oob early-return discipline, NO chained ternary from the start. Saves 3 iterations.

  2. Phase 5 review pays off every cycle. Cycle 1 caught 2 RED bugs; cycle 2 caught 2 YELLOW contract gaps (stride ≥ 4, assert discipline) and 1 V3D-specific divergence-cost warning. No wasted code from review-flagged bugs in either cycle.

  3. R isolation is a misleading metric on bandwidth-saturated hardware. Comparing QPU vs 1-core NEON is the wrong baseline when 4-core NEON only delivers 0.56-0.82× of 1-core scaled. The right comparison is QPU vs 4-core-NEON-saturated, then the mixed-vs-pure-CPU delta. Both cycles' M4 confirm this.

  4. Oversubscription tax depends on kernel weight. Heavy per-unit work (IDCT) tolerates NEON-4 + QPU (+9 %). Light per-unit work (LPF) is hurt by it (-5 %). Recommendation for deployment: always N-1 NEON cores + QPU, never N + QPU.

  5. shaderdb at 4 threads / 0 spills means compute is not the bottleneck. Subsequent optimization should target memory pattern (TMU prefetch, working-set tiling) or accept the silicon limit. Cycle 2 v1 hit this ceiling — no v2-v5 iterations needed because there's nothing to improve in the compiled shader shape.