marfrit
|
36eca40ff2
|
Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.
Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
- stride >= 4 contract added alongside m.x >= 4 (finding 2)
- assert(...) in bench made a MUST not a suggestion (finding 4)
- V3D divergence-cost note: don't restructure to always-execute,
masked lanes consume clock anyway (finding 3, informational)
Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.
Performance:
M2'' = 19.645 Medge/s (50.9 ns/edge)
M3'' = 48.285 Medge/s (NEON baseline from phase3)
R'' = 0.41 (ORANGE band - doesn't auto-close per
cycle-1 calibration adjustment)
shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.
M4'' concurrent matrix on hertz (8s windows):
NEON-1 LPF 41.131 Medge/s
NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling
(per-core 7-9; same
bandwidth-saturation as
cycle-1 F1)
QPU only 14.299 Medge/s
MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4
MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed
The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".
Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck
New artifacts:
- src/v3d_lpf_h_4_8.comp — GLSL kernel
- tests/bench_v3d_lpf.c — M1'' + M2'' harness with
contract asserts + fm/hev
pass-rate instrumentation
- tests/bench_concurrent_lpf.c — M4'' pthread bench
(mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md — plan + review + verification
Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-18 12:39:26 +00:00 |
|