--- phase: 7 status: closed 2026-05-18 date_opened: 2026-05-18 date_closed: 2026-05-18 parent: phase6 → phase4' (loopback) → phase6 (iter 2..5) host: hertz result_v1: R = 0.230 (ORANGE) result_v4: R = 0.918 ± 0.033 N=3 (YELLOW, at GREEN boundary) --- # Phase 7 — Verification, with two Phase 4' loopbacks Per `dev_process.md`: > Repeat measurements from Phase 3. Compare explicitly against baseline. > If the delta does not match Phase 4's prediction → loop back to Phase 4. Phase 6 v1 measurement (R = 0.230) did not match Phase 4's prediction (R = 2.0 predicted, R = 1.0 worst-case honest lower bound). Loop back triggered. Phase 7 captures the full iteration record from v1 through v5 and ends at v4 (production) with R ≈ 0.92 on 1080p luma. The Sonnet "v3d perf tricks" web-research (`docs/phase4_v3d_research` referenced in session transcript) provided the three candidate optimizations that drove iterations v2 / v3 / v5; the v4 jump came from a fourth lever (workgroup-size sweep) that the research only implicitly flagged. ## Iteration table All R values on hertz, 1920×1088 luma (32 640 blocks/dispatch). M3 baseline = 8.171 Mblock/s (Phase 3, NEON `ff_vp9_idct_idct_8x8_add_neon`). | ver | change | bit-exact | M2 Mblock/s | ns/block | R | shaderdb inst / threads / temps / spills | |---|---|---|---|---|---|---| | v1 | first-light (4 blocks/WG, lane 0-7 col / 8-15 row, chained ternary in row pass, uint8 dst SSBO) | 100.00% | 1.878 | 532.6 | 0.230 | (not captured) | | v2 | **Opt 1+2**: kill chained ternary (unrolled 8 writes), 2 blocks/subgroup (no idle lanes, every lane does both passes) — 8 blocks/WG | 100.00% | 3.877 | 258.0 | **0.474** | 268 / 2 / 20 / 0:0 | | v3 | Opt 4 (sibling): scope `oN` per pass | 100.00% | 3.930 | 254.5 | 0.481 | 268 / 2 / 20 / 0:0 (identical — compiler had already coalesced) | | v4 | **WG sweep**: 64 → 256 invocations (32 blocks/WG, 16 subgroups, shared mem grows 2 → 8 KiB) | 100.00% | 7.734 | 129.3 | **0.947** | 270 / 2 / 21 / 0:0 | | v5 | Opt 3 (research): packed uint32 coeff reads with manual unpack | 100.00% | 7.663 | 130.5 | 0.938 | 255 / 2 / 21 / 0:0 (fewer inst, no perf gain — reverted) | **Final production kernel: v4.** N=3 repeat on 1080p: R = 0.931, 0.944, 0.879 → mean **0.918 ± 0.033** (range; third run likely caught LXD-container interference on hertz). ## What worked (and how surprising it was) **v2 (predicted 3× win, got 2.07×):** Phase 4' attribution split was wrong. Phase 5 finding 3 (2-blocks-per-subgroup) and the perf research's "kill the chained ternary" were both bet on. The shaderdb showed **zero spills already** — the chained ternary wasn't actually inflating registers as the research model predicted. So the 2.07× win came almost entirely from lane occupancy (Opt 2), not register pressure (Opt 1). **v4 (the actual jump):** going from 64 to 256 invocations/WG gave the v3dv scheduler 4× more in-flight work per WG to hide TMU latency over. Doubled throughput. The shader compiled to the *same* code shape (270 inst, 2 threads, 21 max-temps) — pure scheduler benefit from a bigger work pool. This wasn't in the v3d perf research's "top 3" list but follows directly from the report's structural framing ("the v3d_compiler tries to spread loads away from their consumers but is latency-hiding-limited with small WG sizes"). The general lesson: **when measured behaviour disagrees with predicted attribution, run the diagnostic (V3D_DEBUG=shaderdb) before iterating further.** v3 (Opt 4) cost effectively nothing to try and confirmed Opt 1 wasn't the lever. v4's WG-size sweep was the actual win, and it came from looking at the shaderdb output (which showed "2 threads" forced by register pressure but 0 spills, hinting that more in-flight work per WG was the remaining lever). ## What didn't work **v3 (per-pass scoping of `oN`):** zero perf delta. Compiler had already coalesced `oN` lifetime across the barrier. Kept the change in v4 — it's strictly cleaner code, just not faster. **v5 (packed uint32 coeff reads):** 0.947 → 0.938, within noise. Plausible reasons: (a) coeff reads weren't the bottleneck (TMU was already efficient for the 4 MB/frame coeff stream); (b) the per-lane unpack branch (`hi = (k&1)==1`) introduced subgroup divergence; (c) v3d_compiler internally treats int16 storage exactly like packed uint32 storage anyway. Reverted in production kernel for simplicity. ## Predictions vs measurements summary | | predicted | measured | delta | |---|---|---|---| | Phase 4 R (v1) | 2.0 (envelope) / 1.0 (lower) | 0.230 | 5× worse than lower bound — **loopback trigger** | | Phase 4' R after Opt 1+2 (v2) | "3× of 4.4× gap" → R ≈ 0.7 | 0.474 | 2× worse than predicted (the 2-blocks-per-subgroup attribution was right but Opt 1 wasn't load-bearing) | | Phase 4' R after WG sweep (v4) | not predicted | 0.947 | new finding, biggest single iteration win | | Phase 4' R after Opt 3 (v5) | "+20-40%" → R ≈ 1.1-1.3 | 0.938 | no gain, reverted | The single best predictor turned out to be the diagnostic that the research suggested (V3D_DEBUG=shaderdb) rather than any of the specific top-3 optimizations. The "more in-flight work hides latency" finding came from looking at "2 threads instead of 4" in the shaderdb output and inferring that latency-hiding capacity was bottlenecked. ## Decision per Phase 1 rules `phase1.md §"Decision rules"`: | R | Interpretation | Next step | |---|---|---| | ≥ 1.0 | QPU beats NEON. | Phase 9 → Phase 1 of next kernel | | **0.5 ≤ R < 1.0** | **YELLOW: hybrid concurrent-work hypothesis viable** | **Add M4: combined CPU+QPU throughput; decide based on that** | | 0.1 ≤ R < 0.5 | ORANGE: honest close | Phase 9 documents negative result | | < 0.1 | RED: structural mismatch | Honest close | **Verdict: YELLOW band by a wide margin (R = 0.92, just 0.08 from GREEN).** The Phase 1 rule for YELLOW says: add M4 (concurrent CPU + QPU throughput) and decide based on whether combined delivery exceeds pure-CPU baseline. M4 is the next measurement, not more shader tuning. The R = 0.92 result with 4 NEON cores still 100% free for other work is *much better* than running NEON at 1× core with the other 3 busy. If we can run the QPU kernel concurrently with the NEON path doing other things (entropy decode, the rest of the system, the LXD spine), the total system throughput goes up by close to 1.0 / (1.0 - QPU_fraction_of_time), even at R < 1. ## What Phase 7 leaves open (M4 / future) - **M4: concurrent CPU + QPU.** Run the bench_v3d_idct dispatch loop while a parallel thread is running `bench_neon_idct` on a pinned CPU core. Measure: does combined Mblock/s exceed `bench_neon_idct -t 4` (4-core NEON)? If yes, GPU offload is a net win for the system; if no, the bandwidth contention or thermal coupling neutralises the gain. - **M6: WG size sweep (Phase 1 secondary).** v4 is at 256 invocations (max). Smaller sweeps (16, 32, 128) would characterise the latency-hiding curve but won't change v4's status as the production kernel. - **M7: power delta via Himbeere plug.** Most relevant for the higgs (battery) deployment, not hertz. - **Thermal headroom under sustained mixed load.** With QPU running flat-out (1.9 GB/s memory traffic) + 4-core NEON busy, hertz may throttle. Not yet measured. ## Production artifact - `src/v3d_idct8.comp` — v4 production shader, 270 inst, R = 0.92 - `src/v3d_runner.{c,h}` — Vulkan plumbing (unchanged since Phase 6) - `tests/bench_v3d_idct.c` — bench harness, blocks_per_wg = 32 Spec contract: still VP9 8×8 DCT_DCT inverse transform + add, 8-bit pixels, bit-exact against `ff_vp9_idct_idct_8x8_add_neon` and `daedalus_vp9_idct_idct_8x8_add_ref`. Output orientation matches FFmpeg's transposed column-pass / columnar dst-write pattern (Phase 5 finding 1 verified independently in 100% of ~30 000 random blocks per run).