---
phase: 7
status: closed 2026-05-18
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: phase6 → phase4' (loopback) → phase6 (iter 2..5)
host: hertz
result_v1: R = 0.230 (ORANGE)
result_v4: R = 0.918 ± 0.033 N=3 (YELLOW, at GREEN boundary)
---

# Phase 7 — Verification, with two Phase 4' loopbacks

Per `dev_process.md`:

> Repeat measurements from Phase 3. Compare explicitly against baseline.
> If the delta does not match Phase 4's prediction → loop back to Phase 4.

Phase 6 v1 measurement (R = 0.230) did not match Phase 4's prediction
(R = 2.0 predicted, R = 1.0 worst-case honest lower bound). Loop
back triggered. Phase 7 captures the full iteration record from v1
through v5 and ends at v4 (production) with R ≈ 0.92 on 1080p luma.

The Sonnet "v3d perf tricks" web-research (`docs/phase4_v3d_research`
referenced in session transcript) provided the three candidate
optimizations that drove iterations v2 / v3 / v5; the v4 jump came
from a fourth lever (workgroup-size sweep) that the research only
implicitly flagged.

## Iteration table

All R values on hertz, 1920×1088 luma (32 640 blocks/dispatch).
M3 baseline = 8.171 Mblock/s (Phase 3, NEON `ff_vp9_idct_idct_8x8_add_neon`).

| ver | change | bit-exact | M2 Mblock/s | ns/block | R | shaderdb inst / threads / temps / spills |
|---|---|---|---|---|---|---|
| v1 | first-light (4 blocks/WG, lane 0-7 col / 8-15 row, chained ternary in row pass, uint8 dst SSBO) | 100.00% | 1.878 | 532.6 | 0.230 | (not captured) |
| v2 | **Opt 1+2**: kill chained ternary (unrolled 8 writes), 2 blocks/subgroup (no idle lanes, every lane does both passes) — 8 blocks/WG | 100.00% | 3.877 | 258.0 | **0.474** | 268 / 2 / 20 / 0:0 |
| v3 | Opt 4 (sibling): scope `oN` per pass | 100.00% | 3.930 | 254.5 | 0.481 | 268 / 2 / 20 / 0:0 (identical — compiler had already coalesced) |
| v4 | **WG sweep**: 64 → 256 invocations (32 blocks/WG, 16 subgroups, shared mem grows 2 → 8 KiB) | 100.00% | 7.734 | 129.3 | **0.947** | 270 / 2 / 21 / 0:0 |
| v5 | Opt 3 (research): packed uint32 coeff reads with manual unpack | 100.00% | 7.663 | 130.5 | 0.938 | 255 / 2 / 21 / 0:0 (fewer inst, no perf gain — reverted) |

**Final production kernel: v4.** N=3 repeat on 1080p:
R = 0.931, 0.944, 0.879 → mean **0.918 ± 0.033** (range; third run
likely caught LXD-container interference on hertz).

## What worked (and how surprising it was)

**v2 (predicted 3× win, got 2.07×):** Phase 4' attribution split was
wrong. Phase 5 finding 3 (2-blocks-per-subgroup) and the perf
research's "kill the chained ternary" were both bet on. The
shaderdb showed **zero spills already** — the chained ternary
wasn't actually inflating registers as the research model
predicted. So the 2.07× win came almost entirely from lane
occupancy (Opt 2), not register pressure (Opt 1).

**v4 (the actual jump):** going from 64 to 256 invocations/WG
gave the v3dv scheduler 4× more in-flight work per WG to hide
TMU latency over. Doubled throughput. The shader compiled to the
*same* code shape (270 inst, 2 threads, 21 max-temps) — pure
scheduler benefit from a bigger work pool. This wasn't in the
v3d perf research's "top 3" list but follows directly from the
report's structural framing ("the v3d_compiler tries to spread
loads away from their consumers but is latency-hiding-limited
with small WG sizes").

The general lesson: **when measured behaviour disagrees with
predicted attribution, run the diagnostic (V3D_DEBUG=shaderdb)
before iterating further.** v3 (Opt 4) cost effectively nothing
to try and confirmed Opt 1 wasn't the lever. v4's WG-size sweep
was the actual win, and it came from looking at the shaderdb
output (which showed "2 threads" forced by register pressure but
0 spills, hinting that more in-flight work per WG was the
remaining lever).

## What didn't work

**v3 (per-pass scoping of `oN`):** zero perf delta. Compiler had
already coalesced `oN` lifetime across the barrier. Kept the
change in v4 — it's strictly cleaner code, just not faster.

**v5 (packed uint32 coeff reads):** 0.947 → 0.938, within
noise. Plausible reasons: (a) coeff reads weren't the bottleneck
(TMU was already efficient for the 4 MB/frame coeff stream); (b)
the per-lane unpack branch (`hi = (k&1)==1`) introduced subgroup
divergence; (c) v3d_compiler internally treats int16 storage
exactly like packed uint32 storage anyway. Reverted in
production kernel for simplicity.

## Predictions vs measurements summary

| | predicted | measured | delta |
|---|---|---|---|
| Phase 4 R (v1) | 2.0 (envelope) / 1.0 (lower) | 0.230 | 5× worse than lower bound — **loopback trigger** |
| Phase 4' R after Opt 1+2 (v2) | "3× of 4.4× gap" → R ≈ 0.7 | 0.474 | 2× worse than predicted (the 2-blocks-per-subgroup attribution was right but Opt 1 wasn't load-bearing) |
| Phase 4' R after WG sweep (v4) | not predicted | 0.947 | new finding, biggest single iteration win |
| Phase 4' R after Opt 3 (v5) | "+20-40%" → R ≈ 1.1-1.3 | 0.938 | no gain, reverted |

The single best predictor turned out to be the diagnostic that the
research suggested (V3D_DEBUG=shaderdb) rather than any of the
specific top-3 optimizations. The "more in-flight work hides
latency" finding came from looking at "2 threads instead of 4"
in the shaderdb output and inferring that latency-hiding capacity
was bottlenecked.

## Decision per Phase 1 rules

`phase1.md §"Decision rules"`:

| R | Interpretation | Next step |
|---|---|---|
| ≥ 1.0 | QPU beats NEON. | Phase 9 → Phase 1 of next kernel |
| **0.5 ≤ R < 1.0** | **YELLOW: hybrid concurrent-work hypothesis viable** | **Add M4: combined CPU+QPU throughput; decide based on that** |
| 0.1 ≤ R < 0.5 | ORANGE: honest close | Phase 9 documents negative result |
| < 0.1 | RED: structural mismatch | Honest close |

**Verdict: YELLOW band by a wide margin (R = 0.92, just 0.08 from
GREEN).** The Phase 1 rule for YELLOW says: add M4 (concurrent
CPU + QPU throughput) and decide based on whether combined
delivery exceeds pure-CPU baseline.

M4 is the next measurement, not more shader tuning. The R = 0.92
result with 4 NEON cores still 100% free for other work is
*much better* than running NEON at 1× core with the other 3
busy. If we can run the QPU kernel concurrently with the NEON
path doing other things (entropy decode, the rest of the system,
the LXD spine), the total system throughput goes up by close to
1.0 / (1.0 - QPU_fraction_of_time), even at R < 1.

## What Phase 7 leaves open (M4 / future)

- **M4: concurrent CPU + QPU.** Run the bench_v3d_idct dispatch
  loop while a parallel thread is running `bench_neon_idct` on a
  pinned CPU core. Measure: does combined Mblock/s exceed
  `bench_neon_idct -t 4` (4-core NEON)? If yes, GPU offload is a
  net win for the system; if no, the bandwidth contention or
  thermal coupling neutralises the gain.
- **M6: WG size sweep (Phase 1 secondary).** v4 is at 256
  invocations (max). Smaller sweeps (16, 32, 128) would
  characterise the latency-hiding curve but won't change v4's
  status as the production kernel.
- **M7: power delta via Himbeere plug.** Most relevant for the
  higgs (battery) deployment, not hertz.
- **Thermal headroom under sustained mixed load.** With QPU
  running flat-out (1.9 GB/s memory traffic) + 4-core NEON busy,
  hertz may throttle. Not yet measured.

## Production artifact

- `src/v3d_idct8.comp` — v4 production shader, 270 inst, R = 0.92
- `src/v3d_runner.{c,h}` — Vulkan plumbing (unchanged since Phase 6)
- `tests/bench_v3d_idct.c` — bench harness, blocks_per_wg = 32

Spec contract: still VP9 8×8 DCT_DCT inverse transform + add,
8-bit pixels, bit-exact against `ff_vp9_idct_idct_8x8_add_neon`
and `daedalus_vp9_idct_idct_8x8_add_ref`. Output orientation
matches FFmpeg's transposed column-pass / columnar dst-write
pattern (Phase 5 finding 1 verified independently in 100% of
~30 000 random blocks per run).