Files
marfrit d66f22f333 Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.

New files:
- src/v3d_runner.{c,h}  — reusable Vulkan compute plumbing (instance,
                          V3D device picker, HOST_VISIBLE|COHERENT
                          SSBOs with mmap, compute pipeline from .spv,
                          enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp    — VP9 8x8 DCT_DCT IDCT add, v4 production:
                          256 invocations/WG, 2 blocks/subgroup
                          (no idle lanes), uint8 dst SSBO (race-free
                          per phase5 finding 5), unrolled writes
                          (no chained ternary), oob-flag pattern
                          (barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md         — full iteration journey + decision verdict

CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.

Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):

  ver  change                              R       ns/block
  v1   first-light                         0.230   533
  v2   kill ternary + 2-blocks-per-sg      0.474   258
  v3   per-pass scope oN                   0.481   254  (noise)
  v4   WG 64 -> 256 invocations            0.947   129
  v5   packed uint32 coeff reads           0.938   130  (noise, reverted)
  v4 final N=3                             0.918 +/- 0.033

Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.

Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.

Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:09:00 +00:00

7.7 KiB
Raw Permalink Blame History

phase, status, date_opened, date_closed, parent, host, result_v1, result_v4
phase status date_opened date_closed parent host result_v1 result_v4
7 closed 2026-05-18 2026-05-18 2026-05-18 phase6 → phase4' (loopback) → phase6 (iter 2..5) hertz R = 0.230 (ORANGE) R = 0.918 ± 0.033 N=3 (YELLOW, at GREEN boundary)

Phase 7 — Verification, with two Phase 4' loopbacks

Per dev_process.md:

Repeat measurements from Phase 3. Compare explicitly against baseline. If the delta does not match Phase 4's prediction → loop back to Phase 4.

Phase 6 v1 measurement (R = 0.230) did not match Phase 4's prediction (R = 2.0 predicted, R = 1.0 worst-case honest lower bound). Loop back triggered. Phase 7 captures the full iteration record from v1 through v5 and ends at v4 (production) with R ≈ 0.92 on 1080p luma.

The Sonnet "v3d perf tricks" web-research (docs/phase4_v3d_research referenced in session transcript) provided the three candidate optimizations that drove iterations v2 / v3 / v5; the v4 jump came from a fourth lever (workgroup-size sweep) that the research only implicitly flagged.

Iteration table

All R values on hertz, 1920×1088 luma (32 640 blocks/dispatch). M3 baseline = 8.171 Mblock/s (Phase 3, NEON ff_vp9_idct_idct_8x8_add_neon).

ver change bit-exact M2 Mblock/s ns/block R shaderdb inst / threads / temps / spills
v1 first-light (4 blocks/WG, lane 0-7 col / 8-15 row, chained ternary in row pass, uint8 dst SSBO) 100.00% 1.878 532.6 0.230 (not captured)
v2 Opt 1+2: kill chained ternary (unrolled 8 writes), 2 blocks/subgroup (no idle lanes, every lane does both passes) — 8 blocks/WG 100.00% 3.877 258.0 0.474 268 / 2 / 20 / 0:0
v3 Opt 4 (sibling): scope oN per pass 100.00% 3.930 254.5 0.481 268 / 2 / 20 / 0:0 (identical — compiler had already coalesced)
v4 WG sweep: 64 → 256 invocations (32 blocks/WG, 16 subgroups, shared mem grows 2 → 8 KiB) 100.00% 7.734 129.3 0.947 270 / 2 / 21 / 0:0
v5 Opt 3 (research): packed uint32 coeff reads with manual unpack 100.00% 7.663 130.5 0.938 255 / 2 / 21 / 0:0 (fewer inst, no perf gain — reverted)

Final production kernel: v4. N=3 repeat on 1080p: R = 0.931, 0.944, 0.879 → mean 0.918 ± 0.033 (range; third run likely caught LXD-container interference on hertz).

What worked (and how surprising it was)

v2 (predicted 3× win, got 2.07×): Phase 4' attribution split was wrong. Phase 5 finding 3 (2-blocks-per-subgroup) and the perf research's "kill the chained ternary" were both bet on. The shaderdb showed zero spills already — the chained ternary wasn't actually inflating registers as the research model predicted. So the 2.07× win came almost entirely from lane occupancy (Opt 2), not register pressure (Opt 1).

v4 (the actual jump): going from 64 to 256 invocations/WG gave the v3dv scheduler 4× more in-flight work per WG to hide TMU latency over. Doubled throughput. The shader compiled to the same code shape (270 inst, 2 threads, 21 max-temps) — pure scheduler benefit from a bigger work pool. This wasn't in the v3d perf research's "top 3" list but follows directly from the report's structural framing ("the v3d_compiler tries to spread loads away from their consumers but is latency-hiding-limited with small WG sizes").

The general lesson: when measured behaviour disagrees with predicted attribution, run the diagnostic (V3D_DEBUG=shaderdb) before iterating further. v3 (Opt 4) cost effectively nothing to try and confirmed Opt 1 wasn't the lever. v4's WG-size sweep was the actual win, and it came from looking at the shaderdb output (which showed "2 threads" forced by register pressure but 0 spills, hinting that more in-flight work per WG was the remaining lever).

What didn't work

v3 (per-pass scoping of oN): zero perf delta. Compiler had already coalesced oN lifetime across the barrier. Kept the change in v4 — it's strictly cleaner code, just not faster.

v5 (packed uint32 coeff reads): 0.947 → 0.938, within noise. Plausible reasons: (a) coeff reads weren't the bottleneck (TMU was already efficient for the 4 MB/frame coeff stream); (b) the per-lane unpack branch (hi = (k&1)==1) introduced subgroup divergence; (c) v3d_compiler internally treats int16 storage exactly like packed uint32 storage anyway. Reverted in production kernel for simplicity.

Predictions vs measurements summary

predicted measured delta
Phase 4 R (v1) 2.0 (envelope) / 1.0 (lower) 0.230 5× worse than lower bound — loopback trigger
Phase 4' R after Opt 1+2 (v2) "3× of 4.4× gap" → R ≈ 0.7 0.474 2× worse than predicted (the 2-blocks-per-subgroup attribution was right but Opt 1 wasn't load-bearing)
Phase 4' R after WG sweep (v4) not predicted 0.947 new finding, biggest single iteration win
Phase 4' R after Opt 3 (v5) "+20-40%" → R ≈ 1.1-1.3 0.938 no gain, reverted

The single best predictor turned out to be the diagnostic that the research suggested (V3D_DEBUG=shaderdb) rather than any of the specific top-3 optimizations. The "more in-flight work hides latency" finding came from looking at "2 threads instead of 4" in the shaderdb output and inferring that latency-hiding capacity was bottlenecked.

Decision per Phase 1 rules

phase1.md §"Decision rules":

R Interpretation Next step
≥ 1.0 QPU beats NEON. Phase 9 → Phase 1 of next kernel
0.5 ≤ R < 1.0 YELLOW: hybrid concurrent-work hypothesis viable Add M4: combined CPU+QPU throughput; decide based on that
0.1 ≤ R < 0.5 ORANGE: honest close Phase 9 documents negative result
< 0.1 RED: structural mismatch Honest close

Verdict: YELLOW band by a wide margin (R = 0.92, just 0.08 from GREEN). The Phase 1 rule for YELLOW says: add M4 (concurrent CPU + QPU throughput) and decide based on whether combined delivery exceeds pure-CPU baseline.

M4 is the next measurement, not more shader tuning. The R = 0.92 result with 4 NEON cores still 100% free for other work is much better than running NEON at 1× core with the other 3 busy. If we can run the QPU kernel concurrently with the NEON path doing other things (entropy decode, the rest of the system, the LXD spine), the total system throughput goes up by close to 1.0 / (1.0 - QPU_fraction_of_time), even at R < 1.

What Phase 7 leaves open (M4 / future)

  • M4: concurrent CPU + QPU. Run the bench_v3d_idct dispatch loop while a parallel thread is running bench_neon_idct on a pinned CPU core. Measure: does combined Mblock/s exceed bench_neon_idct -t 4 (4-core NEON)? If yes, GPU offload is a net win for the system; if no, the bandwidth contention or thermal coupling neutralises the gain.
  • M6: WG size sweep (Phase 1 secondary). v4 is at 256 invocations (max). Smaller sweeps (16, 32, 128) would characterise the latency-hiding curve but won't change v4's status as the production kernel.
  • M7: power delta via Himbeere plug. Most relevant for the higgs (battery) deployment, not hertz.
  • Thermal headroom under sustained mixed load. With QPU running flat-out (1.9 GB/s memory traffic) + 4-core NEON busy, hertz may throttle. Not yet measured.

Production artifact

  • src/v3d_idct8.comp — v4 production shader, 270 inst, R = 0.92
  • src/v3d_runner.{c,h} — Vulkan plumbing (unchanged since Phase 6)
  • tests/bench_v3d_idct.c — bench harness, blocks_per_wg = 32

Spec contract: still VP9 8×8 DCT_DCT inverse transform + add, 8-bit pixels, bit-exact against ff_vp9_idct_idct_8x8_add_neon and daedalus_vp9_idct_idct_8x8_add_ref. Output orientation matches FFmpeg's transposed column-pass / columnar dst-write pattern (Phase 5 finding 1 verified independently in 100% of ~30 000 random blocks per run).