Files
daedalus-fourier/docs/phase7.md
T
marfrit d66f22f333 Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.

New files:
- src/v3d_runner.{c,h}  — reusable Vulkan compute plumbing (instance,
                          V3D device picker, HOST_VISIBLE|COHERENT
                          SSBOs with mmap, compute pipeline from .spv,
                          enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp    — VP9 8x8 DCT_DCT IDCT add, v4 production:
                          256 invocations/WG, 2 blocks/subgroup
                          (no idle lanes), uint8 dst SSBO (race-free
                          per phase5 finding 5), unrolled writes
                          (no chained ternary), oob-flag pattern
                          (barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md         — full iteration journey + decision verdict

CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.

Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):

  ver  change                              R       ns/block
  v1   first-light                         0.230   533
  v2   kill ternary + 2-blocks-per-sg      0.474   258
  v3   per-pass scope oN                   0.481   254  (noise)
  v4   WG 64 -> 256 invocations            0.947   129
  v5   packed uint32 coeff reads           0.938   130  (noise, reverted)
  v4 final N=3                             0.918 +/- 0.033

Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.

Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.

Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:09:00 +00:00

160 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 7
status: closed 2026-05-18
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: phase6 → phase4' (loopback) → phase6 (iter 2..5)
host: hertz
result_v1: R = 0.230 (ORANGE)
result_v4: R = 0.918 ± 0.033 N=3 (YELLOW, at GREEN boundary)
---
# Phase 7 — Verification, with two Phase 4' loopbacks
Per `dev_process.md`:
> Repeat measurements from Phase 3. Compare explicitly against baseline.
> If the delta does not match Phase 4's prediction → loop back to Phase 4.
Phase 6 v1 measurement (R = 0.230) did not match Phase 4's prediction
(R = 2.0 predicted, R = 1.0 worst-case honest lower bound). Loop
back triggered. Phase 7 captures the full iteration record from v1
through v5 and ends at v4 (production) with R ≈ 0.92 on 1080p luma.
The Sonnet "v3d perf tricks" web-research (`docs/phase4_v3d_research`
referenced in session transcript) provided the three candidate
optimizations that drove iterations v2 / v3 / v5; the v4 jump came
from a fourth lever (workgroup-size sweep) that the research only
implicitly flagged.
## Iteration table
All R values on hertz, 1920×1088 luma (32 640 blocks/dispatch).
M3 baseline = 8.171 Mblock/s (Phase 3, NEON `ff_vp9_idct_idct_8x8_add_neon`).
| ver | change | bit-exact | M2 Mblock/s | ns/block | R | shaderdb inst / threads / temps / spills |
|---|---|---|---|---|---|---|
| v1 | first-light (4 blocks/WG, lane 0-7 col / 8-15 row, chained ternary in row pass, uint8 dst SSBO) | 100.00% | 1.878 | 532.6 | 0.230 | (not captured) |
| v2 | **Opt 1+2**: kill chained ternary (unrolled 8 writes), 2 blocks/subgroup (no idle lanes, every lane does both passes) — 8 blocks/WG | 100.00% | 3.877 | 258.0 | **0.474** | 268 / 2 / 20 / 0:0 |
| v3 | Opt 4 (sibling): scope `oN` per pass | 100.00% | 3.930 | 254.5 | 0.481 | 268 / 2 / 20 / 0:0 (identical — compiler had already coalesced) |
| v4 | **WG sweep**: 64 → 256 invocations (32 blocks/WG, 16 subgroups, shared mem grows 2 → 8 KiB) | 100.00% | 7.734 | 129.3 | **0.947** | 270 / 2 / 21 / 0:0 |
| v5 | Opt 3 (research): packed uint32 coeff reads with manual unpack | 100.00% | 7.663 | 130.5 | 0.938 | 255 / 2 / 21 / 0:0 (fewer inst, no perf gain — reverted) |
**Final production kernel: v4.** N=3 repeat on 1080p:
R = 0.931, 0.944, 0.879 → mean **0.918 ± 0.033** (range; third run
likely caught LXD-container interference on hertz).
## What worked (and how surprising it was)
**v2 (predicted 3× win, got 2.07×):** Phase 4' attribution split was
wrong. Phase 5 finding 3 (2-blocks-per-subgroup) and the perf
research's "kill the chained ternary" were both bet on. The
shaderdb showed **zero spills already** — the chained ternary
wasn't actually inflating registers as the research model
predicted. So the 2.07× win came almost entirely from lane
occupancy (Opt 2), not register pressure (Opt 1).
**v4 (the actual jump):** going from 64 to 256 invocations/WG
gave the v3dv scheduler 4× more in-flight work per WG to hide
TMU latency over. Doubled throughput. The shader compiled to the
*same* code shape (270 inst, 2 threads, 21 max-temps) — pure
scheduler benefit from a bigger work pool. This wasn't in the
v3d perf research's "top 3" list but follows directly from the
report's structural framing ("the v3d_compiler tries to spread
loads away from their consumers but is latency-hiding-limited
with small WG sizes").
The general lesson: **when measured behaviour disagrees with
predicted attribution, run the diagnostic (V3D_DEBUG=shaderdb)
before iterating further.** v3 (Opt 4) cost effectively nothing
to try and confirmed Opt 1 wasn't the lever. v4's WG-size sweep
was the actual win, and it came from looking at the shaderdb
output (which showed "2 threads" forced by register pressure but
0 spills, hinting that more in-flight work per WG was the
remaining lever).
## What didn't work
**v3 (per-pass scoping of `oN`):** zero perf delta. Compiler had
already coalesced `oN` lifetime across the barrier. Kept the
change in v4 — it's strictly cleaner code, just not faster.
**v5 (packed uint32 coeff reads):** 0.947 → 0.938, within
noise. Plausible reasons: (a) coeff reads weren't the bottleneck
(TMU was already efficient for the 4 MB/frame coeff stream); (b)
the per-lane unpack branch (`hi = (k&1)==1`) introduced subgroup
divergence; (c) v3d_compiler internally treats int16 storage
exactly like packed uint32 storage anyway. Reverted in
production kernel for simplicity.
## Predictions vs measurements summary
| | predicted | measured | delta |
|---|---|---|---|
| Phase 4 R (v1) | 2.0 (envelope) / 1.0 (lower) | 0.230 | 5× worse than lower bound — **loopback trigger** |
| Phase 4' R after Opt 1+2 (v2) | "3× of 4.4× gap" → R ≈ 0.7 | 0.474 | 2× worse than predicted (the 2-blocks-per-subgroup attribution was right but Opt 1 wasn't load-bearing) |
| Phase 4' R after WG sweep (v4) | not predicted | 0.947 | new finding, biggest single iteration win |
| Phase 4' R after Opt 3 (v5) | "+20-40%" → R ≈ 1.1-1.3 | 0.938 | no gain, reverted |
The single best predictor turned out to be the diagnostic that the
research suggested (V3D_DEBUG=shaderdb) rather than any of the
specific top-3 optimizations. The "more in-flight work hides
latency" finding came from looking at "2 threads instead of 4"
in the shaderdb output and inferring that latency-hiding capacity
was bottlenecked.
## Decision per Phase 1 rules
`phase1.md §"Decision rules"`:
| R | Interpretation | Next step |
|---|---|---|
| ≥ 1.0 | QPU beats NEON. | Phase 9 → Phase 1 of next kernel |
| **0.5 ≤ R < 1.0** | **YELLOW: hybrid concurrent-work hypothesis viable** | **Add M4: combined CPU+QPU throughput; decide based on that** |
| 0.1 ≤ R < 0.5 | ORANGE: honest close | Phase 9 documents negative result |
| < 0.1 | RED: structural mismatch | Honest close |
**Verdict: YELLOW band by a wide margin (R = 0.92, just 0.08 from
GREEN).** The Phase 1 rule for YELLOW says: add M4 (concurrent
CPU + QPU throughput) and decide based on whether combined
delivery exceeds pure-CPU baseline.
M4 is the next measurement, not more shader tuning. The R = 0.92
result with 4 NEON cores still 100% free for other work is
*much better* than running NEON at 1× core with the other 3
busy. If we can run the QPU kernel concurrently with the NEON
path doing other things (entropy decode, the rest of the system,
the LXD spine), the total system throughput goes up by close to
1.0 / (1.0 - QPU_fraction_of_time), even at R < 1.
## What Phase 7 leaves open (M4 / future)
- **M4: concurrent CPU + QPU.** Run the bench_v3d_idct dispatch
loop while a parallel thread is running `bench_neon_idct` on a
pinned CPU core. Measure: does combined Mblock/s exceed
`bench_neon_idct -t 4` (4-core NEON)? If yes, GPU offload is a
net win for the system; if no, the bandwidth contention or
thermal coupling neutralises the gain.
- **M6: WG size sweep (Phase 1 secondary).** v4 is at 256
invocations (max). Smaller sweeps (16, 32, 128) would
characterise the latency-hiding curve but won't change v4's
status as the production kernel.
- **M7: power delta via Himbeere plug.** Most relevant for the
higgs (battery) deployment, not hertz.
- **Thermal headroom under sustained mixed load.** With QPU
running flat-out (1.9 GB/s memory traffic) + 4-core NEON busy,
hertz may throttle. Not yet measured.
## Production artifact
- `src/v3d_idct8.comp` — v4 production shader, 270 inst, R = 0.92
- `src/v3d_runner.{c,h}` — Vulkan plumbing (unchanged since Phase 6)
- `tests/bench_v3d_idct.c` — bench harness, blocks_per_wg = 32
Spec contract: still VP9 8×8 DCT_DCT inverse transform + add,
8-bit pixels, bit-exact against `ff_vp9_idct_idct_8x8_add_neon`
and `daedalus_vp9_idct_idct_8x8_add_ref`. Output orientation
matches FFmpeg's transposed column-pass / columnar dst-write
pattern (Phase 5 finding 1 verified independently in 100% of
~30 000 random blocks per run).