First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.
New files:
- src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance,
V3D device picker, HOST_VISIBLE|COHERENT
SSBOs with mmap, compute pipeline from .spv,
enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production:
256 invocations/WG, 2 blocks/subgroup
(no idle lanes), uint8 dst SSBO (race-free
per phase5 finding 5), unrolled writes
(no chained ternary), oob-flag pattern
(barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md — full iteration journey + decision verdict
CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.
Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):
ver change R ns/block
v1 first-light 0.230 533
v2 kill ternary + 2-blocks-per-sg 0.474 258
v3 per-pass scope oN 0.481 254 (noise)
v4 WG 64 -> 256 invocations 0.947 129
v5 packed uint32 coeff reads 0.938 130 (noise, reverted)
v4 final N=3 0.918 +/- 0.033
Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.
Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.
Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.
Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.
Phases closed:
Phase 0 — substrate audit; Path A blocked, Path B open;
codec-back-end-fits-QPU finding (docs/phase0.md)
Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
publish-before-measure R = M2/M3 decision rules
(docs/phase1.md)
Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
under external/ffmpeg-snapshot/ (PROVENANCE.md pins
commit f46e514 + per-file SHA-256s) (docs/phase2.md)
Phase 3 — real baseline measurements on hertz (docs/phase3.md):
M1 bit-exact 100.0000 % (10000/10000)
M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block)
M5a empty Vulkan submit 22.66 us
M5b 1-WG noop dispatch 55.60 us
M5 delta 32.95 us/dispatch
=> per-dispatch overhead is ~455x per-NEON-block cost;
Phase 4 must batch at frame level or close to it.
Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.
Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>