This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.
Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.
Phases closed:
Phase 0 — substrate audit; Path A blocked, Path B open;
codec-back-end-fits-QPU finding (docs/phase0.md)
Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
publish-before-measure R = M2/M3 decision rules
(docs/phase1.md)
Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
under external/ffmpeg-snapshot/ (PROVENANCE.md pins
commit f46e514 + per-file SHA-256s) (docs/phase2.md)
Phase 3 — real baseline measurements on hertz (docs/phase3.md):
M1 bit-exact 100.0000 % (10000/10000)
M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block)
M5a empty Vulkan submit 22.66 us
M5b 1-WG noop dispatch 55.60 us
M5 delta 32.95 us/dispatch
=> per-dispatch overhead is ~455x per-NEON-block cost;
Phase 4 must batch at frame level or close to it.
Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.
Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
phase, status, date_opened, parent, target_kernel
| phase | status | date_opened | parent | target_kernel |
|---|---|---|---|---|
| 2 | closed 2026-05-18 | 2026-05-18 | phase1.md | VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels) |
Phase 2 — Situation analysis
Per dev_process.md:
Document current state. Identify constraints, dependencies, known failure modes. Reset context here — do not carry assumptions from prior sessions; re-read CLAUDE.md, relevant memory files, run
git status, re-verify reachability.
1. Context reset
- Working tree state: dirty (Phase 0/1/2 docs not yet committed).
.git-broken-2026-05-18/preserved as a forensic artifact of the 2026-05-18 10:25 working-tree wipe (cause undetermined). - CLAUDE.md re-read: no contradictions with the Path B scope set in README §"Architecture (Path B)".
- hertz reachability: confirmed via SSH;
vcgencmd,vulkaninfo,apt, sudo NOPASSWD all working as of 2026-05-17 inventory. Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.
2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)
The Phase 1 kernel has two canonical reference implementations in FFmpeg n7.1.3 (the version installed on hertz). The harness will link both: the C path as the bit-exact gate (M1), the NEON path as the throughput baseline (M3).
2.1 C reference
-
Source:
libavcodec/vp9dsp_template.c, functionidct_idct_8x8_add_c -
Spec basis: VP9 specification §8.7 — Inverse transform process
-
Signature:
static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride, int16_t *_block, int eob); -
Algorithm (8-bit path):
- If
eob == 1(DC-only): single(coef * 11585 * 11585)round, broadcast to 8×8 with+pred, clamp[0,255]. - Otherwise: 8 column passes through
idct8_1d→ tmp[64]. Zero the input block. 8 row passes throughidct8_1d→ out[8]. Per-element(out + 16) >> 5, add todst,av_clip_pixel.
- If
-
idct8_1d: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (dctint). -
Q14 constants (matched against VP9 spec §8.7.1.4):
symbol value trig identity cospi_16_64 11585 cos(π/4) × 2^14 ≈ 0.70711 cospi_24_64 6270 cos(3π/8) × 2^14 ≈ 0.38268 cospi_8_64 15137 sin(3π/8) × 2^14 ≈ 0.92388 cospi_28_64 3196 cos(7π/16) × 2^14 ≈ 0.19509 cospi_4_64 16069 sin(7π/16) × 2^14 ≈ 0.98079 cospi_20_64 9102 cos(5π/16) × 2^14 ≈ 0.55557 cospi_12_64 13623 sin(5π/16) × 2^14 ≈ 0.83147 Rounding convention:
(product + (1 << 13)) >> 14, i.e. round-half-up at bit 14. -
License: LGPL-2.1-or-later (FFmpeg).
-
Side effect: zeroes the input
block[](idempotency requirement; matches spec).
2.2 NEON reference
- Source:
libavcodec/aarch64/vp9itxfm_neon.S, symbolff_vp9_idct_idct_8x8_add_neon - Signature (same as C):
Registers:
void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride, int16_t *block, int eob);x0=dst, x1=stride, x2=block, w3=eob. - Internal dependencies (must be copied alongside the .S):
macro / symbol location role idct8vp9itxfm_neon.S1-D 8-pt IDCT, fully unrolled with dmbutterfly*dmbutterfly0vp9itxfm_neon.Srotation by π/4 (the cospi_16_64case)dmbutterflyvp9itxfm_neon.Sgeneral 2-input rotation [a,b] → [a·c1−b·c2, a·c2+b·c1](Q14)dmbutterfly_lvp9itxfm_neon.Swide-form (4×i32 acc) for dmbutterflybutterfly_8hvp9itxfm_neon.Strivial [a+b, a−b]onint16x8_ttranspose_8x8Hlibavcodec/aarch64/neon.Sin-place 8×8 i16 transpose idct_coeffsvp9itxfm_neon.S(const)Q14 trig constants table, aligned 4 movrellibavutil/aarch64/asm.SPIC-aware constant-pool relocation helper - License: LGPL-2.1-or-later (Google, 2016).
- Performance shape: full unrolled 8-pt butterfly with NEON
smull/smlsl/smlal+rshrnfor the Q14 round-shift; output usessqxtunfor saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.
2.3 AV1 equivalence note
AV1's 8×8 DCT_DCT transform (av1_iidentity8_iidentity8_c vs av1_idct8_idct8_c family in libavcodec/av1dsp/...) shares the same 1-D 8-point structure but with different scaling: AV1 uses 12-bit fixed-point (>> 12) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. AV1 support is out of scope for Phase 1. A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.
3. Vulkan compute dispatch path
Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan
PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU, API 1.3.305, conformance
1.3.8.3. The compute-only dispatch path is:
host program
├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
│ - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
│ - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
├─ vkCreateShaderModule (SPIR-V from glslang)
├─ vkCreateComputePipeline
├─ vkBeginCommandBuffer
│ vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
│ vkCmdDispatch(group_count_x, 1, 1) # one WG per ~K blocks
├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
└─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)
Per Phase 0 §2 inside-view limits, the relevant constraints for this kernel:
- ≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we
only need 2:
block[]in,dst[]in/out). - Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in i16 plus 64 B in u8). One WG can carry dozens of blocks of shared state if useful.
- Subgroup size = 16 (fixed). One workgroup of 64 invocations = 4 subgroups; one block per subgroup is a natural shape (each 16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of subgroup work).
4. Build path on hertz
Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian
trixie default), libvulkan-dev 1.4.309, glslang-tools 15.1.0,
spirv-tools 2025.1, libdrm-dev 2.4.131, vulkan-tools 1.4.304.
Missing but cheap:
libavcodec-dev— only needed if the harness wants to link against system libavcodec for cross-checks against the dynamic dispatcher. Not needed for the source-copy approach (preferred, see §5).
5. Reference-copy strategy (vs system-libavcodec link)
Decision: source-copy the 3 FFmpeg files into external/ffmpeg-snapshot/.
Rationale:
- System
libavcodec.soon hertz is symbol-stripped (nmreturns empty forff_vp9_idct_*). Internal NEON entry points are not reachable viadlsym. - The two reference implementations (C, NEON) plus their macro/ data dependencies total ~3 files / ~600 lines. Source-copy is smaller than the dlopen plumbing would be.
- LGPL-2.1-or-later (FFmpeg license) is propagation-compatible with the harness binary if the harness binary itself is GPL or LGPL. The kernel shaders and dispatch library stay separately-licensed (BSD-2-Clause, default for this project).
- Pinning to
n7.1.3matches hertz's runtime libavcodec version, so any in-session sanity cross-check against the running Mesa / video tooling stays consistent.
Files to vendor:
| Source | License | Target path under daedalus-fourier/ |
|---|---|---|
libavcodec/vp9dsp_template.c |
LGPL-2.1+ | external/ffmpeg-snapshot/vp9dsp_template.c |
libavcodec/aarch64/vp9itxfm_neon.S |
LGPL-2.1+ | external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S |
libavcodec/aarch64/neon.S (for transpose_8x8H) |
LGPL-2.1+ | external/ffmpeg-snapshot/aarch64/neon.S |
libavutil/aarch64/asm.S (for movrel, function, endfunc) |
LGPL-2.1+ | external/ffmpeg-snapshot/aarch64/asm.S |
(whatever else vp9dsp_template.c transitively needs) |
LGPL-2.1+ | as required |
A external/ffmpeg-snapshot/COPYING.LGPL and external/ffmpeg-snapshot/PROVENANCE.md document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.
6. Known constraints / failure modes carried from Phase 0
Repeated here so Phase 4 (plan) can bind against them without re-derivation:
- C1: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
- C2: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
- C3: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
- C4: subgroupSupportedOperations = BASIC + VOTE only → no
subgroupAdd/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite. - C5: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
- C6: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
- C7: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
- C8: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.
7. What Phase 2 does not close
- The harness architecture (single binary? Two binaries — one for bit-exact, one for throughput?). Phase 4 picks.
- Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
- Random-coefficient generation strategy (uniform i16 vs realistic-distribution; the latter affects DC-only path frequency). Phase 4 picks; Phase 7 may re-evaluate.
- Whether NEON measurement uses
clock_gettime(CLOCK_MONOTONIC_RAW)per-call (high overhead) or batched (more realistic for codec use). Phase 3 picks during baseline collection.
8. Hand-off to Phase 3
Phase 3 measures:
- M3-prelim: NEON
ff_vp9_idct_idct_8x8_add_neonthroughput on hertz, batched over 10⁶ random blocks, single-threaded, 4-thread, sched-isolated. This is the floor. - M5-prelim: Vulkan dispatch overhead — pipeline create cost
(one-time), per-
vkCmdDispatchcost (per-frame-equivalent), per-vkQueueSubmit + vkQueueWaitIdlecost (per-completion). Bound below which kernel batching is mandatory.
Both are measurements on the existing substrate. Neither requires writing any shader code. Phase 3 closes before Phase 4 (plan) begins.