This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.
Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.
Phases closed:
Phase 0 — substrate audit; Path A blocked, Path B open;
codec-back-end-fits-QPU finding (docs/phase0.md)
Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
publish-before-measure R = M2/M3 decision rules
(docs/phase1.md)
Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
under external/ffmpeg-snapshot/ (PROVENANCE.md pins
commit f46e514 + per-file SHA-256s) (docs/phase2.md)
Phase 3 — real baseline measurements on hertz (docs/phase3.md):
M1 bit-exact 100.0000 % (10000/10000)
M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block)
M5a empty Vulkan submit 22.66 us
M5b 1-WG noop dispatch 55.60 us
M5 delta 32.95 us/dispatch
=> per-dispatch overhead is ~455x per-NEON-block cost;
Phase 4 must batch at frame level or close to it.
Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.
Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.7 KiB
phase, status, date_opened, parent, target_kernel, dev_host
| phase | status | date_opened | parent | target_kernel | dev_host |
|---|---|---|---|---|---|
| 1 | open | 2026-05-18 | phase0.md | VP9 / AV1 8×8 inverse DCT (integer fixed-point) | hertz |
Phase 1 — Goal formulation
Per dev_process.md:
Define the objective in measurable terms. State what success looks like before touching anything. The chosen metric is a hypothesis about what to measure, not an axiom — Phase 3 may invalidate it.
Kernel under test
VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit fixed-point input, 8-bit output, with reconstructed-block add.
Mirrors the ff_vp9_idct_idct_8x8_add_neon shape in libavcodec
(see libavcodec/aarch64/vp9itxfm_neon.S) and the equivalent
dav1d / rav1d / libgav1 implementations for AV1's IDTX_DCT /
DCT_DCT 8×8 path.
I/O contract (per VP9 spec § 8.7 inverse transform process):
input: int16_t coeffs[64] // dequantized transform coefficients
input: uint8_t pred[64] // predicted block (intra/inter)
input: ptrdiff_t stride // typically 8 for an isolated test
output: uint8_t dst[64] // clamp(pred + idct(coeffs)) per pixel
Bit-exact: integer arithmetic per spec, no rounding ambiguity.
Measurable success criteria
Three numbers must come out of Phase 7, all measured in-session on hertz, all N≥3:
| ID | Measurement | What it tells us |
|---|---|---|
| M1 | Bit-exactness rate vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. |
| M2 | QPU throughput in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. |
| M3 | NEON throughput in MblockS on the same hertz, single-threaded, running ff_vp9_idct_idct_8x8_add_neon via a microbench harness |
The floor any GPU offload has to beat or get close to. |
Derived figure for go/no-go: R = M2 / M3.
Decision rules (set before measuring, per feedback_no_motivated_reasoning)
| R | Interpretation | Next step |
|---|---|---|
| ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). |
| 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. Concurrent-work hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. |
| 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. |
| < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. |
These thresholds are deliberately published before measurement so the result can't be retroactively reframed.
Secondary measurements (not gating, but recorded)
- M5 — per-kernel-launch overhead in µs, isolated (run with 0 blocks, measure submit+wait round-trip). Tells us the floor below which kernel batching is required.
- M6 — workgroup-size sweep across {8, 16, 32, 64, 128, 256} invocations to identify the v3dv-optimal launch shape for this kernel. Records the Pareto curve, doesn't change R unless the best-WG result invalidates M2.
- M7 — power draw delta at the wall (via the Himbeere Fritz!DECT plug telemetry, if reachable) under idle vs CPU-only vs QPU-only vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs battery argument that motivates the project.
What Phase 1 does not lock
- The dispatch path (Vulkan compute via
v3dvvs direct DRM submit viav3d_drm.hioctl). Phase 4 picks. Default for Phase 1 = Vulkan compute unless Phase 4 has reason to flip: documented, debuggable, doesn't require QPU asm against the NDA ISA. - The shader source (GLSL → glslang → SPIR-V) vs hand-written SPIR-V. Default = GLSL.
- Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG). Phase 4 chooses based on subgroup width and tile cost; Phase 1 records the sweep (M6).
Non-goals for Phase 1
- No V4L2 driver work.
- No end-to-end VP9 / AV1 decode (entropy + back-end). Just one kernel, isolated, measured.
- No optimization beyond what's needed to hit the bit-exact gate and produce a single throughput number. Tuning is Phase 7's feedback if R is borderline.
- No build-system perfection. A CMakeLists that compiles the test harness on hertz is enough.
Phase 2 → Phase 3 hand-off conditions
Phase 1 closes when:
- The above metrics + decision rules are reviewed (second-model review per dev_process.md Phase 5? No — this is Phase 1 not Phase 5. The Phase 5 second-model review comes after Phase 4 plan).
- The metrics are recorded in this file or a sibling
phase1_metrics.mdartifact (TBD).
The next phase (Phase 2 — situation analysis) inventories:
- libavcodec's NEON IDCT reference (file, function, calling convention, expected I/O contract).
- VP9 spec § 8.7 transform process (which the C reference implements verbatim).
- AV1 spec § 7.7 (same transform structure, larger transform set; 8×8 DCT_DCT path is identical to VP9's at this size).
- Mesa v3dv's compute-shader compilation path and any known v3dv-specific shader idioms that perform better on V3D 7.1.
- The hertz Vulkan dispatch overhead floor (M5 candidate, but measured as part of Phase 3 baseline).
Open questions Phase 1 hands forward
None new. Phase 0 § 7's open questions are the standing list; Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON baseline) directly via M2 and M3.