dcbbc77038
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.
Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.
Phases closed:
Phase 0 — substrate audit; Path A blocked, Path B open;
codec-back-end-fits-QPU finding (docs/phase0.md)
Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
publish-before-measure R = M2/M3 decision rules
(docs/phase1.md)
Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
under external/ffmpeg-snapshot/ (PROVENANCE.md pins
commit f46e514 + per-file SHA-256s) (docs/phase2.md)
Phase 3 — real baseline measurements on hertz (docs/phase3.md):
M1 bit-exact 100.0000 % (10000/10000)
M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block)
M5a empty Vulkan submit 22.66 us
M5b 1-WG noop dispatch 55.60 us
M5 delta 32.95 us/dispatch
=> per-dispatch overhead is ~455x per-NEON-block cost;
Phase 4 must batch at frame level or close to it.
Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.
Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129 lines
5.7 KiB
Markdown
129 lines
5.7 KiB
Markdown
---
|
||
phase: 1
|
||
status: open
|
||
date_opened: 2026-05-18
|
||
parent: phase0.md
|
||
target_kernel: VP9 / AV1 8×8 inverse DCT (integer fixed-point)
|
||
dev_host: hertz
|
||
---
|
||
|
||
# Phase 1 — Goal formulation
|
||
|
||
Per `dev_process.md`:
|
||
|
||
> Define the objective in measurable terms. State what success looks
|
||
> like *before* touching anything. The chosen metric is a **hypothesis**
|
||
> about what to measure, not an axiom — Phase 3 may invalidate it.
|
||
|
||
## Kernel under test
|
||
|
||
**VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit
|
||
fixed-point input, 8-bit output, with reconstructed-block add.**
|
||
|
||
Mirrors the `ff_vp9_idct_idct_8x8_add_neon` shape in libavcodec
|
||
(see `libavcodec/aarch64/vp9itxfm_neon.S`) and the equivalent
|
||
dav1d / rav1d / libgav1 implementations for AV1's `IDTX_DCT` /
|
||
`DCT_DCT` 8×8 path.
|
||
|
||
I/O contract (per VP9 spec § 8.7 inverse transform process):
|
||
|
||
```
|
||
input: int16_t coeffs[64] // dequantized transform coefficients
|
||
input: uint8_t pred[64] // predicted block (intra/inter)
|
||
input: ptrdiff_t stride // typically 8 for an isolated test
|
||
output: uint8_t dst[64] // clamp(pred + idct(coeffs)) per pixel
|
||
```
|
||
|
||
Bit-exact: integer arithmetic per spec, no rounding ambiguity.
|
||
|
||
## Measurable success criteria
|
||
|
||
Three numbers must come out of Phase 7, all measured in-session on
|
||
hertz, all N≥3:
|
||
|
||
| ID | Measurement | What it tells us |
|
||
|---|---|---|
|
||
| **M1** | **Bit-exactness rate** vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. |
|
||
| **M2** | **QPU throughput** in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. |
|
||
| **M3** | **NEON throughput** in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness | The floor any GPU offload has to beat or get close to. |
|
||
|
||
Derived figure for go/no-go: **R = M2 / M3**.
|
||
|
||
## Decision rules (set before measuring, per `feedback_no_motivated_reasoning`)
|
||
|
||
| R | Interpretation | Next step |
|
||
|---|---|---|
|
||
| ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). |
|
||
| 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. *Concurrent-work* hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. |
|
||
| 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. |
|
||
| < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. |
|
||
|
||
These thresholds are deliberately published *before* measurement so
|
||
the result can't be retroactively reframed.
|
||
|
||
## Secondary measurements (not gating, but recorded)
|
||
|
||
- **M5** — per-kernel-launch overhead in µs, isolated (run with 0
|
||
blocks, measure submit+wait round-trip). Tells us the floor below
|
||
which kernel batching is required.
|
||
- **M6** — workgroup-size sweep across {8, 16, 32, 64, 128, 256}
|
||
invocations to identify the v3dv-optimal launch shape for this
|
||
kernel. Records the Pareto curve, doesn't change R unless the
|
||
best-WG result invalidates M2.
|
||
- **M7** — power draw delta at the wall (via the Himbeere Fritz!DECT
|
||
plug telemetry, if reachable) under idle vs CPU-only vs QPU-only
|
||
vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs
|
||
battery argument that motivates the project.
|
||
|
||
## What Phase 1 does *not* lock
|
||
|
||
- The dispatch path (Vulkan compute via `v3dv` vs direct DRM
|
||
submit via `v3d_drm.h` ioctl). Phase 4 picks. Default for
|
||
Phase 1 = **Vulkan compute** unless Phase 4 has reason to flip:
|
||
documented, debuggable, doesn't require QPU asm against the
|
||
NDA ISA.
|
||
- The shader source (GLSL → glslang → SPIR-V) vs hand-written
|
||
SPIR-V. Default = GLSL.
|
||
- Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG).
|
||
Phase 4 chooses based on subgroup width and tile cost; Phase 1
|
||
records the sweep (M6).
|
||
|
||
## Non-goals for Phase 1
|
||
|
||
- No V4L2 driver work.
|
||
- No end-to-end VP9 / AV1 decode (entropy + back-end). Just one
|
||
kernel, isolated, measured.
|
||
- No optimization beyond what's needed to hit the bit-exact gate
|
||
and produce a single throughput number. Tuning is Phase 7's
|
||
feedback if R is borderline.
|
||
- No build-system perfection. A CMakeLists that compiles the test
|
||
harness on hertz is enough.
|
||
|
||
## Phase 2 → Phase 3 hand-off conditions
|
||
|
||
Phase 1 closes when:
|
||
- The above metrics + decision rules are reviewed (second-model
|
||
review per dev_process.md Phase 5? No — this is *Phase 1* not
|
||
Phase 5. The Phase 5 second-model review comes after Phase 4
|
||
plan).
|
||
- The metrics are recorded in this file or a sibling
|
||
`phase1_metrics.md` artifact (TBD).
|
||
|
||
The next phase (Phase 2 — situation analysis) inventories:
|
||
- libavcodec's NEON IDCT reference (file, function, calling
|
||
convention, expected I/O contract).
|
||
- VP9 spec § 8.7 transform process (which the C reference
|
||
implements verbatim).
|
||
- AV1 spec § 7.7 (same transform structure, larger transform set;
|
||
8×8 DCT_DCT path is identical to VP9's at this size).
|
||
- Mesa v3dv's compute-shader compilation path and any known
|
||
v3dv-specific shader idioms that perform better on V3D 7.1.
|
||
- The hertz Vulkan dispatch overhead floor (M5 candidate, but
|
||
measured as part of Phase 3 baseline).
|
||
|
||
## Open questions Phase 1 hands forward
|
||
|
||
None new. Phase 0 § 7's open questions are the standing list;
|
||
Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON
|
||
baseline) directly via M2 and M3.
|