Files
daedalus-fourier/docs/phase1.md
T
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00

129 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 1
status: open
date_opened: 2026-05-18
parent: phase0.md
target_kernel: VP9 / AV1 8×8 inverse DCT (integer fixed-point)
dev_host: hertz
---
# Phase 1 — Goal formulation
Per `dev_process.md`:
> Define the objective in measurable terms. State what success looks
> like *before* touching anything. The chosen metric is a **hypothesis**
> about what to measure, not an axiom — Phase 3 may invalidate it.
## Kernel under test
**VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit
fixed-point input, 8-bit output, with reconstructed-block add.**
Mirrors the `ff_vp9_idct_idct_8x8_add_neon` shape in libavcodec
(see `libavcodec/aarch64/vp9itxfm_neon.S`) and the equivalent
dav1d / rav1d / libgav1 implementations for AV1's `IDTX_DCT` /
`DCT_DCT` 8×8 path.
I/O contract (per VP9 spec § 8.7 inverse transform process):
```
input: int16_t coeffs[64] // dequantized transform coefficients
input: uint8_t pred[64] // predicted block (intra/inter)
input: ptrdiff_t stride // typically 8 for an isolated test
output: uint8_t dst[64] // clamp(pred + idct(coeffs)) per pixel
```
Bit-exact: integer arithmetic per spec, no rounding ambiguity.
## Measurable success criteria
Three numbers must come out of Phase 7, all measured in-session on
hertz, all N≥3:
| ID | Measurement | What it tells us |
|---|---|---|
| **M1** | **Bit-exactness rate** vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. |
| **M2** | **QPU throughput** in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. |
| **M3** | **NEON throughput** in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness | The floor any GPU offload has to beat or get close to. |
Derived figure for go/no-go: **R = M2 / M3**.
## Decision rules (set before measuring, per `feedback_no_motivated_reasoning`)
| R | Interpretation | Next step |
|---|---|---|
| ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). |
| 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. *Concurrent-work* hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. |
| 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. |
| < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. |
These thresholds are deliberately published *before* measurement so
the result can't be retroactively reframed.
## Secondary measurements (not gating, but recorded)
- **M5** — per-kernel-launch overhead in µs, isolated (run with 0
blocks, measure submit+wait round-trip). Tells us the floor below
which kernel batching is required.
- **M6** — workgroup-size sweep across {8, 16, 32, 64, 128, 256}
invocations to identify the v3dv-optimal launch shape for this
kernel. Records the Pareto curve, doesn't change R unless the
best-WG result invalidates M2.
- **M7** — power draw delta at the wall (via the Himbeere Fritz!DECT
plug telemetry, if reachable) under idle vs CPU-only vs QPU-only
vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs
battery argument that motivates the project.
## What Phase 1 does *not* lock
- The dispatch path (Vulkan compute via `v3dv` vs direct DRM
submit via `v3d_drm.h` ioctl). Phase 4 picks. Default for
Phase 1 = **Vulkan compute** unless Phase 4 has reason to flip:
documented, debuggable, doesn't require QPU asm against the
NDA ISA.
- The shader source (GLSL → glslang → SPIR-V) vs hand-written
SPIR-V. Default = GLSL.
- Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG).
Phase 4 chooses based on subgroup width and tile cost; Phase 1
records the sweep (M6).
## Non-goals for Phase 1
- No V4L2 driver work.
- No end-to-end VP9 / AV1 decode (entropy + back-end). Just one
kernel, isolated, measured.
- No optimization beyond what's needed to hit the bit-exact gate
and produce a single throughput number. Tuning is Phase 7's
feedback if R is borderline.
- No build-system perfection. A CMakeLists that compiles the test
harness on hertz is enough.
## Phase 2 → Phase 3 hand-off conditions
Phase 1 closes when:
- The above metrics + decision rules are reviewed (second-model
review per dev_process.md Phase 5? No — this is *Phase 1* not
Phase 5. The Phase 5 second-model review comes after Phase 4
plan).
- The metrics are recorded in this file or a sibling
`phase1_metrics.md` artifact (TBD).
The next phase (Phase 2 — situation analysis) inventories:
- libavcodec's NEON IDCT reference (file, function, calling
convention, expected I/O contract).
- VP9 spec § 8.7 transform process (which the C reference
implements verbatim).
- AV1 spec § 7.7 (same transform structure, larger transform set;
8×8 DCT_DCT path is identical to VP9's at this size).
- Mesa v3dv's compute-shader compilation path and any known
v3dv-specific shader idioms that perform better on V3D 7.1.
- The hertz Vulkan dispatch overhead floor (M5 candidate, but
measured as part of Phase 3 baseline).
## Open questions Phase 1 hands forward
None new. Phase 0 § 7's open questions are the standing list;
Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON
baseline) directly via M2 and M3.