daedalus-fourier/docs/phase1.md

---
phase: 1
status: open
date_opened: 2026-05-18
parent: phase0.md
target_kernel: VP9 / AV1 8×8 inverse DCT (integer fixed-point)
dev_host: hertz
---

# Phase 1 — Goal formulation

Per `dev_process.md`:

> Define the objective in measurable terms. State what success looks
> like *before* touching anything. The chosen metric is a **hypothesis**
> about what to measure, not an axiom — Phase 3 may invalidate it.

## Kernel under test

**VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit
fixed-point input, 8-bit output, with reconstructed-block add.**

Mirrors the `ff_vp9_idct_idct_8x8_add_neon` shape in libavcodec
(see `libavcodec/aarch64/vp9itxfm_neon.S`) and the equivalent
dav1d / rav1d / libgav1 implementations for AV1's `IDTX_DCT` /
`DCT_DCT` 8×8 path.

I/O contract (per VP9 spec § 8.7 inverse transform process):

```
input:   int16_t coeffs[64]   // dequantized transform coefficients
input:   uint8_t pred[64]     // predicted block (intra/inter)
input:   ptrdiff_t stride     // typically 8 for an isolated test
output:  uint8_t dst[64]      // clamp(pred + idct(coeffs)) per pixel
```

Bit-exact: integer arithmetic per spec, no rounding ambiguity.

## Measurable success criteria

Three numbers must come out of Phase 7, all measured in-session on
hertz, all N≥3:

| ID | Measurement | What it tells us |
|---|---|---|
| **M1** | **Bit-exactness rate** vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. |
| **M2** | **QPU throughput** in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. |
| **M3** | **NEON throughput** in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness | The floor any GPU offload has to beat or get close to. |

Derived figure for go/no-go: **R = M2 / M3**.

## Decision rules (set before measuring, per `feedback_no_motivated_reasoning`)

| R | Interpretation | Next step |
|---|---|---|
| ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). |
| 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. *Concurrent-work* hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. |
| 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. |
| < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. |

These thresholds are deliberately published *before* measurement so
the result can't be retroactively reframed.

## Secondary measurements (not gating, but recorded)

- **M5** — per-kernel-launch overhead in µs, isolated (run with 0
  blocks, measure submit+wait round-trip). Tells us the floor below
  which kernel batching is required.
- **M6** — workgroup-size sweep across {8, 16, 32, 64, 128, 256}
  invocations to identify the v3dv-optimal launch shape for this
  kernel. Records the Pareto curve, doesn't change R unless the
  best-WG result invalidates M2.
- **M7** — power draw delta at the wall (via the Himbeere Fritz!DECT
  plug telemetry, if reachable) under idle vs CPU-only vs QPU-only
  vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs
  battery argument that motivates the project.

## What Phase 1 does *not* lock

- The dispatch path (Vulkan compute via `v3dv` vs direct DRM
  submit via `v3d_drm.h` ioctl). Phase 4 picks. Default for
  Phase 1 = **Vulkan compute** unless Phase 4 has reason to flip:
  documented, debuggable, doesn't require QPU asm against the
  NDA ISA.
- The shader source (GLSL → glslang → SPIR-V) vs hand-written
  SPIR-V. Default = GLSL.
- Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG).
  Phase 4 chooses based on subgroup width and tile cost; Phase 1
  records the sweep (M6).

## Non-goals for Phase 1

- No V4L2 driver work.
- No end-to-end VP9 / AV1 decode (entropy + back-end). Just one
  kernel, isolated, measured.
- No optimization beyond what's needed to hit the bit-exact gate
  and produce a single throughput number. Tuning is Phase 7's
  feedback if R is borderline.
- No build-system perfection. A CMakeLists that compiles the test
  harness on hertz is enough.

## Phase 2 → Phase 3 hand-off conditions

Phase 1 closes when:
- The above metrics + decision rules are reviewed (second-model
  review per dev_process.md Phase 5? No — this is *Phase 1* not
  Phase 5. The Phase 5 second-model review comes after Phase 4
  plan).
- The metrics are recorded in this file or a sibling
  `phase1_metrics.md` artifact (TBD).

The next phase (Phase 2 — situation analysis) inventories:
- libavcodec's NEON IDCT reference (file, function, calling
  convention, expected I/O contract).
- VP9 spec § 8.7 transform process (which the C reference
  implements verbatim).
- AV1 spec § 7.7 (same transform structure, larger transform set;
  8×8 DCT_DCT path is identical to VP9's at this size).
- Mesa v3dv's compute-shader compilation path and any known
  v3dv-specific shader idioms that perform better on V3D 7.1.
- The hertz Vulkan dispatch overhead floor (M5 candidate, but
  measured as part of Phase 3 baseline).

## Open questions Phase 1 hands forward

None new. Phase 0 § 7's open questions are the standing list;
Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON
baseline) directly via M2 and M3.