--- phase: 1 status: open date_opened: 2026-05-18 parent: phase0.md target_kernel: VP9 / AV1 8×8 inverse DCT (integer fixed-point) dev_host: hertz --- # Phase 1 — Goal formulation Per `dev_process.md`: > Define the objective in measurable terms. State what success looks > like *before* touching anything. The chosen metric is a **hypothesis** > about what to measure, not an axiom — Phase 3 may invalidate it. ## Kernel under test **VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit fixed-point input, 8-bit output, with reconstructed-block add.** Mirrors the `ff_vp9_idct_idct_8x8_add_neon` shape in libavcodec (see `libavcodec/aarch64/vp9itxfm_neon.S`) and the equivalent dav1d / rav1d / libgav1 implementations for AV1's `IDTX_DCT` / `DCT_DCT` 8×8 path. I/O contract (per VP9 spec § 8.7 inverse transform process): ``` input: int16_t coeffs[64] // dequantized transform coefficients input: uint8_t pred[64] // predicted block (intra/inter) input: ptrdiff_t stride // typically 8 for an isolated test output: uint8_t dst[64] // clamp(pred + idct(coeffs)) per pixel ``` Bit-exact: integer arithmetic per spec, no rounding ambiguity. ## Measurable success criteria Three numbers must come out of Phase 7, all measured in-session on hertz, all N≥3: | ID | Measurement | What it tells us | |---|---|---| | **M1** | **Bit-exactness rate** vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. | | **M2** | **QPU throughput** in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. | | **M3** | **NEON throughput** in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness | The floor any GPU offload has to beat or get close to. | Derived figure for go/no-go: **R = M2 / M3**. ## Decision rules (set before measuring, per `feedback_no_motivated_reasoning`) | R | Interpretation | Next step | |---|---|---| | ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). | | 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. *Concurrent-work* hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. | | 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. | | < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. | These thresholds are deliberately published *before* measurement so the result can't be retroactively reframed. ## Secondary measurements (not gating, but recorded) - **M5** — per-kernel-launch overhead in µs, isolated (run with 0 blocks, measure submit+wait round-trip). Tells us the floor below which kernel batching is required. - **M6** — workgroup-size sweep across {8, 16, 32, 64, 128, 256} invocations to identify the v3dv-optimal launch shape for this kernel. Records the Pareto curve, doesn't change R unless the best-WG result invalidates M2. - **M7** — power draw delta at the wall (via the Himbeere Fritz!DECT plug telemetry, if reachable) under idle vs CPU-only vs QPU-only vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs battery argument that motivates the project. ## What Phase 1 does *not* lock - The dispatch path (Vulkan compute via `v3dv` vs direct DRM submit via `v3d_drm.h` ioctl). Phase 4 picks. Default for Phase 1 = **Vulkan compute** unless Phase 4 has reason to flip: documented, debuggable, doesn't require QPU asm against the NDA ISA. - The shader source (GLSL → glslang → SPIR-V) vs hand-written SPIR-V. Default = GLSL. - Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG). Phase 4 chooses based on subgroup width and tile cost; Phase 1 records the sweep (M6). ## Non-goals for Phase 1 - No V4L2 driver work. - No end-to-end VP9 / AV1 decode (entropy + back-end). Just one kernel, isolated, measured. - No optimization beyond what's needed to hit the bit-exact gate and produce a single throughput number. Tuning is Phase 7's feedback if R is borderline. - No build-system perfection. A CMakeLists that compiles the test harness on hertz is enough. ## Phase 2 → Phase 3 hand-off conditions Phase 1 closes when: - The above metrics + decision rules are reviewed (second-model review per dev_process.md Phase 5? No — this is *Phase 1* not Phase 5. The Phase 5 second-model review comes after Phase 4 plan). - The metrics are recorded in this file or a sibling `phase1_metrics.md` artifact (TBD). The next phase (Phase 2 — situation analysis) inventories: - libavcodec's NEON IDCT reference (file, function, calling convention, expected I/O contract). - VP9 spec § 8.7 transform process (which the C reference implements verbatim). - AV1 spec § 7.7 (same transform structure, larger transform set; 8×8 DCT_DCT path is identical to VP9's at this size). - Mesa v3dv's compute-shader compilation path and any known v3dv-specific shader idioms that perform better on V3D 7.1. - The hertz Vulkan dispatch overhead floor (M5 candidate, but measured as part of Phase 3 baseline). ## Open questions Phase 1 hands forward None new. Phase 0 § 7's open questions are the standing list; Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON baseline) directly via M2 and M3.