Files

T

marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers

This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 11:30:12 +00:00

5.7 KiB

Raw Blame History

phase, status, date_opened, parent, target_kernel, dev_host

phase	status	date_opened	parent	target_kernel	dev_host
1	open	2026-05-18	phase0.md	VP9 / AV1 8×8 inverse DCT (integer fixed-point)	hertz

Phase 1 — Goal formulation

Per dev_process.md:

Define the objective in measurable terms. State what success looks like before touching anything. The chosen metric is a hypothesis about what to measure, not an axiom — Phase 3 may invalidate it.

Kernel under test

VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit fixed-point input, 8-bit output, with reconstructed-block add.

Mirrors the ff_vp9_idct_idct_8x8_add_neon shape in libavcodec (see libavcodec/aarch64/vp9itxfm_neon.S) and the equivalent dav1d / rav1d / libgav1 implementations for AV1's IDTX_DCT / DCT_DCT 8×8 path.

I/O contract (per VP9 spec § 8.7 inverse transform process):

input:   int16_t coeffs[64]   // dequantized transform coefficients
input:   uint8_t pred[64]     // predicted block (intra/inter)
input:   ptrdiff_t stride     // typically 8 for an isolated test
output:  uint8_t dst[64]      // clamp(pred + idct(coeffs)) per pixel

Bit-exact: integer arithmetic per spec, no rounding ambiguity.

Measurable success criteria

Three numbers must come out of Phase 7, all measured in-session on hertz, all N≥3:

ID	Measurement	What it tells us
M1	Bit-exactness rate vs libavcodec C reference, across ≥10 000 random coefficient blocks	Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters.
M2	QPU throughput in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s	The substrate's actual delivered capacity for this kernel shape.
M3	NEON throughput in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness	The floor any GPU offload has to beat or get close to.

Derived figure for go/no-go: R = M2 / M3.

Decision rules (set before measuring, per `feedback_no_motivated_reasoning`)

R	Interpretation	Next step
≥ 1.0	QPU beats NEON on this kernel in isolation. Strong substrate signal.	Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF).
0.5 ≤ R < 1.0	QPU loses in isolation but is in the same order of magnitude. Concurrent-work hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else.	Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide.
0.1 ≤ R < 0.5	QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost.	Honest close. Phase 9 documents the negative result.
< 0.1	QPU is structurally wrong for this kernel shape.	Honest close. Phase 9 documents the failure, project shelves.

These thresholds are deliberately published before measurement so the result can't be retroactively reframed.

Secondary measurements (not gating, but recorded)

M5 — per-kernel-launch overhead in µs, isolated (run with 0 blocks, measure submit+wait round-trip). Tells us the floor below which kernel batching is required.
M6 — workgroup-size sweep across {8, 16, 32, 64, 128, 256} invocations to identify the v3dv-optimal launch shape for this kernel. Records the Pareto curve, doesn't change R unless the best-WG result invalidates M2.
M7 — power draw delta at the wall (via the Himbeere Fritz!DECT plug telemetry, if reachable) under idle vs CPU-only vs QPU-only vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs battery argument that motivates the project.

What Phase 1 does not lock

The dispatch path (Vulkan compute via v3dv vs direct DRM submit via v3d_drm.h ioctl). Phase 4 picks. Default for Phase 1 = Vulkan compute unless Phase 4 has reason to flip: documented, debuggable, doesn't require QPU asm against the NDA ISA.
The shader source (GLSL → glslang → SPIR-V) vs hand-written SPIR-V. Default = GLSL.
Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG). Phase 4 chooses based on subgroup width and tile cost; Phase 1 records the sweep (M6).

Non-goals for Phase 1

No V4L2 driver work.
No end-to-end VP9 / AV1 decode (entropy + back-end). Just one kernel, isolated, measured.
No optimization beyond what's needed to hit the bit-exact gate and produce a single throughput number. Tuning is Phase 7's feedback if R is borderline.
No build-system perfection. A CMakeLists that compiles the test harness on hertz is enough.

Phase 2 → Phase 3 hand-off conditions

Phase 1 closes when:

The above metrics + decision rules are reviewed (second-model review per dev_process.md Phase 5? No — this is Phase 1 not Phase 5. The Phase 5 second-model review comes after Phase 4 plan).
The metrics are recorded in this file or a sibling phase1_metrics.md artifact (TBD).

The next phase (Phase 2 — situation analysis) inventories:

libavcodec's NEON IDCT reference (file, function, calling convention, expected I/O contract).
VP9 spec § 8.7 transform process (which the C reference implements verbatim).
AV1 spec § 7.7 (same transform structure, larger transform set; 8×8 DCT_DCT path is identical to VP9's at this size).
Mesa v3dv's compute-shader compilation path and any known v3dv-specific shader idioms that perform better on V3D 7.1.
The hertz Vulkan dispatch overhead floor (M5 candidate, but measured as part of Phase 3 baseline).

Open questions Phase 1 hands forward

None new. Phase 0 § 7's open questions are the standing list; Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON baseline) directly via M2 and M3.

5.7 KiB Raw Blame History Unescape Escape