Files
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00

106 lines
6.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 3
status: closed 2026-05-18
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: phase2.md
host: hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712, Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz)
artifacts: build/bench_neon_idct, build/bench_vulkan_dispatch, build/noop.spv
---
# Phase 3 — Baseline measurements
Per `dev_process.md`:
> Take concrete measurements *before* any changes. Raw before
> derived. Real data, not theatre.
These numbers anchor every Phase 4+ decision. Re-run with the
same harness on the same hertz before drawing any new conclusions
in later phases.
## M1 — bit-exact correctness gate (Phase 1)
| | |
|---|---|
| Method | 10 000 random VP9-plausible coefficient blocks + random `pred[64]`, compare `daedalus_vp9_idct_idct_8x8_add_ref` C output vs vendored FFmpeg `ff_vp9_idct_idct_8x8_add_neon` |
| Run | `./bench_neon_idct --blocks 1000000 --iters 5` (built 2026-05-18) |
| **Result** | **10 000 / 10 000 = 100.0000 %** |
| DC-only path frequency | 11 / 10 000 = 0.11 % |
| Notes | Random generator: xorshift64, biased toward 116 non-zero coeffs per block; eob mostly ∈ [4, 63]. DC-only frequency is incidental; Phase 7 may revisit if it materially affects the throughput number. |
**Gate passes. Throughput measurement was authorized to run.**
## M3 — NEON throughput (single-core)
| | |
|---|---|
| Kernel | `ff_vp9_idct_idct_8x8_add_neon` from FFmpeg n7.1.3 (vendored, see `external/ffmpeg-snapshot/PROVENANCE.md`) |
| Method | Pre-generate 1 M random blocks + preds. Per iteration: memcpy refresh of all blocks/preds (NEON path zeroes blocks), then call NEON kernel 1 M times. Subtract setup memcpy time from the measured wall-clock. 5 iterations, single thread, no CPU pinning. |
| Compiler flags | `-O3 -march=armv8-a+simd` |
| Run | `./bench_neon_idct --blocks 1000000 --iters 5` |
| **Throughput** | **8.171 Mblock/s** |
| Per-block | 122.4 ns |
| Equivalent 1080p frame rate | 252.2 FPS (32 400 blocks per 1080p frame, assuming pure 8×8 work) |
| Elapsed (kernel) | 0.612 s / 5 M blocks |
| Elapsed (setup-only) | 0.250 s / 5 M iters |
| Cross-check | Cycle estimate at 2.8 GHz: 122.4 ns × 2.8 GHz ≈ 342 cycles/block. Plausible for a fully-unrolled NEON 8-point IDCT with butterflies + saturated narrow stores; the FFmpeg implementation interleaves loads/computes/stores aggressively. |
### M3 implications
- A single A76 core handles ~8 M blocks/s = **252 FPS at 1080p**. Real decode needs ~60 FPS = 4.2× headroom on one core, ~16× headroom on all four cores. **NEON is not the bottleneck for current YouTube workloads on Pi 5.**
- The QPU offload story is not "make decode faster" — decode is already fast enough single-threaded. The story has to be "free CPU cycles for the rest of the system" (browser, audio, the 11 LXD containers on hertz).
- For a per-kernel R = QPU / NEON measurement (per `phase1.md §"Decision rules"`), the QPU has to hit ≥4 M blocks/s to score R ≥ 0.5. That's the gate.
## M5 — Vulkan compute dispatch overhead
| | |
|---|---|
| Method | Allocate empty pipeline (no descriptors, no push constants), bind+dispatch a `void main(){}` shader on `local_size_x=64`. Time `vkQueueSubmit` + `vkQueueWaitIdle` round-trip. 50 000 iterations, warm. |
| Device | V3D 7.1.7.0 via Mesa v3dv 25.0.7 (selected past llvmpipe by `strstr("V3D")`) |
| Run | `./bench_vulkan_dispatch --iters 50000` |
| **M5a — empty CB submit+wait** | **22.66 µs / op** |
| **M5b — 1-WG noop dispatch submit+wait** | **55.60 µs / op** |
| **M5 delta — per-vkCmdDispatch + pipeline-bind** | **32.95 µs** |
### M5 implications — the load-bearing finding for Phase 4
This is the single most important number from Phase 3.
- Per-dispatch cost (55.6 µs) is **~455× the NEON per-block cost** (122 ns).
- A per-block QPU dispatch is structurally impossible — overhead dominates by two-and-a-half orders of magnitude.
- Break-even batch size for a *hypothetical* zero-cost QPU kernel: **≥ 556 blocks per dispatch**. Real kernel cost on top of that.
- Frame-level batching is mandatory: a 1080p frame has 32 400 8×8 blocks; one dispatch per frame amortizes M5b to 1.7 ns/block — well below NEON's 122 ns.
- Tile-level batching is borderline: a typical VP9 64×64 superblock has 64 sub-blocks; 55.6 µs / 64 ≈ 870 ns/block, ~7× NEON. Probably too coarse — frame-level or full-plane is the right granularity.
### M5 measurement caveats
- `vkQueueWaitIdle` after each submit forces a full GPU sync, modelling the "submit and need the result now" case. Real decode pipelines can submit multiple frames ahead and wait less often — the per-dispatch cost in a pipelined deployment will be lower (probably bounded below by M5a ≈ 22.66 µs as the pure submit cost).
- Empty CB (M5a) at 22.66 µs is the *floor*. This is Mesa command-list construction + kernel `DRM_IOCTL_V3D_SUBMIT_CL` + scheduler RTT. Cannot be optimised at the userspace level without changing Mesa or kernel.
- Both numbers include `vkQueueWaitIdle` overhead; pure submit-without-wait would be lower. For Phase 1's threshold analysis the with-wait number is the right one to use because end-to-end frame decode must wait for its output to be readable.
## Phase 3 closure
Two anchor measurements captured, both with verbatim raw output
(see `bench_neon_idct` and `bench_vulkan_dispatch` source for the
print format). No estimates, no inferences, no recall from prior
sessions or sibling-host memory.
Phase 4 (plan) opens against these numbers. Its first decision:
**given the 32.95 µs per-dispatch floor, what is the
batch granularity for the first kernel?** The answer is either
frame-level (32 400 blocks/dispatch) or row-level (~120
blocks/dispatch for one 1920-wide row of 8×8 → still ~460 ns/block
overhead, ~4× NEON). Frame-level is the only granularity that
amortises overhead enough to leave kernel compute room to win.
Open thread for a later phase (not blocking Phase 4):
- Multi-core NEON sweep (M3'): single-core NEON is the right
*competitor floor*, but the actual ARM headroom on hertz is
4× this number under load.
- Memory-bandwidth contention measurement (M6): does NEON's
rate change when concurrent QPU is reading the same LPDDR4x
bus? Needs the QPU kernel to exist first.
- Power-draw delta via Himbeere plug (M7): same — needs a real
GPU workload to differentiate from idle.