diff --git a/README.md b/README.md index 8d9d223..bd38b50 100644 --- a/README.md +++ b/README.md @@ -16,11 +16,30 @@ Labyrinth; the Pi Foundation's "use the HEVC block and live with software decode for everything else" is the official non-exit; the QPU sits unused inside the labyrinth's walls. -**Status: Phase 0 closed (substrate audit). Phase 1 in progress -(first-kernel proof on hertz).** This is research-track work that -may take months or may yield a single proof-of-concept kernel that -loses to ARM NEON, in which case the negative result ships and the -project closes. +**Status (2026-05-18): cycles 1-9 closed across 3 codecs +(VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels. +3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU +helper paths. Phase 8 (V4L2 deployment) ongoing in sibling +[daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2). +On hertz, all kernels exceed the 30fps@1080p user-facing floor by +8-30×.** + +### Cycles 1-9 deployment recipe + +| Cycle | Kernel | NEON M3 | Primary substrate | QPU offload verdict | +|---|---|---|---|---| +| 1 | VP9 IDCT 8×8 | 8.2 Mblock/s | **QPU** | M4 +7.2 %, R=0.92 GREEN | +| 2 | VP9 LPF wd=4 | 48 Medge/s | **QPU** | M4 +6.9 %, R=0.41 | +| 3 | VP9 MC 8h | 7.0 Mblock/s | CPU | R=0.067 RED; QPU dispatch path exists | +| 4 | VP9 LPF wd=8 | 31 Medge/s | **QPU** | M4 +4.1 %, R=0.34 | +| 5 | AV1 CDEF 8×8 | 3.9 Mblock/s | CPU | R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed) | +| 6 | H.264 IDCT 4×4 | 175 Mblock/s | CPU | trivially fast on NEON; QPU pointless | +| 7 | H.264 IDCT 8×8 | 151 Mblock/s | CPU | likewise | +| 8 | H.264 deblock luma-v | 92 Medge/s | CPU | R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed) | +| 9 | H.264 luma qpel MC (mc20) | 131 Mblock/s | CPU | NEON 19× faster than VP9 analog; QPU pointless | + +Per-cycle Phase 7 docs in `docs/k*_phase7.md` (or `*_phase3_and_4.md` +for deferred-Phase-4 closures). ## Why this exists @@ -85,37 +104,48 @@ The build: └───────────────────────────────┘ ``` -The first deliverable is *not* the V4L2 wrapper. The first -deliverable is one back-end kernel running on the QPU, bit-exact -against a libavcodec reference, with measured throughput. If that -single kernel can't beat NEON or get within 50% of it, the project -closes here with a documented negative result. +The first deliverable was one back-end kernel; nine cycles later +the public API in `include/daedalus.h` exposes nine kernels each +with bit-exact NEON and (where worthwhile) QPU paths. The V4L2 +wrapper is the next-up sibling project +([daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2)), +which turns the kernel-library into a `/dev/videoNN` device for +libva-v4l2-request-fourier / browser consumption. ## In scope -- A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking, - loop restoration filter, MC interpolation) compiled as SPIR-V - compute shaders for Mesa `v3dv`, dispatched via Vulkan compute - from userspace. -- A test harness on hertz that runs each kernel against libavcodec - reference outputs and measures throughput (megapixels/sec or - blocks/sec) against the equivalent NEON path. -- Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more - kernels only if Phase 1 numbers justify it. +- The set of codec back-end kernels documented in the deployment + recipe table above (9 kernels closed; more added per cycle as + the codec coverage expands). +- A test harness on hertz that runs each kernel against a + bit-exact reference (FFmpeg or dav1d NEON) and measures + throughput vs the equivalent NEON path. +- The public C API in `include/daedalus.h` so the sibling + daedalus-v4l2 (and any other consumer) can dispatch per-block + work with recipe-default substrate routing or explicit override. -## Out of scope (for now) +## Out of scope (lives in sibling repos) + +- The V4L2 stateless driver — that's + [daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2). +- Bitstream parsing — that lives in daedalus-v4l2 too, via + `dlopen`'d FFmpeg at runtime (Option γ). +- Browser-side consumption — libva-v4l2-request-fourier + + firefox-fourier / chromium-fourier, already mature. + +## Out of scope (permanent) - HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it). - Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute - budget. Path B *could* extend but isn't the priority. -- Encode. Pi Foundation removed all HW encode in Pi 5; encode on - VC7 is a separate, larger project. + budget. +- Encode. Pi Foundation removed all HW encode in Pi 5. - Custom VPU firmware (Path A — blocked by silicon RoT, see `docs/phase0.md`). -- V4L2 stateless driver wrapping the userspace decoder. Eventual - consumption point, but Phase 1 lives entirely in userspace. - Beating ARM NEON unconditionally. The honest target is *concurrent* work: QPU runs while CPU does something else. + Per Issue 003 (`docs/issues/003-mixed-kernel-m4-bench.md`), + the mixed-kernel deployment shape is where QPU offload pays — + same-kernel M4 is the worst-case bound. ## Dev substrate @@ -129,40 +159,113 @@ closes here with a documented negative result. ## Conventions -This project follows the 9(+1)-phase dev process. See -`docs/dev_process.md`. Phase 0 is closed (`docs/phase0.md`); -Phase 1 is `docs/phase1.md`. +This project follows a 9(+1)-phase dev process per cycle. See +`docs/dev_process.md`. Phase 0 is closed once at project start +(`docs/phase0.md`); each kernel cycle re-runs Phases 1-9. -Gitea identity: `claude-noether` (per -`feedback_gitea_as_claude_noether.md`). No `marfrit` pushes from -Claude sessions. +Phase 5 (second-model independent review) is non-skippable per +project rule. See `~/.claude/CLAUDE.md` "Reviews are never +skippable" — empty/no-finding reviews are themselves a strong +positive signal, not wasted effort. + +Gitea identity: `claude-noether` for Claude-driven pushes, via +SSH alias `git.reauktion.de.claude-noether` (see +`memory/reference_gitea_ssh_alias_noether.md`). ## Layout ``` daedalus-fourier/ ├── README.md ← this file +├── include/daedalus.h ← public C API +├── src/ +│ ├── daedalus_core.c ← API impl: per-kernel CPU+QPU dispatch +│ ├── v3d_runner.{c,h} ← Vulkan compute plumbing +│ └── v3d_*.comp ← compute shaders (cycles 1, 2, 4, 5, 8) +├── tests/ +│ ├── *_ref.c ← per-kernel C references (bit-exact) +│ ├── bench_neon_*.c ← NEON M3 baselines +│ ├── bench_v3d_*.c ← QPU M2 + 3-way M1 (vs NEON + C ref) +│ ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench +│ └── test_api_*.c ← public API smoke tests ├── docs/ -│ ├── dev_process.md ← reference copy of the 9(+1)-phase loop -│ ├── phase0.md ← substrate audit (closes Paths A and B) -│ ├── phase1.md ← first-kernel goal + measurement plan -│ └── vulkaninfo_v3d_7_1_7_hertz.txt -│ ← inside-view device profile from hertz -├── src/ ← kernels + Vulkan dispatch harness -└── tests/ ← bit-exact vs libavcodec, throughput +│ ├── dev_process.md ← reference 9(+1)-phase loop +│ ├── phase0.md ← substrate audit (closes Path A) +│ ├── phase1.md ← R-band decision rules +│ ├── phase8_scoping.md ← V4L2 architecture options +│ ├── phase8_status.md ← decisions locked + status +│ ├── k1_*.md..k9_*.md ← per-cycle Phase 1/3/4/5/7 docs +│ └── issues/ ← deferred work +├── external/ +│ ├── ffmpeg-snapshot/ ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+) +│ └── dav1d-snapshot/ ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause) +└── CMakeLists.txt ``` -No build system yet. Adding CMake when the first kernel lands. +## Build and run + +On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv: + +```sh +mkdir build && cd build +cmake .. -DCMAKE_BUILD_TYPE=Release +cmake --build . + +# Per-kernel M1+M3 NEON baseline: +./bench_neon_idct +./bench_neon_lpf +./bench_neon_h264deblock +# ... (one per cycle) + +# Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref): +./bench_v3d_idct +./bench_v3d_lpf +./bench_v3d_h264deblock +# ... + +# Public API smoke tests: +./test_api_idct # VP9 IDCT 8x8, CPU+QPU+AUTO +./test_api_lpf # VP9 LPF wd=4 + wd=8 +./test_api_h264 # H.264 IDCT 4x4 + 8x8 + deblock +./test_api_opportunistic_qpu # cycles 3+5+8 QPU-override paths + +# Mixed-kernel M4 bench (Issue 003 framework): +./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6 +``` + +## Consuming the kernel library + +For integration code (e.g., `daedalus-v4l2` userspace daemon): + +```c +#include + +daedalus_ctx *ctx = daedalus_ctx_create(); +// has_qpu == 1 if V3D init succeeded; else NEON-only fallback + +// Recipe dispatch: routes to the per-cycle verdict substrate. +daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta); + +// Or explicit substrate selection for runtime-aware scheduling: +daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride, + src, src_stride, n_blocks, meta); + +daedalus_ctx_destroy(ctx); +``` + +See `include/daedalus.h` for the full API. ## Sibling projects in the same orbit -- `libva-v4l2-request-fourier` — VA-API consumer-side backend. - Eventual consumer if daedalus produces a V4L2 stateless node. -- `firefox-fourier` — Firefox fork that routes stateless V4L2 - through libavcodec's `v4l2_request` hwaccel. Same pickup point. +- **[daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2)** + — V4L2 stateless wrapper. Linux kernel module + + userspace daemon that consume `libdaedalus_core.a` from this + repo. Scaffold + roadmap; Phase 8 implementation work. +- `libva-v4l2-request-fourier` — VA-API consumer; talks to + daedalus-v4l2's `/dev/videoNN`. +- `firefox-fourier` — Firefox fork routing stateless V4L2 through + libavcodec's `v4l2_request` hwaccel. - `chromium-fourier` — sibling for Chromium. -- `kernel-agent` — would house the V4L2 driver wrapping the - userspace decoder, once one exists. - `ampere-av1-enablement` — software-side AV1 bring-up on RK3588 (rkvdec / vpu981). Provides the userspace conformance harness daedalus reuses for VC7-AV1 verification.