README polish: reflect cycles 1-9 state + sibling daedalus-v4l2
The Phase-0-era README is updated to reflect the kernel-library project's actual state: - Status: 9 cycles closed (VP9 IDCT/LPF/MC, AV1 CDEF, H.264 IDCT4/IDCT8/deblock/MC) with deployment recipe table as the headline result. - Architecture: clarifies that 3 kernels deploy on QPU primary, 6 on CPU primary, 2 with opportunistic-QPU helper paths; V4L2 wrapper is the sibling daedalus-v4l2 (Option B + γ + sibling per locked Phase 8 architecture). - Layout: shows actual repo structure (include/, src/, tests/, docs/k*_phase*.md, external/ffmpeg-snapshot + dav1d-snapshot). - Build + run: concrete cmake commands and example bench invocations. - Consuming the kernel library: code snippet showing the public API (daedalus_ctx_create, daedalus_recipe_dispatch_*). - Conventions: updated dev process reference, current claude-noether SSH alias convention. - Sibling projects: added daedalus-v4l2 link. Old "single-kernel proof-of-concept negative result will close the project" framing replaced — the negative result test passed; project is alive and now in deployment phase. Project voice (Daedalus myth, higgs framing, honest-target posture) preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -16,11 +16,30 @@ Labyrinth; the Pi Foundation's "use the HEVC block and live with
|
|||||||
software decode for everything else" is the official non-exit;
|
software decode for everything else" is the official non-exit;
|
||||||
the QPU sits unused inside the labyrinth's walls.
|
the QPU sits unused inside the labyrinth's walls.
|
||||||
|
|
||||||
**Status: Phase 0 closed (substrate audit). Phase 1 in progress
|
**Status (2026-05-18): cycles 1-9 closed across 3 codecs
|
||||||
(first-kernel proof on hertz).** This is research-track work that
|
(VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels.
|
||||||
may take months or may yield a single proof-of-concept kernel that
|
3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU
|
||||||
loses to ARM NEON, in which case the negative result ships and the
|
helper paths. Phase 8 (V4L2 deployment) ongoing in sibling
|
||||||
project closes.
|
[daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2).
|
||||||
|
On hertz, all kernels exceed the 30fps@1080p user-facing floor by
|
||||||
|
8-30×.**
|
||||||
|
|
||||||
|
### Cycles 1-9 deployment recipe
|
||||||
|
|
||||||
|
| Cycle | Kernel | NEON M3 | Primary substrate | QPU offload verdict |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 1 | VP9 IDCT 8×8 | 8.2 Mblock/s | **QPU** | M4 +7.2 %, R=0.92 GREEN |
|
||||||
|
| 2 | VP9 LPF wd=4 | 48 Medge/s | **QPU** | M4 +6.9 %, R=0.41 |
|
||||||
|
| 3 | VP9 MC 8h | 7.0 Mblock/s | CPU | R=0.067 RED; QPU dispatch path exists |
|
||||||
|
| 4 | VP9 LPF wd=8 | 31 Medge/s | **QPU** | M4 +4.1 %, R=0.34 |
|
||||||
|
| 5 | AV1 CDEF 8×8 | 3.9 Mblock/s | CPU | R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed) |
|
||||||
|
| 6 | H.264 IDCT 4×4 | 175 Mblock/s | CPU | trivially fast on NEON; QPU pointless |
|
||||||
|
| 7 | H.264 IDCT 8×8 | 151 Mblock/s | CPU | likewise |
|
||||||
|
| 8 | H.264 deblock luma-v | 92 Medge/s | CPU | R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed) |
|
||||||
|
| 9 | H.264 luma qpel MC (mc20) | 131 Mblock/s | CPU | NEON 19× faster than VP9 analog; QPU pointless |
|
||||||
|
|
||||||
|
Per-cycle Phase 7 docs in `docs/k*_phase7.md` (or `*_phase3_and_4.md`
|
||||||
|
for deferred-Phase-4 closures).
|
||||||
|
|
||||||
## Why this exists
|
## Why this exists
|
||||||
|
|
||||||
@@ -85,37 +104,48 @@ The build:
|
|||||||
└───────────────────────────────┘
|
└───────────────────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
The first deliverable is *not* the V4L2 wrapper. The first
|
The first deliverable was one back-end kernel; nine cycles later
|
||||||
deliverable is one back-end kernel running on the QPU, bit-exact
|
the public API in `include/daedalus.h` exposes nine kernels each
|
||||||
against a libavcodec reference, with measured throughput. If that
|
with bit-exact NEON and (where worthwhile) QPU paths. The V4L2
|
||||||
single kernel can't beat NEON or get within 50% of it, the project
|
wrapper is the next-up sibling project
|
||||||
closes here with a documented negative result.
|
([daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2)),
|
||||||
|
which turns the kernel-library into a `/dev/videoNN` device for
|
||||||
|
libva-v4l2-request-fourier / browser consumption.
|
||||||
|
|
||||||
## In scope
|
## In scope
|
||||||
|
|
||||||
- A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking,
|
- The set of codec back-end kernels documented in the deployment
|
||||||
loop restoration filter, MC interpolation) compiled as SPIR-V
|
recipe table above (9 kernels closed; more added per cycle as
|
||||||
compute shaders for Mesa `v3dv`, dispatched via Vulkan compute
|
the codec coverage expands).
|
||||||
from userspace.
|
- A test harness on hertz that runs each kernel against a
|
||||||
- A test harness on hertz that runs each kernel against libavcodec
|
bit-exact reference (FFmpeg or dav1d NEON) and measures
|
||||||
reference outputs and measures throughput (megapixels/sec or
|
throughput vs the equivalent NEON path.
|
||||||
blocks/sec) against the equivalent NEON path.
|
- The public C API in `include/daedalus.h` so the sibling
|
||||||
- Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more
|
daedalus-v4l2 (and any other consumer) can dispatch per-block
|
||||||
kernels only if Phase 1 numbers justify it.
|
work with recipe-default substrate routing or explicit override.
|
||||||
|
|
||||||
## Out of scope (for now)
|
## Out of scope (lives in sibling repos)
|
||||||
|
|
||||||
|
- The V4L2 stateless driver — that's
|
||||||
|
[daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2).
|
||||||
|
- Bitstream parsing — that lives in daedalus-v4l2 too, via
|
||||||
|
`dlopen`'d FFmpeg at runtime (Option γ).
|
||||||
|
- Browser-side consumption — libva-v4l2-request-fourier +
|
||||||
|
firefox-fourier / chromium-fourier, already mature.
|
||||||
|
|
||||||
|
## Out of scope (permanent)
|
||||||
|
|
||||||
- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
|
- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
|
||||||
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
|
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
|
||||||
budget. Path B *could* extend but isn't the priority.
|
budget.
|
||||||
- Encode. Pi Foundation removed all HW encode in Pi 5; encode on
|
- Encode. Pi Foundation removed all HW encode in Pi 5.
|
||||||
VC7 is a separate, larger project.
|
|
||||||
- Custom VPU firmware (Path A — blocked by silicon RoT, see
|
- Custom VPU firmware (Path A — blocked by silicon RoT, see
|
||||||
`docs/phase0.md`).
|
`docs/phase0.md`).
|
||||||
- V4L2 stateless driver wrapping the userspace decoder. Eventual
|
|
||||||
consumption point, but Phase 1 lives entirely in userspace.
|
|
||||||
- Beating ARM NEON unconditionally. The honest target is
|
- Beating ARM NEON unconditionally. The honest target is
|
||||||
*concurrent* work: QPU runs while CPU does something else.
|
*concurrent* work: QPU runs while CPU does something else.
|
||||||
|
Per Issue 003 (`docs/issues/003-mixed-kernel-m4-bench.md`),
|
||||||
|
the mixed-kernel deployment shape is where QPU offload pays —
|
||||||
|
same-kernel M4 is the worst-case bound.
|
||||||
|
|
||||||
## Dev substrate
|
## Dev substrate
|
||||||
|
|
||||||
@@ -129,40 +159,113 @@ closes here with a documented negative result.
|
|||||||
|
|
||||||
## Conventions
|
## Conventions
|
||||||
|
|
||||||
This project follows the 9(+1)-phase dev process. See
|
This project follows a 9(+1)-phase dev process per cycle. See
|
||||||
`docs/dev_process.md`. Phase 0 is closed (`docs/phase0.md`);
|
`docs/dev_process.md`. Phase 0 is closed once at project start
|
||||||
Phase 1 is `docs/phase1.md`.
|
(`docs/phase0.md`); each kernel cycle re-runs Phases 1-9.
|
||||||
|
|
||||||
Gitea identity: `claude-noether` (per
|
Phase 5 (second-model independent review) is non-skippable per
|
||||||
`feedback_gitea_as_claude_noether.md`). No `marfrit` pushes from
|
project rule. See `~/.claude/CLAUDE.md` "Reviews are never
|
||||||
Claude sessions.
|
skippable" — empty/no-finding reviews are themselves a strong
|
||||||
|
positive signal, not wasted effort.
|
||||||
|
|
||||||
|
Gitea identity: `claude-noether` for Claude-driven pushes, via
|
||||||
|
SSH alias `git.reauktion.de.claude-noether` (see
|
||||||
|
`memory/reference_gitea_ssh_alias_noether.md`).
|
||||||
|
|
||||||
## Layout
|
## Layout
|
||||||
|
|
||||||
```
|
```
|
||||||
daedalus-fourier/
|
daedalus-fourier/
|
||||||
├── README.md ← this file
|
├── README.md ← this file
|
||||||
|
├── include/daedalus.h ← public C API
|
||||||
|
├── src/
|
||||||
|
│ ├── daedalus_core.c ← API impl: per-kernel CPU+QPU dispatch
|
||||||
|
│ ├── v3d_runner.{c,h} ← Vulkan compute plumbing
|
||||||
|
│ └── v3d_*.comp ← compute shaders (cycles 1, 2, 4, 5, 8)
|
||||||
|
├── tests/
|
||||||
|
│ ├── *_ref.c ← per-kernel C references (bit-exact)
|
||||||
|
│ ├── bench_neon_*.c ← NEON M3 baselines
|
||||||
|
│ ├── bench_v3d_*.c ← QPU M2 + 3-way M1 (vs NEON + C ref)
|
||||||
|
│ ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench
|
||||||
|
│ └── test_api_*.c ← public API smoke tests
|
||||||
├── docs/
|
├── docs/
|
||||||
│ ├── dev_process.md ← reference copy of the 9(+1)-phase loop
|
│ ├── dev_process.md ← reference 9(+1)-phase loop
|
||||||
│ ├── phase0.md ← substrate audit (closes Paths A and B)
|
│ ├── phase0.md ← substrate audit (closes Path A)
|
||||||
│ ├── phase1.md ← first-kernel goal + measurement plan
|
│ ├── phase1.md ← R-band decision rules
|
||||||
│ └── vulkaninfo_v3d_7_1_7_hertz.txt
|
│ ├── phase8_scoping.md ← V4L2 architecture options
|
||||||
│ ← inside-view device profile from hertz
|
│ ├── phase8_status.md ← decisions locked + status
|
||||||
├── src/ ← kernels + Vulkan dispatch harness
|
│ ├── k1_*.md..k9_*.md ← per-cycle Phase 1/3/4/5/7 docs
|
||||||
└── tests/ ← bit-exact vs libavcodec, throughput
|
│ └── issues/ ← deferred work
|
||||||
|
├── external/
|
||||||
|
│ ├── ffmpeg-snapshot/ ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+)
|
||||||
|
│ └── dav1d-snapshot/ ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause)
|
||||||
|
└── CMakeLists.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
No build system yet. Adding CMake when the first kernel lands.
|
## Build and run
|
||||||
|
|
||||||
|
On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
mkdir build && cd build
|
||||||
|
cmake .. -DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build .
|
||||||
|
|
||||||
|
# Per-kernel M1+M3 NEON baseline:
|
||||||
|
./bench_neon_idct
|
||||||
|
./bench_neon_lpf
|
||||||
|
./bench_neon_h264deblock
|
||||||
|
# ... (one per cycle)
|
||||||
|
|
||||||
|
# Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref):
|
||||||
|
./bench_v3d_idct
|
||||||
|
./bench_v3d_lpf
|
||||||
|
./bench_v3d_h264deblock
|
||||||
|
# ...
|
||||||
|
|
||||||
|
# Public API smoke tests:
|
||||||
|
./test_api_idct # VP9 IDCT 8x8, CPU+QPU+AUTO
|
||||||
|
./test_api_lpf # VP9 LPF wd=4 + wd=8
|
||||||
|
./test_api_h264 # H.264 IDCT 4x4 + 8x8 + deblock
|
||||||
|
./test_api_opportunistic_qpu # cycles 3+5+8 QPU-override paths
|
||||||
|
|
||||||
|
# Mixed-kernel M4 bench (Issue 003 framework):
|
||||||
|
./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6
|
||||||
|
```
|
||||||
|
|
||||||
|
## Consuming the kernel library
|
||||||
|
|
||||||
|
For integration code (e.g., `daedalus-v4l2` userspace daemon):
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include <daedalus.h>
|
||||||
|
|
||||||
|
daedalus_ctx *ctx = daedalus_ctx_create();
|
||||||
|
// has_qpu == 1 if V3D init succeeded; else NEON-only fallback
|
||||||
|
|
||||||
|
// Recipe dispatch: routes to the per-cycle verdict substrate.
|
||||||
|
daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta);
|
||||||
|
|
||||||
|
// Or explicit substrate selection for runtime-aware scheduling:
|
||||||
|
daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride,
|
||||||
|
src, src_stride, n_blocks, meta);
|
||||||
|
|
||||||
|
daedalus_ctx_destroy(ctx);
|
||||||
|
```
|
||||||
|
|
||||||
|
See `include/daedalus.h` for the full API.
|
||||||
|
|
||||||
## Sibling projects in the same orbit
|
## Sibling projects in the same orbit
|
||||||
|
|
||||||
- `libva-v4l2-request-fourier` — VA-API consumer-side backend.
|
- **[daedalus-v4l2](https://git.reauktion.de/marfrit/daedalus-v4l2)**
|
||||||
Eventual consumer if daedalus produces a V4L2 stateless node.
|
— V4L2 stateless wrapper. Linux kernel module +
|
||||||
- `firefox-fourier` — Firefox fork that routes stateless V4L2
|
userspace daemon that consume `libdaedalus_core.a` from this
|
||||||
through libavcodec's `v4l2_request` hwaccel. Same pickup point.
|
repo. Scaffold + roadmap; Phase 8 implementation work.
|
||||||
|
- `libva-v4l2-request-fourier` — VA-API consumer; talks to
|
||||||
|
daedalus-v4l2's `/dev/videoNN`.
|
||||||
|
- `firefox-fourier` — Firefox fork routing stateless V4L2 through
|
||||||
|
libavcodec's `v4l2_request` hwaccel.
|
||||||
- `chromium-fourier` — sibling for Chromium.
|
- `chromium-fourier` — sibling for Chromium.
|
||||||
- `kernel-agent` — would house the V4L2 driver wrapping the
|
|
||||||
userspace decoder, once one exists.
|
|
||||||
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
|
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
|
||||||
(rkvdec / vpu981). Provides the userspace conformance harness
|
(rkvdec / vpu981). Provides the userspace conformance harness
|
||||||
daedalus reuses for VC7-AV1 verification.
|
daedalus reuses for VC7-AV1 verification.
|
||||||
|
|||||||
Reference in New Issue
Block a user