T

claude-noether 0894a46114 h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

Closes the qpel buildout.  All 8 remaining diagonal positions land
in one PR.  Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).

Decomposition table (the formula for each output cell at (r,c)):

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r, c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r, c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r, c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r, c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r, c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r, c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r, c+1])

The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.

Scope (tightly macro-ised):
  - 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
  - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
  - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - Header decls condensed via a DECLARE_QPEL_DIAG macro that
    expands to both recipe + dispatch decls per name.
  - C references via DEFINE_DIAG_REF macro: each ref is a 6-line
    wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
    (the latter being the per-cell version of mc22's 13-row int16
    tmp[] computation).
  - Test wrapper: test_qpel_diag_all() drives all 8 through the
    existing run_quarter_axis_qpel() harness.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | tail -8
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

ALL 8 diagonal positions bit-exact PASS first try.  Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.

After this PR the H.264 qpel 8x8 put_ matrix is complete:
  mc00 mc01 mc02 mc03
  mc10 mc11 mc12 mc13
  mc20 mc21 mc22 mc23
  mc30 mc31 mc32 mc33

15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly).  mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.

What this does NOT cover (still in backlog):
  - avg_ variants (the "add" form for biprediction, 16 more
    positions).  Currently the API only exposes put_.
  - 16x16 qpel (separate function family in FFmpeg; the 8x8 path
    can be used twice to substitute when 16x16 isn't critical).
  - QPU shaders for any qpel position other than mc20.

2026-05-25 07:49:12 +02:00

docs

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

external

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

include

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

src

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

tests

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

.gitignore

gitignore: exclude .claude/ runtime files

2026-05-24 23:29:06 +02:00

CMakeLists.txt

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

daedalus_architecture_backlog.md

docs: architecture backlog — correct fleet hardware mapping

2026-05-23 15:47:55 +02:00

README.md

README: fix daedalus-v4l2 link (reauktion/, not marfrit/)

2026-05-18 14:57:38 +00:00

README.md

daedalus-fourier

Community-built VP9 / AV1 software-decode back-end running on the VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 / Compute Module 5), via the existing Mesa v3d userspace driver. ARM keeps the serial entropy front-end; the QPU takes the parallel back-end (inverse transforms, deblocking, CDEF, loop restoration, MC residual add).

Daedalus built the Labyrinth for King Minos, then escaped from it by hand-forging flight firmware out of feathers and wax when no sanctioned exit existed.

That's the project shape. The Broadcom-locked VideoCore VII is the Labyrinth; the Pi Foundation's "use the HEVC block and live with software decode for everything else" is the official non-exit; the QPU sits unused inside the labyrinth's walls.

Status (2026-05-18): cycles 1-9 closed across 3 codecs (VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels. 3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU helper paths. Phase 8 (V4L2 deployment) ongoing in sibling daedalus-v4l2. On hertz, all kernels exceed the 30fps@1080p user-facing floor by 8-30×.

Cycles 1-9 deployment recipe

Cycle	Kernel	NEON M3	Primary substrate	QPU offload verdict
1	VP9 IDCT 8×8	8.2 Mblock/s	QPU	M4 +7.2 %, R=0.92 GREEN
2	VP9 LPF wd=4	48 Medge/s	QPU	M4 +6.9 %, R=0.41
3	VP9 MC 8h	7.0 Mblock/s	CPU	R=0.067 RED; QPU dispatch path exists
4	VP9 LPF wd=8	31 Medge/s	QPU	M4 +4.1 %, R=0.34
5	AV1 CDEF 8×8	3.9 Mblock/s	CPU	R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed)
6	H.264 IDCT 4×4	175 Mblock/s	CPU	trivially fast on NEON; QPU pointless
7	H.264 IDCT 8×8	151 Mblock/s	CPU	likewise
8	H.264 deblock luma-v	92 Medge/s	CPU	R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed)
9	H.264 luma qpel MC (mc20)	131 Mblock/s	CPU	NEON 19× faster than VP9 analog; QPU pointless

Per-cycle Phase 7 docs in docs/k*_phase7.md (or *_phase3_and_4.md for deferred-Phase-4 closures).

Why this exists

higgs is a Raspberry Pi Compute Module 5 in a small portable chassis with a battery. Watching nerds review Star Wars on YouTube while putting Mac Studios into virtual shopping baskets is a core workload for the higgs class of device.

YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer high-bitrate / high-resolution content). It does not serve HEVC. Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.

Every YouTube frame on higgs today is software-decoded on Cortex-A76 cores at ~50–90% CPU per video stream. Offloading the parallel back-end of that decode to the otherwise-idle QPU complex might recover meaningful CPU time and battery on higgs. The honest prior — measured in Phase 0 — is that the QPU has roughly equal raw compute to the A76 cluster but a smaller slice of the shared LPDDR4x bandwidth, so the win, if any, comes from offloading concurrent work the CPU would have done anyway.

The Pi Foundation isn't going to do this work (per their own statement: chromium-patch sustainment was too much; codec sustainment would be moreso). The kernel rpi-hevc-dec series has been 17 months in review for one decoder block they DID write themselves. Whatever ships here ships through the community.

Architecture (Path B)

Phase 0 closed two paths:

Path A — custom VPU firmware on the VC7 scalar cores. Blocked. BCM2712 has a silicon root of trust: the mask ROM hardcodes RPi's public key and unconditionally verifies the second-stage bootloader. EXECUTE_CODE mailbox removed on Pi 5. No software-only bypass exists. See docs/phase0.md §3.
Path B — QPU compute kernels via the existing Mesa v3d / DRM / Vulkan-compute path. This is the path. The QPU is reachable from userspace today on a stock signed Pi 5 / CM5 via /dev/dri/card0. No firmware loading. No signing fight. Idein/py-videocore7 (SGEMM 21 GFLOPS sustained) is the existence proof.

The build:

┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder   │
│  (fork of dav1d / libvpx)     │
├───────────────────────────────┤
│  ARM:    entropy decode       │ ← Cortex-A76 + NEON
│          (Bool coder / ANS)   │   structurally serial
├───────────────────────────────┤
│  QPU:    parallel back-end    │ ← V3D 7.1 via Mesa v3dv
│          (IDCT, CDEF,         │   Vulkan compute shaders
│           deblock, LR, MC)    │   or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper        │ ← out-of-tree kernel module
│  (eventual, kernel-agent)     │   exposing /dev/videoN
└───────────────────────────────┘

The first deliverable was one back-end kernel; nine cycles later the public API in include/daedalus.h exposes nine kernels each with bit-exact NEON and (where worthwhile) QPU paths. The V4L2 wrapper is the next-up sibling project (daedalus-v4l2), which turns the kernel-library into a /dev/videoNN device for libva-v4l2-request-fourier / browser consumption.

In scope

The set of codec back-end kernels documented in the deployment recipe table above (9 kernels closed; more added per cycle as the codec coverage expands).
A test harness on hertz that runs each kernel against a bit-exact reference (FFmpeg or dav1d NEON) and measures throughput vs the equivalent NEON path.
The public C API in include/daedalus.h so the sibling daedalus-v4l2 (and any other consumer) can dispatch per-block work with recipe-default substrate routing or explicit override.

Out of scope (lives in sibling repos)

The V4L2 stateless driver — that's daedalus-v4l2.
Bitstream parsing — that lives in daedalus-v4l2 too, via dlopen'd FFmpeg at runtime (Option γ).
Browser-side consumption — libva-v4l2-request-fourier + firefox-fourier / chromium-fourier, already mature.

Out of scope (permanent)

HEVC (Pi 5 has dedicated silicon; rpi-hevc-dec covers it).
Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute budget.
Encode. Pi Foundation removed all HW encode in Pi 5.
Custom VPU firmware (Path A — blocked by silicon RoT, see docs/phase0.md).
Beating ARM NEON unconditionally. The honest target is concurrent work: QPU runs while CPU does something else. Per Issue 003 (docs/issues/003-mixed-kernel-m4-bench.md), the mixed-kernel deployment shape is where QPU offload pays — same-kernel M4 is the worst-case bound.

Dev substrate

hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712, Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement host. Watchdog-protected for crash recovery. See docs/vulkaninfo_v3d_7_1_7_hertz.txt for the inside-view device profile.
higgs (CM5 in portable battery chassis) — the eventual user target. Not a dev unit; sealed chassis.

Conventions

This project follows a 9(+1)-phase dev process per cycle. See docs/dev_process.md. Phase 0 is closed once at project start (docs/phase0.md); each kernel cycle re-runs Phases 1-9.

Phase 5 (second-model independent review) is non-skippable per project rule. See ~/.claude/CLAUDE.md "Reviews are never skippable" — empty/no-finding reviews are themselves a strong positive signal, not wasted effort.

Gitea identity: claude-noether for Claude-driven pushes, via SSH alias git.reauktion.de.claude-noether (see memory/reference_gitea_ssh_alias_noether.md).

Layout

daedalus-fourier/
├── README.md             ← this file
├── include/daedalus.h    ← public C API
├── src/
│   ├── daedalus_core.c   ← API impl: per-kernel CPU+QPU dispatch
│   ├── v3d_runner.{c,h}  ← Vulkan compute plumbing
│   └── v3d_*.comp        ← compute shaders (cycles 1, 2, 4, 5, 8)
├── tests/
│   ├── *_ref.c           ← per-kernel C references (bit-exact)
│   ├── bench_neon_*.c    ← NEON M3 baselines
│   ├── bench_v3d_*.c     ← QPU M2 + 3-way M1 (vs NEON + C ref)
│   ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench
│   └── test_api_*.c      ← public API smoke tests
├── docs/
│   ├── dev_process.md    ← reference 9(+1)-phase loop
│   ├── phase0.md         ← substrate audit (closes Path A)
│   ├── phase1.md         ← R-band decision rules
│   ├── phase8_scoping.md ← V4L2 architecture options
│   ├── phase8_status.md  ← decisions locked + status
│   ├── k1_*.md..k9_*.md  ← per-cycle Phase 1/3/4/5/7 docs
│   └── issues/           ← deferred work
├── external/
│   ├── ffmpeg-snapshot/  ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+)
│   └── dav1d-snapshot/   ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause)
└── CMakeLists.txt

Build and run

On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

# Per-kernel M1+M3 NEON baseline:
./bench_neon_idct
./bench_neon_lpf
./bench_neon_h264deblock
# ... (one per cycle)

# Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref):
./bench_v3d_idct
./bench_v3d_lpf
./bench_v3d_h264deblock
# ...

# Public API smoke tests:
./test_api_idct       # VP9 IDCT 8x8, CPU+QPU+AUTO
./test_api_lpf        # VP9 LPF wd=4 + wd=8
./test_api_h264       # H.264 IDCT 4x4 + 8x8 + deblock
./test_api_opportunistic_qpu  # cycles 3+5+8 QPU-override paths

# Mixed-kernel M4 bench (Issue 003 framework):
./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6

Consuming the kernel library

For integration code (e.g., daedalus-v4l2 userspace daemon):

#include <daedalus.h>

daedalus_ctx *ctx = daedalus_ctx_create();
// has_qpu == 1 if V3D init succeeded; else NEON-only fallback

// Recipe dispatch: routes to the per-cycle verdict substrate.
daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta);

// Or explicit substrate selection for runtime-aware scheduling:
daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride,
                            src, src_stride, n_blocks, meta);

daedalus_ctx_destroy(ctx);

See include/daedalus.h for the full API.

Sibling projects in the same orbit

daedalus-v4l2 — V4L2 stateless wrapper. Linux kernel module + userspace daemon that consume libdaedalus_core.a from this repo. Scaffold + roadmap; Phase 8 implementation work.
libva-v4l2-request-fourier — VA-API consumer; talks to daedalus-v4l2's /dev/videoNN.
firefox-fourier — Firefox fork routing stateless V4L2 through libavcodec's v4l2_request hwaccel.
chromium-fourier — sibling for Chromium.
ampere-av1-enablement — software-side AV1 bring-up on RK3588 (rkvdec / vpu981). Provides the userspace conformance harness daedalus reuses for VC7-AV1 verification.

Source attribution

Daedalus-the-myth is public domain. The wax-and-feathers metaphor is older than software engineering.

Anyone wanting to fail at this project: please file your failures under branches/icarus/. Built-in self-deprecation slot, with honor.

README.md Unescape Escape

daedalus-fourier

Cycles 1-9 deployment recipe

Why this exists

Architecture (Path B)

In scope

Out of scope (lives in sibling repos)

Out of scope (permanent)

Dev substrate

Conventions

Layout

Build and run

Consuming the kernel library

Sibling projects in the same orbit

Source attribution

README.md