Second piece of the intra-prediction primitive set after PR #12
(Intra_4x4 luma 9 modes). Covers the Intra_16x16 luma MB type
per H.264 §8.3.2: 4 modes (Vertical, Horizontal, DC, Plane).
Scope:
- tests/h264_intra_pred_16x16_ref.c — 4 spec-derived modes.
Same FFmpeg-style interface as the 4x4 sibling:
void daedalus_h264_pred_16x16_<name>_ref(uint8_t *dst, ptrdiff_t stride);
Assumes all neighbours valid (interior-MB case).
The Plane mode is the algorithmically heaviest of the four —
spec §8.3.2.4 has two slope sums (H, V) over the asymmetric
top/left contexts, a clipped quadratic evaluation per cell,
and a top-left-corner participant at i=7 / j=7. Implementation
follows the spec straightforwardly with `clip_u8` on the final
saturating cast.
- tests/test_intra_pred_16x16.c — 5 test cases:
* V, H, DC: standard contexts (gradient top / gradient left /
small uniform pair).
* Plane (uniform): all neighbours = 100 → H = V = 0 →
output = (16*200 + 16) >> 5 = 100. Verifies the
orientation-free portion of the formula.
* Plane (gradient): top + left both 0..15, spec-derived
corner expectations pred[0][0] = 1 and pred[15][15] = 31.
The arithmetic chain (H = V = 400 → b = c = 31, a = 480)
is fully hand-traced in the test comment so the expected
values are auditable.
- CMakeLists.txt — new test_intra_pred_16x16 binary; pure-CPU
library, no daedalus_core dependency (same separation as the
4x4 ref).
Verified on hertz:
$ ./build/test_intra_pred_16x16
Vertical (mode 0) PASS
Horizontal (mode 1) PASS
DC (mode 2) PASS
Plane (mode 3, uniform) PASS
Plane (mode 3, gradient) PASS (corners 1, 31)
ALL Intra_16x16 mode references PASS
Plane mode being right first try is meaningful — H/V sums, b/c
slope shifts, and the a-baseline arithmetic have many sign / index
error opportunities. The asymmetric gradient test would have caught
any of them; it didn't.
What this does NOT cover (still in the intra-pred backlog):
- Intra_8x8 chroma (4 modes per H.264 §8.3.3).
- Intra_8x8 luma (High profile, 9 modes per §8.3.2.1 + the 1-2-1
smoothing pre-filter — distinct algorithm from Intra_4x4).
- Neighbour-availability fallback for boundary MBs.
- Dispatch wrappers (same architectural question as before — wait
for decoder Stage 2a strategy decision).
Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction.
Spec-derived C reference covering all 9 modes; standalone test
exercises each against hand-computed expected 4x4 patterns.
Why fourier (not the decoder) gets this: it's a reusable spec-level
primitive — both daedalus-decoder (Phase 1 Stage 2a intra prediction)
and any future shader work will need the same bit-exact reference.
Putting it in fourier alongside the IDCT / deblock refs keeps the
"spec implementations" library cohesive.
Why CPU C reference, not NEON or QPU: the vendored FFmpeg snapshot
(external/ffmpeg-snapshot/libavcodec/aarch64/) has h264dsp/idct/qpel
but NOT h264pred. Vendoring h264pred_neon.S would expand the snapshot
surface; deferring that pending real perf data. Per the cycle 9
NEON benches that take ~5 ns per 8x8 qpel block, intra prediction
at ~5 ns per 4x4 block × 16 blocks/MB × 8160 MBs = ~650 us/frame at
1080p — well inside budget even at NEON, and much further inside at
plain C. Not the critical-path concern.
Scope:
- tests/h264_intra_pred_4x4_ref.c — 9 prediction modes per
H.264 spec §8.3.1.4 sub-clauses, FFmpeg-style interface:
void daedalus_h264_pred_4x4_<name>_ref(uint8_t *dst, ptrdiff_t stride);
Reads top/top-right/left/top-left neighbours from dst[-stride/-1]
offsets, writes 4×4 output at dst[0..3][0..3]. Assumes all 13
neighbour bytes are valid (interior-MB case; availability
fallbacks are caller-side per spec).
- tests/test_intra_pred_4x4.c — 10 cases:
* 9 uniform-context degenerate tests (one per mode), establishing
that nothing is structurally broken (all output cells must
equal the uniform input value).
* 1 asymmetric Vertical_Right sanity test with 16 distinct
expected cells hand-computed from spec §8.3.1.4.6 — the
"really exercise orientation + row/col arithmetic" gate.
- CMakeLists.txt — new test_intra_pred_4x4 binary (no daedalus_core
dependency; pure-CPU library doesn't need a context to construct).
Verified on hertz:
$ ./build/test_intra_pred_4x4
Vertical (mode 0) PASS
Horizontal (mode 1) PASS
DC (mode 2) PASS
DiagDownLeft (mode 3) PASS
DiagDownRight (mode 4) PASS
VerticalRight (mode 5) PASS
HorizontalDown (mode 6) PASS
VerticalLeft (mode 7) PASS
HorizontalUp (mode 8) PASS
VR asym (sanity) PASS
ALL 10 intra-4x4 mode references PASS
The VR asym test passed first try; the DC test fell on the first
attempt because my test expectation miscomputed the rounding shift
(I wrote 4, actual is 2 = (16+4)>>3). Fixed in the test. Reference
itself never had the bug.
What this does NOT cover (next-step backlog):
- Intra_16x16 luma prediction (4 modes per H.264 §8.3.2): vertical,
horizontal, DC, plane.
- Intra_8x8 chroma prediction (4 modes per H.264 §8.3.3): DC,
horizontal, vertical, plane.
- Intra_8x8 luma prediction (High profile, 9 modes per §8.3.2.1) —
these are the High-profile siblings of the modes in this PR with
the 1-2-1 smoothing pre-filter. Different but well-defined.
- Neighbour availability fallback (top-edge MB, left-edge MB,
slice-boundary, top-right unavailable in some positions).
- Dispatch wrappers — these refs aren't surfaced through
daedalus_dispatch_*(). Whether to do that depends on the
daedalus-decoder Stage 2a architecture (per-block CPU vs
per-diagonal GPU wavefront — TBD).
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4). After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:
bS<4 bS=4
----- -----
luma_v ✓ (cycle 8 QPU) ✓ (CPU)
luma_h ✓ (CPU, PR #9) ✓ (CPU)
chrm_v ✓ (CPU, PR #10) ✓ (CPU)
chrm_h ✓ (CPU, PR #10) ✓ (CPU)
Scope:
- 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
CH_INTRA=16), all → CPU substrate in the recipe table.
- 4 new public dispatch fns + 4 recipe wrappers (defined via two
DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
boilerplate tight).
- 4 new extern decls for the vendored
ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
- C reference: tests/h264_intra_loop_filter_ref.c covers all four
orientations. Algorithm per H.264 §8.7.2.3:
Luma: per-side strong/weak filter selector
strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
Chroma: always weak, only p0/q0 updated.
- daedalus_h264_deblock_meta is REUSED for intra dispatches; the
tc0[] field is ignored (bS=4 hardcodes the strength). Callers
can build a single edge list and route by kernel without an
extra struct.
- Test refactor: an intra_test_spec table + run_intra_test helper
drives all four orientations through one harness, keeping the
new test surface compact (~50 LOC for 4 kernels vs ~200 if each
had its own test_deblock_*_intra fn).
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
...
H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
...
All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.
The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately. The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.
QPU shader follow-up: not in this PR. The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
Continues the deblock buildout after PR #9 (luma_h). Adds the two
chroma orientations via the same recipe-table-routed-to-CPU pattern;
QPU shaders for chroma deblock are still a follow-up.
Scope:
- Public API: 4 new fns (dispatch + recipe wrapper × {v, h}).
- Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the
vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols.
- Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit
SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_chroma_loop_filter_ref.c — covers both
orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter):
tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0
are updated (chroma never modifies p1/p2/q1/q2).
- Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) +
test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x
2 cells per segment per spec.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H264_DEBLOCK_CV recipe substrate: 1 (CPU)
H264_DEBLOCK_CH recipe substrate: 1 (CPU)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256
bytes per orientation) because the per-MB chroma deblock surface is
smaller than luma — accurate to the production geometry.
Why no QPU shader yet (per the established pattern):
- Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter
the pixel count of luma per MB) — modest QPU win even after the
shader exists.
- Same R-band considerations as the luma _h follow-up: the V shader
transpose isn't mechanical, and the 8-cell tile is small enough
that NEON's per-edge cost (~3 ns) is already inside the budget.
- Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us.
Negligible compared to the IDCT layer's 10 ms (CPU NEON).
Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is
complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining
deblock work: bS=4 intra variants (luma + chroma, V + H).
What this unblocks downstream:
- daedalus-decoder Stage 4 deblock can now dispatch all four bS<4
edge categories that a typical inter MB needs.
The previous commit unintentionally added .claude/scheduled_tasks.lock
which is an agent-runtime artefact, not source. Untrack it and add
.claude/ to .gitignore so it stays out of future commits.
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.
Scope:
- Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
+ daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
- Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
- Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An
explicit SUBSTRATE_QPU request on the H dispatch returns -1
(fails fast, no silent CPU degradation).
- C reference: tests/h264_h_loop_filter_luma_ref.c — the
column-axis transpose of h264_deblock_ref.c. Same per-segment
kernel; pix[-4..+3] accesses cols instead of rows*stride.
- Test: test_api_h264 grows a test_deblock_h() with 8 tiles
(8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
compares NEON dispatch against reference byte-for-byte.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 5 kernels bit-exact PASS. The new H variant joins the suite
with 1024 random-input bytes per tile x 8 tiles.
Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget. Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).
Backlog addition (out of scope for this PR):
- V3D shader for the H variant (mirror of v3d_h264deblock.spv).
- bS=4 intra-strength filter (different algebra; both _v and _h).
- Chroma deblock luma_v/_h (8-cell variants).
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.
What's added:
- src/v3d_h264_idct4.comp: GLSL compute shader implementing
the H.264 §8.5.12.1 1D butterfly twice (row pass then column
pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
dst. Block memory layout is column-major (matches FFmpeg
`ff_h264_idct_add_neon` convention).
- CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.
- dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
(n_blocks, dst_stride), 16 blocks per WG dispatch. Matches
the existing dispatch_*_qpu patterns; uses
v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
- daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
the same CPU/QPU substrate switch the deblock dispatch uses.
- daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
that the shader exists.
Verification on hertz (Pi 5 + V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).
Remaining cycle-6/7/9 work in task #165:
- cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
block, fewer blocks per WG)
- cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
not a transform)
This commit lands the cycle-6 piece of task #165.
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:
1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
every kernel that has a shipped V3D compute shader:
cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged)
cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged)
cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv)
cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged)
cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv)
cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165)
cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165)
cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv)
cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165)
The R-band cost/benefit framework still applies but is now
superseded for substrate selection by the decree. Where R
stays RED, the cost is in dispatch overhead, which is a
fixable engineering issue (tasks 160 buffer-pool, 161
persistent cmdbuf, 162 dmabuf import).
2) daedalus_ctx_create_no_qpu() now honours an env-var override:
set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
silently escalates to a full daedalus_ctx_create(). Lets
the libavcodec substitution shims in marfrit-packages (which
pthread_once a create_no_qpu ctx — see
libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
without rebuilding those patches.
Firefox / mpv consumers stay on the Vulkan-free path by
default (env var unset). The daedalus-v4l2 daemon will set
DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
(separate daedalus-v4l2 follow-up).
Smoke (hertz, Pi 5, kernel 6.18.29):
=== test_api_h264 ===
H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2 ← flipped
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path
H.264 qpel mc20: 1024/1024 bytes bit-exact
=== test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
=== test_api_lpf === all substrates bit-exact wd=4 and wd=8
The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.
Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.
Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline(). The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk. Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.
Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):
before (task 160 pool only):
steady-state p50: 76.44 us
steady-state mean: 77.95 us
after (task 160 pool + task 161 persistent cb):
steady-state p50: 54.56 us
steady-state mean: 56.00 us
-> 28% per-dispatch reduction
The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.
test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.
Refs daedalus-fourier task #161.
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead. This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.
API additions (v3d_runner.h):
- v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
falls through to v3d_runner_create_buffer() on miss.
- v3d_runner_release_buffer(): pushes back onto the freelist; the
backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
v3d_runner_destroy().
- v3d_runner_pool_total_bytes(): diagnostic watermark.
Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.
Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls. test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).
Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):
call 0: 434.89 us (cold — 3x vkAllocateMemory)
call 1: 100.06 us (pool hit on all 3 buffers)
steady-state:
p50: 76.44 us
p90: 90.52 us
mean: 77.95 us
first-call / steady-state ratio: 5.7x
The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).
Refs daedalus-fourier task #160.
The pkg-config file was generated at *configure* time with
`prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever
CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build`
time — typically the default `/usr/local`. `cmake --install
build --prefix /foo` then put the files under /foo but the .pc
still pointed pkg-config at /usr/local/include and /usr/local/lib,
which broke downstream consumers configuring against the install
tree.
Concrete bite encountered today on hertz: the daedalus-v4l2 daemon
CMake configure on a /tmp/df-prefix install tree resolved
DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on
the test host), so main.c failed to find <daedalus.h>.
Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is
the configure-time-computed relative path from
<prefix>/<libdir>/pkgconfig back to <prefix>. pkg-config
substitutes ${pcfiledir} with the actual on-disk location of the
.pc at lookup time, so the resolved prefix tracks wherever the
install tree is moved to — including DESTDIR-staged builds, apt
package installs, and ad-hoc `cmake --install --prefix /tmp/foo`
test installs.
The relative-path computation handles GNUInstallDirs layouts that
add multiarch tuples (Debian's lib/aarch64-linux-gnu) without
hard-coding `../..`. Tested on hertz (Debian trixie, libdir=lib):
prefix=${pcfiledir}/../../
...
$ pkg-config --variable=prefix daedalus-fourier
/tmp/df-prefix-test/lib/pkgconfig/../../
# mv preserves relocation:
$ mv /tmp/df-prefix-test /tmp/df-prefix-moved
$ pkg-config --variable=prefix daedalus-fourier
/tmp/df-prefix-moved/lib/pkgconfig/../../
This unblocks the daedalus-v4l2 daemon out-of-tree builds against
local daedalus-fourier installs and is a prerequisite for tidy
test-rig deployments (per the hertz reload session 2026-05-23).
Original draft (PR #3) speculated wrongly on host-to-SoC mapping:
- hertz and tesla were listed under RK3588. Verified via
/proc/device-tree/compatible: both are raspberrypi,5-model-b /
brcm,bcm2712 (tesla is an LXD container hosted on hertz, so
necessarily shares the host SoC).
- boltzmann (the only actual RK3588 in the fleet, 32 GB, kernel-
dev / MCP hub, 8 W always-on) was omitted entirely.
- noether (Pi 4 / BCM2711, the user's interactive workstation,
where Firefox and mpv actually run) was omitted entirely.
Corrects the per-SoC coverage table:
BCM2712 Pi 5 — higgs, hertz, broglie, tesla (LXD on hertz)
BCM2711 Pi 4 — noether (workstation), dcw3, dcw2
RK3588 — boltzmann
Allwinner H6 — (not in fleet)
Reasoning consequences:
- Pi 5 row is now four hosts but one SoC. Adding a fifth Pi 5
doesn't pressure-test the architecture; substrate decisions
are identical across the row.
- The realistic forcing function for the Pi 4 path is "HW decode
on noether matters and rpivid is still unstable upstream" —
noether is a daily-driver Pi 4 workstation, so this is closer
than the original draft implied.
- The realistic forcing function for an RK3588 caps file is
"AV1 playback on boltzmann matters" — rkvdec doesn't cover
AV1, so Mali Valhall compute substrate becomes the only HW
acceleration option there.
"Re-read this when" list at the top + "Why deferred" section
+ decision log all updated. No change to the architecture sketch
(caps directory, plugin layout, two-backend conclusion) — those
were correct in the original; only the host-to-SoC mapping
underneath them was wrong.
Refs PR #3 (the merged original).
Captures the design draft for generalizing the daedalus daemon
across the fleet (Pi 5 + Pi 4 + RK3588 + Allwinner H6) while
explicitly DEFERRING the work until a second SoC creates a
forcing function.
Key conclusions:
- The recipe layer in daedalus-fourier (daedalus_recipe_dispatch_*)
already abstracts substrate selection per kernel; scaling to
multi-SoC is a data extension (caps/<soc>.toml), not new
architecture.
- libva-v4l2-request-fourier already abstracts over any V4L2
stateless decoder node; the cross-SoC seam is at the V4L2
device level, where the upstream stateless API put it.
- The conceptual gap is that hardware decoders are NOT made of
shaders — rkvdec on RK3588, Hantro G1/G2, VPU8, rpi-hevc-dec
on Pi 5 are bitstream-in NV12-out monoliths. A generalized
daemon needs TWO backends: substrate-composed (today's path)
and codec-level pass-through to vendor V4L2 decoders.
- On RK3588 + every codec rkvdec supports, the daedalus daemon
is bypassed entirely — libva talks to rkvdec directly. The
daemon is only ever in the path on SoCs where at least one
codec needs substrate composition.
Forcing functions for revisiting:
- Pi 4 enters daily use with rpivid still unstable upstream
(would require a V3D4 substrate-composed path with its own
caps file and substrate verdicts).
- A third-party user needs to swap shaders for V3D firmware
experiments without rebuilding the daemon.
- An x86 / panvk host enters the fleet needing dynamic SoC
discovery rather than build-time pinning.
Until then: keep daedalus daemon Pi 5 specific, push cross-SoC
abstraction up to libva-v4l2-request-fourier (which already does
most of it).
Document covers:
- current stack diagram (cycles 1-9 closed)
- per-SoC codec coverage matrix
- refined sketch: /usr/lib/daedalus/{shaders,caps,plugins}
- illustrative bcm2712.toml + rk3588.toml caps files
- where it gets hard (probing, fallback, stateful vs stateless,
CI matrix, libva node selection)
- open questions
- decision log
No code changes; document only.
Refs reauktion/daedalus-v4l2#11 substitution arc closing; pivot
to bug-fix backlog (#145 daemon SEGV, #146 D-state) is the next
work block once cycle 9 deploys.
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.
API additions (header + library):
- daedalus_h264_qpel_meta { dst_off, src_off }
- daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
n_blocks, meta)
- daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper)
- DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
- daedalus_recipe_substrate_for() returns CPU NEON for cycle 9
The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.
Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse. Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.
CMakeLists changes:
- h264qpel_neon.S added to the daedalus_core static lib (only the
bench targets owned it before; now the public API needs it too)
- tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target
Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
Make daedalus_core installable so sibling consumers (Phase 8 V4L2
daemon, future libva-v4l2-request-fourier integration tests, etc.)
can `pkg_check_modules(DAEDALUS REQUIRED daedalus-fourier)` against
a system-installed copy.
Installs:
- lib/libdaedalus_core.a
- include/daedalus.h
- lib/pkgconfig/daedalus-fourier.pc
- share/daedalus-fourier/shaders/*.spv (only when
DAEDALUS_BUILD_VULKAN is ON; consumers using
daedalus_ctx_create_no_qpu() don't need them)
pkg-config surfaces the static-archive transitive deps via
Libs.private (-lpthread -ldl -lm) and Requires.private (vulkan),
so a consumer doing `pkg-config --static --libs daedalus-fourier`
gets the full link line. Non-static consumers (using the
no_qpu path) get just `-ldaedalus_core`.
No behaviour change to existing tests / benches.
Verified on hertz (Pi 5, dev host): clean build, all 7 SPIR-V
shaders + the static lib + the header + the .pc file land in
the install prefix.
User created the sibling repo under reauktion/ org. Updates
all 5 cross-links in the README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase-0-era README is updated to reflect the kernel-library
project's actual state:
- Status: 9 cycles closed (VP9 IDCT/LPF/MC, AV1 CDEF, H.264
IDCT4/IDCT8/deblock/MC) with deployment recipe table as the
headline result.
- Architecture: clarifies that 3 kernels deploy on QPU primary,
6 on CPU primary, 2 with opportunistic-QPU helper paths;
V4L2 wrapper is the sibling daedalus-v4l2 (Option B + γ +
sibling per locked Phase 8 architecture).
- Layout: shows actual repo structure (include/, src/, tests/,
docs/k*_phase*.md, external/ffmpeg-snapshot + dav1d-snapshot).
- Build + run: concrete cmake commands and example bench
invocations.
- Consuming the kernel library: code snippet showing the
public API (daedalus_ctx_create, daedalus_recipe_dispatch_*).
- Conventions: updated dev process reference, current
claude-noether SSH alias convention.
- Sibling projects: added daedalus-v4l2 link.
Old "single-kernel proof-of-concept negative result will close
the project" framing replaced — the negative result test passed;
project is alive and now in deployment phase.
Project voice (Daedalus myth, higgs framing, honest-target
posture) preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last unmeasured H.264 kernel. mc20 picked as representative
(horizontal half-pel, 6-tap filter; canonical for the H.264 luma
qpel family). M1 PASS 10000/10000 first try, M3 = 131.477
Mblock/s on a single core (7.6 ns/block), 135x the 1080p30 floor.
Per the cycles 6+7 lightweight-kernel rationale, Phase 4 deferred:
QPU dispatch floor (~250 ns/block) is 33x above the NEON per-block
cost; R9 ≈ 0.03 deep RED. No realistic QPU offload value.
Generalization: all H.264 luma MC variants (mc02, mc11, mc22,
etc.) will share this verdict. No need to measure each variant
individually.
H.264 NEON is dramatically faster than VP9 NEON across the board:
- IDCT 4x4: 175 vs N/A (no VP9 analog)
- IDCT 8x8: 151 vs 8.2 Mblock/s (18x faster)
- MC 6/8-tap: 131 vs 7.0 (19x faster)
- Deblock: 92 vs 48 Medge/s (2x faster)
H.264 deployment recipe: all CPU NEON except deblock (opportunistic
QPU). On a Pi 5 running H.264-only, the QPU is mostly idle.
Cycles 1-9 complete. Public API exposes all 9.
Next: daedalus-v4l2 sibling repo per locked Phase 8 architecture
(B + γ + sibling), then README polish.
- external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
vendored (1467 lines, all qpel variants)
- tests/h264_qpel8_mc20_ref.c: 40-line C ref (clip255 of
6-tap convolution)
- tests/bench_neon_h264qpel_mc20.c: M1 + M3 bench
- docs/k9_h264qpel_mc20.md: cycle 9 closure with comparison
matrix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.
Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).
Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.
After this commit, the public API surface fully covers cycles 1-8:
Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
test — bench_v3d_cdef is the authoritative 3-way M1)
Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
Cycle 7 (H.264 IDCT 8x8): CPU only
Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact
Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.
Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per goal "c8p3..c8p7, then p8 — surface if user intervention is
required": this is the surface point.
Kernel-library work is complete (cycles 1-8 all dispatchable via
public API, all CPU paths bit-exact, 3 QPU paths bit-exact, 3
opportunistic-QPU paths shader-exists-API-TODO).
V4L2 wrapper architecture needs 4 user decisions:
- Q1: Option A (v4l2loopback) / B (kernel V4L2 shim) / C (libva direct)
- Q2: Parser source: FFmpeg-vendored / dav1d+libvpx mix / FFmpeg-dlopen
- Q3: In-repo or sibling repo (daedalus-v4l2)?
- Q4: End-to-end test target (tiny clips / 1080p30 / both)
Recommended defaults (A / γ / sibling / both) documented;
explicit confirmation requested before committing to days of work
that locks in months of follow-on choices.
Mechanical TODOs that can proceed in parallel without blocking V4L2
decision: cycle 3+5+8 opportunistic-QPU dispatch wiring through API,
or cycle 9 (H.264 luma qpel MC, predicted CPU-only per cycle 6/7
pattern).
24 commits pushed to marfrit/daedalus-fourier this autonomous arc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4,
IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU
(matches per-cycle Phase 7 verdicts).
src/daedalus_core.c: NEON-path implementations + recipe routing.
daedalus_core library now links the full FFmpeg H.264 NEON
snapshot (h264idct + h264dsp) plus existing VP9 + dav1d.
tests/test_api_h264.c: smoke test covering all 3 H.264 kernels
via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact.
Public API coverage after this commit:
- Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch
(test_api_idct, test_api_lpf, both pass)
- Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1)
- Cycle 5 CDEF: CPU only (QPU stub)
- Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired)
- Cycle 7 H.264 IDCT 8x8: CPU only
- Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired
through API yet; bench_v3d_h264deblock exists for direct test)
Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles
3+5+8 through the API (so override-mode users can request QPU).
Then surface V4L2-wrapper architecture decisions to user.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_
filter is "vertical filtering of horizontal edges", not "vertical
edge"; 16 columns process the edge horizontally with 8 rows of
vertical context).
M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case
1080p30 floor, 30x realistic floor. Filter triggers on 25 % of
edges (random alpha/beta/tc0 covers both gating paths).
Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses
filter DIRECTION (vertical) not edge orientation. Edge is
horizontal; filter operates vertically across it. Distinct from
cycle 6's column-major-block lesson but related discovery
pattern. Encoded for future cycles.
R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's
0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9
LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches
that hurt QPU more. Worth building anyway:
- ORANGE in cycle 1's "M4 may rescue" band
- Mixed-kernel deployment helper value (Issue 003) matters more
than isolation R
- 25%-trigger rate gives 4x effective contribution multiplier
on QPU side
- tests/h264_deblock_ref.c (column-walking C ref per row segment)
- tests/bench_neon_h264deblock.c (M1 + M3 bench)
- CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S
- docs/k8_h264deblock_phase3.md (closure)
Next: Phase 4 plan QPU shader, Phase 5 Sonnet review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Targets the one H.264 kernel most likely to be QPU-worthy:
in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in
CPU-only because H.264 transforms are NEON-trivial. H.264
deblock has analogous structure to VP9 LPF (cycles 2+4, both
GREEN) so predicted R8 = ORANGE/YELLOW.
This commit:
- Vendors ff_h264_*_loop_filter_*_neon from h264dsp_neon.S
(1076 lines, includes both v/h luma + chroma + intra variants
+ weight/biweight)
- PROVENANCE.md updated with the new vendored file
- Phase 1 doc captures the full plan: start with luma vertical
non-intra (most common case), defer Phase 3+ to next session
H.264 deblock C ref scope is ~2 hours (per-row branching,
per-4-row-segment tc0, ap/aq side conditions, alpha/beta
thresholds — much more complex than VP9 LPF wd=4's
single-branch filter). Deferring to fresh attention next
session rather than rushing now.
After cycle 8 closes, the H.264 QPU surface is well-characterised
and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's
substrate-routing recipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore
VII has no hardware H.264 decoder block (only HEVC), so a
QPU-accelerated H.264 path fills the most impactful codec gap.
Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264
transform, simplest first cycle).
Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s
worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or
P-skip).
Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on
hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case
floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9
IDCT 8x8 (21x faster per block).
Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg
blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON
ld1 with 4 registers interleaves loading, and the FFmpeg C ref
indexing makes this convention explicit. Initial C ref assumed
row-major, M1 was 5% bit-exact; after fix, M1 = 100%.
Convention encoded for all subsequent H.264 cycles (cycle 7+).
- external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
(vendored verbatim from FFmpeg n7.1.3, 415 lines)
- external/ffmpeg-snapshot/PROVENANCE.md: updated
- tests/h264_idct4_ref.c: column-major C ref
- tests/bench_neon_h264idct4.c: M1 + M3 bench
- CMakeLists.txt: cycle 6 NEON bench wiring
- docs/k6_h264idct4_phase1.md, phase3.md
Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep
RED — kernel too small relative to QPU dispatch overhead) but
worth building for cycle-completeness + the opportunistic-helper
hypothesis (cycle 6 may stay CPU per recipe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc +
dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8).
Important caught-empirically bug: the two LPF shaders disagree
on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1,
wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1).
Initial single-struct attempt silently corrupted wd=8 output
(1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping
separate lpf4_pc and lpf8_pc struct definitions.
dst-window calc handles both kernels (same -4..+3 byte footprint
per row).
test_api_lpf exercises both kernels in CPU / QPU / AUTO modes
against the C reference. All 6 mode/kernel combinations pass
2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge).
Phase 8 status after this commit: 3 of 5 kernels wired through
API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3
QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF
still need wiring for opportunistic-override mode but aren't
needed for recipe-AUTO path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
daedalus_ctx now owns a v3d_runner when V3D is available. The
public API's dispatch_vp9_idct8 routes QPU calls through a
new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1
v4 pipeline on first use, (2) allocates 3 host-visible SSBOs
per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch
with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host.
Per-call alloc is intentional for Phase 8 correctness-first
scope; buffer-pool perf optimization is deferred.
Added daedalus_ctx_create_no_qpu() for fast-path callers that
know they want CPU only.
test_api_idct extended to a 3-mode matrix: CPU forced, QPU
forced, AUTO recipe. All three deliver 4096/4096 bit-exact
on hertz with V3D 7.1.7.0:
recipe substrate for VP9_IDCT8: 2 (QPU)
[CPU] 4096/4096 bit-exact
[QPU] 4096/4096 bit-exact (real QPU dispatch through the API)
[AUTO] 4096/4096 bit-exact (recipe routes to QPU)
Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4
and cycle 4 LPF wd=8 (the other two recipe-QPU kernels).
Cycle 3 MC and cycle 5 CDEF only need the dispatch hook
(recipe routes to CPU; QPU stays opportunistic via explicit
override).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.
src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.
tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.
docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.
Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous "layout mismatch" deferral was a one-line bench bug:
NEON expects the caller to pass tmp pointing at the 8x8 block
origin (after the 2*16+2 padding skip), but the bench passed the
raw padded-buffer origin. C ref does the advance internally, so it
filtered the correct block; NEON filtered a (+2 rows, +2 cols)
shifted region. Diagonal-shift trace in the partial doc was
exactly that.
Fix: tmps + i*TMP_INTS + (2*TMP_W + 2) for NEON calls.
Results:
M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced
M3: 3.809 Mblock/s (consistent with 3.923 from longer window)
Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier
analysis. Will build QPU CDEF anyway for cycle-completeness +
V4L2 dispatch-path existence.
- tests/bench_neon_cdef.c: 3-line tmp pointer fix
- docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:
V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s
V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF
V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s
V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.
Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).
- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>