d0a1db3c8fdff02294883751617e76da2419d6ce
19 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e01f7bc7c6 |
h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR. Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:
mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]
The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same. Putting them in one PR is justified by that uniformity.
Scope:
- 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
- 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
- 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
(collapses ~50 LOC of repetition).
- 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
- 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
hpel_h() / hpel_v() inlines + per-mode L2 average.
- Test refactor: generic run_quarter_axis_qpel() harness exercises
all 4 positions through a single helper (~50 LOC for 4 tests vs
~200 if each was hand-rolled).
Verified on hertz:
$ ./build/test_api_h264 | tail -8
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
All 4 new positions bit-exact PASS first try.
Coverage matrix update:
put_ mc00 mc10 mc20 mc30
mc01 — ✓ — ✓
mc11 — — ✓ — ← this row
mc21 — — — —
mc31 — — — —
mc02 — — ✓ — ← mc02 + mc22 anchor
mc03 — — ✓ —
After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.
Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget. QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
|
||
|
|
20a4299c5c |
h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.
Algorithmically distinct from the 1D mc20/mc02 siblings:
- Horizontal 6-tap produces 13 rows of int16 intermediate (no
per-stage clip/round — full precision retained).
- Vertical 6-tap on the intermediate, then +512 >> 10 (the
double-shift compensates for both 6-tap scalings) + clip255.
The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result. The 13-row int16 tmp[] buffer is the central
invariant.
Scope (same pattern as mc02 PR #15):
- Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc22_cpu calling
ff_put_h264_qpel8_mc22_neon.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
- C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
int16 staging buffer; spec-derived shifts and rounding.
- Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
[-2 .. +10] read window stays in-tile.
Verified on hertz:
$ ./build/test_api_h264 | tail -5
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
All 13 H.264 kernels in api_smoke now bit-exact PASS.
mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.
Coverage matrix update:
put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU)
→ 12 single put_ positions still missing (¼/¾ + HV combos with
L2 averaging).
|
||
|
|
c3301b0c2e |
h264: qpel mc02 (vertical half-pel, CPU/NEON)
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.
Scope:
- Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
Explicit SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
transpose of mc20 (reads src[(r±N)*stride + c] instead of
src[r*stride + c±N]).
- Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
× 16 rows, random input, bit-exact compare against the C ref.
Verified on hertz:
$ ./build/test_api_h264
...
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
All 12 H.264 kernels in the api_smoke now bit-exact PASS.
Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget. QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).
Coverage matrix update:
qpel position put_ status avg_ status
------------- ----------- -----------
mc00 (copy) not wired not wired
mc10 (¼-H) not wired not wired
mc20 (½-H) ✓ QPU+CPU not wired
mc30 (¾-H) not wired not wired
mc01 (¼-V) not wired not wired
mc02 (½-V) ✓ CPU not wired (this PR)
mc03 (¾-V) not wired not wired
mc11..mc33 not wired not wired
13 more qpel positions to go for the full put_ matrix. Adding them
follows the same template; each is a small contained PR.
|
||
|
|
9b1c106dc5 |
h264: deblock bS=4 intra variants (luma + chroma, V + H)
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4). After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:
bS<4 bS=4
----- -----
luma_v ✓ (cycle 8 QPU) ✓ (CPU)
luma_h ✓ (CPU, PR #9) ✓ (CPU)
chrm_v ✓ (CPU, PR #10) ✓ (CPU)
chrm_h ✓ (CPU, PR #10) ✓ (CPU)
Scope:
- 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
CH_INTRA=16), all → CPU substrate in the recipe table.
- 4 new public dispatch fns + 4 recipe wrappers (defined via two
DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
boilerplate tight).
- 4 new extern decls for the vendored
ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
- C reference: tests/h264_intra_loop_filter_ref.c covers all four
orientations. Algorithm per H.264 §8.7.2.3:
Luma: per-side strong/weak filter selector
strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
Chroma: always weak, only p0/q0 updated.
- daedalus_h264_deblock_meta is REUSED for intra dispatches; the
tc0[] field is ignored (bS=4 hardcodes the strength). Callers
can build a single edge list and route by kernel without an
extra struct.
- Test refactor: an intra_test_spec table + run_intra_test helper
drives all four orientations through one harness, keeping the
new test surface compact (~50 LOC for 4 kernels vs ~200 if each
had its own test_deblock_*_intra fn).
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
...
H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
...
All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.
The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately. The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.
QPU shader follow-up: not in this PR. The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
|
||
|
|
a5c47aa51c |
h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)
Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs. |
||
|
|
9d5451e0fe |
h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.
Scope:
- Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
+ daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
- Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
- Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An
explicit SUBSTRATE_QPU request on the H dispatch returns -1
(fails fast, no silent CPU degradation).
- C reference: tests/h264_h_loop_filter_luma_ref.c — the
column-axis transpose of h264_deblock_ref.c. Same per-segment
kernel; pix[-4..+3] accesses cols instead of rows*stride.
- Test: test_api_h264 grows a test_deblock_h() with 8 tiles
(8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
compares NEON dispatch against reference byte-for-byte.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 5 kernels bit-exact PASS. The new H variant joins the suite
with 1024 random-input bytes per tile x 8 tiles.
Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget. Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).
Backlog addition (out of scope for this PR):
- V3D shader for the H variant (mirror of v3d_h264deblock.spv).
- bS=4 intra-strength filter (different algebra; both _v and _h).
- Chroma deblock luma_v/_h (8-cell variants).
|
||
|
|
79553c6e22 |
cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.
Shader (src/v3d_h264_qpel_mc20.comp):
- local_size = 64, 1 lane per output pixel of one 8x8 block,
1 block per workgroup. Simplest layout that avoids any
inter-lane communication — V3D's L2 cache handles the
redundant reads from adjacent lanes computing adjacent
output columns.
- Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
rounding, clip to u8, write one dst byte.
- Single-stride convention matches FFmpeg's H264QpelContext
(dst and src share `stride`; src+src_off points at output
col 0 with the caller-guaranteed -2/+3 padding).
Dispatch wiring (src/daedalus_core.c):
- h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
- dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
src_max = src_off + 7*stride + 11 (covers the +3-col read
footprint on the last row), dst_max = dst_off + 7*stride + 8.
1 block per WG.
- daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
with the substrate-switch pattern matching the other H.264
kernels.
- Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.
Verification (hertz, Pi 5, V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2 ← flipped
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU
First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper. The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same
patch.
All 9 daedalus-fourier cycles now QPU-by-recipe:
cycle 1 VP9 IDCT 8x8 QPU
cycle 2 VP9 LPF wd=4 QPU
cycle 3 VP9 MC 8h QPU
cycle 4 VP9 LPF wd=8 QPU
cycle 5 AV1 CDEF 8x8 QPU
cycle 6 H.264 IDCT 4x4 QPU
cycle 7 H.264 IDCT 8x8 QPU
cycle 8 H.264 deblock luma-v QPU
cycle 9 H.264 qpel mc20 QPU ← this commit
Closes daedalus-fourier task #165. Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
|
||
|
|
a092ee34aa |
Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7 |
||
|
|
74687d9def |
cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.
Bit-exact first try on hertz (Pi 5, V3D 7.1):
H264_IDCT4 recipe substrate: 2 (QPU)
H264_IDCT8 recipe substrate: 2 (QPU) ← flipped
H264_DEBLOCK_LV recipe substrate: 2 (QPU)
H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.
Same pattern as cycle 6 commit (
|
||
|
|
65bd5c3fe3 |
cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.
What's added:
- src/v3d_h264_idct4.comp: GLSL compute shader implementing
the H.264 §8.5.12.1 1D butterfly twice (row pass then column
pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
dst. Block memory layout is column-major (matches FFmpeg
`ff_h264_idct_add_neon` convention).
- CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.
- dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
(n_blocks, dst_stride), 16 blocks per WG dispatch. Matches
the existing dispatch_*_qpu patterns; uses
v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
- daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
the same CPU/QPU substrate switch the deblock dispatch uses.
- daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
that the shader exists.
Verification on hertz (Pi 5 + V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).
Remaining cycle-6/7/9 work in task #165:
- cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
block, fewer blocks per WG)
- cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
not a transform)
This commit lands the cycle-6 piece of task #165.
|
||
|
|
737e87980d |
QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:
1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
every kernel that has a shipped V3D compute shader:
cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged)
cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged)
cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv)
cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged)
cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv)
cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165)
cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165)
cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv)
cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165)
The R-band cost/benefit framework still applies but is now
superseded for substrate selection by the decree. Where R
stays RED, the cost is in dispatch overhead, which is a
fixable engineering issue (tasks 160 buffer-pool, 161
persistent cmdbuf, 162 dmabuf import).
2) daedalus_ctx_create_no_qpu() now honours an env-var override:
set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
silently escalates to a full daedalus_ctx_create(). Lets
the libavcodec substitution shims in marfrit-packages (which
pthread_once a create_no_qpu ctx — see
libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
without rebuilding those patches.
Firefox / mpv consumers stay on the Vulkan-free path by
default (env var unset). The daedalus-v4l2 daemon will set
DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
(separate daedalus-v4l2 follow-up).
Smoke (hertz, Pi 5, kernel 6.18.29):
=== test_api_h264 ===
H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2 ← flipped
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path
H.264 qpel mc20: 1024/1024 bytes bit-exact
=== test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
=== test_api_lpf === all substrates bit-exact wd=4 and wd=8
The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.
Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
|
||
|
|
98553278dd |
v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.
Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline(). The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk. Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.
Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):
before (task 160 pool only):
steady-state p50: 76.44 us
steady-state mean: 77.95 us
after (task 160 pool + task 161 persistent cb):
steady-state p50: 54.56 us
steady-state mean: 56.00 us
-> 28% per-dispatch reduction
The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.
test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.
Refs daedalus-fourier task #161.
|
||
|
|
0a042a8e95 |
v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead. This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.
API additions (v3d_runner.h):
- v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
falls through to v3d_runner_create_buffer() on miss.
- v3d_runner_release_buffer(): pushes back onto the freelist; the
backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
v3d_runner_destroy().
- v3d_runner_pool_total_bytes(): diagnostic watermark.
Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.
Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls. test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).
Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):
call 0: 434.89 us (cold — 3x vkAllocateMemory)
call 1: 100.06 us (pool hit on all 3 buffers)
steady-state:
p50: 76.44 us
p90: 90.52 us
mean: 77.95 us
first-call / steady-state ratio: 5.7x
The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).
Refs daedalus-fourier task #160.
|
||
|
|
8fdef27a7d |
Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.
API additions (header + library):
- daedalus_h264_qpel_meta { dst_off, src_off }
- daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
n_blocks, meta)
- daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper)
- DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
- daedalus_recipe_substrate_for() returns CPU NEON for cycle 9
The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.
Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse. Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.
CMakeLists changes:
- h264qpel_neon.S added to the daedalus_core static lib (only the
bench targets owned it before; now the public API needs it too)
- tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target
Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
|
||
|
|
0a99b16489 |
Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.
Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).
Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.
After this commit, the public API surface fully covers cycles 1-8:
Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
test — bench_v3d_cdef is the authoritative 3-way M1)
Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
Cycle 7 (H.264 IDCT 8x8): CPU only
Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact
Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.
Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
af8146a2cd |
Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eb5cfb34c4 |
Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1085c5699c |
Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
760f6a4060 |
Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |