8bc6d27ea7062e0496cc593d6762b3604853825f
29 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
01f782cfaf |
h264: qpel avg — 12 remaining variants (closes the matrix)
Closes the H.264 8x8 qpel buildout. Adds the remaining 12 avg_
biprediction positions:
4 quarter-axis: avg_mc{10,30,01,03}
8 diagonals : avg_mc{11,12,13,21,23,31,32,33}
Each follows the established pattern: same half-pel formula as the
put_ sibling, then L2 average with the existing dst contents per
H.264 §8.4.2.3.1.
Scope:
- 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU.
- 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon.
- 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
- 12 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 12 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- 12 header decls via DECLARE_QPEL_AVG macro.
- tests/h264_qpel8_avg_rest_ref.c — references via two parametric
macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms,
DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms.
- Test harness extended with a RUN(MC) sub-macro that derives both
the ref name and dispatch name from the bare mcXX. (The ref
is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is
daedalus_recipe_dispatch_h264_qpel_avg_<mc>. Macro had a typo
on first try that duplicated "avg_" in the ref name — caught at
compile, fixed.)
Verified on hertz:
$ ./build/test_api_h264 | tail -12
H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%)
All 12 new positions bit-exact PASS first try.
Final qpel matrix state:
put_: mc00 (none — integer copy)
mc01 ✓ mc02 ✓ mc03 ✓
mc10 ✓ mc11 ✓ mc12 ✓ mc13 ✓
mc20 ✓ (QPU+CPU) mc21 ✓ mc22 ✓ mc23 ✓
mc30 ✓ mc31 ✓ mc32 ✓ mc33 ✓
avg_: same 15-of-16 coverage, all CPU.
Every B-slice biprediction case the libavcodec intercept can throw
at us is now serviceable. QPU shaders remain mc20-only (cycle 9);
the other 29 positions are CPU NEON. Whether to write more QPU
shaders depends on real perf measurement — at NEON ~10 ns per
8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work,
well inside budget.
|
||
|
|
1113953f97 |
h264: qpel avg anchors (avg_mc20/02/22, biprediction support)
Begins the avg_ qpel buildout for B-slice biprediction. Each avg_
form computes the same half-pel formula as its put_ sibling, then
L2-averages the result with the existing dst contents — the caller
pre-loads dst with the list0 prediction; the avg_ call adds list1
per H.264 §8.4.2.3.1.
Scope (3 anchors, sets the pattern for the remaining 13 avg_
variants):
- 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU.
- 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon.
- 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro
(the macro is type-agnostic so it didn't need changes for avg_).
- 3 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 3 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg.
- Test harness: run_avg_qpel() seeds dst with random content so
the L2 averaging is actually exercised (not just put_-style
overwrite that would silently pass).
Verified on hertz:
$ ./build/test_api_h264 | tail -3
H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%)
All 3 anchors bit-exact PASS first try.
Why anchors only in this PR: the avg_ pattern is uniform across all
16 positions (each is just "put_ result + L2 with dst"). Landing
the anchors first confirms the macro pattern works for both put_
and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow
the same template in a follow-up PR.
State of the qpel matrix after this PR:
put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper)
avg_ : 3 of 16 positions ✓ (mc20, mc02, mc22 anchors)
13 follow-up positions
|
||
|
|
0894a46114 |
h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)
Closes the qpel buildout. All 8 remaining diagonal positions land
in one PR. Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).
Decomposition table (the formula for each output cell at (r,c)):
mc11 ¼¼ : avg(mc20[r, c], mc02[r, c])
mc12 ¼½ : avg(mc22[r, c], mc02[r, c])
mc13 ¼¾ : avg(mc20[r+1, c], mc02[r, c])
mc21 ½¼ : avg(mc22[r, c], mc20[r, c])
mc23 ½¾ : avg(mc22[r, c], mc20[r+1, c])
mc31 ¾¼ : avg(mc20[r, c], mc02[r, c+1])
mc32 ¾½ : avg(mc22[r, c], mc02[r, c+1])
mc33 ¾¾ : avg(mc20[r+1, c], mc02[r, c+1])
The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.
Scope (tightly macro-ised):
- 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
- 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
- 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
- 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- Header decls condensed via a DECLARE_QPEL_DIAG macro that
expands to both recipe + dispatch decls per name.
- C references via DEFINE_DIAG_REF macro: each ref is a 6-line
wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
(the latter being the per-cell version of mc22's 13-row int16
tmp[] computation).
- Test wrapper: test_qpel_diag_all() drives all 8 through the
existing run_quarter_axis_qpel() harness.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264 | tail -8
H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)
ALL 8 diagonal positions bit-exact PASS first try. Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.
After this PR the H.264 qpel 8x8 put_ matrix is complete:
mc00 mc01 mc02 mc03
mc10 mc11 mc12 mc13
mc20 mc21 mc22 mc23
mc30 mc31 mc32 mc33
15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly). mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.
What this does NOT cover (still in backlog):
- avg_ variants (the "add" form for biprediction, 16 more
positions). Currently the API only exposes put_.
- 16x16 qpel (separate function family in FFmpeg; the 8x8 path
can be used twice to substitute when 16x16 isn't critical).
- QPU shaders for any qpel position other than mc20.
|
||
|
|
e01f7bc7c6 |
h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR. Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:
mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]
The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same. Putting them in one PR is justified by that uniformity.
Scope:
- 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
- 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
- 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
(collapses ~50 LOC of repetition).
- 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
- 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
hpel_h() / hpel_v() inlines + per-mode L2 average.
- Test refactor: generic run_quarter_axis_qpel() harness exercises
all 4 positions through a single helper (~50 LOC for 4 tests vs
~200 if each was hand-rolled).
Verified on hertz:
$ ./build/test_api_h264 | tail -8
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
All 4 new positions bit-exact PASS first try.
Coverage matrix update:
put_ mc00 mc10 mc20 mc30
mc01 — ✓ — ✓
mc11 — — ✓ — ← this row
mc21 — — — —
mc31 — — — —
mc02 — — ✓ — ← mc02 + mc22 anchor
mc03 — — ✓ —
After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.
Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget. QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
|
||
|
|
20a4299c5c |
h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.
Algorithmically distinct from the 1D mc20/mc02 siblings:
- Horizontal 6-tap produces 13 rows of int16 intermediate (no
per-stage clip/round — full precision retained).
- Vertical 6-tap on the intermediate, then +512 >> 10 (the
double-shift compensates for both 6-tap scalings) + clip255.
The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result. The 13-row int16 tmp[] buffer is the central
invariant.
Scope (same pattern as mc02 PR #15):
- Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc22_cpu calling
ff_put_h264_qpel8_mc22_neon.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
- C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
int16 staging buffer; spec-derived shifts and rounding.
- Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
[-2 .. +10] read window stays in-tile.
Verified on hertz:
$ ./build/test_api_h264 | tail -5
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
All 13 H.264 kernels in api_smoke now bit-exact PASS.
mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.
Coverage matrix update:
put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU)
→ 12 single put_ positions still missing (¼/¾ + HV combos with
L2 averaging).
|
||
|
|
c3301b0c2e |
h264: qpel mc02 (vertical half-pel, CPU/NEON)
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.
Scope:
- Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
Explicit SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
transpose of mc20 (reads src[(r±N)*stride + c] instead of
src[r*stride + c±N]).
- Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
× 16 rows, random input, bit-exact compare against the C ref.
Verified on hertz:
$ ./build/test_api_h264
...
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
All 12 H.264 kernels in the api_smoke now bit-exact PASS.
Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget. QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).
Coverage matrix update:
qpel position put_ status avg_ status
------------- ----------- -----------
mc00 (copy) not wired not wired
mc10 (¼-H) not wired not wired
mc20 (½-H) ✓ QPU+CPU not wired
mc30 (¾-H) not wired not wired
mc01 (¼-V) not wired not wired
mc02 (½-V) ✓ CPU not wired (this PR)
mc03 (¾-V) not wired not wired
mc11..mc33 not wired not wired
13 more qpel positions to go for the full put_ matrix. Adding them
follows the same template; each is a small contained PR.
|
||
|
|
9b1c106dc5 |
h264: deblock bS=4 intra variants (luma + chroma, V + H)
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4). After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:
bS<4 bS=4
----- -----
luma_v ✓ (cycle 8 QPU) ✓ (CPU)
luma_h ✓ (CPU, PR #9) ✓ (CPU)
chrm_v ✓ (CPU, PR #10) ✓ (CPU)
chrm_h ✓ (CPU, PR #10) ✓ (CPU)
Scope:
- 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
CH_INTRA=16), all → CPU substrate in the recipe table.
- 4 new public dispatch fns + 4 recipe wrappers (defined via two
DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
boilerplate tight).
- 4 new extern decls for the vendored
ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
- C reference: tests/h264_intra_loop_filter_ref.c covers all four
orientations. Algorithm per H.264 §8.7.2.3:
Luma: per-side strong/weak filter selector
strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
Chroma: always weak, only p0/q0 updated.
- daedalus_h264_deblock_meta is REUSED for intra dispatches; the
tc0[] field is ignored (bS=4 hardcodes the strength). Callers
can build a single edge list and route by kernel without an
extra struct.
- Test refactor: an intra_test_spec table + run_intra_test helper
drives all four orientations through one harness, keeping the
new test surface compact (~50 LOC for 4 kernels vs ~200 if each
had its own test_deblock_*_intra fn).
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
...
H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
...
All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.
The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately. The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.
QPU shader follow-up: not in this PR. The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
|
||
|
|
a5c47aa51c |
h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)
Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs. |
||
|
|
9d5451e0fe |
h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.
Scope:
- Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
+ daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
- Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
- Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An
explicit SUBSTRATE_QPU request on the H dispatch returns -1
(fails fast, no silent CPU degradation).
- C reference: tests/h264_h_loop_filter_luma_ref.c — the
column-axis transpose of h264_deblock_ref.c. Same per-segment
kernel; pix[-4..+3] accesses cols instead of rows*stride.
- Test: test_api_h264 grows a test_deblock_h() with 8 tiles
(8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
compares NEON dispatch against reference byte-for-byte.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 5 kernels bit-exact PASS. The new H variant joins the suite
with 1024 random-input bytes per tile x 8 tiles.
Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget. Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).
Backlog addition (out of scope for this PR):
- V3D shader for the H variant (mirror of v3d_h264deblock.spv).
- bS=4 intra-strength filter (different algebra; both _v and _h).
- Chroma deblock luma_v/_h (8-cell variants).
|
||
|
|
79553c6e22 |
cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.
Shader (src/v3d_h264_qpel_mc20.comp):
- local_size = 64, 1 lane per output pixel of one 8x8 block,
1 block per workgroup. Simplest layout that avoids any
inter-lane communication — V3D's L2 cache handles the
redundant reads from adjacent lanes computing adjacent
output columns.
- Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
rounding, clip to u8, write one dst byte.
- Single-stride convention matches FFmpeg's H264QpelContext
(dst and src share `stride`; src+src_off points at output
col 0 with the caller-guaranteed -2/+3 padding).
Dispatch wiring (src/daedalus_core.c):
- h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
- dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
src_max = src_off + 7*stride + 11 (covers the +3-col read
footprint on the last row), dst_max = dst_off + 7*stride + 8.
1 block per WG.
- daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
with the substrate-switch pattern matching the other H.264
kernels.
- Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.
Verification (hertz, Pi 5, V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2 ← flipped
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU
First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper. The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same
patch.
All 9 daedalus-fourier cycles now QPU-by-recipe:
cycle 1 VP9 IDCT 8x8 QPU
cycle 2 VP9 LPF wd=4 QPU
cycle 3 VP9 MC 8h QPU
cycle 4 VP9 LPF wd=8 QPU
cycle 5 AV1 CDEF 8x8 QPU
cycle 6 H.264 IDCT 4x4 QPU
cycle 7 H.264 IDCT 8x8 QPU
cycle 8 H.264 deblock luma-v QPU
cycle 9 H.264 qpel mc20 QPU ← this commit
Closes daedalus-fourier task #165. Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
|
||
|
|
a092ee34aa |
Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7 |
||
|
|
74687d9def |
cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.
Bit-exact first try on hertz (Pi 5, V3D 7.1):
H264_IDCT4 recipe substrate: 2 (QPU)
H264_IDCT8 recipe substrate: 2 (QPU) ← flipped
H264_DEBLOCK_LV recipe substrate: 2 (QPU)
H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.
Same pattern as cycle 6 commit (
|
||
|
|
65bd5c3fe3 |
cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.
What's added:
- src/v3d_h264_idct4.comp: GLSL compute shader implementing
the H.264 §8.5.12.1 1D butterfly twice (row pass then column
pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
dst. Block memory layout is column-major (matches FFmpeg
`ff_h264_idct_add_neon` convention).
- CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.
- dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
(n_blocks, dst_stride), 16 blocks per WG dispatch. Matches
the existing dispatch_*_qpu patterns; uses
v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
- daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
the same CPU/QPU substrate switch the deblock dispatch uses.
- daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
that the shader exists.
Verification on hertz (Pi 5 + V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).
Remaining cycle-6/7/9 work in task #165:
- cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
block, fewer blocks per WG)
- cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
not a transform)
This commit lands the cycle-6 piece of task #165.
|
||
|
|
737e87980d |
QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:
1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
every kernel that has a shipped V3D compute shader:
cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged)
cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged)
cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv)
cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged)
cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv)
cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165)
cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165)
cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv)
cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165)
The R-band cost/benefit framework still applies but is now
superseded for substrate selection by the decree. Where R
stays RED, the cost is in dispatch overhead, which is a
fixable engineering issue (tasks 160 buffer-pool, 161
persistent cmdbuf, 162 dmabuf import).
2) daedalus_ctx_create_no_qpu() now honours an env-var override:
set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
silently escalates to a full daedalus_ctx_create(). Lets
the libavcodec substitution shims in marfrit-packages (which
pthread_once a create_no_qpu ctx — see
libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
without rebuilding those patches.
Firefox / mpv consumers stay on the Vulkan-free path by
default (env var unset). The daedalus-v4l2 daemon will set
DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
(separate daedalus-v4l2 follow-up).
Smoke (hertz, Pi 5, kernel 6.18.29):
=== test_api_h264 ===
H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2 ← flipped
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path
H.264 qpel mc20: 1024/1024 bytes bit-exact
=== test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
=== test_api_lpf === all substrates bit-exact wd=4 and wd=8
The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.
Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
|
||
|
|
98553278dd |
v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.
Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline(). The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk. Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.
Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):
before (task 160 pool only):
steady-state p50: 76.44 us
steady-state mean: 77.95 us
after (task 160 pool + task 161 persistent cb):
steady-state p50: 54.56 us
steady-state mean: 56.00 us
-> 28% per-dispatch reduction
The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.
test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.
Refs daedalus-fourier task #161.
|
||
|
|
0a042a8e95 |
v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead. This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.
API additions (v3d_runner.h):
- v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
falls through to v3d_runner_create_buffer() on miss.
- v3d_runner_release_buffer(): pushes back onto the freelist; the
backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
v3d_runner_destroy().
- v3d_runner_pool_total_bytes(): diagnostic watermark.
Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.
Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls. test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).
Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):
call 0: 434.89 us (cold — 3x vkAllocateMemory)
call 1: 100.06 us (pool hit on all 3 buffers)
steady-state:
p50: 76.44 us
p90: 90.52 us
mean: 77.95 us
first-call / steady-state ratio: 5.7x
The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).
Refs daedalus-fourier task #160.
|
||
|
|
8fdef27a7d |
Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.
API additions (header + library):
- daedalus_h264_qpel_meta { dst_off, src_off }
- daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
n_blocks, meta)
- daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper)
- DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
- daedalus_recipe_substrate_for() returns CPU NEON for cycle 9
The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.
Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse. Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.
CMakeLists changes:
- h264qpel_neon.S added to the daedalus_core static lib (only the
bench targets owned it before; now the public API needs it too)
- tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target
Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
|
||
|
|
0a99b16489 |
Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.
Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).
Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.
After this commit, the public API surface fully covers cycles 1-8:
Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
test — bench_v3d_cdef is the authoritative 3-way M1)
Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
Cycle 7 (H.264 IDCT 8x8): CPU only
Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact
Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.
Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
af8146a2cd |
Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
373f63a910 |
Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
RED-2: bench-enforced m.x >= 4*stride contract
M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
Lower than prediction; H.264 deblock has 4 early-return paths +
2 conditional writes that hurt V3D branchy execution more than
expected.
M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
(neutral).
M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
QPU contribution is essentially unchanged from isolation —
the cross-substrate contention is gentle (consistent with
Issue 003's V4 finding).
Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.
Cycles 1-8 deployment recipe complete:
Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)
Phase 9 lessons added:
- Branchy kernels underperform V3D vs straight-line ones
- Mixed-kernel helper value scales with isolation M2, not
same-kernel M4
- R prediction needs branchiness weight, not just compute density
- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
+ h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)
Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
eb5cfb34c4 |
Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1085c5699c |
Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
760f6a4060 |
Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5223d3cb3f |
Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
85feba4087 |
Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: | | wd=4 (k2) | wd=8 (k4) | | M3 NEON | 48.3 | 52.4 | | M2 QPU iso | 19.6 | 17.8 | | R iso | 0.41 | 0.34 | | M4 delta | +6.9% | +4.1% | | 30fps mixed | 7.2x | 20.3x | | Verdict | GO QPU | GO QPU | NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
356e446a49 |
Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.
Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.
Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).
Performance:
M2''' = 1.413 Mblock/s (707.9 ns/block)
M3''' = 20.997 Mblock/s (NEON baseline phase3)
R''' = 0.067 (RED band — structural mismatch)
shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills
M4''' concurrent matrix (8s windows):
NEON 1-core 14.479 Mblock/s
NEON 4-core 15.248 Mblock/s <- baseline (compute-bound,
not bandwidth-saturated
like cycles 1+2!)
QPU only 1.380 Mblock/s
MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate)
MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3%
NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.
But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.
DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):
IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core)
LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core)
MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU)
Entropy -> CPU (structurally serial)
Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.
New artifacts:
- src/v3d_mc_8h.comp — GLSL kernel
- tests/vp9_mc_ref.c — standalone C ref (REGULAR filter
embedded; clean transcription)
- tests/bench_neon_mc.c — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c — M1''' + M2''' bench with contract
asserts + 30fps margin display
- tests/bench_concurrent_mc.c — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
(hand-extracted; provides
ff_vp9_subpel_filters symbol
without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation
Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
36eca40ff2 |
Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.
Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
- stride >= 4 contract added alongside m.x >= 4 (finding 2)
- assert(...) in bench made a MUST not a suggestion (finding 4)
- V3D divergence-cost note: don't restructure to always-execute,
masked lanes consume clock anyway (finding 3, informational)
Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.
Performance:
M2'' = 19.645 Medge/s (50.9 ns/edge)
M3'' = 48.285 Medge/s (NEON baseline from phase3)
R'' = 0.41 (ORANGE band - doesn't auto-close per
cycle-1 calibration adjustment)
shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.
M4'' concurrent matrix on hertz (8s windows):
NEON-1 LPF 41.131 Medge/s
NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling
(per-core 7-9; same
bandwidth-saturation as
cycle-1 F1)
QPU only 14.299 Medge/s
MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4
MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed
The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".
Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck
New artifacts:
- src/v3d_lpf_h_4_8.comp — GLSL kernel
- tests/bench_v3d_lpf.c — M1'' + M2'' harness with
contract asserts + fm/hev
pass-rate instrumentation
- tests/bench_concurrent_lpf.c — M4'' pthread bench
(mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md — plan + review + verification
Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d66f22f333 |
Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.
New files:
- src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance,
V3D device picker, HOST_VISIBLE|COHERENT
SSBOs with mmap, compute pipeline from .spv,
enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production:
256 invocations/WG, 2 blocks/subgroup
(no idle lanes), uint8 dst SSBO (race-free
per phase5 finding 5), unrolled writes
(no chained ternary), oob-flag pattern
(barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md — full iteration journey + decision verdict
CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.
Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):
ver change R ns/block
v1 first-light 0.230 533
v2 kill ternary + 2-blocks-per-sg 0.474 258
v3 per-pass scope oN 0.481 254 (noise)
v4 WG 64 -> 256 invocations 0.947 129
v5 packed uint32 coeff reads 0.938 130 (noise, reverted)
v4 final N=3 0.918 +/- 0.033
Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.
Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.
Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
dcbbc77038 |
Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (
|