224f4be9e2c30bdaaded91fc7481f87cfe509125
51 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e3c28495ae |
h264: V3D shaders for the 4 single-axis quarter-pel qpel variants
mc10 (¼-H), mc30 (¾-H), mc01 (¼-V), mc03 (¾-V). Each is the
corresponding half-pel filter (mc20 or mc02) with one extra L2
rounded-average step against an integer-source pixel at the tail:
mc10[r,c] = avg(clip255(mc20(s)), s[r, c ])
mc30[r,c] = avg(clip255(mc20(s)), s[r, c+1])
mc01[r,c] = avg(clip255(mc02(s)), s[r, c ])
mc03[r,c] = avg(clip255(mc02(s)), s[r+1, c ])
Each shader is ~45 lines (mc20-/mc02-pattern + 1 L2 line).
CMake foreach loop generates the 4 SPV compile rules. Dispatch
helper `dispatch_h264_qpel_axis_qpu` shares plumbing across all 4
(axis flag selects src_max bounds: H reads cols -2..+10, V reads
rows -2..+10). DEFINE_QPEL_AXIS_QPU + DEFINE_QPEL_DISPATCH_QPU
macros collapse ~200 LOC of boilerplate.
Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC{10,30,01,03} from
CPU to QPU.
Verified on hertz:
$ ./build/test_api_h264 | grep "qpel mc[01230]"
H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
(+ mc20/mc02/mc22 anchors from previous PRs)
Qpel QPU coverage:
put_ mc20 ✓ mc02 ✓ mc22 ✓ (3 anchors)
mc10 ✓ mc30 ✓ mc01 ✓ mc03 ✓ (4 quarter-axis, THIS PR)
mc11/12/13/21/23/31/32/33 — CPU NEON (8 diagonals)
avg_ all 15 positions — CPU NEON
7 of 15 useful put_ positions now on QPU. The 8 diagonals each
compose two half-pel results via L2; can land via dedicated kernels
or by chaining existing anchor dispatches (the latter would need
the L2 step as a fourth dispatch — probably cheaper to write
dedicated 8x diagonal shaders).
|
||
|
|
02d564b43e |
h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1. Highest per-frame impact among missing qpel positions (PR #24 bench: 71.5 ns/block NEON, 2.33 ms/frame worst-case all-mc22 at 1080p). Per-lane structure: each lane runs the FULL cascade independently — computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3 of its column, then a vertical lowpass on those with +512 >> 10 final scale. ~50 ALU ops per lane. Design choice: NO shared memory / barriers. Alternative was to cache the h-lowpass intermediates in shared memory (13 rows × 8 cols of int16 per WG), trading shared-memory bank pressure + a barrier for ~6× less h-lowpass compute. V3D L2 absorbs the redundant src reads across lanes; the per-lane compute is cheap (multiply-add ALU units idle anyway during dst write). Simpler shader, fewer SPIR-V ops, easier to extend to mc12/mc21/etc. later. CANNOT just cascade mc20 → mc02 because the intermediate must be int16 (no per-stage clip): the +512 >> 10 final scale assumes both 6-tap scalings preserved through the pipeline. Dedicated kernel. dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape; src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10) and H (cols -2..+10) read windows in one bound. Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU. Verified on hertz: $ ./build/test_api_h264 | grep qpel H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these are the half-pel "building blocks" the 12 other qpel positions combine via L2 averaging. Remaining variants (quarter-pel singles mc01/03/10/30 and the 8 diagonals) can dispatch through the existing shaders + a small L2-averaging compose step, or get dedicated kernels. |
||
|
|
bc5edf656d |
h264: V3D shader for qpel mc02 (vertical half-pel)
Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap H.264 luma
half-pel filter, transposed to vertical orientation: filter reads
rows [-2..+3] of source per output pixel instead of cols.
Shader is ~58 lines (vs mc20's 86) — same WG geometry (64 lanes /
1 block per WG / 1 lane per output pixel). The address arithmetic
flips: row_base = src_off + r*stride + c (mc20) → col_base =
src_off + c, then col_base + (r±N)*stride (mc02).
dispatch_h264_qpel_mc02_qpu mirrors the mc20 QPU dispatch; src_max
calculation differs since the V kernel reads rows -2..+10 of source
(13 rows × stride wide) vs mc20's cols -2..+10 (8 rows × stride+11).
For 8x8 blocks: src_max = src_off + 10*stride + 8.
Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC02 from CPU to QPU.
Verified on hertz:
$ ./build/test_api_h264 | grep qpel
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
QPU coverage for the 30 qpel positions:
put_ mc20 ✓ (cycle 9) mc02 ✓ (this PR)
all 13 other put_ CPU NEON
avg_ all 15 positions CPU NEON
Next-priority candidates by per-frame impact (per PR #24 bench):
mc22 (2D half-pel) — 71.5 ns/block NEON × 32 640 blocks worst
case = 2.33 ms/frame at 1080p. Most-used
qpel position in real H.264 streams.
mc11/mc13/mc31/mc33 — corner ¼-pel positions, structurally similar
to mc20 + mc02 with L2 averaging.
The cascaded H+V structure of mc22 means it can either share the
existing mc20 + mc02 shaders' L2 (compute mc20 into tmp, then mc02
on tmp) or get a dedicated 2-stage pipeline. Follow-up.
|
||
|
|
d8de7754fa |
h264: V3D shaders for chroma deblock V + H (4:2:0)
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h. Closes 4 of 8 deblock QPU coverage at non-intra: luma_v ✓ cycle 8 luma_h ✓ PR #28 chroma_v ✓ this PR chroma_h ✓ this PR *_intra — CPU NEON (less common; smaller volume) Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0 updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq side bonus), 8 cells per edge (vs luma's 16). Shader: 64 lines vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes 8..15 of each edge early-return). 4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is 4:2:0-only by design, caller-side gating already covers this in the libavcodec substitution arc (marfrit-packages PR #98). Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to QPU. dispatch_h264_deblock_chroma_qpu factored to share QPU plumbing between V and H (orientation passed as a flag for the dst_max calculation). Verified on hertz: $ ./build/test_api_h264 | grep "deblock chroma [vh]:" H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) Recipe substrate now reports 2 (QPU) for both CV and CH. Coverage now: bS<4 QPU bS=4 (intra) luma_v ✓ cycle 8 CPU NEON luma_h ✓ PR #28 CPU NEON chroma_v ✓ this PR CPU NEON chroma_h ✓ this PR CPU NEON Intra (bS=4) variants stay CPU NEON. Less common case, smaller per-frame contribution, and the algorithm is structurally different (no tc0; strong-vs-weak filter quad-tree). Can land as a follow-up PR if perf demands. |
||
|
|
3db059ffab |
h264: V3D shader for deblock_luma_h — first QPU port since cycle 9
Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row)
to the H orientation (V edge, horizontal across a column). Same
algorithm, transposed access pattern:
V variant: lane → column, reads/writes pix[±N*stride] (vertical I/O)
H variant: lane → row, reads/writes pix[±N] (horizontal I/O)
WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge.
Lane-in-edge interpretation flips: column-index for V → row-index
for H. tc0 segment math unchanged (one tc0 byte per 4 lanes).
dst_max calculation flips: V used dst_off + 3*stride + 16 (cols),
H uses dst_off + 15*stride + 4 (rows).
Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU). AUTO
dispatch now picks QPU for the H edge as well as the V edge. CPU
NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264 | grep luma_h
H264_DEBLOCK_LH recipe substrate: 2 (was 1 — flipped to QPU)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on
8 tiles × 8 cols × 16 rows of random input. Same correctness gate
as the cycle 8 V shader.
CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV
added to daedalus_shaders ALL list + install rule. daedalus_ctx
gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair
(can't share with V because pipelines bind a specific SPIR-V module
at create time).
What this changes for the substitution arc: PR #97's 0008-h264-
deblock-luma-h substitution patch already plumbed
daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec.
That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe
(unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu,
in which case it stays NEON — same shape as cycle 8's V shader).
Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders):
luma_v ✓ cycle 8 ✓ now
luma_h — ✓ THIS PR
chroma_v/h — (CPU NEON; smaller tiles, lower-priority)
*_intra (4) — (CPU NEON; less common)
|
||
|
|
cb3aef3dac |
h264: promote remaining intra prediction modes (17) to public API
Follows PR #26 (Intra_4x4 luma) with the same promotion pattern for the rest of the intra prediction primitive set: Intra_16x16 luma (4 modes, PR #13) — V/H/DC/Plane Intra_8x8 chroma (4 modes, PR #14) — DC/H/V/Plane (4:2:0) Intra_8x8 luma (9 modes, PRs #21 + #22) — High profile, with 1-2-1 pre-filter 3 file moves via `git mv`, ~17 function renames stripping the `_ref` suffix. Test binaries rewired to link daedalus_core instead of compiling the (now moved) ref files directly. No code change — pure plumbing for substitution-arc consumers. 26 intra prediction modes total now in the public API after this PR. Verified on hertz: test_intra_pred_16x16: 5/5 PASS test_intra_pred_chroma8x8: 5/5 PASS test_intra_pred_8x8_luma: 11/11 PASS All via public symbols (test binaries linked against daedalus_core). Unblocks marfrit-packages substitution arc patch 0014 — wires H264PredContext.pred4x4[], pred16x16[], pred8x8[], pred8x8l[] through daedalus alongside the existing IDCT / deblock / qpel / DC Hadamard substitutions. After 0014 lands, the libavcodec.so built by marfrit-packages will have EVERY hot-path pixel-math kernel of an H.264 8-bit 4:2:0 decode routing through daedalus — the substitution arc is feature- complete for the campaign target (Pi 5 Firefox YouTube playback). |
||
|
|
df9e1c9d78 |
h264: promote Intra_4x4 luma prediction (9 modes) to public API
PR #12 added the 9 Intra_4x4 luma intra prediction modes as test-only spec references in tests/. This PR promotes them to public src/ symbols so consumers (the eventual marfrit-packages substitution-arc patch 0014) can link against them. Moved: tests/h264_intra_pred_4x4_ref.c → src/h264_intra_pred_4x4.c Renamed: daedalus_h264_pred_4x4_<mode>_ref → daedalus_h264_pred_4x4_<mode> (9 functions: vertical/horizontal/dc/ddl/ddr/vr/hd/vl/hu) The src/ implementation is byte-for-byte the same code as the test-only ref; this PR is plain plumbing. The test binary now links against daedalus_core to pull in the public symbols (instead of compiling the ref file directly), exercising the path that real consumers will use. Same promotion shape as PR #25 (chroma DC Hadamard). Verified on hertz: $ ./build/test_intra_pred_4x4 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS DiagDownLeft (mode 3) PASS DiagDownRight (mode 4) PASS VerticalRight (mode 5) PASS HorizontalDown (mode 6) PASS VerticalLeft (mode 7) PASS HorizontalUp (mode 8) PASS VR asym (sanity) PASS ALL 10 intra-4x4 mode references PASS $ nm -g build/libdaedalus_core.a | grep "T daedalus_h264_pred_4x4" (9 symbols exported) Follow-ups (same promotion pattern, can land in parallel): - Intra_16x16 luma (4 modes, PR #13) - Intra_8x8 chroma (4 modes, PR #14) - Intra_8x8 luma (9 modes, PRs #21 + #22) Once all 26 intra modes are in the public API, the marfrit-packages substitution arc can route H264PredContext's pred function pointer tables through daedalus alongside the IDCT / deblock / qpel / DC Hadamard substitutions already in place. |
||
|
|
1f07f3cd70 |
h264: expose chroma DC 2x2 Hadamard as public API
PR #23 added the Hadamard as a test-only spec reference; this PR promotes it to a public symbol in src/ so consumers (the eventual marfrit-packages substitution-arc patch 0011) can link against it. New: void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]); — operates in-place on 4 int16, no QP-dependent scaling (caller composes that themselves per §8.5.11.2). The src/ implementation is byte-for-byte identical to the test-only ref in tests/h264_chroma_dc_hadamard_ref.c (kept as a separate spec-validation copy). A new "public API parity" test case verifies the two produce identical output for a non-trivial input. Pure CPU primitive — no substrate-dispatch wrapper because the work is 4 adds + 4 subs; the substrate machinery would cost more than the kernel itself. Verified on hertz: $ ./build/test_chroma_dc_hadamard all-uniform 5 PASS col gradient [0,10,0,10] PASS row gradient [0,0,10,10] PASS anti-diagonal [10,0,0,10] PASS asymmetric [1,2,3,4] PASS sign-alternating [-5,5,-5,5] PASS double-Hadamard = 4*orig PASS public API parity vs _ref PASS ALL chroma DC Hadamard tests PASS $ nm -g build/libdaedalus_core.a | grep chroma_dc_hadamard 0000000000000000 T daedalus_h264_chroma_dc_hadamard_2x2 Unblocks marfrit-packages 0011 (substituting H264DSPContext.chroma_dc_dequant_idct, which composes the Hadamard + qmul scaling). |
||
|
|
ba5bbae8e2 |
bench: H.264 primitives NEON CPU baseline (1080p budget projection)
Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets. Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."
Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):
Per-kernel ns/op (CPU NEON):
IDCT 4x4 luma 10.78 ns/block
IDCT 8x8 luma 29.73 ns/block
Deblock luma_v 18.04 ns/edge
Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly)
qpel mc20 (H half-pel) 25.66 ns/block
qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!)
qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected)
Projected 1080p frame budgets (worst-case, CPU NEON only):
IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks)
IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks)
Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges)
Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges)
qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks)
Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms
30 fps deadline: 33.33 ms
Margin: +27.64 ms
What this validates:
- The "30fps@1080p is the fine floor" memory note holds with
huge headroom on the pixel-math layer alone. 17% of the
deadline goes to pixel math (worst case); 83% is available
for entropy decode + reference frame management + intra
prediction + chroma deblock + chroma IDCT + the libavcodec
intercept overhead.
- The CPU-vs-QPU substrate finding from earlier (PR #10 on
daedalus-decoder showed CPU NEON is 4x faster than QPU for
IDCT) is consistent here. All these kernels have CPU-only
recipes by default; the data suggests that's the right call
for now. The recipe substrate decision can be revisited
per-kernel once QPU shaders catch up.
- mc22 (2D HV half-pel) is the most expensive single qpel
position at ~71 ns/block — 2-7x more than the 1D variants.
Real B-slice biprediction with two mc22 calls per MB would
add ~4.7 ms/frame; still comfortable but worth knowing.
What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):
- Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to
luma, that's ~0.7 ms/frame.
- Chroma deblock (smaller tile, simpler kernel — sub-ms).
- Intra prediction (per-block, ~50 ops at NEON, but serialized
in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
- bS=4 intra deblock variants — different algorithm, similar
cost to bS<4.
- chroma DC Hadamard — trivial.
Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms. Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.
|
||
|
|
854bdeda20 |
h264: chroma DC 2x2 Hadamard pre-pass primitive
Adds the H.264 §8.5.11.1 chroma DC Hadamard transform. In 4:2:0
chroma, the four DC coefficients (one from each chroma 4x4 AC block
within an MB) go through a 2x2 Hadamard before quant-scaling and
before being added back to each block's [0,0] coefficient prior to
the 4x4 AC IDCT.
This PR ships the pure Hadamard transform:
f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1]
f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1]
f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1]
f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1]
implemented as the 2-stage row+col butterfly (1:1 with the NEON
SIMD shape upstream). Operates in-place on int16[4].
What this does NOT do (deferred to caller-side composition):
- QP-dependent scaling per §8.5.11.2. The scale depends on
QP_C (with chroma_qp_offset adjustment), so the formula has
branches (>=6 vs <6) and looks up LevelScale4x4 table values.
The libavcodec intercept patch composes Hadamard + scale +
shift itself since the scale shape varies by codec-level
context (slice header chroma_qp_offset, PPS chroma_qp_offset,
second_chroma_qp_offset for the chroma_qp_index_offset).
- Inverse transform (decode-time used for the FORWARD direction
is the same Hadamard up to scaling, but conceptually the spec
distinguishes them in §8.5.11; we expose only the matrix).
Test design (tests/test_chroma_dc_hadamard.c):
7 cases, all spec-derived hand-computations:
- all-uniform 5 → [20, 0, 0, 0]
- col gradient [0,10,0,10] → [20, -20, 0, 0]
- row gradient [0,0,10,10] → [20, 0, -20, 0]
- anti-diagonal [10,0,0,10] → [20, 0, 0, 20]
- asymmetric [1,2,3,4] → [10, -2, -4, 0]
- sign-alternating [-5,5,-5,5] → [0, -20, 0, 0]
- double-Hadamard invariant: H·H = 4·I, so applying twice
gives [4*c[0], 4*c[1], 4*c[2], 4*c[3]] for any input.
The double-Hadamard test is the strongest correctness gate: any
single sign error in the butterfly would break the H·H = 4·I
algebraic property, surfacing immediately. All 7 PASS first try.
Verified on hertz:
$ ./build/test_chroma_dc_hadamard
all-uniform 5 PASS
col gradient [0,10,0,10] PASS
row gradient [0,0,10,10] PASS
anti-diagonal [10,0,0,10] PASS
asymmetric [1,2,3,4] PASS
sign-alternating [-5,5,-5,5] PASS
double-Hadamard = 4*orig PASS
ALL chroma DC Hadamard tests PASS
With this primitive the H.264 8-bit 4:2:0 pixel-math primitive
matrix is complete in fourier:
- IDCT 4x4 (luma + chroma) ✓
- IDCT 8x8 (luma, High profile) ✓
- Chroma DC Hadamard 2x2 ✓ (this PR)
- Deblock (8 variants) ✓
- Intra prediction (26 modes) ✓
- MC qpel (30 dispatches) ✓
What remains for the libavcodec intercept patch: CABAC/CAVLC entropy
decode, SPS/PPS parsing, slice header parsing, MB type / QP / CBP /
intra mode prediction. All of that lives at the intercept layer
(it's spec-derived from the bitstream syntax, not pixel-math); the
intercept patch will call into these fourier primitives once the
metadata is decoded.
|
||
|
|
8bc6d27ea7 |
h264: Intra_8x8 luma prediction (High profile) — pre-filter + 3 modes
Adds the High-profile Intra_8x8 luma primitive set. Per H.264
§8.3.2.1, this is distinct from Intra_4x4 in two ways:
1. REFERENCE SAMPLE PRE-FILTER (§8.3.2.1.1). The 25 raw neighbour
samples are smoothed with a 1-2-1 filter BEFORE prediction.
Spec-defined boundary handling at corners and the right edge:
- top-left filt'd: (top[0] + 2*tl + left[0] + 2) >> 2
- top[0] filt'd: (tl + 2*t[0] + t[1] + 2) >> 2
- top[i] for 1..14: (t[i-1] + 2*t[i] + t[i+1] + 2) >> 2
- top[15] filt'd: (t[14] + 3*t[15] + 2) >> 2 ← 3× boundary
- left analogous, with l[7] using 3× boundary.
2. SCALE. All 9 prediction modes operate at 8x8 on the filtered
samples (Intra_4x4 is 4x4 on raw samples).
This PR ships the pre-filter + the 3 simple modes (V, H, DC):
- Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]
- Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]
- Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
+ 8) >> 4) broadcast
The 6 directional modes (DDL, DDR, VR, HD, VL, HU at 8x8 per
§8.3.2.1.5..§8.3.2.1.10) follow in a separate PR. They use the
same filtered samples; only the per-cell formula differs.
Test design (tests/test_intra_pred_8x8_luma.c):
- 3 uniform-context tests, one per mode (sanity).
- 2 gradient tests that exercise the pre-filter's interior +
boundary cases:
* Vertical with top = 0..15: spec arithmetic gives filtered
top[c] = c for c in 0..7 (gradient input → identity through
the 1-2-1 filter on the interior; boundaries arithmetically
verify too). Test expects pred[r,c] = c.
* Horizontal with left = 0..7: same arithmetic chain on the
left col. Test expects pred[r,c] = r.
Verified on hertz:
$ ./build/test_intra_pred_8x8_luma
Vertical (mode 0, uniform top) PASS
Horizontal (mode 1, uniform left) PASS
DC (mode 2, uniform) PASS
Vertical (mode 0, gradient) PASS (filtered gradient)
Horizontal (mode 1, gradient) PASS (filtered gradient)
ALL Intra_8x8 luma PASS (3 modes — V, H, DC)
The pre-filter being right first try is meaningful — the boundary
samples use a 3× weight rather than 2× (filt[top 15] = (t[14] +
3*t[15] + 2) >> 2), which is easy to forget when transcribing. The
gradient test would have surfaced any boundary mistake immediately.
Combined intra-prediction primitive coverage after this PR:
Intra_4x4 luma ✓ (9 modes, PR #12)
Intra_16x16 luma ✓ (4 modes, PR #13)
Intra_8x8 chroma ✓ (4 modes, PR #14)
Intra_8x8 luma △ (3 of 9 modes — V, H, DC ✓; DDL/DDR/VR/HD/VL/HU pending)
The 6 remaining Intra_8x8 luma directional modes are spec-mechanical
follow-ups; each is a ~30-line formula per §8.3.2.1.5+.
|
||
|
|
01f782cfaf |
h264: qpel avg — 12 remaining variants (closes the matrix)
Closes the H.264 8x8 qpel buildout. Adds the remaining 12 avg_
biprediction positions:
4 quarter-axis: avg_mc{10,30,01,03}
8 diagonals : avg_mc{11,12,13,21,23,31,32,33}
Each follows the established pattern: same half-pel formula as the
put_ sibling, then L2 average with the existing dst contents per
H.264 §8.4.2.3.1.
Scope:
- 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU.
- 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon.
- 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
- 12 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 12 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- 12 header decls via DECLARE_QPEL_AVG macro.
- tests/h264_qpel8_avg_rest_ref.c — references via two parametric
macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms,
DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms.
- Test harness extended with a RUN(MC) sub-macro that derives both
the ref name and dispatch name from the bare mcXX. (The ref
is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is
daedalus_recipe_dispatch_h264_qpel_avg_<mc>. Macro had a typo
on first try that duplicated "avg_" in the ref name — caught at
compile, fixed.)
Verified on hertz:
$ ./build/test_api_h264 | tail -12
H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%)
All 12 new positions bit-exact PASS first try.
Final qpel matrix state:
put_: mc00 (none — integer copy)
mc01 ✓ mc02 ✓ mc03 ✓
mc10 ✓ mc11 ✓ mc12 ✓ mc13 ✓
mc20 ✓ (QPU+CPU) mc21 ✓ mc22 ✓ mc23 ✓
mc30 ✓ mc31 ✓ mc32 ✓ mc33 ✓
avg_: same 15-of-16 coverage, all CPU.
Every B-slice biprediction case the libavcodec intercept can throw
at us is now serviceable. QPU shaders remain mc20-only (cycle 9);
the other 29 positions are CPU NEON. Whether to write more QPU
shaders depends on real perf measurement — at NEON ~10 ns per
8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work,
well inside budget.
|
||
|
|
1113953f97 |
h264: qpel avg anchors (avg_mc20/02/22, biprediction support)
Begins the avg_ qpel buildout for B-slice biprediction. Each avg_
form computes the same half-pel formula as its put_ sibling, then
L2-averages the result with the existing dst contents — the caller
pre-loads dst with the list0 prediction; the avg_ call adds list1
per H.264 §8.4.2.3.1.
Scope (3 anchors, sets the pattern for the remaining 13 avg_
variants):
- 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU.
- 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon.
- 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro
(the macro is type-agnostic so it didn't need changes for avg_).
- 3 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 3 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg.
- Test harness: run_avg_qpel() seeds dst with random content so
the L2 averaging is actually exercised (not just put_-style
overwrite that would silently pass).
Verified on hertz:
$ ./build/test_api_h264 | tail -3
H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%)
All 3 anchors bit-exact PASS first try.
Why anchors only in this PR: the avg_ pattern is uniform across all
16 positions (each is just "put_ result + L2 with dst"). Landing
the anchors first confirms the macro pattern works for both put_
and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow
the same template in a follow-up PR.
State of the qpel matrix after this PR:
put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper)
avg_ : 3 of 16 positions ✓ (mc20, mc02, mc22 anchors)
13 follow-up positions
|
||
|
|
0894a46114 |
h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)
Closes the qpel buildout. All 8 remaining diagonal positions land
in one PR. Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).
Decomposition table (the formula for each output cell at (r,c)):
mc11 ¼¼ : avg(mc20[r, c], mc02[r, c])
mc12 ¼½ : avg(mc22[r, c], mc02[r, c])
mc13 ¼¾ : avg(mc20[r+1, c], mc02[r, c])
mc21 ½¼ : avg(mc22[r, c], mc20[r, c])
mc23 ½¾ : avg(mc22[r, c], mc20[r+1, c])
mc31 ¾¼ : avg(mc20[r, c], mc02[r, c+1])
mc32 ¾½ : avg(mc22[r, c], mc02[r, c+1])
mc33 ¾¾ : avg(mc20[r+1, c], mc02[r, c+1])
The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.
Scope (tightly macro-ised):
- 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
- 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
- 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
- 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
- 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- Header decls condensed via a DECLARE_QPEL_DIAG macro that
expands to both recipe + dispatch decls per name.
- C references via DEFINE_DIAG_REF macro: each ref is a 6-line
wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
(the latter being the per-cell version of mc22's 13-row int16
tmp[] computation).
- Test wrapper: test_qpel_diag_all() drives all 8 through the
existing run_quarter_axis_qpel() harness.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264 | tail -8
H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)
ALL 8 diagonal positions bit-exact PASS first try. Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.
After this PR the H.264 qpel 8x8 put_ matrix is complete:
mc00 mc01 mc02 mc03
mc10 mc11 mc12 mc13
mc20 mc21 mc22 mc23
mc30 mc31 mc32 mc33
15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly). mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.
What this does NOT cover (still in backlog):
- avg_ variants (the "add" form for biprediction, 16 more
positions). Currently the API only exposes put_.
- 16x16 qpel (separate function family in FFmpeg; the 8x8 path
can be used twice to substitute when 16x16 isn't critical).
- QPU shaders for any qpel position other than mc20.
|
||
|
|
e01f7bc7c6 |
h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR. Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:
mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]
The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same. Putting them in one PR is justified by that uniformity.
Scope:
- 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
- 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
- 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
(collapses ~50 LOC of repetition).
- 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
- 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
hpel_h() / hpel_v() inlines + per-mode L2 average.
- Test refactor: generic run_quarter_axis_qpel() harness exercises
all 4 positions through a single helper (~50 LOC for 4 tests vs
~200 if each was hand-rolled).
Verified on hertz:
$ ./build/test_api_h264 | tail -8
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
All 4 new positions bit-exact PASS first try.
Coverage matrix update:
put_ mc00 mc10 mc20 mc30
mc01 — ✓ — ✓
mc11 — — ✓ — ← this row
mc21 — — — —
mc31 — — — —
mc02 — — ✓ — ← mc02 + mc22 anchor
mc03 — — ✓ —
After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.
Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget. QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
|
||
|
|
20a4299c5c |
h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.
Algorithmically distinct from the 1D mc20/mc02 siblings:
- Horizontal 6-tap produces 13 rows of int16 intermediate (no
per-stage clip/round — full precision retained).
- Vertical 6-tap on the intermediate, then +512 >> 10 (the
double-shift compensates for both 6-tap scalings) + clip255.
The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result. The 13-row int16 tmp[] buffer is the central
invariant.
Scope (same pattern as mc02 PR #15):
- Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc22_cpu calling
ff_put_h264_qpel8_mc22_neon.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
- C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
int16 staging buffer; spec-derived shifts and rounding.
- Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
[-2 .. +10] read window stays in-tile.
Verified on hertz:
$ ./build/test_api_h264 | tail -5
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
All 13 H.264 kernels in api_smoke now bit-exact PASS.
mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.
Coverage matrix update:
put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU)
→ 12 single put_ positions still missing (¼/¾ + HV combos with
L2 averaging).
|
||
|
|
c3301b0c2e |
h264: qpel mc02 (vertical half-pel, CPU/NEON)
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.
Scope:
- Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
Explicit SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
transpose of mc20 (reads src[(r±N)*stride + c] instead of
src[r*stride + c±N]).
- Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
× 16 rows, random input, bit-exact compare against the C ref.
Verified on hertz:
$ ./build/test_api_h264
...
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
All 12 H.264 kernels in the api_smoke now bit-exact PASS.
Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget. QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).
Coverage matrix update:
qpel position put_ status avg_ status
------------- ----------- -----------
mc00 (copy) not wired not wired
mc10 (¼-H) not wired not wired
mc20 (½-H) ✓ QPU+CPU not wired
mc30 (¾-H) not wired not wired
mc01 (¼-V) not wired not wired
mc02 (½-V) ✓ CPU not wired (this PR)
mc03 (¾-V) not wired not wired
mc11..mc33 not wired not wired
13 more qpel positions to go for the full put_ matrix. Adding them
follows the same template; each is a small contained PR.
|
||
|
|
d7100459f2 |
h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates
Third intra-prediction primitive after PR #12 (Intra_4x4 luma) and PR #13 (Intra_16x16 luma). Covers Intra_8x8 chroma per H.264 §8.3.3: 4 modes used for BOTH Cb and Cr planes at 4:2:0. Mode quirks worth flagging in code review: - Mode 0 DC is asymmetric per quadrant. The 8x8 chroma block splits into four 4x4 quadrants with different DC formulas: (0,0) top-left : (sum_top[0..3] + sum_left[0..3] + 4) >> 3 (0,1) top-right : (sum_top[4..7] + 2) >> 2 (1,0) bot-left : (sum_left[4..7] + 2) >> 2 (1,1) bot-right : (sum_top[4..7] + sum_left[4..7] + 4) >> 3 The top-right quadrant deliberately IGNORES the top-left half even though it's available — that's per spec §8.3.3.2. - Mode 3 Plane uses slope coefficient 34 (not 5 like Intra_16x16 luma). Centre is (x-3, y-3) instead of (x-7, y-7). Sums span 4 differences instead of 8. Easy to copy-paste-bug from the luma Plane if you don't notice the constants change. Test highlights: - DC quadrants: distinct expected values per quadrant (16, 16, 40, 28 from asymmetric top/left halves) — any quadrant mix-up would surface immediately. Hand-derived from the formulas in the test comment. - Plane uniform: all-100 context → all-100 output (a = 3200, H = V = 0, (3200+16) >> 5 = 100 exactly). - Plane gradient: top + left = 0..7, hand-derives pred[0][0] = 1 and pred[7][7] = 15 via the full arithmetic chain (H = V = 56, b = c = 30, a = 224). Same hand-traced spec-walkthrough as the Intra_16x16 Plane gradient test. Verified on hertz: $ ./build/test_intra_pred_chroma8x8 Horizontal (mode 1) PASS Vertical (mode 2) PASS DC quadrants (mode 0) PASS Plane uniform (mode 3) PASS Plane gradient (mode 3) PASS (corners 1, 15) ALL Intra_8x8 chroma mode references PASS All 5 tests PASS first try. The DC quadrant correctness is meaningful (4 different formulas in one kernel) and the Plane gradient corners validate the slope=34 + centre=(x-3,y-3) constants vs the luma equivalents. Combined coverage after this PR: - Intra_4x4 luma: 9 modes ✓ (PR #12, all 9 PASS) - Intra_16x16 luma: 4 modes ✓ (PR #13, all 5 tests PASS) - Intra_8x8 chroma: 4 modes ✓ (this PR, all 5 tests PASS) - Intra_8x8 luma (High profile): 9 modes + smoothing — pending. Remaining backlog: Intra_8x8 luma (High profile, 9 modes + 1-2-1 smoothing pre-filter — distinct algorithm from Intra_4x4 because of the pre-filter), neighbour-availability fallback, dispatch wrappers. |
||
|
|
c43ee84d8e |
h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates
Second piece of the intra-prediction primitive set after PR #12 (Intra_4x4 luma 9 modes). Covers the Intra_16x16 luma MB type per H.264 §8.3.2: 4 modes (Vertical, Horizontal, DC, Plane). Scope: - tests/h264_intra_pred_16x16_ref.c — 4 spec-derived modes. Same FFmpeg-style interface as the 4x4 sibling: void daedalus_h264_pred_16x16_<name>_ref(uint8_t *dst, ptrdiff_t stride); Assumes all neighbours valid (interior-MB case). The Plane mode is the algorithmically heaviest of the four — spec §8.3.2.4 has two slope sums (H, V) over the asymmetric top/left contexts, a clipped quadratic evaluation per cell, and a top-left-corner participant at i=7 / j=7. Implementation follows the spec straightforwardly with `clip_u8` on the final saturating cast. - tests/test_intra_pred_16x16.c — 5 test cases: * V, H, DC: standard contexts (gradient top / gradient left / small uniform pair). * Plane (uniform): all neighbours = 100 → H = V = 0 → output = (16*200 + 16) >> 5 = 100. Verifies the orientation-free portion of the formula. * Plane (gradient): top + left both 0..15, spec-derived corner expectations pred[0][0] = 1 and pred[15][15] = 31. The arithmetic chain (H = V = 400 → b = c = 31, a = 480) is fully hand-traced in the test comment so the expected values are auditable. - CMakeLists.txt — new test_intra_pred_16x16 binary; pure-CPU library, no daedalus_core dependency (same separation as the 4x4 ref). Verified on hertz: $ ./build/test_intra_pred_16x16 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS Plane (mode 3, uniform) PASS Plane (mode 3, gradient) PASS (corners 1, 31) ALL Intra_16x16 mode references PASS Plane mode being right first try is meaningful — H/V sums, b/c slope shifts, and the a-baseline arithmetic have many sign / index error opportunities. The asymmetric gradient test would have caught any of them; it didn't. What this does NOT cover (still in the intra-pred backlog): - Intra_8x8 chroma (4 modes per H.264 §8.3.3). - Intra_8x8 luma (High profile, 9 modes per §8.3.2.1 + the 1-2-1 smoothing pre-filter — distinct algorithm from Intra_4x4). - Neighbour-availability fallback for boundary MBs. - Dispatch wrappers (same architectural question as before — wait for decoder Stage 2a strategy decision). |
||
|
|
ce6703a862 |
h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates
Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction.
Spec-derived C reference covering all 9 modes; standalone test
exercises each against hand-computed expected 4x4 patterns.
Why fourier (not the decoder) gets this: it's a reusable spec-level
primitive — both daedalus-decoder (Phase 1 Stage 2a intra prediction)
and any future shader work will need the same bit-exact reference.
Putting it in fourier alongside the IDCT / deblock refs keeps the
"spec implementations" library cohesive.
Why CPU C reference, not NEON or QPU: the vendored FFmpeg snapshot
(external/ffmpeg-snapshot/libavcodec/aarch64/) has h264dsp/idct/qpel
but NOT h264pred. Vendoring h264pred_neon.S would expand the snapshot
surface; deferring that pending real perf data. Per the cycle 9
NEON benches that take ~5 ns per 8x8 qpel block, intra prediction
at ~5 ns per 4x4 block × 16 blocks/MB × 8160 MBs = ~650 us/frame at
1080p — well inside budget even at NEON, and much further inside at
plain C. Not the critical-path concern.
Scope:
- tests/h264_intra_pred_4x4_ref.c — 9 prediction modes per
H.264 spec §8.3.1.4 sub-clauses, FFmpeg-style interface:
void daedalus_h264_pred_4x4_<name>_ref(uint8_t *dst, ptrdiff_t stride);
Reads top/top-right/left/top-left neighbours from dst[-stride/-1]
offsets, writes 4×4 output at dst[0..3][0..3]. Assumes all 13
neighbour bytes are valid (interior-MB case; availability
fallbacks are caller-side per spec).
- tests/test_intra_pred_4x4.c — 10 cases:
* 9 uniform-context degenerate tests (one per mode), establishing
that nothing is structurally broken (all output cells must
equal the uniform input value).
* 1 asymmetric Vertical_Right sanity test with 16 distinct
expected cells hand-computed from spec §8.3.1.4.6 — the
"really exercise orientation + row/col arithmetic" gate.
- CMakeLists.txt — new test_intra_pred_4x4 binary (no daedalus_core
dependency; pure-CPU library doesn't need a context to construct).
Verified on hertz:
$ ./build/test_intra_pred_4x4
Vertical (mode 0) PASS
Horizontal (mode 1) PASS
DC (mode 2) PASS
DiagDownLeft (mode 3) PASS
DiagDownRight (mode 4) PASS
VerticalRight (mode 5) PASS
HorizontalDown (mode 6) PASS
VerticalLeft (mode 7) PASS
HorizontalUp (mode 8) PASS
VR asym (sanity) PASS
ALL 10 intra-4x4 mode references PASS
The VR asym test passed first try; the DC test fell on the first
attempt because my test expectation miscomputed the rounding shift
(I wrote 4, actual is 2 = (16+4)>>3). Fixed in the test. Reference
itself never had the bug.
What this does NOT cover (next-step backlog):
- Intra_16x16 luma prediction (4 modes per H.264 §8.3.2): vertical,
horizontal, DC, plane.
- Intra_8x8 chroma prediction (4 modes per H.264 §8.3.3): DC,
horizontal, vertical, plane.
- Intra_8x8 luma prediction (High profile, 9 modes per §8.3.2.1) —
these are the High-profile siblings of the modes in this PR with
the 1-2-1 smoothing pre-filter. Different but well-defined.
- Neighbour availability fallback (top-edge MB, left-edge MB,
slice-boundary, top-right unavailable in some positions).
- Dispatch wrappers — these refs aren't surfaced through
daedalus_dispatch_*(). Whether to do that depends on the
daedalus-decoder Stage 2a architecture (per-block CPU vs
per-diagonal GPU wavefront — TBD).
|
||
|
|
9b1c106dc5 |
h264: deblock bS=4 intra variants (luma + chroma, V + H)
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4). After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:
bS<4 bS=4
----- -----
luma_v ✓ (cycle 8 QPU) ✓ (CPU)
luma_h ✓ (CPU, PR #9) ✓ (CPU)
chrm_v ✓ (CPU, PR #10) ✓ (CPU)
chrm_h ✓ (CPU, PR #10) ✓ (CPU)
Scope:
- 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
CH_INTRA=16), all → CPU substrate in the recipe table.
- 4 new public dispatch fns + 4 recipe wrappers (defined via two
DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
boilerplate tight).
- 4 new extern decls for the vendored
ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
- C reference: tests/h264_intra_loop_filter_ref.c covers all four
orientations. Algorithm per H.264 §8.7.2.3:
Luma: per-side strong/weak filter selector
strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
Chroma: always weak, only p0/q0 updated.
- daedalus_h264_deblock_meta is REUSED for intra dispatches; the
tc0[] field is ignored (bS=4 hardcodes the strength). Callers
can build a single edge list and route by kernel without an
extra struct.
- Test refactor: an intra_test_spec table + run_intra_test helper
drives all four orientations through one harness, keeping the
new test surface compact (~50 LOC for 4 kernels vs ~200 if each
had its own test_deblock_*_intra fn).
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
...
H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
...
All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.
The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately. The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.
QPU shader follow-up: not in this PR. The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
|
||
|
|
a5c47aa51c |
h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)
Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs. |
||
|
|
9d5451e0fe |
h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.
Scope:
- Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
+ daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
- Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
- Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An
explicit SUBSTRATE_QPU request on the H dispatch returns -1
(fails fast, no silent CPU degradation).
- C reference: tests/h264_h_loop_filter_luma_ref.c — the
column-axis transpose of h264_deblock_ref.c. Same per-segment
kernel; pix[-4..+3] accesses cols instead of rows*stride.
- Test: test_api_h264 grows a test_deblock_h() with 8 tiles
(8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
compares NEON dispatch against reference byte-for-byte.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 5 kernels bit-exact PASS. The new H variant joins the suite
with 1024 random-input bytes per tile x 8 tiles.
Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget. Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).
Backlog addition (out of scope for this PR):
- V3D shader for the H variant (mirror of v3d_h264deblock.spv).
- bS=4 intra-strength filter (different algebra; both _v and _h).
- Chroma deblock luma_v/_h (8-cell variants).
|
||
|
|
79553c6e22 |
cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.
Shader (src/v3d_h264_qpel_mc20.comp):
- local_size = 64, 1 lane per output pixel of one 8x8 block,
1 block per workgroup. Simplest layout that avoids any
inter-lane communication — V3D's L2 cache handles the
redundant reads from adjacent lanes computing adjacent
output columns.
- Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
rounding, clip to u8, write one dst byte.
- Single-stride convention matches FFmpeg's H264QpelContext
(dst and src share `stride`; src+src_off points at output
col 0 with the caller-guaranteed -2/+3 padding).
Dispatch wiring (src/daedalus_core.c):
- h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
- dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
src_max = src_off + 7*stride + 11 (covers the +3-col read
footprint on the last row), dst_max = dst_off + 7*stride + 8.
1 block per WG.
- daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
with the substrate-switch pattern matching the other H.264
kernels.
- Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.
Verification (hertz, Pi 5, V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2 ← flipped
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU
First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper. The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same
patch.
All 9 daedalus-fourier cycles now QPU-by-recipe:
cycle 1 VP9 IDCT 8x8 QPU
cycle 2 VP9 LPF wd=4 QPU
cycle 3 VP9 MC 8h QPU
cycle 4 VP9 LPF wd=8 QPU
cycle 5 AV1 CDEF 8x8 QPU
cycle 6 H.264 IDCT 4x4 QPU
cycle 7 H.264 IDCT 8x8 QPU
cycle 8 H.264 deblock luma-v QPU
cycle 9 H.264 qpel mc20 QPU ← this commit
Closes daedalus-fourier task #165. Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
|
||
|
|
a092ee34aa |
Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7 |
||
|
|
74687d9def |
cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.
Bit-exact first try on hertz (Pi 5, V3D 7.1):
H264_IDCT4 recipe substrate: 2 (QPU)
H264_IDCT8 recipe substrate: 2 (QPU) ← flipped
H264_DEBLOCK_LV recipe substrate: 2 (QPU)
H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.
Same pattern as cycle 6 commit (
|
||
|
|
65bd5c3fe3 |
cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.
What's added:
- src/v3d_h264_idct4.comp: GLSL compute shader implementing
the H.264 §8.5.12.1 1D butterfly twice (row pass then column
pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
dst. Block memory layout is column-major (matches FFmpeg
`ff_h264_idct_add_neon` convention).
- CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.
- dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
(n_blocks, dst_stride), 16 blocks per WG dispatch. Matches
the existing dispatch_*_qpu patterns; uses
v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
- daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
the same CPU/QPU substrate switch the deblock dispatch uses.
- daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
that the shader exists.
Verification on hertz (Pi 5 + V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).
Remaining cycle-6/7/9 work in task #165:
- cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
block, fewer blocks per WG)
- cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
not a transform)
This commit lands the cycle-6 piece of task #165.
|
||
|
|
0a042a8e95 |
v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead. This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.
API additions (v3d_runner.h):
- v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
falls through to v3d_runner_create_buffer() on miss.
- v3d_runner_release_buffer(): pushes back onto the freelist; the
backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
v3d_runner_destroy().
- v3d_runner_pool_total_bytes(): diagnostic watermark.
Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.
Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls. test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).
Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):
call 0: 434.89 us (cold — 3x vkAllocateMemory)
call 1: 100.06 us (pool hit on all 3 buffers)
steady-state:
p50: 76.44 us
p90: 90.52 us
mean: 77.95 us
first-call / steady-state ratio: 5.7x
The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).
Refs daedalus-fourier task #160.
|
||
|
|
b3de96b21c |
CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}
The pkg-config file was generated at *configure* time with
`prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever
CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build`
time — typically the default `/usr/local`. `cmake --install
build --prefix /foo` then put the files under /foo but the .pc
still pointed pkg-config at /usr/local/include and /usr/local/lib,
which broke downstream consumers configuring against the install
tree.
Concrete bite encountered today on hertz: the daedalus-v4l2 daemon
CMake configure on a /tmp/df-prefix install tree resolved
DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on
the test host), so main.c failed to find <daedalus.h>.
Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is
the configure-time-computed relative path from
<prefix>/<libdir>/pkgconfig back to <prefix>. pkg-config
substitutes ${pcfiledir} with the actual on-disk location of the
.pc at lookup time, so the resolved prefix tracks wherever the
install tree is moved to — including DESTDIR-staged builds, apt
package installs, and ad-hoc `cmake --install --prefix /tmp/foo`
test installs.
The relative-path computation handles GNUInstallDirs layouts that
add multiarch tuples (Debian's lib/aarch64-linux-gnu) without
hard-coding `../..`. Tested on hertz (Debian trixie, libdir=lib):
prefix=${pcfiledir}/../../
...
$ pkg-config --variable=prefix daedalus-fourier
/tmp/df-prefix-test/lib/pkgconfig/../../
# mv preserves relocation:
$ mv /tmp/df-prefix-test /tmp/df-prefix-moved
$ pkg-config --variable=prefix daedalus-fourier
/tmp/df-prefix-moved/lib/pkgconfig/../../
This unblocks the daedalus-v4l2 daemon out-of-tree builds against
local daedalus-fourier installs and is a prerequisite for tidy
test-rig deployments (per the hertz reload session 2026-05-23).
|
||
|
|
8fdef27a7d |
Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.
API additions (header + library):
- daedalus_h264_qpel_meta { dst_off, src_off }
- daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
n_blocks, meta)
- daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper)
- DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
- daedalus_recipe_substrate_for() returns CPU NEON for cycle 9
The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.
Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse. Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.
CMakeLists changes:
- h264qpel_neon.S added to the daedalus_core static lib (only the
bench targets owned it before; now the public API needs it too)
- tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target
Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
|
||
|
|
47d0107809 |
CMakeLists: install rules + pkg-config for daedalus_core
Make daedalus_core installable so sibling consumers (Phase 8 V4L2
daemon, future libva-v4l2-request-fourier integration tests, etc.)
can `pkg_check_modules(DAEDALUS REQUIRED daedalus-fourier)` against
a system-installed copy.
Installs:
- lib/libdaedalus_core.a
- include/daedalus.h
- lib/pkgconfig/daedalus-fourier.pc
- share/daedalus-fourier/shaders/*.spv (only when
DAEDALUS_BUILD_VULKAN is ON; consumers using
daedalus_ctx_create_no_qpu() don't need them)
pkg-config surfaces the static-archive transitive deps via
Libs.private (-lpthread -ldl -lm) and Requires.private (vulkan),
so a consumer doing `pkg-config --static --libs daedalus-fourier`
gets the full link line. Non-static consumers (using the
no_qpu path) get just `-ldaedalus_core`.
No behaviour change to existing tests / benches.
Verified on hertz (Pi 5, dev host): clean build, all 7 SPIR-V
shaders + the static lib + the header + the .pc file land in
the install prefix.
|
||
|
|
5c8b09349c |
Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only
Last unmeasured H.264 kernel. mc20 picked as representative (horizontal half-pel, 6-tap filter; canonical for the H.264 luma qpel family). M1 PASS 10000/10000 first try, M3 = 131.477 Mblock/s on a single core (7.6 ns/block), 135x the 1080p30 floor. Per the cycles 6+7 lightweight-kernel rationale, Phase 4 deferred: QPU dispatch floor (~250 ns/block) is 33x above the NEON per-block cost; R9 ≈ 0.03 deep RED. No realistic QPU offload value. Generalization: all H.264 luma MC variants (mc02, mc11, mc22, etc.) will share this verdict. No need to measure each variant individually. H.264 NEON is dramatically faster than VP9 NEON across the board: - IDCT 4x4: 175 vs N/A (no VP9 analog) - IDCT 8x8: 151 vs 8.2 Mblock/s (18x faster) - MC 6/8-tap: 131 vs 7.0 (19x faster) - Deblock: 92 vs 48 Medge/s (2x faster) H.264 deployment recipe: all CPU NEON except deblock (opportunistic QPU). On a Pi 5 running H.264-only, the QPU is mostly idle. Cycles 1-9 complete. Public API exposes all 9. Next: daedalus-v4l2 sibling repo per locked Phase 8 architecture (B + γ + sibling), then README polish. - external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S vendored (1467 lines, all qpel variants) - tests/h264_qpel8_mc20_ref.c: 40-line C ref (clip255 of 6-tap convolution) - tests/bench_neon_h264qpel_mc20.c: M1 + M3 bench - docs/k9_h264qpel_mc20.md: cycle 9 closure with comparison matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0a99b16489 |
Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.
Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).
Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.
After this commit, the public API surface fully covers cycles 1-8:
Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
test — bench_v3d_cdef is the authoritative 3-way M1)
Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
Cycle 7 (H.264 IDCT 8x8): CPU only
Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact
Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.
Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
af8146a2cd |
Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
373f63a910 |
Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
RED-2: bench-enforced m.x >= 4*stride contract
M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
Lower than prediction; H.264 deblock has 4 early-return paths +
2 conditional writes that hurt V3D branchy execution more than
expected.
M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
(neutral).
M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
QPU contribution is essentially unchanged from isolation —
the cross-substrate contention is gentle (consistent with
Issue 003's V4 finding).
Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.
Cycles 1-8 deployment recipe complete:
Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)
Phase 9 lessons added:
- Branchy kernels underperform V3D vs straight-line ones
- Mixed-kernel helper value scales with isolation M2, not
same-kernel M4
- R prediction needs branchiness weight, not just compute density
- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
+ h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)
Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
436a5c4f74 |
Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s
M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_ filter is "vertical filtering of horizontal edges", not "vertical edge"; 16 columns process the edge horizontally with 8 rows of vertical context). M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case 1080p30 floor, 30x realistic floor. Filter triggers on 25 % of edges (random alpha/beta/tc0 covers both gating paths). Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses filter DIRECTION (vertical) not edge orientation. Edge is horizontal; filter operates vertically across it. Distinct from cycle 6's column-major-block lesson but related discovery pattern. Encoded for future cycles. R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's 0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9 LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches that hurt QPU more. Worth building anyway: - ORANGE in cycle 1's "M4 may rescue" band - Mixed-kernel deployment helper value (Issue 003) matters more than isolation R - 25%-trigger rate gives 4x effective contribution multiplier on QPU side - tests/h264_deblock_ref.c (column-walking C ref per row segment) - tests/bench_neon_h264deblock.c (M1 + M3 bench) - CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S - docs/k8_h264deblock_phase3.md (closure) Next: Phase 4 plan QPU shader, Phase 5 Sonnet review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
db2205d0e3 |
Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson from cycle 6 carried over cleanly). M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the 1080p30 floor (0.972 Mblock/s req'd). Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264 IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster NEON): VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies) H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts) Phase 4 deferred via the cycle 6 lightweight-kernel rationale: NEON per-block << QPU dispatch floor; offload doesn't help. Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target compute-heavy H.264 kernels only (deblock = cycle 8 next; MC likely RED). Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only" result. Forces a sharper view of which kernels QPU can actually help with: deblock and possibly some VP9 cases. - tests/h264_idct8_ref.c (column-major C ref) - tests/bench_neon_h264idct8.c (M1 + M3 bench) - CMakeLists.txt: cycle 7 bench wiring - docs/k7_h264idct8_phase3_and_4.md (closure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f92dc40f43 |
Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eb5cfb34c4 |
Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1085c5699c |
Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
760f6a4060 |
Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5223d3cb3f |
Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2dd774a9ab |
Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
20b59cd6a5 |
Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred
CDEF is the most compute-intensive kernel measured so far —
254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x
even on single NEON core in isolation.
M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact
gate failing due to tmp-layout mismatch between my standalone C
reference and dav1d's NEON expectation. The smoking gun: NEON output
appears at (+2 rows, -2 cols) shifted positions vs C ref output —
suggests NEON's padding-function output has a different convention
than my manual tmp construction.
Untangled in setup work:
- dav1d has TWO directions tables: stride-12 in src/tables.c
(C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side).
Initially vendored the C-side; should have used the NEON-side.
- dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon
(a separate function with its own conventions), not the C-side
padding() function from cdef_tmpl.c.
- Updated cdef_ref.c to use NEON-layout (stride 16) with table
transcribed from cdef_tmpl.S. Algorithm matches — but bench's
manual tmp construction doesn't match what NEON expects.
Resolution paths for next session (documented in
docs/k5_cdef_phase3_partial.md §'Resolution paths'):
1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest)
2. Vendor dav1d's full C reference (most rigorous)
3. Reverse-engineer dav1d's padding output layout (hackiest)
Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED).
CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound
kernels don't benefit from QPU offload). 30fps floor still
passes regardless.
New artifacts:
- external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored)
- external/dav1d-snapshot/config.h — 14-define asm preamble shim
- tests/cdef_ref.c — standalone C ref (algorithmically correct,
layout mismatch with NEON known)
- tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured)
- docs/k5_cdef_phase3_partial.md — phase 3 partial closure +
resumption checklist
dav1d snapshot in PROVENANCE.md should be updated next session
with the new cdef_tmpl.S entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
85feba4087 |
Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: | | wd=4 (k2) | wd=8 (k4) | | M3 NEON | 48.3 | 52.4 | | M2 QPU iso | 19.6 | 17.8 | | R iso | 0.41 | 0.34 | | M4 delta | +6.9% | +4.1% | | 30fps mixed | 7.2x | 20.3x | | Verdict | GO QPU | GO QPU | NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
356e446a49 |
Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.
Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.
Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).
Performance:
M2''' = 1.413 Mblock/s (707.9 ns/block)
M3''' = 20.997 Mblock/s (NEON baseline phase3)
R''' = 0.067 (RED band — structural mismatch)
shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills
M4''' concurrent matrix (8s windows):
NEON 1-core 14.479 Mblock/s
NEON 4-core 15.248 Mblock/s <- baseline (compute-bound,
not bandwidth-saturated
like cycles 1+2!)
QPU only 1.380 Mblock/s
MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate)
MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3%
NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.
But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.
DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):
IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core)
LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core)
MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU)
Entropy -> CPU (structurally serial)
Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.
New artifacts:
- src/v3d_mc_8h.comp — GLSL kernel
- tests/vp9_mc_ref.c — standalone C ref (REGULAR filter
embedded; clean transcription)
- tests/bench_neon_mc.c — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c — M1''' + M2''' bench with contract
asserts + 30fps margin display
- tests/bench_concurrent_mc.c — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
(hand-extracted; provides
ff_vp9_subpel_filters symbol
without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation
Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
36eca40ff2 |
Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.
Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
- stride >= 4 contract added alongside m.x >= 4 (finding 2)
- assert(...) in bench made a MUST not a suggestion (finding 4)
- V3D divergence-cost note: don't restructure to always-execute,
masked lanes consume clock anyway (finding 3, informational)
Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.
Performance:
M2'' = 19.645 Medge/s (50.9 ns/edge)
M3'' = 48.285 Medge/s (NEON baseline from phase3)
R'' = 0.41 (ORANGE band - doesn't auto-close per
cycle-1 calibration adjustment)
shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.
M4'' concurrent matrix on hertz (8s windows):
NEON-1 LPF 41.131 Medge/s
NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling
(per-core 7-9; same
bandwidth-saturation as
cycle-1 F1)
QPU only 14.299 Medge/s
MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4
MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed
The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".
Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck
New artifacts:
- src/v3d_lpf_h_4_8.comp — GLSL kernel
- tests/bench_v3d_lpf.c — M1'' + M2'' harness with
contract asserts + fm/hev
pass-rate instrumentation
- tests/bench_concurrent_lpf.c — M4'' pthread bench
(mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md — plan + review + verification
Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
be7ff5587c |
Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline
Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.
docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.
docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.
docs/k2_deblock_phase3.md - NEON baseline:
M1''_c bit-exact 100.0000 % (10000 random edges)
M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76)
per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame)
cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row
LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).
New artifacts:
- tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4
inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
(5s window, edge-content-biased RNG for
realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target
Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8182e43c15 |
Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues
The YELLOW-band gate test from phase1.md (concurrent CPU+QPU vs
pure-CPU baseline). tests/bench_concurrent.c is a pthread harness
that runs N NEON workers (pinned to cores 0..N-1) and optionally a
QPU dispatch loop on its own pinned thread; 8s time-based windows;
sums per-worker block counts.
Raw results on hertz (1920x1088, 32640 blocks/dispatch):
Config Mblock/s 1080p FPS-eq
NEON 1-core 12.623 389.6
NEON 4-core 7.074 218.3 <- realistic CPU ceiling
QPU only 6.890 212.7
MIXED NEON-3 + QPU 7.583 234.0 <- +7.2 % over NEON-4
MIXED NEON-4 + QPU 7.739 238.9 <- +9.4 % oversubscribed
Headline findings beyond the gate test itself:
F1 — Pi 5 LPDDR4x saturates well before 4-core CPU scaling.
NEON-1 (12.6) > NEON-4 (7.1): 4 cores deliver 0.56x the
per-core throughput, not 4x. The realistic CPU ceiling for
memory-bound IDCT work is ~7 Mblock/s aggregate, not the
~32 Mblock/s a naive 4x scaling would predict. This recasts
phase7.md's R=0.92 framing: the right baseline is "4-core
NEON saturated", which the QPU effectively matches (6.89
vs 7.07) on its own.
F2 — QPU contributes meaningfully BECAUSE it doesn't fully share
the CPU's bandwidth bottleneck (own access channel + v3d L2
cache partially insulate it). Mixed adds the QPU's 0.51
Mblock/s on top of an already-saturated CPU.
F3 — Oversubscribed mode (NEON-4 + QPU) is not harmful — per-NEON
drops slightly but QPU adds more than the loss. Net +9 %.
F4 — Freed-core story is bigger than the throughput delta. In
mixed NEON-3+QPU, the 4th core is 100 % free for entropy
decode (Bool coder, ANS) which MUST run on CPU. Pure NEON-4
has nothing left. Realistic decode pipeline gets more like
a 30-50 % effective throughput uplift, not just 7 %.
Verdict per phase1.md YELLOW-band rule (mixed > pure-CPU): PASS.
Project continues to next-kernel cycle (recommend deblocking or
CDEF — same "small parallel block-level" workload class that
amortises the same M4 wins).
docs/phase7_M4.md captures the full M4 harness design, all 5
configs raw output, and the leaves-open items: M7 wall-power via
Himbeere plug, sustained-thermal test, realistic-bitstream
coefficient distribution, multi-frame async pipelining.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d66f22f333 |
Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.
New files:
- src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance,
V3D device picker, HOST_VISIBLE|COHERENT
SSBOs with mmap, compute pipeline from .spv,
enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production:
256 invocations/WG, 2 blocks/subgroup
(no idle lanes), uint8 dst SSBO (race-free
per phase5 finding 5), unrolled writes
(no chained ternary), oob-flag pattern
(barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md — full iteration journey + decision verdict
CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.
Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):
ver change R ns/block
v1 first-light 0.230 533
v2 kill ternary + 2-blocks-per-sg 0.474 258
v3 per-pass scope oN 0.481 254 (noise)
v4 WG 64 -> 256 invocations 0.947 129
v5 packed uint32 coeff reads 0.938 130 (noise, reverted)
v4 final N=3 0.918 +/- 0.033
Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.
Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.
Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|