6 Commits

Author SHA1 Message Date
marfrit 432d127ea9 Merge pull request 'v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline' (#37) from noether/spv-search-and-bench-retract into main
Reviewed-on: #37
2026-05-25 20:33:30 +00:00
claude-noether 1347fb961c v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path
sum.  That number was a measurement artifact.  This commit makes the
artifact impossible to reproduce by ANYONE running the bench again.

THE BUG
-------

v3d_runner read_spv() did fopen(spv_path, "rb") with no path search:
the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen
resolves it relative to cwd.  The cmake build puts SPVs in $builddir
(e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264)
were typically invoked from ~/src/daedalus-fourier/, so fopen failed.

On failure read_spv printed perror and returned NULL; pipeline create
then returned -1; dispatch then returned -1; the bench loop ignored
the return value and timed the failure path.  Each iter cost ~1-5 µs
(open + perror + return), which divided across 256 ops gave ~10-20
ns/op — looking convincingly like real-but-fast QPU work.

PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact.  PR #10's
much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found
that time, perhaps run from build/), so the artifact is what made it
look like the gap had closed.  The gap never closed.

CORRECTED NUMBERS
-----------------

Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit:

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:   5.57 ms
    QPU only:      123.54 ms
    Ratio:        CPU/QPU sum = 0.05x  (QPU 22x SLOWER than CPU)

QPU is currently 12-77x slower per kernel.  The post-buffer-pool /
post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT
close the gap with NEON.  Whether those tasks helped at all needs
re-measurement — the previous "we saw a big win" reading was the
same artifact.

PR #36's commit-message claim "PR #10's verdict is reversed" is
withdrawn.  PR #10 was right; PR #36 was wrong.

THE FIX
-------

Two changes:

1. v3d_runner: SPV search now tries, in order:
     - cwd (legacy)
     - $DAEDALUS_SHADER_DIR (env override)
     - <readlink /proc/self/exe>/.. (binary-relative)
     - /opt/fourier/share/daedalus-fourier/ (Pi 5 install)
     - /usr/share/daedalus-fourier/ (system-wide)
   Found-anywhere succeeds silently.  Found-nowhere prints one error
   naming all searched locations.

2. bench_h264_primitives: bench_fn now returns int.  bench_ns does
   a single preflight call; if rc != 0 it prints "DISPATCH FAILED
   rc=N — kernel skipped" and bails on the kernel.  Main loop counts
   QPU failures and exits 2 BEFORE printing the comparison table if
   any kernel failed — so the next person running this can't read
   fail-fast timings as substrate numbers.

POLICY IMPLICATIONS
-------------------

The QPU substrate decree (2026-05-23) was conceived as a policy
choice that overrides per-kernel measurement.  With the corrected
data the gap is not "fixable defect we'll close with one more
optimization", it's an order of magnitude.  Whether to keep the
decree, soften it (auto = QPU only where measured advantage), or
revert is now a clear-eyed decision for the user.

This commit doesn't change the recipe table — that's a separate
question, taken on its own merits with this data in hand.

Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu →
qpu-capable) was justified by PR #36's artifact and should be
reverted; that revert lands in a follow-up to marfrit-packages.
2026-05-25 21:45:12 +02:00
marfrit 9be02a9470 Merge pull request 'bench: H.264 primitive bench now measures both substrates + comparison table' (#36) from noether/h264-qpu-bench-and-cleanup into main
Reviewed-on: #36
2026-05-25 18:56:01 +00:00
claude-noether 989818c2e6 bench: H.264 primitive bench now measures both substrates + comparison table
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).
2026-05-25 20:42:39 +02:00
marfrit 1446b779a6 Merge pull request 'h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete' (#35) from noether/v3d-shader-h264-deblock-intra into main
Reviewed-on: #35
2026-05-25 18:36:10 +00:00
claude-noether c2d1e9790e h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete
Closes the H.264 deblock QPU coverage matrix.  Adds the 4 intra
(bS=4) variants — luma_v/h_intra + chroma_v/h_intra.

Algorithmically distinct from the bS<4 path:
  - Per-side strong/weak filter selector
      strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2)
      strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2)
  - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3)
  - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2
  - Mirror for q-side; no tc0 (bS=4 hardcodes the strength)
  - Chroma always weak, only p0/q0 updated (same as bS<4 chroma)

Per H.264 §8.3.2.3.  Transcribed from PR #11's C reference
(tests/h264_intra_loop_filter_ref.c).

Shaders:
  - v3d_h264deblock_luma_v_intra.comp  (luma 16-cell + strong/weak)
  - v3d_h264deblock_luma_h_intra.comp  (transpose of luma_v_intra)
  - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak)
  - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra)

Dispatch wiring:
  - 4 new pipeline pairs in daedalus_ctx
  - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by
    orient_h for V vs H) — 2 wrappers
  - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu
    helper (same WG geometry as bS<4 chroma) — 2 wrappers
  - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter,
    routes CPU/QPU per recipe table
  - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU
    to QPU

Verified on hertz:

  $ ./build/test_api_h264 | grep intra
    H.264 deblock luma v intra:   1024/1024 bytes bit-exact
    H.264 deblock luma h intra:   1024/1024 bytes bit-exact
    H.264 deblock chroma v intra:  256/256 bytes bit-exact
    H.264 deblock chroma h intra:  256/256 bytes bit-exact

All 4 PASS first try.  Strong/weak quad-tree selector + per-side
asymmetry would have surfaced any sign/shift/index mistake; passing
on all 4 (including the asymmetric writes-3-cells cases) means the
transcription from C is clean.

Deblock QPU coverage matrix — COMPLETE (8 of 8):

  bS<4 (non-intra):
    luma_v    ✓ cycle 8
    luma_h    ✓ PR #28
    chroma_v  ✓ PR #29
    chroma_h  ✓ PR #29

  bS=4 (intra, this PR):
    luma_v    ✓
    luma_h    ✓
    chroma_v  ✓
    chroma_h  ✓

The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU
when daedalus is initialised with a QPU-capable context:
  - IDCT 4x4 / 8x8 ✓
  - All 8 deblock variants ✓
  - All 30 qpel positions (15 put_ + 15 avg_) ✓
2026-05-25 20:30:07 +02:00
9 changed files with 627 additions and 100 deletions
+21 -1
View File
@@ -317,6 +317,22 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM VERBATIM
) )
# Intra (bS=4) deblock shaders — strong/weak filter selector per
# H.264 §8.3.2.3. 4 variants (luma_v/h + chroma_v/h).
foreach(_kind luma_v_intra luma_h_intra chroma_v_intra chroma_h_intra)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264deblock_${_kind}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
COMMENT "glslang: v3d_h264deblock_${_kind}.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_${_kind}_SPV ${_spv})
endforeach()
set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv) set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv)
add_custom_command( add_custom_command(
OUTPUT ${H264_IDCT4_SPV} OUTPUT ${H264_IDCT4_SPV}
@@ -406,7 +422,7 @@ if (DAEDALUS_BUILD_VULKAN)
set(H264_QPEL_avg_${_mc}_SPV ${_spv}) set(H264_QPEL_avg_${_mc}_SPV ${_spv})
endforeach() endforeach()
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264DEBLOCK_H_SPV} ${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_H_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV} ${H264_QPEL_MC02_SPV} ${H264_QPEL_MC22_SPV} ${H264_QPEL_mc10_SPV} ${H264_QPEL_mc30_SPV} ${H264_QPEL_mc01_SPV} ${H264_QPEL_mc03_SPV} ${H264_QPEL_mc11_SPV} ${H264_QPEL_mc12_SPV} ${H264_QPEL_mc13_SPV} ${H264_QPEL_mc21_SPV} ${H264_QPEL_mc23_SPV} ${H264_QPEL_mc31_SPV} ${H264_QPEL_mc32_SPV} ${H264_QPEL_mc33_SPV} ${H264_QPEL_avg_mc20_SPV} ${H264_QPEL_avg_mc02_SPV} ${H264_QPEL_avg_mc22_SPV} ${H264_QPEL_avg_mc10_SPV} ${H264_QPEL_avg_mc30_SPV} ${H264_QPEL_avg_mc01_SPV} ${H264_QPEL_avg_mc03_SPV} ${H264_QPEL_avg_mc11_SPV} ${H264_QPEL_avg_mc12_SPV} ${H264_QPEL_avg_mc13_SPV} ${H264_QPEL_avg_mc21_SPV} ${H264_QPEL_avg_mc23_SPV} ${H264_QPEL_avg_mc31_SPV} ${H264_QPEL_avg_mc32_SPV} ${H264_QPEL_avg_mc33_SPV}) add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264DEBLOCK_H_SPV} ${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_H_SPV} ${H264DEBLOCK_luma_v_intra_SPV} ${H264DEBLOCK_luma_h_intra_SPV} ${H264DEBLOCK_chroma_v_intra_SPV} ${H264DEBLOCK_chroma_h_intra_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV} ${H264_QPEL_MC02_SPV} ${H264_QPEL_MC22_SPV} ${H264_QPEL_mc10_SPV} ${H264_QPEL_mc30_SPV} ${H264_QPEL_mc01_SPV} ${H264_QPEL_mc03_SPV} ${H264_QPEL_mc11_SPV} ${H264_QPEL_mc12_SPV} ${H264_QPEL_mc13_SPV} ${H264_QPEL_mc21_SPV} ${H264_QPEL_mc23_SPV} ${H264_QPEL_mc31_SPV} ${H264_QPEL_mc32_SPV} ${H264_QPEL_mc33_SPV} ${H264_QPEL_avg_mc20_SPV} ${H264_QPEL_avg_mc02_SPV} ${H264_QPEL_avg_mc22_SPV} ${H264_QPEL_avg_mc10_SPV} ${H264_QPEL_avg_mc30_SPV} ${H264_QPEL_avg_mc01_SPV} ${H264_QPEL_avg_mc03_SPV} ${H264_QPEL_avg_mc11_SPV} ${H264_QPEL_avg_mc12_SPV} ${H264_QPEL_avg_mc13_SPV} ${H264_QPEL_avg_mc21_SPV} ${H264_QPEL_avg_mc23_SPV} ${H264_QPEL_avg_mc31_SPV} ${H264_QPEL_avg_mc32_SPV} ${H264_QPEL_avg_mc33_SPV})
# v3d_runner — reusable Vulkan plumbing. # v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c) add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -542,6 +558,10 @@ if (DAEDALUS_BUILD_VULKAN)
${H264DEBLOCK_H_SPV} ${H264DEBLOCK_H_SPV}
${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_V_SPV}
${H264DEBLOCK_CHROMA_H_SPV} ${H264DEBLOCK_CHROMA_H_SPV}
${H264DEBLOCK_luma_v_intra_SPV}
${H264DEBLOCK_luma_h_intra_SPV}
${H264DEBLOCK_chroma_v_intra_SPV}
${H264DEBLOCK_chroma_h_intra_SPV}
${H264_IDCT4_SPV} ${H264_IDCT4_SPV}
${H264_IDCT8_SPV} ${H264_IDCT8_SPV}
${H264_QPEL_MC20_SPV} ${H264_QPEL_MC20_SPV}
+129 -10
View File
@@ -46,6 +46,11 @@ struct daedalus_ctx {
v3d_pipeline h264deblock_chroma_v_pipe; v3d_pipeline h264deblock_chroma_v_pipe;
int h264deblock_chroma_h_pipe_ready; int h264deblock_chroma_h_pipe_ready;
v3d_pipeline h264deblock_chroma_h_pipe; v3d_pipeline h264deblock_chroma_h_pipe;
/* bS=4 intra deblock pipelines (strong/weak filter selector). */
int h264deblock_luma_v_intra_pipe_ready; v3d_pipeline h264deblock_luma_v_intra_pipe;
int h264deblock_luma_h_intra_pipe_ready; v3d_pipeline h264deblock_luma_h_intra_pipe;
int h264deblock_chroma_v_intra_pipe_ready; v3d_pipeline h264deblock_chroma_v_intra_pipe;
int h264deblock_chroma_h_intra_pipe_ready; v3d_pipeline h264deblock_chroma_h_intra_pipe;
int h264_idct4_pipe_ready; int h264_idct4_pipe_ready;
v3d_pipeline h264_idct4_pipe; v3d_pipeline h264_idct4_pipe;
int h264_idct8_pipe_ready; int h264_idct8_pipe_ready;
@@ -145,6 +150,10 @@ void daedalus_ctx_destroy(daedalus_ctx *ctx)
if (ctx->h264deblock_h_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_h_pipe); if (ctx->h264deblock_h_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_h_pipe);
if (ctx->h264deblock_chroma_v_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_v_pipe); if (ctx->h264deblock_chroma_v_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_v_pipe);
if (ctx->h264deblock_chroma_h_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_h_pipe); if (ctx->h264deblock_chroma_h_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_h_pipe);
if (ctx->h264deblock_luma_v_intra_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_luma_v_intra_pipe);
if (ctx->h264deblock_luma_h_intra_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_luma_h_intra_pipe);
if (ctx->h264deblock_chroma_v_intra_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_v_intra_pipe);
if (ctx->h264deblock_chroma_h_intra_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_chroma_h_intra_pipe);
if (ctx->h264_idct4_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct4_pipe); if (ctx->h264_idct4_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct4_pipe);
if (ctx->h264_idct8_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct8_pipe); if (ctx->h264_idct8_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct8_pipe);
if (ctx->h264_qpel_mc20_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_qpel_mc20_pipe); if (ctx->h264_qpel_mc20_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_qpel_mc20_pipe);
@@ -207,10 +216,10 @@ daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k)
case DAEDALUS_KERNEL_H264_DEBLOCK_LH: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_h.spv */ case DAEDALUS_KERNEL_H264_DEBLOCK_LH: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_h.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_CV: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_chroma_v.spv */ case DAEDALUS_KERNEL_H264_DEBLOCK_CV: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_chroma_v.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_CH: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_chroma_h.spv */ case DAEDALUS_KERNEL_H264_DEBLOCK_CH: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_chroma_h.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA: return DAEDALUS_SUBSTRATE_CPU; /* bS=4 luma QPU pending */ case DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_luma_v_intra.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA: return DAEDALUS_SUBSTRATE_CPU; case DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA: return DAEDALUS_SUBSTRATE_QPU;
case DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA: return DAEDALUS_SUBSTRATE_CPU; /* bS=4 chroma QPU pending */ case DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock_chroma_v_intra.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA: return DAEDALUS_SUBSTRATE_CPU; case DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA: return DAEDALUS_SUBSTRATE_QPU;
case DAEDALUS_KERNEL_H264_QPEL_MC20: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc20.spv */ case DAEDALUS_KERNEL_H264_QPEL_MC20: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc20.spv */
case DAEDALUS_KERNEL_H264_QPEL_MC02: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc02.spv */ case DAEDALUS_KERNEL_H264_QPEL_MC02: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc02.spv */
case DAEDALUS_KERNEL_H264_QPEL_MC22: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc22.spv */ case DAEDALUS_KERNEL_H264_QPEL_MC22: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc22.spv */
@@ -1240,6 +1249,107 @@ static int dispatch_h264_deblock_chroma_h_qpu(daedalus_ctx *ctx,
"v3d_h264deblock_chroma_h.spv", dst, dst_stride, n_edges, meta, 1); "v3d_h264deblock_chroma_h.spv", dst, dst_stride, n_edges, meta, 1);
} }
/* -------------------- H.264 luma/chroma intra (bS=4) QPU dispatches.
* Same WG geometry as the non-intra shaders, same meta layout (tc0[]
* unused — the strong/weak selector replaces it). Bounds match the
* non-intra variants exactly:
* luma_v: dst_off + 3*stride + 16 (reads p3 at -4*stride, writes p2 at -3*stride)
* luma_h: dst_off + 15*stride + 4 (lane → row, reads pix[-4..+3])
* chroma_v: 1*stride + 8 (only p1..q1 = ±2*stride)
* chroma_h: 7*stride + 2 (lane → row, reads pix[-2..+1])
*/
static int dispatch_h264_deblock_luma_intra_qpu(daedalus_ctx *ctx,
v3d_pipeline *pipe, int *pipe_ready, const char *spv,
uint8_t *dst, size_t dst_stride, size_t n_edges,
const daedalus_h264_deblock_meta *meta, int orient_h)
{
if (!*pipe_ready) {
if (v3d_runner_create_pipeline(ctx->runner, spv,
2, sizeof(h264deblock_pc), pipe) != 0)
return -1;
*pipe_ready = 1;
}
size_t meta_bytes = n_edges * 4 * sizeof(uint32_t);
size_t dst_max = 0;
for (size_t i = 0; i < n_edges; i++) {
size_t e = orient_h ? meta[i].dst_off + 15 * dst_stride + 4
: meta[i].dst_off + 3 * dst_stride + 16;
if (e > dst_max) dst_max = e;
}
v3d_buffer bm = {0}, bd = {0};
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped;
for (size_t i = 0; i < n_edges; i++) {
m[4*i+0] = meta[i].dst_off;
m[4*i+1] = ((uint32_t) meta[i].alpha) | (((uint32_t) meta[i].beta) << 8);
m[4*i+2] = 0; /* tc0 unused for intra */
m[4*i+3] = 0;
}
v3d_buffer binds[2] = { bm, bd };
if (v3d_runner_bind_buffers(ctx->runner, pipe, binds, 2)) goto fail;
uint32_t wg_count = (uint32_t)((n_edges + 15) / 16);
h264deblock_pc pc = { .n_edges = (uint32_t) n_edges,
.dst_stride_u8 = (uint32_t) dst_stride };
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, pipe)) goto fail;
VkCommandBuffer cb = pipe->cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe->pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
pipe->layout, 0, 1, &pipe->desc_set, 0, NULL);
vkCmdPushConstants(cb, pipe->layout, VK_SHADER_STAGE_COMPUTE_BIT,
0, sizeof(pc), &pc);
vkCmdDispatch(cb, wg_count, 1, 1);
vkEndCommandBuffer(cb);
if (v3d_runner_submit_wait(ctx->runner, cb)) goto fail;
memcpy(dst, bd.mapped, dst_max);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return 0;
fail:
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return -1;
}
static int dispatch_h264_deblock_luma_v_intra_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta)
{
return dispatch_h264_deblock_luma_intra_qpu(ctx,
&ctx->h264deblock_luma_v_intra_pipe, &ctx->h264deblock_luma_v_intra_pipe_ready,
"v3d_h264deblock_luma_v_intra.spv", dst, dst_stride, n_edges, meta, 0);
}
static int dispatch_h264_deblock_luma_h_intra_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta)
{
return dispatch_h264_deblock_luma_intra_qpu(ctx,
&ctx->h264deblock_luma_h_intra_pipe, &ctx->h264deblock_luma_h_intra_pipe_ready,
"v3d_h264deblock_luma_h_intra.spv", dst, dst_stride, n_edges, meta, 1);
}
static int dispatch_h264_deblock_chroma_v_intra_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta)
{
return dispatch_h264_deblock_chroma_qpu(ctx,
&ctx->h264deblock_chroma_v_intra_pipe, &ctx->h264deblock_chroma_v_intra_pipe_ready,
"v3d_h264deblock_chroma_v_intra.spv", dst, dst_stride, n_edges, meta, 0);
}
static int dispatch_h264_deblock_chroma_h_intra_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta)
{
return dispatch_h264_deblock_chroma_qpu(ctx,
&ctx->h264deblock_chroma_h_intra_pipe, &ctx->h264deblock_chroma_h_intra_pipe_ready,
"v3d_h264deblock_chroma_h_intra.spv", dst, dst_stride, n_edges, meta, 1);
}
/* -------------------- H.264 IDCT 4x4 QPU dispatch (cycle 6) ----- */ /* -------------------- H.264 IDCT 4x4 QPU dispatch (cycle 6) ----- */
typedef struct { typedef struct {
@@ -2063,7 +2173,7 @@ int daedalus_dispatch_h264_deblock_chroma_h(daedalus_ctx *ctx, daedalus_substrat
return dispatch_h264_deblock_chroma_h_qpu(ctx, dst, dst_stride, n_edges, meta); return dispatch_h264_deblock_chroma_h_qpu(ctx, dst, dst_stride, n_edges, meta);
} }
#define DEFINE_INTRA_DISPATCH(name, kernel, cpu_fn) \ #define DEFINE_INTRA_DISPATCH(name, kernel, cpu_fn, qpu_fn) \
int daedalus_dispatch_h264_deblock_ ## name (daedalus_ctx *ctx, \ int daedalus_dispatch_h264_deblock_ ## name (daedalus_ctx *ctx, \
daedalus_substrate sub, uint8_t *dst, size_t dst_stride, \ daedalus_substrate sub, uint8_t *dst, size_t dst_stride, \
size_t n_edges, const daedalus_h264_deblock_meta *meta) \ size_t n_edges, const daedalus_h264_deblock_meta *meta) \
@@ -2073,14 +2183,23 @@ int daedalus_dispatch_h264_deblock_ ## name (daedalus_ctx *ctx, \
eff = daedalus_recipe_substrate_for(kernel); \ eff = daedalus_recipe_substrate_for(kernel); \
if (eff == DAEDALUS_SUBSTRATE_QPU && !daedalus_ctx_has_qpu(ctx)) \ if (eff == DAEDALUS_SUBSTRATE_QPU && !daedalus_ctx_has_qpu(ctx)) \
eff = DAEDALUS_SUBSTRATE_CPU; \ eff = DAEDALUS_SUBSTRATE_CPU; \
if (eff == DAEDALUS_SUBSTRATE_QPU) return -1; \ if (eff == DAEDALUS_SUBSTRATE_CPU) \
return cpu_fn(ctx, dst, dst_stride, n_edges, meta); \ return cpu_fn(ctx, dst, dst_stride, n_edges, meta); \
return qpu_fn(ctx, dst, dst_stride, n_edges, meta); \
} }
DEFINE_INTRA_DISPATCH(luma_v_intra, DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA, dispatch_h264_deblock_luma_v_intra_cpu) DEFINE_INTRA_DISPATCH(luma_v_intra, DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA,
DEFINE_INTRA_DISPATCH(luma_h_intra, DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA, dispatch_h264_deblock_luma_h_intra_cpu) dispatch_h264_deblock_luma_v_intra_cpu,
DEFINE_INTRA_DISPATCH(chroma_v_intra, DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA, dispatch_h264_deblock_chroma_v_intra_cpu) dispatch_h264_deblock_luma_v_intra_qpu)
DEFINE_INTRA_DISPATCH(chroma_h_intra, DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA, dispatch_h264_deblock_chroma_h_intra_cpu) DEFINE_INTRA_DISPATCH(luma_h_intra, DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA,
dispatch_h264_deblock_luma_h_intra_cpu,
dispatch_h264_deblock_luma_h_intra_qpu)
DEFINE_INTRA_DISPATCH(chroma_v_intra, DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA,
dispatch_h264_deblock_chroma_v_intra_cpu,
dispatch_h264_deblock_chroma_v_intra_qpu)
DEFINE_INTRA_DISPATCH(chroma_h_intra, DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA,
dispatch_h264_deblock_chroma_h_intra_cpu,
dispatch_h264_deblock_chroma_h_intra_qpu)
#undef DEFINE_INTRA_DISPATCH #undef DEFINE_INTRA_DISPATCH
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) H deblock —
// V3D 7.1. Transpose of v3d_h264deblock_chroma_v_intra.comp.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+54
View File
@@ -0,0 +1,54 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) V deblock —
// V3D 7.1. Per H.264 §8.3.2.3 chroma intra path: simpler than luma
// — always weak filter, only p0/q0 updated, 8 cells per edge.
//
// p0' = (2*p1 + p0 + q1 + 2) >> 2
// q0' = (2*q1 + q0 + p1 + 2) >> 2
//
// Same 16-edges × 16-lanes/edge WG shape as luma; lanes 8..15 of each
// edge early-return (chroma edges are only 8 cells wide).
//
// 4:2:0-only — caller-side gating handles 4:2:2 (chroma_format_idc>1)
// at the libavcodec init layer.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+70
View File
@@ -0,0 +1,70 @@
// daedalus-fourier — H.264 luma intra (bS=4) H deblock — V3D 7.1.
//
// Sibling of v3d_h264deblock_luma_v_intra.comp transposed to the
// horizontal axis: lane → row, reads pix[-4..+3] (cols) instead of
// pix[-4*stride..+3*stride] (rows). Same strong/weak filter
// selector + same write-back algebra.
//
// dst_off contract: (m.x % stride) ≥ 4 (kernel reads p3 at pix[-4]).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u]);
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
int q3 = int(u_dst.dst[dst_off + 3u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+81
View File
@@ -0,0 +1,81 @@
// daedalus-fourier — H.264 luma intra (bS=4) V deblock — V3D 7.1.
//
// Per H.264 §8.3.2.3: at I-MB edges and certain inter-MB edges that
// force boundary strength to 4, the deblock kernel is structurally
// different from bS<4 — it has a per-side strong/weak filter
// selector that decides whether to update 3 cells (strong) or 1
// (weak), reads p3/q3, and ignores tc0.
//
// strong_common = |p0-q0| < (α>>2) + 2
// strong_p = strong_common AND |p2-p0| < β
// strong_q = strong_common AND |q2-q0| < β
//
// Strong-p updates p0/p1/p2 with specific 5-/4-/3-tap blends.
// Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2.
// Mirror for q-side.
//
// WG geometry identical to v3d_h264deblock.comp (16 edges × 16 lanes/WG).
// dst_off contract: m.x ≥ 4*stride (kernel reads p3 at -4*stride).
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_intra_loop_filter_ref.c (PR #11).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u * stride]);
int p2 = int(u_dst.dst[dst_off - 3u * stride]);
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
int q2 = int(u_dst.dst[dst_off + 2u * stride]);
int q3 = int(u_dst.dst[dst_off + 3u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u * stride] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u * stride] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u * stride] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u * stride] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+62 -2
View File
@@ -8,6 +8,8 @@
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
#include <string.h> #include <string.h>
#include <unistd.h>
#include <limits.h>
#define CHK(call) do { VkResult r__ = (call); if (r__ != VK_SUCCESS) { \ #define CHK(call) do { VkResult r__ = (call); if (r__ != VK_SUCCESS) { \
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \ fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
@@ -368,10 +370,68 @@ void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf)
/* ---- Pipelines -------------------------------------------------- */ /* ---- Pipelines -------------------------------------------------- */
/* SPV lookup tries a small set of locations. The caller passes a bare
* filename (e.g. "v3d_h264_idct4.spv"); we try, in order:
*
* 1. cwd-relative (legacy contract; works when run from build/)
* 2. $DAEDALUS_SHADER_DIR (env override for tests / packaged installs)
* 3. <binary-dir>/<name> (so the bench/test binary finds the SPV next
* to itself regardless of cwd this is the
* fix for the silent-no-SPV regression that
* made PR #36's bench numbers meaningless)
* 4. /opt/fourier/share/daedalus-fourier/<name> (Pi 5 install layout)
* 5. /usr/share/daedalus-fourier/<name> (system-wide install)
*
* Returns NULL only if every location fails, with a single perror naming
* the bare filename so the user can grep for it. */
static FILE *open_spv(const char *name)
{
FILE *f = fopen(name, "rb");
if (f) return f;
const char *envdir = getenv("DAEDALUS_SHADER_DIR");
if (envdir && *envdir) {
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", envdir, name);
f = fopen(p, "rb");
if (f) return f;
}
char exe[PATH_MAX];
ssize_t n = readlink("/proc/self/exe", exe, sizeof(exe) - 1);
if (n > 0) {
exe[n] = 0;
char *slash = strrchr(exe, '/');
if (slash) {
*slash = 0;
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", exe, name);
f = fopen(p, "rb");
if (f) return f;
}
}
char p[PATH_MAX];
snprintf(p, sizeof(p), "/opt/fourier/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
snprintf(p, sizeof(p), "/usr/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
return NULL;
}
static uint32_t *read_spv(const char *path, size_t *out_size) static uint32_t *read_spv(const char *path, size_t *out_size)
{ {
FILE *f = fopen(path, "rb"); FILE *f = open_spv(path);
if (!f) { perror(path); return NULL; } if (!f) {
fprintf(stderr,
"daedalus: SPV not found via cwd / $DAEDALUS_SHADER_DIR / "
"binary-dir / /opt/fourier/share / /usr/share: %s\n", path);
return NULL;
}
fseek(f, 0, SEEK_END); fseek(f, 0, SEEK_END);
long sz = ftell(f); long sz = ftell(f);
fseek(f, 0, SEEK_SET); fseek(f, 0, SEEK_SET);
+160 -81
View File
@@ -2,25 +2,22 @@
/* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */ /* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */
#define _POSIX_C_SOURCE 200809L #define _POSIX_C_SOURCE 200809L
/* /*
* bench_h264_primitives NEON-path latency baseline for the H.264 * bench_h264_primitives latency baseline for the H.264 primitive
* primitive library landed across PRs #9#23. * library landed across PRs #9#35.
* *
* Each kernel is exercised at a representative per-frame N for 1080p * Each kernel is exercised at a representative per-frame N for 1080p
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported. * (8160 MBs); the per-kernel total + ns/op + ms/frame are reported,
* Lets us answer "what's the total NEON-only budget for the H.264 * once per substrate (CPU NEON, QPU V3D7 compute). The QPU column
* decode at 1080p" — useful for sizing intercept-patch decisions * appears only when the host has a usable Vulkan device. When both
* (which kernels NEED QPU shaders vs which are budget-fine on NEON). * columns exist a CPU/QPU ratio is printed; that's the per-kernel
* data the QPU-substrate decree (2026-05-23) deliberately overrides
* but which is still useful to track over time as dispatch overhead
* shrinks (buffer pool, persistent cmdbuf, dmabuf import tasks 160-162).
* *
* NOT a ctest produces wall-time numbers, doesn't pass/fail. * NOT a ctest produces wall-time numbers, doesn't pass/fail.
* *
* Invoke: ./build/bench_h264_primitives [iters] * Invoke: ./build/bench_h264_primitives [iters [warmup]]
* (default iters = 50, post-warmup = 5) * (default iters = 50, warmup = 5)
*
* NB: results are inherently approximate single-core, includes
* loop overhead + memory access patterns that may not match what
* a real decode would hit (we touch a small set of pages repeatedly).
* The numbers are useful for relative comparison and order-of-
* magnitude sizing, not absolute perf claims.
*/ */
#include "daedalus.h" #include "daedalus.h"
@@ -46,34 +43,47 @@ static double now_ms(void) {
/* Per-1080p-frame counts (8160 MBs at 1920x1088). */ /* Per-1080p-frame counts (8160 MBs at 1920x1088). */
#define MBS_1080P 8160 #define MBS_1080P 8160
#define LUMA_4x4_PER_MB 16 /* if transform_8x8=0 */
#define LUMA_8x8_PER_MB 4 /* if transform_8x8=1 */
#define CHROMA_4x4_PER_MB 8 /* 4 Cb + 4 Cr */
#define DEBLOCK_LUMA_EDGES_PER_MB 4 /* 4 horiz + 4 vert internal+MB-edge — ~4 each */
#define DEBLOCK_CHROMA_EDGES_PER_MB 2 /* 2 each direction */
/* Standard benchmark loop. fn() is called n times per iteration. */ /* Standard benchmark loop. fn() is called n times per iteration.
typedef void (*bench_fn)(void); *
* fn() now returns the dispatch's int rc. A single preflight call is
* made before the hot loop; if rc != 0 (which on the QPU substrate
* almost always means "SPV not found via any search path"), bench_ns
* returns -1 and the caller must NOT report the kernel as measured.
*
* Without this, a missing SPV makes every dispatch fail fast at the
* cost of one fprintf+open call (~1-5 µs), and the loop times that
* cost as if it were real QPU work producing absurdly-small ns/op
* numbers that look like a QPU speedup. This is exactly what made
* PR #36's bench numbers a measurement artifact. */
typedef int (*bench_fn)(void);
static double bench_ns(const char *name, int iters, int warmup, static double bench_ns(const char *name, int iters, int warmup,
int ops_per_iter, bench_fn fn) int ops_per_iter, bench_fn fn)
{ {
int rc = fn();
if (rc != 0) {
printf(" %-32s DISPATCH FAILED rc=%d — kernel skipped\n", name, rc);
return -1;
}
for (int i = 0; i < warmup; i++) fn(); for (int i = 0; i < warmup; i++) fn();
double t0 = now_ms(); double t0 = now_ms();
for (int i = 0; i < iters; i++) fn(); for (int i = 0; i < iters; i++) fn();
double t1 = now_ms(); double t1 = now_ms();
double total_ms = (t1 - t0); double total_ms = (t1 - t0);
double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter); double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter);
printf(" %-32s %8.2f ns/op (%d iters x %d ops)\n", printf(" %-32s %10.2f ns/op (%d iters x %d ops)\n",
name, ns_per_op, iters, ops_per_iter); name, ns_per_op, iters, ops_per_iter);
return ns_per_op; return ns_per_op;
} }
/* ---- Per-kernel scaffolding. Each section sets up the buffers + /* ---- Per-kernel scaffolding. Each section sets up the buffers +
* meta, then defines a static fn() that calls the corresponding * meta, then defines a static fn() that calls the corresponding
* dispatch with a representative N. */ * dispatch with a representative N. The substrate is read from the
* global g_sub so the same fn() can be re-driven with CPU then QPU. */
static daedalus_ctx *ctx; static daedalus_ctx *ctx;
static daedalus_substrate g_sub = DAEDALUS_SUBSTRATE_CPU;
/* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks /* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks
* per call (64 MBs worth). Per-MB the dispatch overhead is the * per call (64 MBs worth). Per-MB the dispatch overhead is the
@@ -82,8 +92,8 @@ static int16_t idct4_coeffs[1024 * 16];
static daedalus_h264_block_meta idct4_meta[1024]; static daedalus_h264_block_meta idct4_meta[1024];
static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */ static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */
static void bench_idct4(void) { static int bench_idct4(void) {
daedalus_dispatch_h264_idct4(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_idct4(ctx, g_sub,
idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta); idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta);
} }
@@ -91,8 +101,8 @@ static void bench_idct4(void) {
static int16_t idct8_coeffs[256 * 64]; static int16_t idct8_coeffs[256 * 64];
static daedalus_h264_block_meta idct8_meta[256]; static daedalus_h264_block_meta idct8_meta[256];
static void bench_idct8(void) { static int bench_idct8(void) {
daedalus_dispatch_h264_idct8(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_idct8(ctx, g_sub,
idct_dst, 64*16, idct8_coeffs, 256, idct8_meta); idct_dst, 64*16, idct8_coeffs, 256, idct8_meta);
} }
@@ -100,13 +110,13 @@ static void bench_idct8(void) {
static daedalus_h264_deblock_meta deblock_meta[256]; static daedalus_h264_deblock_meta deblock_meta[256];
static uint8_t deblock_dst[256 * 16 * 16]; static uint8_t deblock_dst[256 * 16 * 16];
static void bench_deblock_v(void) { static int bench_deblock_v(void) {
daedalus_dispatch_h264_deblock_luma_v(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_deblock_luma_v(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta); deblock_dst, 16, 256, deblock_meta);
} }
static void bench_deblock_h(void) { static int bench_deblock_h(void) {
daedalus_dispatch_h264_deblock_luma_h(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_deblock_luma_h(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta); deblock_dst, 16, 256, deblock_meta);
} }
@@ -115,19 +125,44 @@ static uint8_t qpel_src[256 * 16 * 16];
static uint8_t qpel_dst[256 * 16 * 16]; static uint8_t qpel_dst[256 * 16 * 16];
static daedalus_h264_qpel_meta qpel_meta[256]; static daedalus_h264_qpel_meta qpel_meta[256];
static void bench_qpel_mc20(void) { static int bench_qpel_mc20(void) {
daedalus_dispatch_h264_qpel_mc20(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_qpel_mc20(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta); qpel_dst, qpel_src, 16, 256, qpel_meta);
} }
static void bench_qpel_mc02(void) { static int bench_qpel_mc02(void) {
daedalus_dispatch_h264_qpel_mc02(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_qpel_mc02(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta); qpel_dst, qpel_src, 16, 256, qpel_meta);
} }
static void bench_qpel_mc22(void) { static int bench_qpel_mc22(void) {
daedalus_dispatch_h264_qpel_mc22(ctx, DAEDALUS_SUBSTRATE_CPU, return daedalus_dispatch_h264_qpel_mc22(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta); qpel_dst, qpel_src, 16, 256, qpel_meta);
} }
/* ---- One row of bench output:
* - kernel name + N
* - CPU ns/op
* - QPU ns/op (or "n/a" if Vulkan absent)
* - CPU/QPU ratio (>1 means QPU wins; <1 means CPU wins) */
struct row {
const char *name;
int n_per_call;
bench_fn fn;
double cpu_ns;
double qpu_ns; /* -1 if not measured */
int frame_n; /* count per 1080p frame */
};
static struct row rows[] = {
{"IDCT 4x4 luma", 1024, bench_idct4, 0, -1, MBS_1080P * 16},
{"IDCT 8x8 luma", 256, bench_idct8, 0, -1, MBS_1080P * 4},
{"Deblock luma_v", 256, bench_deblock_v, 0, -1, MBS_1080P * 4},
{"Deblock luma_h", 256, bench_deblock_h, 0, -1, MBS_1080P * 4},
{"qpel mc20 (8x8)", 256, bench_qpel_mc20, 0, -1, MBS_1080P * 4},
{"qpel mc02 (8x8)", 256, bench_qpel_mc02, 0, -1, MBS_1080P * 4},
{"qpel mc22 (8x8)", 256, bench_qpel_mc22, 0, -1, MBS_1080P * 4},
};
#define N_ROWS ((int)(sizeof(rows)/sizeof(rows[0])))
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
int iters = argc > 1 ? atoi(argv[1]) : 50; int iters = argc > 1 ? atoi(argv[1]) : 50;
@@ -138,6 +173,7 @@ int main(int argc, char **argv)
fprintf(stderr, "ctx create failed (Vulkan?)\n"); fprintf(stderr, "ctx create failed (Vulkan?)\n");
return 1; return 1;
} }
int has_qpu = daedalus_ctx_has_qpu(ctx);
/* Pre-fill all input buffers with random data so the NEON inner /* Pre-fill all input buffers with random data so the NEON inner
* loops see realistic memory access patterns. */ * loops see realistic memory access patterns. */
@@ -147,8 +183,7 @@ int main(int argc, char **argv)
idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512); idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff); for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff);
/* IDCT meta: each block at offset i*16 (row layout matters less /* IDCT meta. */
* here since we're just measuring per-block latency). */
for (size_t i = 0; i < 1024; i++) for (size_t i = 0; i < 1024; i++)
idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4); idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4);
for (size_t i = 0; i < 256; i++) for (size_t i = 0; i < 256; i++)
@@ -162,58 +197,102 @@ int main(int argc, char **argv)
for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1); for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1);
} }
/* qpel meta: src and dst at row 3 col 3 of each 16x16 tile. */ /* qpel meta. */
for (size_t i = 0; i < 256; i++) { for (size_t i = 0; i < 256; i++) {
qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3); qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3);
qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3); qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3);
} }
printf("bench_h264_primitives: %d iters (%d warmup), substrate=CPU NEON\n", printf("bench_h264_primitives: %d iters (%d warmup)\n", iters, warmup);
iters, warmup); printf(" ctx has_qpu=%d (CPU pass always runs; QPU pass skipped without Vulkan)\n\n", has_qpu);
printf("Per-call N is set per kernel; ns/op is per BLOCK or EDGE.\n\n");
double idct4_ns = bench_ns("IDCT 4x4 luma", iters, warmup, 1024, bench_idct4); /* Pass 1: CPU NEON. */
double idct8_ns = bench_ns("IDCT 8x8 luma", iters, warmup, 256, bench_idct8); g_sub = DAEDALUS_SUBSTRATE_CPU;
double debl_v_ns = bench_ns("Deblock luma_v", iters, warmup, 256, bench_deblock_v); printf("== CPU NEON ==\n");
double debl_h_ns = bench_ns("Deblock luma_h", iters, warmup, 256, bench_deblock_h); for (int i = 0; i < N_ROWS; i++)
double qmc20_ns = bench_ns("qpel mc20 (8x8)", iters, warmup, 256, bench_qpel_mc20); rows[i].cpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
double qmc02_ns = bench_ns("qpel mc02 (8x8)", iters, warmup, 256, bench_qpel_mc02);
double qmc22_ns = bench_ns("qpel mc22 (8x8)", iters, warmup, 256, bench_qpel_mc22);
/* Per-frame budget summary at 1080p (8160 MBs). Worst-case /* Pass 2: QPU compute (if available). */
* assumptions: int qpu_failures = 0;
* - All MBs are transform_4x4 (16 4x4 IDCTs each) so 130,560 if (has_qpu) {
* IDCT 4x4 blocks per frame. If High profile transform_8x8, g_sub = DAEDALUS_SUBSTRATE_QPU;
* it'd be 32,640 IDCT 8x8 blocks instead. printf("\n== QPU V3D7 compute ==\n");
* - All MBs are intra (no MC qpel zero) OR all inter (no for (int i = 0; i < N_ROWS; i++) {
* intra prediction). We report MC at "all inter, all qpel rows[i].qpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
* mc22" worst case. if (rows[i].qpu_ns < 0) qpu_failures++;
* - Deblock: ~4 luma_v + 4 luma_h edges per MB; assume all 8 }
* edges trigger filtering. */ if (qpu_failures) {
printf("\nProjected 1080p frame budgets (worst-case, CPU NEON only):\n"); fprintf(stderr,
printf(" IDCT 4x4 (all-4x4 MBs): %7.2f ms (%d blocks)\n", "\nbench_h264_primitives: %d of %d QPU dispatches failed.\n"
idct4_ns * MBS_1080P * 16 / 1e6, MBS_1080P * 16); " Almost always means SPV files were not found via any of:\n"
printf(" IDCT 8x8 (all-8x8 MBs): %7.2f ms (%d blocks)\n", " cwd / $DAEDALUS_SHADER_DIR / binary-dir /\n"
idct8_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4); " /opt/fourier/share/daedalus-fourier / /usr/share/daedalus-fourier\n"
printf(" Deblock luma_v (all MBs): %7.2f ms (%d edges)\n", " Set DAEDALUS_SHADER_DIR=<path> or run from a dir where the\n"
debl_v_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4); " .spv files exist (e.g. the cmake build dir).\n",
printf(" Deblock luma_h (all MBs): %7.2f ms (%d edges)\n", qpu_failures, N_ROWS);
debl_h_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4); return 2;
printf(" qpel mc22 (all 8x8 blocks): %7.2f ms (%d blocks)\n", }
qmc22_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4); }
/* Summary table — both substrates side by side. */
printf("\n== Per-kernel comparison ==\n");
printf(" %-24s %12s %12s %8s %7s\n",
"kernel", "CPU ns/op", "QPU ns/op", "winner", "ms/frame");
for (int i = 0; i < N_ROWS; i++) {
double cpu_ms = rows[i].cpu_ns * rows[i].frame_n / 1e6;
double qpu_ms = rows[i].qpu_ns > 0 ? rows[i].qpu_ns * rows[i].frame_n / 1e6 : -1;
const char *winner;
char ratio[16];
if (rows[i].qpu_ns <= 0) {
winner = "CPU"; /* QPU n/a */
snprintf(ratio, sizeof(ratio), "n/a");
} else if (rows[i].cpu_ns < rows[i].qpu_ns) {
winner = "CPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].qpu_ns / rows[i].cpu_ns);
} else {
winner = "QPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].cpu_ns / rows[i].qpu_ns);
}
char qpu_field[16];
if (rows[i].qpu_ns > 0) snprintf(qpu_field, sizeof(qpu_field), "%.2f", rows[i].qpu_ns);
else snprintf(qpu_field, sizeof(qpu_field), "n/a");
char ms_field[24];
if (qpu_ms > 0)
snprintf(ms_field, sizeof(ms_field), "%.2f/%.2f", cpu_ms, qpu_ms);
else
snprintf(ms_field, sizeof(ms_field), "%.2f/n/a", cpu_ms);
printf(" %-24s %12.2f %12s %3s %s %s\n",
rows[i].name, rows[i].cpu_ns, qpu_field, winner, ratio, ms_field);
}
/* Per-frame budget summary at 1080p (8160 MBs). */
double cpu_idct4 = rows[0].cpu_ns * MBS_1080P * 16 / 1e6;
double cpu_debl = (rows[2].cpu_ns + rows[3].cpu_ns) * MBS_1080P * 4 / 1e6;
double cpu_mc = rows[6].cpu_ns * MBS_1080P * 4 / 1e6; /* mc22 worst-case */
double cpu_sum = cpu_idct4 + cpu_debl + cpu_mc;
printf("\n== Projected 1080p worst-case (CPU NEON only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", cpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - cpu_sum);
if (has_qpu) {
double qpu_idct4 = rows[0].qpu_ns * MBS_1080P * 16 / 1e6;
double qpu_debl = (rows[2].qpu_ns + rows[3].qpu_ns) * MBS_1080P * 4 / 1e6;
double qpu_mc = rows[6].qpu_ns * MBS_1080P * 4 / 1e6;
double qpu_sum = qpu_idct4 + qpu_debl + qpu_mc;
printf("\n== Projected 1080p worst-case (QPU V3D7 compute only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", qpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - qpu_sum);
printf("\n CPU vs QPU sum ratio: %.2fx (>1 means QPU wins)\n",
qpu_sum > 0 ? cpu_sum / qpu_sum : 0.0);
}
double sum_idct_4x4 = idct4_ns * MBS_1080P * 16 / 1e6;
double sum_deblock = (debl_v_ns + debl_h_ns) * MBS_1080P * 4 / 1e6;
double sum_mc = qmc22_ns * MBS_1080P * 4 / 1e6; /* worst-case all-mc22 */
printf("\n Sum (IDCT 4x4 + deblock luma + MC all-mc22): %7.2f ms\n",
sum_idct_4x4 + sum_deblock + sum_mc);
printf(" 30 fps deadline: 33.33 ms\n");
printf(" Margin: %+.2f ms\n",
33.33 - (sum_idct_4x4 + sum_deblock + sum_mc));
printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n"); printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n");
printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n"); printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n");
printf(" bound; the real decode stack adds 20-40%% on top.)\n"); printf(" bound; the real decode stack adds 20-40%% on top.\n");
(void) qmc20_ns; (void) qmc02_ns; printf(" Per-kernel substrate decisions belong in daedalus_core.c recipe\n");
printf(" table; the QPU substrate decree (2026-05-23) keeps everything\n");
printf(" on QPU regardless of these numbers as a policy choice.)\n");
daedalus_ctx_destroy(ctx); daedalus_ctx_destroy(ctx);
return 0; return 0;
+4 -4
View File
@@ -683,13 +683,13 @@ int main(void)
printf(" H264_QPEL_MC20 recipe substrate: %d\n", printf(" H264_QPEL_MC20 recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20)); (int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20));
printf(" H264_DEBLOCK_LH recipe substrate: %d (CPU, no QPU H shader yet)\n", printf(" H264_DEBLOCK_LH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH)); (int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH));
printf(" H264_DEBLOCK_CV recipe substrate: %d (CPU)\n", printf(" H264_DEBLOCK_CV recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV)); (int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV));
printf(" H264_DEBLOCK_CH recipe substrate: %d (CPU)\n", printf(" H264_DEBLOCK_CH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH)); (int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH));
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (CPU, bS=4 set)\n", printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (bS=4 family, all on QPU)\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA)); (int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA));
int fail = 0; int fail = 0;