14 Commits

Author SHA1 Message Date
marfrit 0d54d68f38 Merge pull request 'cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage' (#8) from noether/v3d-shader-h264-qpel-mc20 into main
Reviewed-on: #8
2026-05-23 19:14:44 +00:00
claude-noether 79553c6e22 cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU.  Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.

Shader (src/v3d_h264_qpel_mc20.comp):

  - local_size = 64, 1 lane per output pixel of one 8x8 block,
    1 block per workgroup.  Simplest layout that avoids any
    inter-lane communication — V3D's L2 cache handles the
    redundant reads from adjacent lanes computing adjacent
    output columns.
  - Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
    apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
    rounding, clip to u8, write one dst byte.
  - Single-stride convention matches FFmpeg's H264QpelContext
    (dst and src share `stride`; src+src_off points at output
    col 0 with the caller-guaranteed -2/+3 padding).

Dispatch wiring (src/daedalus_core.c):

  - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
    src_max = src_off + 7*stride + 11 (covers the +3-col read
    footprint on the last row), dst_max = dst_off + 7*stride + 8.
    1 block per WG.
  - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
    with the substrate-switch pattern matching the other H.264
    kernels.
  - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2   ← flipped
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper.  The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11.  Fixed in the same
patch.

All 9 daedalus-fourier cycles now QPU-by-recipe:

  cycle 1 VP9 IDCT 8x8         QPU
  cycle 2 VP9 LPF wd=4         QPU
  cycle 3 VP9 MC 8h            QPU
  cycle 4 VP9 LPF wd=8         QPU
  cycle 5 AV1 CDEF 8x8         QPU
  cycle 6 H.264 IDCT 4x4       QPU
  cycle 7 H.264 IDCT 8x8       QPU
  cycle 8 H.264 deblock luma-v QPU
  cycle 9 H.264 qpel mc20      QPU   ← this commit

Closes daedalus-fourier task #165.  Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
2026-05-23 21:05:36 +02:00
marfrit a092ee34aa Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7
2026-05-23 18:59:34 +00:00
marfrit c01754e849 Merge pull request 'v3d_runner: buffer pool for QPU dispatch hot path' (#6) from noether/v3d-buffer-pool into main
Reviewed-on: #6
2026-05-23 18:59:18 +00:00
claude-noether 74687d9def cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG.  H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.

Bit-exact first try on hertz (Pi 5, V3D 7.1):

  H264_IDCT4 recipe substrate:      2 (QPU)
  H264_IDCT8 recipe substrate:      2 (QPU)    ← flipped
  H264_DEBLOCK_LV recipe substrate: 2 (QPU)
  H264_QPEL_MC20 recipe substrate:  1 (CPU)    ← task #165 remaining
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact    ← QPU
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact

8 of 9 daedalus-fourier cycles now QPU-by-recipe.  Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.

Same pattern as cycle 6 commit (65bd5c3): adds h264_idct8_pipe
field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs,
v3d_h264_idct8.spv install rule.

Uses v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
2026-05-23 20:09:25 +02:00
claude-noether 65bd5c3fe3 cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
2026-05-23 20:06:20 +02:00
claude-noether 737e87980d QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:

1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
   every kernel that has a shipped V3D compute shader:

    cycle 1 VP9 IDCT 8x8         QPU  (was QPU; unchanged)
    cycle 2 VP9 LPF wd=4         QPU  (was QPU; unchanged)
    cycle 3 VP9 MC 8h            QPU  (FLIPPED from CPU — v3d_mc_8h.spv)
    cycle 4 VP9 LPF wd=8         QPU  (was QPU; unchanged)
    cycle 5 AV1 CDEF 8x8         QPU  (FLIPPED from CPU — v3d_cdef.spv)
    cycle 6 H.264 IDCT 4x4       CPU  (no shader yet; task #165)
    cycle 7 H.264 IDCT 8x8       CPU  (no shader yet; task #165)
    cycle 8 H.264 deblock luma-v QPU  (FLIPPED from CPU — v3d_h264deblock.spv)
    cycle 9 H.264 qpel mc20      CPU  (no shader yet; task #165)

   The R-band cost/benefit framework still applies but is now
   superseded for substrate selection by the decree.  Where R
   stays RED, the cost is in dispatch overhead, which is a
   fixable engineering issue (tasks 160 buffer-pool, 161
   persistent cmdbuf, 162 dmabuf import).

2) daedalus_ctx_create_no_qpu() now honours an env-var override:
   set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
   silently escalates to a full daedalus_ctx_create().  Lets
   the libavcodec substitution shims in marfrit-packages (which
   pthread_once a create_no_qpu ctx — see
   libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
   without rebuilding those patches.

   Firefox / mpv consumers stay on the Vulkan-free path by
   default (env var unset).  The daedalus-v4l2 daemon will set
   DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
   (separate daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29):

  === test_api_h264 ===
    H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2     ← flipped
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
    H.264 qpel mc20: 1024/1024 bytes bit-exact

  === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
  === test_api_lpf  === all substrates bit-exact wd=4 and wd=8

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.

Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
2026-05-23 19:59:53 +02:00
claude-noether 98553278dd v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.

Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline().  The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk.  Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.

Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):

  before (task 160 pool only):
    steady-state p50: 76.44 us
    steady-state mean: 77.95 us
  after (task 160 pool + task 161 persistent cb):
    steady-state p50: 54.56 us
    steady-state mean: 56.00 us
    -> 28% per-dispatch reduction

The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.

test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.

Refs daedalus-fourier task #161.
2026-05-23 19:56:35 +02:00
claude-noether 0a042a8e95 v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.
2026-05-23 19:52:50 +02:00
marfrit 3ecfc8b0ef Merge pull request 'docs: architecture backlog — correct fleet hardware mapping' (#4) from noether/architecture-backlog-fleet-fix into main
Reviewed-on: #4
2026-05-23 15:12:29 +00:00
marfrit c154253432 Merge pull request 'CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}' (#5) from noether/pkgconfig-relocatable-prefix into main
Reviewed-on: #5
2026-05-23 15:11:20 +00:00
claude-noether b3de96b21c CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}
The pkg-config file was generated at *configure* time with
`prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever
CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build`
time — typically the default `/usr/local`.  `cmake --install
build --prefix /foo` then put the files under /foo but the .pc
still pointed pkg-config at /usr/local/include and /usr/local/lib,
which broke downstream consumers configuring against the install
tree.

Concrete bite encountered today on hertz: the daedalus-v4l2 daemon
CMake configure on a /tmp/df-prefix install tree resolved
DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on
the test host), so main.c failed to find <daedalus.h>.

Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is
the configure-time-computed relative path from
<prefix>/<libdir>/pkgconfig back to <prefix>.  pkg-config
substitutes ${pcfiledir} with the actual on-disk location of the
.pc at lookup time, so the resolved prefix tracks wherever the
install tree is moved to — including DESTDIR-staged builds, apt
package installs, and ad-hoc `cmake --install --prefix /tmp/foo`
test installs.

The relative-path computation handles GNUInstallDirs layouts that
add multiarch tuples (Debian's lib/aarch64-linux-gnu) without
hard-coding `../..`.  Tested on hertz (Debian trixie, libdir=lib):

  prefix=${pcfiledir}/../../
  ...
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-test/lib/pkgconfig/../../

  # mv preserves relocation:
  $ mv /tmp/df-prefix-test /tmp/df-prefix-moved
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-moved/lib/pkgconfig/../../

This unblocks the daedalus-v4l2 daemon out-of-tree builds against
local daedalus-fourier installs and is a prerequisite for tidy
test-rig deployments (per the hertz reload session 2026-05-23).
2026-05-23 16:55:31 +02:00
claude-noether 68dccd2911 docs: architecture backlog — correct fleet hardware mapping
Original draft (PR #3) speculated wrongly on host-to-SoC mapping:

  - hertz and tesla were listed under RK3588.  Verified via
    /proc/device-tree/compatible: both are raspberrypi,5-model-b /
    brcm,bcm2712 (tesla is an LXD container hosted on hertz, so
    necessarily shares the host SoC).
  - boltzmann (the only actual RK3588 in the fleet, 32 GB, kernel-
    dev / MCP hub, 8 W always-on) was omitted entirely.
  - noether (Pi 4 / BCM2711, the user's interactive workstation,
    where Firefox and mpv actually run) was omitted entirely.

Corrects the per-SoC coverage table:

    BCM2712 Pi 5  — higgs, hertz, broglie, tesla (LXD on hertz)
    BCM2711 Pi 4  — noether (workstation), dcw3, dcw2
    RK3588        — boltzmann
    Allwinner H6  — (not in fleet)

Reasoning consequences:

  - Pi 5 row is now four hosts but one SoC.  Adding a fifth Pi 5
    doesn't pressure-test the architecture; substrate decisions
    are identical across the row.
  - The realistic forcing function for the Pi 4 path is "HW decode
    on noether matters and rpivid is still unstable upstream" —
    noether is a daily-driver Pi 4 workstation, so this is closer
    than the original draft implied.
  - The realistic forcing function for an RK3588 caps file is
    "AV1 playback on boltzmann matters" — rkvdec doesn't cover
    AV1, so Mali Valhall compute substrate becomes the only HW
    acceleration option there.

"Re-read this when" list at the top + "Why deferred" section
+ decision log all updated.  No change to the architecture sketch
(caps directory, plugin layout, two-backend conclusion) — those
were correct in the original; only the host-to-SoC mapping
underneath them was wrong.

Refs PR #3 (the merged original).
2026-05-23 15:47:55 +02:00
marfrit 7d6f106919 Merge pull request 'docs: architecture backlog for multi-SoC daedalus generalization' (#3) from noether/architecture-backlog into main
Reviewed-on: #3
2026-05-23 03:31:56 +00:00
9 changed files with 1186 additions and 78 deletions
+66 -2
View File
@@ -284,7 +284,40 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV})
set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv)
add_custom_command(
OUTPUT ${H264_IDCT4_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_IDCT4_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_idct4.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_idct4.comp
COMMENT "glslang: v3d_h264_idct4.comp -> v3d_h264_idct4.spv"
VERBATIM
)
set(H264_IDCT8_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct8.spv)
add_custom_command(
OUTPUT ${H264_IDCT8_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_IDCT8_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_idct8.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_idct8.comp
COMMENT "glslang: v3d_h264_idct8.comp -> v3d_h264_idct8.spv"
VERBATIM
)
set(H264_QPEL_MC20_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc20.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC20_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC20_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc20.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc20.comp
COMMENT "glslang: v3d_h264_qpel_mc20.comp -> v3d_h264_qpel_mc20.spv"
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV})
# v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -412,6 +445,9 @@ if (DAEDALUS_BUILD_VULKAN)
${LPF8_SPV}
${CDEF_SPV}
${H264DEBLOCK_SPV}
${H264_IDCT4_SPV}
${H264_IDCT8_SPV}
${H264_QPEL_MC20_SPV}
DESTINATION ${CMAKE_INSTALL_DATADIR}/daedalus-fourier/shaders
)
endif()
@@ -419,9 +455,33 @@ endif()
# pkg-config file. Vulkan goes in Requires.private (consumer's
# pkg-config call gets it via --static). pthread + dl are needed
# by the static archive's runtime helpers.
#
# `prefix` is derived from ${pcfiledir} so the .pc is relocatable:
# pkg-config substitutes ${pcfiledir} with the directory holding the
# .pc at lookup time, and the relative path from
# <prefix>/<libdir>/pkgconfig back to <prefix> tells pkg-config the
# install prefix without baking it in. This is why
# `cmake --install build --prefix /foo` produces a .pc that correctly
# resolves `prefix=/foo` instead of baking whatever CMAKE_INSTALL_PREFIX
# was at *configure* time (default /usr/local). DESTDIR-staged
# installs work too: at runtime pkg-config sees the .pc at its real
# install path and computes the right prefix.
#
# Relative-path depth is computed from CMAKE_INSTALL_LIBDIR (and
# whatever multiarch tuple GNUInstallDirs adds) so Debian-style
# `lib/aarch64-linux-gnu/pkgconfig/...` resolves with the right number
# of `..` components. Layouts where libdir is *not* under prefix are
# not supported by this scheme; if a packager overrides libdir to an
# absolute path the relative-path machinery falls back to the absolute
# value (CMake's file(RELATIVE_PATH) prepends `..` until they meet),
# which is also relocatable but no longer prefix-agnostic.
file(RELATIVE_PATH PKGCONFIG_PCDIR_TO_PREFIX
"${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/pkgconfig"
"${CMAKE_INSTALL_PREFIX}")
set(PKGCONFIG_OUT ${CMAKE_CURRENT_BINARY_DIR}/daedalus-fourier.pc)
file(WRITE ${PKGCONFIG_OUT}
"prefix=${CMAKE_INSTALL_PREFIX}
"prefix=\${pcfiledir}/${PKGCONFIG_PCDIR_TO_PREFIX}
exec_prefix=\${prefix}
libdir=\${prefix}/${CMAKE_INSTALL_LIBDIR}
includedir=\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}
@@ -468,6 +528,10 @@ add_executable(test_api_opportunistic_qpu tests/test_api_opportunistic_qpu.c)
target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core)
target_compile_options(test_api_opportunistic_qpu PRIVATE -O2)
add_executable(bench_pool_overhead tests/bench_pool_overhead.c)
target_link_libraries(bench_pool_overhead PRIVATE daedalus_core)
target_compile_options(bench_pool_overhead PRIVATE -O2)
if (DAEDALUS_BUILD_VULKAN)
# (re-open the conditional so the closing endif() below balances)
+17 -12
View File
@@ -4,9 +4,9 @@
This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when:
- A second aarch64 host without a working kernel-side V4L2 stateless decoder shows up in the fleet (most likely candidate: Pi 4, which has V3D 4.x and no rpivid stable upstream).
- A specific working-copy slowdown that the current Pi-5-only daedalus can't address motivates the generalization.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (currently it picks the first matching V4L2 node).
- HW decode on noether (Pi 4, the user's interactive workstation) becomes a real ask and rpivid upstream is still unstable. This is the most likely trigger — same SoC class as Pi 5 but weaker V3D 4.x, so the caps-file mechanism plus an extra row's worth of substrate measurements.
- AV1 playback on boltzmann (RK3588) starts mattering. rkvdec doesn't cover AV1, so the daedalus path becomes the only HW-accelerated option, and Mali Valhall compute substrate decisions need their own caps row.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (today it picks the first matching V4L2 node; a host with both rkvdec and daedalus-v4l2 nodes wants a preference policy).
Until then: this is decision context, not a TODO.
@@ -51,13 +51,17 @@ The mfritsche fleet has heterogeneous aarch64 hardware decoders:
| SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute |
|---|---|---|---|---|---|---|
| BCM2712 (Pi 5) | higgs, broglie | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | dcw3 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | hertz, tesla | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |
| BCM2712 (Pi 5) | higgs, hertz, broglie, tesla (LXD on hertz) | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | noether (interactive workstation), dcw3, dcw2 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | boltzmann (32 GB, kernel-dev / MCP hub, 8 W always-on) | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk-bifrost-video in dev) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists upstream) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |
No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only.
A note on the Pi 5 row: hertz and tesla share hardware (tesla is an LXD container hosted on hertz) but are operationally distinct — tesla is the distcc/MCP worker, hertz is the LXD host with all the cron automations and the 17-tool lmcp hub. From a daedalus deployment perspective they count as **one** Pi 5 substrate; from a workflow perspective they're separate boxes.
A note on noether: it's the user's interactive workstation (Pi 4, BCM2711). Firefox + mpv run here. Any "I want HW decode on my main box" pressure lands first on this host, which puts Pi 4 (V3D4 + maybe-rpivid) closer to the front of the queue than the original draft of this document suggested.
The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels.
The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution.
@@ -207,15 +211,15 @@ Pass-through plugins are *thin* — they translate the daedalus daemon's wire pr
**Today's calculus:**
- Pi 5 daedalus path is the only thing in the fleet that uses daedalus daemon. Generalizing for a single user is overdesign.
- RK3588 uses rkvdec directly through libva-v4l2-request-fourier; daedalus daemon is **not in the path** for any RK3588 codec. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali.
- Pi 4 with rpivid is the only realistic second motivator. rpivid upstream stability is the gate — if it lands cleanly, Pi 4 takes the pass-through path with no kernel substitution needed. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- Pi 5 (higgs + hertz + broglie + tesla) is **four hosts**, but **one SoC**. Adding the fifth Pi 5 host wouldn't pressure-test the architecture; they all share BCM2712 caps so the substrate decisions are identical across the row.
- boltzmann (RK3588) is the only non-Pi-5 always-on host in the fleet, and it uses rkvdec directly through libva-v4l2-request-fourier daedalus daemon is **not in the path** for any RK3588 codec on it. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. No forcing pressure from boltzmann today.
- noether (Pi 4, this user's interactive workstation) and dcw3/dcw2 (also Pi 4) are the real second-SoC candidates. The gate is rpivid upstream stability: if it lands cleanly, Pi 4 takes the pass-through path with zero kernel substitution work. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural.
**The forcing function that flips this from "deferred" to "do it":**
- Pi 4 enters daily use and rpivid is still not stable upstream — implies we need a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate.
- **Or:** an x86 host enters the fleet running mesa-panvk on a Pi-CM5-like board, and we need the daedalus daemon to discover it dynamically rather than being baked at build time.
- **noether-as-Firefox-host** — the user starts wanting HW decode on their main workstation and rpivid is still not stable upstream. Implies a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. This is the most likely trigger; noether is already a daily-driver Pi 4.
- **boltzmann-as-AV1-decoder** — RK3588 has no AV1 HW decoder, and the user wants AV1 playback there (currently CPU-only). Triggers a cycle-5equivalent measurement campaign on Mali Valhall to see whether `daedalus_recipe_dispatch_cdef_8x8` (or follow-on AV1 kernels) is worth running on Mali compute. If yes, we need an RK3588 caps file that overrides only the AV1 row while leaving H.264/HEVC/VP9 on rkvdec pass-through.
- **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask.
Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon.
@@ -242,6 +246,7 @@ Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC a
|---|---|---|
| 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. |
| 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 36 months. |
| 2026-05-23 | **Correct fleet hardware mapping.** Original draft had hertz/tesla under RK3588 and omitted boltzmann + noether entirely. Verified via `/proc/device-tree/compatible`: hertz + tesla are Pi 5 (BCM2712), noether is Pi 4 (BCM2711), boltzmann is the only RK3588 in the fleet. Adjusted "Why deferred" / forcing-function reasoning accordingly — Pi 5 row is now 4 hosts (one SoC), noether is the realistic Pi 4 trigger, boltzmann is the realistic RK3588 trigger via AV1. | Original draft was speculative on host-to-SoC mapping; verified state changes which forcing functions are credible. |
---
+407 -64
View File
@@ -40,6 +40,12 @@ struct daedalus_ctx {
v3d_pipeline cdef_pipe;
int h264deblock_pipe_ready;
v3d_pipeline h264deblock_pipe;
int h264_idct4_pipe_ready;
v3d_pipeline h264_idct4_pipe;
int h264_idct8_pipe_ready;
v3d_pipeline h264_idct8_pipe;
int h264_qpel_mc20_pipe_ready;
v3d_pipeline h264_qpel_mc20_pipe;
};
daedalus_ctx *daedalus_ctx_create(void)
@@ -53,6 +59,25 @@ daedalus_ctx *daedalus_ctx_create(void)
daedalus_ctx *daedalus_ctx_create_no_qpu(void)
{
/*
* Per the "QPU is default substrate" decree 2026-05-23:
* setting DAEDALUS_FORCE_QPU=1 in the process env escalates this
* function to a full daedalus_ctx_create(), letting the libavcodec
* substitution shims (which call create_no_qpu via pthread_once)
* fire the V3D shaders that exist for cycles 1/2/4/5/8. Without
* this hook each consumer process (firefox, mpv, daemon) would
* need its own shim build to opt into QPU.
*
* Default behaviour (env var unset / not "1") is unchanged: pure
* NEON ctx, no implicit Vulkan init. Firefox / mpv consumers
* that dlopen libavcodec without opting in stay on the
* Vulkan-free path; the daemon explicitly sets
* DAEDALUS_FORCE_QPU=1 before loading libavcodec.
*/
const char *force = getenv("DAEDALUS_FORCE_QPU");
if (force && force[0] == '1' && force[1] == 0)
return daedalus_ctx_create();
daedalus_ctx *ctx = calloc(1, sizeof(*ctx));
if (!ctx) return NULL;
ctx->has_qpu = 0;
@@ -75,6 +100,9 @@ void daedalus_ctx_destroy(daedalus_ctx *ctx)
if (ctx->mc8h_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->mc8h_pipe);
if (ctx->cdef_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->cdef_pipe);
if (ctx->h264deblock_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264deblock_pipe);
if (ctx->h264_idct4_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct4_pipe);
if (ctx->h264_idct8_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_idct8_pipe);
if (ctx->h264_qpel_mc20_pipe_ready) v3d_runner_destroy_pipeline(ctx->runner, &ctx->h264_qpel_mc20_pipe);
v3d_runner_destroy(ctx->runner);
}
free(ctx);
@@ -84,16 +112,25 @@ void daedalus_ctx_destroy(daedalus_ctx *ctx)
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k)
{
/*
* Recipe table per the "QPU is default substrate" decree
* 2026-05-23. Any kernel that has a V3D compute shader returns
* SUBSTRATE_QPU; CPU is the fallback for kernels without a
* shader (still the case for H.264 IDCT 4x4 / IDCT 8x8 / qpel
* mc20 — covered by follow-on task 165). The dispatch
* wrappers already fall back to CPU automatically when the
* ctx doesn't have QPU available (daedalus_ctx_has_qpu == 0).
*/
switch (k) {
case DAEDALUS_KERNEL_VP9_IDCT8: return DAEDALUS_SUBSTRATE_QPU;
case DAEDALUS_KERNEL_VP9_LPF4_INNER: return DAEDALUS_SUBSTRATE_QPU;
case DAEDALUS_KERNEL_VP9_MC_8H: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_VP9_MC_8H: return DAEDALUS_SUBSTRATE_QPU; /* v3d_mc_8h.spv */
case DAEDALUS_KERNEL_VP9_LPF8_INNER: return DAEDALUS_SUBSTRATE_QPU;
case DAEDALUS_KERNEL_AV1_CDEF_8X8: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_H264_IDCT4: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_H264_IDCT8: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_H264_DEBLOCK_LV: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_H264_QPEL_MC20: return DAEDALUS_SUBSTRATE_CPU;
case DAEDALUS_KERNEL_AV1_CDEF_8X8: return DAEDALUS_SUBSTRATE_QPU; /* v3d_cdef.spv */
case DAEDALUS_KERNEL_H264_IDCT4: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_idct4.spv */
case DAEDALUS_KERNEL_H264_IDCT8: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_idct8.spv */
case DAEDALUS_KERNEL_H264_DEBLOCK_LV: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264deblock.spv */
case DAEDALUS_KERNEL_H264_QPEL_MC20: return DAEDALUS_SUBSTRATE_QPU; /* v3d_h264_qpel_mc20.spv */
}
return DAEDALUS_SUBSTRATE_CPU;
}
@@ -291,13 +328,13 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
}
v3d_buffer buf_coeffs = {0}, buf_dst = {0}, buf_meta = {0};
if (v3d_runner_create_buffer(ctx->runner, coeff_bytes, &buf_coeffs)) return -1;
if (v3d_runner_create_buffer(ctx->runner, max_byte_touched, &buf_dst)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); return -1;
if (v3d_runner_acquire_buffer(ctx->runner, coeff_bytes, &buf_coeffs)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, max_byte_touched, &buf_dst)) {
v3d_runner_release_buffer(ctx->runner, &buf_coeffs); return -1;
}
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &buf_meta)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); return -1;
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &buf_meta)) {
v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_release_buffer(ctx->runner, &buf_coeffs); return -1;
}
/* Upload. Coeffs and meta are straight copies. Dst we copy the
@@ -325,8 +362,8 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
._pad = 0,
};
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->idct8_pipe)) goto fail;
VkCommandBuffer cb = ctx->idct8_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
@@ -344,15 +381,15 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
/* Read-back dst. */
memcpy(dst, buf_dst.mapped, max_byte_touched);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs);
v3d_runner_release_buffer(ctx->runner, &buf_meta);
v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_release_buffer(ctx->runner, &buf_coeffs);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &buf_meta);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs);
v3d_runner_release_buffer(ctx->runner, &buf_meta);
v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_release_buffer(ctx->runner, &buf_coeffs);
return -1;
}
@@ -424,9 +461,9 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
size_t dst_window_size = hi - lo;
v3d_buffer buf_meta = {0}, buf_dst = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &buf_meta)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_window_size, &buf_dst)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); return -1;
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &buf_meta)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, dst_window_size, &buf_dst)) {
v3d_runner_release_buffer(ctx->runner, &buf_meta); return -1;
}
memcpy(buf_dst.mapped, dst + lo, dst_window_size);
@@ -442,8 +479,8 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
if (v3d_runner_bind_buffers(ctx->runner, p, binds, 2)) goto fail;
uint32_t wg_count = (uint32_t)((n_edges + 31) / 32);
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, p)) goto fail;
VkCommandBuffer cb = p->cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, p->pipeline);
@@ -468,12 +505,12 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
memcpy(dst + lo, buf_dst.mapped, dst_window_size);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta);
v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_release_buffer(ctx->runner, &buf_meta);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta);
v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_release_buffer(ctx->runner, &buf_meta);
return -1;
}
@@ -509,9 +546,9 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
}
v3d_buffer bm = {0}, bd = {0}, bs = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_create_buffer(ctx->runner, src_max, &bs)) { v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_acquire_buffer(ctx->runner, src_max, &bs)) { v3d_runner_release_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
memcpy(bs.mapped, src, src_max);
memcpy(bd.mapped, dst, dst_max);
@@ -530,8 +567,8 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
mc_pc pc = { .n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = (uint32_t) dst_stride,
.src_stride_u8 = (uint32_t) src_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->mc8h_pipe)) goto fail;
VkCommandBuffer cb = ctx->mc8h_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->mc8h_pipe.pipeline);
@@ -545,14 +582,14 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bs);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_release_buffer(ctx->runner, &bs);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bs);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_release_buffer(ctx->runner, &bs);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return -1;
}
@@ -588,9 +625,9 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
size_t tmp_bytes = tmp_max_u16 * sizeof(uint16_t);
v3d_buffer bm = {0}, bd = {0}, bt = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_create_buffer(ctx->runner, tmp_bytes, &bt)) { v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_acquire_buffer(ctx->runner, tmp_bytes, &bt)) { v3d_runner_release_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
/* tmp may need padding before block-origin offset (caller-allocated). Just
* copy from caller; we assume meta[i].tmp_off_u16 is consistent with how
@@ -615,8 +652,8 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
cdef_pc pc = { .n_blocks = (uint32_t) n_blocks,
.tmp_stride_u16 = 16,
.dst_stride_u8 = (uint32_t) dst_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->cdef_pipe)) goto fail;
VkCommandBuffer cb = ctx->cdef_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->cdef_pipe.pipeline);
@@ -630,14 +667,14 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bt);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_release_buffer(ctx->runner, &bt);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bt);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_release_buffer(ctx->runner, &bt);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return -1;
}
@@ -670,8 +707,8 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
}
v3d_buffer bm = {0}, bd = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped;
@@ -691,8 +728,8 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
uint32_t wg_count = (uint32_t)((n_edges + 15) / 16);
h264deblock_pc pc = { .n_edges = (uint32_t) n_edges,
.dst_stride_u8 = (uint32_t) dst_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->h264deblock_pipe)) goto fail;
VkCommandBuffer cb = ctx->h264deblock_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->h264deblock_pipe.pipeline);
@@ -706,12 +743,294 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_release_buffer(ctx->runner, &bm);
return -1;
}
/* -------------------- H.264 IDCT 4x4 QPU dispatch (cycle 6) ----- */
typedef struct {
uint32_t n_blocks;
uint32_t dst_stride_u8;
uint32_t _pad0;
uint32_t _pad1;
} h264_idct4_pc;
static int dispatch_h264_idct4_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs, size_t n_blocks,
const daedalus_h264_block_meta *meta)
{
if (!ctx->h264_idct4_pipe_ready) {
if (v3d_runner_create_pipeline(ctx->runner, "v3d_h264_idct4.spv",
3, sizeof(h264_idct4_pc),
&ctx->h264_idct4_pipe) != 0)
return -1;
ctx->h264_idct4_pipe_ready = 1;
}
size_t coeff_bytes = n_blocks * 16 * sizeof(int16_t);
size_t meta_bytes = n_blocks * 4 * sizeof(uint32_t); /* uvec4 per block */
size_t dst_max = 0;
for (size_t i = 0; i < n_blocks; i++) {
size_t e = meta[i].dst_off + (size_t) 3 * dst_stride + 4;
if (e > dst_max) dst_max = e;
}
v3d_buffer bc = {0}, bd = {0}, bm = {0};
if (v3d_runner_create_buffer(ctx->runner, coeff_bytes, &bc)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) {
v3d_runner_destroy_buffer(ctx->runner, &bc); return -1;
}
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) {
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc); return -1;
}
memcpy(bc.mapped, coeffs, coeff_bytes);
memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped;
for (size_t i = 0; i < n_blocks; i++) {
m[4*i+0] = meta[i].dst_off;
m[4*i+1] = 0;
m[4*i+2] = 0;
m[4*i+3] = 0;
}
v3d_buffer binds[3] = { bc, bd, bm };
if (v3d_runner_bind_buffers(ctx->runner, &ctx->h264_idct4_pipe, binds, 3))
goto fail;
uint32_t wg_count = (uint32_t)((n_blocks + 15) / 16); /* 16 blocks/WG */
h264_idct4_pc pc = {
.n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = (uint32_t) dst_stride,
};
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_idct4_pipe.pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_idct4_pipe.layout, 0, 1,
&ctx->h264_idct4_pipe.desc_set, 0, NULL);
vkCmdPushConstants(cb, ctx->h264_idct4_pipe.layout,
VK_SHADER_STAGE_COMPUTE_BIT, 0, sizeof(pc), &pc);
vkCmdDispatch(cb, wg_count, 1, 1);
vkEndCommandBuffer(cb);
if (v3d_runner_submit_wait(ctx->runner, cb)) goto fail;
memcpy(dst, bd.mapped, dst_max);
/* H.264/FFmpeg convention: zero the coeffs block after the
* transform (matches the C ref + NEON .S behaviour). */
memset(coeffs, 0, coeff_bytes);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc);
return -1;
}
/* -------------------- H.264 IDCT 8x8 QPU dispatch (cycle 7) ----- */
typedef struct {
uint32_t n_blocks;
uint32_t dst_stride_u8;
uint32_t _pad0;
uint32_t _pad1;
} h264_idct8_pc;
static int dispatch_h264_idct8_qpu(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs, size_t n_blocks,
const daedalus_h264_block_meta *meta)
{
if (!ctx->h264_idct8_pipe_ready) {
if (v3d_runner_create_pipeline(ctx->runner, "v3d_h264_idct8.spv",
3, sizeof(h264_idct8_pc),
&ctx->h264_idct8_pipe) != 0)
return -1;
ctx->h264_idct8_pipe_ready = 1;
}
size_t coeff_bytes = n_blocks * 64 * sizeof(int16_t);
size_t meta_bytes = n_blocks * 4 * sizeof(uint32_t);
size_t dst_max = 0;
for (size_t i = 0; i < n_blocks; i++) {
size_t e = meta[i].dst_off + (size_t) 7 * dst_stride + 8;
if (e > dst_max) dst_max = e;
}
v3d_buffer bc = {0}, bd = {0}, bm = {0};
if (v3d_runner_create_buffer(ctx->runner, coeff_bytes, &bc)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) {
v3d_runner_destroy_buffer(ctx->runner, &bc); return -1;
}
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) {
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc); return -1;
}
memcpy(bc.mapped, coeffs, coeff_bytes);
memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped;
for (size_t i = 0; i < n_blocks; i++) {
m[4*i+0] = meta[i].dst_off;
m[4*i+1] = 0;
m[4*i+2] = 0;
m[4*i+3] = 0;
}
v3d_buffer binds[3] = { bc, bd, bm };
if (v3d_runner_bind_buffers(ctx->runner, &ctx->h264_idct8_pipe, binds, 3))
goto fail;
uint32_t wg_count = (uint32_t)((n_blocks + 7) / 8); /* 8 blocks/WG */
h264_idct8_pc pc = {
.n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = (uint32_t) dst_stride,
};
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_idct8_pipe.pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_idct8_pipe.layout, 0, 1,
&ctx->h264_idct8_pipe.desc_set, 0, NULL);
vkCmdPushConstants(cb, ctx->h264_idct8_pipe.layout,
VK_SHADER_STAGE_COMPUTE_BIT, 0, sizeof(pc), &pc);
vkCmdDispatch(cb, wg_count, 1, 1);
vkEndCommandBuffer(cb);
if (v3d_runner_submit_wait(ctx->runner, cb)) goto fail;
memcpy(dst, bd.mapped, dst_max);
memset(coeffs, 0, coeff_bytes);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bc);
return -1;
}
/* -------------------- H.264 qpel mc20 QPU dispatch (cycle 9) --- */
typedef struct {
uint32_t n_blocks;
uint32_t stride_u8;
uint32_t _pad0;
uint32_t _pad1;
} h264_qpel_mc20_pc;
static int dispatch_h264_qpel_mc20_qpu(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta)
{
if (!ctx->h264_qpel_mc20_pipe_ready) {
if (v3d_runner_create_pipeline(ctx->runner, "v3d_h264_qpel_mc20.spv",
3, sizeof(h264_qpel_mc20_pc),
&ctx->h264_qpel_mc20_pipe) != 0)
return -1;
ctx->h264_qpel_mc20_pipe_ready = 1;
}
/* Compute the smallest contiguous src/dst window that covers
* every block's read/write footprint.
*
* src: filter reads cols (c-2)..(c+3) for c=0..7 across rows 0..7.
* Highest read = src_off + 7*stride + (7 + 3) = src_off + 7*stride + 10.
* Plus 1 for the byte-count semantic of memcpy (length=N copies
* indices 0..N-1) → src_max = src_off + 7*stride + 11.
*
* dst: writes cols 0..7 across rows 0..7.
* Highest write = dst_off + 7*stride + 7; +1 → dst_off + 7*stride + 8. */
size_t meta_bytes = n_blocks * 4 * sizeof(uint32_t);
size_t src_max = 0, dst_max = 0;
for (size_t i = 0; i < n_blocks; i++) {
size_t s_end = meta[i].src_off + (size_t) 7 * stride + 11;
size_t d_end = meta[i].dst_off + (size_t) 7 * stride + 8;
if (s_end > src_max) src_max = s_end;
if (d_end > dst_max) dst_max = d_end;
}
v3d_buffer bs = {0}, bd = {0}, bm = {0};
if (v3d_runner_create_buffer(ctx->runner, src_max, &bs)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) {
v3d_runner_destroy_buffer(ctx->runner, &bs); return -1;
}
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) {
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bs); return -1;
}
/* Copy src window (filter needs cols -2..+3, captured by src_max
* upper bound above; the lower bound is implicit in src_off >= 2
* which the caller guarantees per the public API contract). */
memcpy(bs.mapped, src, src_max);
memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped;
for (size_t i = 0; i < n_blocks; i++) {
m[4*i+0] = meta[i].dst_off;
m[4*i+1] = meta[i].src_off;
m[4*i+2] = 0;
m[4*i+3] = 0;
}
v3d_buffer binds[3] = { bs, bd, bm };
if (v3d_runner_bind_buffers(ctx->runner, &ctx->h264_qpel_mc20_pipe, binds, 3))
goto fail;
uint32_t wg_count = (uint32_t) n_blocks; /* 1 block per WG */
h264_qpel_mc20_pc pc = {
.n_blocks = (uint32_t) n_blocks,
.stride_u8 = (uint32_t) stride,
};
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner);
if (cb == VK_NULL_HANDLE) goto fail;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_qpel_mc20_pipe.pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
ctx->h264_qpel_mc20_pipe.layout, 0, 1,
&ctx->h264_qpel_mc20_pipe.desc_set, 0, NULL);
vkCmdPushConstants(cb, ctx->h264_qpel_mc20_pipe.layout,
VK_SHADER_STAGE_COMPUTE_BIT, 0, sizeof(pc), &pc);
vkCmdDispatch(cb, wg_count, 1, 1);
vkEndCommandBuffer(cb);
if (v3d_runner_submit_wait(ctx->runner, cb)) goto fail;
memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bs);
return 0;
fail:
v3d_runner_destroy_buffer(ctx->runner, &bm);
v3d_runner_destroy_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bs);
return -1;
}
@@ -803,8 +1122,16 @@ int daedalus_dispatch_h264_idct4(daedalus_ctx *ctx, daedalus_substrate sub,
int16_t *coeffs, size_t n_blocks,
const daedalus_h264_block_meta *meta)
{
ROUTE_CPU_ONLY(DAEDALUS_KERNEL_H264_IDCT4, dispatch_h264_idct4_cpu,
dst, dst_stride, coeffs, n_blocks, meta);
daedalus_substrate eff = sub;
if (eff == DAEDALUS_SUBSTRATE_AUTO)
eff = daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_IDCT4);
if (eff == DAEDALUS_SUBSTRATE_QPU && !daedalus_ctx_has_qpu(ctx))
eff = DAEDALUS_SUBSTRATE_CPU;
if (eff == DAEDALUS_SUBSTRATE_CPU)
return dispatch_h264_idct4_cpu(ctx, dst, dst_stride,
coeffs, n_blocks, meta);
return dispatch_h264_idct4_qpu(ctx, dst, dst_stride,
coeffs, n_blocks, meta);
}
int daedalus_dispatch_h264_idct8(daedalus_ctx *ctx, daedalus_substrate sub,
@@ -812,8 +1139,16 @@ int daedalus_dispatch_h264_idct8(daedalus_ctx *ctx, daedalus_substrate sub,
int16_t *coeffs, size_t n_blocks,
const daedalus_h264_block_meta *meta)
{
ROUTE_CPU_ONLY(DAEDALUS_KERNEL_H264_IDCT8, dispatch_h264_idct8_cpu,
dst, dst_stride, coeffs, n_blocks, meta);
daedalus_substrate eff = sub;
if (eff == DAEDALUS_SUBSTRATE_AUTO)
eff = daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_IDCT8);
if (eff == DAEDALUS_SUBSTRATE_QPU && !daedalus_ctx_has_qpu(ctx))
eff = DAEDALUS_SUBSTRATE_CPU;
if (eff == DAEDALUS_SUBSTRATE_CPU)
return dispatch_h264_idct8_cpu(ctx, dst, dst_stride,
coeffs, n_blocks, meta);
return dispatch_h264_idct8_qpu(ctx, dst, dst_stride,
coeffs, n_blocks, meta);
}
int daedalus_dispatch_h264_deblock_luma_v(daedalus_ctx *ctx, daedalus_substrate sub,
@@ -834,8 +1169,16 @@ int daedalus_dispatch_h264_qpel_mc20(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta)
{
ROUTE_CPU_ONLY(DAEDALUS_KERNEL_H264_QPEL_MC20, dispatch_h264_qpel_mc20_cpu,
dst, src, stride, n_blocks, meta);
daedalus_substrate eff = sub;
if (eff == DAEDALUS_SUBSTRATE_AUTO)
eff = daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20);
if (eff == DAEDALUS_SUBSTRATE_QPU && !daedalus_ctx_has_qpu(ctx))
eff = DAEDALUS_SUBSTRATE_CPU;
if (eff == DAEDALUS_SUBSTRATE_CPU)
return dispatch_h264_qpel_mc20_cpu(ctx, dst, src, stride,
n_blocks, meta);
return dispatch_h264_qpel_mc20_qpu(ctx, dst, src, stride,
n_blocks, meta);
}
/* -------------------- Recipe convenience wrappers --------------- */
+129
View File
@@ -0,0 +1,129 @@
// daedalus-fourier — H.264 4x4 inverse integer transform + add, V3D 7.1.
//
// H.264 spec §8.5.12.1. Pure integer arithmetic — no trig constants
// (unlike VP9 IDCT 8x8). Row pass first, column pass second; round
// (+32) >> 6, add to dst, clip to u8.
//
// Block memory layout: COLUMN-MAJOR. block[c*4 + r] = coefficient at
// (row r, column c). Matches FFmpeg `ff_h264_idct_add_neon`.
//
// Workgroup layout: 64 invocations = 4 lanes/block × 16 blocks/WG.
// - row pass: lane k (0..3) reads row k of the block (4 coefficients,
// one from each column), runs the butterfly, writes 4
// outputs to one row of tmp_shared.
// - column pass: lane k reads column k of tmp_shared (4 rows),
// runs the butterfly, writes 4 outputs to dst as
// column k at rows 0..3.
//
// shared = 16 × 16 × 4 B = 1 KiB. Well under V3D's 16 KiB limit.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Coeffs {
int16_t coeffs[]; // N × 16 column-major
} u_coeffs;
layout(binding = 1) buffer Dst {
uint8_t dst[]; // H × stride bytes (caller-provided base)
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off (byte offset into u_dst.dst)
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint _pad0, _pad1;
} pc;
// 16 blocks per WG × 16 ints per block = 256 ints = 1 KiB shared.
shared int tmp_shared[16 * 16];
// 1D butterfly per H.264 §8.5.12.1. d[0..3] in, o[0..3] out.
void idct4_1d(int d0, int d1, int d2, int d3,
out int o0, out int o1, out int o2, out int o3)
{
int e = d0 + d2;
int f = d0 - d2;
int g = (d1 >> 1) - d3;
int h = d1 + (d3 >> 1);
o0 = e + h;
o1 = f + g;
o2 = f - g;
o3 = e - h;
}
void main()
{
// Lane decomposition: local_size 64 = 16 blocks × 4 lanes/block.
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 64u;
uint lane_in_wg = gid & 63u;
uint block_local = lane_in_wg >> 2; // 0..15
uint k = lane_in_wg & 3u; // 0..3
uint block_idx = wg_id * 16u + block_local;
bool oob = (block_idx >= pc.n_blocks);
// ---- Row pass --------------------------------------------------
// lane k handles row r=k. Reads block[c*4 + k] for c=0..3 (one
// element from each column at fixed row).
if (!oob) {
uint base = block_idx * 16u;
int d0 = int(u_coeffs.coeffs[base + 0u * 4u + k]);
int d1 = int(u_coeffs.coeffs[base + 1u * 4u + k]);
int d2 = int(u_coeffs.coeffs[base + 2u * 4u + k]);
int d3 = int(u_coeffs.coeffs[base + 3u * 4u + k]);
int o0, o1, o2, o3;
idct4_1d(d0, d1, d2, d3, o0, o1, o2, o3);
// Write row k of tmp_shared[block_local].
uint tbase = block_local * 16u + k * 4u;
tmp_shared[tbase + 0u] = o0;
tmp_shared[tbase + 1u] = o1;
tmp_shared[tbase + 2u] = o2;
tmp_shared[tbase + 3u] = o3;
}
barrier();
// ---- Column pass ----------------------------------------------
// lane k handles column c=k. Reads tmp[r][k] for r=0..3.
if (!oob) {
uint tbase = block_local * 16u;
int s0 = tmp_shared[tbase + 0u * 4u + k];
int s1 = tmp_shared[tbase + 1u * 4u + k];
int s2 = tmp_shared[tbase + 2u * 4u + k];
int s3 = tmp_shared[tbase + 3u * 4u + k];
int o0, o1, o2, o3;
idct4_1d(s0, s1, s2, s3, o0, o1, o2, o3);
// Column k at rows 0..3 of dst, offset by meta.x (dst_off).
uint dst_off = u_meta.meta[block_idx].x;
uint stride = pc.dst_stride_u8;
uint a0 = dst_off + 0u * stride + k;
uint a1 = dst_off + 1u * stride + k;
uint a2 = dst_off + 2u * stride + k;
uint a3 = dst_off + 3u * stride + k;
int p0 = int(u_dst.dst[a0]);
int p1 = int(u_dst.dst[a1]);
int p2 = int(u_dst.dst[a2]);
int p3 = int(u_dst.dst[a3]);
u_dst.dst[a0] = uint8_t(clamp(p0 + ((o0 + 32) >> 6), 0, 255));
u_dst.dst[a1] = uint8_t(clamp(p1 + ((o1 + 32) >> 6), 0, 255));
u_dst.dst[a2] = uint8_t(clamp(p2 + ((o2 + 32) >> 6), 0, 255));
u_dst.dst[a3] = uint8_t(clamp(p3 + ((o3 + 32) >> 6), 0, 255));
}
}
+175
View File
@@ -0,0 +1,175 @@
// daedalus-fourier — H.264 8x8 inverse integer transform + add, V3D 7.1.
//
// H.264 spec §8.5.13.2 (High profile 8x8 IT). Pure integer arithmetic
// — different butterfly from VP9 IDCT 8x8 (cycle 1, uses cospi
// multipliers). Row pass first, column pass second; round (+32) >> 6,
// add to dst, clip to u8.
//
// Block layout: COLUMN-MAJOR. block[c*8 + r] = coefficient at
// (row r, column c). Matches FFmpeg `ff_h264_idct8_add_neon`.
//
// Workgroup layout: 64 invocations = 8 lanes/block × 8 blocks/WG.
// - row pass: lane k (0..7) reads row k of the block (8 coefficients,
// one from each column), runs the butterfly, writes 8
// outputs to one row of tmp_shared.
// - column pass: lane k reads column k of tmp_shared (8 rows),
// runs the butterfly, writes 8 outputs to dst as
// column k at rows 0..7.
//
// shared = 8 × 64 × 4 B = 2 KiB. Well under V3D's 16 KiB limit.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Coeffs {
int16_t coeffs[]; // N × 64 column-major
} u_coeffs;
layout(binding = 1) buffer Dst {
uint8_t dst[]; // H × stride bytes
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint _pad0, _pad1;
} pc;
// 8 blocks/WG × 64 ints/block × 4 B = 2 KiB shared.
shared int tmp_shared[8 * 64];
// 1D 8-element butterfly per H.264 §8.5.13.2.
void idct8_1d(int d0, int d1, int d2, int d3,
int d4, int d5, int d6, int d7,
out int g0, out int g1, out int g2, out int g3,
out int g4, out int g5, out int g6, out int g7)
{
int e0 = d0 + d4;
int e1 = -d3 + d5 - d7 - (d7 >> 1);
int e2 = d0 - d4;
int e3 = d1 + d7 - d3 - (d3 >> 1);
int e4 = (d2 >> 1) - d6;
int e5 = -d1 + d7 + d5 + (d5 >> 1);
int e6 = d2 + (d6 >> 1);
int e7 = d3 + d5 + d1 + (d1 >> 1);
int f0 = e0 + e6;
int f1 = e1 + (e7 >> 2);
int f2 = e2 + e4;
int f3 = e3 + (e5 >> 2);
int f4 = e2 - e4;
int f5 = (e3 >> 2) - e5;
int f6 = e0 - e6;
int f7 = e7 - (e1 >> 2);
g0 = f0 + f7;
g1 = f2 + f5;
g2 = f4 + f3;
g3 = f6 + f1;
g4 = f6 - f1;
g5 = f4 - f3;
g6 = f2 - f5;
g7 = f0 - f7;
}
void main()
{
// local_size 64 = 8 blocks × 8 lanes/block.
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 64u;
uint lane_in_wg = gid & 63u;
uint block_local = lane_in_wg >> 3; // 0..7
uint k = lane_in_wg & 7u; // 0..7
uint block_idx = wg_id * 8u + block_local;
bool oob = (block_idx >= pc.n_blocks);
// ---- Row pass --------------------------------------------------
// lane k handles row r=k. Reads block[c*8 + k] for c=0..7.
if (!oob) {
uint base = block_idx * 64u;
int d0 = int(u_coeffs.coeffs[base + 0u * 8u + k]);
int d1 = int(u_coeffs.coeffs[base + 1u * 8u + k]);
int d2 = int(u_coeffs.coeffs[base + 2u * 8u + k]);
int d3 = int(u_coeffs.coeffs[base + 3u * 8u + k]);
int d4 = int(u_coeffs.coeffs[base + 4u * 8u + k]);
int d5 = int(u_coeffs.coeffs[base + 5u * 8u + k]);
int d6 = int(u_coeffs.coeffs[base + 6u * 8u + k]);
int d7 = int(u_coeffs.coeffs[base + 7u * 8u + k]);
int g0, g1, g2, g3, g4, g5, g6, g7;
idct8_1d(d0, d1, d2, d3, d4, d5, d6, d7,
g0, g1, g2, g3, g4, g5, g6, g7);
// Write row k of tmp_shared[block_local].
uint tbase = block_local * 64u + k * 8u;
tmp_shared[tbase + 0u] = g0;
tmp_shared[tbase + 1u] = g1;
tmp_shared[tbase + 2u] = g2;
tmp_shared[tbase + 3u] = g3;
tmp_shared[tbase + 4u] = g4;
tmp_shared[tbase + 5u] = g5;
tmp_shared[tbase + 6u] = g6;
tmp_shared[tbase + 7u] = g7;
}
barrier();
// ---- Column pass ----------------------------------------------
// lane k handles column c=k. Reads tmp[r][k] for r=0..7.
if (!oob) {
uint tbase = block_local * 64u;
int s0 = tmp_shared[tbase + 0u * 8u + k];
int s1 = tmp_shared[tbase + 1u * 8u + k];
int s2 = tmp_shared[tbase + 2u * 8u + k];
int s3 = tmp_shared[tbase + 3u * 8u + k];
int s4 = tmp_shared[tbase + 4u * 8u + k];
int s5 = tmp_shared[tbase + 5u * 8u + k];
int s6 = tmp_shared[tbase + 6u * 8u + k];
int s7 = tmp_shared[tbase + 7u * 8u + k];
int g0, g1, g2, g3, g4, g5, g6, g7;
idct8_1d(s0, s1, s2, s3, s4, s5, s6, s7,
g0, g1, g2, g3, g4, g5, g6, g7);
// Column k at rows 0..7 of dst, offset by meta.x.
uint dst_off = u_meta.meta[block_idx].x;
uint stride = pc.dst_stride_u8;
uint a0 = dst_off + 0u * stride + k;
uint a1 = dst_off + 1u * stride + k;
uint a2 = dst_off + 2u * stride + k;
uint a3 = dst_off + 3u * stride + k;
uint a4 = dst_off + 4u * stride + k;
uint a5 = dst_off + 5u * stride + k;
uint a6 = dst_off + 6u * stride + k;
uint a7 = dst_off + 7u * stride + k;
int p0 = int(u_dst.dst[a0]);
int p1 = int(u_dst.dst[a1]);
int p2 = int(u_dst.dst[a2]);
int p3 = int(u_dst.dst[a3]);
int p4 = int(u_dst.dst[a4]);
int p5 = int(u_dst.dst[a5]);
int p6 = int(u_dst.dst[a6]);
int p7 = int(u_dst.dst[a7]);
u_dst.dst[a0] = uint8_t(clamp(p0 + ((g0 + 32) >> 6), 0, 255));
u_dst.dst[a1] = uint8_t(clamp(p1 + ((g1 + 32) >> 6), 0, 255));
u_dst.dst[a2] = uint8_t(clamp(p2 + ((g2 + 32) >> 6), 0, 255));
u_dst.dst[a3] = uint8_t(clamp(p3 + ((g3 + 32) >> 6), 0, 255));
u_dst.dst[a4] = uint8_t(clamp(p4 + ((g4 + 32) >> 6), 0, 255));
u_dst.dst[a5] = uint8_t(clamp(p5 + ((g5 + 32) >> 6), 0, 255));
u_dst.dst[a6] = uint8_t(clamp(p6 + ((g6 + 32) >> 6), 0, 255));
u_dst.dst[a7] = uint8_t(clamp(p7 + ((g7 + 32) >> 6), 0, 255));
}
}
+83
View File
@@ -0,0 +1,83 @@
// daedalus-fourier — H.264 luma qpel mc20 (8x8, horizontal half-pel), V3D 7.1.
//
// H.264 spec §8.4.2.2.1 horizontal 6-tap luma interpolation:
//
// dst[r,c] = clip255(
// ( s[r,c-2]
// - 5 * s[r,c-1]
// + 20 * s[r,c]
// + 20 * s[r,c+1]
// - 5 * s[r,c+2]
// + s[r,c+3]
// + 16
// ) >> 5)
//
// Single-stride: dst and src share `stride` (H264QpelContext
// convention). src+src_off already points at the leftmost output
// column (col 0); the filter reads cols -2..+3. Caller guarantees
// edge-padding context per the public API docstring.
//
// Workgroup layout: 64 invocations = 1 lane per output pixel.
// 1 block per WG; n_blocks WGs total. This is the simplest layout
// that avoids any inter-lane communication — each lane independently
// reads its 6 src samples and writes its 1 dst sample. V3D's L2
// cache handles the redundant reads from adjacent lanes.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src {
uint8_t src[];
} u_src;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off, .y = src_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
// 1 block per WG, 64 lanes covering the 8x8 output block.
uint wg_id = gl_WorkGroupID.x;
uint block_idx = wg_id;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3; // 0..7 (row)
uint c = lane & 7u; // 0..7 (column)
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// src points at output col 0 of the block; filter reads cols -2..+3
// of the current row. Negative col arithmetic is unsigned-safe
// because src_off >= 2 (caller-guaranteed left context).
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base + 0u]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+144
View File
@@ -17,6 +17,18 @@
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
r__, __FILE__, __LINE__, #call); return NULL; } } while (0)
/* Power-of-2 size classes from 2^8 (256 B) up to 2^23 (8 MiB). Cycle
* 1's largest dispatch with n_blocks ≈ 8K is well under 8 MiB; oversize
* requests fall through to non-pooled allocation. */
#define V3D_POOL_MIN_LOG2 8
#define V3D_POOL_MAX_LOG2 23
#define V3D_POOL_BUCKETS (V3D_POOL_MAX_LOG2 - V3D_POOL_MIN_LOG2 + 1)
struct v3d_pool_entry {
v3d_buffer buf;
struct v3d_pool_entry *next;
};
struct v3d_runner {
VkInstance instance;
VkPhysicalDevice phys;
@@ -26,6 +38,15 @@ struct v3d_runner {
VkCommandPool pool;
char device_name[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE];
VkPhysicalDeviceMemoryProperties mem_props;
/* Buffer pool: per-bucket freelist of previously-released
* v3d_buffer. bucket index = ceil_log2(size) - V3D_POOL_MIN_LOG2.
* pool_total_bytes accumulates every successful vkAllocateMemory
* we've done through the pool — never decreases (the freelist
* just hands buffers around, no vkFreeMemory until destroy).
*/
struct v3d_pool_entry *pool_free[V3D_POOL_BUCKETS];
size_t pool_total_bytes;
};
static int pick_v3d_physical_device(VkInstance inst, VkPhysicalDevice *out,
@@ -168,6 +189,21 @@ void v3d_runner_destroy(v3d_runner *r)
{
if (!r) return;
if (r->device != VK_NULL_HANDLE) vkDeviceWaitIdle(r->device);
/* Drain the buffer pool BEFORE destroying device — the pool
* entries own VkBuffer/VkDeviceMemory handles, which need a live
* device for vkDestroyBuffer/vkFreeMemory. */
for (int b = 0; b < V3D_POOL_BUCKETS; b++) {
struct v3d_pool_entry *e = r->pool_free[b];
while (e) {
struct v3d_pool_entry *next = e->next;
v3d_runner_destroy_buffer(r, &e->buf);
free(e);
e = next;
}
r->pool_free[b] = NULL;
}
if (r->pool != VK_NULL_HANDLE)
vkDestroyCommandPool(r->device, r->pool, NULL);
if (r->device != VK_NULL_HANDLE) vkDestroyDevice(r->device, NULL);
@@ -175,6 +211,92 @@ void v3d_runner_destroy(v3d_runner *r)
free(r);
}
/* ---- Buffer pool ----------------------------------------------- */
/* ceil_log2 for buffer pool bucket selection. */
static int v3d_pool_bucket_for(size_t size)
{
int log2;
size_t m;
if (size <= ((size_t)1 << V3D_POOL_MIN_LOG2))
return 0;
m = size - 1;
log2 = 0;
while (m) { log2++; m >>= 1; }
if (log2 < V3D_POOL_MIN_LOG2) log2 = V3D_POOL_MIN_LOG2;
if (log2 > V3D_POOL_MAX_LOG2) return -1;
return log2 - V3D_POOL_MIN_LOG2;
}
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out)
{
int bucket;
size_t bucket_size;
struct v3d_pool_entry *e;
int rc;
if (!r || !out || size == 0) return -1;
bucket = v3d_pool_bucket_for(size);
if (bucket < 0) {
/* Oversize — fall through to non-pooled allocation. Caller
* still calls v3d_runner_release_buffer(), which detects the
* oversize bucket via bucket_for() and destroys. */
return v3d_runner_create_buffer(r, size, out);
}
bucket_size = (size_t)1 << (bucket + V3D_POOL_MIN_LOG2);
e = r->pool_free[bucket];
if (e) {
r->pool_free[bucket] = e->next;
*out = e->buf;
free(e);
return 0;
}
/* Miss — allocate fresh at the bucket size. Subsequent acquire/
* release for the same bucket reuses this buffer. */
rc = v3d_runner_create_buffer(r, bucket_size, out);
if (rc == 0)
r->pool_total_bytes += bucket_size;
return rc;
}
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf)
{
int bucket;
struct v3d_pool_entry *e;
if (!r || !buf || buf->buffer == VK_NULL_HANDLE) return;
bucket = v3d_pool_bucket_for(buf->size);
if (bucket < 0) {
/* Oversize — destroy outright; never made it into the pool. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e = malloc(sizeof(*e));
if (!e) {
/* Allocator failure: just destroy. Pool degenerates to
* non-pooled behaviour but doesn't leak. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e->buf = *buf;
e->next = r->pool_free[bucket];
r->pool_free[bucket] = e;
memset(buf, 0, sizeof(*buf));
}
size_t v3d_runner_pool_total_bytes(v3d_runner *r)
{
return r ? r->pool_total_bytes : 0;
}
VkDevice v3d_runner_device(v3d_runner *r) { return r->device; }
VkQueue v3d_runner_queue(v3d_runner *r) { return r->queue; }
uint32_t v3d_runner_queue_family(v3d_runner *r) { return r->queue_family; }
@@ -364,12 +486,27 @@ int v3d_runner_create_pipeline(v3d_runner *r, const char *spv_path,
.pSetLayouts = &out->ds_layout,
};
CHK(vkAllocateDescriptorSets(r->device, &dsai, &out->desc_set));
/* Persistent command buffer — pool was created with
* RESET_COMMAND_BUFFER_BIT (see v3d_runner_create) so dispatch
* sites can call vkResetCommandBuffer on this same cb instead
* of paying vkAllocateCommandBuffers per call. */
VkCommandBufferAllocateInfo cbai = {
.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
.commandPool = r->pool,
.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY,
.commandBufferCount = 1,
};
CHK(vkAllocateCommandBuffers(r->device, &cbai, &out->cb));
return 0;
}
void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
{
if (!p || p->pipeline == VK_NULL_HANDLE) return;
if (p->cb != VK_NULL_HANDLE)
vkFreeCommandBuffers(r->device, r->pool, 1, &p->cb);
vkDestroyPipeline(r->device, p->pipeline, NULL);
vkDestroyPipelineLayout(r->device, p->layout, NULL);
vkDestroyDescriptorPool(r->device, p->pool, NULL); /* frees its set */
@@ -377,6 +514,13 @@ void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
memset(p, 0, sizeof(*p));
}
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p)
{
(void) r;
if (!p || p->cb == VK_NULL_HANDLE) return -1;
return vkResetCommandBuffer(p->cb, 0) == VK_SUCCESS ? 0 : -1;
}
int v3d_runner_bind_buffers(v3d_runner *r, v3d_pipeline *p,
const v3d_buffer *bufs, uint32_t n)
{
+45
View File
@@ -34,6 +34,12 @@ typedef struct {
VkDescriptorSet desc_set;
uint32_t n_ssbos;
uint32_t push_const_size;
/* Persistent command buffer. Allocated at create-pipeline time;
* dispatch sites use v3d_runner_pipeline_cmdbuf_reset() to
* vkResetCommandBuffer instead of paying vkAllocateCommandBuffers
* per dispatch. Pool flagged RESET_COMMAND_BUFFER_BIT so reset
* is permitted. */
VkCommandBuffer cb;
} v3d_pipeline;
/*
@@ -57,10 +63,43 @@ const char *v3d_runner_device_name(v3d_runner *r);
* host side. The mapping persists for the lifetime of the buffer.
*
* Returns 0 on success, non-zero on failure.
*
* NOTE: prefer v3d_runner_acquire_buffer() on the dispatch hot path —
* create_buffer/destroy_buffer go straight to vkAllocateMemory each
* call, which on V3D7's Mesa stack costs ~10-50us. The acquire/
* release pair pulls from a freelist and pays vkAllocateMemory only
* on a cache miss.
*/
int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
/*
* Pooled buffer acquisition. Returns a v3d_buffer whose .size is the
* smallest power-of-2 >= the requested size (so callers can pool
* across similar-sized requests). Backed by HOST_VISIBLE |
* HOST_COHERENT memory; mapped pointer is valid.
*
* On cache hit: zero-cost reuse of a previously-released buffer.
* On miss: falls through to v3d_runner_create_buffer(). Release with
* v3d_runner_release_buffer(); pool drains in v3d_runner_destroy().
*
* Lifetime contract: the returned buffer's .mapped contents are
* UNINITIALISED — the previous user's data may still be present.
* Callers that need a clean buffer must memset themselves. This is
* deliberate; the dispatch hot paths immediately overwrite the
* buffer with new coefficients / meta anyway.
*
* Thread-safety: NOT thread-safe. A daedalus_ctx is single-threaded
* by API contract; the pool inherits that constraint.
*/
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf);
/* Pool diagnostics: total allocated bytes (sum across all size
* classes, including currently-released entries). Useful for
* watermark logging. */
size_t v3d_runner_pool_total_bytes(v3d_runner *r);
/* Compute pipeline from a SPIR-V file path. The descriptor-set
* layout exposes `n_ssbos` storage buffer bindings at binding
* indices 0..n_ssbos-1, all visible to the compute stage. A push
@@ -88,6 +127,12 @@ int v3d_runner_bind_buffers(v3d_runner *r,
/* Allocate a primary command buffer from the runner's pool. */
VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r);
/* Reset @p->cb so it can be re-recorded. Returns 0 on success.
* Replaces v3d_runner_alloc_cmdbuf() on the dispatch hot path —
* vkResetCommandBuffer is O(1) vs vkAllocateCommandBuffers' ~1-5us
* driver cost. */
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p);
/* Submit `cb` to the queue and wait for completion. The classic
* timed operation. Returns 0 on success.
*/
+120
View File
@@ -0,0 +1,120 @@
/*
* bench_pool_overhead — measure QPU dispatch overhead with and without
* the v3d_runner buffer pool warm.
*
* Times N consecutive daedalus_recipe_dispatch_vp9_idct8 calls and
* prints the per-call distribution. The first call pays
* vkAllocateMemory (typically tens of microseconds on V3D7's Mesa);
* the second and subsequent should hit the pool freelist and amortise
* to the pure dispatch-floor cost.
*
* Purpose: provide a concrete before/after number for the QPU-default
* substrate decree (2026-05-23). Bench is non-gating and runs in
* fractions of a second.
*
* License: BSD-2-Clause.
*/
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include "../include/daedalus.h"
extern size_t v3d_runner_pool_total_bytes(void *); /* exposed if we wanted it */
static double now_seconds(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
static int cmp_double(const void *a, const void *b)
{
double da = *(const double *)a, db = *(const double *)b;
return da < db ? -1 : da > db ? 1 : 0;
}
int main(int argc, char **argv)
{
int n_calls = argc > 1 ? atoi(argv[1]) : 200;
int n_blocks = 8; /* one MB column of 8x8 IDCT blocks */
int stride = 64;
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) { fprintf(stderr, "ctx create failed\n"); return 1; }
int has_qpu = daedalus_ctx_has_qpu(ctx);
printf("ctx: has_qpu=%d\n", has_qpu);
if (!has_qpu) {
fprintf(stderr, "QPU not available on this device; bench needs V3D\n");
daedalus_ctx_destroy(ctx);
return 2;
}
/* Build a representative IDCT 8x8 batch and warm a dst buffer. */
int16_t *coeffs = calloc((size_t) n_blocks * 64, sizeof(int16_t));
uint8_t *dst = calloc((size_t) n_blocks * 8 * stride, 1);
daedalus_idct8_meta *meta = calloc((size_t) n_blocks, sizeof(*meta));
if (!coeffs || !dst || !meta) { fprintf(stderr, "alloc fail\n"); return 1; }
uint64_t s = 0x1234567abcdefULL;
for (size_t i = 0; i < (size_t) n_blocks * 64; i++) {
s ^= s << 13; s ^= s >> 7; s ^= s << 17;
coeffs[i] = (int16_t)(s & 0x7ff) - 0x400;
}
for (int b = 0; b < n_blocks; b++) {
meta[b].dst_off = (uint32_t) b * 8;
meta[b].block_x = (uint32_t) b;
meta[b].block_y = 0;
}
double *t = malloc((size_t) n_calls * sizeof(double));
int rc;
printf("=== dispatching %d times, n_blocks=%d/call ===\n",
n_calls, n_blocks);
for (int i = 0; i < n_calls; i++) {
double t0 = now_seconds();
rc = daedalus_dispatch_vp9_idct8(ctx, DAEDALUS_SUBSTRATE_QPU,
dst, (size_t) stride,
coeffs, (size_t) n_blocks, meta);
double t1 = now_seconds();
if (rc) { fprintf(stderr, "dispatch %d rc=%d\n", i, rc); return 1; }
t[i] = (t1 - t0) * 1e6; /* us */
}
/* Per-call distribution (first few + sorted summary on the steady-state) */
printf("\nfirst 5 calls (cold-warm transition):\n");
for (int i = 0; i < 5 && i < n_calls; i++)
printf(" call %d: %.2f us\n", i, t[i]);
int skip = 10; /* drop warm-up calls from the steady-state stats */
if (n_calls > skip + 10) {
int n = n_calls - skip;
double *s_arr = malloc((size_t) n * sizeof(double));
memcpy(s_arr, t + skip, (size_t) n * sizeof(double));
qsort(s_arr, (size_t) n, sizeof(double), cmp_double);
double sum = 0;
for (int i = 0; i < n; i++) sum += s_arr[i];
printf("\nsteady-state stats (calls %d..%d, n=%d):\n",
skip, n_calls - 1, n);
printf(" min: %.2f us\n", s_arr[0]);
printf(" p50: %.2f us\n", s_arr[n / 2]);
printf(" p90: %.2f us\n", s_arr[(int)(n * 0.9)]);
printf(" p99: %.2f us\n", s_arr[(int)(n * 0.99)]);
printf(" max: %.2f us\n", s_arr[n - 1]);
printf(" mean: %.2f us\n", sum / n);
printf("\nfirst-call / steady-state median ratio: %.1fx\n",
t[0] / s_arr[n / 2]);
free(s_arr);
}
free(t); free(coeffs); free(dst); free(meta);
daedalus_ctx_destroy(ctx);
return 0;
}