8 Commits

Author SHA1 Message Date
marfrit c01754e849 Merge pull request 'v3d_runner: buffer pool for QPU dispatch hot path' (#6) from noether/v3d-buffer-pool into main
Reviewed-on: #6
2026-05-23 18:59:18 +00:00
claude-noether 98553278dd v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.

Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline().  The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk.  Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.

Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):

  before (task 160 pool only):
    steady-state p50: 76.44 us
    steady-state mean: 77.95 us
  after (task 160 pool + task 161 persistent cb):
    steady-state p50: 54.56 us
    steady-state mean: 56.00 us
    -> 28% per-dispatch reduction

The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.

test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.

Refs daedalus-fourier task #161.
2026-05-23 19:56:35 +02:00
claude-noether 0a042a8e95 v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.
2026-05-23 19:52:50 +02:00
marfrit 3ecfc8b0ef Merge pull request 'docs: architecture backlog — correct fleet hardware mapping' (#4) from noether/architecture-backlog-fleet-fix into main
Reviewed-on: #4
2026-05-23 15:12:29 +00:00
marfrit c154253432 Merge pull request 'CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}' (#5) from noether/pkgconfig-relocatable-prefix into main
Reviewed-on: #5
2026-05-23 15:11:20 +00:00
claude-noether b3de96b21c CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}
The pkg-config file was generated at *configure* time with
`prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever
CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build`
time — typically the default `/usr/local`.  `cmake --install
build --prefix /foo` then put the files under /foo but the .pc
still pointed pkg-config at /usr/local/include and /usr/local/lib,
which broke downstream consumers configuring against the install
tree.

Concrete bite encountered today on hertz: the daedalus-v4l2 daemon
CMake configure on a /tmp/df-prefix install tree resolved
DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on
the test host), so main.c failed to find <daedalus.h>.

Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is
the configure-time-computed relative path from
<prefix>/<libdir>/pkgconfig back to <prefix>.  pkg-config
substitutes ${pcfiledir} with the actual on-disk location of the
.pc at lookup time, so the resolved prefix tracks wherever the
install tree is moved to — including DESTDIR-staged builds, apt
package installs, and ad-hoc `cmake --install --prefix /tmp/foo`
test installs.

The relative-path computation handles GNUInstallDirs layouts that
add multiarch tuples (Debian's lib/aarch64-linux-gnu) without
hard-coding `../..`.  Tested on hertz (Debian trixie, libdir=lib):

  prefix=${pcfiledir}/../../
  ...
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-test/lib/pkgconfig/../../

  # mv preserves relocation:
  $ mv /tmp/df-prefix-test /tmp/df-prefix-moved
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-moved/lib/pkgconfig/../../

This unblocks the daedalus-v4l2 daemon out-of-tree builds against
local daedalus-fourier installs and is a prerequisite for tidy
test-rig deployments (per the hertz reload session 2026-05-23).
2026-05-23 16:55:31 +02:00
claude-noether 68dccd2911 docs: architecture backlog — correct fleet hardware mapping
Original draft (PR #3) speculated wrongly on host-to-SoC mapping:

  - hertz and tesla were listed under RK3588.  Verified via
    /proc/device-tree/compatible: both are raspberrypi,5-model-b /
    brcm,bcm2712 (tesla is an LXD container hosted on hertz, so
    necessarily shares the host SoC).
  - boltzmann (the only actual RK3588 in the fleet, 32 GB, kernel-
    dev / MCP hub, 8 W always-on) was omitted entirely.
  - noether (Pi 4 / BCM2711, the user's interactive workstation,
    where Firefox and mpv actually run) was omitted entirely.

Corrects the per-SoC coverage table:

    BCM2712 Pi 5  — higgs, hertz, broglie, tesla (LXD on hertz)
    BCM2711 Pi 4  — noether (workstation), dcw3, dcw2
    RK3588        — boltzmann
    Allwinner H6  — (not in fleet)

Reasoning consequences:

  - Pi 5 row is now four hosts but one SoC.  Adding a fifth Pi 5
    doesn't pressure-test the architecture; substrate decisions
    are identical across the row.
  - The realistic forcing function for the Pi 4 path is "HW decode
    on noether matters and rpivid is still unstable upstream" —
    noether is a daily-driver Pi 4 workstation, so this is closer
    than the original draft implied.
  - The realistic forcing function for an RK3588 caps file is
    "AV1 playback on boltzmann matters" — rkvdec doesn't cover
    AV1, so Mali Valhall compute substrate becomes the only HW
    acceleration option there.

"Re-read this when" list at the top + "Why deferred" section
+ decision log all updated.  No change to the architecture sketch
(caps directory, plugin layout, two-backend conclusion) — those
were correct in the original; only the host-to-SoC mapping
underneath them was wrong.

Refs PR #3 (the merged original).
2026-05-23 15:47:55 +02:00
marfrit 7d6f106919 Merge pull request 'docs: architecture backlog for multi-SoC daedalus generalization' (#3) from noether/architecture-backlog into main
Reviewed-on: #3
2026-05-23 03:31:56 +00:00
6 changed files with 408 additions and 66 deletions
+29 -1
View File
@@ -419,9 +419,33 @@ endif()
# pkg-config file. Vulkan goes in Requires.private (consumer's # pkg-config file. Vulkan goes in Requires.private (consumer's
# pkg-config call gets it via --static). pthread + dl are needed # pkg-config call gets it via --static). pthread + dl are needed
# by the static archive's runtime helpers. # by the static archive's runtime helpers.
#
# `prefix` is derived from ${pcfiledir} so the .pc is relocatable:
# pkg-config substitutes ${pcfiledir} with the directory holding the
# .pc at lookup time, and the relative path from
# <prefix>/<libdir>/pkgconfig back to <prefix> tells pkg-config the
# install prefix without baking it in. This is why
# `cmake --install build --prefix /foo` produces a .pc that correctly
# resolves `prefix=/foo` instead of baking whatever CMAKE_INSTALL_PREFIX
# was at *configure* time (default /usr/local). DESTDIR-staged
# installs work too: at runtime pkg-config sees the .pc at its real
# install path and computes the right prefix.
#
# Relative-path depth is computed from CMAKE_INSTALL_LIBDIR (and
# whatever multiarch tuple GNUInstallDirs adds) so Debian-style
# `lib/aarch64-linux-gnu/pkgconfig/...` resolves with the right number
# of `..` components. Layouts where libdir is *not* under prefix are
# not supported by this scheme; if a packager overrides libdir to an
# absolute path the relative-path machinery falls back to the absolute
# value (CMake's file(RELATIVE_PATH) prepends `..` until they meet),
# which is also relocatable but no longer prefix-agnostic.
file(RELATIVE_PATH PKGCONFIG_PCDIR_TO_PREFIX
"${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/pkgconfig"
"${CMAKE_INSTALL_PREFIX}")
set(PKGCONFIG_OUT ${CMAKE_CURRENT_BINARY_DIR}/daedalus-fourier.pc) set(PKGCONFIG_OUT ${CMAKE_CURRENT_BINARY_DIR}/daedalus-fourier.pc)
file(WRITE ${PKGCONFIG_OUT} file(WRITE ${PKGCONFIG_OUT}
"prefix=${CMAKE_INSTALL_PREFIX} "prefix=\${pcfiledir}/${PKGCONFIG_PCDIR_TO_PREFIX}
exec_prefix=\${prefix} exec_prefix=\${prefix}
libdir=\${prefix}/${CMAKE_INSTALL_LIBDIR} libdir=\${prefix}/${CMAKE_INSTALL_LIBDIR}
includedir=\${prefix}/${CMAKE_INSTALL_INCLUDEDIR} includedir=\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}
@@ -468,6 +492,10 @@ add_executable(test_api_opportunistic_qpu tests/test_api_opportunistic_qpu.c)
target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core) target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core)
target_compile_options(test_api_opportunistic_qpu PRIVATE -O2) target_compile_options(test_api_opportunistic_qpu PRIVATE -O2)
add_executable(bench_pool_overhead tests/bench_pool_overhead.c)
target_link_libraries(bench_pool_overhead PRIVATE daedalus_core)
target_compile_options(bench_pool_overhead PRIVATE -O2)
if (DAEDALUS_BUILD_VULKAN) if (DAEDALUS_BUILD_VULKAN)
# (re-open the conditional so the closing endif() below balances) # (re-open the conditional so the closing endif() below balances)
+17 -12
View File
@@ -4,9 +4,9 @@
This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when: This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when:
- A second aarch64 host without a working kernel-side V4L2 stateless decoder shows up in the fleet (most likely candidate: Pi 4, which has V3D 4.x and no rpivid stable upstream). - HW decode on noether (Pi 4, the user's interactive workstation) becomes a real ask and rpivid upstream is still unstable. This is the most likely trigger — same SoC class as Pi 5 but weaker V3D 4.x, so the caps-file mechanism plus an extra row's worth of substrate measurements.
- A specific working-copy slowdown that the current Pi-5-only daedalus can't address motivates the generalization. - AV1 playback on boltzmann (RK3588) starts mattering. rkvdec doesn't cover AV1, so the daedalus path becomes the only HW-accelerated option, and Mali Valhall compute substrate decisions need their own caps row.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (currently it picks the first matching V4L2 node). - libva-v4l2-request-fourier evolves to need multi-node negotiation (today it picks the first matching V4L2 node; a host with both rkvdec and daedalus-v4l2 nodes wants a preference policy).
Until then: this is decision context, not a TODO. Until then: this is decision context, not a TODO.
@@ -51,13 +51,17 @@ The mfritsche fleet has heterogeneous aarch64 hardware decoders:
| SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute | | SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute |
|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|
| BCM2712 (Pi 5) | higgs, broglie | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) | | BCM2712 (Pi 5) | higgs, hertz, broglie, tesla (LXD on hertz) | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | dcw3 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) | | BCM2711 (Pi 4) | noether (interactive workstation), dcw3, dcw2 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | hertz, tesla | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk) + RK NPU | | RK3588 | boltzmann (32 GB, kernel-dev / MCP hub, 8 W always-on) | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk-bifrost-video in dev) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost | | Allwinner H6 | (not in current fleet, but Cedrus exists upstream) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |
No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only. No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only.
A note on the Pi 5 row: hertz and tesla share hardware (tesla is an LXD container hosted on hertz) but are operationally distinct — tesla is the distcc/MCP worker, hertz is the LXD host with all the cron automations and the 17-tool lmcp hub. From a daedalus deployment perspective they count as **one** Pi 5 substrate; from a workflow perspective they're separate boxes.
A note on noether: it's the user's interactive workstation (Pi 4, BCM2711). Firefox + mpv run here. Any "I want HW decode on my main box" pressure lands first on this host, which puts Pi 4 (V3D4 + maybe-rpivid) closer to the front of the queue than the original draft of this document suggested.
The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels. The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels.
The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution. The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution.
@@ -207,15 +211,15 @@ Pass-through plugins are *thin* — they translate the daedalus daemon's wire pr
**Today's calculus:** **Today's calculus:**
- Pi 5 daedalus path is the only thing in the fleet that uses daedalus daemon. Generalizing for a single user is overdesign. - Pi 5 (higgs + hertz + broglie + tesla) is **four hosts**, but **one SoC**. Adding the fifth Pi 5 host wouldn't pressure-test the architecture; they all share BCM2712 caps so the substrate decisions are identical across the row.
- RK3588 uses rkvdec directly through libva-v4l2-request-fourier; daedalus daemon is **not in the path** for any RK3588 codec. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. - boltzmann (RK3588) is the only non-Pi-5 always-on host in the fleet, and it uses rkvdec directly through libva-v4l2-request-fourier daedalus daemon is **not in the path** for any RK3588 codec on it. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. No forcing pressure from boltzmann today.
- Pi 4 with rpivid is the only realistic second motivator. rpivid upstream stability is the gate — if it lands cleanly, Pi 4 takes the pass-through path with no kernel substitution needed. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute. - noether (Pi 4, this user's interactive workstation) and dcw3/dcw2 (also Pi 4) are the real second-SoC candidates. The gate is rpivid upstream stability: if it lands cleanly, Pi 4 takes the pass-through path with zero kernel substitution work. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural. - The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural.
**The forcing function that flips this from "deferred" to "do it":** **The forcing function that flips this from "deferred" to "do it":**
- Pi 4 enters daily use and rpivid is still not stable upstream — implies we need a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. - **noether-as-Firefox-host** — the user starts wanting HW decode on their main workstation and rpivid is still not stable upstream. Implies a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. This is the most likely trigger; noether is already a daily-driver Pi 4.
- **Or:** an x86 host enters the fleet running mesa-panvk on a Pi-CM5-like board, and we need the daedalus daemon to discover it dynamically rather than being baked at build time. - **boltzmann-as-AV1-decoder** — RK3588 has no AV1 HW decoder, and the user wants AV1 playback there (currently CPU-only). Triggers a cycle-5equivalent measurement campaign on Mali Valhall to see whether `daedalus_recipe_dispatch_cdef_8x8` (or follow-on AV1 kernels) is worth running on Mali compute. If yes, we need an RK3588 caps file that overrides only the AV1 row while leaving H.264/HEVC/VP9 on rkvdec pass-through.
- **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask. - **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask.
Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon. Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon.
@@ -242,6 +246,7 @@ Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC a
|---|---|---| |---|---|---|
| 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. | | 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. |
| 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 36 months. | | 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 36 months. |
| 2026-05-23 | **Correct fleet hardware mapping.** Original draft had hertz/tesla under RK3588 and omitted boltzmann + noether entirely. Verified via `/proc/device-tree/compatible`: hertz + tesla are Pi 5 (BCM2712), noether is Pi 4 (BCM2711), boltzmann is the only RK3588 in the fleet. Adjusted "Why deferred" / forcing-function reasoning accordingly — Pi 5 row is now 4 hosts (one SoC), noether is the realistic Pi 4 trigger, boltzmann is the realistic RK3588 trigger via AV1. | Original draft was speculative on host-to-SoC mapping; verified state changes which forcing functions are credible. |
--- ---
+53 -53
View File
@@ -291,13 +291,13 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
} }
v3d_buffer buf_coeffs = {0}, buf_dst = {0}, buf_meta = {0}; v3d_buffer buf_coeffs = {0}, buf_dst = {0}, buf_meta = {0};
if (v3d_runner_create_buffer(ctx->runner, coeff_bytes, &buf_coeffs)) return -1; if (v3d_runner_acquire_buffer(ctx->runner, coeff_bytes, &buf_coeffs)) return -1;
if (v3d_runner_create_buffer(ctx->runner, max_byte_touched, &buf_dst)) { if (v3d_runner_acquire_buffer(ctx->runner, max_byte_touched, &buf_dst)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); return -1; v3d_runner_release_buffer(ctx->runner, &buf_coeffs); return -1;
} }
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &buf_meta)) { if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &buf_meta)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_dst); v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); return -1; v3d_runner_release_buffer(ctx->runner, &buf_coeffs); return -1;
} }
/* Upload. Coeffs and meta are straight copies. Dst we copy the /* Upload. Coeffs and meta are straight copies. Dst we copy the
@@ -325,8 +325,8 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
._pad = 0, ._pad = 0,
}; };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner); if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->idct8_pipe)) goto fail;
if (cb == VK_NULL_HANDLE) goto fail; VkCommandBuffer cb = ctx->idct8_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi); vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
@@ -344,15 +344,15 @@ static int dispatch_idct8_qpu(daedalus_ctx *ctx,
/* Read-back dst. */ /* Read-back dst. */
memcpy(dst, buf_dst.mapped, max_byte_touched); memcpy(dst, buf_dst.mapped, max_byte_touched);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); v3d_runner_release_buffer(ctx->runner, &buf_meta);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst); v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); v3d_runner_release_buffer(ctx->runner, &buf_coeffs);
return 0; return 0;
fail: fail:
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); v3d_runner_release_buffer(ctx->runner, &buf_meta);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst); v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_coeffs); v3d_runner_release_buffer(ctx->runner, &buf_coeffs);
return -1; return -1;
} }
@@ -424,9 +424,9 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
size_t dst_window_size = hi - lo; size_t dst_window_size = hi - lo;
v3d_buffer buf_meta = {0}, buf_dst = {0}; v3d_buffer buf_meta = {0}, buf_dst = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &buf_meta)) return -1; if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &buf_meta)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_window_size, &buf_dst)) { if (v3d_runner_acquire_buffer(ctx->runner, dst_window_size, &buf_dst)) {
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); return -1; v3d_runner_release_buffer(ctx->runner, &buf_meta); return -1;
} }
memcpy(buf_dst.mapped, dst + lo, dst_window_size); memcpy(buf_dst.mapped, dst + lo, dst_window_size);
@@ -442,8 +442,8 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
if (v3d_runner_bind_buffers(ctx->runner, p, binds, 2)) goto fail; if (v3d_runner_bind_buffers(ctx->runner, p, binds, 2)) goto fail;
uint32_t wg_count = (uint32_t)((n_edges + 31) / 32); uint32_t wg_count = (uint32_t)((n_edges + 31) / 32);
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner); if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, p)) goto fail;
if (cb == VK_NULL_HANDLE) goto fail; VkCommandBuffer cb = p->cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi); vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, p->pipeline); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, p->pipeline);
@@ -468,12 +468,12 @@ static int dispatch_lpf_qpu(daedalus_ctx *ctx, int wd_8,
memcpy(dst + lo, buf_dst.mapped, dst_window_size); memcpy(dst + lo, buf_dst.mapped, dst_window_size);
v3d_runner_destroy_buffer(ctx->runner, &buf_dst); v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); v3d_runner_release_buffer(ctx->runner, &buf_meta);
return 0; return 0;
fail: fail:
v3d_runner_destroy_buffer(ctx->runner, &buf_dst); v3d_runner_release_buffer(ctx->runner, &buf_dst);
v3d_runner_destroy_buffer(ctx->runner, &buf_meta); v3d_runner_release_buffer(ctx->runner, &buf_meta);
return -1; return -1;
} }
@@ -509,9 +509,9 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
} }
v3d_buffer bm = {0}, bd = {0}, bs = {0}; v3d_buffer bm = {0}, bd = {0}, bs = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1; if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; } if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_create_buffer(ctx->runner, src_max, &bs)) { v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; } if (v3d_runner_acquire_buffer(ctx->runner, src_max, &bs)) { v3d_runner_release_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
memcpy(bs.mapped, src, src_max); memcpy(bs.mapped, src, src_max);
memcpy(bd.mapped, dst, dst_max); memcpy(bd.mapped, dst, dst_max);
@@ -530,8 +530,8 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
mc_pc pc = { .n_blocks = (uint32_t) n_blocks, mc_pc pc = { .n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = (uint32_t) dst_stride, .dst_stride_u8 = (uint32_t) dst_stride,
.src_stride_u8 = (uint32_t) src_stride }; .src_stride_u8 = (uint32_t) src_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner); if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->mc8h_pipe)) goto fail;
if (cb == VK_NULL_HANDLE) goto fail; VkCommandBuffer cb = ctx->mc8h_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi); vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->mc8h_pipe.pipeline); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->mc8h_pipe.pipeline);
@@ -545,14 +545,14 @@ static int dispatch_mc_8h_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max); memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bs); v3d_runner_release_buffer(ctx->runner, &bs);
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return 0; return 0;
fail: fail:
v3d_runner_destroy_buffer(ctx->runner, &bs); v3d_runner_release_buffer(ctx->runner, &bs);
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return -1; return -1;
} }
@@ -588,9 +588,9 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
size_t tmp_bytes = tmp_max_u16 * sizeof(uint16_t); size_t tmp_bytes = tmp_max_u16 * sizeof(uint16_t);
v3d_buffer bm = {0}, bd = {0}, bt = {0}; v3d_buffer bm = {0}, bd = {0}, bt = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1; if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; } if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
if (v3d_runner_create_buffer(ctx->runner, tmp_bytes, &bt)) { v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; } if (v3d_runner_acquire_buffer(ctx->runner, tmp_bytes, &bt)) { v3d_runner_release_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
/* tmp may need padding before block-origin offset (caller-allocated). Just /* tmp may need padding before block-origin offset (caller-allocated). Just
* copy from caller; we assume meta[i].tmp_off_u16 is consistent with how * copy from caller; we assume meta[i].tmp_off_u16 is consistent with how
@@ -615,8 +615,8 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
cdef_pc pc = { .n_blocks = (uint32_t) n_blocks, cdef_pc pc = { .n_blocks = (uint32_t) n_blocks,
.tmp_stride_u16 = 16, .tmp_stride_u16 = 16,
.dst_stride_u8 = (uint32_t) dst_stride }; .dst_stride_u8 = (uint32_t) dst_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner); if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->cdef_pipe)) goto fail;
if (cb == VK_NULL_HANDLE) goto fail; VkCommandBuffer cb = ctx->cdef_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi); vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->cdef_pipe.pipeline); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->cdef_pipe.pipeline);
@@ -630,14 +630,14 @@ static int dispatch_cdef_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max); memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bt); v3d_runner_release_buffer(ctx->runner, &bt);
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return 0; return 0;
fail: fail:
v3d_runner_destroy_buffer(ctx->runner, &bt); v3d_runner_release_buffer(ctx->runner, &bt);
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return -1; return -1;
} }
@@ -670,8 +670,8 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
} }
v3d_buffer bm = {0}, bd = {0}; v3d_buffer bm = {0}, bd = {0};
if (v3d_runner_create_buffer(ctx->runner, meta_bytes, &bm)) return -1; if (v3d_runner_acquire_buffer(ctx->runner, meta_bytes, &bm)) return -1;
if (v3d_runner_create_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_destroy_buffer(ctx->runner, &bm); return -1; } if (v3d_runner_acquire_buffer(ctx->runner, dst_max, &bd)) { v3d_runner_release_buffer(ctx->runner, &bm); return -1; }
memcpy(bd.mapped, dst, dst_max); memcpy(bd.mapped, dst, dst_max);
uint32_t *m = bm.mapped; uint32_t *m = bm.mapped;
@@ -691,8 +691,8 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
uint32_t wg_count = (uint32_t)((n_edges + 15) / 16); uint32_t wg_count = (uint32_t)((n_edges + 15) / 16);
h264deblock_pc pc = { .n_edges = (uint32_t) n_edges, h264deblock_pc pc = { .n_edges = (uint32_t) n_edges,
.dst_stride_u8 = (uint32_t) dst_stride }; .dst_stride_u8 = (uint32_t) dst_stride };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(ctx->runner); if (v3d_runner_pipeline_cmdbuf_reset(ctx->runner, &ctx->h264deblock_pipe)) goto fail;
if (cb == VK_NULL_HANDLE) goto fail; VkCommandBuffer cb = ctx->h264deblock_pipe.cb;
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi); vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->h264deblock_pipe.pipeline); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, ctx->h264deblock_pipe.pipeline);
@@ -706,12 +706,12 @@ static int dispatch_h264_deblock_qpu(daedalus_ctx *ctx,
memcpy(dst, bd.mapped, dst_max); memcpy(dst, bd.mapped, dst_max);
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return 0; return 0;
fail: fail:
v3d_runner_destroy_buffer(ctx->runner, &bd); v3d_runner_release_buffer(ctx->runner, &bd);
v3d_runner_destroy_buffer(ctx->runner, &bm); v3d_runner_release_buffer(ctx->runner, &bm);
return -1; return -1;
} }
+144
View File
@@ -17,6 +17,18 @@
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \ fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
r__, __FILE__, __LINE__, #call); return NULL; } } while (0) r__, __FILE__, __LINE__, #call); return NULL; } } while (0)
/* Power-of-2 size classes from 2^8 (256 B) up to 2^23 (8 MiB). Cycle
* 1's largest dispatch with n_blocks ≈ 8K is well under 8 MiB; oversize
* requests fall through to non-pooled allocation. */
#define V3D_POOL_MIN_LOG2 8
#define V3D_POOL_MAX_LOG2 23
#define V3D_POOL_BUCKETS (V3D_POOL_MAX_LOG2 - V3D_POOL_MIN_LOG2 + 1)
struct v3d_pool_entry {
v3d_buffer buf;
struct v3d_pool_entry *next;
};
struct v3d_runner { struct v3d_runner {
VkInstance instance; VkInstance instance;
VkPhysicalDevice phys; VkPhysicalDevice phys;
@@ -26,6 +38,15 @@ struct v3d_runner {
VkCommandPool pool; VkCommandPool pool;
char device_name[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE]; char device_name[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE];
VkPhysicalDeviceMemoryProperties mem_props; VkPhysicalDeviceMemoryProperties mem_props;
/* Buffer pool: per-bucket freelist of previously-released
* v3d_buffer. bucket index = ceil_log2(size) - V3D_POOL_MIN_LOG2.
* pool_total_bytes accumulates every successful vkAllocateMemory
* we've done through the pool — never decreases (the freelist
* just hands buffers around, no vkFreeMemory until destroy).
*/
struct v3d_pool_entry *pool_free[V3D_POOL_BUCKETS];
size_t pool_total_bytes;
}; };
static int pick_v3d_physical_device(VkInstance inst, VkPhysicalDevice *out, static int pick_v3d_physical_device(VkInstance inst, VkPhysicalDevice *out,
@@ -168,6 +189,21 @@ void v3d_runner_destroy(v3d_runner *r)
{ {
if (!r) return; if (!r) return;
if (r->device != VK_NULL_HANDLE) vkDeviceWaitIdle(r->device); if (r->device != VK_NULL_HANDLE) vkDeviceWaitIdle(r->device);
/* Drain the buffer pool BEFORE destroying device — the pool
* entries own VkBuffer/VkDeviceMemory handles, which need a live
* device for vkDestroyBuffer/vkFreeMemory. */
for (int b = 0; b < V3D_POOL_BUCKETS; b++) {
struct v3d_pool_entry *e = r->pool_free[b];
while (e) {
struct v3d_pool_entry *next = e->next;
v3d_runner_destroy_buffer(r, &e->buf);
free(e);
e = next;
}
r->pool_free[b] = NULL;
}
if (r->pool != VK_NULL_HANDLE) if (r->pool != VK_NULL_HANDLE)
vkDestroyCommandPool(r->device, r->pool, NULL); vkDestroyCommandPool(r->device, r->pool, NULL);
if (r->device != VK_NULL_HANDLE) vkDestroyDevice(r->device, NULL); if (r->device != VK_NULL_HANDLE) vkDestroyDevice(r->device, NULL);
@@ -175,6 +211,92 @@ void v3d_runner_destroy(v3d_runner *r)
free(r); free(r);
} }
/* ---- Buffer pool ----------------------------------------------- */
/* ceil_log2 for buffer pool bucket selection. */
static int v3d_pool_bucket_for(size_t size)
{
int log2;
size_t m;
if (size <= ((size_t)1 << V3D_POOL_MIN_LOG2))
return 0;
m = size - 1;
log2 = 0;
while (m) { log2++; m >>= 1; }
if (log2 < V3D_POOL_MIN_LOG2) log2 = V3D_POOL_MIN_LOG2;
if (log2 > V3D_POOL_MAX_LOG2) return -1;
return log2 - V3D_POOL_MIN_LOG2;
}
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out)
{
int bucket;
size_t bucket_size;
struct v3d_pool_entry *e;
int rc;
if (!r || !out || size == 0) return -1;
bucket = v3d_pool_bucket_for(size);
if (bucket < 0) {
/* Oversize — fall through to non-pooled allocation. Caller
* still calls v3d_runner_release_buffer(), which detects the
* oversize bucket via bucket_for() and destroys. */
return v3d_runner_create_buffer(r, size, out);
}
bucket_size = (size_t)1 << (bucket + V3D_POOL_MIN_LOG2);
e = r->pool_free[bucket];
if (e) {
r->pool_free[bucket] = e->next;
*out = e->buf;
free(e);
return 0;
}
/* Miss — allocate fresh at the bucket size. Subsequent acquire/
* release for the same bucket reuses this buffer. */
rc = v3d_runner_create_buffer(r, bucket_size, out);
if (rc == 0)
r->pool_total_bytes += bucket_size;
return rc;
}
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf)
{
int bucket;
struct v3d_pool_entry *e;
if (!r || !buf || buf->buffer == VK_NULL_HANDLE) return;
bucket = v3d_pool_bucket_for(buf->size);
if (bucket < 0) {
/* Oversize — destroy outright; never made it into the pool. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e = malloc(sizeof(*e));
if (!e) {
/* Allocator failure: just destroy. Pool degenerates to
* non-pooled behaviour but doesn't leak. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e->buf = *buf;
e->next = r->pool_free[bucket];
r->pool_free[bucket] = e;
memset(buf, 0, sizeof(*buf));
}
size_t v3d_runner_pool_total_bytes(v3d_runner *r)
{
return r ? r->pool_total_bytes : 0;
}
VkDevice v3d_runner_device(v3d_runner *r) { return r->device; } VkDevice v3d_runner_device(v3d_runner *r) { return r->device; }
VkQueue v3d_runner_queue(v3d_runner *r) { return r->queue; } VkQueue v3d_runner_queue(v3d_runner *r) { return r->queue; }
uint32_t v3d_runner_queue_family(v3d_runner *r) { return r->queue_family; } uint32_t v3d_runner_queue_family(v3d_runner *r) { return r->queue_family; }
@@ -364,12 +486,27 @@ int v3d_runner_create_pipeline(v3d_runner *r, const char *spv_path,
.pSetLayouts = &out->ds_layout, .pSetLayouts = &out->ds_layout,
}; };
CHK(vkAllocateDescriptorSets(r->device, &dsai, &out->desc_set)); CHK(vkAllocateDescriptorSets(r->device, &dsai, &out->desc_set));
/* Persistent command buffer — pool was created with
* RESET_COMMAND_BUFFER_BIT (see v3d_runner_create) so dispatch
* sites can call vkResetCommandBuffer on this same cb instead
* of paying vkAllocateCommandBuffers per call. */
VkCommandBufferAllocateInfo cbai = {
.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
.commandPool = r->pool,
.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY,
.commandBufferCount = 1,
};
CHK(vkAllocateCommandBuffers(r->device, &cbai, &out->cb));
return 0; return 0;
} }
void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p) void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
{ {
if (!p || p->pipeline == VK_NULL_HANDLE) return; if (!p || p->pipeline == VK_NULL_HANDLE) return;
if (p->cb != VK_NULL_HANDLE)
vkFreeCommandBuffers(r->device, r->pool, 1, &p->cb);
vkDestroyPipeline(r->device, p->pipeline, NULL); vkDestroyPipeline(r->device, p->pipeline, NULL);
vkDestroyPipelineLayout(r->device, p->layout, NULL); vkDestroyPipelineLayout(r->device, p->layout, NULL);
vkDestroyDescriptorPool(r->device, p->pool, NULL); /* frees its set */ vkDestroyDescriptorPool(r->device, p->pool, NULL); /* frees its set */
@@ -377,6 +514,13 @@ void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
memset(p, 0, sizeof(*p)); memset(p, 0, sizeof(*p));
} }
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p)
{
(void) r;
if (!p || p->cb == VK_NULL_HANDLE) return -1;
return vkResetCommandBuffer(p->cb, 0) == VK_SUCCESS ? 0 : -1;
}
int v3d_runner_bind_buffers(v3d_runner *r, v3d_pipeline *p, int v3d_runner_bind_buffers(v3d_runner *r, v3d_pipeline *p,
const v3d_buffer *bufs, uint32_t n) const v3d_buffer *bufs, uint32_t n)
{ {
+45
View File
@@ -34,6 +34,12 @@ typedef struct {
VkDescriptorSet desc_set; VkDescriptorSet desc_set;
uint32_t n_ssbos; uint32_t n_ssbos;
uint32_t push_const_size; uint32_t push_const_size;
/* Persistent command buffer. Allocated at create-pipeline time;
* dispatch sites use v3d_runner_pipeline_cmdbuf_reset() to
* vkResetCommandBuffer instead of paying vkAllocateCommandBuffers
* per dispatch. Pool flagged RESET_COMMAND_BUFFER_BIT so reset
* is permitted. */
VkCommandBuffer cb;
} v3d_pipeline; } v3d_pipeline;
/* /*
@@ -57,10 +63,43 @@ const char *v3d_runner_device_name(v3d_runner *r);
* host side. The mapping persists for the lifetime of the buffer. * host side. The mapping persists for the lifetime of the buffer.
* *
* Returns 0 on success, non-zero on failure. * Returns 0 on success, non-zero on failure.
*
* NOTE: prefer v3d_runner_acquire_buffer() on the dispatch hot path —
* create_buffer/destroy_buffer go straight to vkAllocateMemory each
* call, which on V3D7's Mesa stack costs ~10-50us. The acquire/
* release pair pulls from a freelist and pays vkAllocateMemory only
* on a cache miss.
*/ */
int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out); int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf); void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
/*
* Pooled buffer acquisition. Returns a v3d_buffer whose .size is the
* smallest power-of-2 >= the requested size (so callers can pool
* across similar-sized requests). Backed by HOST_VISIBLE |
* HOST_COHERENT memory; mapped pointer is valid.
*
* On cache hit: zero-cost reuse of a previously-released buffer.
* On miss: falls through to v3d_runner_create_buffer(). Release with
* v3d_runner_release_buffer(); pool drains in v3d_runner_destroy().
*
* Lifetime contract: the returned buffer's .mapped contents are
* UNINITIALISED — the previous user's data may still be present.
* Callers that need a clean buffer must memset themselves. This is
* deliberate; the dispatch hot paths immediately overwrite the
* buffer with new coefficients / meta anyway.
*
* Thread-safety: NOT thread-safe. A daedalus_ctx is single-threaded
* by API contract; the pool inherits that constraint.
*/
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf);
/* Pool diagnostics: total allocated bytes (sum across all size
* classes, including currently-released entries). Useful for
* watermark logging. */
size_t v3d_runner_pool_total_bytes(v3d_runner *r);
/* Compute pipeline from a SPIR-V file path. The descriptor-set /* Compute pipeline from a SPIR-V file path. The descriptor-set
* layout exposes `n_ssbos` storage buffer bindings at binding * layout exposes `n_ssbos` storage buffer bindings at binding
* indices 0..n_ssbos-1, all visible to the compute stage. A push * indices 0..n_ssbos-1, all visible to the compute stage. A push
@@ -88,6 +127,12 @@ int v3d_runner_bind_buffers(v3d_runner *r,
/* Allocate a primary command buffer from the runner's pool. */ /* Allocate a primary command buffer from the runner's pool. */
VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r); VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r);
/* Reset @p->cb so it can be re-recorded. Returns 0 on success.
* Replaces v3d_runner_alloc_cmdbuf() on the dispatch hot path —
* vkResetCommandBuffer is O(1) vs vkAllocateCommandBuffers' ~1-5us
* driver cost. */
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p);
/* Submit `cb` to the queue and wait for completion. The classic /* Submit `cb` to the queue and wait for completion. The classic
* timed operation. Returns 0 on success. * timed operation. Returns 0 on success.
*/ */
+120
View File
@@ -0,0 +1,120 @@
/*
* bench_pool_overhead — measure QPU dispatch overhead with and without
* the v3d_runner buffer pool warm.
*
* Times N consecutive daedalus_recipe_dispatch_vp9_idct8 calls and
* prints the per-call distribution. The first call pays
* vkAllocateMemory (typically tens of microseconds on V3D7's Mesa);
* the second and subsequent should hit the pool freelist and amortise
* to the pure dispatch-floor cost.
*
* Purpose: provide a concrete before/after number for the QPU-default
* substrate decree (2026-05-23). Bench is non-gating and runs in
* fractions of a second.
*
* License: BSD-2-Clause.
*/
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include "../include/daedalus.h"
extern size_t v3d_runner_pool_total_bytes(void *); /* exposed if we wanted it */
static double now_seconds(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
static int cmp_double(const void *a, const void *b)
{
double da = *(const double *)a, db = *(const double *)b;
return da < db ? -1 : da > db ? 1 : 0;
}
int main(int argc, char **argv)
{
int n_calls = argc > 1 ? atoi(argv[1]) : 200;
int n_blocks = 8; /* one MB column of 8x8 IDCT blocks */
int stride = 64;
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) { fprintf(stderr, "ctx create failed\n"); return 1; }
int has_qpu = daedalus_ctx_has_qpu(ctx);
printf("ctx: has_qpu=%d\n", has_qpu);
if (!has_qpu) {
fprintf(stderr, "QPU not available on this device; bench needs V3D\n");
daedalus_ctx_destroy(ctx);
return 2;
}
/* Build a representative IDCT 8x8 batch and warm a dst buffer. */
int16_t *coeffs = calloc((size_t) n_blocks * 64, sizeof(int16_t));
uint8_t *dst = calloc((size_t) n_blocks * 8 * stride, 1);
daedalus_idct8_meta *meta = calloc((size_t) n_blocks, sizeof(*meta));
if (!coeffs || !dst || !meta) { fprintf(stderr, "alloc fail\n"); return 1; }
uint64_t s = 0x1234567abcdefULL;
for (size_t i = 0; i < (size_t) n_blocks * 64; i++) {
s ^= s << 13; s ^= s >> 7; s ^= s << 17;
coeffs[i] = (int16_t)(s & 0x7ff) - 0x400;
}
for (int b = 0; b < n_blocks; b++) {
meta[b].dst_off = (uint32_t) b * 8;
meta[b].block_x = (uint32_t) b;
meta[b].block_y = 0;
}
double *t = malloc((size_t) n_calls * sizeof(double));
int rc;
printf("=== dispatching %d times, n_blocks=%d/call ===\n",
n_calls, n_blocks);
for (int i = 0; i < n_calls; i++) {
double t0 = now_seconds();
rc = daedalus_dispatch_vp9_idct8(ctx, DAEDALUS_SUBSTRATE_QPU,
dst, (size_t) stride,
coeffs, (size_t) n_blocks, meta);
double t1 = now_seconds();
if (rc) { fprintf(stderr, "dispatch %d rc=%d\n", i, rc); return 1; }
t[i] = (t1 - t0) * 1e6; /* us */
}
/* Per-call distribution (first few + sorted summary on the steady-state) */
printf("\nfirst 5 calls (cold-warm transition):\n");
for (int i = 0; i < 5 && i < n_calls; i++)
printf(" call %d: %.2f us\n", i, t[i]);
int skip = 10; /* drop warm-up calls from the steady-state stats */
if (n_calls > skip + 10) {
int n = n_calls - skip;
double *s_arr = malloc((size_t) n * sizeof(double));
memcpy(s_arr, t + skip, (size_t) n * sizeof(double));
qsort(s_arr, (size_t) n, sizeof(double), cmp_double);
double sum = 0;
for (int i = 0; i < n; i++) sum += s_arr[i];
printf("\nsteady-state stats (calls %d..%d, n=%d):\n",
skip, n_calls - 1, n);
printf(" min: %.2f us\n", s_arr[0]);
printf(" p50: %.2f us\n", s_arr[n / 2]);
printf(" p90: %.2f us\n", s_arr[(int)(n * 0.9)]);
printf(" p99: %.2f us\n", s_arr[(int)(n * 0.99)]);
printf(" max: %.2f us\n", s_arr[n - 1]);
printf(" mean: %.2f us\n", sum / n);
printf("\nfirst-call / steady-state median ratio: %.1fx\n",
t[0] / s_arr[n / 2]);
free(s_arr);
}
free(t); free(coeffs); free(dst); free(meta);
daedalus_ctx_destroy(ctx);
return 0;
}