v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead. This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.
API additions (v3d_runner.h):
- v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
falls through to v3d_runner_create_buffer() on miss.
- v3d_runner_release_buffer(): pushes back onto the freelist; the
backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
v3d_runner_destroy().
- v3d_runner_pool_total_bytes(): diagnostic watermark.
Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.
Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls. test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).
Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):
call 0: 434.89 us (cold — 3x vkAllocateMemory)
call 1: 100.06 us (pool hit on all 3 buffers)
steady-state:
p50: 76.44 us
p90: 90.52 us
mean: 77.95 us
first-call / steady-state ratio: 5.7x
The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).
Refs daedalus-fourier task #160.
This commit is contained in:
@@ -57,10 +57,43 @@ const char *v3d_runner_device_name(v3d_runner *r);
|
||||
* host side. The mapping persists for the lifetime of the buffer.
|
||||
*
|
||||
* Returns 0 on success, non-zero on failure.
|
||||
*
|
||||
* NOTE: prefer v3d_runner_acquire_buffer() on the dispatch hot path —
|
||||
* create_buffer/destroy_buffer go straight to vkAllocateMemory each
|
||||
* call, which on V3D7's Mesa stack costs ~10-50us. The acquire/
|
||||
* release pair pulls from a freelist and pays vkAllocateMemory only
|
||||
* on a cache miss.
|
||||
*/
|
||||
int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
|
||||
void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
|
||||
|
||||
/*
|
||||
* Pooled buffer acquisition. Returns a v3d_buffer whose .size is the
|
||||
* smallest power-of-2 >= the requested size (so callers can pool
|
||||
* across similar-sized requests). Backed by HOST_VISIBLE |
|
||||
* HOST_COHERENT memory; mapped pointer is valid.
|
||||
*
|
||||
* On cache hit: zero-cost reuse of a previously-released buffer.
|
||||
* On miss: falls through to v3d_runner_create_buffer(). Release with
|
||||
* v3d_runner_release_buffer(); pool drains in v3d_runner_destroy().
|
||||
*
|
||||
* Lifetime contract: the returned buffer's .mapped contents are
|
||||
* UNINITIALISED — the previous user's data may still be present.
|
||||
* Callers that need a clean buffer must memset themselves. This is
|
||||
* deliberate; the dispatch hot paths immediately overwrite the
|
||||
* buffer with new coefficients / meta anyway.
|
||||
*
|
||||
* Thread-safety: NOT thread-safe. A daedalus_ctx is single-threaded
|
||||
* by API contract; the pool inherits that constraint.
|
||||
*/
|
||||
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
|
||||
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf);
|
||||
|
||||
/* Pool diagnostics: total allocated bytes (sum across all size
|
||||
* classes, including currently-released entries). Useful for
|
||||
* watermark logging. */
|
||||
size_t v3d_runner_pool_total_bytes(v3d_runner *r);
|
||||
|
||||
/* Compute pipeline from a SPIR-V file path. The descriptor-set
|
||||
* layout exposes `n_ssbos` storage buffer bindings at binding
|
||||
* indices 0..n_ssbos-1, all visible to the compute stage. A push
|
||||
|
||||
Reference in New Issue
Block a user