v3d_runner: buffer pool for QPU dispatch hot path #6
Reference in New Issue
Block a user
Delete Branch "noether/v3d-buffer-pool"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 1 of the QPU-default-substrate campaign per the 2026-05-23 decree ("what can be done in QPU will be done in QPU").
Per-dispatch
vkAllocateMemorywas the biggest single fixable contributor to QPU dispatch overhead. This poolsv3d_bufferallocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10–50 μs of Mesa-V3D7 memory-allocation cost per call.API
v3d_runner_acquire_buffer(r, size, out)— pulls from per-bucket freelist; falls through tocreate_bufferon miss.v3d_runner_release_buffer(r, buf)— pushes back onto the freelist; backing VkBuffer/VkDeviceMemory onlyvkFreeMemory'd atv3d_runner_destroy().v3d_runner_pool_total_bytes()— diagnostic watermark.Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests degenerate to non-pooled at both ends.
Migration
daedalus_core.cdispatch_*_qpupaths globally swapcreate_buffer→acquire_bufferanddestroy_buffer→release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls.test_api_idctstays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).Microbench result
tests/bench_pool_overhead.con hertz (Pi 5, kernel 6.18.29, V3D 7.1, 200 dispatches of VP9 IDCT 8×8 with n_blocks=8):Remaining ~76 μs steady-state is dominated by
vkQueueWaitIdle+ shader execution + per-call descriptor-set update + command-buffer allocation — follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates the in/out memcpy) attack these.Thread-safety: the pool is not thread-safe (lock-free freelist). A
daedalus_ctx *is already single-threaded by API contract per the public header; the pool inherits that.Refs task #160.
Per the QPU-default substrate decree 2026-05-23: the per-dispatch vkAllocateMemory in dispatch_*_qpu was the biggest single fixable contributor to QPU dispatch overhead. This pools v3d_buffer allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7 memory-allocation cost per call. API additions (v3d_runner.h): - v3d_runner_acquire_buffer(): pulls from per-bucket freelist; falls through to v3d_runner_create_buffer() on miss. - v3d_runner_release_buffer(): pushes back onto the freelist; the backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in v3d_runner_destroy(). - v3d_runner_pool_total_bytes(): diagnostic watermark. Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall through to non-pooled (vkAllocateMemory) for both ends — pool stays correct, just degenerates to old behaviour for those calls. Migration: daedalus_core.c dispatch_*_qpu paths globally swap create_buffer → acquire_buffer and destroy_buffer → release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls. test_api_idct stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz). Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5, 6.18.29+rpt-rpi-2712, V3D 7.1): call 0: 434.89 us (cold — 3x vkAllocateMemory) call 1: 100.06 us (pool hit on all 3 buffers) steady-state: p50: 76.44 us p90: 90.52 us mean: 77.95 us first-call / steady-state ratio: 5.7x The remaining ~76us steady-state is dominated by vkQueueWaitIdle + shader execution + per-call descriptor-set update + command-buffer allocation — addressed in follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates memcpy in/out). Refs daedalus-fourier task #160.Phase 2 of the QPU-default substrate campaign — eliminate vkAllocateCommandBuffers from the dispatch hot path. Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in v3d_runner_create_pipeline() and freed in destroy_pipeline(). The five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1) versus the driver-side allocation walk. Pool was already created with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is permitted. Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1): before (task 160 pool only): steady-state p50: 76.44 us steady-state mean: 77.95 us after (task 160 pool + task 161 persistent cb): steady-state p50: 54.56 us steady-state mean: 56.00 us -> 28% per-dispatch reduction The remaining ~54 us steady-state is dominated by vkQueueWaitIdle + shader execution + the two memcpy(in/out) on the dst buffer — task 162 (dmabuf import for dst) targets the memcpy half. test_api_idct stays bit-exact across CPU/QPU/AUTO substrates. Refs daedalus-fourier task #161.Phase 2 added (commit
9855327): persistent per-pipeline command buffer.v3d_pipeline grows a VkCommandBuffer allocated once at create-pipeline time. The five dispatch sites switch from
v3d_runner_alloc_cmdbuf()tov3d_runner_pipeline_cmdbuf_reset()(vkResetCommandBuffer, O(1)).Microbench progression
0a042a8)9855327)29% additional per-dispatch reduction. Remaining ~54 us is dominated by
vkQueueWaitIdle+ shader execution + the in/out memcpys on the dst buffer — task 162 (dmabuf import) attacks the memcpy half.Closes tasks #160 and #161.