v3d_runner: buffer pool for QPU dispatch hot path #6

2026-05-23T17:53:15Z

marfrit commented

2026-05-23 17:53:15 +00:00

Phase 1 of the QPU-default-substrate campaign per the 2026-05-23 decree ("what can be done in QPU will be done in QPU").

Per-dispatch vkAllocateMemory was the biggest single fixable contributor to QPU dispatch overhead. This pools v3d_buffer allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10–50 μs of Mesa-V3D7 memory-allocation cost per call.

API

v3d_runner_acquire_buffer(r, size, out) — pulls from per-bucket freelist; falls through to create_buffer on miss.
v3d_runner_release_buffer(r, buf) — pushes back onto the freelist; backing VkBuffer/VkDeviceMemory only vkFreeMemory'd at v3d_runner_destroy().
v3d_runner_pool_total_bytes() — diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests degenerate to non-pooled at both ends.

Migration

daedalus_core.c dispatch_*_qpu paths globally swap create_buffer → acquire_buffer and destroy_buffer → release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls.

test_api_idct stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench result

tests/bench_pool_overhead.c on hertz (Pi 5, kernel 6.18.29, V3D 7.1, 200 dispatches of VP9 IDCT 8×8 with n_blocks=8):

first 5 calls (cold→warm transition):
  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  call 2:   89.80 us
  call 3:   89.63 us
  call 4:   96.06 us

steady-state stats (calls 10..99, n=90):
  min:    66.46 us
  p50:    76.44 us
  p90:    90.52 us
  p99:   103.57 us
  mean:   77.95 us

first-call / steady-state median ratio: 5.7x

Remaining ~76 μs steady-state is dominated by vkQueueWaitIdle + shader execution + per-call descriptor-set update + command-buffer allocation — follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates the in/out memcpy) attack these.

Thread-safety: the pool is not thread-safe (lock-free freelist). A daedalus_ctx * is already single-threaded by API contract per the public header; the pool inherits that.

Refs task #160.

Phase 1 of the QPU-default-substrate campaign per the 2026-05-23 decree ("what can be done in QPU will be done in QPU"). Per-dispatch `vkAllocateMemory` was the biggest single fixable contributor to QPU dispatch overhead. This pools `v3d_buffer` allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10–50 μs of Mesa-V3D7 memory-allocation cost per call. ## API - `v3d_runner_acquire_buffer(r, size, out)` — pulls from per-bucket freelist; falls through to `create_buffer` on miss. - `v3d_runner_release_buffer(r, buf)` — pushes back onto the freelist; backing VkBuffer/VkDeviceMemory only `vkFreeMemory`'d at `v3d_runner_destroy()`. - `v3d_runner_pool_total_bytes()` — diagnostic watermark. Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests degenerate to non-pooled at both ends. ## Migration `daedalus_core.c` `dispatch_*_qpu` paths globally swap `create_buffer` → `acquire_buffer` and `destroy_buffer` → `release_buffer`. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls. `test_api_idct` stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz). ## Microbench result `tests/bench_pool_overhead.c` on hertz (Pi 5, kernel 6.18.29, V3D 7.1, 200 dispatches of VP9 IDCT 8×8 with n_blocks=8): ``` first 5 calls (cold→warm transition): call 0: 434.89 us (cold — 3x vkAllocateMemory) call 1: 100.06 us (pool hit on all 3 buffers) call 2: 89.80 us call 3: 89.63 us call 4: 96.06 us steady-state stats (calls 10..99, n=90): min: 66.46 us p50: 76.44 us p90: 90.52 us p99: 103.57 us mean: 77.95 us first-call / steady-state median ratio: 5.7x ``` Remaining ~76 μs steady-state is dominated by `vkQueueWaitIdle` + shader execution + per-call descriptor-set update + command-buffer allocation — follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates the in/out memcpy) attack these. Thread-safety: the pool is **not** thread-safe (lock-free freelist). A `daedalus_ctx *` is already single-threaded by API contract per the public header; the pool inherits that. Refs task #160.

marfrit added 1 commit 2026-05-23 17:53:15 +00:00

v3d_runner: buffer pool for QPU dispatch hot path 0a042a8e95

Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.

claude-noether added 1 commit 2026-05-23 17:56:36 +00:00

v3d_runner: persistent per-pipeline command buffer 98553278dd

Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.

Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline().  The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk.  Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.

Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):

  before (task 160 pool only):
    steady-state p50: 76.44 us
    steady-state mean: 77.95 us
  after (task 160 pool + task 161 persistent cb):
    steady-state p50: 54.56 us
    steady-state mean: 56.00 us
    -> 28% per-dispatch reduction

The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.

test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.

Refs daedalus-fourier task #161.

marfrit commented

2026-05-23 17:56:53 +00:00

Phase 2 added (commit 9855327): persistent per-pipeline command buffer.

v3d_pipeline grows a VkCommandBuffer allocated once at create-pipeline time. The five dispatch sites switch from v3d_runner_alloc_cmdbuf() to v3d_runner_pipeline_cmdbuf_reset() (vkResetCommandBuffer, O(1)).

Microbench progression

	Steady-state p50	Mean
baseline (pre-pool)	--	--
+ pool (commit `0a042a8`)	76.44 us	77.95 us
+ persistent cmdbuf (commit `9855327`)	54.56 us	56.00 us

29% additional per-dispatch reduction. Remaining ~54 us is dominated by vkQueueWaitIdle + shader execution + the in/out memcpys on the dst buffer — task 162 (dmabuf import) attacks the memcpy half.

Closes tasks #160 and #161.

**Phase 2 added** (commit 9855327): persistent per-pipeline command buffer. v3d_pipeline grows a VkCommandBuffer allocated once at create-pipeline time. The five dispatch sites switch from `v3d_runner_alloc_cmdbuf()` to `v3d_runner_pipeline_cmdbuf_reset()` (`vkResetCommandBuffer`, O(1)). ## Microbench progression | | Steady-state p50 | Mean | |---|---|---| | baseline (pre-pool) | -- | -- | | + pool (commit 0a042a8) | 76.44 us | 77.95 us | | + persistent cmdbuf (commit 9855327) | **54.56 us** | **56.00 us** | 29% additional per-dispatch reduction. Remaining ~54 us is dominated by `vkQueueWaitIdle` + shader execution + the in/out memcpys on the dst buffer — task 162 (dmabuf import) attacks the memcpy half. Closes tasks #160 and #161.

marfrit referenced this pull request

2026-05-23 18:00:19 +00:00

QPU is default substrate: recipe table + ctx env-var override #7

claude-noether referenced this issue from a commit

2026-05-23 18:06:23 +00:00

cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)

marfrit referenced this pull request

2026-05-23 18:06:43 +00:00

QPU is default substrate: recipe table + ctx env-var override #7

claude-noether referenced this issue from a commit

2026-05-23 18:09:28 +00:00

cycle 7: V3D shader for H.264 IDCT 8x8