daedalus-fourier/tests at 0d54d68f3804e73571f3ff9936af99231666961e - daedalus-fourier - marfrit's space

marfrit/daedalus-fourier

Files

T

History

claude-noether 0a042a8e95 v3d_runner: buffer pool for QPU dispatch hot path

Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.

2026-05-23 19:52:50 +02:00

..

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

.gitkeep

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_concurrent_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_concurrent_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_concurrent_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_concurrent_mixed.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_concurrent.c

Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues

2026-05-18 12:18:36 +00:00

bench_neon_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_neon_h264deblock.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

bench_neon_h264idct4.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

bench_neon_h264idct8.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

bench_neon_h264qpel_mc20.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

bench_neon_idct.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_neon_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_neon_lpf.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

bench_neon_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_pool_overhead.c

v3d_runner: buffer pool for QPU dispatch hot path

2026-05-23 19:52:50 +02:00

bench_v3d_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_v3d_h264deblock.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_v3d_idct.c

Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz

2026-05-18 12:09:00 +00:00

bench_v3d_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_v3d_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_v3d_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_vulkan_dispatch.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

cdef_ref.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

h264_deblock_ref.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

h264_idct4_ref.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

h264_idct8_ref.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

h264_qpel8_mc20_ref.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

test_api_h264.c

Phase 8c: H.264 luma qpel mc20 through public API

2026-05-23 03:25:24 +02:00

test_api_idct.c

Phase 8: wire IDCT QPU dispatch through public API

2026-05-18 13:55:55 +00:00

test_api_lpf.c

Phase 8: wire LPF wd=4 + wd=8 QPU through public API

2026-05-18 13:57:25 +00:00

test_api_opportunistic_qpu.c

Phase 8b: opportunistic QPU paths through public API

2026-05-18 14:50:41 +00:00

vp9_idct8_ref.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

vp9_lpf8_ref.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

vp9_lpf_ref.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

vp9_mc_ref.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00