Files
daedalus-fourier/src/v3d_runner.h
T
marfrit d66f22f333 Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.

New files:
- src/v3d_runner.{c,h}  — reusable Vulkan compute plumbing (instance,
                          V3D device picker, HOST_VISIBLE|COHERENT
                          SSBOs with mmap, compute pipeline from .spv,
                          enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp    — VP9 8x8 DCT_DCT IDCT add, v4 production:
                          256 invocations/WG, 2 blocks/subgroup
                          (no idle lanes), uint8 dst SSBO (race-free
                          per phase5 finding 5), unrolled writes
                          (no chained ternary), oob-flag pattern
                          (barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md         — full iteration journey + decision verdict

CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.

Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):

  ver  change                              R       ns/block
  v1   first-light                         0.230   533
  v2   kill ternary + 2-blocks-per-sg      0.474   258
  v3   per-pass scope oN                   0.481   254  (noise)
  v4   WG 64 -> 256 invocations            0.947   129
  v5   packed uint32 coeff reads           0.938   130  (noise, reverted)
  v4 final N=3                             0.918 +/- 0.033

Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.

Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.

Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:09:00 +00:00

97 lines
3.4 KiB
C

/*
* v3d_runner — minimal Vulkan compute plumbing for V3D 7.1 on Pi 5.
*
* Factored out of tests/bench_vulkan_dispatch.c so successive kernel
* benches can reuse the device/queue/buffer/pipeline machinery
* without copy-paste. Kept deliberately small and concrete — no
* generality beyond what daedalus-fourier needs.
*
* License: BSD-2-Clause.
*/
#ifndef DAEDALUS_V3D_RUNNER_H
#define DAEDALUS_V3D_RUNNER_H
#include <stddef.h>
#include <stdint.h>
#include <vulkan/vulkan.h>
typedef struct v3d_runner v3d_runner;
/* Host-visible SSBO. .mapped is a CPU-side pointer to .size bytes. */
typedef struct {
VkBuffer buffer;
VkDeviceMemory memory;
void *mapped;
size_t size;
} v3d_buffer;
/* Compute pipeline + its descriptor set (one set per pipeline). */
typedef struct {
VkPipeline pipeline;
VkPipelineLayout layout;
VkDescriptorSetLayout ds_layout;
VkDescriptorPool pool;
VkDescriptorSet desc_set;
uint32_t n_ssbos;
uint32_t push_const_size;
} v3d_pipeline;
/*
* Create runner: Vulkan instance, V3D physical device, logical
* device with storageBuffer{8,16}BitAccess features enabled,
* compute queue, command pool.
*
* Returns NULL on failure (writes errors to stderr).
*/
v3d_runner *v3d_runner_create(void);
void v3d_runner_destroy(v3d_runner *r);
/* Expose a few internals for code that wants direct vkCmd*. */
VkDevice v3d_runner_device(v3d_runner *r);
VkQueue v3d_runner_queue(v3d_runner *r);
uint32_t v3d_runner_queue_family(v3d_runner *r);
VkCommandPool v3d_runner_cmd_pool(v3d_runner *r);
const char *v3d_runner_device_name(v3d_runner *r);
/* Storage buffer, HOST_VISIBLE | HOST_COHERENT, mapped on the
* host side. The mapping persists for the lifetime of the buffer.
*
* Returns 0 on success, non-zero on failure.
*/
int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
/* Compute pipeline from a SPIR-V file path. The descriptor-set
* layout exposes `n_ssbos` storage buffer bindings at binding
* indices 0..n_ssbos-1, all visible to the compute stage. A push
* constant range of `push_const_size` bytes is added if non-zero.
*
* The single descriptor set is pre-allocated; bind buffers via
* v3d_runner_bind_buffers().
*/
int v3d_runner_create_pipeline(v3d_runner *r,
const char *spv_path,
uint32_t n_ssbos,
uint32_t push_const_size,
v3d_pipeline *out);
void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p);
/* Bind SSBOs to the pipeline's descriptor set. `bufs` must have
* exactly `p->n_ssbos` entries, in binding order. Idempotent —
* rebind freely between dispatches if buffers change.
*/
int v3d_runner_bind_buffers(v3d_runner *r,
v3d_pipeline *p,
const v3d_buffer *bufs,
uint32_t n);
/* Allocate a primary command buffer from the runner's pool. */
VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r);
/* Submit `cb` to the queue and wait for completion. The classic
* timed operation. Returns 0 on success.
*/
int v3d_runner_submit_wait(v3d_runner *r, VkCommandBuffer cb);
#endif /* DAEDALUS_V3D_RUNNER_H */