Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz

First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa v3dv compute. Five iterations through a Phase 7→Phase 4' loopback; production kernel is v4. New files: - src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance, V3D device picker, HOST_VISIBLE|COHERENT SSBOs with mmap, compute pipeline from .spv, enables storageBuffer{8,16}BitAccess) - src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production: 256 invocations/WG, 2 blocks/subgroup (no idle lanes), uint8 dst SSBO (race-free per phase5 finding 5), unrolled writes (no chained ternary), oob-flag pattern (barrier-safe per phase5 finding 7) - tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref - docs/phase7.md — full iteration journey + decision verdict CMakeLists.txt updated to build the new shader, library, and bench when DAEDALUS_BUILD_VULKAN=ON. Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3): ver change R ns/block v1 first-light 0.230 533 v2 kill ternary + 2-blocks-per-sg 0.474 258 v3 per-pass scope oN 0.481 254 (noise) v4 WG 64 -> 256 invocations 0.947 129 v5 packed uint32 coeff reads 0.938 130 (noise, reverted) v4 final N=3 0.918 +/- 0.033 Bit-exactness 100.0000% across all iterations (10000-block sample on 128x128, 32640-block sample on 1080p) against both the C reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON ff_vp9_idct_idct_8x8_add_neon. Key learning over the Phase 5 review's prediction model: the chained ternary was NOT a spill killer on V3D 7.1 (shaderdb showed 0:0 spills:fills even in v1). The actual lever was workgroup-size-driven latency hiding — going from 64 to 256 invocations doubled throughput with the same compiled code (270 inst, 2 threads, 21 max-temps, 0 spills) because the v3dv scheduler had 4x more in-flight work to overlap TMU latency. Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0) by a wide margin, near GREEN boundary. Phase 1 YELLOW rule: add M4 (concurrent CPU+QPU throughput) before honest-close or continue. M4 is the next measurement, not more shader tuning — at R = 0.92 with all 4 A76 cores still 100% free for other work, the question is whether the system aggregate beats pure 4-core NEON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:09:00 +00:00
parent 71db72928f
commit d66f22f333
6 changed files with 1267 additions and 1 deletions
@@ -0,0 +1,96 @@
+/*
+ * v3d_runner — minimal Vulkan compute plumbing for V3D 7.1 on Pi 5.
+ *
+ * Factored out of tests/bench_vulkan_dispatch.c so successive kernel
+ * benches can reuse the device/queue/buffer/pipeline machinery
+ * without copy-paste. Kept deliberately small and concrete — no
+ * generality beyond what daedalus-fourier needs.
+ *
+ * License: BSD-2-Clause.
+ */
+#ifndef DAEDALUS_V3D_RUNNER_H
+#define DAEDALUS_V3D_RUNNER_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <vulkan/vulkan.h>
+
+typedef struct v3d_runner v3d_runner;
+
+/* Host-visible SSBO. .mapped is a CPU-side pointer to .size bytes. */
+typedef struct {
+    VkBuffer        buffer;
+    VkDeviceMemory  memory;
+    void           *mapped;
+    size_t          size;
+} v3d_buffer;
+
+/* Compute pipeline + its descriptor set (one set per pipeline). */
+typedef struct {
+    VkPipeline             pipeline;
+    VkPipelineLayout       layout;
+    VkDescriptorSetLayout  ds_layout;
+    VkDescriptorPool       pool;
+    VkDescriptorSet        desc_set;
+    uint32_t               n_ssbos;
+    uint32_t               push_const_size;
+} v3d_pipeline;
+
+/*
+ * Create runner: Vulkan instance, V3D physical device, logical
+ * device with storageBuffer{8,16}BitAccess features enabled,
+ * compute queue, command pool.
+ *
+ * Returns NULL on failure (writes errors to stderr).
+ */
+v3d_runner *v3d_runner_create(void);
+void        v3d_runner_destroy(v3d_runner *r);
+
+/* Expose a few internals for code that wants direct vkCmd*. */
+VkDevice         v3d_runner_device(v3d_runner *r);
+VkQueue          v3d_runner_queue(v3d_runner *r);
+uint32_t         v3d_runner_queue_family(v3d_runner *r);
+VkCommandPool    v3d_runner_cmd_pool(v3d_runner *r);
+const char      *v3d_runner_device_name(v3d_runner *r);
+
+/* Storage buffer, HOST_VISIBLE | HOST_COHERENT, mapped on the
+ * host side. The mapping persists for the lifetime of the buffer.
+ *
+ * Returns 0 on success, non-zero on failure.
+ */
+int  v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
+void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
+
+/* Compute pipeline from a SPIR-V file path. The descriptor-set
+ * layout exposes `n_ssbos` storage buffer bindings at binding
+ * indices 0..n_ssbos-1, all visible to the compute stage. A push
+ * constant range of `push_const_size` bytes is added if non-zero.
+ *
+ * The single descriptor set is pre-allocated; bind buffers via
+ * v3d_runner_bind_buffers().
+ */
+int  v3d_runner_create_pipeline(v3d_runner *r,
+                                const char  *spv_path,
+                                uint32_t     n_ssbos,
+                                uint32_t     push_const_size,
+                                v3d_pipeline *out);
+void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p);
+
+/* Bind SSBOs to the pipeline's descriptor set. `bufs` must have
+ * exactly `p->n_ssbos` entries, in binding order. Idempotent —
+ * rebind freely between dispatches if buffers change.
+ */
+int  v3d_runner_bind_buffers(v3d_runner   *r,
+                             v3d_pipeline *p,
+                             const v3d_buffer *bufs,
+                             uint32_t      n);
+
+/* Allocate a primary command buffer from the runner's pool. */
+VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r);
+
+/* Submit `cb` to the queue and wait for completion. The classic
+ * timed operation. Returns 0 on success.
+ */
+int v3d_runner_submit_wait(v3d_runner *r, VkCommandBuffer cb);
+
+#endif /* DAEDALUS_V3D_RUNNER_H */