1347fb961c
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path sum. That number was a measurement artifact. This commit makes the artifact impossible to reproduce by ANYONE running the bench again. THE BUG ------- v3d_runner read_spv() did fopen(spv_path, "rb") with no path search: the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen resolves it relative to cwd. The cmake build puts SPVs in $builddir (e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264) were typically invoked from ~/src/daedalus-fourier/, so fopen failed. On failure read_spv printed perror and returned NULL; pipeline create then returned -1; dispatch then returned -1; the bench loop ignored the return value and timed the failure path. Each iter cost ~1-5 µs (open + perror + return), which divided across 256 ops gave ~10-20 ns/op — looking convincingly like real-but-fast QPU work. PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact. PR #10's much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found that time, perhaps run from build/), so the artifact is what made it look like the gap had closed. The gap never closed. CORRECTED NUMBERS ----------------- Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit: kernel CPU ns/op QPU ns/op winner IDCT 4x4 luma 10.75 217.63 CPU 20.24x IDCT 8x8 luma 29.69 785.94 CPU 26.47x Deblock luma_v 17.63 467.42 CPU 26.51x Deblock luma_h 38.30 498.53 CPU 13.02x qpel mc20 (8x8) 30.17 1300.44 CPU 43.10x qpel mc02 (8x8) 17.69 1363.40 CPU 77.08x qpel mc22 (8x8) 71.60 1948.37 CPU 27.21x 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 123.54 ms Ratio: CPU/QPU sum = 0.05x (QPU 22x SLOWER than CPU) QPU is currently 12-77x slower per kernel. The post-buffer-pool / post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT close the gap with NEON. Whether those tasks helped at all needs re-measurement — the previous "we saw a big win" reading was the same artifact. PR #36's commit-message claim "PR #10's verdict is reversed" is withdrawn. PR #10 was right; PR #36 was wrong. THE FIX ------- Two changes: 1. v3d_runner: SPV search now tries, in order: - cwd (legacy) - $DAEDALUS_SHADER_DIR (env override) - <readlink /proc/self/exe>/.. (binary-relative) - /opt/fourier/share/daedalus-fourier/ (Pi 5 install) - /usr/share/daedalus-fourier/ (system-wide) Found-anywhere succeeds silently. Found-nowhere prints one error naming all searched locations. 2. bench_h264_primitives: bench_fn now returns int. bench_ns does a single preflight call; if rc != 0 it prints "DISPATCH FAILED rc=N — kernel skipped" and bails on the kernel. Main loop counts QPU failures and exits 2 BEFORE printing the comparison table if any kernel failed — so the next person running this can't read fail-fast timings as substrate numbers. POLICY IMPLICATIONS ------------------- The QPU substrate decree (2026-05-23) was conceived as a policy choice that overrides per-kernel measurement. With the corrected data the gap is not "fixable defect we'll close with one more optimization", it's an order of magnitude. Whether to keep the decree, soften it (auto = QPU only where measured advantage), or revert is now a clear-eyed decision for the user. This commit doesn't change the recipe table — that's a separate question, taken on its own merits with this data in hand. Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu → qpu-capable) was justified by PR #36's artifact and should be reverted; that revert lands in a follow-up to marfrit-packages.
300 lines
12 KiB
C
300 lines
12 KiB
C
/* SPDX-License-Identifier: BSD-2-Clause */
|
||
/* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */
|
||
#define _POSIX_C_SOURCE 200809L
|
||
/*
|
||
* bench_h264_primitives — latency baseline for the H.264 primitive
|
||
* library landed across PRs #9–#35.
|
||
*
|
||
* Each kernel is exercised at a representative per-frame N for 1080p
|
||
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported,
|
||
* once per substrate (CPU NEON, QPU V3D7 compute). The QPU column
|
||
* appears only when the host has a usable Vulkan device. When both
|
||
* columns exist a CPU/QPU ratio is printed; that's the per-kernel
|
||
* data the QPU-substrate decree (2026-05-23) deliberately overrides
|
||
* but which is still useful to track over time as dispatch overhead
|
||
* shrinks (buffer pool, persistent cmdbuf, dmabuf import — tasks 160-162).
|
||
*
|
||
* NOT a ctest — produces wall-time numbers, doesn't pass/fail.
|
||
*
|
||
* Invoke: ./build/bench_h264_primitives [iters [warmup]]
|
||
* (default iters = 50, warmup = 5)
|
||
*/
|
||
|
||
#include "daedalus.h"
|
||
|
||
#include <stdint.h>
|
||
#include <stdio.h>
|
||
#include <stdlib.h>
|
||
#include <string.h>
|
||
#include <time.h>
|
||
|
||
static uint64_t xs64_state = 0xfeedface5a5a5a5aULL;
|
||
static uint64_t xs64(void) {
|
||
uint64_t x = xs64_state;
|
||
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
|
||
return xs64_state = x;
|
||
}
|
||
|
||
static double now_ms(void) {
|
||
struct timespec ts;
|
||
clock_gettime(CLOCK_MONOTONIC, &ts);
|
||
return ts.tv_sec * 1000.0 + ts.tv_nsec / 1.0e6;
|
||
}
|
||
|
||
/* Per-1080p-frame counts (8160 MBs at 1920x1088). */
|
||
#define MBS_1080P 8160
|
||
|
||
/* Standard benchmark loop. fn() is called n times per iteration.
|
||
*
|
||
* fn() now returns the dispatch's int rc. A single preflight call is
|
||
* made before the hot loop; if rc != 0 (which on the QPU substrate
|
||
* almost always means "SPV not found via any search path"), bench_ns
|
||
* returns -1 and the caller must NOT report the kernel as measured.
|
||
*
|
||
* Without this, a missing SPV makes every dispatch fail fast at the
|
||
* cost of one fprintf+open call (~1-5 µs), and the loop times that
|
||
* cost as if it were real QPU work — producing absurdly-small ns/op
|
||
* numbers that look like a QPU speedup. This is exactly what made
|
||
* PR #36's bench numbers a measurement artifact. */
|
||
typedef int (*bench_fn)(void);
|
||
|
||
static double bench_ns(const char *name, int iters, int warmup,
|
||
int ops_per_iter, bench_fn fn)
|
||
{
|
||
int rc = fn();
|
||
if (rc != 0) {
|
||
printf(" %-32s DISPATCH FAILED rc=%d — kernel skipped\n", name, rc);
|
||
return -1;
|
||
}
|
||
for (int i = 0; i < warmup; i++) fn();
|
||
double t0 = now_ms();
|
||
for (int i = 0; i < iters; i++) fn();
|
||
double t1 = now_ms();
|
||
double total_ms = (t1 - t0);
|
||
double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter);
|
||
printf(" %-32s %10.2f ns/op (%d iters x %d ops)\n",
|
||
name, ns_per_op, iters, ops_per_iter);
|
||
return ns_per_op;
|
||
}
|
||
|
||
/* ---- Per-kernel scaffolding. Each section sets up the buffers +
|
||
* meta, then defines a static fn() that calls the corresponding
|
||
* dispatch with a representative N. The substrate is read from the
|
||
* global g_sub so the same fn() can be re-driven with CPU then QPU. */
|
||
|
||
static daedalus_ctx *ctx;
|
||
static daedalus_substrate g_sub = DAEDALUS_SUBSTRATE_CPU;
|
||
|
||
/* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks
|
||
* per call (64 MBs worth). Per-MB the dispatch overhead is the
|
||
* same regardless of N — we want ns per block. */
|
||
static int16_t idct4_coeffs[1024 * 16];
|
||
static daedalus_h264_block_meta idct4_meta[1024];
|
||
static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */
|
||
|
||
static int bench_idct4(void) {
|
||
return daedalus_dispatch_h264_idct4(ctx, g_sub,
|
||
idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta);
|
||
}
|
||
|
||
/* --- IDCT 8x8 luma: 256 8x8 blocks per call. */
|
||
static int16_t idct8_coeffs[256 * 64];
|
||
static daedalus_h264_block_meta idct8_meta[256];
|
||
|
||
static int bench_idct8(void) {
|
||
return daedalus_dispatch_h264_idct8(ctx, g_sub,
|
||
idct_dst, 64*16, idct8_coeffs, 256, idct8_meta);
|
||
}
|
||
|
||
/* --- Deblock luma_v (cycle 8 baseline; M3 path). */
|
||
static daedalus_h264_deblock_meta deblock_meta[256];
|
||
static uint8_t deblock_dst[256 * 16 * 16];
|
||
|
||
static int bench_deblock_v(void) {
|
||
return daedalus_dispatch_h264_deblock_luma_v(ctx, g_sub,
|
||
deblock_dst, 16, 256, deblock_meta);
|
||
}
|
||
|
||
static int bench_deblock_h(void) {
|
||
return daedalus_dispatch_h264_deblock_luma_h(ctx, g_sub,
|
||
deblock_dst, 16, 256, deblock_meta);
|
||
}
|
||
|
||
/* --- qpel mc20 + mc02 + mc22 (the H/V/HV anchors). */
|
||
static uint8_t qpel_src[256 * 16 * 16];
|
||
static uint8_t qpel_dst[256 * 16 * 16];
|
||
static daedalus_h264_qpel_meta qpel_meta[256];
|
||
|
||
static int bench_qpel_mc20(void) {
|
||
return daedalus_dispatch_h264_qpel_mc20(ctx, g_sub,
|
||
qpel_dst, qpel_src, 16, 256, qpel_meta);
|
||
}
|
||
static int bench_qpel_mc02(void) {
|
||
return daedalus_dispatch_h264_qpel_mc02(ctx, g_sub,
|
||
qpel_dst, qpel_src, 16, 256, qpel_meta);
|
||
}
|
||
static int bench_qpel_mc22(void) {
|
||
return daedalus_dispatch_h264_qpel_mc22(ctx, g_sub,
|
||
qpel_dst, qpel_src, 16, 256, qpel_meta);
|
||
}
|
||
|
||
/* ---- One row of bench output:
|
||
* - kernel name + N
|
||
* - CPU ns/op
|
||
* - QPU ns/op (or "n/a" if Vulkan absent)
|
||
* - CPU/QPU ratio (>1 means QPU wins; <1 means CPU wins) */
|
||
struct row {
|
||
const char *name;
|
||
int n_per_call;
|
||
bench_fn fn;
|
||
double cpu_ns;
|
||
double qpu_ns; /* -1 if not measured */
|
||
int frame_n; /* count per 1080p frame */
|
||
};
|
||
|
||
static struct row rows[] = {
|
||
{"IDCT 4x4 luma", 1024, bench_idct4, 0, -1, MBS_1080P * 16},
|
||
{"IDCT 8x8 luma", 256, bench_idct8, 0, -1, MBS_1080P * 4},
|
||
{"Deblock luma_v", 256, bench_deblock_v, 0, -1, MBS_1080P * 4},
|
||
{"Deblock luma_h", 256, bench_deblock_h, 0, -1, MBS_1080P * 4},
|
||
{"qpel mc20 (8x8)", 256, bench_qpel_mc20, 0, -1, MBS_1080P * 4},
|
||
{"qpel mc02 (8x8)", 256, bench_qpel_mc02, 0, -1, MBS_1080P * 4},
|
||
{"qpel mc22 (8x8)", 256, bench_qpel_mc22, 0, -1, MBS_1080P * 4},
|
||
};
|
||
#define N_ROWS ((int)(sizeof(rows)/sizeof(rows[0])))
|
||
|
||
int main(int argc, char **argv)
|
||
{
|
||
int iters = argc > 1 ? atoi(argv[1]) : 50;
|
||
int warmup = argc > 2 ? atoi(argv[2]) : 5;
|
||
|
||
ctx = daedalus_ctx_create();
|
||
if (!ctx) {
|
||
fprintf(stderr, "ctx create failed (Vulkan?)\n");
|
||
return 1;
|
||
}
|
||
int has_qpu = daedalus_ctx_has_qpu(ctx);
|
||
|
||
/* Pre-fill all input buffers with random data so the NEON inner
|
||
* loops see realistic memory access patterns. */
|
||
for (size_t i = 0; i < sizeof(idct4_coeffs)/2; i++)
|
||
idct4_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
|
||
for (size_t i = 0; i < sizeof(idct8_coeffs)/2; i++)
|
||
idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
|
||
for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff);
|
||
|
||
/* IDCT meta. */
|
||
for (size_t i = 0; i < 1024; i++)
|
||
idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4);
|
||
for (size_t i = 0; i < 256; i++)
|
||
idct8_meta[i].dst_off = (uint32_t)((i / 8) * 64 + (i % 8) * 8);
|
||
|
||
/* Deblock meta: edge offsets within 256 16x16 tiles. */
|
||
for (size_t i = 0; i < 256; i++) {
|
||
deblock_meta[i].dst_off = (uint32_t)(i * 256 + 4 * 16);
|
||
deblock_meta[i].alpha = 30;
|
||
deblock_meta[i].beta = 10;
|
||
for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1);
|
||
}
|
||
|
||
/* qpel meta. */
|
||
for (size_t i = 0; i < 256; i++) {
|
||
qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3);
|
||
qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3);
|
||
}
|
||
|
||
printf("bench_h264_primitives: %d iters (%d warmup)\n", iters, warmup);
|
||
printf(" ctx has_qpu=%d (CPU pass always runs; QPU pass skipped without Vulkan)\n\n", has_qpu);
|
||
|
||
/* Pass 1: CPU NEON. */
|
||
g_sub = DAEDALUS_SUBSTRATE_CPU;
|
||
printf("== CPU NEON ==\n");
|
||
for (int i = 0; i < N_ROWS; i++)
|
||
rows[i].cpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
|
||
|
||
/* Pass 2: QPU compute (if available). */
|
||
int qpu_failures = 0;
|
||
if (has_qpu) {
|
||
g_sub = DAEDALUS_SUBSTRATE_QPU;
|
||
printf("\n== QPU V3D7 compute ==\n");
|
||
for (int i = 0; i < N_ROWS; i++) {
|
||
rows[i].qpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
|
||
if (rows[i].qpu_ns < 0) qpu_failures++;
|
||
}
|
||
if (qpu_failures) {
|
||
fprintf(stderr,
|
||
"\nbench_h264_primitives: %d of %d QPU dispatches failed.\n"
|
||
" Almost always means SPV files were not found via any of:\n"
|
||
" cwd / $DAEDALUS_SHADER_DIR / binary-dir /\n"
|
||
" /opt/fourier/share/daedalus-fourier / /usr/share/daedalus-fourier\n"
|
||
" Set DAEDALUS_SHADER_DIR=<path> or run from a dir where the\n"
|
||
" .spv files exist (e.g. the cmake build dir).\n",
|
||
qpu_failures, N_ROWS);
|
||
return 2;
|
||
}
|
||
}
|
||
|
||
/* Summary table — both substrates side by side. */
|
||
printf("\n== Per-kernel comparison ==\n");
|
||
printf(" %-24s %12s %12s %8s %7s\n",
|
||
"kernel", "CPU ns/op", "QPU ns/op", "winner", "ms/frame");
|
||
for (int i = 0; i < N_ROWS; i++) {
|
||
double cpu_ms = rows[i].cpu_ns * rows[i].frame_n / 1e6;
|
||
double qpu_ms = rows[i].qpu_ns > 0 ? rows[i].qpu_ns * rows[i].frame_n / 1e6 : -1;
|
||
const char *winner;
|
||
char ratio[16];
|
||
if (rows[i].qpu_ns <= 0) {
|
||
winner = "CPU"; /* QPU n/a */
|
||
snprintf(ratio, sizeof(ratio), "n/a");
|
||
} else if (rows[i].cpu_ns < rows[i].qpu_ns) {
|
||
winner = "CPU";
|
||
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].qpu_ns / rows[i].cpu_ns);
|
||
} else {
|
||
winner = "QPU";
|
||
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].cpu_ns / rows[i].qpu_ns);
|
||
}
|
||
char qpu_field[16];
|
||
if (rows[i].qpu_ns > 0) snprintf(qpu_field, sizeof(qpu_field), "%.2f", rows[i].qpu_ns);
|
||
else snprintf(qpu_field, sizeof(qpu_field), "n/a");
|
||
char ms_field[24];
|
||
if (qpu_ms > 0)
|
||
snprintf(ms_field, sizeof(ms_field), "%.2f/%.2f", cpu_ms, qpu_ms);
|
||
else
|
||
snprintf(ms_field, sizeof(ms_field), "%.2f/n/a", cpu_ms);
|
||
printf(" %-24s %12.2f %12s %3s %s %s\n",
|
||
rows[i].name, rows[i].cpu_ns, qpu_field, winner, ratio, ms_field);
|
||
}
|
||
|
||
/* Per-frame budget summary at 1080p (8160 MBs). */
|
||
double cpu_idct4 = rows[0].cpu_ns * MBS_1080P * 16 / 1e6;
|
||
double cpu_debl = (rows[2].cpu_ns + rows[3].cpu_ns) * MBS_1080P * 4 / 1e6;
|
||
double cpu_mc = rows[6].cpu_ns * MBS_1080P * 4 / 1e6; /* mc22 worst-case */
|
||
double cpu_sum = cpu_idct4 + cpu_debl + cpu_mc;
|
||
|
||
printf("\n== Projected 1080p worst-case (CPU NEON only) ==\n");
|
||
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", cpu_sum);
|
||
printf(" Margin: %+.2f ms\n", 33.33 - cpu_sum);
|
||
|
||
if (has_qpu) {
|
||
double qpu_idct4 = rows[0].qpu_ns * MBS_1080P * 16 / 1e6;
|
||
double qpu_debl = (rows[2].qpu_ns + rows[3].qpu_ns) * MBS_1080P * 4 / 1e6;
|
||
double qpu_mc = rows[6].qpu_ns * MBS_1080P * 4 / 1e6;
|
||
double qpu_sum = qpu_idct4 + qpu_debl + qpu_mc;
|
||
printf("\n== Projected 1080p worst-case (QPU V3D7 compute only) ==\n");
|
||
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", qpu_sum);
|
||
printf(" Margin: %+.2f ms\n", 33.33 - qpu_sum);
|
||
printf("\n CPU vs QPU sum ratio: %.2fx (>1 means QPU wins)\n",
|
||
qpu_sum > 0 ? cpu_sum / qpu_sum : 0.0);
|
||
}
|
||
|
||
printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n");
|
||
printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n");
|
||
printf(" bound; the real decode stack adds 20-40%% on top.\n");
|
||
printf(" Per-kernel substrate decisions belong in daedalus_core.c recipe\n");
|
||
printf(" table; the QPU substrate decree (2026-05-23) keeps everything\n");
|
||
printf(" on QPU regardless of these numbers as a policy choice.)\n");
|
||
|
||
daedalus_ctx_destroy(ctx);
|
||
return 0;
|
||
}
|