Files
daedalus-fourier/src/v3d_h264_idct4.comp
T
claude-noether 65bd5c3fe3 cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
2026-05-23 20:06:20 +02:00

130 lines
4.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// daedalus-fourier — H.264 4x4 inverse integer transform + add, V3D 7.1.
//
// H.264 spec §8.5.12.1. Pure integer arithmetic — no trig constants
// (unlike VP9 IDCT 8x8). Row pass first, column pass second; round
// (+32) >> 6, add to dst, clip to u8.
//
// Block memory layout: COLUMN-MAJOR. block[c*4 + r] = coefficient at
// (row r, column c). Matches FFmpeg `ff_h264_idct_add_neon`.
//
// Workgroup layout: 64 invocations = 4 lanes/block × 16 blocks/WG.
// - row pass: lane k (0..3) reads row k of the block (4 coefficients,
// one from each column), runs the butterfly, writes 4
// outputs to one row of tmp_shared.
// - column pass: lane k reads column k of tmp_shared (4 rows),
// runs the butterfly, writes 4 outputs to dst as
// column k at rows 0..3.
//
// shared = 16 × 16 × 4 B = 1 KiB. Well under V3D's 16 KiB limit.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Coeffs {
int16_t coeffs[]; // N × 16 column-major
} u_coeffs;
layout(binding = 1) buffer Dst {
uint8_t dst[]; // H × stride bytes (caller-provided base)
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off (byte offset into u_dst.dst)
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint _pad0, _pad1;
} pc;
// 16 blocks per WG × 16 ints per block = 256 ints = 1 KiB shared.
shared int tmp_shared[16 * 16];
// 1D butterfly per H.264 §8.5.12.1. d[0..3] in, o[0..3] out.
void idct4_1d(int d0, int d1, int d2, int d3,
out int o0, out int o1, out int o2, out int o3)
{
int e = d0 + d2;
int f = d0 - d2;
int g = (d1 >> 1) - d3;
int h = d1 + (d3 >> 1);
o0 = e + h;
o1 = f + g;
o2 = f - g;
o3 = e - h;
}
void main()
{
// Lane decomposition: local_size 64 = 16 blocks × 4 lanes/block.
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 64u;
uint lane_in_wg = gid & 63u;
uint block_local = lane_in_wg >> 2; // 0..15
uint k = lane_in_wg & 3u; // 0..3
uint block_idx = wg_id * 16u + block_local;
bool oob = (block_idx >= pc.n_blocks);
// ---- Row pass --------------------------------------------------
// lane k handles row r=k. Reads block[c*4 + k] for c=0..3 (one
// element from each column at fixed row).
if (!oob) {
uint base = block_idx * 16u;
int d0 = int(u_coeffs.coeffs[base + 0u * 4u + k]);
int d1 = int(u_coeffs.coeffs[base + 1u * 4u + k]);
int d2 = int(u_coeffs.coeffs[base + 2u * 4u + k]);
int d3 = int(u_coeffs.coeffs[base + 3u * 4u + k]);
int o0, o1, o2, o3;
idct4_1d(d0, d1, d2, d3, o0, o1, o2, o3);
// Write row k of tmp_shared[block_local].
uint tbase = block_local * 16u + k * 4u;
tmp_shared[tbase + 0u] = o0;
tmp_shared[tbase + 1u] = o1;
tmp_shared[tbase + 2u] = o2;
tmp_shared[tbase + 3u] = o3;
}
barrier();
// ---- Column pass ----------------------------------------------
// lane k handles column c=k. Reads tmp[r][k] for r=0..3.
if (!oob) {
uint tbase = block_local * 16u;
int s0 = tmp_shared[tbase + 0u * 4u + k];
int s1 = tmp_shared[tbase + 1u * 4u + k];
int s2 = tmp_shared[tbase + 2u * 4u + k];
int s3 = tmp_shared[tbase + 3u * 4u + k];
int o0, o1, o2, o3;
idct4_1d(s0, s1, s2, s3, o0, o1, o2, o3);
// Column k at rows 0..3 of dst, offset by meta.x (dst_off).
uint dst_off = u_meta.meta[block_idx].x;
uint stride = pc.dst_stride_u8;
uint a0 = dst_off + 0u * stride + k;
uint a1 = dst_off + 1u * stride + k;
uint a2 = dst_off + 2u * stride + k;
uint a3 = dst_off + 3u * stride + k;
int p0 = int(u_dst.dst[a0]);
int p1 = int(u_dst.dst[a1]);
int p2 = int(u_dst.dst[a2]);
int p3 = int(u_dst.dst[a3]);
u_dst.dst[a0] = uint8_t(clamp(p0 + ((o0 + 32) >> 6), 0, 255));
u_dst.dst[a1] = uint8_t(clamp(p1 + ((o1 + 32) >> 6), 0, 255));
u_dst.dst[a2] = uint8_t(clamp(p2 + ((o2 + 32) >> 6), 0, 255));
u_dst.dst[a3] = uint8_t(clamp(p3 + ((o3 + 32) >> 6), 0, 255));
}
}