cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage #8
Reference in New Issue
Block a user
Delete Branch "noether/v3d-shader-h264-qpel-mc20"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes the QPU-default substrate campaign per the 2026-05-23 decree — every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece: 6-tap horizontal half-pel luma MC, H.264 §8.4.2.2.1.
Shader
src/v3d_h264_qpel_mc20.comp:(1, -5, 20, 20, -5, 1) / 32filter with +16 rounding, clip to u8, write one dst byte.H264QpelContext.Dispatch wiring
h264_qpel_mc20_pipefield ondaedalus_ctx, lazy init.dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), one WG per block.daedalus_dispatch_h264_qpel_mc20()swapsROUTE_CPU_ONLYfor the substrate-switch pattern.H264_QPEL_MC20returnsSUBSTRATE_QPU.Verification (hertz, Pi 5, V3D 7.1)
First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing
src_maxin the host wrapper. The filter readssrc_off + 7*stride + 10at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch.9 of 9 cycles QPU-by-recipe
The prototype now demonstrates GPU acceleration on every measured kernel — campaign complete.
Closes daedalus-fourier task #165.
Closes the QPU-default substrate campaign per the 2026-05-23 decree: every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal half-pel luma motion compensation, H.264 §8.4.2.2.1. Shader (src/v3d_h264_qpel_mc20.comp): - local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout that avoids any inter-lane communication — V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's H264QpelContext (dst and src share `stride`; src+src_off points at output col 0 with the caller-guaranteed -2/+3 padding). Dispatch wiring (src/daedalus_core.c): - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init. - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), src_max = src_off + 7*stride + 11 (covers the +3-col read footprint on the last row), dst_max = dst_off + 7*stride + 8. 1 block per WG. - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY with the substrate-switch pattern matching the other H.264 kernels. - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU. Verification (hertz, Pi 5, V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + (7 + 3) = +10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. All 9 daedalus-fourier cycles now QPU-by-recipe: cycle 1 VP9 IDCT 8x8 QPU cycle 2 VP9 LPF wd=4 QPU cycle 3 VP9 MC 8h QPU cycle 4 VP9 LPF wd=8 QPU cycle 5 AV1 CDEF 8x8 QPU cycle 6 H.264 IDCT 4x4 QPU cycle 7 H.264 IDCT 8x8 QPU cycle 8 H.264 deblock luma-v QPU cycle 9 H.264 qpel mc20 QPU ← this commit Closes daedalus-fourier task #165. Per the decree memory [QPU is default substrate], the prototype now demonstrates GPU acceleration on every measured kernel.