claude-noether
|
79553c6e22
|
cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.
Shader (src/v3d_h264_qpel_mc20.comp):
- local_size = 64, 1 lane per output pixel of one 8x8 block,
1 block per workgroup. Simplest layout that avoids any
inter-lane communication — V3D's L2 cache handles the
redundant reads from adjacent lanes computing adjacent
output columns.
- Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
rounding, clip to u8, write one dst byte.
- Single-stride convention matches FFmpeg's H264QpelContext
(dst and src share `stride`; src+src_off points at output
col 0 with the caller-guaranteed -2/+3 padding).
Dispatch wiring (src/daedalus_core.c):
- h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
- dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
src_max = src_off + 7*stride + 11 (covers the +3-col read
footprint on the last row), dst_max = dst_off + 7*stride + 8.
1 block per WG.
- daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
with the substrate-switch pattern matching the other H.264
kernels.
- Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.
Verification (hertz, Pi 5, V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2 ← flipped
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU
First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper. The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same
patch.
All 9 daedalus-fourier cycles now QPU-by-recipe:
cycle 1 VP9 IDCT 8x8 QPU
cycle 2 VP9 LPF wd=4 QPU
cycle 3 VP9 MC 8h QPU
cycle 4 VP9 LPF wd=8 QPU
cycle 5 AV1 CDEF 8x8 QPU
cycle 6 H.264 IDCT 4x4 QPU
cycle 7 H.264 IDCT 8x8 QPU
cycle 8 H.264 deblock luma-v QPU
cycle 9 H.264 qpel mc20 QPU ← this commit
Closes daedalus-fourier task #165. Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
|
2026-05-23 21:05:36 +02:00 |
|