h264: V3D shader for qpel mc22 (2D half-pel 'j' position) #31

Merged
marfrit merged 1 commits from noether/v3d-shader-h264-qpel-mc22 into main 2026-05-25 17:00:35 +00:00

1 Commits

Author SHA1 Message Date
claude-noether 02d564b43e h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.
2026-05-25 18:52:39 +02:00