Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1. Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).
Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale. ~50 ALU ops per lane.
Design choice: NO shared memory / barriers. Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute. V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write). Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.
CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline. Dedicated kernel.
dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.
Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.
Verified on hertz:
$ ./build/test_api_h264 | grep qpel
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging. Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.