h264: V3D shader for qpel mc22 (2D half-pel 'j' position) #31

2026-05-25T16:52:42Z

marfrit commented

2026-05-25 16:52:42 +00:00

Cascaded H+V 6-tap per H.264 §8.4.2.2.1. Highest per-frame impact among missing qpel positions (2.33 ms/frame worst-case at 1080p per PR #24 bench).

Per-lane: each computes 6 horizontal lowpass int16 intermediates + 1 vertical lowpass with +512>>10 scale. ~50 ALU ops/lane. NO shared memory / barriers — V3D L2 absorbs redundant src reads (alternative was 13×8 int16 shmem + barrier; rejected as more complex without proven win).

CANNOT just cascade mc20→mc02 (int16 intermediate matters); dedicated kernel required.

Test on hertz: qpel mc22: 2048/2048 bytes bit-exact via QPU.

Qpel QPU anchors complete: mc20 ✓, mc02 ✓, mc22 ✓. The 12 remaining qpel positions are quarter-pel L2-averaged combinations of these — can either dispatch via compose step or get dedicated shaders.

Cascaded H+V 6-tap per H.264 §8.4.2.2.1. Highest per-frame impact among missing qpel positions (2.33 ms/frame worst-case at 1080p per PR #24 bench). Per-lane: each computes 6 horizontal lowpass int16 intermediates + 1 vertical lowpass with +512>>10 scale. ~50 ALU ops/lane. NO shared memory / barriers — V3D L2 absorbs redundant src reads (alternative was 13×8 int16 shmem + barrier; rejected as more complex without proven win). CANNOT just cascade mc20→mc02 (int16 intermediate matters); dedicated kernel required. **Test on hertz**: `qpel mc22: 2048/2048 bytes bit-exact` via QPU. Qpel QPU anchors complete: mc20 ✓, mc02 ✓, mc22 ✓. The 12 remaining qpel positions are quarter-pel L2-averaged combinations of these — can either dispatch via compose step or get dedicated shaders.

marfrit added 1 commit 2026-05-25 16:52:43 +00:00

h264: V3D shader for qpel mc22 (2D half-pel "j" position) 02d564b43e

Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.

marfrit merged commit 8b8e8dc6e8 into main

2026-05-25 17:00:35 +00:00

marfrit deleted branch noether/v3d-shader-h264-qpel-mc22

2026-05-25 17:00:45 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#31