h264: V3D shaders for the 8 diagonal qpel positions #33

2026-05-25T17:15:23Z

marfrit commented

2026-05-25 17:15:23 +00:00

Closes the put_ qpel QPU matrix. All 15 useful positions now on QPU (mc00 = integer copy, no shader).

8 new shaders for mc11/12/13/21/23/31/32/33 — each composes 2 half-pel anchors (mc20/mc02/mc22) via L2 rounded-average, with the position-specific (r±1, c±1) offsets per H.264 §8.4.2.2.1. ~88 lines per shader. Generated from a python template against the same POSITIONS table as the C ref from fourier PR #18.

Shared dispatch_h264_qpel_diag_qpu helper (mc22-style src envelope: src_max = src_off + 10*stride + 11 covers rows -2..+10 and cols -2..+10 for any offset).

All 8 PASS 2048/2048 bytes bit-exact first try — meaningful because the asymmetric (r±1, c±1) shifts are easy to transpose between positions; passing on all means the position-specific shifts are correct in all 8 templates.

put_ QPU matrix complete: mc{00,10,20,30,01,11,21,31,02,12,22,32,03,13,23,33}. avg_ qpel (15 more positions) remain on CPU NEON — can land as a follow-up since avg_ is just put_ + one extra L2.

**Closes the put_ qpel QPU matrix.** All 15 useful positions now on QPU (mc00 = integer copy, no shader). 8 new shaders for mc11/12/13/21/23/31/32/33 — each composes 2 half-pel anchors (mc20/mc02/mc22) via L2 rounded-average, with the position-specific (r±1, c±1) offsets per H.264 §8.4.2.2.1. ~88 lines per shader. Generated from a python template against the same POSITIONS table as the C ref from fourier PR #18. Shared `dispatch_h264_qpel_diag_qpu` helper (mc22-style src envelope: src_max = src_off + 10*stride + 11 covers rows -2..+10 and cols -2..+10 for any offset). **All 8 PASS 2048/2048 bytes bit-exact first try** — meaningful because the asymmetric (r±1, c±1) shifts are easy to transpose between positions; passing on all means the position-specific shifts are correct in all 8 templates. put_ QPU matrix complete: mc{00,10,20,30,01,11,21,31,02,12,22,32,03,13,23,33}. avg_ qpel (15 more positions) remain on CPU NEON — can land as a follow-up since avg_ is just put_ + one extra L2.

marfrit added 1 commit 2026-05-25 17:15:28 +00:00

h264: V3D shaders for the 8 diagonal qpel positions 746533582e

Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON

marfrit merged commit 55d3618408 into main

2026-05-25 18:16:54 +00:00

marfrit deleted branch noether/v3d-shader-h264-qpel-diagonals

2026-05-25 18:16:55 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#33