h264: V3D shaders for the 8 diagonal qpel positions #33

Merged

marfrit merged 1 commits from noether/v3d-shader-h264-qpel-diagonals into main

2026-05-25 18:16:54 +00:00

Author	SHA1	Message	Date
claude-noether	746533582e	h264: V3D shaders for the 8 diagonal qpel positions Closes the put_ qpel QPU matrix. Adds mc11/12/13/21/23/31/32/33 — each composes two half-pel anchor outputs via L2 rounded-average: mc11 ¼¼ : avg(mc20[r, c], mc02[r, c]) mc12 ¼½ : avg(mc22[r, c], mc02[r, c]) mc13 ¼¾ : avg(mc20[r+1, c], mc02[r, c]) mc21 ½¼ : avg(mc22[r, c], mc20[r, c]) mc23 ½¾ : avg(mc22[r, c], mc20[r+1, c]) mc31 ¾¼ : avg(mc20[r, c], mc02[r, c+1]) mc32 ¾½ : avg(mc22[r, c], mc02[r, c+1]) mc33 ¾¾ : avg(mc20[r+1, c], mc02[r, c+1]) Per-lane structure: each lane runs the FULL cascade for BOTH anchors at its own (r, c) target, then L2 averages. No shared memory. Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter does the 13×6 int16 cascade per cell). ~88 lines each. Shaders generated from a python template (POSITIONS table + format string) — the 8 .comp files are 1:1 with the C reference's DEFINE_DIAG_REF macro from fourier PR #18. Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11, covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset). Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to QPU. Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro (replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU). Verified on hertz: $ ./build/test_api_h264 \| grep "qpel mc[1-3][1-3]" H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%) Meaningful: the (r±1, c±1) offsets are easy to transpose between positions; passing first try on the asymmetric variants (mc13/23/31/33) means the position-specific shifts are correct in all 8 templates. put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions (mc00 = integer copy, no shader needed). avg_ qpel positions (15 more) remain on CPU NEON; can land as a follow-up since avg_ is just put_ + one extra L2 against existing dst. put_ mc20 ✓ mc02 ✓ mc22 ✓ (anchors) mc10 ✓ mc30 ✓ mc01 ✓ mc03 ✓ (single-axis ¼-pel) mc11 ✓ mc12 ✓ mc13 ✓ (this PR — row-1 diagonals) mc21 ✓ mc23 ✓ (this PR — row-2 diagonals) mc31 ✓ mc32 ✓ mc33 ✓ (this PR — row-3 diagonals) avg_ all 15 — CPU NEON	2026-05-25 19:14:42 +02:00

Author

SHA1

Message

Date

claude-noether

746533582e

h264: V3D shaders for the 8 diagonal qpel positions

Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON

2026-05-25 19:14:42 +02:00

h264: V3D shaders for the 8 diagonal qpel positions #33

1 Commits