h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete #34

Merged
marfrit merged 1 commits from noether/v3d-shader-h264-qpel-avg into main 2026-05-25 18:23:14 +00:00
Owner

Generates 15 avg_ shader variants by python-templating from existing put_ shaders. Each avg_ adds one L2 line at the tail (dst = (dst + result + 1) >> 1). For B-slice biprediction per H.264 §8.4.2.3.1.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for all 15 — same src envelope as the diag variants. Slightly over-allocates for simpler positions but eliminates 15× boilerplate duplication.

All 15 PASS 2048/2048 bytes bit-exact via QPU.

Qpel QPU matrix is now complete: 30 of 30 useful positions (15 put_ + 15 avg_; mc00 = integer copy, no shader).

H.264 8-bit 4:2:0 hot-path QPU coverage state:

  • IDCT 4x4/8x8 ✓
  • Deblock luma_v/h ✓
  • Deblock chroma_v/h ✓
  • qpel put_ 15 ✓
  • qpel avg_ 15 ✓ (this PR)

The H.264 non-intra-deblock hot path is now FULLY on QPU for any consumer that initialises daedalus with a QPU-capable context.

Generates 15 avg_ shader variants by python-templating from existing put_ shaders. Each avg_ adds one L2 line at the tail (`dst = (dst + result + 1) >> 1`). For B-slice biprediction per H.264 §8.4.2.3.1. Dispatch reuses the existing `dispatch_h264_qpel_diag_qpu` helper for all 15 — same src envelope as the diag variants. Slightly over-allocates for simpler positions but eliminates 15× boilerplate duplication. **All 15 PASS 2048/2048 bytes bit-exact via QPU.** **Qpel QPU matrix is now complete**: 30 of 30 useful positions (15 put_ + 15 avg_; mc00 = integer copy, no shader). H.264 8-bit 4:2:0 hot-path QPU coverage state: - IDCT 4x4/8x8 ✓ - Deblock luma_v/h ✓ - Deblock chroma_v/h ✓ - qpel put_ 15 ✓ - qpel avg_ 15 ✓ (this PR) The H.264 non-intra-deblock hot path is now FULLY on QPU for any consumer that initialises daedalus with a QPU-capable context.
marfrit added 1 commit 2026-05-25 18:22:42 +00:00
Generates 15 avg_ shader variants by templating from the existing
put_ shaders.  Each avg_ shader is identical to its put_ sibling
except the final write does an L2 average with the existing dst:

  put_:  dst[r,c] = result
  avg_:  dst[r,c] = (dst[r,c] + result + 1) >> 1

Per H.264 §8.4.2.3.1 (B-slice biprediction): caller pre-loads dst
with the list0 prediction; the avg_ call folds in list1.

Generated via python (avg-shader-gen.py): reads each
v3d_h264_qpel_mcXY.comp, transforms the docstring header + final
write hunk, writes v3d_h264_qpel_avg_mcXY.comp.  ~88 lines each;
15 new shader files.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for
all 15 — same src envelope (10*stride+11 covers any (r±1, c±1)
shift), the L2 step only touches dst.  Slightly over-allocates for
the simpler positions (avg_mc20/02/10/30/01/03) but negligible
cost.  Eliminates 15 wrappers + 15 src_max bound calculations that
would otherwise duplicate.

CMake foreach loops compile + install 15 new SPV files.  ctx grows
15 pipeline pairs.  Recipe table flips DAEDALUS_KERNEL_H264_QPEL_AVG_*
from CPU to QPU.  Public dispatchers re-defined via the existing
DEFINE_QPEL_DIAG_PUBLIC macro (replaces the CPU-only
DEFINE_QPEL_DISPATCH instantiations).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel avg" | wc -l
  15
  $ ./build/test_api_h264 | grep "qpel avg" | grep -c "100.0000%"
  15

All 15 PASS 2048/2048 bytes bit-exact via QPU.

QPU coverage for the H.264 8-bit 4:2:0 hot-path pixel kernels:

  Layer                Coverage
  ─────────────────────────────────────────────────────────────
  IDCT 4x4 luma        ✓ cycle 6 (one QPU shader, also handles chroma)
  IDCT 8x8 luma        ✓ cycle 7
  Chroma DC Hadamard   CPU only (4 adds + 4 subs; not worth)
  Deblock luma_v       ✓ cycle 8
  Deblock luma_h       ✓ PR #28
  Deblock chroma_v/h   ✓ PR #29
  Deblock *_intra      CPU only (less common, structurally different)
  qpel put_ 15 pos     ✓ cycle 9 (mc20) + PRs #30-#33
  qpel avg_ 15 pos     ✓ THIS PR

The H.264 non-intra-deblock hot path is now FULLY on QPU for any
consumer that initialises daedalus with a QPU-capable context.
marfrit merged commit e506ef0803 into main 2026-05-25 18:23:14 +00:00
marfrit deleted branch noether/v3d-shader-h264-qpel-avg 2026-05-25 18:23:15 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#34