h264: qpel mc22 (2D half-pel, CPU/NEON) #16

Merged
marfrit merged 1 commits from noether/h264-qpel-mc22 into main 2026-05-24 23:26:16 +00:00
Owner

Adds the 'j position' 2D half-pel via cascaded H+V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real streams.

Algorithmically distinct from the 1D siblings — can't just call mc20 then mc02 (that would double-clip). Reference uses an explicit tmp[13][8] int16 staging buffer; vertical pass applies +512 >> 10 to compensate for both 6-tap scalings.

PASS 2048/2048 bytes bit-exact first try — meaningful because the +512>>10 + int16 intermediate has multiple shift/sign pitfalls.

All 13 H.264 kernels in api_smoke now PASS. New kernel DAEDALUS_KERNEL_H264_QPEL_MC22=18 → CPU.

Adds the 'j position' 2D half-pel via cascaded H+V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real streams. Algorithmically distinct from the 1D siblings — can't just call mc20 then mc02 (that would double-clip). Reference uses an explicit `tmp[13][8]` int16 staging buffer; vertical pass applies +512 >> 10 to compensate for both 6-tap scalings. **PASS 2048/2048 bytes bit-exact first try** — meaningful because the +512>>10 + int16 intermediate has multiple shift/sign pitfalls. All 13 H.264 kernels in api_smoke now PASS. New kernel `DAEDALUS_KERNEL_H264_QPEL_MC22=18` → CPU.
marfrit added 1 commit 2026-05-24 23:03:32 +00:00
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1.  One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.

Algorithmically distinct from the 1D mc20/mc02 siblings:
  - Horizontal 6-tap produces 13 rows of int16 intermediate (no
    per-stage clip/round — full precision retained).
  - Vertical 6-tap on the intermediate, then +512 >> 10 (the
    double-shift compensates for both 6-tap scalings) + clip255.

The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result.  The 13-row int16 tmp[] buffer is the central
invariant.

Scope (same pattern as mc02 PR #15):
  - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc22_cpu calling
    ff_put_h264_qpel8_mc22_neon.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
  - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
    int16 staging buffer; spec-derived shifts and rounding.
  - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
    output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
    [-2 .. +10] read window stays in-tile.

Verified on hertz:

  $ ./build/test_api_h264 | tail -5
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 13 H.264 kernels in api_smoke now bit-exact PASS.

mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.

Coverage matrix update:
  put_ mc20 ✓ (QPU+CPU)  put_ mc02 ✓ (CPU)  put_ mc22 ✓ (CPU)
  → 12 single put_ positions still missing (¼/¾ + HV combos with
  L2 averaging).
marfrit merged commit f3d4b15b9a into main 2026-05-24 23:26:16 +00:00
marfrit deleted branch noether/h264-qpel-mc22 2026-05-24 23:26:17 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#16