h264: qpel mc22 (2D half-pel, CPU/NEON) #16
Reference in New Issue
Block a user
Delete Branch "noether/h264-qpel-mc22"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds the 'j position' 2D half-pel via cascaded H+V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real streams.
Algorithmically distinct from the 1D siblings — can't just call mc20 then mc02 (that would double-clip). Reference uses an explicit
tmp[13][8]int16 staging buffer; vertical pass applies +512 >> 10 to compensate for both 6-tap scalings.PASS 2048/2048 bytes bit-exact first try — meaningful because the +512>>10 + int16 intermediate has multiple shift/sign pitfalls.
All 13 H.264 kernels in api_smoke now PASS. New kernel
DAEDALUS_KERNEL_H264_QPEL_MC22=18→ CPU.Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real H.264 streams — many encoders emit 1/2-1/2 motion vectors as their best-RD choice. Algorithmically distinct from the 1D mc20/mc02 siblings: - Horizontal 6-tap produces 13 rows of int16 intermediate (no per-stage clip/round — full precision retained). - Vertical 6-tap on the intermediate, then +512 >> 10 (the double-shift compensates for both 6-tap scalings) + clip255. The intermediate-precision requirement means the C reference can't just be "call mc20 then mc02" — that would double-clip and produce the wrong result. The 13-row int16 tmp[] buffer is the central invariant. Scope (same pattern as mc02 PR #15): - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper. - Internal: dispatch_h264_qpel_mc22_cpu calling ff_put_h264_qpel8_mc22_neon. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU. - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8] int16 staging buffer; spec-derived shifts and rounding. - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's [-2 .. +10] read window stays in-tile. Verified on hertz: $ ./build/test_api_h264 | tail -5 H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) All 13 H.264 kernels in api_smoke now bit-exact PASS. mc22 being right first try is meaningful — the +512 >> 10 scaling + int16 intermediate sequence has multiple sign/shift/clip pitfalls and any of them would surface on random inputs immediately. Coverage matrix update: put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU) → 12 single put_ positions still missing (¼/¾ + HV combos with L2 averaging).