h264: qpel mc22 (2D half-pel, CPU/NEON)

Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1.  One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.

Algorithmically distinct from the 1D mc20/mc02 siblings:
  - Horizontal 6-tap produces 13 rows of int16 intermediate (no
    per-stage clip/round — full precision retained).
  - Vertical 6-tap on the intermediate, then +512 >> 10 (the
    double-shift compensates for both 6-tap scalings) + clip255.

The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result.  The 13-row int16 tmp[] buffer is the central
invariant.

Scope (same pattern as mc02 PR #15):
  - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc22_cpu calling
    ff_put_h264_qpel8_mc22_neon.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
  - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
    int16 staging buffer; spec-derived shifts and rounding.
  - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
    output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
    [-2 .. +10] read window stays in-tile.

Verified on hertz:

  $ ./build/test_api_h264 | tail -5
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 13 H.264 kernels in api_smoke now bit-exact PASS.

mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.

Coverage matrix update:
  put_ mc20 ✓ (QPU+CPU)  put_ mc02 ✓ (CPU)  put_ mc22 ✓ (CPU)
  → 12 single put_ positions still missing (¼/¾ + HV combos with
  L2 averaging).
This commit is contained in:
2026-05-25 01:03:14 +02:00
parent a2575d5e42
commit 20a4299c5c
5 changed files with 174 additions and 0 deletions
+22
View File
@@ -415,6 +415,27 @@ int daedalus_dispatch_h264_qpel_mc02(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma qpel mc22 (2D half-pel "j" position per spec §8.4.2.2.1).
* Horizontal 6-tap cascaded into vertical 6-tap with intermediate
* 16-bit precision; final +512 >> 10 with clip255. Common position
* in real H.264 streams.
*
* src + src_off points at row 0 col 0 of the OUTPUT block; the
* cascade reads rows -2..+10 (13 rows of context) and cols -2..+5
* (10 cols of context). Caller must guarantee.
*
* QPU shader not implemented yet (the HV lowpass is the meatiest
* qpel kernel; structurally distinct from the 1D mc20 shader).
* Recipe routes AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc22(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc22(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
@@ -436,6 +457,7 @@ typedef enum {
DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA = 15,
DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA = 16,
DAEDALUS_KERNEL_H264_QPEL_MC02 = 17,
DAEDALUS_KERNEL_H264_QPEL_MC22 = 18,
} daedalus_kernel;
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k);