Files
daedalus-fourier/tests/h264_qpel8_mc22_ref.c
T
claude-noether 20a4299c5c h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1.  One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.

Algorithmically distinct from the 1D mc20/mc02 siblings:
  - Horizontal 6-tap produces 13 rows of int16 intermediate (no
    per-stage clip/round — full precision retained).
  - Vertical 6-tap on the intermediate, then +512 >> 10 (the
    double-shift compensates for both 6-tap scalings) + clip255.

The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result.  The 13-row int16 tmp[] buffer is the central
invariant.

Scope (same pattern as mc02 PR #15):
  - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc22_cpu calling
    ff_put_h264_qpel8_mc22_neon.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
  - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
    int16 staging buffer; spec-derived shifts and rounding.
  - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
    output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
    [-2 .. +10] read window stays in-tile.

Verified on hertz:

  $ ./build/test_api_h264 | tail -5
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 13 H.264 kernels in api_smoke now bit-exact PASS.

mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.

Coverage matrix update:
  put_ mc20 ✓ (QPU+CPU)  put_ mc02 ✓ (CPU)  put_ mc22 ✓ (CPU)
  → 12 single put_ positions still missing (¼/¾ + HV combos with
  L2 averaging).
2026-05-25 01:03:14 +02:00

71 lines
2.8 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/*
* Standalone bit-exact C reference for H.264 luma qpel 8x8 mc22
* (2D half-pel, "put" variant). Cascade of horizontal 6-tap then
* vertical 6-tap with INTERMEDIATE 16-bit precision (no per-stage
* clip/round), final +512 >> 10 to scale back.
*
* Per H.264 §8.4.2.2.1, "j" position:
*
* tmp[r,c] = s[r,c-2] - 5*s[r,c-1] + 20*s[r,c] + 20*s[r,c+1]
* - 5*s[r,c+2] + s[r,c+3] (16-bit signed)
*
* dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
* + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
* + 512) >> 10)
*
* The tmp[] array spans rows r-2 .. r+3 around each output row, so
* we need 13 intermediate rows (rows -2..+10 of the SOURCE
* neighbourhood) for 8 output rows. Caller's src must have 2 rows
* of top context + 3 rows of bottom context AND 2 cols of left +
* 3 cols of right context (FFmpeg's edge-emulated buffer provides
* this at the frame boundary; same contract as mc20).
*
* Mirrors FFmpeg `ff_put_h264_qpel8_mc22_neon` (in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* line 710, which tail-calls put_h264_qpel8_hv_lowpass_neon).
*
* Signature:
* void(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
*
* Same single-stride convention as mc20/mc02.
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
void daedalus_put_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
/* 13 intermediate rows × 8 cols (for the 8 output rows
* dst[0..7][0..7], we need tmp[-2..+10][0..7] — but tmp is
* indexed RELATIVE to the output, so tmp_buf[0..12] corresponds
* to source rows [-2..+10]). */
int16_t tmp[13][8];
for (int rr = 0; rr < 13; rr++) {
int src_row = rr - 2; /* maps tmp_buf[0..12] → src rows [-2..+10] */
const uint8_t *s = src + src_row * stride;
for (int c = 0; c < 8; c++) {
int v = (int) s[c - 2] - 5 * (int) s[c - 1]
+ 20 * (int) s[c] + 20 * (int) s[c + 1]
- 5 * (int) s[c + 2] + (int) s[c + 3];
tmp[rr][c] = (int16_t) v;
}
}
for (int r = 0; r < 8; r++) {
/* tmp[r-2..r+3] in the output's coord system → tmp_buf[r..r+5]. */
for (int c = 0; c < 8; c++) {
int v = tmp[r + 0][c] /* "r-2" + shift 2 */
- 5 * tmp[r + 1][c] /* "r-1" */
+ 20 * tmp[r + 2][c] /* "r+0" */
+ 20 * tmp[r + 3][c] /* "r+1" */
- 5 * tmp[r + 4][c] /* "r+2" */
+ tmp[r + 5][c] /* "r+3" */
+ 512;
dst[r * stride + c] = (uint8_t) clip_u8(v >> 10);
}
}
}