20a4299c5c
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.
Algorithmically distinct from the 1D mc20/mc02 siblings:
- Horizontal 6-tap produces 13 rows of int16 intermediate (no
per-stage clip/round — full precision retained).
- Vertical 6-tap on the intermediate, then +512 >> 10 (the
double-shift compensates for both 6-tap scalings) + clip255.
The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result. The 13-row int16 tmp[] buffer is the central
invariant.
Scope (same pattern as mc02 PR #15):
- Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc22_cpu calling
ff_put_h264_qpel8_mc22_neon.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
- C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
int16 staging buffer; spec-derived shifts and rounding.
- Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
[-2 .. +10] read window stays in-tile.
Verified on hertz:
$ ./build/test_api_h264 | tail -5
H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
All 13 H.264 kernels in api_smoke now bit-exact PASS.
mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.
Coverage matrix update:
put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU)
→ 12 single put_ positions still missing (¼/¾ + HV combos with
L2 averaging).
71 lines
2.8 KiB
C
71 lines
2.8 KiB
C
/*
|
||
* Standalone bit-exact C reference for H.264 luma qpel 8x8 mc22
|
||
* (2D half-pel, "put" variant). Cascade of horizontal 6-tap then
|
||
* vertical 6-tap with INTERMEDIATE 16-bit precision (no per-stage
|
||
* clip/round), final +512 >> 10 to scale back.
|
||
*
|
||
* Per H.264 §8.4.2.2.1, "j" position:
|
||
*
|
||
* tmp[r,c] = s[r,c-2] - 5*s[r,c-1] + 20*s[r,c] + 20*s[r,c+1]
|
||
* - 5*s[r,c+2] + s[r,c+3] (16-bit signed)
|
||
*
|
||
* dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
|
||
* + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
|
||
* + 512) >> 10)
|
||
*
|
||
* The tmp[] array spans rows r-2 .. r+3 around each output row, so
|
||
* we need 13 intermediate rows (rows -2..+10 of the SOURCE
|
||
* neighbourhood) for 8 output rows. Caller's src must have 2 rows
|
||
* of top context + 3 rows of bottom context AND 2 cols of left +
|
||
* 3 cols of right context (FFmpeg's edge-emulated buffer provides
|
||
* this at the frame boundary; same contract as mc20).
|
||
*
|
||
* Mirrors FFmpeg `ff_put_h264_qpel8_mc22_neon` (in
|
||
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
|
||
* line 710, which tail-calls put_h264_qpel8_hv_lowpass_neon).
|
||
*
|
||
* Signature:
|
||
* void(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
|
||
*
|
||
* Same single-stride convention as mc20/mc02.
|
||
*
|
||
* License: LGPL-2.1-or-later.
|
||
*/
|
||
#include <stdint.h>
|
||
#include <stddef.h>
|
||
|
||
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
|
||
|
||
void daedalus_put_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
|
||
{
|
||
/* 13 intermediate rows × 8 cols (for the 8 output rows
|
||
* dst[0..7][0..7], we need tmp[-2..+10][0..7] — but tmp is
|
||
* indexed RELATIVE to the output, so tmp_buf[0..12] corresponds
|
||
* to source rows [-2..+10]). */
|
||
int16_t tmp[13][8];
|
||
for (int rr = 0; rr < 13; rr++) {
|
||
int src_row = rr - 2; /* maps tmp_buf[0..12] → src rows [-2..+10] */
|
||
const uint8_t *s = src + src_row * stride;
|
||
for (int c = 0; c < 8; c++) {
|
||
int v = (int) s[c - 2] - 5 * (int) s[c - 1]
|
||
+ 20 * (int) s[c] + 20 * (int) s[c + 1]
|
||
- 5 * (int) s[c + 2] + (int) s[c + 3];
|
||
tmp[rr][c] = (int16_t) v;
|
||
}
|
||
}
|
||
|
||
for (int r = 0; r < 8; r++) {
|
||
/* tmp[r-2..r+3] in the output's coord system → tmp_buf[r..r+5]. */
|
||
for (int c = 0; c < 8; c++) {
|
||
int v = tmp[r + 0][c] /* "r-2" + shift 2 */
|
||
- 5 * tmp[r + 1][c] /* "r-1" */
|
||
+ 20 * tmp[r + 2][c] /* "r+0" */
|
||
+ 20 * tmp[r + 3][c] /* "r+1" */
|
||
- 5 * tmp[r + 4][c] /* "r+2" */
|
||
+ tmp[r + 5][c] /* "r+3" */
|
||
+ 512;
|
||
dst[r * stride + c] = (uint8_t) clip_u8(v >> 10);
|
||
}
|
||
}
|
||
}
|