c3301b0c2e
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.
Scope:
- Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
Explicit SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
transpose of mc20 (reads src[(r±N)*stride + c] instead of
src[r*stride + c±N]).
- Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
× 16 rows, random input, bit-exact compare against the C ref.
Verified on hertz:
$ ./build/test_api_h264
...
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
All 12 H.264 kernels in the api_smoke now bit-exact PASS.
Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget. QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).
Coverage matrix update:
qpel position put_ status avg_ status
------------- ----------- -----------
mc00 (copy) not wired not wired
mc10 (¼-H) not wired not wired
mc20 (½-H) ✓ QPU+CPU not wired
mc30 (¾-H) not wired not wired
mc01 (¼-V) not wired not wired
mc02 (½-V) ✓ CPU not wired (this PR)
mc03 (¾-V) not wired not wired
mc11..mc33 not wired not wired
13 more qpel positions to go for the full put_ matrix. Adding them
follows the same template; each is a small contained PR.
46 lines
1.7 KiB
C
46 lines
1.7 KiB
C
/*
|
||
* Standalone bit-exact C reference for H.264 luma qpel 8×8 mc02
|
||
* (vertical half-pel, "put" variant). Mirror of mc20 with rows
|
||
* and columns transposed. 6-tap filter applied vertically:
|
||
*
|
||
* dst[r,c] = clip255( (s[r-2,c] - 5*s[r-1,c] + 20*s[r,c]
|
||
* + 20*s[r+1,c] - 5*s[r+2,c] + s[r+3,c]
|
||
* + 16) >> 5 )
|
||
*
|
||
* Mirrors FFmpeg `ff_put_h264_qpel8_mc02_neon` (in
|
||
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
|
||
* line 678, which tail-calls put_h264_qpel8_v_lowpass_neon).
|
||
*
|
||
* Signature:
|
||
* void(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
|
||
*
|
||
* Both dst and src use the SAME stride. src points at row 0 col 0
|
||
* of the output block; the filter reads rows -2..+3 (2 rows of top
|
||
* context, 3 rows of bottom context). Caller must guarantee the
|
||
* source buffer has those rows available (FFmpeg's edge-emulated
|
||
* buffer handles this at the frame boundary; matches the contract
|
||
* documented for mc20).
|
||
*
|
||
* License: LGPL-2.1-or-later.
|
||
*/
|
||
#include <stdint.h>
|
||
#include <stddef.h>
|
||
|
||
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
|
||
|
||
void daedalus_put_h264_qpel8_mc02_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
|
||
{
|
||
for (int r = 0; r < 8; r++) {
|
||
for (int c = 0; c < 8; c++) {
|
||
int s_m2 = src[(r - 2) * stride + c];
|
||
int s_m1 = src[(r - 1) * stride + c];
|
||
int s_0 = src[(r + 0) * stride + c];
|
||
int s_p1 = src[(r + 1) * stride + c];
|
||
int s_p2 = src[(r + 2) * stride + c];
|
||
int s_p3 = src[(r + 3) * stride + c];
|
||
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
|
||
dst[r * stride + c] = (uint8_t) clip_u8(v >> 5);
|
||
}
|
||
}
|
||
}
|