h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR. Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:
mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]
The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same. Putting them in one PR is justified by that uniformity.
Scope:
- 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
- 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
- 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
(collapses ~50 LOC of repetition).
- 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
- 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
- tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
hpel_h() / hpel_v() inlines + per-mode L2 average.
- Test refactor: generic run_quarter_axis_qpel() harness exercises
all 4 positions through a single helper (~50 LOC for 4 tests vs
~200 if each was hand-rolled).
Verified on hertz:
$ ./build/test_api_h264 | tail -8
H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
All 4 new positions bit-exact PASS first try.
Coverage matrix update:
put_ mc00 mc10 mc20 mc30
mc01 — ✓ — ✓
mc11 — — ✓ — ← this row
mc21 — — — —
mc31 — — — —
mc02 — — ✓ — ← mc02 + mc22 anchor
mc03 — — ✓ —
After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.
Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget. QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
This commit is contained in:
@@ -436,6 +436,45 @@ int daedalus_dispatch_h264_qpel_mc22(daedalus_ctx *ctx, daedalus_substrate sub,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
|
||||
/* H.264 luma single-axis quarter-pel qpel positions ("put"):
|
||||
* mc10 ¼-H ("a" position): clip255(mc20(s)) avg src[r,c]
|
||||
* mc30 ¾-H ("c" position): clip255(mc20(s)) avg src[r,c+1]
|
||||
* mc01 ¼-V ("d" position): clip255(mc02(s)) avg src[r,c]
|
||||
* mc03 ¾-V ("n" position): clip255(mc02(s)) avg src[r+1,c]
|
||||
*
|
||||
* Each is a half-pel lowpass clipped to u8 then averaged with an
|
||||
* integer-aligned source pixel (rounded +1 >> 1). Same edge
|
||||
* context contract as mc20/mc02. CPU-only for now; QPU shaders
|
||||
* not yet implemented. Explicit SUBSTRATE_QPU returns -1.
|
||||
*/
|
||||
int daedalus_recipe_dispatch_h264_qpel_mc10(daedalus_ctx *ctx,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
int daedalus_dispatch_h264_qpel_mc10(daedalus_ctx *ctx, daedalus_substrate sub,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
|
||||
int daedalus_recipe_dispatch_h264_qpel_mc30(daedalus_ctx *ctx,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
int daedalus_dispatch_h264_qpel_mc30(daedalus_ctx *ctx, daedalus_substrate sub,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
|
||||
int daedalus_recipe_dispatch_h264_qpel_mc01(daedalus_ctx *ctx,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
int daedalus_dispatch_h264_qpel_mc01(daedalus_ctx *ctx, daedalus_substrate sub,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
|
||||
int daedalus_recipe_dispatch_h264_qpel_mc03(daedalus_ctx *ctx,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
int daedalus_dispatch_h264_qpel_mc03(daedalus_ctx *ctx, daedalus_substrate sub,
|
||||
uint8_t *dst, const uint8_t *src, size_t stride,
|
||||
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
|
||||
|
||||
/* -------------------------------------------------------------------
|
||||
* Recipe query — what does the API recommend for each kernel?
|
||||
* ----------------------------------------------------------------- */
|
||||
@@ -458,6 +497,10 @@ typedef enum {
|
||||
DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA = 16,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC02 = 17,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC22 = 18,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC10 = 19,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC30 = 20,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC01 = 21,
|
||||
DAEDALUS_KERNEL_H264_QPEL_MC03 = 22,
|
||||
} daedalus_kernel;
|
||||
|
||||
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k);
|
||||
|
||||
Reference in New Issue
Block a user