c3301b0c2e
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.
Scope:
- Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
- Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
- Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
Explicit SUBSTRATE_QPU returns -1 (no shader yet).
- C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
transpose of mc20 (reads src[(r±N)*stride + c] instead of
src[r*stride + c±N]).
- Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
× 16 rows, random input, bit-exact compare against the C ref.
Verified on hertz:
$ ./build/test_api_h264
...
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
All 12 H.264 kernels in the api_smoke now bit-exact PASS.
Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget. QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).
Coverage matrix update:
qpel position put_ status avg_ status
------------- ----------- -----------
mc00 (copy) not wired not wired
mc10 (¼-H) not wired not wired
mc20 (½-H) ✓ QPU+CPU not wired
mc30 (¾-H) not wired not wired
mc01 (¼-V) not wired not wired
mc02 (½-V) ✓ CPU not wired (this PR)
mc03 (¾-V) not wired not wired
mc11..mc33 not wired not wired
13 more qpel positions to go for the full put_ matrix. Adding them
follows the same template; each is a small contained PR.