h264: qpel mc02 (vertical half-pel, CPU/NEON) #15
Reference in New Issue
Block a user
Delete Branch "noether/h264-qpel-mc02"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Mirror of cycle 9's mc20 transposed to vertical orientation — wires up
ff_put_h264_qpel8_mc02_neonvia the established CPU-dispatch pattern. Closes the missing vertical half-pel sibling from the qpel matrix.New kernel
DAEDALUS_KERNEL_H264_QPEL_MC02 = 17→ CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). C reference is the row/col transpose of mc20's reference. Test passes 2048/2048 bytes bit-exact on hertz.All 12 H.264 kernels in test_api_h264 now bit-exact PASS. 13 more qpel positions to go for full put_ matrix coverage (the 4 quarter-pel H, 3 V variants, 9 HV combinations) — same template per variant.
Mirror of cycle 9's mc20 transposed to vertical orientation. Wires up the second qpel half-pel position via the vendored ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical sibling" gap that mc20 left open since cycle 9. Scope: - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper. - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap transpose of mc20 (reads src[(r±N)*stride + c] instead of src[r*stride + c±N]). - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols × 16 rows, random input, bit-exact compare against the C ref. Verified on hertz: $ ./build/test_api_h264 ... H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) All 12 H.264 kernels in the api_smoke now bit-exact PASS. Why CPU-only: same R-band logic as the deblock _h sibling pattern. mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at 1080p — comfortably inside the 33 ms budget. QPU shader is a fast-follow once the V vs H shader work is consolidated (the transpose for the V shader is not mechanical — different SIMD access pattern than the H shader). Coverage matrix update: qpel position put_ status avg_ status ------------- ----------- ----------- mc00 (copy) not wired not wired mc10 (¼-H) not wired not wired mc20 (½-H) ✓ QPU+CPU not wired mc30 (¾-H) not wired not wired mc01 (¼-V) not wired not wired mc02 (½-V) ✓ CPU not wired (this PR) mc03 (¾-V) not wired not wired mc11..mc33 not wired not wired 13 more qpel positions to go for the full put_ matrix. Adding them follows the same template; each is a small contained PR.