h264: qpel mc02 (vertical half-pel, CPU/NEON) #15

2026-05-24T22:47:51Z

marfrit commented

2026-05-24 22:47:51 +00:00

Mirror of cycle 9's mc20 transposed to vertical orientation — wires up ff_put_h264_qpel8_mc02_neon via the established CPU-dispatch pattern. Closes the missing vertical half-pel sibling from the qpel matrix.

New kernel DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). C reference is the row/col transpose of mc20's reference. Test passes 2048/2048 bytes bit-exact on hertz.

All 12 H.264 kernels in test_api_h264 now bit-exact PASS. 13 more qpel positions to go for full put_ matrix coverage (the 4 quarter-pel H, 3 V variants, 9 HV combinations) — same template per variant.

Mirror of cycle 9's mc20 transposed to vertical orientation — wires up `ff_put_h264_qpel8_mc02_neon` via the established CPU-dispatch pattern. Closes the missing vertical half-pel sibling from the qpel matrix. New kernel `DAEDALUS_KERNEL_H264_QPEL_MC02 = 17` → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). C reference is the row/col transpose of mc20's reference. Test passes 2048/2048 bytes bit-exact on hertz. **All 12 H.264 kernels in test_api_h264 now bit-exact PASS.** 13 more qpel positions to go for full put_ matrix coverage (the 4 quarter-pel H, 3 V variants, 9 HV combinations) — same template per variant.

marfrit added 1 commit 2026-05-24 22:47:52 +00:00

h264: qpel mc02 (vertical half-pel, CPU/NEON) c3301b0c2e

Mirror of cycle 9's mc20 transposed to vertical orientation.  Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.

Scope:
  - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
    Explicit SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
    transpose of mc20 (reads src[(r±N)*stride + c] instead of
    src[r*stride + c±N]).
  - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
    × 16 rows, random input, bit-exact compare against the C ref.

Verified on hertz:

  $ ./build/test_api_h264
  ...
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

  All 12 H.264 kernels in the api_smoke now bit-exact PASS.

Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget.  QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).

Coverage matrix update:

  qpel position  put_ status  avg_ status
  -------------  -----------  -----------
  mc00 (copy)    not wired    not wired
  mc10 (¼-H)     not wired    not wired
  mc20 (½-H)    ✓ QPU+CPU     not wired
  mc30 (¾-H)     not wired    not wired
  mc01 (¼-V)     not wired    not wired
  mc02 (½-V)    ✓ CPU         not wired (this PR)
  mc03 (¾-V)     not wired    not wired
  mc11..mc33     not wired    not wired

13 more qpel positions to go for the full put_ matrix.  Adding them
follows the same template; each is a small contained PR.

marfrit merged commit a2575d5e42 into main

2026-05-24 22:59:42 +00:00

marfrit deleted branch noether/h264-qpel-mc02

2026-05-24 22:59:43 +00:00

claude-noether referenced this issue from a commit

2026-05-24 23:03:20 +00:00

h264: qpel mc22 (2D half-pel, CPU/NEON)

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#15