daedalus-fourier/tests at c3301b0c2e6efb852b18558a13fdf8a1f9eb4b43 - daedalus-fourier - marfrit's space

marfrit/daedalus-fourier

Files

T

History

claude-noether c3301b0c2e h264: qpel mc02 (vertical half-pel, CPU/NEON)

Mirror of cycle 9's mc20 transposed to vertical orientation.  Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.

Scope:
  - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
    Explicit SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
    transpose of mc20 (reads src[(r±N)*stride + c] instead of
    src[r*stride + c±N]).
  - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
    × 16 rows, random input, bit-exact compare against the C ref.

Verified on hertz:

  $ ./build/test_api_h264
  ...
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

  All 12 H.264 kernels in the api_smoke now bit-exact PASS.

Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget.  QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).

Coverage matrix update:

  qpel position  put_ status  avg_ status
  -------------  -----------  -----------
  mc00 (copy)    not wired    not wired
  mc10 (¼-H)     not wired    not wired
  mc20 (½-H)    ✓ QPU+CPU     not wired
  mc30 (¾-H)     not wired    not wired
  mc01 (¼-V)     not wired    not wired
  mc02 (½-V)    ✓ CPU         not wired (this PR)
  mc03 (¾-V)     not wired    not wired
  mc11..mc33     not wired    not wired

13 more qpel positions to go for the full put_ matrix.  Adding them
follows the same template; each is a small contained PR.

2026-05-25 00:47:37 +02:00

..

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

.gitkeep

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_concurrent_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_concurrent_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_concurrent_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_concurrent_mixed.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_concurrent.c

Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues

2026-05-18 12:18:36 +00:00

bench_neon_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_neon_h264deblock.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

bench_neon_h264idct4.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

bench_neon_h264idct8.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

bench_neon_h264qpel_mc20.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

bench_neon_idct.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_neon_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_neon_lpf.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

bench_neon_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_pool_overhead.c

v3d_runner: buffer pool for QPU dispatch hot path

2026-05-23 19:52:50 +02:00

bench_v3d_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_v3d_h264deblock.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_v3d_idct.c

Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz

2026-05-18 12:09:00 +00:00

bench_v3d_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_v3d_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_v3d_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_vulkan_dispatch.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

cdef_ref.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

h264_chroma_loop_filter_ref.c

h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)

2026-05-24 23:53:09 +02:00

h264_deblock_ref.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

h264_h_loop_filter_luma_ref.c

h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter

2026-05-24 23:28:56 +02:00

h264_idct4_ref.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

h264_idct8_ref.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

h264_intra_loop_filter_ref.c

h264: deblock bS=4 intra variants (luma + chroma, V + H)

2026-05-25 00:00:46 +02:00

h264_intra_pred_4x4_ref.c

h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates

2026-05-25 00:14:51 +02:00

h264_intra_pred_16x16_ref.c

h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates

2026-05-25 00:35:24 +02:00

h264_intra_pred_chroma8x8_ref.c

h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates

2026-05-25 00:42:49 +02:00

h264_qpel8_mc02_ref.c

h264: qpel mc02 (vertical half-pel, CPU/NEON)

2026-05-25 00:47:37 +02:00

h264_qpel8_mc20_ref.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

test_api_h264.c

h264: qpel mc02 (vertical half-pel, CPU/NEON)

2026-05-25 00:47:37 +02:00

test_api_idct.c

Phase 8: wire IDCT QPU dispatch through public API

2026-05-18 13:55:55 +00:00

test_api_lpf.c

Phase 8: wire LPF wd=4 + wd=8 QPU through public API

2026-05-18 13:57:25 +00:00

test_api_opportunistic_qpu.c

Phase 8b: opportunistic QPU paths through public API

2026-05-18 14:50:41 +00:00

test_intra_pred_4x4.c

h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates

2026-05-25 00:14:51 +02:00

test_intra_pred_16x16.c

h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates

2026-05-25 00:35:24 +02:00

test_intra_pred_chroma8x8.c

h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates

2026-05-25 00:42:49 +02:00

vp9_idct8_ref.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

vp9_lpf8_ref.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

vp9_lpf_ref.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

vp9_mc_ref.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00