daedalus-fourier/tests at 1cc0990c9f0e0036c82f176de55e3286f6df7792 - daedalus-fourier - marfrit's space

marfrit/daedalus-fourier

Files

T

History

claude-noether 1113953f97 h264: qpel avg anchors (avg_mc20/02/22, biprediction support)

Begins the avg_ qpel buildout for B-slice biprediction.  Each avg_
form computes the same half-pel formula as its put_ sibling, then
L2-averages the result with the existing dst contents — the caller
pre-loads dst with the list0 prediction; the avg_ call adds list1
per H.264 §8.4.2.3.1.

Scope (3 anchors, sets the pattern for the remaining 13 avg_
variants):
  - 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU.
  - 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon.
  - 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro
    (the macro is type-agnostic so it didn't need changes for avg_).
  - 3 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 3 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg.
  - Test harness: run_avg_qpel() seeds dst with random content so
    the L2 averaging is actually exercised (not just put_-style
    overwrite that would silently pass).

Verified on hertz:

  $ ./build/test_api_h264 | tail -3
    H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 3 anchors bit-exact PASS first try.

Why anchors only in this PR: the avg_ pattern is uniform across all
16 positions (each is just "put_ result + L2 with dst").  Landing
the anchors first confirms the macro pattern works for both put_
and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow
the same template in a follow-up PR.

State of the qpel matrix after this PR:
  put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper)
  avg_ :  3 of 16 positions ✓ (mc20, mc02, mc22 anchors)
        13 follow-up positions

2026-05-25 08:35:25 +02:00

..

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

.gitkeep

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_concurrent_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_concurrent_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_concurrent_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_concurrent_mixed.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_concurrent.c

Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues

2026-05-18 12:18:36 +00:00

bench_neon_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_neon_h264deblock.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

bench_neon_h264idct4.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

bench_neon_h264idct8.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

bench_neon_h264qpel_mc20.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

bench_neon_idct.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_neon_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_neon_lpf.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

bench_neon_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_pool_overhead.c

v3d_runner: buffer pool for QPU dispatch hot path

2026-05-23 19:52:50 +02:00

bench_v3d_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_v3d_h264deblock.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_v3d_idct.c

Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz

2026-05-18 12:09:00 +00:00

bench_v3d_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_v3d_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_v3d_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_vulkan_dispatch.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

cdef_ref.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

h264_chroma_loop_filter_ref.c

h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)

2026-05-24 23:53:09 +02:00

h264_deblock_ref.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

h264_h_loop_filter_luma_ref.c

h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter

2026-05-24 23:28:56 +02:00

h264_idct4_ref.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

h264_idct8_ref.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

h264_intra_loop_filter_ref.c

h264: deblock bS=4 intra variants (luma + chroma, V + H)

2026-05-25 00:00:46 +02:00

h264_intra_pred_4x4_ref.c

h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates

2026-05-25 00:14:51 +02:00

h264_intra_pred_16x16_ref.c

h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates

2026-05-25 00:35:24 +02:00

h264_intra_pred_chroma8x8_ref.c

h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates

2026-05-25 00:42:49 +02:00

h264_qpel8_avg_anchors_ref.c

h264: qpel avg anchors (avg_mc20/02/22, biprediction support)

2026-05-25 08:35:25 +02:00

h264_qpel8_diag_ref.c

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

h264_qpel8_mc02_ref.c

h264: qpel mc02 (vertical half-pel, CPU/NEON)

2026-05-25 00:47:37 +02:00

h264_qpel8_mc20_ref.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

h264_qpel8_mc22_ref.c

h264: qpel mc22 (2D half-pel, CPU/NEON)

2026-05-25 01:03:14 +02:00

h264_qpel8_quarter_axis_ref.c

h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)

2026-05-25 01:29:52 +02:00

test_api_h264.c

h264: qpel avg anchors (avg_mc20/02/22, biprediction support)

2026-05-25 08:35:25 +02:00

test_api_idct.c

Phase 8: wire IDCT QPU dispatch through public API

2026-05-18 13:55:55 +00:00

test_api_lpf.c

Phase 8: wire LPF wd=4 + wd=8 QPU through public API

2026-05-18 13:57:25 +00:00

test_api_opportunistic_qpu.c

Phase 8b: opportunistic QPU paths through public API

2026-05-18 14:50:41 +00:00

test_intra_pred_4x4.c

h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates

2026-05-25 00:14:51 +02:00

test_intra_pred_16x16.c

h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates

2026-05-25 00:35:24 +02:00

test_intra_pred_chroma8x8.c

h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates

2026-05-25 00:42:49 +02:00

vp9_idct8_ref.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

vp9_lpf8_ref.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

vp9_lpf_ref.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

vp9_mc_ref.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00