h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33) #18

2026-05-25T05:49:34Z

marfrit commented

2026-05-25 05:49:34 +00:00

Closes the qpel buildout. All 8 remaining diagonal positions in one PR; each is the rounded average of two half-pel intermediates per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching the FFmpeg .S structure (verified by reading h264qpel_neon.S lines 622-758 — the position-specific (r±1, c±1) source offsets come from pre-incrementing the src pointer before branching into mc11/mc21).

Macros condense the wiring: 8 new kernel enums (MC11..MC33 = 23..30), 8 NEON externs, 8 CPU dispatches, 8 public APIs, 8 recipe wrappers, 8 ref functions via DEFINE_DIAG_REF, 8 test entries via the existing run_quarter_axis_qpel harness.

ALL 8 positions PASS 2048/2048 bytes bit-exact first try. Meaningful because the (r±1, c±1) offsets are easy to get wrong.

The H.264 qpel 8x8 put_ matrix is now complete (15 of 16 positions exposed; mc00 is integer copy, libavcodec sets the pointer directly). mc20 retains its QPU shader (cycle 9); the other 14 are CPU NEON.

Backlog: avg_ variants for biprediction (16 more positions), 16x16 qpel, QPU shaders for the non-mc20 positions.

Closes the qpel buildout. All 8 remaining diagonal positions in one PR; each is the rounded average of two half-pel intermediates per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching the FFmpeg .S structure (verified by reading h264qpel_neon.S lines 622-758 — the position-specific (r±1, c±1) source offsets come from pre-incrementing the src pointer before branching into mc11/mc21). Macros condense the wiring: 8 new kernel enums (MC11..MC33 = 23..30), 8 NEON externs, 8 CPU dispatches, 8 public APIs, 8 recipe wrappers, 8 ref functions via DEFINE_DIAG_REF, 8 test entries via the existing run_quarter_axis_qpel harness. **ALL 8 positions PASS 2048/2048 bytes bit-exact first try.** Meaningful because the (r±1, c±1) offsets are easy to get wrong. **The H.264 qpel 8x8 put_ matrix is now complete** (15 of 16 positions exposed; mc00 is integer copy, libavcodec sets the pointer directly). mc20 retains its QPU shader (cycle 9); the other 14 are CPU NEON. Backlog: avg_ variants for biprediction (16 more positions), 16x16 qpel, QPU shaders for the non-mc20 positions.

marfrit added 1 commit 2026-05-25 05:49:35 +00:00

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33) 0894a46114

Closes the qpel buildout.  All 8 remaining diagonal positions land
in one PR.  Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).

Decomposition table (the formula for each output cell at (r,c)):

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r, c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r, c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r, c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r, c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r, c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r, c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r, c+1])

The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.

Scope (tightly macro-ised):
  - 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
  - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
  - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - Header decls condensed via a DECLARE_QPEL_DIAG macro that
    expands to both recipe + dispatch decls per name.
  - C references via DEFINE_DIAG_REF macro: each ref is a 6-line
    wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
    (the latter being the per-cell version of mc22's 13-row int16
    tmp[] computation).
  - Test wrapper: test_qpel_diag_all() drives all 8 through the
    existing run_quarter_axis_qpel() harness.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | tail -8
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

ALL 8 diagonal positions bit-exact PASS first try.  Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.

After this PR the H.264 qpel 8x8 put_ matrix is complete:
  mc00 mc01 mc02 mc03
  mc10 mc11 mc12 mc13
  mc20 mc21 mc22 mc23
  mc30 mc31 mc32 mc33

15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly).  mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.

What this does NOT cover (still in backlog):
  - avg_ variants (the "add" form for biprediction, 16 more
    positions).  Currently the API only exposes put_.
  - 16x16 qpel (separate function family in FFmpeg; the 8x8 path
    can be used twice to substitute when 16x16 isn't critical).
  - QPU shaders for any qpel position other than mc20.

marfrit merged commit 76e3076670 into main

2026-05-25 06:32:03 +00:00

marfrit deleted branch noether/h264-qpel-diagonals

2026-05-25 06:32:03 +00:00

marfrit referenced this pull request

2026-05-25 17:15:24 +00:00

h264: V3D shaders for the 8 diagonal qpel positions #33

claude-noether referenced this issue from a commit

2026-05-25 17:15:30 +00:00

h264: V3D shaders for the 8 diagonal qpel positions

Sign in to join this conversation.