Files
daedalus-fourier/include/daedalus.h
T
claude-noether 01f782cfaf h264: qpel avg — 12 remaining variants (closes the matrix)
Closes the H.264 8x8 qpel buildout.  Adds the remaining 12 avg_
biprediction positions:
  4 quarter-axis: avg_mc{10,30,01,03}
  8 diagonals  : avg_mc{11,12,13,21,23,31,32,33}

Each follows the established pattern: same half-pel formula as the
put_ sibling, then L2 average with the existing dst contents per
H.264 §8.4.2.3.1.

Scope:
  - 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU.
  - 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon.
  - 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 12 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 12 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - 12 header decls via DECLARE_QPEL_AVG macro.
  - tests/h264_qpel8_avg_rest_ref.c — references via two parametric
    macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms,
    DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms.
  - Test harness extended with a RUN(MC) sub-macro that derives both
    the ref name and dispatch name from the bare mcXX.  (The ref
    is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is
    daedalus_recipe_dispatch_h264_qpel_avg_<mc>.  Macro had a typo
    on first try that duplicated "avg_" in the ref name — caught at
    compile, fixed.)

Verified on hertz:

  $ ./build/test_api_h264 | tail -12
    H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%)

  All 12 new positions bit-exact PASS first try.

Final qpel matrix state:
  put_:  mc00 (none — integer copy)
         mc01 ✓  mc02 ✓  mc03 ✓
         mc10 ✓  mc11 ✓  mc12 ✓  mc13 ✓
         mc20 ✓ (QPU+CPU)  mc21 ✓  mc22 ✓  mc23 ✓
         mc30 ✓  mc31 ✓  mc32 ✓  mc33 ✓
  avg_:  same 15-of-16 coverage, all CPU.

Every B-slice biprediction case the libavcodec intercept can throw
at us is now serviceable.  QPU shaders remain mc20-only (cycle 9);
the other 29 positions are CPU NEON.  Whether to write more QPU
shaders depends on real perf measurement — at NEON ~10 ns per
8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work,
well inside budget.
2026-05-25 08:49:42 +02:00

604 lines
25 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/*
* daedalus-fourier — public C API.
*
* Stable surface for the integration layer (Phase 8 V4L2 shim,
* libva-v4l2-request-fourier consumer, or any future skin) to
* dispatch per-kernel work to the right substrate per the
* cycle 1-5 deployment recipe.
*
* Recipe (verdict at end of cycles 1-5, see docs/k*_phase7.md):
*
* VP9 IDCT 8x8 → V3D QPU (R=0.92 GREEN; M4 +7.2 %)
* VP9 LPF wd=4 inner → V3D QPU (R=0.41 ORANGE; M4 +6.9 %)
* VP9 MC 8-tap horiz → CPU NEON (R=0.067 RED; M4 -19.5 %)
* VP9 LPF wd=8 inner → V3D QPU (R=0.34 ORANGE; M4 +4.1 %)
* AV1 CDEF 8x8 luma → CPU NEON (R=0.116 ORANGE; QPU = opportunistic helper at 0.4 Mblock/s)
*
* The API exposes BOTH substrates for every kernel — the
* integration layer can override the recipe at runtime if it
* has scheduler knowledge the kernel-level R-band measurement
* didn't capture. The recommended path is to use
* `daedalus_recipe_dispatch_*` which picks the recipe substrate
* automatically.
*
* License: BSD-2-Clause. This header is part of the library API
* boundary; the implementation links against vendored
* LGPL-2.1+ FFmpeg snapshot and BSD-2-Clause dav1d snapshot.
*
* Threading: a `daedalus_ctx *` owns Vulkan + V3D state. A
* context is single-threaded; use one per worker thread if you
* need parallelism on the QPU side. NEON-side dispatch is
* stateless and re-entrant.
*
* ABI: pre-1.0 — no stability guarantees yet. The function names
* and signatures will become ABI-stable at v1.0; until then the
* integration layer should rebuild against the headers it links
* with.
*/
#ifndef DAEDALUS_FOURIER_H
#define DAEDALUS_FOURIER_H
#include <stdint.h>
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
/* -------------------------------------------------------------------
* Substrate selection
*
* Most callers should NOT specify a substrate — use the
* `daedalus_recipe_dispatch_*` family below, which picks the
* substrate per the cycles-1-5 verdict. Explicit substrate
* selection is for benchmarking, debugging, and future
* runtime-aware schedulers.
* ----------------------------------------------------------------- */
typedef enum {
DAEDALUS_SUBSTRATE_AUTO = 0, /* per recipe table */
DAEDALUS_SUBSTRATE_CPU = 1, /* force ARM NEON */
DAEDALUS_SUBSTRATE_QPU = 2, /* force V3D compute */
} daedalus_substrate;
/* -------------------------------------------------------------------
* Context lifecycle
* ----------------------------------------------------------------- */
typedef struct daedalus_ctx daedalus_ctx;
/* Create a context. Initialises V3D Vulkan device if available;
* NEON-only fallback OK if V3D init fails. Returns NULL on alloc
* failure. */
daedalus_ctx *daedalus_ctx_create(void);
/* Same but skip V3D init — for callers that know they want CPU
* only and want a fast-creating context. */
daedalus_ctx *daedalus_ctx_create_no_qpu(void);
/* Returns 1 if QPU dispatch is available on this context, 0 if
* NEON-only. Useful for the integration layer to short-circuit
* QPU dispatch attempts. */
int daedalus_ctx_has_qpu(const daedalus_ctx *ctx);
void daedalus_ctx_destroy(daedalus_ctx *ctx);
/* -------------------------------------------------------------------
* VP9 IDCT 8x8 add — cycle 1 (QPU by recipe)
*
* For each of n_blocks: take 64 int16 coefficients, perform 8x8
* inverse DCT, add to dst[r,c] = clamp(dst[r,c] + ((q + 16)>>5)).
*
* `meta` is an array of (dst_byte_offset, block_x, block_y) for
* each block, where dst_byte_offset is byte offset into dst.
*
* Returns 0 on success, negative errno-like on failure.
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off; /* byte offset into dst */
uint32_t block_x; /* used only by QPU path for placement */
uint32_t block_y;
uint32_t _pad;
} daedalus_idct8_meta;
int daedalus_recipe_dispatch_vp9_idct8(
daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
const int16_t *coeffs, size_t n_blocks,
const daedalus_idct8_meta *meta);
int daedalus_dispatch_vp9_idct8(
daedalus_ctx *ctx,
daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
const int16_t *coeffs, size_t n_blocks,
const daedalus_idct8_meta *meta);
/* -------------------------------------------------------------------
* VP9 LPF wd=4 / wd=8 — cycles 2 and 4 (QPU by recipe)
*
* Loop filter at horizontal edge crossing pixel column 4 of an
* 8x8 block. Per-edge thresholds (E, I, H).
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off; /* byte offset into dst, at col 4 of edge */
int32_t E, I, H;
} daedalus_lpf_meta;
int daedalus_recipe_dispatch_vp9_lpf4(
daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_lpf_meta *meta);
int daedalus_recipe_dispatch_vp9_lpf8(
daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_lpf_meta *meta);
int daedalus_dispatch_vp9_lpf4(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_lpf_meta *meta);
int daedalus_dispatch_vp9_lpf8(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_lpf_meta *meta);
/* -------------------------------------------------------------------
* VP9 MC 8-tap horizontal — cycle 3 (CPU by recipe)
*
* Subpel-fractional 8-tap horizontal filter; mx selects filter
* row. CPU path is the high-performance default; QPU path is
* available but never recommended by the recipe.
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off;
uint32_t src_off; /* raw, no pre-advance — shader handles -3 internally */
int32_t mx;
uint32_t _pad;
} daedalus_mc_meta;
int daedalus_recipe_dispatch_vp9_mc_8h(
daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
const uint8_t *src, size_t src_stride,
size_t n_blocks, const daedalus_mc_meta *meta);
int daedalus_dispatch_vp9_mc_8h(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
const uint8_t *src, size_t src_stride,
size_t n_blocks, const daedalus_mc_meta *meta);
/* -------------------------------------------------------------------
* AV1 CDEF 8x8 luma — cycle 5 (CPU by recipe; QPU opportunistic)
*
* tmp is an array of n_blocks * 192 uint16, with the padded-buffer
* layout that dav1d's NEON expects (stride 16, padding 2-rows-top +
* 2-cols-left + 2-cols-right + 2-rows-bottom). Caller supplies
* tmp populated with either source pixels (if all edges valid) or
* INT16_MIN sentinels at the boundary (if edge filtered out).
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off;
uint32_t tmp_off_u16; /* offset to block-origin in tmp[] (= padded_origin + 2*16+2) */
int32_t pri_strength; /* 1..7 */
int32_t sec_strength; /* 1..4 */
int32_t dir; /* 0..7 */
int32_t damping; /* 1..6 */
} daedalus_cdef_meta;
int daedalus_recipe_dispatch_cdef_8x8(
daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
const uint16_t *tmp,
size_t n_blocks, const daedalus_cdef_meta *meta);
int daedalus_dispatch_cdef_8x8(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
const uint16_t *tmp,
size_t n_blocks, const daedalus_cdef_meta *meta);
/* -------------------------------------------------------------------
* H.264 IDCT 4x4 + add — cycle 6 (CPU by recipe; QPU unused)
*
* Per H.264 §8.5.12.1, integer 4x4 inverse transform. block is
* COLUMN-major: block[c*4 + r] = coefficient at (row r, col c).
* Block is destructively zeroed after the transform (FFmpeg
* convention).
*
* `coeffs` is an array of n_blocks * 16 int16. `dst_off` is byte
* offset into dst per block.
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off;
uint32_t _pad0, _pad1, _pad2;
} daedalus_h264_block_meta;
int daedalus_recipe_dispatch_h264_idct4(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs, /* not const — destructively zeroed */
size_t n_blocks, const daedalus_h264_block_meta *meta);
int daedalus_dispatch_h264_idct4(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs,
size_t n_blocks, const daedalus_h264_block_meta *meta);
/* H.264 IDCT 8x8 + add — cycle 7 (CPU by recipe).
* Per H.264 §8.5.13.2, integer 8x8 inverse transform.
* `coeffs` is an array of n_blocks * 64 int16, column-major per block.
*/
int daedalus_recipe_dispatch_h264_idct8(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs,
size_t n_blocks, const daedalus_h264_block_meta *meta);
int daedalus_dispatch_h264_idct8(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
int16_t *coeffs,
size_t n_blocks, const daedalus_h264_block_meta *meta);
/* -------------------------------------------------------------------
* H.264 luma "v_loop_filter" — cycle 8 (CPU primary; QPU opportunistic)
*
* Filter applied VERTICALLY across a HORIZONTAL edge (16 columns
* wide; pix points to row 0 of the bottom block). Non-intra
* (bS < 4) variant.
*
* Each tile is 16 cols × 8 rows of context (rows -4..+3 around
* the edge). dst_off points to row 0 col 0 of the bottom block.
*
* Constraint: dst_off >= 4 * dst_stride (the kernel reads p3 at
* -4*stride). Caller must ensure this.
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off;
int32_t alpha; /* 0..63 typical, table-derived */
int32_t beta; /* 0..63 typical */
int8_t tc0[4]; /* per-segment filter strength; -1 means skip */
} daedalus_h264_deblock_meta;
int daedalus_recipe_dispatch_h264_deblock_luma_v(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_v(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 luma "h_loop_filter" — sibling of _v, applies filter
* HORIZONTALLY across a VERTICAL edge (16 rows tall; pix points to
* row 0 of the right block, col 0 = leftmost output column). Same
* non-intra (bS < 4) variant.
*
* Each tile is 8 cols x 16 rows of context (cols -4..+3 around the
* edge). dst_off points to row 0 col 0 of the RIGHT block.
*
* Constraint: (dst_off % dst_stride) >= 4 (the kernel reads p3 at
* pix[-4]). Caller must ensure this.
*
* QPU shader for the H variant is not yet implemented; recipe table
* routes AUTO to CPU NEON. An explicit DAEDALUS_SUBSTRATE_QPU on
* the _h dispatch returns -1 rather than silently degrading.
*/
int daedalus_recipe_dispatch_h264_deblock_luma_h(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_h(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 chroma (4:2:0) loop filters — bS<4 variant. Chroma uses
* the SAME daedalus_h264_deblock_meta struct as luma but on smaller
* tiles: 8 cols × 4 rows for V (4 segments of 2 cols), 4 cols × 8
* rows for H (4 segments of 2 rows). Each segment has its own tc0
* strength (tc0[s] applies to both cells in segment s).
*
* Algorithm difference vs luma: chroma updates only p0 and q0
* (never p1/p2/q1/q2) and uses tC = tc0_seg + 1 directly (no
* luma-style ap/aq side-condition bonus).
*
* QPU shaders for chroma deblock not implemented yet; recipe table
* routes AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_deblock_chroma_v(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_v(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_h(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_h(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 bS=4 "intra" loop filters — used at I-MB and inter
* macroblock boundaries where boundary strength is forced to 4 per
* H.264 §8.7.2.1. Different algorithm from bS<4: per-side strong
* vs weak filter decided by quad-tree condition (luma only);
* chroma is always weak. No tc0 — the daedalus_h264_deblock_meta
* struct's tc0[] field is IGNORED for intra dispatches (callers can
* leave it uninitialised or share a single edge list across both
* intra and non-intra kernels).
*
* Reuses the same meta layout as bS<4 dispatches for alpha + beta +
* dst_off; tile geometry per orientation is identical to the bS<4
* sibling (16-col / 16-row luma; 8-col / 8-row chroma).
*
* QPU shaders not implemented for any of the four; recipe routes
* AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1 (fast fail).
*/
int daedalus_recipe_dispatch_h264_deblock_luma_v_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_v_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_luma_h_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_h_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_v_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_h_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* -------------------------------------------------------------------
* H.264 luma qpel mc20 (8×8, horizontal half-pel) — cycle 9
* (CPU by recipe; per-block 7.6 ns NEON, QPU not viable — see
* docs/k9_h264qpel_mc20.md for the R-band rationale).
*
* Per H.264 §8.4.2.2.1, horizontal half-pel luma 6-tap filter:
* dst[r,c] = clip255((s[r,c-2] - 5*s[r,c-1] + 20*s[r,c]
* + 20*s[r,c+1] - 5*s[r,c+2] + s[r,c+3]
* + 16) >> 5)
*
* Single-stride: dst and src share `stride`; this matches FFmpeg's
* H264QpelContext.put_h264_qpel_pixels_tab[][] convention and the
* vendored ff_put_h264_qpel8_mc20_neon signature.
*
* `src + src_off` points at the leftmost OUTPUT column (col 0); the
* filter reads cols -2..+3, so the caller must guarantee src has at
* least 2 pixels of left context and 3 pixels of right context per
* row. (FFmpeg already maintains an edge-emulated buffer for the
* frame boundary; this matches that contract.)
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off; /* byte offset into dst (block top-left) */
uint32_t src_off; /* byte offset into src (col 0, row 0) */
} daedalus_h264_qpel_meta;
int daedalus_recipe_dispatch_h264_qpel_mc20(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc20(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma qpel mc02 (vertical half-pel) — mirror of mc20.
* 6-tap filter applied vertically:
* dst[r,c] = clip255((s[r-2,c] - 5*s[r-1,c] + 20*s[r,c]
* + 20*s[r+1,c] - 5*s[r+2,c] + s[r+3,c]
* + 16) >> 5)
*
* Same single-stride convention as mc20. src + src_off points at
* row 0 col 0 of the OUTPUT block; the filter reads rows -2..+3, so
* the caller must guarantee 2 rows of top context and 3 rows of
* bottom context per block (FFmpeg edge-emulated buffer handles
* frame boundaries; same contract as mc20).
*
* QPU shader not implemented yet; recipe table routes AUTO to CPU
* NEON. Explicit DAEDALUS_SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc02(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc02(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma qpel mc22 (2D half-pel "j" position per spec §8.4.2.2.1).
* Horizontal 6-tap cascaded into vertical 6-tap with intermediate
* 16-bit precision; final +512 >> 10 with clip255. Common position
* in real H.264 streams.
*
* src + src_off points at row 0 col 0 of the OUTPUT block; the
* cascade reads rows -2..+10 (13 rows of context) and cols -2..+5
* (10 cols of context). Caller must guarantee.
*
* QPU shader not implemented yet (the HV lowpass is the meatiest
* qpel kernel; structurally distinct from the 1D mc20 shader).
* Recipe routes AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc22(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc22(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma single-axis quarter-pel qpel positions ("put"):
* mc10 ¼-H ("a" position): clip255(mc20(s)) avg src[r,c]
* mc30 ¾-H ("c" position): clip255(mc20(s)) avg src[r,c+1]
* mc01 ¼-V ("d" position): clip255(mc02(s)) avg src[r,c]
* mc03 ¾-V ("n" position): clip255(mc02(s)) avg src[r+1,c]
*
* Each is a half-pel lowpass clipped to u8 then averaged with an
* integer-aligned source pixel (rounded +1 >> 1). Same edge
* context contract as mc20/mc02. CPU-only for now; QPU shaders
* not yet implemented. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc10(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc10(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc30(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc30(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc01(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc01(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc03(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc03(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma diagonal qpel positions ("put", 8 variants). Each is
* the rounded average of two half-pel intermediates per H.264
* §8.4.2.2.1 / Table 8-4 (decomposition matches the FFmpeg .S
* structure; see test/h264_qpel8_diag_ref.c for the formulas).
*
* mc11 ¼¼ : avg(mc20[r,c], mc02[r,c])
* mc12 ¼½ : avg(mc22[r,c], mc02[r,c])
* mc13 ¼¾ : avg(mc20[r+1,c], mc02[r,c])
* mc21 ½¼ : avg(mc22[r,c], mc20[r,c])
* mc23 ½¾ : avg(mc22[r,c], mc20[r+1,c])
* mc31 ¾¼ : avg(mc20[r,c], mc02[r,c+1])
* mc32 ¾½ : avg(mc22[r,c], mc02[r,c+1])
* mc33 ¾¾ : avg(mc20[r+1,c], mc02[r,c+1])
*
* CPU-only via vendored FFmpeg NEON; QPU shaders pending.
* Explicit SUBSTRATE_QPU returns -1.
*/
#define DECLARE_QPEL_DIAG(name) \
int daedalus_recipe_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta); \
int daedalus_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, daedalus_substrate sub, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
DECLARE_QPEL_DIAG(mc11)
DECLARE_QPEL_DIAG(mc12)
DECLARE_QPEL_DIAG(mc13)
DECLARE_QPEL_DIAG(mc21)
DECLARE_QPEL_DIAG(mc23)
DECLARE_QPEL_DIAG(mc31)
DECLARE_QPEL_DIAG(mc32)
DECLARE_QPEL_DIAG(mc33)
#undef DECLARE_QPEL_DIAG
/* H.264 luma qpel avg_ biprediction anchors — 3 half-pel positions
* (the put_ result is L2-averaged into the existing dst buffer per
* H.264 §8.4.2.3.1). Caller is responsible for pre-loading dst with
* the list0 prediction; the avg_ call adds list1.
*
* Same single-stride convention as put_; CPU NEON only for now.
*/
#define DECLARE_QPEL_AVG(name) \
int daedalus_recipe_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta); \
int daedalus_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, daedalus_substrate sub, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
DECLARE_QPEL_AVG(avg_mc20)
DECLARE_QPEL_AVG(avg_mc02)
DECLARE_QPEL_AVG(avg_mc22)
DECLARE_QPEL_AVG(avg_mc10)
DECLARE_QPEL_AVG(avg_mc30)
DECLARE_QPEL_AVG(avg_mc01)
DECLARE_QPEL_AVG(avg_mc03)
DECLARE_QPEL_AVG(avg_mc11)
DECLARE_QPEL_AVG(avg_mc12)
DECLARE_QPEL_AVG(avg_mc13)
DECLARE_QPEL_AVG(avg_mc21)
DECLARE_QPEL_AVG(avg_mc23)
DECLARE_QPEL_AVG(avg_mc31)
DECLARE_QPEL_AVG(avg_mc32)
DECLARE_QPEL_AVG(avg_mc33)
#undef DECLARE_QPEL_AVG
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
typedef enum {
DAEDALUS_KERNEL_VP9_IDCT8 = 1,
DAEDALUS_KERNEL_VP9_LPF4_INNER = 2,
DAEDALUS_KERNEL_VP9_MC_8H = 3,
DAEDALUS_KERNEL_VP9_LPF8_INNER = 4,
DAEDALUS_KERNEL_AV1_CDEF_8X8 = 5,
DAEDALUS_KERNEL_H264_IDCT4 = 6,
DAEDALUS_KERNEL_H264_IDCT8 = 7,
DAEDALUS_KERNEL_H264_DEBLOCK_LV = 8,
DAEDALUS_KERNEL_H264_QPEL_MC20 = 9,
DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10,
DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12,
DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA = 13,
DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA = 14,
DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA = 15,
DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA = 16,
DAEDALUS_KERNEL_H264_QPEL_MC02 = 17,
DAEDALUS_KERNEL_H264_QPEL_MC22 = 18,
DAEDALUS_KERNEL_H264_QPEL_MC10 = 19,
DAEDALUS_KERNEL_H264_QPEL_MC30 = 20,
DAEDALUS_KERNEL_H264_QPEL_MC01 = 21,
DAEDALUS_KERNEL_H264_QPEL_MC03 = 22,
DAEDALUS_KERNEL_H264_QPEL_MC11 = 23,
DAEDALUS_KERNEL_H264_QPEL_MC12 = 24,
DAEDALUS_KERNEL_H264_QPEL_MC13 = 25,
DAEDALUS_KERNEL_H264_QPEL_MC21 = 26,
DAEDALUS_KERNEL_H264_QPEL_MC23 = 27,
DAEDALUS_KERNEL_H264_QPEL_MC31 = 28,
DAEDALUS_KERNEL_H264_QPEL_MC32 = 29,
DAEDALUS_KERNEL_H264_QPEL_MC33 = 30,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC20 = 31,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC02 = 32,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC22 = 33,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC10 = 34,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC30 = 35,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC01 = 36,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC03 = 37,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC11 = 38,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC12 = 39,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC13 = 40,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC21 = 41,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC23 = 42,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC31 = 43,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC32 = 44,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC33 = 45,
} daedalus_kernel;
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k);
#ifdef __cplusplus
}
#endif
#endif /* DAEDALUS_FOURIER_H */