Commit Graph

29 Commits

Author SHA1 Message Date
claude-noether 2079fe39c6 h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete
Generates 15 avg_ shader variants by templating from the existing
put_ shaders.  Each avg_ shader is identical to its put_ sibling
except the final write does an L2 average with the existing dst:

  put_:  dst[r,c] = result
  avg_:  dst[r,c] = (dst[r,c] + result + 1) >> 1

Per H.264 §8.4.2.3.1 (B-slice biprediction): caller pre-loads dst
with the list0 prediction; the avg_ call folds in list1.

Generated via python (avg-shader-gen.py): reads each
v3d_h264_qpel_mcXY.comp, transforms the docstring header + final
write hunk, writes v3d_h264_qpel_avg_mcXY.comp.  ~88 lines each;
15 new shader files.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for
all 15 — same src envelope (10*stride+11 covers any (r±1, c±1)
shift), the L2 step only touches dst.  Slightly over-allocates for
the simpler positions (avg_mc20/02/10/30/01/03) but negligible
cost.  Eliminates 15 wrappers + 15 src_max bound calculations that
would otherwise duplicate.

CMake foreach loops compile + install 15 new SPV files.  ctx grows
15 pipeline pairs.  Recipe table flips DAEDALUS_KERNEL_H264_QPEL_AVG_*
from CPU to QPU.  Public dispatchers re-defined via the existing
DEFINE_QPEL_DIAG_PUBLIC macro (replaces the CPU-only
DEFINE_QPEL_DISPATCH instantiations).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel avg" | wc -l
  15
  $ ./build/test_api_h264 | grep "qpel avg" | grep -c "100.0000%"
  15

All 15 PASS 2048/2048 bytes bit-exact via QPU.

QPU coverage for the H.264 8-bit 4:2:0 hot-path pixel kernels:

  Layer                Coverage
  ─────────────────────────────────────────────────────────────
  IDCT 4x4 luma        ✓ cycle 6 (one QPU shader, also handles chroma)
  IDCT 8x8 luma        ✓ cycle 7
  Chroma DC Hadamard   CPU only (4 adds + 4 subs; not worth)
  Deblock luma_v       ✓ cycle 8
  Deblock luma_h       ✓ PR #28
  Deblock chroma_v/h   ✓ PR #29
  Deblock *_intra      CPU only (less common, structurally different)
  qpel put_ 15 pos     ✓ cycle 9 (mc20) + PRs #30-#33
  qpel avg_ 15 pos     ✓ THIS PR

The H.264 non-intra-deblock hot path is now FULLY on QPU for any
consumer that initialises daedalus with a QPU-capable context.
2026-05-25 20:22:33 +02:00
claude-noether 746533582e h264: V3D shaders for the 8 diagonal qpel positions
Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON
2026-05-25 19:14:42 +02:00
claude-noether e3c28495ae h264: V3D shaders for the 4 single-axis quarter-pel qpel variants
mc10 (¼-H), mc30 (¾-H), mc01 (¼-V), mc03 (¾-V).  Each is the
corresponding half-pel filter (mc20 or mc02) with one extra L2
rounded-average step against an integer-source pixel at the tail:

  mc10[r,c] = avg(clip255(mc20(s)), s[r,   c   ])
  mc30[r,c] = avg(clip255(mc20(s)), s[r,   c+1])
  mc01[r,c] = avg(clip255(mc02(s)), s[r,   c  ])
  mc03[r,c] = avg(clip255(mc02(s)), s[r+1, c  ])

Each shader is ~45 lines (mc20-/mc02-pattern + 1 L2 line).

CMake foreach loop generates the 4 SPV compile rules.  Dispatch
helper `dispatch_h264_qpel_axis_qpu` shares plumbing across all 4
(axis flag selects src_max bounds: H reads cols -2..+10, V reads
rows -2..+10).  DEFINE_QPEL_AXIS_QPU + DEFINE_QPEL_DISPATCH_QPU
macros collapse ~200 LOC of boilerplate.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC{10,30,01,03} from
CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[01230]"
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
    (+ mc20/mc02/mc22 anchors from previous PRs)

Qpel QPU coverage:

  put_  mc20 ✓  mc02 ✓  mc22 ✓                                  (3 anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓                          (4 quarter-axis, THIS PR)
        mc11/12/13/21/23/31/32/33 — CPU NEON                    (8 diagonals)
  avg_  all 15 positions — CPU NEON

7 of 15 useful put_ positions now on QPU.  The 8 diagonals each
compose two half-pel results via L2; can land via dedicated kernels
or by chaining existing anchor dispatches (the latter would need
the L2 step as a fourth dispatch — probably cheaper to write
dedicated 8x diagonal shaders).
2026-05-25 19:04:26 +02:00
claude-noether 02d564b43e h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.
2026-05-25 18:52:39 +02:00
claude-noether bc5edf656d h264: V3D shader for qpel mc02 (vertical half-pel)
Sibling of cycle 9's v3d_h264_qpel_mc20.comp.  Same 6-tap H.264 luma
half-pel filter, transposed to vertical orientation: filter reads
rows [-2..+3] of source per output pixel instead of cols.

Shader is ~58 lines (vs mc20's 86) — same WG geometry (64 lanes /
1 block per WG / 1 lane per output pixel).  The address arithmetic
flips: row_base = src_off + r*stride + c (mc20) → col_base =
src_off + c, then col_base + (r±N)*stride (mc02).

dispatch_h264_qpel_mc02_qpu mirrors the mc20 QPU dispatch; src_max
calculation differs since the V kernel reads rows -2..+10 of source
(13 rows × stride wide) vs mc20's cols -2..+10 (8 rows × stride+11).
For 8x8 blocks: src_max = src_off + 10*stride + 8.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC02 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

QPU coverage for the 30 qpel positions:
  put_  mc20 ✓ (cycle 9)   mc02 ✓ (this PR)
        all 13 other put_  CPU NEON
  avg_  all 15 positions   CPU NEON

Next-priority candidates by per-frame impact (per PR #24 bench):
  mc22 (2D half-pel)  — 71.5 ns/block NEON × 32 640 blocks worst
                        case = 2.33 ms/frame at 1080p.  Most-used
                        qpel position in real H.264 streams.
  mc11/mc13/mc31/mc33 — corner ¼-pel positions, structurally similar
                        to mc20 + mc02 with L2 averaging.

The cascaded H+V structure of mc22 means it can either share the
existing mc20 + mc02 shaders' L2 (compute mc20 into tmp, then mc02
on tmp) or get a dedicated 2-stage pipeline.  Follow-up.
2026-05-25 18:38:38 +02:00
claude-noether d8de7754fa h264: V3D shaders for chroma deblock V + H (4:2:0)
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra
bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h.
Closes 4 of 8 deblock QPU coverage at non-intra:

  luma_v   ✓ cycle 8
  luma_h   ✓ PR #28
  chroma_v ✓ this PR
  chroma_h ✓ this PR
  *_intra  — CPU NEON (less common; smaller volume)

Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0
updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq
side bonus), 8 cells per edge (vs luma's 16).  Shader: 64 lines
vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes
8..15 of each edge early-return).

4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this
shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is
4:2:0-only by design, caller-side gating already covers this in the
libavcodec substitution arc (marfrit-packages PR #98).

Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to
QPU.  dispatch_h264_deblock_chroma_qpu factored to share QPU
plumbing between V and H (orientation passed as a flag for the
dst_max calculation).

Verified on hertz:

  $ ./build/test_api_h264 | grep "deblock chroma [vh]:"
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)

  Recipe substrate now reports 2 (QPU) for both CV and CH.

Coverage now:
                bS<4 QPU     bS=4 (intra)
  luma_v        ✓ cycle 8    CPU NEON
  luma_h        ✓ PR #28     CPU NEON
  chroma_v      ✓ this PR    CPU NEON
  chroma_h      ✓ this PR    CPU NEON

Intra (bS=4) variants stay CPU NEON.  Less common case, smaller
per-frame contribution, and the algorithm is structurally different
(no tc0; strong-vs-weak filter quad-tree).  Can land as a follow-up
PR if perf demands.
2026-05-25 17:10:34 +02:00
claude-noether 3db059ffab h264: V3D shader for deblock_luma_h — first QPU port since cycle 9
Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row)
to the H orientation (V edge, horizontal across a column).  Same
algorithm, transposed access pattern:

  V variant: lane → column, reads/writes pix[±N*stride] (vertical I/O)
  H variant: lane → row,    reads/writes pix[±N]        (horizontal I/O)

  WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge.
  Lane-in-edge interpretation flips: column-index for V → row-index
  for H.  tc0 segment math unchanged (one tc0 byte per 4 lanes).
  dst_max calculation flips: V used dst_off + 3*stride + 16 (cols),
  H uses dst_off + 15*stride + 4 (rows).

Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU).  AUTO
dispatch now picks QPU for the H edge as well as the V edge.  CPU
NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | grep luma_h
    H264_DEBLOCK_LH recipe substrate: 2     (was 1 — flipped to QPU)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)

Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on
8 tiles × 8 cols × 16 rows of random input.  Same correctness gate
as the cycle 8 V shader.

CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV
added to daedalus_shaders ALL list + install rule.  daedalus_ctx
gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair
(can't share with V because pipelines bind a specific SPIR-V module
at create time).

What this changes for the substitution arc: PR #97's 0008-h264-
deblock-luma-h substitution patch already plumbed
daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec.
That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe
(unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu,
in which case it stays NEON — same shape as cycle 8's V shader).

Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders):
  luma_v       ✓ cycle 8       ✓ now
  luma_h       —               ✓ THIS PR
  chroma_v/h   —               (CPU NEON; smaller tiles, lower-priority)
  *_intra (4)  —               (CPU NEON; less common)
2026-05-25 16:50:41 +02:00
claude-noether 01f782cfaf h264: qpel avg — 12 remaining variants (closes the matrix)
Closes the H.264 8x8 qpel buildout.  Adds the remaining 12 avg_
biprediction positions:
  4 quarter-axis: avg_mc{10,30,01,03}
  8 diagonals  : avg_mc{11,12,13,21,23,31,32,33}

Each follows the established pattern: same half-pel formula as the
put_ sibling, then L2 average with the existing dst contents per
H.264 §8.4.2.3.1.

Scope:
  - 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU.
  - 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon.
  - 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 12 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 12 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - 12 header decls via DECLARE_QPEL_AVG macro.
  - tests/h264_qpel8_avg_rest_ref.c — references via two parametric
    macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms,
    DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms.
  - Test harness extended with a RUN(MC) sub-macro that derives both
    the ref name and dispatch name from the bare mcXX.  (The ref
    is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is
    daedalus_recipe_dispatch_h264_qpel_avg_<mc>.  Macro had a typo
    on first try that duplicated "avg_" in the ref name — caught at
    compile, fixed.)

Verified on hertz:

  $ ./build/test_api_h264 | tail -12
    H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%)

  All 12 new positions bit-exact PASS first try.

Final qpel matrix state:
  put_:  mc00 (none — integer copy)
         mc01 ✓  mc02 ✓  mc03 ✓
         mc10 ✓  mc11 ✓  mc12 ✓  mc13 ✓
         mc20 ✓ (QPU+CPU)  mc21 ✓  mc22 ✓  mc23 ✓
         mc30 ✓  mc31 ✓  mc32 ✓  mc33 ✓
  avg_:  same 15-of-16 coverage, all CPU.

Every B-slice biprediction case the libavcodec intercept can throw
at us is now serviceable.  QPU shaders remain mc20-only (cycle 9);
the other 29 positions are CPU NEON.  Whether to write more QPU
shaders depends on real perf measurement — at NEON ~10 ns per
8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work,
well inside budget.
2026-05-25 08:49:42 +02:00
claude-noether 1113953f97 h264: qpel avg anchors (avg_mc20/02/22, biprediction support)
Begins the avg_ qpel buildout for B-slice biprediction.  Each avg_
form computes the same half-pel formula as its put_ sibling, then
L2-averages the result with the existing dst contents — the caller
pre-loads dst with the list0 prediction; the avg_ call adds list1
per H.264 §8.4.2.3.1.

Scope (3 anchors, sets the pattern for the remaining 13 avg_
variants):
  - 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU.
  - 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon.
  - 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro
    (the macro is type-agnostic so it didn't need changes for avg_).
  - 3 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 3 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg.
  - Test harness: run_avg_qpel() seeds dst with random content so
    the L2 averaging is actually exercised (not just put_-style
    overwrite that would silently pass).

Verified on hertz:

  $ ./build/test_api_h264 | tail -3
    H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 3 anchors bit-exact PASS first try.

Why anchors only in this PR: the avg_ pattern is uniform across all
16 positions (each is just "put_ result + L2 with dst").  Landing
the anchors first confirms the macro pattern works for both put_
and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow
the same template in a follow-up PR.

State of the qpel matrix after this PR:
  put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper)
  avg_ :  3 of 16 positions ✓ (mc20, mc02, mc22 anchors)
        13 follow-up positions
2026-05-25 08:35:25 +02:00
claude-noether 0894a46114 h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)
Closes the qpel buildout.  All 8 remaining diagonal positions land
in one PR.  Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).

Decomposition table (the formula for each output cell at (r,c)):

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r, c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r, c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r, c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r, c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r, c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r, c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r, c+1])

The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.

Scope (tightly macro-ised):
  - 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
  - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
  - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - Header decls condensed via a DECLARE_QPEL_DIAG macro that
    expands to both recipe + dispatch decls per name.
  - C references via DEFINE_DIAG_REF macro: each ref is a 6-line
    wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
    (the latter being the per-cell version of mc22's 13-row int16
    tmp[] computation).
  - Test wrapper: test_qpel_diag_all() drives all 8 through the
    existing run_quarter_axis_qpel() harness.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | tail -8
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

ALL 8 diagonal positions bit-exact PASS first try.  Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.

After this PR the H.264 qpel 8x8 put_ matrix is complete:
  mc00 mc01 mc02 mc03
  mc10 mc11 mc12 mc13
  mc20 mc21 mc22 mc23
  mc30 mc31 mc32 mc33

15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly).  mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.

What this does NOT cover (still in backlog):
  - avg_ variants (the "add" form for biprediction, 16 more
    positions).  Currently the API only exposes put_.
  - 16x16 qpel (separate function family in FFmpeg; the 8x8 path
    can be used twice to substitute when 16x16 isn't critical).
  - QPU shaders for any qpel position other than mc20.
2026-05-25 07:49:12 +02:00
claude-noether e01f7bc7c6 h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR.  Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:

  mc10  ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
  mc30  ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
  mc01  ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
  mc03  ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]

The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same.  Putting them in one PR is justified by that uniformity.

Scope:
  - 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
  - 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
  - 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
    (collapses ~50 LOC of repetition).
  - 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
  - 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
    hpel_h() / hpel_v() inlines + per-mode L2 average.
  - Test refactor: generic run_quarter_axis_qpel() harness exercises
    all 4 positions through a single helper (~50 LOC for 4 tests vs
    ~200 if each was hand-rolled).

Verified on hertz:

  $ ./build/test_api_h264 | tail -8
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)

  All 4 new positions bit-exact PASS first try.

Coverage matrix update:
  put_  mc00 mc10 mc20 mc30
  mc01     —    ✓    —    ✓
  mc11     —    —    ✓    —     ← this row
  mc21     —    —    —    —
  mc31     —    —    —    —
  mc02     —    —    ✓    —     ← mc02 + mc22 anchor
  mc03     —    —    ✓    —

After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.

Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget.  QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
2026-05-25 01:29:52 +02:00
claude-noether 20a4299c5c h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1.  One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.

Algorithmically distinct from the 1D mc20/mc02 siblings:
  - Horizontal 6-tap produces 13 rows of int16 intermediate (no
    per-stage clip/round — full precision retained).
  - Vertical 6-tap on the intermediate, then +512 >> 10 (the
    double-shift compensates for both 6-tap scalings) + clip255.

The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result.  The 13-row int16 tmp[] buffer is the central
invariant.

Scope (same pattern as mc02 PR #15):
  - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc22_cpu calling
    ff_put_h264_qpel8_mc22_neon.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
  - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
    int16 staging buffer; spec-derived shifts and rounding.
  - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
    output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
    [-2 .. +10] read window stays in-tile.

Verified on hertz:

  $ ./build/test_api_h264 | tail -5
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 13 H.264 kernels in api_smoke now bit-exact PASS.

mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.

Coverage matrix update:
  put_ mc20 ✓ (QPU+CPU)  put_ mc02 ✓ (CPU)  put_ mc22 ✓ (CPU)
  → 12 single put_ positions still missing (¼/¾ + HV combos with
  L2 averaging).
2026-05-25 01:03:14 +02:00
claude-noether c3301b0c2e h264: qpel mc02 (vertical half-pel, CPU/NEON)
Mirror of cycle 9's mc20 transposed to vertical orientation.  Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.

Scope:
  - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
    Explicit SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
    transpose of mc20 (reads src[(r±N)*stride + c] instead of
    src[r*stride + c±N]).
  - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
    × 16 rows, random input, bit-exact compare against the C ref.

Verified on hertz:

  $ ./build/test_api_h264
  ...
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

  All 12 H.264 kernels in the api_smoke now bit-exact PASS.

Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget.  QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).

Coverage matrix update:

  qpel position  put_ status  avg_ status
  -------------  -----------  -----------
  mc00 (copy)    not wired    not wired
  mc10 (¼-H)     not wired    not wired
  mc20 (½-H)    ✓ QPU+CPU     not wired
  mc30 (¾-H)     not wired    not wired
  mc01 (¼-V)     not wired    not wired
  mc02 (½-V)    ✓ CPU         not wired (this PR)
  mc03 (¾-V)     not wired    not wired
  mc11..mc33     not wired    not wired

13 more qpel positions to go for the full put_ matrix.  Adding them
follows the same template; each is a small contained PR.
2026-05-25 00:47:37 +02:00
claude-noether 9b1c106dc5 h264: deblock bS=4 intra variants (luma + chroma, V + H)
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4).  After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:

    bS<4   bS=4
    -----  -----
  luma_v  ✓ (cycle 8 QPU)   ✓ (CPU)
  luma_h  ✓ (CPU, PR #9)    ✓ (CPU)
  chrm_v  ✓ (CPU, PR #10)   ✓ (CPU)
  chrm_h  ✓ (CPU, PR #10)   ✓ (CPU)

Scope:
  - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
    CH_INTRA=16), all → CPU substrate in the recipe table.
  - 4 new public dispatch fns + 4 recipe wrappers (defined via two
    DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
    boilerplate tight).
  - 4 new extern decls for the vendored
    ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
  - C reference: tests/h264_intra_loop_filter_ref.c covers all four
    orientations.  Algorithm per H.264 §8.7.2.3:

      Luma: per-side strong/weak filter selector
        strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
        strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
        Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
      Chroma: always weak, only p0/q0 updated.

  - daedalus_h264_deblock_meta is REUSED for intra dispatches; the
    tc0[] field is ignored (bS=4 hardcodes the strength).  Callers
    can build a single edge list and route by kernel without an
    extra struct.

  - Test refactor: an intra_test_spec table + run_intra_test helper
    drives all four orientations through one harness, keeping the
    new test surface compact (~50 LOC for 4 kernels vs ~200 if each
    had its own test_deblock_*_intra fn).

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    ...
    H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    ...

  All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.

The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately.  The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.

QPU shader follow-up: not in this PR.  The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
2026-05-25 00:00:46 +02:00
claude-noether a5c47aa51c h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)
Continues the deblock buildout after PR #9 (luma_h).  Adds the two
chroma orientations via the same recipe-table-routed-to-CPU pattern;
QPU shaders for chroma deblock are still a follow-up.

Scope:
  - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}).
  - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the
    vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols.
  - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
    DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU.  Explicit
    SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_chroma_loop_filter_ref.c — covers both
    orientations.  Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter):
    tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0
    are updated (chroma never modifies p1/p2/q1/q2).
  - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) +
    test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x
    2 cells per segment per spec.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H264_DEBLOCK_CV recipe substrate: 1 (CPU)
    H264_DEBLOCK_CH recipe substrate: 1 (CPU)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 7 kernels bit-exact PASS.  Chroma test sizes are smaller (256
  bytes per orientation) because the per-MB chroma deblock surface is
  smaller than luma — accurate to the production geometry.

Why no QPU shader yet (per the established pattern):
  - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter
    the pixel count of luma per MB) — modest QPU win even after the
    shader exists.
  - Same R-band considerations as the luma _h follow-up: the V shader
    transpose isn't mechanical, and the 8-cell tile is small enough
    that NEON's per-edge cost (~3 ns) is already inside the budget.
  - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us.
    Negligible compared to the IDCT layer's 10 ms (CPU NEON).

Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is
complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓.  Remaining
deblock work: bS=4 intra variants (luma + chroma, V + H).

What this unblocks downstream:
  - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4
    edge categories that a typical inter MB needs.
2026-05-24 23:53:09 +02:00
claude-noether 9d5451e0fe h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v.  The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.

Scope:
  - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
    + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
  - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
  - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
    to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written.  An
    explicit SUBSTRATE_QPU request on the H dispatch returns -1
    (fails fast, no silent CPU degradation).
  - C reference: tests/h264_h_loop_filter_luma_ref.c — the
    column-axis transpose of h264_deblock_ref.c.  Same per-segment
    kernel; pix[-4..+3] accesses cols instead of rows*stride.
  - Test: test_api_h264 grows a test_deblock_h() with 8 tiles
    (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
    compares NEON dispatch against reference byte-for-byte.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 5 kernels bit-exact PASS.  The new H variant joins the suite
  with 1024 random-input bytes per tile x 8 tiles.

Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget.  Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).

Backlog addition (out of scope for this PR):
  - V3D shader for the H variant (mirror of v3d_h264deblock.spv).
  - bS=4 intra-strength filter (different algebra; both _v and _h).
  - Chroma deblock luma_v/_h (8-cell variants).
2026-05-24 23:28:56 +02:00
claude-noether 79553c6e22 cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU.  Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.

Shader (src/v3d_h264_qpel_mc20.comp):

  - local_size = 64, 1 lane per output pixel of one 8x8 block,
    1 block per workgroup.  Simplest layout that avoids any
    inter-lane communication — V3D's L2 cache handles the
    redundant reads from adjacent lanes computing adjacent
    output columns.
  - Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
    apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
    rounding, clip to u8, write one dst byte.
  - Single-stride convention matches FFmpeg's H264QpelContext
    (dst and src share `stride`; src+src_off points at output
    col 0 with the caller-guaranteed -2/+3 padding).

Dispatch wiring (src/daedalus_core.c):

  - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
    src_max = src_off + 7*stride + 11 (covers the +3-col read
    footprint on the last row), dst_max = dst_off + 7*stride + 8.
    1 block per WG.
  - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
    with the substrate-switch pattern matching the other H.264
    kernels.
  - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2   ← flipped
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper.  The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11.  Fixed in the same
patch.

All 9 daedalus-fourier cycles now QPU-by-recipe:

  cycle 1 VP9 IDCT 8x8         QPU
  cycle 2 VP9 LPF wd=4         QPU
  cycle 3 VP9 MC 8h            QPU
  cycle 4 VP9 LPF wd=8         QPU
  cycle 5 AV1 CDEF 8x8         QPU
  cycle 6 H.264 IDCT 4x4       QPU
  cycle 7 H.264 IDCT 8x8       QPU
  cycle 8 H.264 deblock luma-v QPU
  cycle 9 H.264 qpel mc20      QPU   ← this commit

Closes daedalus-fourier task #165.  Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
2026-05-23 21:05:36 +02:00
marfrit a092ee34aa Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7
2026-05-23 18:59:34 +00:00
claude-noether 74687d9def cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG.  H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.

Bit-exact first try on hertz (Pi 5, V3D 7.1):

  H264_IDCT4 recipe substrate:      2 (QPU)
  H264_IDCT8 recipe substrate:      2 (QPU)    ← flipped
  H264_DEBLOCK_LV recipe substrate: 2 (QPU)
  H264_QPEL_MC20 recipe substrate:  1 (CPU)    ← task #165 remaining
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact    ← QPU
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact

8 of 9 daedalus-fourier cycles now QPU-by-recipe.  Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.

Same pattern as cycle 6 commit (65bd5c3): adds h264_idct8_pipe
field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs,
v3d_h264_idct8.spv install rule.

Uses v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
2026-05-23 20:09:25 +02:00
claude-noether 65bd5c3fe3 cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
2026-05-23 20:06:20 +02:00
claude-noether 737e87980d QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:

1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
   every kernel that has a shipped V3D compute shader:

    cycle 1 VP9 IDCT 8x8         QPU  (was QPU; unchanged)
    cycle 2 VP9 LPF wd=4         QPU  (was QPU; unchanged)
    cycle 3 VP9 MC 8h            QPU  (FLIPPED from CPU — v3d_mc_8h.spv)
    cycle 4 VP9 LPF wd=8         QPU  (was QPU; unchanged)
    cycle 5 AV1 CDEF 8x8         QPU  (FLIPPED from CPU — v3d_cdef.spv)
    cycle 6 H.264 IDCT 4x4       CPU  (no shader yet; task #165)
    cycle 7 H.264 IDCT 8x8       CPU  (no shader yet; task #165)
    cycle 8 H.264 deblock luma-v QPU  (FLIPPED from CPU — v3d_h264deblock.spv)
    cycle 9 H.264 qpel mc20      CPU  (no shader yet; task #165)

   The R-band cost/benefit framework still applies but is now
   superseded for substrate selection by the decree.  Where R
   stays RED, the cost is in dispatch overhead, which is a
   fixable engineering issue (tasks 160 buffer-pool, 161
   persistent cmdbuf, 162 dmabuf import).

2) daedalus_ctx_create_no_qpu() now honours an env-var override:
   set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
   silently escalates to a full daedalus_ctx_create().  Lets
   the libavcodec substitution shims in marfrit-packages (which
   pthread_once a create_no_qpu ctx — see
   libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
   without rebuilding those patches.

   Firefox / mpv consumers stay on the Vulkan-free path by
   default (env var unset).  The daedalus-v4l2 daemon will set
   DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
   (separate daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29):

  === test_api_h264 ===
    H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2     ← flipped
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
    H.264 qpel mc20: 1024/1024 bytes bit-exact

  === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
  === test_api_lpf  === all substrates bit-exact wd=4 and wd=8

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.

Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
2026-05-23 19:59:53 +02:00
claude-noether 98553278dd v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.

Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline().  The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk.  Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.

Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):

  before (task 160 pool only):
    steady-state p50: 76.44 us
    steady-state mean: 77.95 us
  after (task 160 pool + task 161 persistent cb):
    steady-state p50: 54.56 us
    steady-state mean: 56.00 us
    -> 28% per-dispatch reduction

The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.

test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.

Refs daedalus-fourier task #161.
2026-05-23 19:56:35 +02:00
claude-noether 0a042a8e95 v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.
2026-05-23 19:52:50 +02:00
claude-noether 8fdef27a7d Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:25:24 +02:00
marfrit 0a99b16489 Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.

Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).

Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.

After this commit, the public API surface fully covers cycles 1-8:
  Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
  Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
  Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
  Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
  Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
    test — bench_v3d_cdef is the authoritative 3-way M1)
  Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
  Cycle 7 (H.264 IDCT 8x8): CPU only
  Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact

Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.

Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:50:41 +00:00
marfrit af8146a2cd Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4,
IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU
(matches per-cycle Phase 7 verdicts).

src/daedalus_core.c: NEON-path implementations + recipe routing.
daedalus_core library now links the full FFmpeg H.264 NEON
snapshot (h264idct + h264dsp) plus existing VP9 + dav1d.

tests/test_api_h264.c: smoke test covering all 3 H.264 kernels
via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact.

Public API coverage after this commit:
- Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch
  (test_api_idct, test_api_lpf, both pass)
- Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1)
- Cycle 5 CDEF: CPU only (QPU stub)
- Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired)
- Cycle 7 H.264 IDCT 8x8: CPU only
- Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired
  through API yet; bench_v3d_h264deblock exists for direct test)

Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles
3+5+8 through the API (so override-mode users can request QPU).
Then surface V4L2-wrapper architecture decisions to user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:46:03 +00:00
marfrit eb5cfb34c4 Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc +
dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8).

Important caught-empirically bug: the two LPF shaders disagree
on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1,
wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1).
Initial single-struct attempt silently corrupted wd=8 output
(1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping
separate lpf4_pc and lpf8_pc struct definitions.

dst-window calc handles both kernels (same -4..+3 byte footprint
per row).

test_api_lpf exercises both kernels in CPU / QPU / AUTO modes
against the C reference. All 6 mode/kernel combinations pass
2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge).

Phase 8 status after this commit: 3 of 5 kernels wired through
API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3
QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF
still need wiring for opportunistic-override mode but aren't
needed for recipe-AUTO path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:57:25 +00:00
marfrit 1085c5699c Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The
public API's dispatch_vp9_idct8 routes QPU calls through a
new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1
v4 pipeline on first use, (2) allocates 3 host-visible SSBOs
per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch
with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host.

Per-call alloc is intentional for Phase 8 correctness-first
scope; buffer-pool perf optimization is deferred.

Added daedalus_ctx_create_no_qpu() for fast-path callers that
know they want CPU only.

test_api_idct extended to a 3-mode matrix: CPU forced, QPU
forced, AUTO recipe. All three deliver 4096/4096 bit-exact
on hertz with V3D 7.1.7.0:

  recipe substrate for VP9_IDCT8: 2 (QPU)
  [CPU] 4096/4096 bit-exact
  [QPU] 4096/4096 bit-exact (real QPU dispatch through the API)
  [AUTO] 4096/4096 bit-exact (recipe routes to QPU)

Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4
and cycle 4 LPF wd=8 (the other two recipe-QPU kernels).
Cycle 3 MC and cycle 5 CDEF only need the dispatch hook
(recipe routes to CPU; QPU stays opportunistic via explicit
override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:55:55 +00:00
marfrit 760f6a4060 Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.

src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.

tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.

docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.

Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:54:43 +00:00