73 Commits

Author SHA1 Message Date
marfrit 432d127ea9 Merge pull request 'v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline' (#37) from noether/spv-search-and-bench-retract into main
Reviewed-on: #37
2026-05-25 20:33:30 +00:00
claude-noether 1347fb961c v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path
sum.  That number was a measurement artifact.  This commit makes the
artifact impossible to reproduce by ANYONE running the bench again.

THE BUG
-------

v3d_runner read_spv() did fopen(spv_path, "rb") with no path search:
the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen
resolves it relative to cwd.  The cmake build puts SPVs in $builddir
(e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264)
were typically invoked from ~/src/daedalus-fourier/, so fopen failed.

On failure read_spv printed perror and returned NULL; pipeline create
then returned -1; dispatch then returned -1; the bench loop ignored
the return value and timed the failure path.  Each iter cost ~1-5 µs
(open + perror + return), which divided across 256 ops gave ~10-20
ns/op — looking convincingly like real-but-fast QPU work.

PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact.  PR #10's
much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found
that time, perhaps run from build/), so the artifact is what made it
look like the gap had closed.  The gap never closed.

CORRECTED NUMBERS
-----------------

Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit:

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:   5.57 ms
    QPU only:      123.54 ms
    Ratio:        CPU/QPU sum = 0.05x  (QPU 22x SLOWER than CPU)

QPU is currently 12-77x slower per kernel.  The post-buffer-pool /
post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT
close the gap with NEON.  Whether those tasks helped at all needs
re-measurement — the previous "we saw a big win" reading was the
same artifact.

PR #36's commit-message claim "PR #10's verdict is reversed" is
withdrawn.  PR #10 was right; PR #36 was wrong.

THE FIX
-------

Two changes:

1. v3d_runner: SPV search now tries, in order:
     - cwd (legacy)
     - $DAEDALUS_SHADER_DIR (env override)
     - <readlink /proc/self/exe>/.. (binary-relative)
     - /opt/fourier/share/daedalus-fourier/ (Pi 5 install)
     - /usr/share/daedalus-fourier/ (system-wide)
   Found-anywhere succeeds silently.  Found-nowhere prints one error
   naming all searched locations.

2. bench_h264_primitives: bench_fn now returns int.  bench_ns does
   a single preflight call; if rc != 0 it prints "DISPATCH FAILED
   rc=N — kernel skipped" and bails on the kernel.  Main loop counts
   QPU failures and exits 2 BEFORE printing the comparison table if
   any kernel failed — so the next person running this can't read
   fail-fast timings as substrate numbers.

POLICY IMPLICATIONS
-------------------

The QPU substrate decree (2026-05-23) was conceived as a policy
choice that overrides per-kernel measurement.  With the corrected
data the gap is not "fixable defect we'll close with one more
optimization", it's an order of magnitude.  Whether to keep the
decree, soften it (auto = QPU only where measured advantage), or
revert is now a clear-eyed decision for the user.

This commit doesn't change the recipe table — that's a separate
question, taken on its own merits with this data in hand.

Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu →
qpu-capable) was justified by PR #36's artifact and should be
reverted; that revert lands in a follow-up to marfrit-packages.
2026-05-25 21:45:12 +02:00
marfrit 9be02a9470 Merge pull request 'bench: H.264 primitive bench now measures both substrates + comparison table' (#36) from noether/h264-qpu-bench-and-cleanup into main
Reviewed-on: #36
2026-05-25 18:56:01 +00:00
claude-noether 989818c2e6 bench: H.264 primitive bench now measures both substrates + comparison table
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).
2026-05-25 20:42:39 +02:00
marfrit 1446b779a6 Merge pull request 'h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete' (#35) from noether/v3d-shader-h264-deblock-intra into main
Reviewed-on: #35
2026-05-25 18:36:10 +00:00
claude-noether c2d1e9790e h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete
Closes the H.264 deblock QPU coverage matrix.  Adds the 4 intra
(bS=4) variants — luma_v/h_intra + chroma_v/h_intra.

Algorithmically distinct from the bS<4 path:
  - Per-side strong/weak filter selector
      strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2)
      strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2)
  - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3)
  - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2
  - Mirror for q-side; no tc0 (bS=4 hardcodes the strength)
  - Chroma always weak, only p0/q0 updated (same as bS<4 chroma)

Per H.264 §8.3.2.3.  Transcribed from PR #11's C reference
(tests/h264_intra_loop_filter_ref.c).

Shaders:
  - v3d_h264deblock_luma_v_intra.comp  (luma 16-cell + strong/weak)
  - v3d_h264deblock_luma_h_intra.comp  (transpose of luma_v_intra)
  - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak)
  - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra)

Dispatch wiring:
  - 4 new pipeline pairs in daedalus_ctx
  - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by
    orient_h for V vs H) — 2 wrappers
  - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu
    helper (same WG geometry as bS<4 chroma) — 2 wrappers
  - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter,
    routes CPU/QPU per recipe table
  - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU
    to QPU

Verified on hertz:

  $ ./build/test_api_h264 | grep intra
    H.264 deblock luma v intra:   1024/1024 bytes bit-exact
    H.264 deblock luma h intra:   1024/1024 bytes bit-exact
    H.264 deblock chroma v intra:  256/256 bytes bit-exact
    H.264 deblock chroma h intra:  256/256 bytes bit-exact

All 4 PASS first try.  Strong/weak quad-tree selector + per-side
asymmetry would have surfaced any sign/shift/index mistake; passing
on all 4 (including the asymmetric writes-3-cells cases) means the
transcription from C is clean.

Deblock QPU coverage matrix — COMPLETE (8 of 8):

  bS<4 (non-intra):
    luma_v    ✓ cycle 8
    luma_h    ✓ PR #28
    chroma_v  ✓ PR #29
    chroma_h  ✓ PR #29

  bS=4 (intra, this PR):
    luma_v    ✓
    luma_h    ✓
    chroma_v  ✓
    chroma_h  ✓

The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU
when daedalus is initialised with a QPU-capable context:
  - IDCT 4x4 / 8x8 ✓
  - All 8 deblock variants ✓
  - All 30 qpel positions (15 put_ + 15 avg_) ✓
2026-05-25 20:30:07 +02:00
marfrit e506ef0803 Merge pull request 'h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete' (#34) from noether/v3d-shader-h264-qpel-avg into main
Reviewed-on: #34
2026-05-25 18:23:11 +00:00
claude-noether 2079fe39c6 h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete
Generates 15 avg_ shader variants by templating from the existing
put_ shaders.  Each avg_ shader is identical to its put_ sibling
except the final write does an L2 average with the existing dst:

  put_:  dst[r,c] = result
  avg_:  dst[r,c] = (dst[r,c] + result + 1) >> 1

Per H.264 §8.4.2.3.1 (B-slice biprediction): caller pre-loads dst
with the list0 prediction; the avg_ call folds in list1.

Generated via python (avg-shader-gen.py): reads each
v3d_h264_qpel_mcXY.comp, transforms the docstring header + final
write hunk, writes v3d_h264_qpel_avg_mcXY.comp.  ~88 lines each;
15 new shader files.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for
all 15 — same src envelope (10*stride+11 covers any (r±1, c±1)
shift), the L2 step only touches dst.  Slightly over-allocates for
the simpler positions (avg_mc20/02/10/30/01/03) but negligible
cost.  Eliminates 15 wrappers + 15 src_max bound calculations that
would otherwise duplicate.

CMake foreach loops compile + install 15 new SPV files.  ctx grows
15 pipeline pairs.  Recipe table flips DAEDALUS_KERNEL_H264_QPEL_AVG_*
from CPU to QPU.  Public dispatchers re-defined via the existing
DEFINE_QPEL_DIAG_PUBLIC macro (replaces the CPU-only
DEFINE_QPEL_DISPATCH instantiations).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel avg" | wc -l
  15
  $ ./build/test_api_h264 | grep "qpel avg" | grep -c "100.0000%"
  15

All 15 PASS 2048/2048 bytes bit-exact via QPU.

QPU coverage for the H.264 8-bit 4:2:0 hot-path pixel kernels:

  Layer                Coverage
  ─────────────────────────────────────────────────────────────
  IDCT 4x4 luma        ✓ cycle 6 (one QPU shader, also handles chroma)
  IDCT 8x8 luma        ✓ cycle 7
  Chroma DC Hadamard   CPU only (4 adds + 4 subs; not worth)
  Deblock luma_v       ✓ cycle 8
  Deblock luma_h       ✓ PR #28
  Deblock chroma_v/h   ✓ PR #29
  Deblock *_intra      CPU only (less common, structurally different)
  qpel put_ 15 pos     ✓ cycle 9 (mc20) + PRs #30-#33
  qpel avg_ 15 pos     ✓ THIS PR

The H.264 non-intra-deblock hot path is now FULLY on QPU for any
consumer that initialises daedalus with a QPU-capable context.
2026-05-25 20:22:33 +02:00
marfrit 55d3618408 Merge pull request 'h264: V3D shaders for the 8 diagonal qpel positions' (#33) from noether/v3d-shader-h264-qpel-diagonals into main
Reviewed-on: #33
2026-05-25 18:16:53 +00:00
claude-noether 746533582e h264: V3D shaders for the 8 diagonal qpel positions
Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON
2026-05-25 19:14:42 +02:00
marfrit 224f4be9e2 Merge pull request 'h264: V3D shaders for the 4 single-axis quarter-pel qpel variants' (#32) from noether/v3d-shader-h264-qpel-quarter-axis into main
Reviewed-on: #32
2026-05-25 17:09:00 +00:00
claude-noether e3c28495ae h264: V3D shaders for the 4 single-axis quarter-pel qpel variants
mc10 (¼-H), mc30 (¾-H), mc01 (¼-V), mc03 (¾-V).  Each is the
corresponding half-pel filter (mc20 or mc02) with one extra L2
rounded-average step against an integer-source pixel at the tail:

  mc10[r,c] = avg(clip255(mc20(s)), s[r,   c   ])
  mc30[r,c] = avg(clip255(mc20(s)), s[r,   c+1])
  mc01[r,c] = avg(clip255(mc02(s)), s[r,   c  ])
  mc03[r,c] = avg(clip255(mc02(s)), s[r+1, c  ])

Each shader is ~45 lines (mc20-/mc02-pattern + 1 L2 line).

CMake foreach loop generates the 4 SPV compile rules.  Dispatch
helper `dispatch_h264_qpel_axis_qpu` shares plumbing across all 4
(axis flag selects src_max bounds: H reads cols -2..+10, V reads
rows -2..+10).  DEFINE_QPEL_AXIS_QPU + DEFINE_QPEL_DISPATCH_QPU
macros collapse ~200 LOC of boilerplate.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC{10,30,01,03} from
CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[01230]"
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
    (+ mc20/mc02/mc22 anchors from previous PRs)

Qpel QPU coverage:

  put_  mc20 ✓  mc02 ✓  mc22 ✓                                  (3 anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓                          (4 quarter-axis, THIS PR)
        mc11/12/13/21/23/31/32/33 — CPU NEON                    (8 diagonals)
  avg_  all 15 positions — CPU NEON

7 of 15 useful put_ positions now on QPU.  The 8 diagonals each
compose two half-pel results via L2; can land via dedicated kernels
or by chaining existing anchor dispatches (the latter would need
the L2 step as a fourth dispatch — probably cheaper to write
dedicated 8x diagonal shaders).
2026-05-25 19:04:26 +02:00
marfrit 8b8e8dc6e8 Merge pull request 'h264: V3D shader for qpel mc22 (2D half-pel 'j' position)' (#31) from noether/v3d-shader-h264-qpel-mc22 into main
Reviewed-on: #31
2026-05-25 17:00:27 +00:00
claude-noether 02d564b43e h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.
2026-05-25 18:52:39 +02:00
marfrit 2074a50554 Merge pull request 'h264: V3D shader for qpel mc02 (vertical half-pel)' (#30) from noether/v3d-shader-h264-qpel-mc02 into main
Reviewed-on: #30
2026-05-25 16:49:26 +00:00
claude-noether bc5edf656d h264: V3D shader for qpel mc02 (vertical half-pel)
Sibling of cycle 9's v3d_h264_qpel_mc20.comp.  Same 6-tap H.264 luma
half-pel filter, transposed to vertical orientation: filter reads
rows [-2..+3] of source per output pixel instead of cols.

Shader is ~58 lines (vs mc20's 86) — same WG geometry (64 lanes /
1 block per WG / 1 lane per output pixel).  The address arithmetic
flips: row_base = src_off + r*stride + c (mc20) → col_base =
src_off + c, then col_base + (r±N)*stride (mc02).

dispatch_h264_qpel_mc02_qpu mirrors the mc20 QPU dispatch; src_max
calculation differs since the V kernel reads rows -2..+10 of source
(13 rows × stride wide) vs mc20's cols -2..+10 (8 rows × stride+11).
For 8x8 blocks: src_max = src_off + 10*stride + 8.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC02 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

QPU coverage for the 30 qpel positions:
  put_  mc20 ✓ (cycle 9)   mc02 ✓ (this PR)
        all 13 other put_  CPU NEON
  avg_  all 15 positions   CPU NEON

Next-priority candidates by per-frame impact (per PR #24 bench):
  mc22 (2D half-pel)  — 71.5 ns/block NEON × 32 640 blocks worst
                        case = 2.33 ms/frame at 1080p.  Most-used
                        qpel position in real H.264 streams.
  mc11/mc13/mc31/mc33 — corner ¼-pel positions, structurally similar
                        to mc20 + mc02 with L2 averaging.

The cascaded H+V structure of mc22 means it can either share the
existing mc20 + mc02 shaders' L2 (compute mc20 into tmp, then mc02
on tmp) or get a dedicated 2-stage pipeline.  Follow-up.
2026-05-25 18:38:38 +02:00
marfrit 37b75b5813 Merge pull request 'h264: V3D shaders for chroma deblock V + H (4:2:0)' (#29) from noether/v3d-shader-h264-deblock-chroma into main
Reviewed-on: #29
2026-05-25 16:35:08 +00:00
claude-noether d8de7754fa h264: V3D shaders for chroma deblock V + H (4:2:0)
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra
bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h.
Closes 4 of 8 deblock QPU coverage at non-intra:

  luma_v   ✓ cycle 8
  luma_h   ✓ PR #28
  chroma_v ✓ this PR
  chroma_h ✓ this PR
  *_intra  — CPU NEON (less common; smaller volume)

Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0
updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq
side bonus), 8 cells per edge (vs luma's 16).  Shader: 64 lines
vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes
8..15 of each edge early-return).

4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this
shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is
4:2:0-only by design, caller-side gating already covers this in the
libavcodec substitution arc (marfrit-packages PR #98).

Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to
QPU.  dispatch_h264_deblock_chroma_qpu factored to share QPU
plumbing between V and H (orientation passed as a flag for the
dst_max calculation).

Verified on hertz:

  $ ./build/test_api_h264 | grep "deblock chroma [vh]:"
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)

  Recipe substrate now reports 2 (QPU) for both CV and CH.

Coverage now:
                bS<4 QPU     bS=4 (intra)
  luma_v        ✓ cycle 8    CPU NEON
  luma_h        ✓ PR #28     CPU NEON
  chroma_v      ✓ this PR    CPU NEON
  chroma_h      ✓ this PR    CPU NEON

Intra (bS=4) variants stay CPU NEON.  Less common case, smaller
per-frame contribution, and the algorithm is structurally different
(no tc0; strong-vs-weak filter quad-tree).  Can land as a follow-up
PR if perf demands.
2026-05-25 17:10:34 +02:00
marfrit de9266a6eb Merge pull request 'h264: V3D shader for deblock_luma_h — first QPU port since cycle 9' (#28) from noether/v3d-shader-h264-deblock-luma-h into main
Reviewed-on: #28
2026-05-25 15:06:18 +00:00
claude-noether 3db059ffab h264: V3D shader for deblock_luma_h — first QPU port since cycle 9
Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row)
to the H orientation (V edge, horizontal across a column).  Same
algorithm, transposed access pattern:

  V variant: lane → column, reads/writes pix[±N*stride] (vertical I/O)
  H variant: lane → row,    reads/writes pix[±N]        (horizontal I/O)

  WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge.
  Lane-in-edge interpretation flips: column-index for V → row-index
  for H.  tc0 segment math unchanged (one tc0 byte per 4 lanes).
  dst_max calculation flips: V used dst_off + 3*stride + 16 (cols),
  H uses dst_off + 15*stride + 4 (rows).

Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU).  AUTO
dispatch now picks QPU for the H edge as well as the V edge.  CPU
NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | grep luma_h
    H264_DEBLOCK_LH recipe substrate: 2     (was 1 — flipped to QPU)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)

Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on
8 tiles × 8 cols × 16 rows of random input.  Same correctness gate
as the cycle 8 V shader.

CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV
added to daedalus_shaders ALL list + install rule.  daedalus_ctx
gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair
(can't share with V because pipelines bind a specific SPIR-V module
at create time).

What this changes for the substitution arc: PR #97's 0008-h264-
deblock-luma-h substitution patch already plumbed
daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec.
That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe
(unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu,
in which case it stays NEON — same shape as cycle 8's V shader).

Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders):
  luma_v       ✓ cycle 8       ✓ now
  luma_h       —               ✓ THIS PR
  chroma_v/h   —               (CPU NEON; smaller tiles, lower-priority)
  *_intra (4)  —               (CPU NEON; less common)
2026-05-25 16:50:41 +02:00
marfrit 2faa849ce2 Merge pull request 'h264: promote remaining intra prediction modes (17) to public API' (#27) from noether/h264-intra-pred-rest-api into main
Reviewed-on: #27
2026-05-25 13:43:56 +00:00
claude-noether cb3aef3dac h264: promote remaining intra prediction modes (17) to public API
Follows PR #26 (Intra_4x4 luma) with the same promotion pattern for
the rest of the intra prediction primitive set:

  Intra_16x16 luma   (4 modes, PR #13) — V/H/DC/Plane
  Intra_8x8  chroma  (4 modes, PR #14) — DC/H/V/Plane (4:2:0)
  Intra_8x8  luma    (9 modes, PRs #21 + #22) — High profile,
                                                 with 1-2-1 pre-filter

3 file moves via `git mv`, ~17 function renames stripping the `_ref`
suffix.  Test binaries rewired to link daedalus_core instead of
compiling the (now moved) ref files directly.  No code change — pure
plumbing for substitution-arc consumers.

26 intra prediction modes total now in the public API after this PR.

Verified on hertz:

  test_intra_pred_16x16:    5/5  PASS
  test_intra_pred_chroma8x8: 5/5  PASS
  test_intra_pred_8x8_luma: 11/11 PASS

All via public symbols (test binaries linked against daedalus_core).

Unblocks marfrit-packages substitution arc patch 0014 — wires
H264PredContext.pred4x4[], pred16x16[], pred8x8[], pred8x8l[]
through daedalus alongside the existing IDCT / deblock / qpel / DC
Hadamard substitutions.

After 0014 lands, the libavcodec.so built by marfrit-packages will
have EVERY hot-path pixel-math kernel of an H.264 8-bit 4:2:0
decode routing through daedalus — the substitution arc is feature-
complete for the campaign target (Pi 5 Firefox YouTube playback).
2026-05-25 15:37:44 +02:00
marfrit 31c68d0d0e Merge pull request 'h264: promote Intra_4x4 luma prediction (9 modes) to public API' (#26) from noether/h264-intra-pred-4x4-api into main
Reviewed-on: #26
2026-05-25 13:35:56 +00:00
claude-noether df9e1c9d78 h264: promote Intra_4x4 luma prediction (9 modes) to public API
PR #12 added the 9 Intra_4x4 luma intra prediction modes as test-only
spec references in tests/.  This PR promotes them to public src/
symbols so consumers (the eventual marfrit-packages substitution-arc
patch 0014) can link against them.

  Moved: tests/h264_intra_pred_4x4_ref.c → src/h264_intra_pred_4x4.c
  Renamed: daedalus_h264_pred_4x4_<mode>_ref → daedalus_h264_pred_4x4_<mode>
           (9 functions: vertical/horizontal/dc/ddl/ddr/vr/hd/vl/hu)

The src/ implementation is byte-for-byte the same code as the
test-only ref; this PR is plain plumbing.  The test binary now
links against daedalus_core to pull in the public symbols (instead
of compiling the ref file directly), exercising the path that real
consumers will use.

Same promotion shape as PR #25 (chroma DC Hadamard).

Verified on hertz:

  $ ./build/test_intra_pred_4x4
    Vertical (mode 0)          PASS
    Horizontal (mode 1)        PASS
    DC (mode 2)                PASS
    DiagDownLeft (mode 3)      PASS
    DiagDownRight (mode 4)     PASS
    VerticalRight (mode 5)     PASS
    HorizontalDown (mode 6)    PASS
    VerticalLeft (mode 7)      PASS
    HorizontalUp (mode 8)      PASS
    VR asym (sanity)           PASS

  ALL 10 intra-4x4 mode references PASS

  $ nm -g build/libdaedalus_core.a | grep "T daedalus_h264_pred_4x4"
  (9 symbols exported)

Follow-ups (same promotion pattern, can land in parallel):
  - Intra_16x16 luma (4 modes, PR #13)
  - Intra_8x8 chroma (4 modes, PR #14)
  - Intra_8x8 luma (9 modes, PRs #21 + #22)

Once all 26 intra modes are in the public API, the marfrit-packages
substitution arc can route H264PredContext's pred function pointer
tables through daedalus alongside the IDCT / deblock / qpel / DC
Hadamard substitutions already in place.
2026-05-25 14:53:37 +02:00
marfrit b9f9ff2a89 Merge pull request 'h264: expose chroma DC 2x2 Hadamard as public API' (#25) from noether/h264-chroma-dc-hadamard-api into main
Reviewed-on: #25
2026-05-25 11:35:05 +00:00
claude-noether 1f07f3cd70 h264: expose chroma DC 2x2 Hadamard as public API
PR #23 added the Hadamard as a test-only spec reference; this PR
promotes it to a public symbol in src/ so consumers (the eventual
marfrit-packages substitution-arc patch 0011) can link against it.

  New: void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
       — operates in-place on 4 int16, no QP-dependent scaling
       (caller composes that themselves per §8.5.11.2).

The src/ implementation is byte-for-byte identical to the test-only
ref in tests/h264_chroma_dc_hadamard_ref.c (kept as a separate
spec-validation copy).  A new "public API parity" test case verifies
the two produce identical output for a non-trivial input.

Pure CPU primitive — no substrate-dispatch wrapper because the work
is 4 adds + 4 subs; the substrate machinery would cost more than
the kernel itself.

Verified on hertz:

  $ ./build/test_chroma_dc_hadamard
    all-uniform 5                    PASS
    col gradient [0,10,0,10]         PASS
    row gradient [0,0,10,10]         PASS
    anti-diagonal [10,0,0,10]        PASS
    asymmetric [1,2,3,4]             PASS
    sign-alternating [-5,5,-5,5]     PASS
    double-Hadamard = 4*orig         PASS
    public API parity vs _ref        PASS

  ALL chroma DC Hadamard tests PASS

  $ nm -g build/libdaedalus_core.a | grep chroma_dc_hadamard
  0000000000000000 T daedalus_h264_chroma_dc_hadamard_2x2

Unblocks marfrit-packages 0011 (substituting
H264DSPContext.chroma_dc_dequant_idct, which composes the Hadamard
+ qmul scaling).
2026-05-25 13:32:01 +02:00
marfrit b21b35c74b Merge pull request 'bench: H.264 primitives NEON CPU baseline (1080p budget projection)' (#24) from noether/h264-primitives-bench into main
Reviewed-on: #24
2026-05-25 09:51:20 +00:00
claude-noether ba5bbae8e2 bench: H.264 primitives NEON CPU baseline (1080p budget projection)
Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets.  Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."

Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):

  Per-kernel ns/op (CPU NEON):
    IDCT 4x4 luma            10.78 ns/block
    IDCT 8x8 luma            29.73 ns/block
    Deblock luma_v           18.04 ns/edge
    Deblock luma_h           41.65 ns/edge   (H access pattern less SIMD-friendly)
    qpel mc20  (H half-pel)  25.66 ns/block
    qpel mc02  (V half-pel)  15.06 ns/block  (faster than mc20!)
    qpel mc22  (HV half-pel) 71.50 ns/block  (cascaded H+V, expected)

  Projected 1080p frame budgets (worst-case, CPU NEON only):
    IDCT 4x4 (all-4x4 MBs):       1.41 ms   (130,560 blocks)
    IDCT 8x8 (all-8x8 MBs):       0.97 ms   ( 32,640 blocks)
    Deblock luma_v (all MBs):     0.59 ms   ( 32,640 edges)
    Deblock luma_h (all MBs):     1.36 ms   ( 32,640 edges)
    qpel mc22 (all 8x8 blocks):   2.33 ms   ( 32,640 blocks)

    Sum (IDCT 4x4 + deblock luma + MC all-mc22):    5.69 ms
    30 fps deadline:                              33.33 ms
    Margin:                                       +27.64 ms

What this validates:

  - The "30fps@1080p is the fine floor" memory note holds with
    huge headroom on the pixel-math layer alone.  17% of the
    deadline goes to pixel math (worst case); 83% is available
    for entropy decode + reference frame management + intra
    prediction + chroma deblock + chroma IDCT + the libavcodec
    intercept overhead.
  - The CPU-vs-QPU substrate finding from earlier (PR #10 on
    daedalus-decoder showed CPU NEON is 4x faster than QPU for
    IDCT) is consistent here.  All these kernels have CPU-only
    recipes by default; the data suggests that's the right call
    for now.  The recipe substrate decision can be revisited
    per-kernel once QPU shaders catch up.
  - mc22 (2D HV half-pel) is the most expensive single qpel
    position at ~71 ns/block — 2-7x more than the 1D variants.
    Real B-slice biprediction with two mc22 calls per MB would
    add ~4.7 ms/frame; still comfortable but worth knowing.

What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):

  - Chroma IDCT (4 cb + 4 cr 4x4 per MB).  At similar ns/block to
    luma, that's ~0.7 ms/frame.
  - Chroma deblock (smaller tile, simpler kernel — sub-ms).
  - Intra prediction (per-block, ~50 ops at NEON, but serialized
    in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
  - bS=4 intra deblock variants — different algorithm, similar
    cost to bS<4.
  - chroma DC Hadamard — trivial.

Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms.  Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.
2026-05-25 11:26:11 +02:00
marfrit eef7f034b0 Merge pull request 'h264: chroma DC 2x2 Hadamard pre-pass primitive' (#23) from noether/h264-chroma-dc-hadamard into main
Reviewed-on: #23
2026-05-25 09:23:05 +00:00
claude-noether 854bdeda20 h264: chroma DC 2x2 Hadamard pre-pass primitive
Adds the H.264 §8.5.11.1 chroma DC Hadamard transform.  In 4:2:0
chroma, the four DC coefficients (one from each chroma 4x4 AC block
within an MB) go through a 2x2 Hadamard before quant-scaling and
before being added back to each block's [0,0] coefficient prior to
the 4x4 AC IDCT.

This PR ships the pure Hadamard transform:

  f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1]
  f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1]
  f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1]
  f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1]

implemented as the 2-stage row+col butterfly (1:1 with the NEON
SIMD shape upstream).  Operates in-place on int16[4].

What this does NOT do (deferred to caller-side composition):

  - QP-dependent scaling per §8.5.11.2.  The scale depends on
    QP_C (with chroma_qp_offset adjustment), so the formula has
    branches (>=6 vs <6) and looks up LevelScale4x4 table values.
    The libavcodec intercept patch composes Hadamard + scale +
    shift itself since the scale shape varies by codec-level
    context (slice header chroma_qp_offset, PPS chroma_qp_offset,
    second_chroma_qp_offset for the chroma_qp_index_offset).
  - Inverse transform (decode-time used for the FORWARD direction
    is the same Hadamard up to scaling, but conceptually the spec
    distinguishes them in §8.5.11; we expose only the matrix).

Test design (tests/test_chroma_dc_hadamard.c):

  7 cases, all spec-derived hand-computations:
    - all-uniform 5 → [20, 0, 0, 0]
    - col gradient [0,10,0,10] → [20, -20, 0, 0]
    - row gradient [0,0,10,10] → [20, 0, -20, 0]
    - anti-diagonal [10,0,0,10] → [20, 0, 0, 20]
    - asymmetric [1,2,3,4] → [10, -2, -4, 0]
    - sign-alternating [-5,5,-5,5] → [0, -20, 0, 0]
    - double-Hadamard invariant: H·H = 4·I, so applying twice
      gives [4*c[0], 4*c[1], 4*c[2], 4*c[3]] for any input.

The double-Hadamard test is the strongest correctness gate: any
single sign error in the butterfly would break the H·H = 4·I
algebraic property, surfacing immediately.  All 7 PASS first try.

Verified on hertz:

  $ ./build/test_chroma_dc_hadamard
    all-uniform 5                    PASS
    col gradient [0,10,0,10]         PASS
    row gradient [0,0,10,10]         PASS
    anti-diagonal [10,0,0,10]        PASS
    asymmetric [1,2,3,4]             PASS
    sign-alternating [-5,5,-5,5]     PASS
    double-Hadamard = 4*orig         PASS

  ALL chroma DC Hadamard tests PASS

With this primitive the H.264 8-bit 4:2:0 pixel-math primitive
matrix is complete in fourier:
  - IDCT 4x4 (luma + chroma) ✓
  - IDCT 8x8 (luma, High profile) ✓
  - Chroma DC Hadamard 2x2 ✓ (this PR)
  - Deblock (8 variants) ✓
  - Intra prediction (26 modes) ✓
  - MC qpel (30 dispatches) ✓

What remains for the libavcodec intercept patch: CABAC/CAVLC entropy
decode, SPS/PPS parsing, slice header parsing, MB type / QP / CBP /
intra mode prediction.  All of that lives at the intercept layer
(it's spec-derived from the bitstream syntax, not pixel-math); the
intercept patch will call into these fourier primitives once the
metadata is decoded.
2026-05-25 11:18:59 +02:00
marfrit 17d672ebef Merge pull request 'h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU)' (#22) from noether/h264-intra-pred-8x8-directional into main
Reviewed-on: #22
2026-05-25 09:16:19 +00:00
claude-noether 5565cc2bef h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU)
Closes the H.264 8-bit 4:2:0 intra-prediction primitive matrix.
Adds the 6 directional Intra_8x8 luma modes per H.264 §8.3.2.1.5..
§8.3.2.1.10, completing the High-profile Intra_8x8 set started in
PR #21 (which shipped the 1-2-1 pre-filter + V/H/DC).

Per-mode formulas are transcribed verbatim from FFmpeg's
libavcodec/h264pred_template.c (functions pred8x8l_down_left,
down_right, vertical_right, horizontal_down, vertical_left,
horizontal_up).  Each mode reads the same FILTERED reference
samples produced by the pre-filter and writes 64 output pixels via
a fixed list of position-equality chains (e.g. for DDL,
SRC(0,7)=SRC(1,6)=SRC(2,5)=...=SRC(7,0)= some shared 3-tap formula).

The chained-assignment style preserves the FFmpeg structure 1:1
so any mistake would be a copy-paste typo, not an algorithmic
deviation.  Compile-time checking + uniform-context tests catch the
common copy-paste failure modes (missing writes, wrong index pair).

Scope:
  - 6 new ref functions: ddl/ddr/vr/hd/vl/hu_ref.
  - Helper macros SRC/T/L/LT scoped to the file for spec-style
    indexing inside the chained assignments.
  - 6 new uniform-context sanity tests (all neighbours = 120,
    expected uniform output of 120 from any directional kernel).

Verified on hertz:

  $ ./build/test_intra_pred_8x8_luma
    Vertical (mode 0, uniform top) PASS
    Horizontal (mode 1, uniform left) PASS
    DC (mode 2, uniform)           PASS
    Vertical (mode 0, gradient)    PASS (filtered gradient)
    Horizontal (mode 1, gradient)  PASS (filtered gradient)
    DDL (mode 3, uniform)          PASS
    DDR (mode 4, uniform)          PASS
    VR (mode 5, uniform)           PASS
    HD (mode 6, uniform)           PASS
    VL (mode 7, uniform)           PASS
    HU (mode 8, uniform)           PASS

  ALL Intra_8x8 luma PASS (9 modes)

Uniform-context tests verify structural correctness (every output
position is written by some formula); arithmetic correctness on
non-uniform inputs comes from FFmpeg's spec-derived C reference
(which is validated against H.264 conformance bitstreams upstream).
The libavcodec intercept patch will exercise these on real streams.

Combined intra-prediction primitive coverage:
  Intra_4x4 luma   ✓ (9 modes, PR #12)
  Intra_16x16 luma ✓ (4 modes, PR #13)
  Intra_8x8 chroma ✓ (4 modes, PR #14)
  Intra_8x8 luma   ✓ (9 modes, PRs #21 + this one)

26 intra-prediction modes total, all bit-exact gated.  Every H.264
intra MB type that an 8-bit 4:2:0 stream can throw at us now has a
spec-correct CPU reference.
2026-05-25 09:56:45 +02:00
marfrit 18ca708f87 Merge pull request 'h264: Intra_8x8 luma (High profile) — pre-filter + 3 modes (V/H/DC)' (#21) from noether/h264-intra-pred-8x8-luma into main
Reviewed-on: #21
2026-05-25 07:51:51 +00:00
claude-noether 8bc6d27ea7 h264: Intra_8x8 luma prediction (High profile) — pre-filter + 3 modes
Adds the High-profile Intra_8x8 luma primitive set.  Per H.264
§8.3.2.1, this is distinct from Intra_4x4 in two ways:

  1. REFERENCE SAMPLE PRE-FILTER (§8.3.2.1.1).  The 25 raw neighbour
     samples are smoothed with a 1-2-1 filter BEFORE prediction.
     Spec-defined boundary handling at corners and the right edge:
       - top-left filt'd: (top[0] + 2*tl + left[0] + 2) >> 2
       - top[0] filt'd:   (tl + 2*t[0] + t[1] + 2) >> 2
       - top[i] for 1..14: (t[i-1] + 2*t[i] + t[i+1] + 2) >> 2
       - top[15] filt'd:  (t[14] + 3*t[15] + 2) >> 2  ← 3× boundary
       - left analogous, with l[7] using 3× boundary.

  2. SCALE.  All 9 prediction modes operate at 8x8 on the filtered
     samples (Intra_4x4 is 4x4 on raw samples).

This PR ships the pre-filter + the 3 simple modes (V, H, DC):

  - Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]
  - Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]
  - Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
                              + 8) >> 4) broadcast

The 6 directional modes (DDL, DDR, VR, HD, VL, HU at 8x8 per
§8.3.2.1.5..§8.3.2.1.10) follow in a separate PR.  They use the
same filtered samples; only the per-cell formula differs.

Test design (tests/test_intra_pred_8x8_luma.c):

  - 3 uniform-context tests, one per mode (sanity).
  - 2 gradient tests that exercise the pre-filter's interior +
    boundary cases:
      * Vertical with top = 0..15: spec arithmetic gives filtered
        top[c] = c for c in 0..7 (gradient input → identity through
        the 1-2-1 filter on the interior; boundaries arithmetically
        verify too).  Test expects pred[r,c] = c.
      * Horizontal with left = 0..7: same arithmetic chain on the
        left col.  Test expects pred[r,c] = r.

Verified on hertz:

  $ ./build/test_intra_pred_8x8_luma
    Vertical (mode 0, uniform top) PASS
    Horizontal (mode 1, uniform left) PASS
    DC (mode 2, uniform)           PASS
    Vertical (mode 0, gradient)    PASS (filtered gradient)
    Horizontal (mode 1, gradient)  PASS (filtered gradient)

  ALL Intra_8x8 luma PASS (3 modes — V, H, DC)

The pre-filter being right first try is meaningful — the boundary
samples use a 3× weight rather than 2× (filt[top 15] = (t[14] +
3*t[15] + 2) >> 2), which is easy to forget when transcribing.  The
gradient test would have surfaced any boundary mistake immediately.

Combined intra-prediction primitive coverage after this PR:
  Intra_4x4 luma   ✓ (9 modes, PR #12)
  Intra_16x16 luma ✓ (4 modes, PR #13)
  Intra_8x8 chroma ✓ (4 modes, PR #14)
  Intra_8x8 luma   △ (3 of 9 modes — V, H, DC ✓; DDL/DDR/VR/HD/VL/HU pending)

The 6 remaining Intra_8x8 luma directional modes are spec-mechanical
follow-ups; each is a ~30-line formula per §8.3.2.1.5+.
2026-05-25 09:35:49 +02:00
marfrit 1ee8b1c0ab Merge pull request 'h264: qpel avg — 12 remaining variants (closes the matrix)' (#20) from noether/h264-qpel-avg-rest into main
Reviewed-on: #20
2026-05-25 07:33:02 +00:00
claude-noether 01f782cfaf h264: qpel avg — 12 remaining variants (closes the matrix)
Closes the H.264 8x8 qpel buildout.  Adds the remaining 12 avg_
biprediction positions:
  4 quarter-axis: avg_mc{10,30,01,03}
  8 diagonals  : avg_mc{11,12,13,21,23,31,32,33}

Each follows the established pattern: same half-pel formula as the
put_ sibling, then L2 average with the existing dst contents per
H.264 §8.4.2.3.1.

Scope:
  - 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU.
  - 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon.
  - 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 12 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 12 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - 12 header decls via DECLARE_QPEL_AVG macro.
  - tests/h264_qpel8_avg_rest_ref.c — references via two parametric
    macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms,
    DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms.
  - Test harness extended with a RUN(MC) sub-macro that derives both
    the ref name and dispatch name from the bare mcXX.  (The ref
    is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is
    daedalus_recipe_dispatch_h264_qpel_avg_<mc>.  Macro had a typo
    on first try that duplicated "avg_" in the ref name — caught at
    compile, fixed.)

Verified on hertz:

  $ ./build/test_api_h264 | tail -12
    H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%)

  All 12 new positions bit-exact PASS first try.

Final qpel matrix state:
  put_:  mc00 (none — integer copy)
         mc01 ✓  mc02 ✓  mc03 ✓
         mc10 ✓  mc11 ✓  mc12 ✓  mc13 ✓
         mc20 ✓ (QPU+CPU)  mc21 ✓  mc22 ✓  mc23 ✓
         mc30 ✓  mc31 ✓  mc32 ✓  mc33 ✓
  avg_:  same 15-of-16 coverage, all CPU.

Every B-slice biprediction case the libavcodec intercept can throw
at us is now serviceable.  QPU shaders remain mc20-only (cycle 9);
the other 29 positions are CPU NEON.  Whether to write more QPU
shaders depends on real perf measurement — at NEON ~10 ns per
8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work,
well inside budget.
2026-05-25 08:49:42 +02:00
marfrit 1cc0990c9f Merge pull request 'h264: qpel avg anchors (avg_mc20/02/22, biprediction support)' (#19) from noether/h264-qpel-avg-anchors into main
Reviewed-on: #19
2026-05-25 06:45:34 +00:00
claude-noether 1113953f97 h264: qpel avg anchors (avg_mc20/02/22, biprediction support)
Begins the avg_ qpel buildout for B-slice biprediction.  Each avg_
form computes the same half-pel formula as its put_ sibling, then
L2-averages the result with the existing dst contents — the caller
pre-loads dst with the list0 prediction; the avg_ call adds list1
per H.264 §8.4.2.3.1.

Scope (3 anchors, sets the pattern for the remaining 13 avg_
variants):
  - 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU.
  - 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon.
  - 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro
    (the macro is type-agnostic so it didn't need changes for avg_).
  - 3 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 3 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg.
  - Test harness: run_avg_qpel() seeds dst with random content so
    the L2 averaging is actually exercised (not just put_-style
    overwrite that would silently pass).

Verified on hertz:

  $ ./build/test_api_h264 | tail -3
    H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 3 anchors bit-exact PASS first try.

Why anchors only in this PR: the avg_ pattern is uniform across all
16 positions (each is just "put_ result + L2 with dst").  Landing
the anchors first confirms the macro pattern works for both put_
and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow
the same template in a follow-up PR.

State of the qpel matrix after this PR:
  put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper)
  avg_ :  3 of 16 positions ✓ (mc20, mc02, mc22 anchors)
        13 follow-up positions
2026-05-25 08:35:25 +02:00
marfrit 76e3076670 Merge pull request 'h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)' (#18) from noether/h264-qpel-diagonals into main
Reviewed-on: #18
2026-05-25 06:32:02 +00:00
claude-noether 0894a46114 h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)
Closes the qpel buildout.  All 8 remaining diagonal positions land
in one PR.  Each is the rounded average of two half-pel intermediates
per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching
the FFmpeg .S reference structure (verified by reading
external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758).

Decomposition table (the formula for each output cell at (r,c)):

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r, c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r, c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r, c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r, c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r, c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r, c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r, c+1])

The (r±1, c±1) offsets capture the position-dependent shift that
the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before
branching into the common mc11/mc21 code paths.

Scope (tightly macro-ised):
  - 8 new kernel enums (MC11..MC33 = 23..30) → CPU.
  - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon.
  - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro.
  - 8 public dispatches via DEFINE_QPEL_DISPATCH macro.
  - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - Header decls condensed via a DECLARE_QPEL_DIAG macro that
    expands to both recipe + dispatch decls per name.
  - C references via DEFINE_DIAG_REF macro: each ref is a 6-line
    wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers
    (the latter being the per-cell version of mc22's 13-row int16
    tmp[] computation).
  - Test wrapper: test_qpel_diag_all() drives all 8 through the
    existing run_quarter_axis_qpel() harness.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | tail -8
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

ALL 8 diagonal positions bit-exact PASS first try.  Meaningful
because the position-dependent (r±1, c±1) source offsets are easy
to get wrong by transcription, and any of them would surface on
random inputs immediately.

After this PR the H.264 qpel 8x8 put_ matrix is complete:
  mc00 mc01 mc02 mc03
  mc10 mc11 mc12 mc13
  mc20 mc21 mc22 mc23
  mc30 mc31 mc32 mc33

15 of 16 positions exposed through the daedalus API; mc00 is just
integer copy and rarely needs a dispatch wrapper (libavcodec sets
the function pointer table directly).  mc20 retains its QPU shader
(cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON.

What this does NOT cover (still in backlog):
  - avg_ variants (the "add" form for biprediction, 16 more
    positions).  Currently the API only exposes put_.
  - 16x16 qpel (separate function family in FFmpeg; the 8x8 path
    can be used twice to substitute when 16x16 isn't critical).
  - QPU shaders for any qpel position other than mc20.
2026-05-25 07:49:12 +02:00
marfrit d0a1db3c8f Merge pull request 'h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)' (#17) from noether/h264-qpel-quarter-axis into main
Reviewed-on: #17
2026-05-25 05:42:16 +00:00
claude-noether e01f7bc7c6 h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)
Closes the 4 single-axis quarter-pel positions in one PR.  Each is
a half-pel lowpass clipped to u8 followed by L2 rounded-average
with an integer-aligned source pixel per H.264 §8.4.2.2.1:

  mc10  ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c]
  mc30  ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1]
  mc01  ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c]
  mc03  ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c]

The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer
source pixel they average with — the half-pel computation is the
same.  Putting them in one PR is justified by that uniformity.

Scope:
  - 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU.
  - 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon.
  - 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro
    (collapses ~50 LOC of repetition).
  - 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro.
  - 4 recipe wrappers via DEFINE_QPEL_RECIPE macro.
  - tests/h264_qpel8_quarter_axis_ref.c covers all four via shared
    hpel_h() / hpel_v() inlines + per-mode L2 average.
  - Test refactor: generic run_quarter_axis_qpel() harness exercises
    all 4 positions through a single helper (~50 LOC for 4 tests vs
    ~200 if each was hand-rolled).

Verified on hertz:

  $ ./build/test_api_h264 | tail -8
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)

  All 4 new positions bit-exact PASS first try.

Coverage matrix update:
  put_  mc00 mc10 mc20 mc30
  mc01     —    ✓    —    ✓
  mc11     —    —    ✓    —     ← this row
  mc21     —    —    —    —
  mc31     —    —    —    —
  mc02     —    —    ✓    —     ← mc02 + mc22 anchor
  mc03     —    —    ✓    —

After this PR: 7 of 16 single-axis + diagonal positions done.
Remaining 9 are the off-axis quarter-pel combinations
(mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D
lowpass intermediate with L2 averaging against a 1D-lowpass output.
Next PR scope.

Why no QPU shaders: same R-band logic as the prior CPU additions.
At ~10 ns per 8x8 NEON block, all 16 qpel positions together
would land in ~1.3 ms/frame at 1080p worst case — comfortably
inside the 33 ms budget.  QPU shader for mc20 already exists
(cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a
clear perf reason emerges.
2026-05-25 01:29:52 +02:00
marfrit f3d4b15b9a Merge pull request 'h264: qpel mc22 (2D half-pel, CPU/NEON)' (#16) from noether/h264-qpel-mc22 into main
Reviewed-on: #16
2026-05-24 23:26:14 +00:00
claude-noether 20a4299c5c h264: qpel mc22 (2D half-pel, CPU/NEON)
Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass
with intermediate 16-bit precision per H.264 §8.4.2.2.1.  One of the
most common qpel positions in real H.264 streams — many encoders
emit 1/2-1/2 motion vectors as their best-RD choice.

Algorithmically distinct from the 1D mc20/mc02 siblings:
  - Horizontal 6-tap produces 13 rows of int16 intermediate (no
    per-stage clip/round — full precision retained).
  - Vertical 6-tap on the intermediate, then +512 >> 10 (the
    double-shift compensates for both 6-tap scalings) + clip255.

The intermediate-precision requirement means the C reference can't
just be "call mc20 then mc02" — that would double-clip and produce
the wrong result.  The 13-row int16 tmp[] buffer is the central
invariant.

Scope (same pattern as mc02 PR #15):
  - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc22_cpu calling
    ff_put_h264_qpel8_mc22_neon.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU.
  - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8]
    int16 staging buffer; spec-derived shifts and rounding.
  - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with
    output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's
    [-2 .. +10] read window stays in-tile.

Verified on hertz:

  $ ./build/test_api_h264 | tail -5
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

  All 13 H.264 kernels in api_smoke now bit-exact PASS.

mc22 being right first try is meaningful — the +512 >> 10 scaling
+ int16 intermediate sequence has multiple sign/shift/clip pitfalls
and any of them would surface on random inputs immediately.

Coverage matrix update:
  put_ mc20 ✓ (QPU+CPU)  put_ mc02 ✓ (CPU)  put_ mc22 ✓ (CPU)
  → 12 single put_ positions still missing (¼/¾ + HV combos with
  L2 averaging).
2026-05-25 01:03:14 +02:00
marfrit a2575d5e42 Merge pull request 'h264: qpel mc02 (vertical half-pel, CPU/NEON)' (#15) from noether/h264-qpel-mc02 into main
Reviewed-on: #15
2026-05-24 22:59:38 +00:00
claude-noether c3301b0c2e h264: qpel mc02 (vertical half-pel, CPU/NEON)
Mirror of cycle 9's mc20 transposed to vertical orientation.  Wires
up the second qpel half-pel position via the vendored
ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical
sibling" gap that mc20 left open since cycle 9.

Scope:
  - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper.
  - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry.
  - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU.
    Explicit SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap
    transpose of mc20 (reads src[(r±N)*stride + c] instead of
    src[r*stride + c±N]).
  - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols
    × 16 rows, random input, bit-exact compare against the C ref.

Verified on hertz:

  $ ./build/test_api_h264
  ...
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

  All 12 H.264 kernels in the api_smoke now bit-exact PASS.

Why CPU-only: same R-band logic as the deblock _h sibling pattern.
mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline
measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at
1080p — comfortably inside the 33 ms budget.  QPU shader is a
fast-follow once the V vs H shader work is consolidated (the
transpose for the V shader is not mechanical — different SIMD
access pattern than the H shader).

Coverage matrix update:

  qpel position  put_ status  avg_ status
  -------------  -----------  -----------
  mc00 (copy)    not wired    not wired
  mc10 (¼-H)     not wired    not wired
  mc20 (½-H)    ✓ QPU+CPU     not wired
  mc30 (¾-H)     not wired    not wired
  mc01 (¼-V)     not wired    not wired
  mc02 (½-V)    ✓ CPU         not wired (this PR)
  mc03 (¾-V)     not wired    not wired
  mc11..mc33     not wired    not wired

13 more qpel positions to go for the full put_ matrix.  Adding them
follows the same template; each is a small contained PR.
2026-05-25 00:47:37 +02:00
marfrit 9abc73d308 Merge pull request 'h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates' (#14) from noether/h264-intra-pred-chroma8x8 into main
Reviewed-on: #14
2026-05-24 22:43:26 +00:00
claude-noether d7100459f2 h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates
Third intra-prediction primitive after PR #12 (Intra_4x4 luma) and
PR #13 (Intra_16x16 luma).  Covers Intra_8x8 chroma per H.264 §8.3.3:
4 modes used for BOTH Cb and Cr planes at 4:2:0.

Mode quirks worth flagging in code review:

  - Mode 0 DC is asymmetric per quadrant.  The 8x8 chroma block
    splits into four 4x4 quadrants with different DC formulas:
      (0,0) top-left  : (sum_top[0..3] + sum_left[0..3] + 4) >> 3
      (0,1) top-right : (sum_top[4..7]                  + 2) >> 2
      (1,0) bot-left  : (sum_left[4..7]                 + 2) >> 2
      (1,1) bot-right : (sum_top[4..7] + sum_left[4..7] + 4) >> 3
    The top-right quadrant deliberately IGNORES the top-left half
    even though it's available — that's per spec §8.3.3.2.

  - Mode 3 Plane uses slope coefficient 34 (not 5 like Intra_16x16
    luma).  Centre is (x-3, y-3) instead of (x-7, y-7).  Sums span
    4 differences instead of 8.  Easy to copy-paste-bug from the
    luma Plane if you don't notice the constants change.

Test highlights:

  - DC quadrants: distinct expected values per quadrant (16, 16,
    40, 28 from asymmetric top/left halves) — any quadrant mix-up
    would surface immediately.  Hand-derived from the formulas
    in the test comment.
  - Plane uniform: all-100 context → all-100 output (a = 3200,
    H = V = 0, (3200+16) >> 5 = 100 exactly).
  - Plane gradient: top + left = 0..7, hand-derives pred[0][0] = 1
    and pred[7][7] = 15 via the full arithmetic chain (H = V = 56,
    b = c = 30, a = 224).  Same hand-traced spec-walkthrough as
    the Intra_16x16 Plane gradient test.

Verified on hertz:

  $ ./build/test_intra_pred_chroma8x8
    Horizontal (mode 1)            PASS
    Vertical (mode 2)              PASS
    DC quadrants (mode 0)          PASS
    Plane uniform (mode 3)         PASS
    Plane gradient (mode 3)        PASS (corners 1, 15)

  ALL Intra_8x8 chroma mode references PASS

All 5 tests PASS first try.  The DC quadrant correctness is meaningful
(4 different formulas in one kernel) and the Plane gradient corners
validate the slope=34 + centre=(x-3,y-3) constants vs the luma
equivalents.

Combined coverage after this PR:
  - Intra_4x4 luma:   9 modes ✓ (PR #12, all 9 PASS)
  - Intra_16x16 luma: 4 modes ✓ (PR #13, all 5 tests PASS)
  - Intra_8x8 chroma: 4 modes ✓ (this PR, all 5 tests PASS)
  - Intra_8x8 luma (High profile): 9 modes + smoothing — pending.

Remaining backlog: Intra_8x8 luma (High profile, 9 modes + 1-2-1
smoothing pre-filter — distinct algorithm from Intra_4x4 because of
the pre-filter), neighbour-availability fallback, dispatch wrappers.
2026-05-25 00:42:49 +02:00
marfrit dff610e13d Merge pull request 'h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates' (#13) from noether/h264-intra-pred-16x16 into main
Reviewed-on: #13
2026-05-24 22:40:29 +00:00
claude-noether c43ee84d8e h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates
Second piece of the intra-prediction primitive set after PR #12
(Intra_4x4 luma 9 modes).  Covers the Intra_16x16 luma MB type
per H.264 §8.3.2: 4 modes (Vertical, Horizontal, DC, Plane).

Scope:
  - tests/h264_intra_pred_16x16_ref.c — 4 spec-derived modes.
    Same FFmpeg-style interface as the 4x4 sibling:
      void daedalus_h264_pred_16x16_<name>_ref(uint8_t *dst, ptrdiff_t stride);
    Assumes all neighbours valid (interior-MB case).

    The Plane mode is the algorithmically heaviest of the four —
    spec §8.3.2.4 has two slope sums (H, V) over the asymmetric
    top/left contexts, a clipped quadratic evaluation per cell,
    and a top-left-corner participant at i=7 / j=7.  Implementation
    follows the spec straightforwardly with `clip_u8` on the final
    saturating cast.

  - tests/test_intra_pred_16x16.c — 5 test cases:
      * V, H, DC: standard contexts (gradient top / gradient left /
        small uniform pair).
      * Plane (uniform): all neighbours = 100 → H = V = 0 →
        output = (16*200 + 16) >> 5 = 100.  Verifies the
        orientation-free portion of the formula.
      * Plane (gradient): top + left both 0..15, spec-derived
        corner expectations pred[0][0] = 1 and pred[15][15] = 31.
        The arithmetic chain (H = V = 400 → b = c = 31, a = 480)
        is fully hand-traced in the test comment so the expected
        values are auditable.

  - CMakeLists.txt — new test_intra_pred_16x16 binary; pure-CPU
    library, no daedalus_core dependency (same separation as the
    4x4 ref).

Verified on hertz:

  $ ./build/test_intra_pred_16x16
    Vertical (mode 0)              PASS
    Horizontal (mode 1)            PASS
    DC (mode 2)                    PASS
    Plane (mode 3, uniform)        PASS
    Plane (mode 3, gradient)       PASS (corners 1, 31)

  ALL Intra_16x16 mode references PASS

Plane mode being right first try is meaningful — H/V sums, b/c
slope shifts, and the a-baseline arithmetic have many sign / index
error opportunities.  The asymmetric gradient test would have caught
any of them; it didn't.

What this does NOT cover (still in the intra-pred backlog):
  - Intra_8x8 chroma (4 modes per H.264 §8.3.3).
  - Intra_8x8 luma (High profile, 9 modes per §8.3.2.1 + the 1-2-1
    smoothing pre-filter — distinct algorithm from Intra_4x4).
  - Neighbour-availability fallback for boundary MBs.
  - Dispatch wrappers (same architectural question as before — wait
    for decoder Stage 2a strategy decision).
2026-05-25 00:35:24 +02:00
marfrit fad600000b Merge pull request 'h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates' (#12) from noether/h264-intra-pred-4x4 into main
Reviewed-on: #12
2026-05-24 22:28:39 +00:00
claude-noether ce6703a862 h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates
Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction.
Spec-derived C reference covering all 9 modes; standalone test
exercises each against hand-computed expected 4x4 patterns.

Why fourier (not the decoder) gets this: it's a reusable spec-level
primitive — both daedalus-decoder (Phase 1 Stage 2a intra prediction)
and any future shader work will need the same bit-exact reference.
Putting it in fourier alongside the IDCT / deblock refs keeps the
"spec implementations" library cohesive.

Why CPU C reference, not NEON or QPU: the vendored FFmpeg snapshot
(external/ffmpeg-snapshot/libavcodec/aarch64/) has h264dsp/idct/qpel
but NOT h264pred.  Vendoring h264pred_neon.S would expand the snapshot
surface; deferring that pending real perf data.  Per the cycle 9
NEON benches that take ~5 ns per 8x8 qpel block, intra prediction
at ~5 ns per 4x4 block × 16 blocks/MB × 8160 MBs = ~650 us/frame at
1080p — well inside budget even at NEON, and much further inside at
plain C.  Not the critical-path concern.

Scope:
  - tests/h264_intra_pred_4x4_ref.c — 9 prediction modes per
    H.264 spec §8.3.1.4 sub-clauses, FFmpeg-style interface:
      void daedalus_h264_pred_4x4_<name>_ref(uint8_t *dst, ptrdiff_t stride);
    Reads top/top-right/left/top-left neighbours from dst[-stride/-1]
    offsets, writes 4×4 output at dst[0..3][0..3].  Assumes all 13
    neighbour bytes are valid (interior-MB case; availability
    fallbacks are caller-side per spec).
  - tests/test_intra_pred_4x4.c — 10 cases:
      * 9 uniform-context degenerate tests (one per mode), establishing
        that nothing is structurally broken (all output cells must
        equal the uniform input value).
      * 1 asymmetric Vertical_Right sanity test with 16 distinct
        expected cells hand-computed from spec §8.3.1.4.6 — the
        "really exercise orientation + row/col arithmetic" gate.
  - CMakeLists.txt — new test_intra_pred_4x4 binary (no daedalus_core
    dependency; pure-CPU library doesn't need a context to construct).

Verified on hertz:

  $ ./build/test_intra_pred_4x4
    Vertical (mode 0)          PASS
    Horizontal (mode 1)        PASS
    DC (mode 2)                PASS
    DiagDownLeft (mode 3)      PASS
    DiagDownRight (mode 4)     PASS
    VerticalRight (mode 5)     PASS
    HorizontalDown (mode 6)    PASS
    VerticalLeft (mode 7)      PASS
    HorizontalUp (mode 8)      PASS
    VR asym (sanity)           PASS

  ALL 10 intra-4x4 mode references PASS

The VR asym test passed first try; the DC test fell on the first
attempt because my test expectation miscomputed the rounding shift
(I wrote 4, actual is 2 = (16+4)>>3).  Fixed in the test.  Reference
itself never had the bug.

What this does NOT cover (next-step backlog):
  - Intra_16x16 luma prediction (4 modes per H.264 §8.3.2): vertical,
    horizontal, DC, plane.
  - Intra_8x8 chroma prediction (4 modes per H.264 §8.3.3): DC,
    horizontal, vertical, plane.
  - Intra_8x8 luma prediction (High profile, 9 modes per §8.3.2.1) —
    these are the High-profile siblings of the modes in this PR with
    the 1-2-1 smoothing pre-filter.  Different but well-defined.
  - Neighbour availability fallback (top-edge MB, left-edge MB,
    slice-boundary, top-right unavailable in some positions).
  - Dispatch wrappers — these refs aren't surfaced through
    daedalus_dispatch_*().  Whether to do that depends on the
    daedalus-decoder Stage 2a architecture (per-block CPU vs
    per-diagonal GPU wavefront — TBD).
2026-05-25 00:14:51 +02:00
marfrit 5306bf0f61 Merge pull request 'h264: deblock bS=4 intra variants (luma + chroma, V + H)' (#11) from noether/h264-deblock-intra into main
Reviewed-on: #11
2026-05-24 22:09:15 +00:00
claude-noether 9b1c106dc5 h264: deblock bS=4 intra variants (luma + chroma, V + H)
Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4).  After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:

    bS<4   bS=4
    -----  -----
  luma_v  ✓ (cycle 8 QPU)   ✓ (CPU)
  luma_h  ✓ (CPU, PR #9)    ✓ (CPU)
  chrm_v  ✓ (CPU, PR #10)   ✓ (CPU)
  chrm_h  ✓ (CPU, PR #10)   ✓ (CPU)

Scope:
  - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
    CH_INTRA=16), all → CPU substrate in the recipe table.
  - 4 new public dispatch fns + 4 recipe wrappers (defined via two
    DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
    boilerplate tight).
  - 4 new extern decls for the vendored
    ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
  - C reference: tests/h264_intra_loop_filter_ref.c covers all four
    orientations.  Algorithm per H.264 §8.7.2.3:

      Luma: per-side strong/weak filter selector
        strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
        strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
        Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
      Chroma: always weak, only p0/q0 updated.

  - daedalus_h264_deblock_meta is REUSED for intra dispatches; the
    tc0[] field is ignored (bS=4 hardcodes the strength).  Callers
    can build a single edge list and route by kernel without an
    extra struct.

  - Test refactor: an intra_test_spec table + run_intra_test helper
    drives all four orientations through one harness, keeping the
    new test surface compact (~50 LOC for 4 kernels vs ~200 if each
    had its own test_deblock_*_intra fn).

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    ...
    H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    ...

  All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.

The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately.  The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.

QPU shader follow-up: not in this PR.  The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.
2026-05-25 00:00:46 +02:00
marfrit ce436bfd96 Merge pull request 'h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)' (#10) from noether/h264-deblock-chroma into main
Reviewed-on: #10
2026-05-24 21:55:57 +00:00
claude-noether a5c47aa51c h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)
Continues the deblock buildout after PR #9 (luma_h).  Adds the two
chroma orientations via the same recipe-table-routed-to-CPU pattern;
QPU shaders for chroma deblock are still a follow-up.

Scope:
  - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}).
  - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the
    vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols.
  - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
    DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU.  Explicit
    SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_chroma_loop_filter_ref.c — covers both
    orientations.  Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter):
    tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0
    are updated (chroma never modifies p1/p2/q1/q2).
  - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) +
    test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x
    2 cells per segment per spec.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H264_DEBLOCK_CV recipe substrate: 1 (CPU)
    H264_DEBLOCK_CH recipe substrate: 1 (CPU)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 7 kernels bit-exact PASS.  Chroma test sizes are smaller (256
  bytes per orientation) because the per-MB chroma deblock surface is
  smaller than luma — accurate to the production geometry.

Why no QPU shader yet (per the established pattern):
  - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter
    the pixel count of luma per MB) — modest QPU win even after the
    shader exists.
  - Same R-band considerations as the luma _h follow-up: the V shader
    transpose isn't mechanical, and the 8-cell tile is small enough
    that NEON's per-edge cost (~3 ns) is already inside the budget.
  - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us.
    Negligible compared to the IDCT layer's 10 ms (CPU NEON).

Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is
complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓.  Remaining
deblock work: bS=4 intra variants (luma + chroma, V + H).

What this unblocks downstream:
  - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4
    edge categories that a typical inter MB needs.
2026-05-24 23:53:09 +02:00
marfrit f4af24020f Merge pull request 'h264: deblock_luma_h (CPU/NEON via vendored ff_h264_h_loop_filter)' (#9) from noether/h264-deblock-luma-h into main
Reviewed-on: #9
2026-05-24 21:47:57 +00:00
claude-noether 818e71560e gitignore: exclude .claude/ runtime files
The previous commit unintentionally added .claude/scheduled_tasks.lock
which is an agent-runtime artefact, not source. Untrack it and add
.claude/ to .gitignore so it stays out of future commits.
2026-05-24 23:29:06 +02:00
claude-noether 9d5451e0fe h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v.  The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.

Scope:
  - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
    + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
  - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
  - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
    to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written.  An
    explicit SUBSTRATE_QPU request on the H dispatch returns -1
    (fails fast, no silent CPU degradation).
  - C reference: tests/h264_h_loop_filter_luma_ref.c — the
    column-axis transpose of h264_deblock_ref.c.  Same per-segment
    kernel; pix[-4..+3] accesses cols instead of rows*stride.
  - Test: test_api_h264 grows a test_deblock_h() with 8 tiles
    (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
    compares NEON dispatch against reference byte-for-byte.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 5 kernels bit-exact PASS.  The new H variant joins the suite
  with 1024 random-input bytes per tile x 8 tiles.

Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget.  Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).

Backlog addition (out of scope for this PR):
  - V3D shader for the H variant (mirror of v3d_h264deblock.spv).
  - bS=4 intra-strength filter (different algebra; both _v and _h).
  - Chroma deblock luma_v/_h (8-cell variants).
2026-05-24 23:28:56 +02:00
marfrit 0d54d68f38 Merge pull request 'cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage' (#8) from noether/v3d-shader-h264-qpel-mc20 into main
Reviewed-on: #8
2026-05-23 19:14:44 +00:00
claude-noether 79553c6e22 cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU.  Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.

Shader (src/v3d_h264_qpel_mc20.comp):

  - local_size = 64, 1 lane per output pixel of one 8x8 block,
    1 block per workgroup.  Simplest layout that avoids any
    inter-lane communication — V3D's L2 cache handles the
    redundant reads from adjacent lanes computing adjacent
    output columns.
  - Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
    apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
    rounding, clip to u8, write one dst byte.
  - Single-stride convention matches FFmpeg's H264QpelContext
    (dst and src share `stride`; src+src_off points at output
    col 0 with the caller-guaranteed -2/+3 padding).

Dispatch wiring (src/daedalus_core.c):

  - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
    src_max = src_off + 7*stride + 11 (covers the +3-col read
    footprint on the last row), dst_max = dst_off + 7*stride + 8.
    1 block per WG.
  - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
    with the substrate-switch pattern matching the other H.264
    kernels.
  - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2   ← flipped
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper.  The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11.  Fixed in the same
patch.

All 9 daedalus-fourier cycles now QPU-by-recipe:

  cycle 1 VP9 IDCT 8x8         QPU
  cycle 2 VP9 LPF wd=4         QPU
  cycle 3 VP9 MC 8h            QPU
  cycle 4 VP9 LPF wd=8         QPU
  cycle 5 AV1 CDEF 8x8         QPU
  cycle 6 H.264 IDCT 4x4       QPU
  cycle 7 H.264 IDCT 8x8       QPU
  cycle 8 H.264 deblock luma-v QPU
  cycle 9 H.264 qpel mc20      QPU   ← this commit

Closes daedalus-fourier task #165.  Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
2026-05-23 21:05:36 +02:00
marfrit a092ee34aa Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7) from noether/qpu-default-recipe-cycles-5-8 into main
Reviewed-on: #7
2026-05-23 18:59:34 +00:00
marfrit c01754e849 Merge pull request 'v3d_runner: buffer pool for QPU dispatch hot path' (#6) from noether/v3d-buffer-pool into main
Reviewed-on: #6
2026-05-23 18:59:18 +00:00
claude-noether 74687d9def cycle 7: V3D shader for H.264 IDCT 8x8
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG.  H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.

Bit-exact first try on hertz (Pi 5, V3D 7.1):

  H264_IDCT4 recipe substrate:      2 (QPU)
  H264_IDCT8 recipe substrate:      2 (QPU)    ← flipped
  H264_DEBLOCK_LV recipe substrate: 2 (QPU)
  H264_QPEL_MC20 recipe substrate:  1 (CPU)    ← task #165 remaining
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact    ← QPU
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact

8 of 9 daedalus-fourier cycles now QPU-by-recipe.  Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.

Same pattern as cycle 6 commit (65bd5c3): adds h264_idct8_pipe
field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs,
v3d_h264_idct8.spv install rule.

Uses v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
2026-05-23 20:09:25 +02:00
claude-noether 65bd5c3fe3 cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
2026-05-23 20:06:20 +02:00
claude-noether 737e87980d QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:

1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
   every kernel that has a shipped V3D compute shader:

    cycle 1 VP9 IDCT 8x8         QPU  (was QPU; unchanged)
    cycle 2 VP9 LPF wd=4         QPU  (was QPU; unchanged)
    cycle 3 VP9 MC 8h            QPU  (FLIPPED from CPU — v3d_mc_8h.spv)
    cycle 4 VP9 LPF wd=8         QPU  (was QPU; unchanged)
    cycle 5 AV1 CDEF 8x8         QPU  (FLIPPED from CPU — v3d_cdef.spv)
    cycle 6 H.264 IDCT 4x4       CPU  (no shader yet; task #165)
    cycle 7 H.264 IDCT 8x8       CPU  (no shader yet; task #165)
    cycle 8 H.264 deblock luma-v QPU  (FLIPPED from CPU — v3d_h264deblock.spv)
    cycle 9 H.264 qpel mc20      CPU  (no shader yet; task #165)

   The R-band cost/benefit framework still applies but is now
   superseded for substrate selection by the decree.  Where R
   stays RED, the cost is in dispatch overhead, which is a
   fixable engineering issue (tasks 160 buffer-pool, 161
   persistent cmdbuf, 162 dmabuf import).

2) daedalus_ctx_create_no_qpu() now honours an env-var override:
   set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
   silently escalates to a full daedalus_ctx_create().  Lets
   the libavcodec substitution shims in marfrit-packages (which
   pthread_once a create_no_qpu ctx — see
   libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
   without rebuilding those patches.

   Firefox / mpv consumers stay on the Vulkan-free path by
   default (env var unset).  The daedalus-v4l2 daemon will set
   DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
   (separate daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29):

  === test_api_h264 ===
    H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2     ← flipped
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
    H.264 qpel mc20: 1024/1024 bytes bit-exact

  === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
  === test_api_lpf  === all substrates bit-exact wd=4 and wd=8

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.

Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
2026-05-23 19:59:53 +02:00
claude-noether 98553278dd v3d_runner: persistent per-pipeline command buffer
Phase 2 of the QPU-default substrate campaign — eliminate
vkAllocateCommandBuffers from the dispatch hot path.

Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in
v3d_runner_create_pipeline() and freed in destroy_pipeline().  The
five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to
v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1)
versus the driver-side allocation walk.  Pool was already created
with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is
permitted.

Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1):

  before (task 160 pool only):
    steady-state p50: 76.44 us
    steady-state mean: 77.95 us
  after (task 160 pool + task 161 persistent cb):
    steady-state p50: 54.56 us
    steady-state mean: 56.00 us
    -> 28% per-dispatch reduction

The remaining ~54 us steady-state is dominated by vkQueueWaitIdle +
shader execution + the two memcpy(in/out) on the dst buffer — task
162 (dmabuf import for dst) targets the memcpy half.

test_api_idct stays bit-exact across CPU/QPU/AUTO substrates.

Refs daedalus-fourier task #161.
2026-05-23 19:56:35 +02:00
claude-noether 0a042a8e95 v3d_runner: buffer pool for QPU dispatch hot path
Per the QPU-default substrate decree 2026-05-23: the per-dispatch
vkAllocateMemory in dispatch_*_qpu was the biggest single fixable
contributor to QPU dispatch overhead.  This pools v3d_buffer
allocations by power-of-2 size class so the second-and-subsequent
dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7
memory-allocation cost per call.

API additions (v3d_runner.h):
  - v3d_runner_acquire_buffer(): pulls from per-bucket freelist;
    falls through to v3d_runner_create_buffer() on miss.
  - v3d_runner_release_buffer(): pushes back onto the freelist; the
    backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in
    v3d_runner_destroy().
  - v3d_runner_pool_total_bytes(): diagnostic watermark.

Size classes 2^8..2^23 (256 B to 8 MiB).  Oversize requests fall
through to non-pooled (vkAllocateMemory) for both ends — pool stays
correct, just degenerates to old behaviour for those calls.

Migration: daedalus_core.c dispatch_*_qpu paths globally swap
create_buffer → acquire_buffer and destroy_buffer → release_buffer.
All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef /
h264_deblock) now reuse buffers across calls.  test_api_idct stays
bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz).

Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5,
6.18.29+rpt-rpi-2712, V3D 7.1):

  call 0:  434.89 us  (cold — 3x vkAllocateMemory)
  call 1:  100.06 us  (pool hit on all 3 buffers)
  steady-state:
    p50:    76.44 us
    p90:    90.52 us
    mean:   77.95 us
  first-call / steady-state ratio: 5.7x

The remaining ~76us steady-state is dominated by vkQueueWaitIdle +
shader execution + per-call descriptor-set update + command-buffer
allocation — addressed in follow-on tasks 161 (persistent cmdbuf)
and 162 (dmabuf import for dst, eliminates memcpy in/out).

Refs daedalus-fourier task #160.
2026-05-23 19:52:50 +02:00
marfrit 3ecfc8b0ef Merge pull request 'docs: architecture backlog — correct fleet hardware mapping' (#4) from noether/architecture-backlog-fleet-fix into main
Reviewed-on: #4
2026-05-23 15:12:29 +00:00
marfrit c154253432 Merge pull request 'CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}' (#5) from noether/pkgconfig-relocatable-prefix into main
Reviewed-on: #5
2026-05-23 15:11:20 +00:00
claude-noether b3de96b21c CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}
The pkg-config file was generated at *configure* time with
`prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever
CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build`
time — typically the default `/usr/local`.  `cmake --install
build --prefix /foo` then put the files under /foo but the .pc
still pointed pkg-config at /usr/local/include and /usr/local/lib,
which broke downstream consumers configuring against the install
tree.

Concrete bite encountered today on hertz: the daedalus-v4l2 daemon
CMake configure on a /tmp/df-prefix install tree resolved
DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on
the test host), so main.c failed to find <daedalus.h>.

Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is
the configure-time-computed relative path from
<prefix>/<libdir>/pkgconfig back to <prefix>.  pkg-config
substitutes ${pcfiledir} with the actual on-disk location of the
.pc at lookup time, so the resolved prefix tracks wherever the
install tree is moved to — including DESTDIR-staged builds, apt
package installs, and ad-hoc `cmake --install --prefix /tmp/foo`
test installs.

The relative-path computation handles GNUInstallDirs layouts that
add multiarch tuples (Debian's lib/aarch64-linux-gnu) without
hard-coding `../..`.  Tested on hertz (Debian trixie, libdir=lib):

  prefix=${pcfiledir}/../../
  ...
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-test/lib/pkgconfig/../../

  # mv preserves relocation:
  $ mv /tmp/df-prefix-test /tmp/df-prefix-moved
  $ pkg-config --variable=prefix daedalus-fourier
  /tmp/df-prefix-moved/lib/pkgconfig/../../

This unblocks the daedalus-v4l2 daemon out-of-tree builds against
local daedalus-fourier installs and is a prerequisite for tidy
test-rig deployments (per the hertz reload session 2026-05-23).
2026-05-23 16:55:31 +02:00
claude-noether 68dccd2911 docs: architecture backlog — correct fleet hardware mapping
Original draft (PR #3) speculated wrongly on host-to-SoC mapping:

  - hertz and tesla were listed under RK3588.  Verified via
    /proc/device-tree/compatible: both are raspberrypi,5-model-b /
    brcm,bcm2712 (tesla is an LXD container hosted on hertz, so
    necessarily shares the host SoC).
  - boltzmann (the only actual RK3588 in the fleet, 32 GB, kernel-
    dev / MCP hub, 8 W always-on) was omitted entirely.
  - noether (Pi 4 / BCM2711, the user's interactive workstation,
    where Firefox and mpv actually run) was omitted entirely.

Corrects the per-SoC coverage table:

    BCM2712 Pi 5  — higgs, hertz, broglie, tesla (LXD on hertz)
    BCM2711 Pi 4  — noether (workstation), dcw3, dcw2
    RK3588        — boltzmann
    Allwinner H6  — (not in fleet)

Reasoning consequences:

  - Pi 5 row is now four hosts but one SoC.  Adding a fifth Pi 5
    doesn't pressure-test the architecture; substrate decisions
    are identical across the row.
  - The realistic forcing function for the Pi 4 path is "HW decode
    on noether matters and rpivid is still unstable upstream" —
    noether is a daily-driver Pi 4 workstation, so this is closer
    than the original draft implied.
  - The realistic forcing function for an RK3588 caps file is
    "AV1 playback on boltzmann matters" — rkvdec doesn't cover
    AV1, so Mali Valhall compute substrate becomes the only HW
    acceleration option there.

"Re-read this when" list at the top + "Why deferred" section
+ decision log all updated.  No change to the architecture sketch
(caps directory, plugin layout, two-backend conclusion) — those
were correct in the original; only the host-to-SoC mapping
underneath them was wrong.

Refs PR #3 (the merged original).
2026-05-23 15:47:55 +02:00
marfrit 7d6f106919 Merge pull request 'docs: architecture backlog for multi-SoC daedalus generalization' (#3) from noether/architecture-backlog into main
Reviewed-on: #3
2026-05-23 03:31:56 +00:00
69 changed files with 9279 additions and 81 deletions
+1
View File
@@ -11,3 +11,4 @@ build-*/
# Forensic snapshot of the corrupted .git from 2026-05-18 10:25
# working-tree wipe. Retained on disk for inspection; not tracked.
.git-broken-2026-05-18/
.claude/
+262 -2
View File
@@ -284,7 +284,145 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV})
set(H264DEBLOCK_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
COMMENT "glslang: v3d_h264deblock_h.comp -> v3d_h264deblock_h.spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_V_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_v.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_V_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_V_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
COMMENT "glslang: v3d_h264deblock_chroma_v.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
COMMENT "glslang: v3d_h264deblock_chroma_h.comp -> .spv"
VERBATIM
)
# Intra (bS=4) deblock shaders — strong/weak filter selector per
# H.264 §8.3.2.3. 4 variants (luma_v/h + chroma_v/h).
foreach(_kind luma_v_intra luma_h_intra chroma_v_intra chroma_h_intra)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264deblock_${_kind}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
COMMENT "glslang: v3d_h264deblock_${_kind}.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_${_kind}_SPV ${_spv})
endforeach()
set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv)
add_custom_command(
OUTPUT ${H264_IDCT4_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_IDCT4_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_idct4.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_idct4.comp
COMMENT "glslang: v3d_h264_idct4.comp -> v3d_h264_idct4.spv"
VERBATIM
)
set(H264_IDCT8_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct8.spv)
add_custom_command(
OUTPUT ${H264_IDCT8_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_IDCT8_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_idct8.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_idct8.comp
COMMENT "glslang: v3d_h264_idct8.comp -> v3d_h264_idct8.spv"
VERBATIM
)
set(H264_QPEL_MC20_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc20.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC20_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC20_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc20.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc20.comp
COMMENT "glslang: v3d_h264_qpel_mc20.comp -> v3d_h264_qpel_mc20.spv"
VERBATIM
)
set(H264_QPEL_MC02_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc02.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC02_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC02_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
COMMENT "glslang: v3d_h264_qpel_mc02.comp -> v3d_h264_qpel_mc02.spv"
VERBATIM
)
set(H264_QPEL_MC22_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc22.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC22_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC22_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
COMMENT "glslang: v3d_h264_qpel_mc22.comp -> v3d_h264_qpel_mc22.spv"
VERBATIM
)
# Quarter-pel single-axis variants (mc10/30/01/03) + diagonal
# variants (mc11/12/13/21/23/31/32/33) — each composes 1-2 half-pel
# results with optional L2 averaging. Same WG geometry as mc20/mc02.
foreach(_mc mc10 mc30 mc01 mc03 mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_${_mc}_SPV ${_spv})
endforeach()
# avg_ biprediction variants — same shader as put_ + extra L2 with
# existing dst. All 15 useful positions.
foreach(_mc mc20 mc02 mc22 mc10 mc30 mc01 mc03
mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_avg_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_avg_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_avg_${_mc}_SPV ${_spv})
endforeach()
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264DEBLOCK_H_SPV} ${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_H_SPV} ${H264DEBLOCK_luma_v_intra_SPV} ${H264DEBLOCK_luma_h_intra_SPV} ${H264DEBLOCK_chroma_v_intra_SPV} ${H264DEBLOCK_chroma_h_intra_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV} ${H264_QPEL_MC02_SPV} ${H264_QPEL_MC22_SPV} ${H264_QPEL_mc10_SPV} ${H264_QPEL_mc30_SPV} ${H264_QPEL_mc01_SPV} ${H264_QPEL_mc03_SPV} ${H264_QPEL_mc11_SPV} ${H264_QPEL_mc12_SPV} ${H264_QPEL_mc13_SPV} ${H264_QPEL_mc21_SPV} ${H264_QPEL_mc23_SPV} ${H264_QPEL_mc31_SPV} ${H264_QPEL_mc32_SPV} ${H264_QPEL_mc33_SPV} ${H264_QPEL_avg_mc20_SPV} ${H264_QPEL_avg_mc02_SPV} ${H264_QPEL_avg_mc22_SPV} ${H264_QPEL_avg_mc10_SPV} ${H264_QPEL_avg_mc30_SPV} ${H264_QPEL_avg_mc01_SPV} ${H264_QPEL_avg_mc03_SPV} ${H264_QPEL_avg_mc11_SPV} ${H264_QPEL_avg_mc12_SPV} ${H264_QPEL_avg_mc13_SPV} ${H264_QPEL_avg_mc21_SPV} ${H264_QPEL_avg_mc23_SPV} ${H264_QPEL_avg_mc31_SPV} ${H264_QPEL_avg_mc32_SPV} ${H264_QPEL_avg_mc33_SPV})
# v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -358,6 +496,11 @@ endif()
add_library(daedalus_core STATIC
src/daedalus_core.c
src/h264_chroma_dc.c
src/h264_intra_pred_4x4.c
src/h264_intra_pred_16x16.c
src/h264_intra_pred_chroma8x8.c
src/h264_intra_pred_8x8_luma.c
src/v3d_runner.c
${FFASM_SOURCES}
${FFASM_LPF_SOURCES}
@@ -412,6 +555,45 @@ if (DAEDALUS_BUILD_VULKAN)
${LPF8_SPV}
${CDEF_SPV}
${H264DEBLOCK_SPV}
${H264DEBLOCK_H_SPV}
${H264DEBLOCK_CHROMA_V_SPV}
${H264DEBLOCK_CHROMA_H_SPV}
${H264DEBLOCK_luma_v_intra_SPV}
${H264DEBLOCK_luma_h_intra_SPV}
${H264DEBLOCK_chroma_v_intra_SPV}
${H264DEBLOCK_chroma_h_intra_SPV}
${H264_IDCT4_SPV}
${H264_IDCT8_SPV}
${H264_QPEL_MC20_SPV}
${H264_QPEL_MC02_SPV}
${H264_QPEL_MC22_SPV}
${H264_QPEL_mc10_SPV}
${H264_QPEL_mc30_SPV}
${H264_QPEL_mc01_SPV}
${H264_QPEL_mc03_SPV}
${H264_QPEL_mc11_SPV}
${H264_QPEL_mc12_SPV}
${H264_QPEL_mc13_SPV}
${H264_QPEL_mc21_SPV}
${H264_QPEL_mc23_SPV}
${H264_QPEL_mc31_SPV}
${H264_QPEL_mc32_SPV}
${H264_QPEL_mc33_SPV}
${H264_QPEL_avg_mc20_SPV}
${H264_QPEL_avg_mc02_SPV}
${H264_QPEL_avg_mc22_SPV}
${H264_QPEL_avg_mc10_SPV}
${H264_QPEL_avg_mc30_SPV}
${H264_QPEL_avg_mc01_SPV}
${H264_QPEL_avg_mc03_SPV}
${H264_QPEL_avg_mc11_SPV}
${H264_QPEL_avg_mc12_SPV}
${H264_QPEL_avg_mc13_SPV}
${H264_QPEL_avg_mc21_SPV}
${H264_QPEL_avg_mc23_SPV}
${H264_QPEL_avg_mc31_SPV}
${H264_QPEL_avg_mc32_SPV}
${H264_QPEL_avg_mc33_SPV}
DESTINATION ${CMAKE_INSTALL_DATADIR}/daedalus-fourier/shaders
)
endif()
@@ -419,9 +601,33 @@ endif()
# pkg-config file. Vulkan goes in Requires.private (consumer's
# pkg-config call gets it via --static). pthread + dl are needed
# by the static archive's runtime helpers.
#
# `prefix` is derived from ${pcfiledir} so the .pc is relocatable:
# pkg-config substitutes ${pcfiledir} with the directory holding the
# .pc at lookup time, and the relative path from
# <prefix>/<libdir>/pkgconfig back to <prefix> tells pkg-config the
# install prefix without baking it in. This is why
# `cmake --install build --prefix /foo` produces a .pc that correctly
# resolves `prefix=/foo` instead of baking whatever CMAKE_INSTALL_PREFIX
# was at *configure* time (default /usr/local). DESTDIR-staged
# installs work too: at runtime pkg-config sees the .pc at its real
# install path and computes the right prefix.
#
# Relative-path depth is computed from CMAKE_INSTALL_LIBDIR (and
# whatever multiarch tuple GNUInstallDirs adds) so Debian-style
# `lib/aarch64-linux-gnu/pkgconfig/...` resolves with the right number
# of `..` components. Layouts where libdir is *not* under prefix are
# not supported by this scheme; if a packager overrides libdir to an
# absolute path the relative-path machinery falls back to the absolute
# value (CMake's file(RELATIVE_PATH) prepends `..` until they meet),
# which is also relocatable but no longer prefix-agnostic.
file(RELATIVE_PATH PKGCONFIG_PCDIR_TO_PREFIX
"${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/pkgconfig"
"${CMAKE_INSTALL_PREFIX}")
set(PKGCONFIG_OUT ${CMAKE_CURRENT_BINARY_DIR}/daedalus-fourier.pc)
file(WRITE ${PKGCONFIG_OUT}
"prefix=${CMAKE_INSTALL_PREFIX}
"prefix=\${pcfiledir}/${PKGCONFIG_PCDIR_TO_PREFIX}
exec_prefix=\${prefix}
libdir=\${prefix}/${CMAKE_INSTALL_LIBDIR}
includedir=\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}
@@ -459,7 +665,16 @@ add_executable(test_api_h264
tests/h264_idct4_ref.c
tests/h264_idct8_ref.c
tests/h264_deblock_ref.c
tests/h264_h_loop_filter_luma_ref.c
tests/h264_chroma_loop_filter_ref.c
tests/h264_intra_loop_filter_ref.c
tests/h264_qpel8_mc20_ref.c
tests/h264_qpel8_mc02_ref.c
tests/h264_qpel8_mc22_ref.c
tests/h264_qpel8_quarter_axis_ref.c
tests/h264_qpel8_diag_ref.c
tests/h264_qpel8_avg_anchors_ref.c
tests/h264_qpel8_avg_rest_ref.c
)
target_link_libraries(test_api_h264 PRIVATE daedalus_core)
target_compile_options(test_api_h264 PRIVATE -O2)
@@ -468,6 +683,51 @@ add_executable(test_api_opportunistic_qpu tests/test_api_opportunistic_qpu.c)
target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core)
target_compile_options(test_api_opportunistic_qpu PRIVATE -O2)
# H.264 Intra_4x4 luma prediction (9 modes) — public src primitives.
# The bodies now live in src/h264_intra_pred_4x4.c (linked into
# daedalus_core for use by libavcodec.so substitution-arc consumers).
# This test exercises the public symbols.
add_executable(test_intra_pred_4x4 tests/test_intra_pred_4x4.c)
target_link_libraries(test_intra_pred_4x4 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_4x4 PRIVATE -O2)
# H.264 Intra_16x16 luma prediction (4 modes) — public src primitives,
# linked from daedalus_core.
add_executable(test_intra_pred_16x16 tests/test_intra_pred_16x16.c)
target_link_libraries(test_intra_pred_16x16 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_16x16 PRIVATE -O2)
# H.264 Intra_8x8 chroma prediction (4 modes) — public src primitives.
add_executable(test_intra_pred_chroma8x8 tests/test_intra_pred_chroma8x8.c)
target_link_libraries(test_intra_pred_chroma8x8 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_chroma8x8 PRIVATE -O2)
# H.264 Intra_8x8 luma prediction (High profile, 9 modes + 1-2-1
# pre-filter) — public src primitives.
add_executable(test_intra_pred_8x8_luma tests/test_intra_pred_8x8_luma.c)
target_link_libraries(test_intra_pred_8x8_luma PRIVATE daedalus_core)
target_compile_options(test_intra_pred_8x8_luma PRIVATE -O2)
# H.264 chroma DC 2x2 Hadamard pre-pass primitive. Pure transform,
# no QP-dependent scaling (that's caller-side composition).
add_executable(test_chroma_dc_hadamard
tests/test_chroma_dc_hadamard.c
tests/h264_chroma_dc_hadamard_ref.c
)
# Links daedalus_core to pull in the public daedalus_h264_chroma_dc_hadamard_2x2
# symbol (for the public-API parity test added in this PR).
target_link_libraries(test_chroma_dc_hadamard PRIVATE daedalus_core)
target_compile_options(test_chroma_dc_hadamard PRIVATE -O2)
# H.264 primitives latency benchmark (NEON CPU baseline).
add_executable(bench_h264_primitives tests/bench_h264_primitives.c)
target_link_libraries(bench_h264_primitives PRIVATE daedalus_core)
target_compile_options(bench_h264_primitives PRIVATE -O2)
add_executable(bench_pool_overhead tests/bench_pool_overhead.c)
target_link_libraries(bench_pool_overhead PRIVATE daedalus_core)
target_compile_options(bench_pool_overhead PRIVATE -O2)
if (DAEDALUS_BUILD_VULKAN)
# (re-open the conditional so the closing endif() below balances)
+17 -12
View File
@@ -4,9 +4,9 @@
This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when:
- A second aarch64 host without a working kernel-side V4L2 stateless decoder shows up in the fleet (most likely candidate: Pi 4, which has V3D 4.x and no rpivid stable upstream).
- A specific working-copy slowdown that the current Pi-5-only daedalus can't address motivates the generalization.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (currently it picks the first matching V4L2 node).
- HW decode on noether (Pi 4, the user's interactive workstation) becomes a real ask and rpivid upstream is still unstable. This is the most likely trigger — same SoC class as Pi 5 but weaker V3D 4.x, so the caps-file mechanism plus an extra row's worth of substrate measurements.
- AV1 playback on boltzmann (RK3588) starts mattering. rkvdec doesn't cover AV1, so the daedalus path becomes the only HW-accelerated option, and Mali Valhall compute substrate decisions need their own caps row.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (today it picks the first matching V4L2 node; a host with both rkvdec and daedalus-v4l2 nodes wants a preference policy).
Until then: this is decision context, not a TODO.
@@ -51,13 +51,17 @@ The mfritsche fleet has heterogeneous aarch64 hardware decoders:
| SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute |
|---|---|---|---|---|---|---|
| BCM2712 (Pi 5) | higgs, broglie | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | dcw3 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | hertz, tesla | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |
| BCM2712 (Pi 5) | higgs, hertz, broglie, tesla (LXD on hertz) | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | noether (interactive workstation), dcw3, dcw2 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | boltzmann (32 GB, kernel-dev / MCP hub, 8 W always-on) | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk-bifrost-video in dev) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists upstream) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |
No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only.
A note on the Pi 5 row: hertz and tesla share hardware (tesla is an LXD container hosted on hertz) but are operationally distinct — tesla is the distcc/MCP worker, hertz is the LXD host with all the cron automations and the 17-tool lmcp hub. From a daedalus deployment perspective they count as **one** Pi 5 substrate; from a workflow perspective they're separate boxes.
A note on noether: it's the user's interactive workstation (Pi 4, BCM2711). Firefox + mpv run here. Any "I want HW decode on my main box" pressure lands first on this host, which puts Pi 4 (V3D4 + maybe-rpivid) closer to the front of the queue than the original draft of this document suggested.
The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels.
The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution.
@@ -207,15 +211,15 @@ Pass-through plugins are *thin* — they translate the daedalus daemon's wire pr
**Today's calculus:**
- Pi 5 daedalus path is the only thing in the fleet that uses daedalus daemon. Generalizing for a single user is overdesign.
- RK3588 uses rkvdec directly through libva-v4l2-request-fourier; daedalus daemon is **not in the path** for any RK3588 codec. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali.
- Pi 4 with rpivid is the only realistic second motivator. rpivid upstream stability is the gate — if it lands cleanly, Pi 4 takes the pass-through path with no kernel substitution needed. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- Pi 5 (higgs + hertz + broglie + tesla) is **four hosts**, but **one SoC**. Adding the fifth Pi 5 host wouldn't pressure-test the architecture; they all share BCM2712 caps so the substrate decisions are identical across the row.
- boltzmann (RK3588) is the only non-Pi-5 always-on host in the fleet, and it uses rkvdec directly through libva-v4l2-request-fourier daedalus daemon is **not in the path** for any RK3588 codec on it. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. No forcing pressure from boltzmann today.
- noether (Pi 4, this user's interactive workstation) and dcw3/dcw2 (also Pi 4) are the real second-SoC candidates. The gate is rpivid upstream stability: if it lands cleanly, Pi 4 takes the pass-through path with zero kernel substitution work. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural.
**The forcing function that flips this from "deferred" to "do it":**
- Pi 4 enters daily use and rpivid is still not stable upstream — implies we need a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate.
- **Or:** an x86 host enters the fleet running mesa-panvk on a Pi-CM5-like board, and we need the daedalus daemon to discover it dynamically rather than being baked at build time.
- **noether-as-Firefox-host** — the user starts wanting HW decode on their main workstation and rpivid is still not stable upstream. Implies a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. This is the most likely trigger; noether is already a daily-driver Pi 4.
- **boltzmann-as-AV1-decoder** — RK3588 has no AV1 HW decoder, and the user wants AV1 playback there (currently CPU-only). Triggers a cycle-5equivalent measurement campaign on Mali Valhall to see whether `daedalus_recipe_dispatch_cdef_8x8` (or follow-on AV1 kernels) is worth running on Mali compute. If yes, we need an RK3588 caps file that overrides only the AV1 row while leaving H.264/HEVC/VP9 on rkvdec pass-through.
- **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask.
Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon.
@@ -242,6 +246,7 @@ Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC a
|---|---|---|
| 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. |
| 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 36 months. |
| 2026-05-23 | **Correct fleet hardware mapping.** Original draft had hertz/tesla under RK3588 and omitted boltzmann + noether entirely. Verified via `/proc/device-tree/compatible`: hertz + tesla are Pi 5 (BCM2712), noether is Pi 4 (BCM2711), boltzmann is the only RK3588 in the fleet. Adjusted "Why deferred" / forcing-function reasoning accordingly — Pi 5 row is now 4 hosts (one SoC), noether is the realistic Pi 4 trigger, boltzmann is the realistic RK3588 trigger via AV1. | Original draft was speculative on host-to-SoC mapping; verified state changes which forcing functions are credible. |
---
+366
View File
@@ -263,6 +263,102 @@ int daedalus_dispatch_h264_deblock_luma_v(daedalus_ctx *ctx, daedalus_substrate
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 luma "h_loop_filter" — sibling of _v, applies filter
* HORIZONTALLY across a VERTICAL edge (16 rows tall; pix points to
* row 0 of the right block, col 0 = leftmost output column). Same
* non-intra (bS < 4) variant.
*
* Each tile is 8 cols x 16 rows of context (cols -4..+3 around the
* edge). dst_off points to row 0 col 0 of the RIGHT block.
*
* Constraint: (dst_off % dst_stride) >= 4 (the kernel reads p3 at
* pix[-4]). Caller must ensure this.
*
* QPU shader for the H variant is not yet implemented; recipe table
* routes AUTO to CPU NEON. An explicit DAEDALUS_SUBSTRATE_QPU on
* the _h dispatch returns -1 rather than silently degrading.
*/
int daedalus_recipe_dispatch_h264_deblock_luma_h(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_h(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 chroma (4:2:0) loop filters — bS<4 variant. Chroma uses
* the SAME daedalus_h264_deblock_meta struct as luma but on smaller
* tiles: 8 cols × 4 rows for V (4 segments of 2 cols), 4 cols × 8
* rows for H (4 segments of 2 rows). Each segment has its own tc0
* strength (tc0[s] applies to both cells in segment s).
*
* Algorithm difference vs luma: chroma updates only p0 and q0
* (never p1/p2/q1/q2) and uses tC = tc0_seg + 1 directly (no
* luma-style ap/aq side-condition bonus).
*
* QPU shaders for chroma deblock not implemented yet; recipe table
* routes AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_deblock_chroma_v(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_v(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_h(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_h(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* H.264 bS=4 "intra" loop filters — used at I-MB and inter
* macroblock boundaries where boundary strength is forced to 4 per
* H.264 §8.7.2.1. Different algorithm from bS<4: per-side strong
* vs weak filter decided by quad-tree condition (luma only);
* chroma is always weak. No tc0 — the daedalus_h264_deblock_meta
* struct's tc0[] field is IGNORED for intra dispatches (callers can
* leave it uninitialised or share a single edge list across both
* intra and non-intra kernels).
*
* Reuses the same meta layout as bS<4 dispatches for alpha + beta +
* dst_off; tile geometry per orientation is identical to the bS<4
* sibling (16-col / 16-row luma; 8-col / 8-row chroma).
*
* QPU shaders not implemented for any of the four; recipe routes
* AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1 (fast fail).
*/
int daedalus_recipe_dispatch_h264_deblock_luma_v_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_v_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_luma_h_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_luma_h_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_v_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(daedalus_ctx *ctx,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
int daedalus_dispatch_h264_deblock_chroma_h_intra(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* -------------------------------------------------------------------
* H.264 luma qpel mc20 (8×8, horizontal half-pel) — cycle 9
* (CPU by recipe; per-block 7.6 ns NEON, QPU not viable — see
@@ -296,6 +392,240 @@ int daedalus_dispatch_h264_qpel_mc20(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma qpel mc02 (vertical half-pel) — mirror of mc20.
* 6-tap filter applied vertically:
* dst[r,c] = clip255((s[r-2,c] - 5*s[r-1,c] + 20*s[r,c]
* + 20*s[r+1,c] - 5*s[r+2,c] + s[r+3,c]
* + 16) >> 5)
*
* Same single-stride convention as mc20. src + src_off points at
* row 0 col 0 of the OUTPUT block; the filter reads rows -2..+3, so
* the caller must guarantee 2 rows of top context and 3 rows of
* bottom context per block (FFmpeg edge-emulated buffer handles
* frame boundaries; same contract as mc20).
*
* QPU shader not implemented yet; recipe table routes AUTO to CPU
* NEON. Explicit DAEDALUS_SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc02(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc02(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma qpel mc22 (2D half-pel "j" position per spec §8.4.2.2.1).
* Horizontal 6-tap cascaded into vertical 6-tap with intermediate
* 16-bit precision; final +512 >> 10 with clip255. Common position
* in real H.264 streams.
*
* src + src_off points at row 0 col 0 of the OUTPUT block; the
* cascade reads rows -2..+10 (13 rows of context) and cols -2..+5
* (10 cols of context). Caller must guarantee.
*
* QPU shader not implemented yet (the HV lowpass is the meatiest
* qpel kernel; structurally distinct from the 1D mc20 shader).
* Recipe routes AUTO to CPU NEON. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc22(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc22(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma single-axis quarter-pel qpel positions ("put"):
* mc10 ¼-H ("a" position): clip255(mc20(s)) avg src[r,c]
* mc30 ¾-H ("c" position): clip255(mc20(s)) avg src[r,c+1]
* mc01 ¼-V ("d" position): clip255(mc02(s)) avg src[r,c]
* mc03 ¾-V ("n" position): clip255(mc02(s)) avg src[r+1,c]
*
* Each is a half-pel lowpass clipped to u8 then averaged with an
* integer-aligned source pixel (rounded +1 >> 1). Same edge
* context contract as mc20/mc02. CPU-only for now; QPU shaders
* not yet implemented. Explicit SUBSTRATE_QPU returns -1.
*/
int daedalus_recipe_dispatch_h264_qpel_mc10(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc10(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc30(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc30(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc01(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc01(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_recipe_dispatch_h264_qpel_mc03(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc03(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* H.264 luma diagonal qpel positions ("put", 8 variants). Each is
* the rounded average of two half-pel intermediates per H.264
* §8.4.2.2.1 / Table 8-4 (decomposition matches the FFmpeg .S
* structure; see test/h264_qpel8_diag_ref.c for the formulas).
*
* mc11 ¼¼ : avg(mc20[r,c], mc02[r,c])
* mc12 ¼½ : avg(mc22[r,c], mc02[r,c])
* mc13 ¼¾ : avg(mc20[r+1,c], mc02[r,c])
* mc21 ½¼ : avg(mc22[r,c], mc20[r,c])
* mc23 ½¾ : avg(mc22[r,c], mc20[r+1,c])
* mc31 ¾¼ : avg(mc20[r,c], mc02[r,c+1])
* mc32 ¾½ : avg(mc22[r,c], mc02[r,c+1])
* mc33 ¾¾ : avg(mc20[r+1,c], mc02[r,c+1])
*
* CPU-only via vendored FFmpeg NEON; QPU shaders pending.
* Explicit SUBSTRATE_QPU returns -1.
*/
#define DECLARE_QPEL_DIAG(name) \
int daedalus_recipe_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta); \
int daedalus_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, daedalus_substrate sub, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
DECLARE_QPEL_DIAG(mc11)
DECLARE_QPEL_DIAG(mc12)
DECLARE_QPEL_DIAG(mc13)
DECLARE_QPEL_DIAG(mc21)
DECLARE_QPEL_DIAG(mc23)
DECLARE_QPEL_DIAG(mc31)
DECLARE_QPEL_DIAG(mc32)
DECLARE_QPEL_DIAG(mc33)
#undef DECLARE_QPEL_DIAG
/* H.264 luma qpel avg_ biprediction anchors — 3 half-pel positions
* (the put_ result is L2-averaged into the existing dst buffer per
* H.264 §8.4.2.3.1). Caller is responsible for pre-loading dst with
* the list0 prediction; the avg_ call adds list1.
*
* Same single-stride convention as put_; CPU NEON only for now.
*/
#define DECLARE_QPEL_AVG(name) \
int daedalus_recipe_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta); \
int daedalus_dispatch_h264_qpel_ ## name(daedalus_ctx *ctx, daedalus_substrate sub, \
uint8_t *dst, const uint8_t *src, size_t stride, \
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
DECLARE_QPEL_AVG(avg_mc20)
DECLARE_QPEL_AVG(avg_mc02)
DECLARE_QPEL_AVG(avg_mc22)
DECLARE_QPEL_AVG(avg_mc10)
DECLARE_QPEL_AVG(avg_mc30)
DECLARE_QPEL_AVG(avg_mc01)
DECLARE_QPEL_AVG(avg_mc03)
DECLARE_QPEL_AVG(avg_mc11)
DECLARE_QPEL_AVG(avg_mc12)
DECLARE_QPEL_AVG(avg_mc13)
DECLARE_QPEL_AVG(avg_mc21)
DECLARE_QPEL_AVG(avg_mc23)
DECLARE_QPEL_AVG(avg_mc31)
DECLARE_QPEL_AVG(avg_mc32)
DECLARE_QPEL_AVG(avg_mc33)
#undef DECLARE_QPEL_AVG
/* -------------------------------------------------------------------
* H.264 chroma DC 2x2 Hadamard pre-pass (per H.264 §8.5.11.1).
*
* Operates in-place on 4 int16 (the DC coefficients of an MB's
* chroma 4x4 AC blocks). Pure CPU primitive — no substrate
* dispatch wrapper because the work is 4 adds + 4 subs. Callers
* compose with QP-dependent scaling themselves; the scale shape
* varies by slice/PPS chroma_qp offset context.
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23).
* ----------------------------------------------------------------- */
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
/* -------------------------------------------------------------------
* H.264 Intra_4x4 luma prediction (per H.264 §8.3.1.4). 9 modes.
*
* Pure CPU primitives — each is a small straightforward fill of a
* 4x4 output block from neighbour pixels in the same buffer. No
* substrate-dispatch wrapper (the work is too small to amortise).
*
* FFmpeg-style interface: `dst` at row 0 col 0 of the 4x4 output.
* Reads top-left at dst[-stride-1], top at dst[-stride..-stride+7]
* (top-right for DDL/VL), and left at dst[r*stride - 1] for r=0..3.
* Caller must ensure all 13 neighbour bytes are valid (interior-MB
* assumption — H.264 availability fallback handled at caller).
*
* Bit-exact validated against tests/test_intra_pred_4x4.c (10-case
* spec-derived test suite including the asymmetric Vertical_Right
* 16-cell hand-derived case; see fourier PR #12).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_4x4_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_16x16 luma prediction (per §8.3.2). 4 modes:
* Vertical / Horizontal / DC / Plane. Same FFmpeg-style interface
* as the 4x4 family at 16x16 scale. Same neighbour availability
* assumption (interior-MB).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_16x16_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 chroma prediction (per §8.3.3, 4:2:0). 4 modes:
* DC / Horizontal / Vertical / Plane. Note: DC is per-quadrant
* asymmetric; Plane uses slope coefficient 34 (not luma's 5).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_chroma8x8_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 luma prediction (High profile, per §8.3.2.1).
* 9 modes with the spec-defined 1-2-1 reference-sample pre-filter
* applied internally to the 25 neighbours before the mode arithmetic.
*
* "_8x8l" naming follows the FFmpeg h264pred_template convention
* (pred8x8l_<mode>_c) to keep the substitution wrappers a 1:1 name
* map.
* ----------------------------------------------------------------- */
void daedalus_h264_pred_8x8l_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
@@ -309,6 +639,42 @@ typedef enum {
DAEDALUS_KERNEL_H264_IDCT8 = 7,
DAEDALUS_KERNEL_H264_DEBLOCK_LV = 8,
DAEDALUS_KERNEL_H264_QPEL_MC20 = 9,
DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10,
DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12,
DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA = 13,
DAEDALUS_KERNEL_H264_DEBLOCK_LH_INTRA = 14,
DAEDALUS_KERNEL_H264_DEBLOCK_CV_INTRA = 15,
DAEDALUS_KERNEL_H264_DEBLOCK_CH_INTRA = 16,
DAEDALUS_KERNEL_H264_QPEL_MC02 = 17,
DAEDALUS_KERNEL_H264_QPEL_MC22 = 18,
DAEDALUS_KERNEL_H264_QPEL_MC10 = 19,
DAEDALUS_KERNEL_H264_QPEL_MC30 = 20,
DAEDALUS_KERNEL_H264_QPEL_MC01 = 21,
DAEDALUS_KERNEL_H264_QPEL_MC03 = 22,
DAEDALUS_KERNEL_H264_QPEL_MC11 = 23,
DAEDALUS_KERNEL_H264_QPEL_MC12 = 24,
DAEDALUS_KERNEL_H264_QPEL_MC13 = 25,
DAEDALUS_KERNEL_H264_QPEL_MC21 = 26,
DAEDALUS_KERNEL_H264_QPEL_MC23 = 27,
DAEDALUS_KERNEL_H264_QPEL_MC31 = 28,
DAEDALUS_KERNEL_H264_QPEL_MC32 = 29,
DAEDALUS_KERNEL_H264_QPEL_MC33 = 30,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC20 = 31,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC02 = 32,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC22 = 33,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC10 = 34,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC30 = 35,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC01 = 36,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC03 = 37,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC11 = 38,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC12 = 39,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC13 = 40,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC21 = 41,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC23 = 42,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC31 = 43,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC32 = 44,
DAEDALUS_KERNEL_H264_QPEL_AVG_MC33 = 45,
} daedalus_kernel;
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k);
+1658 -65
View File
File diff suppressed because it is too large Load Diff
+34
View File
@@ -0,0 +1,34 @@
/* SPDX-License-Identifier: BSD-2-Clause */
/*
* H.264 chroma DC 2x2 Hadamard pre-pass (public, in-tree CPU).
*
* The 4 DC coefficients of an MB's chroma 4x4 AC blocks go through
* this 2x2 Hadamard before quant-scaling and re-injection into the
* AC blocks' [0,0] coefficient. Algorithm per H.264 §8.5.11.1.
*
* Pure CPU primitive — there's no substrate-dispatch wrapper because
* the work is 4 adds + 4 subs. Callers compose with QP-dependent
* scaling themselves (the scale shape varies by slice/PPS chroma_qp
* offset context and shouldn't be baked into the kernel).
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23). Same algorithm; this is the public
* src-tree copy.
*/
#include "daedalus.h"
#include <stdint.h>
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4])
{
int t0 = c[0] + c[1];
int t1 = c[0] - c[1];
int t2 = c[2] + c[3];
int t3 = c[2] - c[3];
c[0] = (int16_t)(t0 + t2); /* f[0,0] = sum of all 4 */
c[1] = (int16_t)(t1 + t3); /* f[0,1] = col-difference */
c[2] = (int16_t)(t0 - t2); /* f[1,0] = row-difference */
c[3] = (int16_t)(t1 - t3); /* f[1,1] = anti-diagonal */
}
+106
View File
@@ -0,0 +1,106 @@
/*
* Standalone bit-exact C reference for H.264 luma Intra_16x16
* prediction modes (per H.264 spec §8.3.2). All 4 modes.
*
* Mode index → name (per H.264 Table 7-15):
* 0 = Vertical
* 1 = Horizontal
* 2 = DC
* 3 = Plane
*
* Calling convention (FFmpeg-style, matches the Intra_4x4 ref):
* pred_16x16_<mode>(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0, col 0 of the 16x16 output block. Neighbours:
* top[0..15] = dst[-stride + 0 .. -stride + 15]
* top-left = dst[-stride - 1]
* left[0..15] = dst[ 0*stride - 1 .. 15*stride - 1]
*
* AVAILABILITY: assumes all neighbours valid (interior-MB case). The
* H.264 spec defines fallback for boundary cases (DC averages just
* the available side, etc.); the eventual libavcodec intercept
* handles availability before calling.
*
* License: BSD-2-Clause.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 16; r++)
for (int c = 0; c < 16; c++) dst[r * stride + c] = top[c];
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 16; r++) {
uint8_t l = dst[r * stride - 1];
for (int c = 0; c < 16; c++) dst[r * stride + c] = l;
}
}
/* Mode 2 — DC: ((sum_top16 + sum_left16 + 16) >> 5) broadcast. */
void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 16; /* rounding for >> 5 over 32 samples */
for (int i = 0; i < 16; i++) sum += top[i];
for (int i = 0; i < 16; i++) sum += dst[i * stride - 1];
uint8_t v = (uint8_t)(sum >> 5);
for (int r = 0; r < 16; r++)
for (int c = 0; c < 16; c++) dst[r * stride + c] = v;
}
/* Mode 3 — Plane (per H.264 §8.3.2.4):
* H = sum_{i=0..7} (i+1) * (p[7+i+1, -1] - p[7-i-1, -1])
* = sum_{i=0..7} (i+1) * (top[8+i] - top[6-i])
* V = sum_{j=0..7} (j+1) * (p[-1, 7+j+1] - p[-1, 7-j-1])
* = sum_{j=0..7} (j+1) * (left[8+j] - left[6-j])
* b = (5*H + 32) >> 6
* c = (5*V + 32) >> 6
* a = 16 * (p[-1, 15] + p[15, -1])
* = 16 * (left[15] + top[15])
* pred[y][x] = Clip1((a + b*(x-7) + c*(y-7) + 16) >> 5)
*
* Note: spec indexing uses [x, y] with x = col, y = row (or vice
* versa depending on the section). Here I use the FFmpeg convention
* pred[y][x] = pred[row][col]; the H = horizontal-slope formula uses
* the TOP row's left-vs-right asymmetry; V = vertical-slope uses the
* LEFT col's top-vs-bottom asymmetry. Boundary participants are
* the top-left corner p[-1,-1] inferred from the spec's index range
* (it does NOT participate in the H/V sums in the 16x16 case — only
* for the chroma 8x8 plane mode).
*/
void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
/* H accumulates differences across the right vs left half of the
* top row. Per spec, the top-left p[-1,-1] participates: i=7 uses
* p[15,-1] - p[-1,-1]. We include it by reading top[-1]. */
int H = 0, V = 0;
for (int i = 0; i < 8; i++) {
int t_right = top[8 + i];
int t_left = (i == 7) ? top[-1] : top[6 - i];
H += (i + 1) * (t_right - t_left);
}
for (int j = 0; j < 8; j++) {
int l_bot = dst[(8 + j) * stride - 1];
int l_top = (j == 7) ? top[-1] : dst[(6 - j) * stride - 1];
V += (j + 1) * (l_bot - l_top);
}
int b = (5 * H + 32) >> 6;
int c = (5 * V + 32) >> 6;
int a = 16 * (dst[15 * stride - 1] + top[15]);
for (int y = 0; y < 16; y++) {
for (int x = 0; x < 16; x++) {
int v = (a + b * (x - 7) + c * (y - 7) + 16) >> 5;
dst[y * stride + x] = (uint8_t) clip_u8(v);
}
}
}
+238
View File
@@ -0,0 +1,238 @@
/*
* Standalone bit-exact C reference for H.264 luma Intra_4x4
* prediction modes (per H.264 spec §8.3.1.4). All 9 modes.
*
* Mode index → name (per H.264 Table 8-2):
* 0 = Vertical
* 1 = Horizontal
* 2 = DC
* 3 = Diagonal_Down_Left
* 4 = Diagonal_Down_Right
* 5 = Vertical_Right
* 6 = Horizontal_Down
* 7 = Vertical_Left
* 8 = Horizontal_Up
*
* Calling convention matches FFmpeg's h264pred:
* pred_4x4_<mode>(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0, col 0 of the 4x4 output block. Neighbour
* pixels come from the already-decoded surrounding pixels in the same
* buffer:
* top-left = dst[-stride - 1]
* top[0..3] = dst[-stride + 0 .. -stride + 3]
* top-right = dst[-stride + 4 .. -stride + 7] (DDL / VL only)
* left[0..3] = dst[ 0*stride - 1 .. 3*stride - 1]
*
* AVAILABILITY: this reference assumes ALL neighbours are available
* (the "interior MB" case). The H.264 spec defines fallback behaviour
* for unavailable neighbours (e.g. DC averages only the available
* side, top-right substitution from top[3] for DDL/VL near the right
* frame edge); those branches are NOT modelled here. Tests must
* exercise the kernel with all 13 neighbour bytes valid. The eventual
* libavcodec intercept handles availability before calling.
*
* License: BSD-2-Clause for the reference + tests; the underlying
* algorithm is from H.264/ITU-T H.264 (2003) and AVC standards, free
* to implement.
*/
#include <stdint.h>
#include <stddef.h>
/* Helper: 3-tap weighted average ((a + 2*b + c + 2) >> 2). */
static inline uint8_t avg3(int a, int b, int c)
{
return (uint8_t)((a + 2*b + c + 2) >> 2);
}
/* Helper: 2-tap mean ((a + b + 1) >> 1). */
static inline uint8_t avg2(int a, int b)
{
return (uint8_t)((a + b + 1) >> 1);
}
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 4; r++) {
for (int c = 0; c < 4; c++) dst[r * stride + c] = top[c];
}
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 4; r++) {
uint8_t l = dst[r * stride - 1];
for (int c = 0; c < 4; c++) dst[r * stride + c] = l;
}
}
/* Mode 2 — DC: mean of top 4 + left 4, broadcast. */
void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 4; /* rounding for ((sum + 4) >> 3) */
for (int i = 0; i < 4; i++) sum += top[i];
for (int i = 0; i < 4; i++) sum += dst[i * stride - 1];
uint8_t v = (uint8_t)(sum >> 3);
for (int r = 0; r < 4; r++)
for (int c = 0; c < 4; c++) dst[r * stride + c] = v;
}
/* Mode 3 — Diagonal_Down_Left. Uses top[0..7] (incl. top-right). */
void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0 = top[0], t1 = top[1], t2 = top[2], t3 = top[3];
int t4 = top[4], t5 = top[5], t6 = top[6], t7 = top[7];
/* zz[7] = top filtered with 3-tap; spec table 8-7. */
uint8_t zz[7];
zz[0] = avg3(t0, t1, t2);
zz[1] = avg3(t1, t2, t3);
zz[2] = avg3(t2, t3, t4);
zz[3] = avg3(t3, t4, t5);
zz[4] = avg3(t4, t5, t6);
zz[5] = avg3(t5, t6, t7);
zz[6] = avg3(t6, t7, t7); /* spec: t7 doubled at the boundary */
/* dst[r][c] = zz[c + r] */
for (int r = 0; r < 4; r++)
for (int c = 0; c < 4; c++) dst[r * stride + c] = zz[c + r];
}
/* Mode 4 — Diagonal_Down_Right. Uses top-left + top[0..3] + left[0..3]. */
void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
int t2 = dst[-stride + 2], t3 = dst[-stride + 3];
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1], l3 = dst[ 3*stride - 1];
/* zz indexed by (col - row): -3..+3 */
uint8_t zz_m3 = avg3(l1, l2, l3);
uint8_t zz_m2 = avg3(l0, l1, l2);
uint8_t zz_m1 = avg3(tl, l0, l1);
uint8_t zz_p0 = avg3(l0, tl, t0);
uint8_t zz_p1 = avg3(tl, t0, t1);
uint8_t zz_p2 = avg3(t0, t1, t2);
uint8_t zz_p3 = avg3(t1, t2, t3);
uint8_t zz[7] = { zz_m3, zz_m2, zz_m1, zz_p0, zz_p1, zz_p2, zz_p3 };
for (int r = 0; r < 4; r++)
for (int c = 0; c < 4; c++) dst[r * stride + c] = zz[(c - r) + 3];
}
/* Mode 5 — Vertical_Right. */
void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
int t2 = dst[-stride + 2], t3 = dst[-stride + 3];
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1];
/* H.264 §8.3.1.4.6: two patterns based on (2c - r) parity. */
dst[0*stride + 0] = avg2(tl, t0);
dst[0*stride + 1] = avg2(t0, t1);
dst[0*stride + 2] = avg2(t1, t2);
dst[0*stride + 3] = avg2(t2, t3);
dst[1*stride + 0] = avg3(l0, tl, t0);
dst[1*stride + 1] = avg3(tl, t0, t1);
dst[1*stride + 2] = avg3(t0, t1, t2);
dst[1*stride + 3] = avg3(t1, t2, t3);
dst[2*stride + 0] = avg3(tl, l0, l1);
dst[2*stride + 1] = dst[0*stride + 0];
dst[2*stride + 2] = dst[0*stride + 1];
dst[2*stride + 3] = dst[0*stride + 2];
dst[3*stride + 0] = avg3(l0, l1, l2);
dst[3*stride + 1] = dst[1*stride + 0];
dst[3*stride + 2] = dst[1*stride + 1];
dst[3*stride + 3] = dst[1*stride + 2];
}
/* Mode 6 — Horizontal_Down. */
void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1], t2 = dst[-stride + 2];
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1], l3 = dst[ 3*stride - 1];
dst[0*stride + 0] = avg2(tl, l0);
dst[0*stride + 1] = avg3(l0, tl, t0);
dst[0*stride + 2] = avg3(tl, t0, t1);
dst[0*stride + 3] = avg3(t0, t1, t2);
dst[1*stride + 0] = avg2(l0, l1);
dst[1*stride + 1] = avg3(tl, l0, l1);
dst[1*stride + 2] = dst[0*stride + 0];
dst[1*stride + 3] = dst[0*stride + 1];
dst[2*stride + 0] = avg2(l1, l2);
dst[2*stride + 1] = avg3(l0, l1, l2);
dst[2*stride + 2] = dst[1*stride + 0];
dst[2*stride + 3] = dst[1*stride + 1];
dst[3*stride + 0] = avg2(l2, l3);
dst[3*stride + 1] = avg3(l1, l2, l3);
dst[3*stride + 2] = dst[2*stride + 0];
dst[3*stride + 3] = dst[2*stride + 1];
}
/* Mode 7 — Vertical_Left. Uses top[0..7]. */
void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0=top[0], t1=top[1], t2=top[2], t3=top[3];
int t4=top[4], t5=top[5], t6=top[6], t7=top[7];
dst[0*stride + 0] = avg2(t0, t1);
dst[0*stride + 1] = avg2(t1, t2);
dst[0*stride + 2] = avg2(t2, t3);
dst[0*stride + 3] = avg2(t3, t4);
dst[1*stride + 0] = avg3(t0, t1, t2);
dst[1*stride + 1] = avg3(t1, t2, t3);
dst[1*stride + 2] = avg3(t2, t3, t4);
dst[1*stride + 3] = avg3(t3, t4, t5);
dst[2*stride + 0] = avg2(t1, t2);
dst[2*stride + 1] = avg2(t2, t3);
dst[2*stride + 2] = avg2(t3, t4);
dst[2*stride + 3] = avg2(t4, t5);
dst[3*stride + 0] = avg3(t1, t2, t3);
dst[3*stride + 1] = avg3(t2, t3, t4);
dst[3*stride + 2] = avg3(t3, t4, t5);
dst[3*stride + 3] = avg3(t4, t5, t6);
(void) t6; (void) t7; /* t6 used; t7 unused in 4x4 VL */
}
/* Mode 8 — Horizontal_Up. Uses left[0..3] only. */
void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride)
{
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1], l3 = dst[ 3*stride - 1];
dst[0*stride + 0] = avg2(l0, l1);
dst[0*stride + 1] = avg3(l0, l1, l2);
dst[0*stride + 2] = avg2(l1, l2);
dst[0*stride + 3] = avg3(l1, l2, l3);
dst[1*stride + 0] = avg2(l1, l2);
dst[1*stride + 1] = avg3(l1, l2, l3);
dst[1*stride + 2] = avg2(l2, l3);
dst[1*stride + 3] = avg3(l2, l3, l3);
dst[2*stride + 0] = avg2(l2, l3);
dst[2*stride + 1] = avg3(l2, l3, l3);
dst[2*stride + 2] = l3;
dst[2*stride + 3] = l3;
dst[3*stride + 0] = l3;
dst[3*stride + 1] = l3;
dst[3*stride + 2] = l3;
dst[3*stride + 3] = l3;
}
+305
View File
@@ -0,0 +1,305 @@
/*
* Standalone bit-exact C reference for H.264 luma Intra_8x8
* prediction modes (per H.264 spec §8.3.2.1). High-profile-only
* MB type — Baseline/Main/Extended profiles don't see Intra_8x8.
*
* Distinct from Intra_4x4 in two ways:
*
* 1. REFERENCE SAMPLE FILTERING (§8.3.2.1.1). The 25 raw
* neighbour samples are pre-filtered with a 1-2-1 smoothing
* filter BEFORE prediction. The filtering has spec-defined
* boundary handling at the corners and the right-edge of the
* top-row extension.
*
* 2. SCALE. All 9 prediction modes operate at 8x8 with the
* filtered samples (Intra_4x4 operates at 4x4 with the raw
* samples).
*
* This PR implements the filter + the 3 simple modes (Vertical,
* Horizontal, DC). The 6 directional modes (DDL, DDR, VR, HD, VL,
* HU at 8x8) follow in a separate PR — same template, different
* formulas per spec sections §8.3.2.1.4..§8.3.2.1.9.
*
* Calling convention (FFmpeg-style):
* pred_8x8_<mode>_ref(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0 col 0 of the 8x8 output block. Reads from
* top[0..15] = dst[-stride + 0..15]
* top-left = dst[-stride - 1]
* left[0..7] = dst[ 0*stride - 1 .. 7*stride - 1]
*
* AVAILABILITY: assumes all neighbours valid (interior-MB case).
*
* License: BSD-2-Clause.
*/
#include <stdint.h>
#include <stddef.h>
#include <string.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* H.264 §8.3.2.1.1 reference sample filtering. Filters the 25 raw
* samples around the 8x8 block into a `filt` array with the same
* indices. When called against an "all neighbours available" tile,
* the filtered output uses these spec-defined formulas:
*
* filt[top -1] (= filtered top-left) = (top[0] + 2*tl + left[0] + 2) >> 2
*
* filt[top 0] = (tl + 2*top[0] + top[1] + 2) >> 2
* filt[top i] for 1<=i<=14 = (top[i-1] + 2*top[i] + top[i+1] + 2) >> 2
* filt[top 15] = (top[14] + 3*top[15] + 2) >> 2 (boundary)
*
* filt[left 0] = (tl + 2*left[0] + left[1] + 2) >> 2
* filt[left j] for 1<=j<=6 = (left[j-1] + 2*left[j] + left[j+1] + 2) >> 2
* filt[left 7] = (left[6] + 3*left[7] + 2) >> 2 (boundary)
*
* Reads neighbours from the dst buffer; writes filtered values to
* a caller-provided 26-element array indexed as:
* filt[0] = filtered top-left
* filt[1..16] = filtered top[0..15]
* filt[17..24] = filtered left[0..7]
*/
static void filter_refs(const uint8_t *dst, ptrdiff_t stride,
uint8_t filt[25])
{
int tl = dst[-stride - 1];
int t[16];
for (int i = 0; i < 16; i++) t[i] = dst[-stride + i];
int l[8];
for (int j = 0; j < 8; j++) l[j] = dst[j * stride - 1];
/* Filtered top-left. */
filt[0] = (uint8_t)((t[0] + 2*tl + l[0] + 2) >> 2);
/* Filtered top. */
filt[1] = (uint8_t)((tl + 2*t[0] + t[1] + 2) >> 2);
for (int i = 1; i <= 14; i++)
filt[1 + i] = (uint8_t)((t[i-1] + 2*t[i] + t[i+1] + 2) >> 2);
filt[1 + 15] = (uint8_t)((t[14] + 3*t[15] + 2) >> 2);
/* Filtered left. */
filt[17 + 0] = (uint8_t)((tl + 2*l[0] + l[1] + 2) >> 2);
for (int j = 1; j <= 6; j++)
filt[17 + j] = (uint8_t)((l[j-1] + 2*l[j] + l[j+1] + 2) >> 2);
filt[17 + 7] = (uint8_t)((l[6] + 3*l[7] + 2) >> 2);
}
/* Convenience macros for accessing the filt[] array by spec-style index. */
#define FT(i) filt[1 + (i)] /* filtered top[i], i in 0..15 */
#define FL(j) filt[17 + (j)] /* filtered left[j], j in 0..7 */
#define FTL filt[0] /* filtered top-left */
/* Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]. */
void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FT(c);
}
/* Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]. */
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FL(r);
}
/* Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
* + 8) >> 4) broadcast. Note the +8 (not +4 like 4x4): there are
* 16 samples summed total, so >> 4 with half-step rounding +8. */
void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
int sum = 8;
for (int i = 0; i < 8; i++) sum += FT(i);
for (int j = 0; j < 8; j++) sum += FL(j);
uint8_t v = (uint8_t)(sum >> 4);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = v;
}
/* --- 6 directional modes for Intra_8x8 (H.264 §8.3.2.1.5..§8.3.2.1.10).
* Transcribed from FFmpeg libavcodec/h264pred_template.c
* pred8x8l_{down_left, down_right, vertical_right, horizontal_down,
* vertical_left, horizontal_up} (LGPL-2.1+ in the original; algorithm
* reproduced here for test purposes).
*
* All 6 use the same FILTERED reference samples produced by
* filter_refs() above. Mapping from FFmpeg's t0..t15 / l0..l7 / lt
* notation:
* tN = FT(N) for N in 0..15
* lN = FL(N) for N in 0..7
* lt = FTL
*
* SRC(x,y) maps to dst[y*stride + x] (col x, row y).
*/
#define SRC(x, y) dst[(y) * stride + (x)]
#define T(i) FT(i)
#define L(j) FL(j)
#define LT FTL
/* Mode 3 DDL (Diagonal_Down_Left) — uses TOP + TOP_RIGHT, no LEFT. */
void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(0,1)=SRC(1,0)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(0,2)=SRC(1,1)=SRC(2,0)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(0,3)=SRC(1,2)=SRC(2,1)=SRC(3,0)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(0,4)=SRC(1,3)=SRC(2,2)=SRC(3,1)=SRC(4,0)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(0,5)=SRC(1,4)=SRC(2,3)=SRC(3,2)=SRC(4,1)=SRC(5,0)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(0,6)=SRC(1,5)=SRC(2,4)=SRC(3,3)=SRC(4,2)=SRC(5,1)=SRC(6,0)= (T(6) + 2*T(7) + T(8) + 2) >> 2;
SRC(0,7)=SRC(1,6)=SRC(2,5)=SRC(3,4)=SRC(4,3)=SRC(5,2)=SRC(6,1)=SRC(7,0)= (T(7) + 2*T(8) + T(9) + 2) >> 2;
SRC(1,7)=SRC(2,6)=SRC(3,5)=SRC(4,4)=SRC(5,3)=SRC(6,2)=SRC(7,1)= (T(8) + 2*T(9) + T(10) + 2) >> 2;
SRC(2,7)=SRC(3,6)=SRC(4,5)=SRC(5,4)=SRC(6,3)=SRC(7,2)= (T(9) + 2*T(10) + T(11) + 2) >> 2;
SRC(3,7)=SRC(4,6)=SRC(5,5)=SRC(6,4)=SRC(7,3)= (T(10) + 2*T(11) + T(12) + 2) >> 2;
SRC(4,7)=SRC(5,6)=SRC(6,5)=SRC(7,4)= (T(11) + 2*T(12) + T(13) + 2) >> 2;
SRC(5,7)=SRC(6,6)=SRC(7,5)= (T(12) + 2*T(13) + T(14) + 2) >> 2;
SRC(6,7)=SRC(7,6)= (T(13) + 2*T(14) + T(15) + 2) >> 2;
SRC(7,7)= (T(14) + 3*T(15) + 2) >> 2;
}
/* Mode 4 DDR (Diagonal_Down_Right). */
void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,7)= (L(7) + 2*L(6) + L(5) + 2) >> 2;
SRC(0,6)=SRC(1,7)= (L(6) + 2*L(5) + L(4) + 2) >> 2;
SRC(0,5)=SRC(1,6)=SRC(2,7)= (L(5) + 2*L(4) + L(3) + 2) >> 2;
SRC(0,4)=SRC(1,5)=SRC(2,6)=SRC(3,7)= (L(4) + 2*L(3) + L(2) + 2) >> 2;
SRC(0,3)=SRC(1,4)=SRC(2,5)=SRC(3,6)=SRC(4,7)= (L(3) + 2*L(2) + L(1) + 2) >> 2;
SRC(0,2)=SRC(1,3)=SRC(2,4)=SRC(3,5)=SRC(4,6)=SRC(5,7)= (L(2) + 2*L(1) + L(0) + 2) >> 2;
SRC(0,1)=SRC(1,2)=SRC(2,3)=SRC(3,4)=SRC(4,5)=SRC(5,6)=SRC(6,7)= (L(1) + 2*L(0) + LT + 2) >> 2;
SRC(0,0)=SRC(1,1)=SRC(2,2)=SRC(3,3)=SRC(4,4)=SRC(5,5)=SRC(6,6)=SRC(7,7)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(1,0)=SRC(2,1)=SRC(3,2)=SRC(4,3)=SRC(5,4)=SRC(6,5)=SRC(7,6)= (LT + 2*T(0) + T(1) + 2) >> 2;
SRC(2,0)=SRC(3,1)=SRC(4,2)=SRC(5,3)=SRC(6,4)=SRC(7,5)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(3,0)=SRC(4,1)=SRC(5,2)=SRC(6,3)=SRC(7,4)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(4,0)=SRC(5,1)=SRC(6,2)=SRC(7,3)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(5,0)=SRC(6,1)=SRC(7,2)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(6,0)=SRC(7,1)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(7,0)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
}
/* Mode 5 VR (Vertical_Right). */
void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,6)= (L(5) + 2*L(4) + L(3) + 2) >> 2;
SRC(0,7)= (L(6) + 2*L(5) + L(4) + 2) >> 2;
SRC(0,4)=SRC(1,6)= (L(3) + 2*L(2) + L(1) + 2) >> 2;
SRC(0,5)=SRC(1,7)= (L(4) + 2*L(3) + L(2) + 2) >> 2;
SRC(0,2)=SRC(1,4)=SRC(2,6)= (L(1) + 2*L(0) + LT + 2) >> 2;
SRC(0,3)=SRC(1,5)=SRC(2,7)= (L(2) + 2*L(1) + L(0) + 2) >> 2;
SRC(0,1)=SRC(1,3)=SRC(2,5)=SRC(3,7)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(0,0)=SRC(1,2)=SRC(2,4)=SRC(3,6)= (LT + T(0) + 1) >> 1;
SRC(1,1)=SRC(2,3)=SRC(3,5)=SRC(4,7)= (LT + 2*T(0) + T(1) + 2) >> 2;
SRC(1,0)=SRC(2,2)=SRC(3,4)=SRC(4,6)= (T(0) + T(1) + 1) >> 1;
SRC(2,1)=SRC(3,3)=SRC(4,5)=SRC(5,7)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(2,0)=SRC(3,2)=SRC(4,4)=SRC(5,6)= (T(1) + T(2) + 1) >> 1;
SRC(3,1)=SRC(4,3)=SRC(5,5)=SRC(6,7)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(3,0)=SRC(4,2)=SRC(5,4)=SRC(6,6)= (T(2) + T(3) + 1) >> 1;
SRC(4,1)=SRC(5,3)=SRC(6,5)=SRC(7,7)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(4,0)=SRC(5,2)=SRC(6,4)=SRC(7,6)= (T(3) + T(4) + 1) >> 1;
SRC(5,1)=SRC(6,3)=SRC(7,5)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(5,0)=SRC(6,2)=SRC(7,4)= (T(4) + T(5) + 1) >> 1;
SRC(6,1)=SRC(7,3)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(6,0)=SRC(7,2)= (T(5) + T(6) + 1) >> 1;
SRC(7,1)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(7,0)= (T(6) + T(7) + 1) >> 1;
}
/* Mode 6 HD (Horizontal_Down). */
void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,7)= (L(6) + L(7) + 1) >> 1;
SRC(1,7)= (L(5) + 2*L(6) + L(7) + 2) >> 2;
SRC(0,6)=SRC(2,7)= (L(5) + L(6) + 1) >> 1;
SRC(1,6)=SRC(3,7)= (L(4) + 2*L(5) + L(6) + 2) >> 2;
SRC(0,5)=SRC(2,6)=SRC(4,7)= (L(4) + L(5) + 1) >> 1;
SRC(1,5)=SRC(3,6)=SRC(5,7)= (L(3) + 2*L(4) + L(5) + 2) >> 2;
SRC(0,4)=SRC(2,5)=SRC(4,6)=SRC(6,7)= (L(3) + L(4) + 1) >> 1;
SRC(1,4)=SRC(3,5)=SRC(5,6)=SRC(7,7)= (L(2) + 2*L(3) + L(4) + 2) >> 2;
SRC(0,3)=SRC(2,4)=SRC(4,5)=SRC(6,6)= (L(2) + L(3) + 1) >> 1;
SRC(1,3)=SRC(3,4)=SRC(5,5)=SRC(7,6)= (L(1) + 2*L(2) + L(3) + 2) >> 2;
SRC(0,2)=SRC(2,3)=SRC(4,4)=SRC(6,5)= (L(1) + L(2) + 1) >> 1;
SRC(1,2)=SRC(3,3)=SRC(5,4)=SRC(7,5)= (L(0) + 2*L(1) + L(2) + 2) >> 2;
SRC(0,1)=SRC(2,2)=SRC(4,3)=SRC(6,4)= (L(0) + L(1) + 1) >> 1;
SRC(1,1)=SRC(3,2)=SRC(5,3)=SRC(7,4)= (LT + 2*L(0) + L(1) + 2) >> 2;
SRC(0,0)=SRC(2,1)=SRC(4,2)=SRC(6,3)= (LT + L(0) + 1) >> 1;
SRC(1,0)=SRC(3,1)=SRC(5,2)=SRC(7,3)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(2,0)=SRC(4,1)=SRC(6,2)= (T(1) + 2*T(0) + LT + 2) >> 2;
SRC(3,0)=SRC(5,1)=SRC(7,2)= (T(2) + 2*T(1) + T(0) + 2) >> 2;
SRC(4,0)=SRC(6,1)= (T(3) + 2*T(2) + T(1) + 2) >> 2;
SRC(5,0)=SRC(7,1)= (T(4) + 2*T(3) + T(2) + 2) >> 2;
SRC(6,0)= (T(5) + 2*T(4) + T(3) + 2) >> 2;
SRC(7,0)= (T(6) + 2*T(5) + T(4) + 2) >> 2;
}
/* Mode 7 VL (Vertical_Left) — uses TOP + TOP_RIGHT only. */
void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (T(0) + T(1) + 1) >> 1;
SRC(0,1)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(0,2)=SRC(1,0)= (T(1) + T(2) + 1) >> 1;
SRC(0,3)=SRC(1,1)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(0,4)=SRC(1,2)=SRC(2,0)= (T(2) + T(3) + 1) >> 1;
SRC(0,5)=SRC(1,3)=SRC(2,1)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(0,6)=SRC(1,4)=SRC(2,2)=SRC(3,0)= (T(3) + T(4) + 1) >> 1;
SRC(0,7)=SRC(1,5)=SRC(2,3)=SRC(3,1)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(1,6)=SRC(2,4)=SRC(3,2)=SRC(4,0)= (T(4) + T(5) + 1) >> 1;
SRC(1,7)=SRC(2,5)=SRC(3,3)=SRC(4,1)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(2,6)=SRC(3,4)=SRC(4,2)=SRC(5,0)= (T(5) + T(6) + 1) >> 1;
SRC(2,7)=SRC(3,5)=SRC(4,3)=SRC(5,1)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(3,6)=SRC(4,4)=SRC(5,2)=SRC(6,0)= (T(6) + T(7) + 1) >> 1;
SRC(3,7)=SRC(4,5)=SRC(5,3)=SRC(6,1)= (T(6) + 2*T(7) + T(8) + 2) >> 2;
SRC(4,6)=SRC(5,4)=SRC(6,2)=SRC(7,0)= (T(7) + T(8) + 1) >> 1;
SRC(4,7)=SRC(5,5)=SRC(6,3)=SRC(7,1)= (T(7) + 2*T(8) + T(9) + 2) >> 2;
SRC(5,6)=SRC(6,4)=SRC(7,2)= (T(8) + T(9) + 1) >> 1;
SRC(5,7)=SRC(6,5)=SRC(7,3)= (T(8) + 2*T(9) + T(10) + 2) >> 2;
SRC(6,6)=SRC(7,4)= (T(9) + T(10) + 1) >> 1;
SRC(6,7)=SRC(7,5)= (T(9) + 2*T(10) + T(11) + 2) >> 2;
SRC(7,6)= (T(10) + T(11) + 1) >> 1;
SRC(7,7)= (T(10) + 2*T(11) + T(12) + 2) >> 2;
}
/* Mode 8 HU (Horizontal_Up) — uses LEFT only. */
void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (L(0) + L(1) + 1) >> 1;
SRC(1,0)= (L(0) + 2*L(1) + L(2) + 2) >> 2;
SRC(0,1)=SRC(2,0)= (L(1) + L(2) + 1) >> 1;
SRC(1,1)=SRC(3,0)= (L(1) + 2*L(2) + L(3) + 2) >> 2;
SRC(0,2)=SRC(2,1)=SRC(4,0)= (L(2) + L(3) + 1) >> 1;
SRC(1,2)=SRC(3,1)=SRC(5,0)= (L(2) + 2*L(3) + L(4) + 2) >> 2;
SRC(0,3)=SRC(2,2)=SRC(4,1)=SRC(6,0)= (L(3) + L(4) + 1) >> 1;
SRC(1,3)=SRC(3,2)=SRC(5,1)=SRC(7,0)= (L(3) + 2*L(4) + L(5) + 2) >> 2;
SRC(0,4)=SRC(2,3)=SRC(4,2)=SRC(6,1)= (L(4) + L(5) + 1) >> 1;
SRC(1,4)=SRC(3,3)=SRC(5,2)=SRC(7,1)= (L(4) + 2*L(5) + L(6) + 2) >> 2;
SRC(0,5)=SRC(2,4)=SRC(4,3)=SRC(6,2)= (L(5) + L(6) + 1) >> 1;
SRC(1,5)=SRC(3,4)=SRC(5,3)=SRC(7,2)= (L(5) + 2*L(6) + L(7) + 2) >> 2;
SRC(0,6)=SRC(2,5)=SRC(4,4)=SRC(6,3)= (L(6) + L(7) + 1) >> 1;
SRC(1,6)=SRC(3,5)=SRC(5,4)=SRC(7,3)= (L(6) + 3*L(7) + 2) >> 2;
/* 20 positions all = L(7) per FFmpeg lines 1097-1100. */
SRC(0,7)=SRC(1,7)=SRC(2,6)=SRC(2,7)=SRC(3,6)=
SRC(3,7)=SRC(4,5)=SRC(4,6)=SRC(4,7)=SRC(5,5)=
SRC(5,6)=SRC(5,7)=SRC(6,4)=SRC(6,5)=SRC(6,6)=
SRC(6,7)=SRC(7,4)=SRC(7,5)=SRC(7,6)=SRC(7,7)= L(7);
}
#undef SRC
#undef T
#undef L
#undef LT
+123
View File
@@ -0,0 +1,123 @@
/*
* Standalone bit-exact C reference for H.264 chroma Intra_8x8
* prediction modes (per H.264 §8.3.3), used for both Cb and Cr
* planes at 4:2:0. All 4 modes.
*
* Mode index → name (per H.264 Table 7-16):
* 0 = DC (per-quadrant — asymmetric, see §8.3.3.2)
* 1 = Horizontal
* 2 = Vertical
* 3 = Plane (slope coefficient 34, distinct from luma's 5)
*
* Calling convention (same shape as luma intra refs):
* pred_chroma8x8_<mode>(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0, col 0 of the 8x8 output block (single
* component plane — Cb or Cr, dispatched independently). Neighbours:
* top[0..7] = dst[-stride + 0 .. -stride + 7]
* top-left = dst[-stride - 1]
* left[0..7] = dst[ 0*stride - 1 .. 7*stride - 1]
*
* AVAILABILITY: assumes all neighbours valid (interior-MB case).
* The H.264 spec defines per-quadrant fallback for the DC mode at
* MB boundaries; that's caller-side via the libavcodec intercept.
*
* License: BSD-2-Clause.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Mode 0 — DC (per-quadrant, 4:2:0 layout per §8.3.3.2).
*
* The 8×8 block is split into four 4×4 quadrants. For interior
* MBs (all neighbours available), the DC value per quadrant uses:
* (0,0) top-left : (sum_top[0..3] + sum_left[0..3] + 4) >> 3
* (0,1) top-right : sum_top[4..7] + 2) >> 2
* (1,0) bot-left : (sum_left[4..7] + 2) >> 2
* (1,1) bot-right : (sum_top[4..7] + sum_left[4..7] + 4) >> 3
*
* The asymmetry mirrors what neighbours are "logically available"
* for each quadrant in the spec's availability model. Top-right
* quadrant ignores the top-left-half because that half is "vertically
* above" the top-left quadrant; the spec uses top[4..7] only.
*/
void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int top_lo = 0, top_hi = 0, left_lo = 0, left_hi = 0;
for (int i = 0; i < 4; i++) {
top_lo += top[i];
top_hi += top[4 + i];
left_lo += dst[i * stride - 1];
left_hi += dst[(4 + i) * stride - 1];
}
uint8_t dc00 = (uint8_t)((top_lo + left_lo + 4) >> 3); /* top-left */
uint8_t dc01 = (uint8_t)((top_hi + 2) >> 2); /* top-right */
uint8_t dc10 = (uint8_t)(( left_hi + 2) >> 2); /* bot-left */
uint8_t dc11 = (uint8_t)((top_hi + left_hi + 4) >> 3); /* bot-right */
for (int r = 0; r < 4; r++) {
for (int c = 0; c < 4; c++) {
dst[( r) * stride + c ] = dc00;
dst[( r) * stride + 4 + c ] = dc01;
dst[(4 + r) * stride + c ] = dc10;
dst[(4 + r) * stride + 4 + c ] = dc11;
}
}
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++) {
uint8_t l = dst[r * stride - 1];
for (int c = 0; c < 8; c++) dst[r * stride + c] = l;
}
}
/* Mode 2 — Vertical: each col = top[col]. */
void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = top[c];
}
/* Mode 3 — Plane (per H.264 §8.3.3.4):
* H = sum_{i=0..3} (i+1) * (p[4+i, -1] - p[2-i, -1]) ; i=3 uses p[-1,-1]
* V = sum_{j=0..3} (j+1) * (p[-1, 4+j] - p[-1, 2-j]) ; j=3 uses p[-1,-1]
* b = (34 * H + 32) >> 6
* c = (34 * V + 32) >> 6
* a = 16 * (p[-1, 7] + p[7, -1])
* pred[y][x] = Clip1((a + b*(x - 3) + c*(y - 3) + 16) >> 5)
*
* Distinct from the Intra_16x16 luma Plane:
* - Slope coefficient is 34 (not 5).
* - Centre is (x-3, y-3) (not x-7, y-7).
* - Spans 4 differences per sum (not 8).
*/
void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int H = 0, V = 0;
for (int i = 0; i < 4; i++) {
int t_right = top[4 + i];
int t_left = (i == 3) ? top[-1] : top[2 - i];
H += (i + 1) * (t_right - t_left);
}
for (int j = 0; j < 4; j++) {
int l_bot = dst[(4 + j) * stride - 1];
int l_top = (j == 3) ? top[-1] : dst[(2 - j) * stride - 1];
V += (j + 1) * (l_bot - l_top);
}
int b = (34 * H + 32) >> 6;
int c = (34 * V + 32) >> 6;
int a = 16 * (dst[7 * stride - 1] + top[7]);
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
int v = (a + b * (x - 3) + c * (y - 3) + 16) >> 5;
dst[y * stride + x] = (uint8_t) clip_u8(v);
}
}
}
+129
View File
@@ -0,0 +1,129 @@
// daedalus-fourier — H.264 4x4 inverse integer transform + add, V3D 7.1.
//
// H.264 spec §8.5.12.1. Pure integer arithmetic — no trig constants
// (unlike VP9 IDCT 8x8). Row pass first, column pass second; round
// (+32) >> 6, add to dst, clip to u8.
//
// Block memory layout: COLUMN-MAJOR. block[c*4 + r] = coefficient at
// (row r, column c). Matches FFmpeg `ff_h264_idct_add_neon`.
//
// Workgroup layout: 64 invocations = 4 lanes/block × 16 blocks/WG.
// - row pass: lane k (0..3) reads row k of the block (4 coefficients,
// one from each column), runs the butterfly, writes 4
// outputs to one row of tmp_shared.
// - column pass: lane k reads column k of tmp_shared (4 rows),
// runs the butterfly, writes 4 outputs to dst as
// column k at rows 0..3.
//
// shared = 16 × 16 × 4 B = 1 KiB. Well under V3D's 16 KiB limit.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Coeffs {
int16_t coeffs[]; // N × 16 column-major
} u_coeffs;
layout(binding = 1) buffer Dst {
uint8_t dst[]; // H × stride bytes (caller-provided base)
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off (byte offset into u_dst.dst)
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint _pad0, _pad1;
} pc;
// 16 blocks per WG × 16 ints per block = 256 ints = 1 KiB shared.
shared int tmp_shared[16 * 16];
// 1D butterfly per H.264 §8.5.12.1. d[0..3] in, o[0..3] out.
void idct4_1d(int d0, int d1, int d2, int d3,
out int o0, out int o1, out int o2, out int o3)
{
int e = d0 + d2;
int f = d0 - d2;
int g = (d1 >> 1) - d3;
int h = d1 + (d3 >> 1);
o0 = e + h;
o1 = f + g;
o2 = f - g;
o3 = e - h;
}
void main()
{
// Lane decomposition: local_size 64 = 16 blocks × 4 lanes/block.
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 64u;
uint lane_in_wg = gid & 63u;
uint block_local = lane_in_wg >> 2; // 0..15
uint k = lane_in_wg & 3u; // 0..3
uint block_idx = wg_id * 16u + block_local;
bool oob = (block_idx >= pc.n_blocks);
// ---- Row pass --------------------------------------------------
// lane k handles row r=k. Reads block[c*4 + k] for c=0..3 (one
// element from each column at fixed row).
if (!oob) {
uint base = block_idx * 16u;
int d0 = int(u_coeffs.coeffs[base + 0u * 4u + k]);
int d1 = int(u_coeffs.coeffs[base + 1u * 4u + k]);
int d2 = int(u_coeffs.coeffs[base + 2u * 4u + k]);
int d3 = int(u_coeffs.coeffs[base + 3u * 4u + k]);
int o0, o1, o2, o3;
idct4_1d(d0, d1, d2, d3, o0, o1, o2, o3);
// Write row k of tmp_shared[block_local].
uint tbase = block_local * 16u + k * 4u;
tmp_shared[tbase + 0u] = o0;
tmp_shared[tbase + 1u] = o1;
tmp_shared[tbase + 2u] = o2;
tmp_shared[tbase + 3u] = o3;
}
barrier();
// ---- Column pass ----------------------------------------------
// lane k handles column c=k. Reads tmp[r][k] for r=0..3.
if (!oob) {
uint tbase = block_local * 16u;
int s0 = tmp_shared[tbase + 0u * 4u + k];
int s1 = tmp_shared[tbase + 1u * 4u + k];
int s2 = tmp_shared[tbase + 2u * 4u + k];
int s3 = tmp_shared[tbase + 3u * 4u + k];
int o0, o1, o2, o3;
idct4_1d(s0, s1, s2, s3, o0, o1, o2, o3);
// Column k at rows 0..3 of dst, offset by meta.x (dst_off).
uint dst_off = u_meta.meta[block_idx].x;
uint stride = pc.dst_stride_u8;
uint a0 = dst_off + 0u * stride + k;
uint a1 = dst_off + 1u * stride + k;
uint a2 = dst_off + 2u * stride + k;
uint a3 = dst_off + 3u * stride + k;
int p0 = int(u_dst.dst[a0]);
int p1 = int(u_dst.dst[a1]);
int p2 = int(u_dst.dst[a2]);
int p3 = int(u_dst.dst[a3]);
u_dst.dst[a0] = uint8_t(clamp(p0 + ((o0 + 32) >> 6), 0, 255));
u_dst.dst[a1] = uint8_t(clamp(p1 + ((o1 + 32) >> 6), 0, 255));
u_dst.dst[a2] = uint8_t(clamp(p2 + ((o2 + 32) >> 6), 0, 255));
u_dst.dst[a3] = uint8_t(clamp(p3 + ((o3 + 32) >> 6), 0, 255));
}
}
+175
View File
@@ -0,0 +1,175 @@
// daedalus-fourier — H.264 8x8 inverse integer transform + add, V3D 7.1.
//
// H.264 spec §8.5.13.2 (High profile 8x8 IT). Pure integer arithmetic
// — different butterfly from VP9 IDCT 8x8 (cycle 1, uses cospi
// multipliers). Row pass first, column pass second; round (+32) >> 6,
// add to dst, clip to u8.
//
// Block layout: COLUMN-MAJOR. block[c*8 + r] = coefficient at
// (row r, column c). Matches FFmpeg `ff_h264_idct8_add_neon`.
//
// Workgroup layout: 64 invocations = 8 lanes/block × 8 blocks/WG.
// - row pass: lane k (0..7) reads row k of the block (8 coefficients,
// one from each column), runs the butterfly, writes 8
// outputs to one row of tmp_shared.
// - column pass: lane k reads column k of tmp_shared (8 rows),
// runs the butterfly, writes 8 outputs to dst as
// column k at rows 0..7.
//
// shared = 8 × 64 × 4 B = 2 KiB. Well under V3D's 16 KiB limit.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Coeffs {
int16_t coeffs[]; // N × 64 column-major
} u_coeffs;
layout(binding = 1) buffer Dst {
uint8_t dst[]; // H × stride bytes
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint _pad0, _pad1;
} pc;
// 8 blocks/WG × 64 ints/block × 4 B = 2 KiB shared.
shared int tmp_shared[8 * 64];
// 1D 8-element butterfly per H.264 §8.5.13.2.
void idct8_1d(int d0, int d1, int d2, int d3,
int d4, int d5, int d6, int d7,
out int g0, out int g1, out int g2, out int g3,
out int g4, out int g5, out int g6, out int g7)
{
int e0 = d0 + d4;
int e1 = -d3 + d5 - d7 - (d7 >> 1);
int e2 = d0 - d4;
int e3 = d1 + d7 - d3 - (d3 >> 1);
int e4 = (d2 >> 1) - d6;
int e5 = -d1 + d7 + d5 + (d5 >> 1);
int e6 = d2 + (d6 >> 1);
int e7 = d3 + d5 + d1 + (d1 >> 1);
int f0 = e0 + e6;
int f1 = e1 + (e7 >> 2);
int f2 = e2 + e4;
int f3 = e3 + (e5 >> 2);
int f4 = e2 - e4;
int f5 = (e3 >> 2) - e5;
int f6 = e0 - e6;
int f7 = e7 - (e1 >> 2);
g0 = f0 + f7;
g1 = f2 + f5;
g2 = f4 + f3;
g3 = f6 + f1;
g4 = f6 - f1;
g5 = f4 - f3;
g6 = f2 - f5;
g7 = f0 - f7;
}
void main()
{
// local_size 64 = 8 blocks × 8 lanes/block.
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 64u;
uint lane_in_wg = gid & 63u;
uint block_local = lane_in_wg >> 3; // 0..7
uint k = lane_in_wg & 7u; // 0..7
uint block_idx = wg_id * 8u + block_local;
bool oob = (block_idx >= pc.n_blocks);
// ---- Row pass --------------------------------------------------
// lane k handles row r=k. Reads block[c*8 + k] for c=0..7.
if (!oob) {
uint base = block_idx * 64u;
int d0 = int(u_coeffs.coeffs[base + 0u * 8u + k]);
int d1 = int(u_coeffs.coeffs[base + 1u * 8u + k]);
int d2 = int(u_coeffs.coeffs[base + 2u * 8u + k]);
int d3 = int(u_coeffs.coeffs[base + 3u * 8u + k]);
int d4 = int(u_coeffs.coeffs[base + 4u * 8u + k]);
int d5 = int(u_coeffs.coeffs[base + 5u * 8u + k]);
int d6 = int(u_coeffs.coeffs[base + 6u * 8u + k]);
int d7 = int(u_coeffs.coeffs[base + 7u * 8u + k]);
int g0, g1, g2, g3, g4, g5, g6, g7;
idct8_1d(d0, d1, d2, d3, d4, d5, d6, d7,
g0, g1, g2, g3, g4, g5, g6, g7);
// Write row k of tmp_shared[block_local].
uint tbase = block_local * 64u + k * 8u;
tmp_shared[tbase + 0u] = g0;
tmp_shared[tbase + 1u] = g1;
tmp_shared[tbase + 2u] = g2;
tmp_shared[tbase + 3u] = g3;
tmp_shared[tbase + 4u] = g4;
tmp_shared[tbase + 5u] = g5;
tmp_shared[tbase + 6u] = g6;
tmp_shared[tbase + 7u] = g7;
}
barrier();
// ---- Column pass ----------------------------------------------
// lane k handles column c=k. Reads tmp[r][k] for r=0..7.
if (!oob) {
uint tbase = block_local * 64u;
int s0 = tmp_shared[tbase + 0u * 8u + k];
int s1 = tmp_shared[tbase + 1u * 8u + k];
int s2 = tmp_shared[tbase + 2u * 8u + k];
int s3 = tmp_shared[tbase + 3u * 8u + k];
int s4 = tmp_shared[tbase + 4u * 8u + k];
int s5 = tmp_shared[tbase + 5u * 8u + k];
int s6 = tmp_shared[tbase + 6u * 8u + k];
int s7 = tmp_shared[tbase + 7u * 8u + k];
int g0, g1, g2, g3, g4, g5, g6, g7;
idct8_1d(s0, s1, s2, s3, s4, s5, s6, s7,
g0, g1, g2, g3, g4, g5, g6, g7);
// Column k at rows 0..7 of dst, offset by meta.x.
uint dst_off = u_meta.meta[block_idx].x;
uint stride = pc.dst_stride_u8;
uint a0 = dst_off + 0u * stride + k;
uint a1 = dst_off + 1u * stride + k;
uint a2 = dst_off + 2u * stride + k;
uint a3 = dst_off + 3u * stride + k;
uint a4 = dst_off + 4u * stride + k;
uint a5 = dst_off + 5u * stride + k;
uint a6 = dst_off + 6u * stride + k;
uint a7 = dst_off + 7u * stride + k;
int p0 = int(u_dst.dst[a0]);
int p1 = int(u_dst.dst[a1]);
int p2 = int(u_dst.dst[a2]);
int p3 = int(u_dst.dst[a3]);
int p4 = int(u_dst.dst[a4]);
int p5 = int(u_dst.dst[a5]);
int p6 = int(u_dst.dst[a6]);
int p7 = int(u_dst.dst[a7]);
u_dst.dst[a0] = uint8_t(clamp(p0 + ((g0 + 32) >> 6), 0, 255));
u_dst.dst[a1] = uint8_t(clamp(p1 + ((g1 + 32) >> 6), 0, 255));
u_dst.dst[a2] = uint8_t(clamp(p2 + ((g2 + 32) >> 6), 0, 255));
u_dst.dst[a3] = uint8_t(clamp(p3 + ((g3 + 32) >> 6), 0, 255));
u_dst.dst[a4] = uint8_t(clamp(p4 + ((g4 + 32) >> 6), 0, 255));
u_dst.dst[a5] = uint8_t(clamp(p5 + ((g5 + 32) >> 6), 0, 255));
u_dst.dst[a6] = uint8_t(clamp(p6 + ((g6 + 32) >> 6), 0, 255));
u_dst.dst[a7] = uint8_t(clamp(p7 + ((g7 + 32) >> 6), 0, 255));
}
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc01 (biprediction) (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc01_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+77
View File
@@ -0,0 +1,77 @@
// daedalus-fourier — H.264 luma qpel avg_mc02 (biprediction) (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc02_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc03 (biprediction) (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc03_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+55
View File
@@ -0,0 +1,55 @@
// daedalus-fourier — H.264 luma qpel avg_mc10 (biprediction) (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc10_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc11 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc11_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc12 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc12_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc13 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc13_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+91
View File
@@ -0,0 +1,91 @@
// daedalus-fourier — H.264 luma qpel avg_mc20 (biprediction) (8x8, horizontal half-pel), V3D 7.1.
//
// H.264 spec §8.4.2.2.1 horizontal 6-tap luma interpolation:
//
// dst[r,c] = clip255(
// ( s[r,c-2]
// - 5 * s[r,c-1]
// + 20 * s[r,c]
// + 20 * s[r,c+1]
// - 5 * s[r,c+2]
// + s[r,c+3]
// + 16
// ) >> 5)
//
// Single-stride: dst and src share `stride` (H264QpelContext
// convention). src+src_off already points at the leftmost output
// column (col 0); the filter reads cols -2..+3. Caller guarantees
// edge-padding context per the public API docstring.
//
// Workgroup layout: 64 invocations = 1 lane per output pixel.
// 1 block per WG; n_blocks WGs total. This is the simplest layout
// that avoids any inter-lane communication — each lane independently
// reads its 6 src samples and writes its 1 dst sample. V3D's L2
// cache handles the redundant reads from adjacent lanes.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc20_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src {
uint8_t src[];
} u_src;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off, .y = src_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
// 1 block per WG, 64 lanes covering the 8x8 output block.
uint wg_id = gl_WorkGroupID.x;
uint block_idx = wg_id;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3; // 0..7 (row)
uint c = lane & 7u; // 0..7 (column)
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// src points at output col 0 of the block; filter reads cols -2..+3
// of the current row. Negative col arithmetic is unsigned-safe
// because src_off >= 2 (caller-guaranteed left context).
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base + 0u]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc21 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc21_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+94
View File
@@ -0,0 +1,94 @@
// daedalus-fourier — H.264 luma qpel avg_mc22 (biprediction) (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc22_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc23 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc23_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc30 (biprediction) (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc30_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc31 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc31_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc32 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc32_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc33 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc33_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc01 (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 luma qpel mc02 (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc03 (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+47
View File
@@ -0,0 +1,47 @@
// daedalus-fourier — H.264 luma qpel mc10 (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc11 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc12 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc13 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+83
View File
@@ -0,0 +1,83 @@
// daedalus-fourier — H.264 luma qpel mc20 (8x8, horizontal half-pel), V3D 7.1.
//
// H.264 spec §8.4.2.2.1 horizontal 6-tap luma interpolation:
//
// dst[r,c] = clip255(
// ( s[r,c-2]
// - 5 * s[r,c-1]
// + 20 * s[r,c]
// + 20 * s[r,c+1]
// - 5 * s[r,c+2]
// + s[r,c+3]
// + 16
// ) >> 5)
//
// Single-stride: dst and src share `stride` (H264QpelContext
// convention). src+src_off already points at the leftmost output
// column (col 0); the filter reads cols -2..+3. Caller guarantees
// edge-padding context per the public API docstring.
//
// Workgroup layout: 64 invocations = 1 lane per output pixel.
// 1 block per WG; n_blocks WGs total. This is the simplest layout
// that avoids any inter-lane communication — each lane independently
// reads its 6 src samples and writes its 1 dst sample. V3D's L2
// cache handles the redundant reads from adjacent lanes.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src {
uint8_t src[];
} u_src;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off, .y = src_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
// 1 block per WG, 64 lanes covering the 8x8 output block.
uint wg_id = gl_WorkGroupID.x;
uint block_idx = wg_id;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3; // 0..7 (row)
uint c = lane & 7u; // 0..7 (column)
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// src points at output col 0 of the block; filter reads cols -2..+3
// of the current row. Negative col arithmetic is unsigned-safe
// because src_off >= 2 (caller-guaranteed left context).
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base + 0u]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc21 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+86
View File
@@ -0,0 +1,86 @@
// daedalus-fourier — H.264 luma qpel mc22 (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc23 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc30 (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc31 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc32 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc33 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 chroma 4:2:0 H loop filter (horizontal
// filter across a vertical edge), non-intra bS<4 variant.
//
// Sibling of v3d_h264deblock_chroma_v.comp; same kernel transposed
// to read pix[-2..+1] (cols) instead of pix[-2*stride..+1*stride]
// (rows). Same 8-cell × 4-segment geometry, same WG layout (lanes
// 8..15 of each edge early-return — only 8 active per edge).
//
// 4:2:0-only: 4:2:2 chroma_h has a 16-row edge that this shader
// doesn't address. daedalus_dispatch_h264_deblock_chroma_h is
// 4:2:0-only by design; caller (libavcodec init) gates accordingly.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint row_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
uint seg = row_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) H deblock —
// V3D 7.1. Transpose of v3d_h264deblock_chroma_v_intra.comp.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+76
View File
@@ -0,0 +1,76 @@
// daedalus-fourier — H.264 chroma 4:2:0 V loop filter (vertical
// filter across a horizontal edge), non-intra bS<4 variant.
//
// Per H.264 §8.7.2.4: chroma kernel is simpler than luma's bS<4 —
// only p0 / q0 are updated (chroma never modifies p1, p2, q1, q2),
// tC = tc0_seg + 1 (no luma-style ap/aq side bonus), and the edge
// spans 8 cells (4 segments × 2 cells/seg).
//
// V3D 7.1 via Mesa v3dv compute. WG geometry kept identical to the
// luma shader (16 edges × 16 lanes/WG) for uniform dispatch math
// across the deblock family; lanes 8..15 of each edge early-return.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint col_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return; // 8 cells per chroma edge
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// 8 cells / 4 segments = 2 cells per segment.
uint seg = col_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
// p1, q1 untouched — chroma kernel only updates p0/q0.
}
+54
View File
@@ -0,0 +1,54 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) V deblock —
// V3D 7.1. Per H.264 §8.3.2.3 chroma intra path: simpler than luma
// — always weak filter, only p0/q0 updated, 8 cells per edge.
//
// p0' = (2*p1 + p0 + q1 + 2) >> 2
// q0' = (2*q1 + q0 + p1 + 2) >> 2
//
// Same 16-edges × 16-lanes/edge WG shape as luma; lanes 8..15 of each
// edge early-return (chroma edges are only 8 cells wide).
//
// 4:2:0-only — caller-side gating handles 4:2:2 (chroma_format_idc>1)
// at the libavcodec init layer.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+111
View File
@@ -0,0 +1,111 @@
// daedalus-fourier — H.264 luma "h_loop_filter" (horizontal filtering
// across a vertical edge), non-intra bS<4 variant. Sibling of cycle 8's
// v3d_h264deblock.comp; same algorithm with row/col access transposed.
//
// V3D 7.1 via Mesa v3dv compute. Same WG geometry as the V shader:
// - 256 invocations / WG, 16 edges/WG (16 lanes/edge = 1 sg/edge)
// - uint8_t dst SSBO via storageBuffer8BitAccess
// - No barrier (each lane independent)
// - lane_in_edge = ROW index (0..15) along the vertical edge
// - meta.dst_off points to (row 0, col 0) of the RIGHT block;
// the kernel reads cols [-4..+3] of each row and writes [-2..+1].
//
// Filter contract (per H.264 §8.7.2.4):
// 1. (m.x % pc.dst_stride_u8) ≥ 4 (kernel reads p3 at pix[-4])
// 2. pc.dst_stride_u8 = byte stride between rows
// 3. tc0_s pre-stored as signed int8 in m.z packed 4 bytes (one per
// 4-row segment along the 16-row edge)
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_h_loop_filter_luma_ref.c which mirrors FFmpeg
// ff_h264_h_loop_filter_luma_neon (LGPL-2.1+).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gl_WorkGroupID.x;
uint lane_in_wg = gid & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15 (16 edges/WG)
uint row_in_edge = lane_in_wg & 15u; // 0..15 — ROW along the V edge
uint edge_idx = wg_id * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
// dst_off addresses row 0 col 0 of the right block; advance by row * stride
// to land at this lane's row. The kernel reads pix[-4..+3] AT THIS ROW.
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// tc0 segment = 0..3 indexed by (row_in_edge / 4).
uint seg = row_in_edge >> 2;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return; // segment skip
// Horizontal access pattern — read cols at offsets [-3..+2] of this row.
// p3 (col -4) unused in bS<4; same DCE comment as the V shader.
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
// Edge preconditions (same as V).
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int ap = abs(p2 - p0);
int aq = abs(q2 - q0);
bool ap_lt = ap < beta;
bool aq_lt = aq < beta;
int tc = tc0_s + int(ap_lt) + int(aq_lt);
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
int p0p = clamp(p0 + delta, 0, 255);
int q0p = clamp(q0 - delta, 0, 255);
int p1p = p1;
if (ap_lt) {
int d_p1 = clamp((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s);
p1p = clamp(p1 + d_p1, 0, 255);
}
int q1p = q1;
if (aq_lt) {
int d_q1 = clamp((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s);
q1p = clamp(q1 + d_q1, 0, 255);
}
u_dst.dst[dst_off - 2u] = uint8_t(p1p);
u_dst.dst[dst_off - 1u] = uint8_t(p0p);
u_dst.dst[dst_off ] = uint8_t(q0p);
u_dst.dst[dst_off + 1u] = uint8_t(q1p);
}
+70
View File
@@ -0,0 +1,70 @@
// daedalus-fourier — H.264 luma intra (bS=4) H deblock — V3D 7.1.
//
// Sibling of v3d_h264deblock_luma_v_intra.comp transposed to the
// horizontal axis: lane → row, reads pix[-4..+3] (cols) instead of
// pix[-4*stride..+3*stride] (rows). Same strong/weak filter
// selector + same write-back algebra.
//
// dst_off contract: (m.x % stride) ≥ 4 (kernel reads p3 at pix[-4]).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u]);
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
int q3 = int(u_dst.dst[dst_off + 3u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+81
View File
@@ -0,0 +1,81 @@
// daedalus-fourier — H.264 luma intra (bS=4) V deblock — V3D 7.1.
//
// Per H.264 §8.3.2.3: at I-MB edges and certain inter-MB edges that
// force boundary strength to 4, the deblock kernel is structurally
// different from bS<4 — it has a per-side strong/weak filter
// selector that decides whether to update 3 cells (strong) or 1
// (weak), reads p3/q3, and ignores tc0.
//
// strong_common = |p0-q0| < (α>>2) + 2
// strong_p = strong_common AND |p2-p0| < β
// strong_q = strong_common AND |q2-q0| < β
//
// Strong-p updates p0/p1/p2 with specific 5-/4-/3-tap blends.
// Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2.
// Mirror for q-side.
//
// WG geometry identical to v3d_h264deblock.comp (16 edges × 16 lanes/WG).
// dst_off contract: m.x ≥ 4*stride (kernel reads p3 at -4*stride).
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_intra_loop_filter_ref.c (PR #11).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u * stride]);
int p2 = int(u_dst.dst[dst_off - 3u * stride]);
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
int q2 = int(u_dst.dst[dst_off + 2u * stride]);
int q3 = int(u_dst.dst[dst_off + 3u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u * stride] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u * stride] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u * stride] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u * stride] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+206 -2
View File
@@ -8,6 +8,8 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <limits.h>
#define CHK(call) do { VkResult r__ = (call); if (r__ != VK_SUCCESS) { \
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
@@ -17,6 +19,18 @@
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
r__, __FILE__, __LINE__, #call); return NULL; } } while (0)
/* Power-of-2 size classes from 2^8 (256 B) up to 2^23 (8 MiB). Cycle
* 1's largest dispatch with n_blocks ≈ 8K is well under 8 MiB; oversize
* requests fall through to non-pooled allocation. */
#define V3D_POOL_MIN_LOG2 8
#define V3D_POOL_MAX_LOG2 23
#define V3D_POOL_BUCKETS (V3D_POOL_MAX_LOG2 - V3D_POOL_MIN_LOG2 + 1)
struct v3d_pool_entry {
v3d_buffer buf;
struct v3d_pool_entry *next;
};
struct v3d_runner {
VkInstance instance;
VkPhysicalDevice phys;
@@ -26,6 +40,15 @@ struct v3d_runner {
VkCommandPool pool;
char device_name[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE];
VkPhysicalDeviceMemoryProperties mem_props;
/* Buffer pool: per-bucket freelist of previously-released
* v3d_buffer. bucket index = ceil_log2(size) - V3D_POOL_MIN_LOG2.
* pool_total_bytes accumulates every successful vkAllocateMemory
* we've done through the pool — never decreases (the freelist
* just hands buffers around, no vkFreeMemory until destroy).
*/
struct v3d_pool_entry *pool_free[V3D_POOL_BUCKETS];
size_t pool_total_bytes;
};
static int pick_v3d_physical_device(VkInstance inst, VkPhysicalDevice *out,
@@ -168,6 +191,21 @@ void v3d_runner_destroy(v3d_runner *r)
{
if (!r) return;
if (r->device != VK_NULL_HANDLE) vkDeviceWaitIdle(r->device);
/* Drain the buffer pool BEFORE destroying device — the pool
* entries own VkBuffer/VkDeviceMemory handles, which need a live
* device for vkDestroyBuffer/vkFreeMemory. */
for (int b = 0; b < V3D_POOL_BUCKETS; b++) {
struct v3d_pool_entry *e = r->pool_free[b];
while (e) {
struct v3d_pool_entry *next = e->next;
v3d_runner_destroy_buffer(r, &e->buf);
free(e);
e = next;
}
r->pool_free[b] = NULL;
}
if (r->pool != VK_NULL_HANDLE)
vkDestroyCommandPool(r->device, r->pool, NULL);
if (r->device != VK_NULL_HANDLE) vkDestroyDevice(r->device, NULL);
@@ -175,6 +213,92 @@ void v3d_runner_destroy(v3d_runner *r)
free(r);
}
/* ---- Buffer pool ----------------------------------------------- */
/* ceil_log2 for buffer pool bucket selection. */
static int v3d_pool_bucket_for(size_t size)
{
int log2;
size_t m;
if (size <= ((size_t)1 << V3D_POOL_MIN_LOG2))
return 0;
m = size - 1;
log2 = 0;
while (m) { log2++; m >>= 1; }
if (log2 < V3D_POOL_MIN_LOG2) log2 = V3D_POOL_MIN_LOG2;
if (log2 > V3D_POOL_MAX_LOG2) return -1;
return log2 - V3D_POOL_MIN_LOG2;
}
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out)
{
int bucket;
size_t bucket_size;
struct v3d_pool_entry *e;
int rc;
if (!r || !out || size == 0) return -1;
bucket = v3d_pool_bucket_for(size);
if (bucket < 0) {
/* Oversize — fall through to non-pooled allocation. Caller
* still calls v3d_runner_release_buffer(), which detects the
* oversize bucket via bucket_for() and destroys. */
return v3d_runner_create_buffer(r, size, out);
}
bucket_size = (size_t)1 << (bucket + V3D_POOL_MIN_LOG2);
e = r->pool_free[bucket];
if (e) {
r->pool_free[bucket] = e->next;
*out = e->buf;
free(e);
return 0;
}
/* Miss — allocate fresh at the bucket size. Subsequent acquire/
* release for the same bucket reuses this buffer. */
rc = v3d_runner_create_buffer(r, bucket_size, out);
if (rc == 0)
r->pool_total_bytes += bucket_size;
return rc;
}
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf)
{
int bucket;
struct v3d_pool_entry *e;
if (!r || !buf || buf->buffer == VK_NULL_HANDLE) return;
bucket = v3d_pool_bucket_for(buf->size);
if (bucket < 0) {
/* Oversize — destroy outright; never made it into the pool. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e = malloc(sizeof(*e));
if (!e) {
/* Allocator failure: just destroy. Pool degenerates to
* non-pooled behaviour but doesn't leak. */
v3d_runner_destroy_buffer(r, buf);
memset(buf, 0, sizeof(*buf));
return;
}
e->buf = *buf;
e->next = r->pool_free[bucket];
r->pool_free[bucket] = e;
memset(buf, 0, sizeof(*buf));
}
size_t v3d_runner_pool_total_bytes(v3d_runner *r)
{
return r ? r->pool_total_bytes : 0;
}
VkDevice v3d_runner_device(v3d_runner *r) { return r->device; }
VkQueue v3d_runner_queue(v3d_runner *r) { return r->queue; }
uint32_t v3d_runner_queue_family(v3d_runner *r) { return r->queue_family; }
@@ -246,10 +370,68 @@ void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf)
/* ---- Pipelines -------------------------------------------------- */
/* SPV lookup tries a small set of locations. The caller passes a bare
* filename (e.g. "v3d_h264_idct4.spv"); we try, in order:
*
* 1. cwd-relative (legacy contract; works when run from build/)
* 2. $DAEDALUS_SHADER_DIR (env override for tests / packaged installs)
* 3. <binary-dir>/<name> (so the bench/test binary finds the SPV next
* to itself regardless of cwd — this is the
* fix for the silent-no-SPV regression that
* made PR #36's bench numbers meaningless)
* 4. /opt/fourier/share/daedalus-fourier/<name> (Pi 5 install layout)
* 5. /usr/share/daedalus-fourier/<name> (system-wide install)
*
* Returns NULL only if every location fails, with a single perror naming
* the bare filename so the user can grep for it. */
static FILE *open_spv(const char *name)
{
FILE *f = fopen(name, "rb");
if (f) return f;
const char *envdir = getenv("DAEDALUS_SHADER_DIR");
if (envdir && *envdir) {
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", envdir, name);
f = fopen(p, "rb");
if (f) return f;
}
char exe[PATH_MAX];
ssize_t n = readlink("/proc/self/exe", exe, sizeof(exe) - 1);
if (n > 0) {
exe[n] = 0;
char *slash = strrchr(exe, '/');
if (slash) {
*slash = 0;
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", exe, name);
f = fopen(p, "rb");
if (f) return f;
}
}
char p[PATH_MAX];
snprintf(p, sizeof(p), "/opt/fourier/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
snprintf(p, sizeof(p), "/usr/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
return NULL;
}
static uint32_t *read_spv(const char *path, size_t *out_size)
{
FILE *f = fopen(path, "rb");
if (!f) { perror(path); return NULL; }
FILE *f = open_spv(path);
if (!f) {
fprintf(stderr,
"daedalus: SPV not found via cwd / $DAEDALUS_SHADER_DIR / "
"binary-dir / /opt/fourier/share / /usr/share: %s\n", path);
return NULL;
}
fseek(f, 0, SEEK_END);
long sz = ftell(f);
fseek(f, 0, SEEK_SET);
@@ -364,12 +546,27 @@ int v3d_runner_create_pipeline(v3d_runner *r, const char *spv_path,
.pSetLayouts = &out->ds_layout,
};
CHK(vkAllocateDescriptorSets(r->device, &dsai, &out->desc_set));
/* Persistent command buffer — pool was created with
* RESET_COMMAND_BUFFER_BIT (see v3d_runner_create) so dispatch
* sites can call vkResetCommandBuffer on this same cb instead
* of paying vkAllocateCommandBuffers per call. */
VkCommandBufferAllocateInfo cbai = {
.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
.commandPool = r->pool,
.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY,
.commandBufferCount = 1,
};
CHK(vkAllocateCommandBuffers(r->device, &cbai, &out->cb));
return 0;
}
void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
{
if (!p || p->pipeline == VK_NULL_HANDLE) return;
if (p->cb != VK_NULL_HANDLE)
vkFreeCommandBuffers(r->device, r->pool, 1, &p->cb);
vkDestroyPipeline(r->device, p->pipeline, NULL);
vkDestroyPipelineLayout(r->device, p->layout, NULL);
vkDestroyDescriptorPool(r->device, p->pool, NULL); /* frees its set */
@@ -377,6 +574,13 @@ void v3d_runner_destroy_pipeline(v3d_runner *r, v3d_pipeline *p)
memset(p, 0, sizeof(*p));
}
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p)
{
(void) r;
if (!p || p->cb == VK_NULL_HANDLE) return -1;
return vkResetCommandBuffer(p->cb, 0) == VK_SUCCESS ? 0 : -1;
}
int v3d_runner_bind_buffers(v3d_runner *r, v3d_pipeline *p,
const v3d_buffer *bufs, uint32_t n)
{
+45
View File
@@ -34,6 +34,12 @@ typedef struct {
VkDescriptorSet desc_set;
uint32_t n_ssbos;
uint32_t push_const_size;
/* Persistent command buffer. Allocated at create-pipeline time;
* dispatch sites use v3d_runner_pipeline_cmdbuf_reset() to
* vkResetCommandBuffer instead of paying vkAllocateCommandBuffers
* per dispatch. Pool flagged RESET_COMMAND_BUFFER_BIT so reset
* is permitted. */
VkCommandBuffer cb;
} v3d_pipeline;
/*
@@ -57,10 +63,43 @@ const char *v3d_runner_device_name(v3d_runner *r);
* host side. The mapping persists for the lifetime of the buffer.
*
* Returns 0 on success, non-zero on failure.
*
* NOTE: prefer v3d_runner_acquire_buffer() on the dispatch hot path —
* create_buffer/destroy_buffer go straight to vkAllocateMemory each
* call, which on V3D7's Mesa stack costs ~10-50us. The acquire/
* release pair pulls from a freelist and pays vkAllocateMemory only
* on a cache miss.
*/
int v3d_runner_create_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf);
/*
* Pooled buffer acquisition. Returns a v3d_buffer whose .size is the
* smallest power-of-2 >= the requested size (so callers can pool
* across similar-sized requests). Backed by HOST_VISIBLE |
* HOST_COHERENT memory; mapped pointer is valid.
*
* On cache hit: zero-cost reuse of a previously-released buffer.
* On miss: falls through to v3d_runner_create_buffer(). Release with
* v3d_runner_release_buffer(); pool drains in v3d_runner_destroy().
*
* Lifetime contract: the returned buffer's .mapped contents are
* UNINITIALISED — the previous user's data may still be present.
* Callers that need a clean buffer must memset themselves. This is
* deliberate; the dispatch hot paths immediately overwrite the
* buffer with new coefficients / meta anyway.
*
* Thread-safety: NOT thread-safe. A daedalus_ctx is single-threaded
* by API contract; the pool inherits that constraint.
*/
int v3d_runner_acquire_buffer(v3d_runner *r, size_t size, v3d_buffer *out);
void v3d_runner_release_buffer(v3d_runner *r, v3d_buffer *buf);
/* Pool diagnostics: total allocated bytes (sum across all size
* classes, including currently-released entries). Useful for
* watermark logging. */
size_t v3d_runner_pool_total_bytes(v3d_runner *r);
/* Compute pipeline from a SPIR-V file path. The descriptor-set
* layout exposes `n_ssbos` storage buffer bindings at binding
* indices 0..n_ssbos-1, all visible to the compute stage. A push
@@ -88,6 +127,12 @@ int v3d_runner_bind_buffers(v3d_runner *r,
/* Allocate a primary command buffer from the runner's pool. */
VkCommandBuffer v3d_runner_alloc_cmdbuf(v3d_runner *r);
/* Reset @p->cb so it can be re-recorded. Returns 0 on success.
* Replaces v3d_runner_alloc_cmdbuf() on the dispatch hot path —
* vkResetCommandBuffer is O(1) vs vkAllocateCommandBuffers' ~1-5us
* driver cost. */
int v3d_runner_pipeline_cmdbuf_reset(v3d_runner *r, v3d_pipeline *p);
/* Submit `cb` to the queue and wait for completion. The classic
* timed operation. Returns 0 on success.
*/
+299
View File
@@ -0,0 +1,299 @@
/* SPDX-License-Identifier: BSD-2-Clause */
/* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */
#define _POSIX_C_SOURCE 200809L
/*
* bench_h264_primitives — latency baseline for the H.264 primitive
* library landed across PRs #9#35.
*
* Each kernel is exercised at a representative per-frame N for 1080p
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported,
* once per substrate (CPU NEON, QPU V3D7 compute). The QPU column
* appears only when the host has a usable Vulkan device. When both
* columns exist a CPU/QPU ratio is printed; that's the per-kernel
* data the QPU-substrate decree (2026-05-23) deliberately overrides
* but which is still useful to track over time as dispatch overhead
* shrinks (buffer pool, persistent cmdbuf, dmabuf import — tasks 160-162).
*
* NOT a ctest — produces wall-time numbers, doesn't pass/fail.
*
* Invoke: ./build/bench_h264_primitives [iters [warmup]]
* (default iters = 50, warmup = 5)
*/
#include "daedalus.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
static uint64_t xs64_state = 0xfeedface5a5a5a5aULL;
static uint64_t xs64(void) {
uint64_t x = xs64_state;
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
return xs64_state = x;
}
static double now_ms(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000.0 + ts.tv_nsec / 1.0e6;
}
/* Per-1080p-frame counts (8160 MBs at 1920x1088). */
#define MBS_1080P 8160
/* Standard benchmark loop. fn() is called n times per iteration.
*
* fn() now returns the dispatch's int rc. A single preflight call is
* made before the hot loop; if rc != 0 (which on the QPU substrate
* almost always means "SPV not found via any search path"), bench_ns
* returns -1 and the caller must NOT report the kernel as measured.
*
* Without this, a missing SPV makes every dispatch fail fast at the
* cost of one fprintf+open call (~1-5 µs), and the loop times that
* cost as if it were real QPU work — producing absurdly-small ns/op
* numbers that look like a QPU speedup. This is exactly what made
* PR #36's bench numbers a measurement artifact. */
typedef int (*bench_fn)(void);
static double bench_ns(const char *name, int iters, int warmup,
int ops_per_iter, bench_fn fn)
{
int rc = fn();
if (rc != 0) {
printf(" %-32s DISPATCH FAILED rc=%d — kernel skipped\n", name, rc);
return -1;
}
for (int i = 0; i < warmup; i++) fn();
double t0 = now_ms();
for (int i = 0; i < iters; i++) fn();
double t1 = now_ms();
double total_ms = (t1 - t0);
double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter);
printf(" %-32s %10.2f ns/op (%d iters x %d ops)\n",
name, ns_per_op, iters, ops_per_iter);
return ns_per_op;
}
/* ---- Per-kernel scaffolding. Each section sets up the buffers +
* meta, then defines a static fn() that calls the corresponding
* dispatch with a representative N. The substrate is read from the
* global g_sub so the same fn() can be re-driven with CPU then QPU. */
static daedalus_ctx *ctx;
static daedalus_substrate g_sub = DAEDALUS_SUBSTRATE_CPU;
/* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks
* per call (64 MBs worth). Per-MB the dispatch overhead is the
* same regardless of N — we want ns per block. */
static int16_t idct4_coeffs[1024 * 16];
static daedalus_h264_block_meta idct4_meta[1024];
static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */
static int bench_idct4(void) {
return daedalus_dispatch_h264_idct4(ctx, g_sub,
idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta);
}
/* --- IDCT 8x8 luma: 256 8x8 blocks per call. */
static int16_t idct8_coeffs[256 * 64];
static daedalus_h264_block_meta idct8_meta[256];
static int bench_idct8(void) {
return daedalus_dispatch_h264_idct8(ctx, g_sub,
idct_dst, 64*16, idct8_coeffs, 256, idct8_meta);
}
/* --- Deblock luma_v (cycle 8 baseline; M3 path). */
static daedalus_h264_deblock_meta deblock_meta[256];
static uint8_t deblock_dst[256 * 16 * 16];
static int bench_deblock_v(void) {
return daedalus_dispatch_h264_deblock_luma_v(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
static int bench_deblock_h(void) {
return daedalus_dispatch_h264_deblock_luma_h(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
/* --- qpel mc20 + mc02 + mc22 (the H/V/HV anchors). */
static uint8_t qpel_src[256 * 16 * 16];
static uint8_t qpel_dst[256 * 16 * 16];
static daedalus_h264_qpel_meta qpel_meta[256];
static int bench_qpel_mc20(void) {
return daedalus_dispatch_h264_qpel_mc20(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static int bench_qpel_mc02(void) {
return daedalus_dispatch_h264_qpel_mc02(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static int bench_qpel_mc22(void) {
return daedalus_dispatch_h264_qpel_mc22(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
/* ---- One row of bench output:
* - kernel name + N
* - CPU ns/op
* - QPU ns/op (or "n/a" if Vulkan absent)
* - CPU/QPU ratio (>1 means QPU wins; <1 means CPU wins) */
struct row {
const char *name;
int n_per_call;
bench_fn fn;
double cpu_ns;
double qpu_ns; /* -1 if not measured */
int frame_n; /* count per 1080p frame */
};
static struct row rows[] = {
{"IDCT 4x4 luma", 1024, bench_idct4, 0, -1, MBS_1080P * 16},
{"IDCT 8x8 luma", 256, bench_idct8, 0, -1, MBS_1080P * 4},
{"Deblock luma_v", 256, bench_deblock_v, 0, -1, MBS_1080P * 4},
{"Deblock luma_h", 256, bench_deblock_h, 0, -1, MBS_1080P * 4},
{"qpel mc20 (8x8)", 256, bench_qpel_mc20, 0, -1, MBS_1080P * 4},
{"qpel mc02 (8x8)", 256, bench_qpel_mc02, 0, -1, MBS_1080P * 4},
{"qpel mc22 (8x8)", 256, bench_qpel_mc22, 0, -1, MBS_1080P * 4},
};
#define N_ROWS ((int)(sizeof(rows)/sizeof(rows[0])))
int main(int argc, char **argv)
{
int iters = argc > 1 ? atoi(argv[1]) : 50;
int warmup = argc > 2 ? atoi(argv[2]) : 5;
ctx = daedalus_ctx_create();
if (!ctx) {
fprintf(stderr, "ctx create failed (Vulkan?)\n");
return 1;
}
int has_qpu = daedalus_ctx_has_qpu(ctx);
/* Pre-fill all input buffers with random data so the NEON inner
* loops see realistic memory access patterns. */
for (size_t i = 0; i < sizeof(idct4_coeffs)/2; i++)
idct4_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(idct8_coeffs)/2; i++)
idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff);
/* IDCT meta. */
for (size_t i = 0; i < 1024; i++)
idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4);
for (size_t i = 0; i < 256; i++)
idct8_meta[i].dst_off = (uint32_t)((i / 8) * 64 + (i % 8) * 8);
/* Deblock meta: edge offsets within 256 16x16 tiles. */
for (size_t i = 0; i < 256; i++) {
deblock_meta[i].dst_off = (uint32_t)(i * 256 + 4 * 16);
deblock_meta[i].alpha = 30;
deblock_meta[i].beta = 10;
for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1);
}
/* qpel meta. */
for (size_t i = 0; i < 256; i++) {
qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3);
qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3);
}
printf("bench_h264_primitives: %d iters (%d warmup)\n", iters, warmup);
printf(" ctx has_qpu=%d (CPU pass always runs; QPU pass skipped without Vulkan)\n\n", has_qpu);
/* Pass 1: CPU NEON. */
g_sub = DAEDALUS_SUBSTRATE_CPU;
printf("== CPU NEON ==\n");
for (int i = 0; i < N_ROWS; i++)
rows[i].cpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
/* Pass 2: QPU compute (if available). */
int qpu_failures = 0;
if (has_qpu) {
g_sub = DAEDALUS_SUBSTRATE_QPU;
printf("\n== QPU V3D7 compute ==\n");
for (int i = 0; i < N_ROWS; i++) {
rows[i].qpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
if (rows[i].qpu_ns < 0) qpu_failures++;
}
if (qpu_failures) {
fprintf(stderr,
"\nbench_h264_primitives: %d of %d QPU dispatches failed.\n"
" Almost always means SPV files were not found via any of:\n"
" cwd / $DAEDALUS_SHADER_DIR / binary-dir /\n"
" /opt/fourier/share/daedalus-fourier / /usr/share/daedalus-fourier\n"
" Set DAEDALUS_SHADER_DIR=<path> or run from a dir where the\n"
" .spv files exist (e.g. the cmake build dir).\n",
qpu_failures, N_ROWS);
return 2;
}
}
/* Summary table — both substrates side by side. */
printf("\n== Per-kernel comparison ==\n");
printf(" %-24s %12s %12s %8s %7s\n",
"kernel", "CPU ns/op", "QPU ns/op", "winner", "ms/frame");
for (int i = 0; i < N_ROWS; i++) {
double cpu_ms = rows[i].cpu_ns * rows[i].frame_n / 1e6;
double qpu_ms = rows[i].qpu_ns > 0 ? rows[i].qpu_ns * rows[i].frame_n / 1e6 : -1;
const char *winner;
char ratio[16];
if (rows[i].qpu_ns <= 0) {
winner = "CPU"; /* QPU n/a */
snprintf(ratio, sizeof(ratio), "n/a");
} else if (rows[i].cpu_ns < rows[i].qpu_ns) {
winner = "CPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].qpu_ns / rows[i].cpu_ns);
} else {
winner = "QPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].cpu_ns / rows[i].qpu_ns);
}
char qpu_field[16];
if (rows[i].qpu_ns > 0) snprintf(qpu_field, sizeof(qpu_field), "%.2f", rows[i].qpu_ns);
else snprintf(qpu_field, sizeof(qpu_field), "n/a");
char ms_field[24];
if (qpu_ms > 0)
snprintf(ms_field, sizeof(ms_field), "%.2f/%.2f", cpu_ms, qpu_ms);
else
snprintf(ms_field, sizeof(ms_field), "%.2f/n/a", cpu_ms);
printf(" %-24s %12.2f %12s %3s %s %s\n",
rows[i].name, rows[i].cpu_ns, qpu_field, winner, ratio, ms_field);
}
/* Per-frame budget summary at 1080p (8160 MBs). */
double cpu_idct4 = rows[0].cpu_ns * MBS_1080P * 16 / 1e6;
double cpu_debl = (rows[2].cpu_ns + rows[3].cpu_ns) * MBS_1080P * 4 / 1e6;
double cpu_mc = rows[6].cpu_ns * MBS_1080P * 4 / 1e6; /* mc22 worst-case */
double cpu_sum = cpu_idct4 + cpu_debl + cpu_mc;
printf("\n== Projected 1080p worst-case (CPU NEON only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", cpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - cpu_sum);
if (has_qpu) {
double qpu_idct4 = rows[0].qpu_ns * MBS_1080P * 16 / 1e6;
double qpu_debl = (rows[2].qpu_ns + rows[3].qpu_ns) * MBS_1080P * 4 / 1e6;
double qpu_mc = rows[6].qpu_ns * MBS_1080P * 4 / 1e6;
double qpu_sum = qpu_idct4 + qpu_debl + qpu_mc;
printf("\n== Projected 1080p worst-case (QPU V3D7 compute only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", qpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - qpu_sum);
printf("\n CPU vs QPU sum ratio: %.2fx (>1 means QPU wins)\n",
qpu_sum > 0 ? cpu_sum / qpu_sum : 0.0);
}
printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n");
printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n");
printf(" bound; the real decode stack adds 20-40%% on top.\n");
printf(" Per-kernel substrate decisions belong in daedalus_core.c recipe\n");
printf(" table; the QPU substrate decree (2026-05-23) keeps everything\n");
printf(" on QPU regardless of these numbers as a policy choice.)\n");
daedalus_ctx_destroy(ctx);
return 0;
}
+120
View File
@@ -0,0 +1,120 @@
/*
* bench_pool_overhead — measure QPU dispatch overhead with and without
* the v3d_runner buffer pool warm.
*
* Times N consecutive daedalus_recipe_dispatch_vp9_idct8 calls and
* prints the per-call distribution. The first call pays
* vkAllocateMemory (typically tens of microseconds on V3D7's Mesa);
* the second and subsequent should hit the pool freelist and amortise
* to the pure dispatch-floor cost.
*
* Purpose: provide a concrete before/after number for the QPU-default
* substrate decree (2026-05-23). Bench is non-gating and runs in
* fractions of a second.
*
* License: BSD-2-Clause.
*/
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include "../include/daedalus.h"
extern size_t v3d_runner_pool_total_bytes(void *); /* exposed if we wanted it */
static double now_seconds(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
static int cmp_double(const void *a, const void *b)
{
double da = *(const double *)a, db = *(const double *)b;
return da < db ? -1 : da > db ? 1 : 0;
}
int main(int argc, char **argv)
{
int n_calls = argc > 1 ? atoi(argv[1]) : 200;
int n_blocks = 8; /* one MB column of 8x8 IDCT blocks */
int stride = 64;
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) { fprintf(stderr, "ctx create failed\n"); return 1; }
int has_qpu = daedalus_ctx_has_qpu(ctx);
printf("ctx: has_qpu=%d\n", has_qpu);
if (!has_qpu) {
fprintf(stderr, "QPU not available on this device; bench needs V3D\n");
daedalus_ctx_destroy(ctx);
return 2;
}
/* Build a representative IDCT 8x8 batch and warm a dst buffer. */
int16_t *coeffs = calloc((size_t) n_blocks * 64, sizeof(int16_t));
uint8_t *dst = calloc((size_t) n_blocks * 8 * stride, 1);
daedalus_idct8_meta *meta = calloc((size_t) n_blocks, sizeof(*meta));
if (!coeffs || !dst || !meta) { fprintf(stderr, "alloc fail\n"); return 1; }
uint64_t s = 0x1234567abcdefULL;
for (size_t i = 0; i < (size_t) n_blocks * 64; i++) {
s ^= s << 13; s ^= s >> 7; s ^= s << 17;
coeffs[i] = (int16_t)(s & 0x7ff) - 0x400;
}
for (int b = 0; b < n_blocks; b++) {
meta[b].dst_off = (uint32_t) b * 8;
meta[b].block_x = (uint32_t) b;
meta[b].block_y = 0;
}
double *t = malloc((size_t) n_calls * sizeof(double));
int rc;
printf("=== dispatching %d times, n_blocks=%d/call ===\n",
n_calls, n_blocks);
for (int i = 0; i < n_calls; i++) {
double t0 = now_seconds();
rc = daedalus_dispatch_vp9_idct8(ctx, DAEDALUS_SUBSTRATE_QPU,
dst, (size_t) stride,
coeffs, (size_t) n_blocks, meta);
double t1 = now_seconds();
if (rc) { fprintf(stderr, "dispatch %d rc=%d\n", i, rc); return 1; }
t[i] = (t1 - t0) * 1e6; /* us */
}
/* Per-call distribution (first few + sorted summary on the steady-state) */
printf("\nfirst 5 calls (cold-warm transition):\n");
for (int i = 0; i < 5 && i < n_calls; i++)
printf(" call %d: %.2f us\n", i, t[i]);
int skip = 10; /* drop warm-up calls from the steady-state stats */
if (n_calls > skip + 10) {
int n = n_calls - skip;
double *s_arr = malloc((size_t) n * sizeof(double));
memcpy(s_arr, t + skip, (size_t) n * sizeof(double));
qsort(s_arr, (size_t) n, sizeof(double), cmp_double);
double sum = 0;
for (int i = 0; i < n; i++) sum += s_arr[i];
printf("\nsteady-state stats (calls %d..%d, n=%d):\n",
skip, n_calls - 1, n);
printf(" min: %.2f us\n", s_arr[0]);
printf(" p50: %.2f us\n", s_arr[n / 2]);
printf(" p90: %.2f us\n", s_arr[(int)(n * 0.9)]);
printf(" p99: %.2f us\n", s_arr[(int)(n * 0.99)]);
printf(" max: %.2f us\n", s_arr[n - 1]);
printf(" mean: %.2f us\n", sum / n);
printf("\nfirst-call / steady-state median ratio: %.1fx\n",
t[0] / s_arr[n / 2]);
free(s_arr);
}
free(t); free(coeffs); free(dst); free(meta);
daedalus_ctx_destroy(ctx);
return 0;
}
+53
View File
@@ -0,0 +1,53 @@
/*
* Standalone bit-exact C reference for the H.264 chroma DC 2x2
* Hadamard transform (per H.264 §8.5.11.1).
*
* In 4:2:0 chroma, the four DC coefficients (one from each chroma
* 4x4 AC block within an MB) are arranged into a 2x2 block:
*
* c[0,0] c[0,1] block (0,0) DC block (0,1) DC
* c[1,0] c[1,1] block (1,0) DC block (1,1) DC
*
* The 2x2 Hadamard transform:
*
* f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1]
* f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1]
* f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1]
* f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1]
*
* Equivalently expressed as 2-stage butterflies (row then col), which
* the NEON impl uses for SIMD friendliness — we present that form
* here too so the QPU/NEON ports are 1:1.
*
* Output f[] replaces the input c[]. The QP-dependent scaling per
* §8.5.11.2 happens AFTER this primitive — the intercept patch
* composes Hadamard + LevelScale + shift itself, since the scaling
* shape depends on QP and on whether we're in the chroma_qp_offset
* adjustment regime.
*
* Input/output layout:
* c[0..3] in row-major order: [c[0,0], c[0,1], c[1,0], c[1,1]]
*
* License: BSD-2-Clause. Algorithm is in the H.264 spec.
*/
#include <stdint.h>
void daedalus_h264_chroma_dc_hadamard_2x2_ref(int16_t c[4])
{
/* Stage 1: butterfly along rows.
* t[0] = c[0,0] + c[0,1] = c[0] + c[1]
* t[1] = c[0,0] - c[0,1] = c[0] - c[1]
* t[2] = c[1,0] + c[1,1] = c[2] + c[3]
* t[3] = c[1,0] - c[1,1] = c[2] - c[3]
*/
int t0 = c[0] + c[1];
int t1 = c[0] - c[1];
int t2 = c[2] + c[3];
int t3 = c[2] - c[3];
/* Stage 2: butterfly along cols. */
c[0] = (int16_t)(t0 + t2); /* f[0,0] = t0+t2 = sum of all 4 */
c[1] = (int16_t)(t1 + t3); /* f[0,1] = (c0-c1) + (c2-c3) */
c[2] = (int16_t)(t0 - t2); /* f[1,0] = (c0+c1) - (c2+c3) */
c[3] = (int16_t)(t1 - t3); /* f[1,1] = (c0-c1) - (c2-c3) */
}
+110
View File
@@ -0,0 +1,110 @@
/*
* Standalone bit-exact C reference for H.264 chroma loop filters
* (bS < 4 variant; "intra" / bS=4 variant lives in a separate file
* when added). Covers both orientations:
*
* v_loop_filter_chroma: filter applied VERTICALLY across a
* HORIZONTAL edge. Tile is 8 cols × 4 rows of context
* (rows -2..+1); pix points to row 0 of the bottom block.
* h_loop_filter_chroma: filter applied HORIZONTALLY across a
* VERTICAL edge. Tile is 4 cols × 8 rows of context
* (cols -2..+1); pix points to col 0 of the right block.
*
* Mirrors FFmpeg `ff_h264_v_loop_filter_chroma_neon` (line 412) and
* `ff_h264_h_loop_filter_chroma_neon` (line 430) in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S.
*
* Algorithm per H.264 §8.7.2.4 (chroma bS<4 inter):
* - Same edge preconditions as luma: |p0-q0|<α, |p1-p0|<β, |q1-q0|<β.
* - tC = tc0_seg + 1 (chroma's tc has no luma-style ap/aq side bonus).
* - δ = clip3((((q0-p0)<<2) + (p1-q1) + 4) >> 3, -tC, tC).
* - p0' = clip255(p0+δ); q0' = clip255(q0-δ).
* - Chroma NEVER updates p1, p2, q1, q2 (unlike luma).
*
* tc0[4]: 4 segments × 2 cells per segment = 8 cells per edge
* (matches both 4:2:0 chroma plane geometry — 8 cols for V edge or
* 8 rows for H edge).
*
* Signature (matches FFmpeg + the existing luma refs):
* void(uint8_t *pix, ptrdiff_t stride,
* int alpha, int beta, int8_t tc0[4]);
*
* License: LGPL-2.1-or-later (matches FFmpeg upstream).
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
static inline int clip3(int v, int lo, int hi) {
return v < lo ? lo : v > hi ? hi : v;
}
static inline int abs_i(int x) { return x < 0 ? -x : x; }
/* Per-cell chroma filter, vertical-direction access (one column
* across the horizontal edge). p1 is at pix[-2*stride], q1 at
* pix[+1*stride]. */
static void h264_chroma_cell_v(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int tc0_s)
{
int p1 = pix[-2*stride], p0 = pix[-1*stride];
int q0 = pix[ 0*stride], q1 = pix[ 1*stride];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clip3(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
pix[-1*stride] = (uint8_t) clip_u8(p0 + delta);
pix[ 0*stride] = (uint8_t) clip_u8(q0 - delta);
}
/* Same kernel, horizontal-direction access (one row across the
* vertical edge). p1 at pix[-2], q1 at pix[+1]. */
static void h264_chroma_cell_h(uint8_t *pix,
int alpha, int beta, int tc0_s)
{
int p1 = pix[-2], p0 = pix[-1];
int q0 = pix[ 0], q1 = pix[ 1];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clip3(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
pix[-1] = (uint8_t) clip_u8(p0 + delta);
pix[ 0] = (uint8_t) clip_u8(q0 - delta);
}
void daedalus_h264_v_loop_filter_chroma_ref(
uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4])
{
if (alpha == 0 || beta == 0) return;
if (tc0[0] < 0 && tc0[1] < 0 && tc0[2] < 0 && tc0[3] < 0) return;
/* 8 cols divided into 4 segments of 2 cols each. */
for (int s = 0; s < 4; s++) {
int tc0_s = tc0[s];
if (tc0_s < 0) continue;
for (int c = 0; c < 2; c++) {
int col = s * 2 + c;
h264_chroma_cell_v(pix + col, stride, alpha, beta, tc0_s);
}
}
}
void daedalus_h264_h_loop_filter_chroma_ref(
uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4])
{
if (alpha == 0 || beta == 0) return;
if (tc0[0] < 0 && tc0[1] < 0 && tc0[2] < 0 && tc0[3] < 0) return;
/* 8 rows divided into 4 segments of 2 rows each. */
for (int s = 0; s < 4; s++) {
int tc0_s = tc0[s];
if (tc0_s < 0) continue;
for (int r = 0; r < 2; r++) {
int row = s * 2 + r;
h264_chroma_cell_h(pix + row * stride, alpha, beta, tc0_s);
}
}
}
+116
View File
@@ -0,0 +1,116 @@
/*
* Standalone bit-exact C reference for H.264 luma "horizontal"
* loop filter (h_loop_filter_luma): applies filter HORIZONTALLY
* across a VERTICAL edge. The edge spans the 16-row macroblock
* height, between columns -1 and 0.
*
* Mirrors FFmpeg `ff_h264_h_loop_filter_luma_neon` in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S
* line 134. Operates on an 8-col × 16-row region:
* pix[r*stride + c] for r in 0..15, c in -4..+3
* With pix pointing to row 0, col 0 of the right block (= the
* leftmost column of the bottom-/right-block half of the edge).
*
* 16 rows divided into 4 segments of 4 rows; each segment has its
* own tc0 strength (tc0[0..3]).
*
* Note: FFmpeg's "h_loop_filter" naming uses the FILTER DIRECTION
* (horizontal = across the edge from the left), not the edge
* orientation (vertical). H.264 spec calls this the "vertical
* edge" filter.
*
* This is the column-axis transpose of h264_v_loop_filter_luma_ref:
* - v variant: p3..p0 above the edge (pix[-4*stride..-1*stride]),
* q0..q3 below (pix[0..+3*stride]). 16 columns × 4 segments.
* - h variant: p3..p0 left of the edge (pix[-4..-1]),
* q0..q3 right (pix[0..+3]). 16 rows × 4 segments.
* Same per-segment kernel; only the address arithmetic transposes.
*
* Signature:
* void(uint8_t *pix, ptrdiff_t stride,
* int alpha, int beta, int8_t tc0[4]);
*
* License: LGPL-2.1-or-later (matches FFmpeg upstream).
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
static inline int clip3(int v, int lo, int hi) {
return v < lo ? lo : v > hi ? hi : v;
}
static inline int abs_i(int x) { return x < 0 ? -x : x; }
/* Apply luma deblock to one ROW at the vertical edge.
* p0..p3 are pixels left of the edge (pix[-1..-4]),
* q0..q3 right (pix[0..+3]).
* tc0_s is the segment's tc0 value (already known >= 0).
*
* Writes back to pix[-2], pix[-1], pix[0], pix[+1]
* (= p1, p0, q0, q1).
*/
static void h264_deblock_luma_row(uint8_t *pix,
int alpha, int beta, int tc0_s)
{
int p3 = pix[-4], p2 = pix[-3], p1 = pix[-2], p0 = pix[-1];
int q0 = pix[ 0], q1 = pix[ 1], q2 = pix[ 2], q3 = pix[ 3];
(void) p3; (void) q3; /* not used in bS<4 path */
/* Edge pre-conditions. */
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
/* Side conditions. */
int ap = abs_i(p2 - p0);
int aq = abs_i(q2 - q0);
int ap_lt_beta = (ap < beta);
int aq_lt_beta = (aq < beta);
/* Combined filter strength. */
int tc = tc0_s + ap_lt_beta + aq_lt_beta;
/* p0 / q0 update. */
int delta = clip3(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
int p0p = clip_u8(p0 + delta);
int q0p = clip_u8(q0 - delta);
/* p1 update (only if ap<beta). */
int p1p = p1;
if (ap_lt_beta) {
int delta_p1 = clip3((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s);
p1p = p1 + delta_p1;
}
/* q1 update (only if aq<beta). */
int q1p = q1;
if (aq_lt_beta) {
int delta_q1 = clip3((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s);
q1p = q1 + delta_q1;
}
pix[-2] = (uint8_t) p1p;
pix[-1] = (uint8_t) p0p;
pix[ 0] = (uint8_t) q0p;
pix[ 1] = (uint8_t) q1p;
}
void daedalus_h264_h_loop_filter_luma_ref(
uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4])
{
/* H.264 deblock "outer" precondition: alpha == 0 OR beta == 0
* skips filtering. Also if ALL tc0[*] == -1, skip
* (h264_loop_filter_start macro check). */
if (alpha == 0 || beta == 0) return;
if (tc0[0] < 0 && tc0[1] < 0 && tc0[2] < 0 && tc0[3] < 0) return;
/* 16 rows divided into 4 segments of 4 rows each. */
for (int s = 0; s < 4; s++) {
int tc0_s = tc0[s];
if (tc0_s < 0) continue; /* bS = 0 segment → skip */
for (int r = 0; r < 4; r++) {
int row = s * 4 + r;
h264_deblock_luma_row(pix + row * stride, alpha, beta, tc0_s);
}
}
}
+184
View File
@@ -0,0 +1,184 @@
/*
* Standalone bit-exact C reference for H.264 luma + chroma "intra"
* loop filters (bS = 4 variant, used at I-MB edges where the
* boundary strength is forced to 4). Covers all four orientations:
*
* v_loop_filter_luma_intra — 16 cols × 8 rows, edge between
* rows -1 and 0
* h_loop_filter_luma_intra — 8 cols × 16 rows, edge between
* cols -1 and 0
* v_loop_filter_chroma_intra — 8 cols × 4 rows
* h_loop_filter_chroma_intra — 4 cols × 8 rows
*
* Mirrors FFmpeg's `ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon`
* in external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S.
*
* Algorithm per H.264 §8.7.2.3 (bS=4):
*
* Preconditions (same as bS<4):
* |p0-q0| < α AND |p1-p0| < β AND |q1-q0| < β
*
* Luma — strong/weak filter selector per side:
* strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
* strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
*
* If strong_p, update p0/p1/p2:
* p0' = (p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3
* p1' = (p2 + p1 + p0 + q0 + 2) >> 2
* p2' = (2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3
* Else weak (single cell):
* p0' = (2*p1 + p0 + q1 + 2) >> 2
* Mirror for q-side.
*
* Chroma — always weak (no quad-tree selector):
* p0' = (2*p1 + p0 + q1 + 2) >> 2
* q0' = (2*q1 + q0 + p1 + 2) >> 2
* Chroma never updates p1/p2/q1/q2.
*
* Signature (no tc0 in the intra path — the daedalus_h264_deblock_meta
* struct's tc0 field is ignored at the dispatch layer):
* void(uint8_t *pix, ptrdiff_t stride, int alpha, int beta);
*
* License: LGPL-2.1-or-later (matches FFmpeg upstream).
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
static inline int abs_i(int x) { return x < 0 ? -x : x; }
/* --- luma intra, one column across the horizontal edge --- */
static void h264_luma_intra_cell_v(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta)
{
int p3 = pix[-4*stride], p2 = pix[-3*stride];
int p1 = pix[-2*stride], p0 = pix[-1*stride];
int q0 = pix[ 0*stride], q1 = pix[ 1*stride];
int q2 = pix[ 2*stride], q3 = pix[ 3*stride];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
int strong_common = abs_i(p0 - q0) < ((alpha >> 2) + 2);
int strong_p = strong_common && (abs_i(p2 - p0) < beta);
int strong_q = strong_common && (abs_i(q2 - q0) < beta);
if (strong_p) {
pix[-1*stride] = (uint8_t) clip_u8((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3);
pix[-2*stride] = (uint8_t) clip_u8((p2 + p1 + p0 + q0 + 2) >> 2);
pix[-3*stride] = (uint8_t) clip_u8((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3);
} else {
pix[-1*stride] = (uint8_t) clip_u8((2*p1 + p0 + q1 + 2) >> 2);
}
if (strong_q) {
pix[ 0*stride] = (uint8_t) clip_u8((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3);
pix[ 1*stride] = (uint8_t) clip_u8((q2 + q1 + q0 + p0 + 2) >> 2);
pix[ 2*stride] = (uint8_t) clip_u8((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3);
} else {
pix[ 0*stride] = (uint8_t) clip_u8((2*q1 + q0 + p1 + 2) >> 2);
}
}
/* --- luma intra, one row across the vertical edge --- */
static void h264_luma_intra_cell_h(uint8_t *pix, int alpha, int beta)
{
int p3 = pix[-4], p2 = pix[-3], p1 = pix[-2], p0 = pix[-1];
int q0 = pix[ 0], q1 = pix[ 1], q2 = pix[ 2], q3 = pix[ 3];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
int strong_common = abs_i(p0 - q0) < ((alpha >> 2) + 2);
int strong_p = strong_common && (abs_i(p2 - p0) < beta);
int strong_q = strong_common && (abs_i(q2 - q0) < beta);
if (strong_p) {
pix[-1] = (uint8_t) clip_u8((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3);
pix[-2] = (uint8_t) clip_u8((p2 + p1 + p0 + q0 + 2) >> 2);
pix[-3] = (uint8_t) clip_u8((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3);
} else {
pix[-1] = (uint8_t) clip_u8((2*p1 + p0 + q1 + 2) >> 2);
}
if (strong_q) {
pix[ 0] = (uint8_t) clip_u8((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3);
pix[ 1] = (uint8_t) clip_u8((q2 + q1 + q0 + p0 + 2) >> 2);
pix[ 2] = (uint8_t) clip_u8((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3);
} else {
pix[ 0] = (uint8_t) clip_u8((2*q1 + q0 + p1 + 2) >> 2);
}
}
/* --- chroma intra, one column across the horizontal edge --- */
static void h264_chroma_intra_cell_v(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta)
{
int p1 = pix[-2*stride], p0 = pix[-1*stride];
int q0 = pix[ 0*stride], q1 = pix[ 1*stride];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
pix[-1*stride] = (uint8_t) clip_u8((2*p1 + p0 + q1 + 2) >> 2);
pix[ 0*stride] = (uint8_t) clip_u8((2*q1 + q0 + p1 + 2) >> 2);
}
/* --- chroma intra, one row across the vertical edge --- */
static void h264_chroma_intra_cell_h(uint8_t *pix, int alpha, int beta)
{
int p1 = pix[-2], p0 = pix[-1];
int q0 = pix[ 0], q1 = pix[ 1];
if (abs_i(p0 - q0) >= alpha) return;
if (abs_i(p1 - p0) >= beta) return;
if (abs_i(q1 - q0) >= beta) return;
pix[-1] = (uint8_t) clip_u8((2*p1 + p0 + q1 + 2) >> 2);
pix[ 0] = (uint8_t) clip_u8((2*q1 + q0 + p1 + 2) >> 2);
}
/* --- public refs --- */
void daedalus_h264_v_loop_filter_luma_intra_ref(
uint8_t *pix, ptrdiff_t stride, int alpha, int beta)
{
/* Note: the FFmpeg .S `h264_loop_filter_start_intra` macro
* returns early if (alpha|beta) == 0. For non-zero alpha or
* non-zero beta it runs the filter; the per-cell preconditions
* (abs(p0-q0)<alpha etc.) then decide whether each column
* actually updates pixels. Match that here. */
if ((alpha | beta) == 0) return;
/* 16 columns; no quad-tree segments in the intra path (bS=4 is
* uniform across the edge, no tc0_seg < 0 skip). */
for (int c = 0; c < 16; c++)
h264_luma_intra_cell_v(pix + c, stride, alpha, beta);
}
void daedalus_h264_h_loop_filter_luma_intra_ref(
uint8_t *pix, ptrdiff_t stride, int alpha, int beta)
{
if ((alpha | beta) == 0) return;
for (int r = 0; r < 16; r++)
h264_luma_intra_cell_h(pix + r * stride, alpha, beta);
}
void daedalus_h264_v_loop_filter_chroma_intra_ref(
uint8_t *pix, ptrdiff_t stride, int alpha, int beta)
{
if ((alpha | beta) == 0) return;
for (int c = 0; c < 8; c++)
h264_chroma_intra_cell_v(pix + c, stride, alpha, beta);
}
void daedalus_h264_h_loop_filter_chroma_intra_ref(
uint8_t *pix, ptrdiff_t stride, int alpha, int beta)
{
if ((alpha | beta) == 0) return;
for (int r = 0; r < 8; r++)
h264_chroma_intra_cell_h(pix + r * stride, alpha, beta);
}
+79
View File
@@ -0,0 +1,79 @@
/*
* Standalone bit-exact C references for the avg_ qpel anchors —
* the biprediction "average against existing dst" form of mc20,
* mc02, mc22. Used in B-slices where two qpel-interpolated samples
* (one from list0, one from list1) are averaged per H.264 §8.4.2.3.
*
* Each kernel computes the same half-pel formula as the put_ form,
* then averages with dst[r,c] via L2 ((dst + put_val + 1) >> 1).
* The dst buffer carries the list0 prediction on entry; the avg_
* call adds the list1 contribution.
*
* Mirror FFmpeg's `ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon` in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* (same `\type=avg` expansion as the put_ functions).
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
static inline uint8_t avg2(uint8_t a, uint8_t b) { return (uint8_t)((a + b + 1) >> 1); }
/* Same per-cell helpers as the diag/quarter-axis refs. Duplicated
* here (rather than extern'd) so this TU compiles standalone. */
static inline uint8_t hpel_h(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[r*stride + c-2] - 5 * (int) s[r*stride + c-1]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[r*stride + c+1]
- 5 * (int) s[r*stride + c+2] + (int) s[r*stride + c+3]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
static inline uint8_t hpel_v(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[(r-2)*stride + c] - 5 * (int) s[(r-1)*stride + c]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[(r+1)*stride + c]
- 5 * (int) s[(r+2)*stride + c] + (int) s[(r+3)*stride + c]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
void daedalus_avg_h264_qpel8_mc20_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
dst[r*stride + c] = avg2(dst[r*stride + c], hpel_h(src, r, c, stride));
}
void daedalus_avg_h264_qpel8_mc02_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
dst[r*stride + c] = avg2(dst[r*stride + c], hpel_v(src, r, c, stride));
}
void daedalus_avg_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
/* Per-cell mc22: same 13-row int16 tmp[] computation as the
* put_ reference, then L2 with dst. */
int16_t tmp[13][8];
for (int rr = 0; rr < 13; rr++) {
int src_row = rr - 2;
const uint8_t *s = src + src_row * stride;
for (int c = 0; c < 8; c++) {
int v = (int) s[c-2] - 5 * (int) s[c-1]
+ 20 * (int) s[c] + 20 * (int) s[c+1]
- 5 * (int) s[c+2] + (int) s[c+3];
tmp[rr][c] = (int16_t) v;
}
}
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) {
int v = tmp[r+0][c] - 5*tmp[r+1][c] + 20*tmp[r+2][c]
+ 20*tmp[r+3][c] - 5*tmp[r+4][c] + tmp[r+5][c] + 512;
uint8_t p = (uint8_t) clip_u8(v >> 10);
dst[r*stride + c] = avg2(dst[r*stride + c], p);
}
}
+97
View File
@@ -0,0 +1,97 @@
/*
* Standalone bit-exact C references for the 12 remaining avg_
* biprediction qpel positions (B-slice list0 + list1 averaging):
* 4 quarter-axis: avg_mc{10,30,01,03}
* 8 diagonals : avg_mc{11,12,13,21,23,31,32,33}
*
* Each is the put_ formula (per H.264 §8.4.2.2.1 / Table 8-4) with
* a final L2 average against the existing dst contents per §8.4.2.3.1.
* Caller pre-loads dst with the list0 prediction; the avg_ call
* folds in list1.
*
* Mirror FFmpeg's `ff_avg_h264_qpel8_mc{XY}_neon` (in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* — same `\type=avg` expansion as the put_ functions).
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
static inline uint8_t avg2(uint8_t a, uint8_t b) { return (uint8_t)((a + b + 1) >> 1); }
static inline uint8_t hpel_h(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[r*stride + c-2] - 5 * (int) s[r*stride + c-1]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[r*stride + c+1]
- 5 * (int) s[r*stride + c+2] + (int) s[r*stride + c+3]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
static inline uint8_t hpel_v(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[(r-2)*stride + c] - 5 * (int) s[(r-1)*stride + c]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[(r+1)*stride + c]
- 5 * (int) s[(r+2)*stride + c] + (int) s[(r+3)*stride + c]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
static inline uint8_t hpel_hv(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int t[6];
for (int i = 0; i < 6; i++) {
int rr = r - 2 + i;
t[i] = (int) s[rr*stride + c-2] - 5 * (int) s[rr*stride + c-1]
+ 20 * (int) s[rr*stride + c] + 20 * (int) s[rr*stride + c+1]
- 5 * (int) s[rr*stride + c+2] + (int) s[rr*stride + c+3];
}
int v = t[0] - 5*t[1] + 20*t[2] + 20*t[3] - 5*t[4] + t[5] + 512;
return (uint8_t) clip_u8(v >> 10);
}
/* Quarter-axis variants: half-pel + L2 with integer source, then
* L2 again with dst. */
#define DEFINE_AVG_QUARTER(NAME, A_EXPR, INT_EXPR) \
void daedalus_avg_h264_qpel8_ ## NAME ## _ref(uint8_t *dst, \
const uint8_t *src, ptrdiff_t stride) \
{ \
for (int r = 0; r < 8; r++) \
for (int c = 0; c < 8; c++) { \
uint8_t a = (A_EXPR); \
uint8_t p = (uint8_t)((a + (INT_EXPR) + 1) >> 1); \
dst[r*stride + c] = avg2(dst[r*stride + c], p); \
} \
}
DEFINE_AVG_QUARTER(mc10, hpel_h(src, r, c, stride), src[r*stride + c ])
DEFINE_AVG_QUARTER(mc30, hpel_h(src, r, c, stride), src[r*stride + c + 1])
DEFINE_AVG_QUARTER(mc01, hpel_v(src, r, c, stride), src[(r )*stride + c])
DEFINE_AVG_QUARTER(mc03, hpel_v(src, r, c, stride), src[(r + 1)*stride + c])
#undef DEFINE_AVG_QUARTER
/* Diagonal variants: avg of two half-pels, then L2 with dst. */
#define DEFINE_AVG_DIAG(NAME, A_EXPR, B_EXPR) \
void daedalus_avg_h264_qpel8_ ## NAME ## _ref(uint8_t *dst, \
const uint8_t *src, ptrdiff_t stride) \
{ \
for (int r = 0; r < 8; r++) \
for (int c = 0; c < 8; c++) { \
uint8_t a = (A_EXPR); \
uint8_t b = (B_EXPR); \
uint8_t p = avg2(a, b); \
dst[r*stride + c] = avg2(dst[r*stride + c], p); \
} \
}
DEFINE_AVG_DIAG(mc11, hpel_h(src, r, c, stride), hpel_v(src, r, c, stride))
DEFINE_AVG_DIAG(mc12, hpel_hv(src, r, c, stride), hpel_v(src, r, c, stride))
DEFINE_AVG_DIAG(mc13, hpel_h(src, r+1, c, stride), hpel_v(src, r, c, stride))
DEFINE_AVG_DIAG(mc21, hpel_hv(src, r, c, stride), hpel_h(src, r, c, stride))
DEFINE_AVG_DIAG(mc23, hpel_hv(src, r, c, stride), hpel_h(src, r+1, c, stride))
DEFINE_AVG_DIAG(mc31, hpel_h(src, r, c, stride), hpel_v(src, r, c+1, stride))
DEFINE_AVG_DIAG(mc32, hpel_hv(src, r, c, stride), hpel_v(src, r, c+1, stride))
DEFINE_AVG_DIAG(mc33, hpel_h(src, r+1, c, stride), hpel_v(src, r, c+1, stride))
#undef DEFINE_AVG_DIAG
+98
View File
@@ -0,0 +1,98 @@
/*
* Standalone bit-exact C references for the 8 diagonal H.264 luma
* qpel positions (mc11, mc12, mc13, mc21, mc23, mc31, mc32, mc33).
* Each is the rounded average of two half-pel intermediates per
* H.264 §8.4.2.2.1 / Table 8-4, decomposed to match the FFmpeg .S
* reference structure (see comments in mc{11,12,21,...}_neon in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S).
*
* Position decompositions (verified against the .S):
* mc11 (e, ¼¼): avg(mc20[r,c], mc02[r,c])
* mc12 (f, ¼½): avg(mc22[r,c], mc02[r,c])
* mc13 (g, ¼¾): avg(mc20[r+1,c], mc02[r,c])
* mc21 (i, ½¼): avg(mc22[r,c], mc20[r,c])
* mc23 (k, ½¾): avg(mc22[r,c], mc20[r+1,c])
* mc31 (p, ¾¼): avg(mc20[r,c], mc02[r,c+1])
* mc32 (q, ¾½): avg(mc22[r,c], mc02[r,c+1])
* mc33 (r, ¾¾): avg(mc20[r+1,c], mc02[r,c+1])
*
* (The mc20[r,c] notation means "the mc20-style horizontal half-pel
* result at source-relative integer position (r, c)"; analogously
* for mc02 and mc22.)
*
* Single-stride convention; same edge-context contract as the simpler
* variants (the cells "[r+1,c]" etc. demand one extra row/col of
* source context beyond what mc20/mc02 alone would need).
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Single-cell helpers — same arithmetic as the dedicated mc20/mc02
* refs but computed point-by-point so the diagonal refs can mix them
* cheaply. Each returns a u8 (already clipped). */
static inline uint8_t hpel_h(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[r*stride + c-2] - 5 * (int) s[r*stride + c-1]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[r*stride + c+1]
- 5 * (int) s[r*stride + c+2] + (int) s[r*stride + c+3]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
static inline uint8_t hpel_v(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[(r-2)*stride + c] - 5 * (int) s[(r-1)*stride + c]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[(r+1)*stride + c]
- 5 * (int) s[(r+2)*stride + c] + (int) s[(r+3)*stride + c]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
/* hpel_hv — 2D half-pel at (r, c) per the H.264 §8.4.2.2.1 "j"
* cascade. Computes the 6 vertical intermediates needed for the
* column at offsets -2..+3 around (r, c), each as a 16-bit signed
* h-lowpass over the 6 source samples in the same row. Then v-lowpass
* over those 6 intermediates with the +512 >> 10 final scale. Same
* as the mc22 ref, just expressed point-by-point. */
static inline uint8_t hpel_hv(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int t[6]; /* tmp at rows r-2..r+3 of the same col c */
for (int i = 0; i < 6; i++) {
int rr = r - 2 + i;
t[i] = (int) s[rr*stride + c-2] - 5 * (int) s[rr*stride + c-1]
+ 20 * (int) s[rr*stride + c] + 20 * (int) s[rr*stride + c+1]
- 5 * (int) s[rr*stride + c+2] + (int) s[rr*stride + c+3];
}
int v = t[0] - 5 * t[1] + 20 * t[2] + 20 * t[3] - 5 * t[4] + t[5] + 512;
return (uint8_t) clip_u8(v >> 10);
}
/* avg rounded ((a + b + 1) >> 1) — saturates already-clipped inputs
* so no further clip needed. */
static inline uint8_t avg2(uint8_t a, uint8_t b) { return (uint8_t)((a + b + 1) >> 1); }
#define DEFINE_DIAG_REF(NAME, A_EXPR, B_EXPR) \
void daedalus_put_h264_qpel8_ ## NAME ## _ref(uint8_t *dst, \
const uint8_t *src, ptrdiff_t stride) \
{ \
for (int r = 0; r < 8; r++) \
for (int c = 0; c < 8; c++) { \
uint8_t a = (A_EXPR); \
uint8_t b = (B_EXPR); \
dst[r*stride + c] = avg2(a, b); \
} \
}
DEFINE_DIAG_REF(mc11, hpel_h(src, r, c, stride), hpel_v(src, r, c, stride))
DEFINE_DIAG_REF(mc12, hpel_hv(src, r, c, stride), hpel_v(src, r, c, stride))
DEFINE_DIAG_REF(mc13, hpel_h(src, r+1, c, stride), hpel_v(src, r, c, stride))
DEFINE_DIAG_REF(mc21, hpel_hv(src, r, c, stride), hpel_h(src, r, c, stride))
DEFINE_DIAG_REF(mc23, hpel_hv(src, r, c, stride), hpel_h(src, r+1, c, stride))
DEFINE_DIAG_REF(mc31, hpel_h(src, r, c, stride), hpel_v(src, r, c+1, stride))
DEFINE_DIAG_REF(mc32, hpel_hv(src, r, c, stride), hpel_v(src, r, c+1, stride))
DEFINE_DIAG_REF(mc33, hpel_h(src, r+1, c, stride), hpel_v(src, r, c+1, stride))
#undef DEFINE_DIAG_REF
+45
View File
@@ -0,0 +1,45 @@
/*
* Standalone bit-exact C reference for H.264 luma qpel 8×8 mc02
* (vertical half-pel, "put" variant). Mirror of mc20 with rows
* and columns transposed. 6-tap filter applied vertically:
*
* dst[r,c] = clip255( (s[r-2,c] - 5*s[r-1,c] + 20*s[r,c]
* + 20*s[r+1,c] - 5*s[r+2,c] + s[r+3,c]
* + 16) >> 5 )
*
* Mirrors FFmpeg `ff_put_h264_qpel8_mc02_neon` (in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* line 678, which tail-calls put_h264_qpel8_v_lowpass_neon).
*
* Signature:
* void(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
*
* Both dst and src use the SAME stride. src points at row 0 col 0
* of the output block; the filter reads rows -2..+3 (2 rows of top
* context, 3 rows of bottom context). Caller must guarantee the
* source buffer has those rows available (FFmpeg's edge-emulated
* buffer handles this at the frame boundary; matches the contract
* documented for mc20).
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
void daedalus_put_h264_qpel8_mc02_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++) {
for (int c = 0; c < 8; c++) {
int s_m2 = src[(r - 2) * stride + c];
int s_m1 = src[(r - 1) * stride + c];
int s_0 = src[(r + 0) * stride + c];
int s_p1 = src[(r + 1) * stride + c];
int s_p2 = src[(r + 2) * stride + c];
int s_p3 = src[(r + 3) * stride + c];
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
dst[r * stride + c] = (uint8_t) clip_u8(v >> 5);
}
}
}
+70
View File
@@ -0,0 +1,70 @@
/*
* Standalone bit-exact C reference for H.264 luma qpel 8x8 mc22
* (2D half-pel, "put" variant). Cascade of horizontal 6-tap then
* vertical 6-tap with INTERMEDIATE 16-bit precision (no per-stage
* clip/round), final +512 >> 10 to scale back.
*
* Per H.264 §8.4.2.2.1, "j" position:
*
* tmp[r,c] = s[r,c-2] - 5*s[r,c-1] + 20*s[r,c] + 20*s[r,c+1]
* - 5*s[r,c+2] + s[r,c+3] (16-bit signed)
*
* dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
* + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
* + 512) >> 10)
*
* The tmp[] array spans rows r-2 .. r+3 around each output row, so
* we need 13 intermediate rows (rows -2..+10 of the SOURCE
* neighbourhood) for 8 output rows. Caller's src must have 2 rows
* of top context + 3 rows of bottom context AND 2 cols of left +
* 3 cols of right context (FFmpeg's edge-emulated buffer provides
* this at the frame boundary; same contract as mc20).
*
* Mirrors FFmpeg `ff_put_h264_qpel8_mc22_neon` (in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* line 710, which tail-calls put_h264_qpel8_hv_lowpass_neon).
*
* Signature:
* void(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
*
* Same single-stride convention as mc20/mc02.
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
void daedalus_put_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
/* 13 intermediate rows × 8 cols (for the 8 output rows
* dst[0..7][0..7], we need tmp[-2..+10][0..7] — but tmp is
* indexed RELATIVE to the output, so tmp_buf[0..12] corresponds
* to source rows [-2..+10]). */
int16_t tmp[13][8];
for (int rr = 0; rr < 13; rr++) {
int src_row = rr - 2; /* maps tmp_buf[0..12] → src rows [-2..+10] */
const uint8_t *s = src + src_row * stride;
for (int c = 0; c < 8; c++) {
int v = (int) s[c - 2] - 5 * (int) s[c - 1]
+ 20 * (int) s[c] + 20 * (int) s[c + 1]
- 5 * (int) s[c + 2] + (int) s[c + 3];
tmp[rr][c] = (int16_t) v;
}
}
for (int r = 0; r < 8; r++) {
/* tmp[r-2..r+3] in the output's coord system → tmp_buf[r..r+5]. */
for (int c = 0; c < 8; c++) {
int v = tmp[r + 0][c] /* "r-2" + shift 2 */
- 5 * tmp[r + 1][c] /* "r-1" */
+ 20 * tmp[r + 2][c] /* "r+0" */
+ 20 * tmp[r + 3][c] /* "r+1" */
- 5 * tmp[r + 4][c] /* "r+2" */
+ tmp[r + 5][c] /* "r+3" */
+ 512;
dst[r * stride + c] = (uint8_t) clip_u8(v >> 10);
}
}
}
+82
View File
@@ -0,0 +1,82 @@
/*
* Standalone bit-exact C references for the four single-axis quarter-
* pel luma qpel positions (H.264 §8.4.2.2.1, "put" variants). Each
* is a half-pel lowpass clipped to u8 followed by an L2 rounded-average
* with an integer-position source pixel.
*
* mc10 ("a" pos, ¼ horiz): a = clip255(mc20(s)); dst = (a + s[r,c] + 1) >> 1
* mc30 ("c" pos, ¾ horiz): a = clip255(mc20(s)); dst = (a + s[r,c+1] + 1) >> 1
* mc01 ("d" pos, ¼ vert ): a = clip255(mc02(s)); dst = (a + s[r, c] + 1) >> 1
* mc03 ("n" pos, ¾ vert ): a = clip255(mc02(s)); dst = (a + s[r+1,c] + 1) >> 1
*
* Mirror FFmpeg's `ff_put_h264_qpel8_mc{10,30,01,03}_neon` (in
* external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S
* lines 587, 603, 611, 729 — each tail-calls the corresponding
* lowpass_l2 helper).
*
* Same single-stride convention as mc20/mc02 — dst and src share the
* same stride; src + src_off points at row 0 col 0 of the output
* block, with appropriate edge context already in-buffer.
*
* License: LGPL-2.1-or-later.
*/
#include <stdint.h>
#include <stddef.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Compute one horizontal half-pel pixel at (r, c) — same as mc20. */
static inline uint8_t hpel_h(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[r*stride + c-2] - 5 * (int) s[r*stride + c-1]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[r*stride + c+1]
- 5 * (int) s[r*stride + c+2] + (int) s[r*stride + c+3]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
/* Compute one vertical half-pel pixel at (r, c) — same as mc02. */
static inline uint8_t hpel_v(const uint8_t *s, int r, int c, ptrdiff_t stride)
{
int v = (int) s[(r-2)*stride + c] - 5 * (int) s[(r-1)*stride + c]
+ 20 * (int) s[r*stride + c] + 20 * (int) s[(r+1)*stride + c]
- 5 * (int) s[(r+2)*stride + c] + (int) s[(r+3)*stride + c]
+ 16;
return (uint8_t) clip_u8(v >> 5);
}
void daedalus_put_h264_qpel8_mc10_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) {
uint8_t a = hpel_h(src, r, c, stride);
dst[r*stride + c] = (uint8_t) ((a + src[r*stride + c ] + 1) >> 1);
}
}
void daedalus_put_h264_qpel8_mc30_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) {
uint8_t a = hpel_h(src, r, c, stride);
dst[r*stride + c] = (uint8_t) ((a + src[r*stride + c + 1] + 1) >> 1);
}
}
void daedalus_put_h264_qpel8_mc01_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) {
uint8_t a = hpel_v(src, r, c, stride);
dst[r*stride + c] = (uint8_t) ((a + src[(r )*stride + c] + 1) >> 1);
}
}
void daedalus_put_h264_qpel8_mc03_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) {
uint8_t a = hpel_v(src, r, c, stride);
dst[r*stride + c] = (uint8_t) ((a + src[(r + 1)*stride + c] + 1) >> 1);
}
}
+505
View File
@@ -16,8 +16,57 @@
extern void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride);
extern void daedalus_h264_idct8_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride);
extern void daedalus_h264_h_loop_filter_luma_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4]);
extern void daedalus_h264_v_loop_filter_chroma_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4]);
extern void daedalus_h264_h_loop_filter_chroma_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4]);
extern void daedalus_h264_v_loop_filter_luma_intra_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
extern void daedalus_h264_h_loop_filter_luma_intra_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
extern void daedalus_h264_v_loop_filter_chroma_intra_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
extern void daedalus_h264_h_loop_filter_chroma_intra_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
extern void daedalus_h264_v_loop_filter_luma_ref(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t tc0[4]);
extern void daedalus_put_h264_qpel8_mc02_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc10_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc30_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc01_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc03_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc11_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc12_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc13_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc21_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc23_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc31_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc32_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc33_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc20_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc02_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc22_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc10_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc30_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc01_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc03_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc11_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc12_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc13_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc21_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc23_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc31_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc32_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_avg_h264_qpel8_mc33_ref(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
extern void daedalus_put_h264_qpel8_mc20_ref(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
@@ -145,6 +194,206 @@ static int test_deblock(void)
return diff == 0 ? 0 : 1;
}
static int test_deblock_h(void)
{
/* Mirror of test_deblock but for the H variant. Per-tile layout
* is now 8 cols x 16 rows (one vertical edge between cols 3 and 4
* of the tile); EDGE_COL = 4 puts dst_off at the leftmost output
* column of the right block so the kernel's pix[-4..+3] read sits
* inside the tile. */
enum { N_EDGES = 8, TILE_STRIDE = 8, TILE_ROWS = 16,
TILE_BYTES = TILE_STRIDE * TILE_ROWS,
TOTAL = N_EDGES * TILE_BYTES, EDGE_COL = 4 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_deblock_meta meta[N_EDGES];
for (int i = 0; i < TOTAL; i++) dst[i] = dst_ref[i] = (uint8_t)(xs() & 0xff);
for (int i = 0; i < N_EDGES; i++) {
meta[i].dst_off = i * TILE_BYTES + EDGE_COL;
meta[i].alpha = (int)(xs() % 64) + 1;
meta[i].beta = (int)(xs() % 16) + 1;
for (int s = 0; s < 4; s++) {
int r = (int)(xs() % 8);
meta[i].tc0[s] = (int8_t)(r == 0 ? -1 : (r - 1));
}
}
for (int i = 0; i < N_EDGES; i++) {
int8_t tc0_local[4] = { meta[i].tc0[0], meta[i].tc0[1], meta[i].tc0[2], meta[i].tc0[3] };
daedalus_h264_h_loop_filter_luma_ref(dst_ref + meta[i].dst_off, TILE_STRIDE,
meta[i].alpha, meta[i].beta, tc0_local);
}
int rc = daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, dst, TILE_STRIDE,
N_EDGES, meta);
if (rc) { fprintf(stderr, "deblock_h dispatch rc=%d\n", rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 deblock luma h: %d/%d bytes bit-exact (%.4f%%)\n",
TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_deblock_chroma_v(void)
{
/* Chroma V: per-tile 8 cols × 4 rows, edge between rows 1 and 2
* (EDGE_ROW=2 lets the kernel read pix[-2..+1]*stride safely). */
enum { N_EDGES = 8, TILE_STRIDE = 8, TILE_ROWS = 4,
TILE_BYTES = TILE_STRIDE * TILE_ROWS,
TOTAL = N_EDGES * TILE_BYTES, EDGE_ROW = 2,
EDGE_OFF = EDGE_ROW * TILE_STRIDE };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_deblock_meta meta[N_EDGES];
for (int i = 0; i < TOTAL; i++) dst[i] = dst_ref[i] = (uint8_t)(xs() & 0xff);
for (int i = 0; i < N_EDGES; i++) {
meta[i].dst_off = i * TILE_BYTES + EDGE_OFF;
meta[i].alpha = (int)(xs() % 64) + 1;
meta[i].beta = (int)(xs() % 16) + 1;
for (int s = 0; s < 4; s++) {
int r = (int)(xs() % 8);
meta[i].tc0[s] = (int8_t)(r == 0 ? -1 : (r - 1));
}
}
for (int i = 0; i < N_EDGES; i++) {
int8_t tc0_local[4] = { meta[i].tc0[0], meta[i].tc0[1], meta[i].tc0[2], meta[i].tc0[3] };
daedalus_h264_v_loop_filter_chroma_ref(dst_ref + meta[i].dst_off, TILE_STRIDE,
meta[i].alpha, meta[i].beta, tc0_local);
}
int rc = daedalus_recipe_dispatch_h264_deblock_chroma_v(ctx, dst, TILE_STRIDE,
N_EDGES, meta);
if (rc) { fprintf(stderr, "deblock_chroma_v dispatch rc=%d\n", rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 deblock chroma v: %d/%d bytes bit-exact (%.4f%%)\n",
TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_deblock_chroma_h(void)
{
/* Chroma H: per-tile 4 cols × 8 rows, edge between cols 1 and 2
* (EDGE_COL=2 lets the kernel read pix[-2..+1] safely). */
enum { N_EDGES = 8, TILE_STRIDE = 4, TILE_ROWS = 8,
TILE_BYTES = TILE_STRIDE * TILE_ROWS,
TOTAL = N_EDGES * TILE_BYTES, EDGE_COL = 2 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_deblock_meta meta[N_EDGES];
for (int i = 0; i < TOTAL; i++) dst[i] = dst_ref[i] = (uint8_t)(xs() & 0xff);
for (int i = 0; i < N_EDGES; i++) {
meta[i].dst_off = i * TILE_BYTES + EDGE_COL;
meta[i].alpha = (int)(xs() % 64) + 1;
meta[i].beta = (int)(xs() % 16) + 1;
for (int s = 0; s < 4; s++) {
int r = (int)(xs() % 8);
meta[i].tc0[s] = (int8_t)(r == 0 ? -1 : (r - 1));
}
}
for (int i = 0; i < N_EDGES; i++) {
int8_t tc0_local[4] = { meta[i].tc0[0], meta[i].tc0[1], meta[i].tc0[2], meta[i].tc0[3] };
daedalus_h264_h_loop_filter_chroma_ref(dst_ref + meta[i].dst_off, TILE_STRIDE,
meta[i].alpha, meta[i].beta, tc0_local);
}
int rc = daedalus_recipe_dispatch_h264_deblock_chroma_h(ctx, dst, TILE_STRIDE,
N_EDGES, meta);
if (rc) { fprintf(stderr, "deblock_chroma_h dispatch rc=%d\n", rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 deblock chroma h: %d/%d bytes bit-exact (%.4f%%)\n",
TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
/* --- bS=4 intra-strength deblock tests ---
* Tile geometry per orientation matches the bS<4 variant; only the
* dispatch + reference function change. alpha/beta are non-trivial
* (the C ref + NEON both early-return when alpha|beta == 0).
*/
typedef struct {
const char *name;
int n_edges, tile_stride, tile_rows, edge_off;
void (*ref)(uint8_t *pix, ptrdiff_t stride, int alpha, int beta);
int (*dispatch)(daedalus_ctx *ctx, uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
} intra_test_spec;
static int run_intra_test(const intra_test_spec *t)
{
int total = t->n_edges * t->tile_stride * t->tile_rows;
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t *dst = malloc((size_t) total);
uint8_t *dst_ref = malloc((size_t) total);
daedalus_h264_deblock_meta *meta = calloc((size_t) t->n_edges, sizeof(*meta));
if (!dst || !dst_ref || !meta) return 1;
for (int i = 0; i < total; i++) dst[i] = dst_ref[i] = (uint8_t)(xs() & 0xff);
int tile_bytes = t->tile_stride * t->tile_rows;
for (int i = 0; i < t->n_edges; i++) {
meta[i].dst_off = (uint32_t)(i * tile_bytes + t->edge_off);
meta[i].alpha = (int)(xs() % 64) + 1;
meta[i].beta = (int)(xs() % 16) + 1;
/* tc0[] unused for intra; leave at 0 from calloc. */
}
for (int i = 0; i < t->n_edges; i++) {
t->ref(dst_ref + meta[i].dst_off,
(ptrdiff_t) t->tile_stride,
meta[i].alpha, meta[i].beta);
}
int rc = t->dispatch(ctx, dst, (size_t) t->tile_stride,
(size_t) t->n_edges, meta);
if (rc) { fprintf(stderr, "%s dispatch rc=%d\n", t->name, rc); return 1; }
int diff = 0;
for (int i = 0; i < total; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 deblock %s: %d/%d bytes bit-exact (%.4f%%)\n",
t->name, total - diff, total, 100.0 * (total - diff) / total);
free(meta); free(dst_ref); free(dst);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_deblock_intra_all(void)
{
intra_test_spec specs[] = {
{ "luma v intra", 8, 16, 8, 4 * 16,
daedalus_h264_v_loop_filter_luma_intra_ref,
daedalus_recipe_dispatch_h264_deblock_luma_v_intra },
{ "luma h intra", 8, 8, 16, 4,
daedalus_h264_h_loop_filter_luma_intra_ref,
daedalus_recipe_dispatch_h264_deblock_luma_h_intra },
{ "chroma v intra", 8, 8, 4, 2 * 8,
daedalus_h264_v_loop_filter_chroma_intra_ref,
daedalus_recipe_dispatch_h264_deblock_chroma_v_intra },
{ "chroma h intra", 8, 4, 8, 2,
daedalus_h264_h_loop_filter_chroma_intra_ref,
daedalus_recipe_dispatch_h264_deblock_chroma_h_intra },
};
int fail = 0;
for (size_t i = 0; i < sizeof(specs)/sizeof(specs[0]); i++)
fail |= run_intra_test(&specs[i]);
return fail;
}
static int test_qpel_mc20(void)
{
/* Cycle 9 — one 8x8 block per 16-wide row-tile, 8 tiles. Each tile
@@ -185,6 +434,243 @@ static int test_qpel_mc20(void)
return diff == 0 ? 0 : 1;
}
static int test_qpel_mc02(void)
{
/* mc02: vertical 6-tap. Tile is 16 cols × 16 rows so the kernel
* can read rows [SRC_ROW-2 .. SRC_ROW+7+3] inside the buffer.
* SRC_ROW = 3 leaves rows -2..-1 above the output (rows 1..2 of
* the tile) and rows 8..10 below (rows 11..13). */
enum { N = 8, TILE_STRIDE = 16, TILE_ROWS = 16,
TILE_BYTES = TILE_ROWS * TILE_STRIDE, TOTAL = N * TILE_BYTES,
SRC_ROW = 3 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t src[TOTAL], dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_qpel_meta meta[N];
for (int i = 0; i < TOTAL; i++) src[i] = (uint8_t)(xs() & 0xff);
memset(dst, 0, sizeof(dst));
memset(dst_ref, 0, sizeof(dst_ref));
for (int i = 0; i < N; i++) {
meta[i].src_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE);
meta[i].dst_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE);
}
for (int i = 0; i < N; i++)
daedalus_put_h264_qpel8_mc02_ref(dst_ref + meta[i].dst_off,
src + meta[i].src_off,
TILE_STRIDE);
int rc = daedalus_recipe_dispatch_h264_qpel_mc02(ctx, dst, src,
TILE_STRIDE, N, meta);
if (rc) { fprintf(stderr, "qpel_mc02 dispatch rc=%d\n", rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 qpel mc02: %d/%d bytes bit-exact (%.4f%%)\n",
TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_qpel_mc22(void)
{
/* mc22: 2D HV lowpass. Needs 2 cols left + 3 cols right + 2 rows
* top + 3 rows bottom of context per 8x8 output. Tile is 16x16
* with output positioned at (SRC_ROW=3, SRC_COL=3) so the read
* range [SRC_*-2 .. SRC_*+7+3] stays inside the tile. */
enum { N = 8, TILE_STRIDE = 16, TILE_ROWS = 16,
TILE_BYTES = TILE_ROWS * TILE_STRIDE, TOTAL = N * TILE_BYTES,
SRC_ROW = 3, SRC_COL = 3 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t src[TOTAL], dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_qpel_meta meta[N];
for (int i = 0; i < TOTAL; i++) src[i] = (uint8_t)(xs() & 0xff);
memset(dst, 0, sizeof(dst));
memset(dst_ref, 0, sizeof(dst_ref));
for (int i = 0; i < N; i++) {
meta[i].src_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
meta[i].dst_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
}
for (int i = 0; i < N; i++)
daedalus_put_h264_qpel8_mc22_ref(dst_ref + meta[i].dst_off,
src + meta[i].src_off,
TILE_STRIDE);
int rc = daedalus_recipe_dispatch_h264_qpel_mc22(ctx, dst, src,
TILE_STRIDE, N, meta);
if (rc) { fprintf(stderr, "qpel_mc22 dispatch rc=%d\n", rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 qpel mc22: %d/%d bytes bit-exact (%.4f%%)\n",
TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
/* Generic harness for the 4 single-axis quarter-pel positions; same
* tile geometry as mc22 since each one reads the largest of the H/V
* lowpass windows (mc10/mc30 need cols -2..+3, mc01/mc03 need rows
* -2..+3 OR +1..+3 on the integer side). */
typedef void (*qpel_ref_fn)(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
typedef int (*qpel_dispatch_fn)(daedalus_ctx *ctx, uint8_t *dst,
const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
static int run_quarter_axis_qpel(const char *name,
qpel_ref_fn ref, qpel_dispatch_fn dispatch)
{
enum { N = 8, TILE_STRIDE = 16, TILE_ROWS = 16,
TILE_BYTES = TILE_ROWS * TILE_STRIDE, TOTAL = N * TILE_BYTES,
SRC_ROW = 3, SRC_COL = 3 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t src[TOTAL], dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_qpel_meta meta[N];
for (int i = 0; i < TOTAL; i++) src[i] = (uint8_t)(xs() & 0xff);
memset(dst, 0, sizeof(dst));
memset(dst_ref, 0, sizeof(dst_ref));
for (int i = 0; i < N; i++) {
meta[i].src_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
meta[i].dst_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
}
for (int i = 0; i < N; i++)
ref(dst_ref + meta[i].dst_off, src + meta[i].src_off, TILE_STRIDE);
int rc = dispatch(ctx, dst, src, TILE_STRIDE, N, meta);
if (rc) { fprintf(stderr, "%s dispatch rc=%d\n", name, rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 qpel %s: %d/%d bytes bit-exact (%.4f%%)\n",
name, TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_qpel_quarter_axis_all(void)
{
int fail = 0;
fail |= run_quarter_axis_qpel("mc10", daedalus_put_h264_qpel8_mc10_ref,
daedalus_recipe_dispatch_h264_qpel_mc10);
fail |= run_quarter_axis_qpel("mc30", daedalus_put_h264_qpel8_mc30_ref,
daedalus_recipe_dispatch_h264_qpel_mc30);
fail |= run_quarter_axis_qpel("mc01", daedalus_put_h264_qpel8_mc01_ref,
daedalus_recipe_dispatch_h264_qpel_mc01);
fail |= run_quarter_axis_qpel("mc03", daedalus_put_h264_qpel8_mc03_ref,
daedalus_recipe_dispatch_h264_qpel_mc03);
return fail;
}
static int test_qpel_diag_all(void)
{
/* Diagonal positions need TWO half-pel intermediates per output;
* some of them read at (r+1,c) or (r,c+1) so the test geometry
* needs an extra row + col of context. run_quarter_axis_qpel
* already provides plenty (SRC_ROW=3, SRC_COL=3, 16x16 tile)
* — reusing that harness is fine. */
int fail = 0;
fail |= run_quarter_axis_qpel("mc11", daedalus_put_h264_qpel8_mc11_ref,
daedalus_recipe_dispatch_h264_qpel_mc11);
fail |= run_quarter_axis_qpel("mc12", daedalus_put_h264_qpel8_mc12_ref,
daedalus_recipe_dispatch_h264_qpel_mc12);
fail |= run_quarter_axis_qpel("mc13", daedalus_put_h264_qpel8_mc13_ref,
daedalus_recipe_dispatch_h264_qpel_mc13);
fail |= run_quarter_axis_qpel("mc21", daedalus_put_h264_qpel8_mc21_ref,
daedalus_recipe_dispatch_h264_qpel_mc21);
fail |= run_quarter_axis_qpel("mc23", daedalus_put_h264_qpel8_mc23_ref,
daedalus_recipe_dispatch_h264_qpel_mc23);
fail |= run_quarter_axis_qpel("mc31", daedalus_put_h264_qpel8_mc31_ref,
daedalus_recipe_dispatch_h264_qpel_mc31);
fail |= run_quarter_axis_qpel("mc32", daedalus_put_h264_qpel8_mc32_ref,
daedalus_recipe_dispatch_h264_qpel_mc32);
fail |= run_quarter_axis_qpel("mc33", daedalus_put_h264_qpel8_mc33_ref,
daedalus_recipe_dispatch_h264_qpel_mc33);
return fail;
}
/* Avg-form harness: pre-loads dst + dst_ref with the same random
* content so we can verify the L2 averaging is happening (not just
* put_-style overwrite). If the dispatch incorrectly overwrote
* dst, the bit-exact compare would still catch the mismatch against
* the avg_ reference. */
static int run_avg_qpel(const char *name,
qpel_ref_fn ref, qpel_dispatch_fn dispatch)
{
enum { N = 8, TILE_STRIDE = 16, TILE_ROWS = 16,
TILE_BYTES = TILE_ROWS * TILE_STRIDE, TOTAL = N * TILE_BYTES,
SRC_ROW = 3, SRC_COL = 3 };
daedalus_ctx *ctx = daedalus_ctx_create();
if (!ctx) return 1;
uint8_t src[TOTAL], dst[TOTAL], dst_ref[TOTAL];
daedalus_h264_qpel_meta meta[N];
/* Two random buffers: src for the qpel input, dst seeded with
* different random content as the "list0 prediction" — both
* dst and dst_ref get the SAME seed so the avg compare is fair. */
for (int i = 0; i < TOTAL; i++) src[i] = (uint8_t)(xs() & 0xff);
for (int i = 0; i < TOTAL; i++) {
uint8_t v = (uint8_t)(xs() & 0xff);
dst[i] = dst_ref[i] = v;
}
for (int i = 0; i < N; i++) {
meta[i].src_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
meta[i].dst_off = (uint32_t)(i * TILE_BYTES + SRC_ROW * TILE_STRIDE + SRC_COL);
}
for (int i = 0; i < N; i++)
ref(dst_ref + meta[i].dst_off, src + meta[i].src_off, TILE_STRIDE);
int rc = dispatch(ctx, dst, src, TILE_STRIDE, N, meta);
if (rc) { fprintf(stderr, "%s dispatch rc=%d\n", name, rc); return 1; }
int diff = 0;
for (int i = 0; i < TOTAL; i++) if (dst[i] != dst_ref[i]) diff++;
printf(" H.264 qpel %s: %d/%d bytes bit-exact (%.4f%%)\n",
name, TOTAL - diff, TOTAL, 100.0 * (TOTAL - diff) / TOTAL);
daedalus_ctx_destroy(ctx);
return diff == 0 ? 0 : 1;
}
static int test_qpel_avg_anchors(void)
{
int fail = 0;
fail |= run_avg_qpel("avg_mc20", daedalus_avg_h264_qpel8_mc20_ref,
daedalus_recipe_dispatch_h264_qpel_avg_mc20);
fail |= run_avg_qpel("avg_mc02", daedalus_avg_h264_qpel8_mc02_ref,
daedalus_recipe_dispatch_h264_qpel_avg_mc02);
fail |= run_avg_qpel("avg_mc22", daedalus_avg_h264_qpel8_mc22_ref,
daedalus_recipe_dispatch_h264_qpel_avg_mc22);
return fail;
}
static int test_qpel_avg_rest(void)
{
int fail = 0;
/* Ref fns are named daedalus_avg_h264_qpel8_<mcXX>_ref (no
* second "avg_"); dispatch fns are named ..._avg_mcXX. Macro
* builds both from the bare mcXX name. */
#define RUN(MC) fail |= run_avg_qpel("avg_" #MC, \
daedalus_avg_h264_qpel8_ ## MC ## _ref, \
daedalus_recipe_dispatch_h264_qpel_avg_ ## MC)
RUN(mc10); RUN(mc30); RUN(mc01); RUN(mc03);
RUN(mc11); RUN(mc12); RUN(mc13);
RUN(mc21); RUN(mc23);
RUN(mc31); RUN(mc32); RUN(mc33);
#undef RUN
return fail;
}
int main(void)
{
printf("=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===\n");
@@ -197,10 +683,29 @@ int main(void)
printf(" H264_QPEL_MC20 recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20));
printf(" H264_DEBLOCK_LH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH));
printf(" H264_DEBLOCK_CV recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV));
printf(" H264_DEBLOCK_CH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH));
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (bS=4 family, all on QPU)\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA));
int fail = 0;
fail |= test_idct4();
fail |= test_idct8();
fail |= test_deblock();
fail |= test_deblock_h();
fail |= test_deblock_chroma_v();
fail |= test_deblock_chroma_h();
fail |= test_deblock_intra_all();
fail |= test_qpel_mc20();
fail |= test_qpel_mc02();
fail |= test_qpel_mc22();
fail |= test_qpel_quarter_axis_all();
fail |= test_qpel_diag_all();
fail |= test_qpel_avg_anchors();
fail |= test_qpel_avg_rest();
return fail;
}
+136
View File
@@ -0,0 +1,136 @@
/*
* Tests the H.264 chroma DC 2x2 Hadamard primitive against
* spec-derived expected outputs.
*
* f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1] "sum"
* f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1] "col-diff"
* f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1] "row-diff"
* f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1] "anti-diag"
*/
#include <stdint.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_chroma_dc_hadamard_2x2_ref(int16_t c[4]);
extern void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]); /* public API */
static int check(const char *name, int16_t in[4], int16_t expect[4])
{
int16_t c[4]; memcpy(c, in, sizeof(c));
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
int fail = 0;
for (int i = 0; i < 4; i++) {
if (c[i] != expect[i]) {
fprintf(stderr, "%s: c[%d] = %d, expected %d\n",
name, i, c[i], expect[i]);
fail = 1;
}
}
if (!fail) printf(" %-32s PASS\n", name);
else printf(" %-32s FAIL\n", name);
return fail;
}
int main(void)
{
int fail = 0;
/* Test 1: All-same input.
* c = [5, 5, 5, 5]
* f[0,0] = 20, f[0,1] = 0, f[1,0] = 0, f[1,1] = 0
*/
{ int16_t in[4] = { 5, 5, 5, 5 };
int16_t ex[4] = { 20, 0, 0, 0 };
fail |= check("all-uniform 5", in, ex); }
/* Test 2: Single-axis variation (col 1 = 0, col 2 = 10).
* c = [0, 10, 0, 10]
* f[0,0] = 0+10+0+10 = 20
* f[0,1] = 0-10+0-10 = -20
* f[1,0] = 0+10-0-10 = 0
* f[1,1] = 0-10-0+10 = 0
*/
{ int16_t in[4] = { 0, 10, 0, 10 };
int16_t ex[4] = { 20, -20, 0, 0 };
fail |= check("col gradient [0,10,0,10]", in, ex); }
/* Test 3: Row gradient.
* c = [0, 0, 10, 10]
* f[0,0] = 20, f[0,1] = 0, f[1,0] = 0-20 = -20, f[1,1] = 0
*/
{ int16_t in[4] = { 0, 0, 10, 10 };
int16_t ex[4] = { 20, 0, -20, 0 };
fail |= check("row gradient [0,0,10,10]", in, ex); }
/* Test 4: Anti-diagonal pattern.
* c = [10, 0, 0, 10]
* f[0,0] = 20
* f[0,1] = 10-0+0-10 = 0
* f[1,0] = 10+0-0-10 = 0
* f[1,1] = 10-0-0+10 = 20
*/
{ int16_t in[4] = { 10, 0, 0, 10 };
int16_t ex[4] = { 20, 0, 0, 20 };
fail |= check("anti-diagonal [10,0,0,10]", in, ex); }
/* Test 5: Asymmetric — all bands non-zero.
* c = [1, 2, 3, 4]
* f[0,0] = 10
* f[0,1] = 1-2+3-4 = -2
* f[1,0] = 1+2-3-4 = -4
* f[1,1] = 1-2-3+4 = 0
*/
{ int16_t in[4] = { 1, 2, 3, 4 };
int16_t ex[4] = { 10, -2, -4, 0 };
fail |= check("asymmetric [1,2,3,4]", in, ex); }
/* Test 6: Negative inputs (Hadamard is linear, so signs preserve).
* c = [-5, 5, -5, 5]
* f[0,0] = -5+5-5+5 = 0
* f[0,1] = -5-5-5-5 = -20
* f[1,0] = -5+5+5-5 = 0
* f[1,1] = -5-5+5+5 = 0
*/
{ int16_t in[4] = { -5, 5, -5, 5 };
int16_t ex[4] = { 0, -20, 0, 0 };
fail |= check("sign-alternating [-5,5,-5,5]", in, ex); }
/* Test 7: Inverse-property check. H * H = 4*I for the unscaled
* 2x2 Hadamard. So applying twice multiplies each by 4.
* c = [1, 2, 3, 4]
* First Hadamard: [10, -2, -4, 0]
* Second Hadamard: [4, 8, 12, 16]
*/
{ int16_t in[4] = { 1, 2, 3, 4 };
int16_t ex[4] = { 4, 8, 12, 16 };
int16_t c[4]; memcpy(c, in, sizeof(c));
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
int local_fail = 0;
for (int i = 0; i < 4; i++) if (c[i] != ex[i]) local_fail = 1;
printf(" %-32s %s\n", "double-Hadamard = 4*orig",
local_fail ? "FAIL" : "PASS");
fail |= local_fail;
}
/* Test 8: public API parity. The public symbol must produce
* byte-identical output to the test-only ref for the same input.
* If the src/ Hadamard ever drifts from the spec, this catches it. */
{
int16_t input[4] = { 7, -11, 23, -42 };
int16_t a[4], b[4];
memcpy(a, input, sizeof(a));
memcpy(b, input, sizeof(b));
daedalus_h264_chroma_dc_hadamard_2x2_ref(a);
daedalus_h264_chroma_dc_hadamard_2x2(b);
int local_fail = 0;
for (int i = 0; i < 4; i++) if (a[i] != b[i]) local_fail = 1;
printf(" %-32s %s\n", "public API parity vs _ref",
local_fail ? "FAIL" : "PASS");
fail |= local_fail;
}
if (fail == 0) printf("\nALL chroma DC Hadamard tests PASS\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+167
View File
@@ -0,0 +1,167 @@
/*
* Tests the 4 H.264 Intra_16x16 luma prediction modes against
* spec-derived expected patterns. Same layout as the 4x4 test:
* a buffer that holds the 16x16 output plus 1-pixel top/left
* context and 1-pixel top-left corner.
*
* row 0: [tl][t0..t15]
* row 1: [l0][output row 0]
* row 2: [l1][output row 1]
* ...
* row 16: [l15][output row 15]
*
* Buffer dimensions: 17 rows × 17 cols, total 289 bytes.
* dst (passed to the pred fns) points at row 1 col 1.
*/
#include <stdint.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 17
static void set_ctx(uint8_t buf[ROWS][STRIDE], int tl,
const int t[16], const int l[16])
{
for (int r = 0; r < ROWS; r++)
for (int c = 0; c < STRIDE; c++) buf[r][c] = 0xff;
buf[0][0] = (uint8_t) tl;
for (int c = 0; c < 16; c++) buf[0][1 + c] = (uint8_t) t[c];
for (int r = 0; r < 16; r++) buf[1 + r][0] = (uint8_t) l[r];
}
static int check(const uint8_t buf[ROWS][STRIDE], const char *name,
uint8_t (*expect_at)(int r, int c, void *), void *cookie)
{
int diff = 0;
int first_r = 0, first_c = 0, first_got = 0, first_exp = 0;
for (int r = 0; r < 16; r++) {
for (int c = 0; c < 16; c++) {
uint8_t got = buf[1 + r][1 + c];
uint8_t exp = expect_at(r, c, cookie);
if (got != exp) {
if (diff == 0) {
first_r = r; first_c = c;
first_got = got; first_exp = exp;
}
diff++;
}
}
}
if (diff == 0)
printf(" %-30s PASS\n", name);
else
printf(" %-30s FAIL (%d/256 wrong, first r=%d c=%d got=%u exp=%u)\n",
name, diff, first_r, first_c, first_got, first_exp);
return diff == 0 ? 0 : 1;
}
/* Expectation helpers for each mode. */
static uint8_t expect_uniform(int r, int c, void *cookie)
{ (void)r; (void)c; return *(uint8_t *)cookie; }
struct vertical_ctx { const int *t; };
static uint8_t expect_vertical(int r, int c, void *cookie)
{ (void)r; return (uint8_t) ((struct vertical_ctx *)cookie)->t[c]; }
struct horizontal_ctx { const int *l; };
static uint8_t expect_horizontal(int r, int c, void *cookie)
{ (void)c; return (uint8_t) ((struct horizontal_ctx *)cookie)->l[r]; }
int main(void)
{
int fail = 0;
/* --- Mode 0 Vertical: each col = top[col] --- */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 10 + i; l[i] = 0; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_vertical(&buf[1][1], STRIDE);
struct vertical_ctx vc = { t };
fail |= check(buf, "Vertical (mode 0)", expect_vertical, &vc);
}
/* --- Mode 1 Horizontal: each row = left[row] --- */
{
uint8_t buf[ROWS][STRIDE];
int t[16] = {0}, l[16];
for (int i = 0; i < 16; i++) l[i] = 50 + i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_horizontal(&buf[1][1], STRIDE);
struct horizontal_ctx hc = { l };
fail |= check(buf, "Horizontal (mode 1)", expect_horizontal, &hc);
}
/* --- Mode 2 DC: ((sum + 16) >> 5) --- */
/* All top = 2, all left = 6: sum = 32 + 96 = 128, +16 = 144,
* >>5 = 144/32 = 4. */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 2; l[i] = 6; }
set_ctx(buf, 99, t, l);
daedalus_h264_pred_16x16_dc(&buf[1][1], STRIDE);
uint8_t exp_val = 4;
fail |= check(buf, "DC (mode 2)", expect_uniform, &exp_val);
}
/* --- Mode 3 Plane: uniform neighbours → uniform output --- */
/* H=V=0 when neighbours are uniform. a = 16*(p+p) = 32p.
* pred[y][x] = (32p + 0 + 0 + 16) >> 5 = (32p + 16) >> 5 = p
* (exact integer for any p, since 32p/32 = p and +16/32 = 0).
* Verifies the orientation-free portion of the formula. */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l); /* uniform tl too — H/V sums actually zero */
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t exp_val = 100;
fail |= check(buf, "Plane (mode 3, uniform)", expect_uniform, &exp_val);
}
/* --- Mode 3 Plane: gradient sanity ---
* Top row = 0..15 (gradient), left col = 0..15, tl = 0.
* H = sum_{i=0..7} (i+1) * (t[8+i] - t[6-i] for i<7; or t[15]-tl=15 for i=7)
* = 1*(8-6) + 2*(9-5) + 3*(10-4) + 4*(11-3) + 5*(12-2) + 6*(13-1)
* + 7*(14-0) + 8*(15-0)
* = 2 + 8 + 18 + 32 + 50 + 72 + 98 + 120 = 400
* V = same shape on left col = 400
* b = (5*400 + 32) >> 6 = 2032 >> 6 = 31
* c = (5*400 + 32) >> 6 = 31
* a = 16 * (l[15] + t[15]) = 16 * (15 + 15) = 480
* pred[0][0] = (480 + 31*(-7) + 31*(-7) + 16) >> 5
* = (480 - 217 - 217 + 16) >> 5
* = 62 >> 5 = 1
* pred[15][15] = (480 + 31*8 + 31*8 + 16) >> 5
* = (480 + 248 + 248 + 16) >> 5
* = 992 >> 5 = 31
* Just spot-check those two corners. */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 15][1 + 15];
int spot_fail = 0;
if (tl_actual != 1) { fprintf(stderr, "Plane gradient pred[0][0] = %u, expected 1\n", tl_actual); spot_fail = 1; }
if (br_actual != 31) { fprintf(stderr, "Plane gradient pred[15][15] = %u, expected 31\n", br_actual); spot_fail = 1; }
if (!spot_fail) printf(" %-30s PASS (corners 1, 31)\n", "Plane (mode 3, gradient)");
else printf(" %-30s FAIL\n", "Plane (mode 3, gradient)");
fail |= spot_fail;
}
if (fail == 0) printf("\nALL Intra_16x16 mode references PASS\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+246
View File
@@ -0,0 +1,246 @@
/*
* Tests the 9 H.264 Intra_4x4 luma prediction modes against
* spec-derived expected patterns. Goal: catch any mistake in
* the reference (sign / shift / table mapping) before it lands
* downstream. Each mode is exercised with a deterministic
* neighbour context and checked against a hand-computed (or
* spec-derived) expected 4x4 output.
*
* The test buffer layout reserves a 1-pixel top/left context border
* + a 4-pixel top-right (for modes 3 / 7):
*
* row 0: [tl][t0 t1 t2 t3 t4 t5 t6 t7] <- TOP_STRIDE = 9 bytes
* row 1: [l0][ 4x4 output goes here ]
* row 2: [l1][ ]
* row 3: [l2][ ]
* row 4: [l3][ ]
*
* dst (passed to the pred fns) points at row 1 col 1.
*/
#include <stdint.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
typedef void (*pred_fn)(uint8_t *dst, ptrdiff_t stride);
/* Set up the buffer: 5 rows × STRIDE cols.
* top-left = tl, top[0..7] = t[0..7], left[0..3] = l[0..3].
* The 4x4 output region (rows 1..4, cols 1..4) is filled with 0xff
* sentinels so any unwritten cell shows up as 255 in the compare. */
static void set_ctx(uint8_t buf[5][STRIDE], int tl, const int t[8], const int l[4])
{
for (int r = 0; r < 5; r++) for (int c = 0; c < STRIDE; c++) buf[r][c] = 0xff;
buf[0][0] = (uint8_t) tl;
for (int c = 0; c < 8; c++) buf[0][1 + c] = (uint8_t) t[c];
for (int r = 0; r < 4; r++) buf[1 + r][0] = (uint8_t) l[r];
}
static int check(const uint8_t buf[5][STRIDE], const char *name,
const uint8_t expect[4][4])
{
int diff = 0;
for (int r = 0; r < 4; r++) {
for (int c = 0; c < 4; c++) {
uint8_t got = buf[1 + r][1 + c];
uint8_t exp = expect[r][c];
if (got != exp) {
if (diff == 0)
fprintf(stderr,
"%s: first mismatch r=%d c=%d got=%u exp=%u\n",
name, r, c, got, exp);
diff++;
}
}
}
if (diff == 0)
printf(" %-26s PASS\n", name);
else
printf(" %-26s FAIL (%d/16 bytes wrong)\n", name, diff);
return diff == 0 ? 0 : 1;
}
int main(void)
{
int fail = 0;
/* Mode 0 — Vertical: each col = top[col]. */
{
uint8_t buf[5][STRIDE];
int tl = 0;
int t[8] = { 10, 20, 30, 40, 0, 0, 0, 0 };
int l[4] = { 0, 0, 0, 0 };
set_ctx(buf, tl, t, l);
daedalus_h264_pred_4x4_vertical(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{10,20,30,40}, {10,20,30,40}, {10,20,30,40}, {10,20,30,40}
};
fail |= check(buf, "Vertical (mode 0)", exp);
}
/* Mode 1 — Horizontal: each row = left[row]. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 50, 60, 70, 80 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_horizontal(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{50,50,50,50}, {60,60,60,60}, {70,70,70,70}, {80,80,80,80}
};
fail |= check(buf, "Horizontal (mode 1)", exp);
}
/* Mode 2 — DC: all 8 neighbours valid → ((sum + 4) >> 3) broadcast.
* top sum = 4*1 = 4, left sum = 4*3 = 12, total 16, +4 = 20,
* >>3 = 2. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 1,1,1,1, 0,0,0,0 };
int l[4] = { 3,3,3,3 };
set_ctx(buf, 99, t, l); /* tl unused for DC */
daedalus_h264_pred_4x4_dc(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{2,2,2,2}, {2,2,2,2}, {2,2,2,2}, {2,2,2,2}
};
fail |= check(buf, "DC (mode 2)", exp);
}
/* Mode 3 — Diagonal_Down_Left: zz[i] = avg3(t[i], t[i+1], t[i+2]);
* dst[r][c] = zz[c + r].
* With all t[]=100 → all zz=100 → all dst=100. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 100,100,100,100, 100,100,100,100 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_ddl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{100,100,100,100}, {100,100,100,100},
{100,100,100,100}, {100,100,100,100}
};
fail |= check(buf, "DiagDownLeft (mode 3)", exp);
}
/* Mode 4 — Diagonal_Down_Right: zz[c-r] with c-r ∈ {-3..+3}.
* If all 9 surrounding pixels = 200 → all zz = 200 → all dst = 200. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 200,200,200,200, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 200, t, l);
daedalus_h264_pred_4x4_ddr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
};
fail |= check(buf, "DiagDownRight (mode 4)", exp);
}
/* Mode 5 — Vertical_Right. With all neighbours = 80 the 3-tap
* (a+2b+c+2)>>2 and 2-tap (a+b+1)>>1 both yield 80. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 80,80,80,80, 0,0,0,0 };
int l[4] = { 80,80,80,80 };
set_ctx(buf, 80, t, l);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{80,80,80,80}, {80,80,80,80}, {80,80,80,80}, {80,80,80,80}
};
fail |= check(buf, "VerticalRight (mode 5)", exp);
}
/* Mode 6 — Horizontal_Down. Same uniform-context degenerate case. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 120,120,120,120, 0,0,0,0 };
int l[4] = { 120,120,120,120 };
set_ctx(buf, 120, t, l);
daedalus_h264_pred_4x4_hd(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{120,120,120,120}, {120,120,120,120},
{120,120,120,120}, {120,120,120,120}
};
fail |= check(buf, "HorizontalDown (mode 6)", exp);
}
/* Mode 7 — Vertical_Left. Uniform context. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 64,64,64,64, 64,64,64,64 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_vl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{64,64,64,64}, {64,64,64,64}, {64,64,64,64}, {64,64,64,64}
};
fail |= check(buf, "VerticalLeft (mode 7)", exp);
}
/* Mode 8 — Horizontal_Up. Uniform context. */
{
uint8_t buf[5][STRIDE];
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_hu(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
};
fail |= check(buf, "HorizontalUp (mode 8)", exp);
}
/* Asymmetric Vertical_Right test: detects orientation /
* row-vs-col confusion. Top=10,20,30,40, Left=50,60,70,
* top-left=5. Spec-derived expected output computed by hand
* from §8.3.1.4.6.
*
* d[0][0] = (tl+t0+1)>>1 = (5+10+1)>>1 = 8
* d[0][1] = (t0+t1+1)>>1 = (10+20+1)>>1 = 15
* d[0][2] = (t1+t2+1)>>1 = (20+30+1)>>1 = 25
* d[0][3] = (t2+t3+1)>>1 = (30+40+1)>>1 = 35
* d[1][0] = avg3(l0,tl,t0) = (50+2*5+10+2)>>2 = 72/4 = 18
* d[1][1] = avg3(tl,t0,t1) = (5+20+20+2)>>2 = 47/4 = 11
* d[1][2] = avg3(t0,t1,t2) = (10+40+30+2)>>2 = 82/4 = 20
* d[1][3] = avg3(t1,t2,t3) = (20+60+40+2)>>2 = 122/4 = 30
* d[2][0] = avg3(tl,l0,l1) = (5+100+60+2)>>2 = 167/4 = 41
* d[2][1] = d[0][0] = 8
* d[2][2] = d[0][1] = 15
* d[2][3] = d[0][2] = 25
* d[3][0] = avg3(l0,l1,l2) = (50+120+70+2)>>2 = 242/4 = 60
* d[3][1] = d[1][0] = 18
* d[3][2] = d[1][1] = 11
* d[3][3] = d[1][2] = 20
*/
{
uint8_t buf[5][STRIDE];
int t[8] = { 10,20,30,40, 0,0,0,0 };
int l[4] = { 50,60,70,0 };
set_ctx(buf, 5, t, l);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{ 8,15,25,35},
{18,11,20,30},
{41, 8,15,25},
{60,18,11,20},
};
fail |= check(buf, "VR asym (sanity)", exp);
}
if (fail == 0) printf("\nALL %d intra-4x4 mode references PASS\n", 10);
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+170
View File
@@ -0,0 +1,170 @@
/*
* Tests the H.264 Intra_8x8 luma prediction modes against spec-derived
* expectations. Buffer layout is 9 rows × 17 cols (extra cols for the
* top-right extension that DDL/VL need; not exercised by V/H/DC but
* already in-place for the eventual directional-modes follow-up):
*
* row 0: [tl][t0..t15] — 17 bytes
* row 1: [l0][output row 0 ..] — 17 bytes
* ...
* row 8: [l7][output row 7 ..]
*/
#include <stdint.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 9
static void set_ctx(uint8_t buf[ROWS][STRIDE], int tl,
const int t[16], const int l[8])
{
for (int r = 0; r < ROWS; r++)
for (int c = 0; c < STRIDE; c++) buf[r][c] = 0xff;
buf[0][0] = (uint8_t) tl;
for (int c = 0; c < 16; c++) buf[0][1 + c] = (uint8_t) t[c];
for (int r = 0; r < 8; r++) buf[1 + r][0] = (uint8_t) l[r];
}
static int check_uniform(const uint8_t buf[ROWS][STRIDE], const char *name,
uint8_t expect_val)
{
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
if (buf[1+r][1+c] != expect_val) diff++;
if (diff == 0) printf(" %-30s PASS\n", name);
else printf(" %-30s FAIL (%d/64 wrong, expected %u)\n", name, diff, expect_val);
return diff == 0 ? 0 : 1;
}
int main(void)
{
int fail = 0;
/* Mode 0 Vertical with uniform top → uniform output.
* Filtered top[c] = (a + 2*a + a + 2) >> 2 = a for uniform a. */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[8];
for (int i = 0; i < 16; i++) t[i] = 50;
for (int j = 0; j < 8; j++) l[j] = 0;
set_ctx(buf, 50, t, l);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Vertical (mode 0, uniform top)", 50);
}
/* Mode 1 Horizontal with uniform left → uniform output. */
{
uint8_t buf[ROWS][STRIDE];
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = 70;
set_ctx(buf, 70, t, l);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Horizontal (mode 1, uniform left)", 70);
}
/* Mode 2 DC with all-uniform neighbours → uniform output.
* Filtered top[c] = top for uniform; filtered left[j] = left.
* sum = 8*a + 8*a + 8 = 16a + 8. >> 4 = a (exact when +8 rounds). */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[8];
for (int i = 0; i < 16; i++) t[i] = 33;
for (int j = 0; j < 8; j++) l[j] = 33;
set_ctx(buf, 33, t, l);
daedalus_h264_pred_8x8l_dc(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "DC (mode 2, uniform)", 33);
}
/* Mode 0 Vertical with NON-uniform top: gradient 0..15. Filtered
* top[c] for c in 1..14 = (t[c-1] + 2*t[c] + t[c+1] + 2) >> 2
* = (c-1 + 2c + c+1 + 2) >> 2
* = (4c + 2) >> 2 = c (since (4c+2)/4 = c with rounding).
* Wait — (4c + 2) >> 2 = c + 0 (since 4c is divisible by 4 and +2 rounds
* BELOW 4, doesn't change anything). So filtered = c for c=1..14.
* filt[0] (top-left) = (t[0] + 2*tl + l[0] + 2) >> 2 (not exercised
* directly by Vertical mode).
* filt[top 0] = (tl + 2*t[0] + t[1] + 2) >> 2 = (0 + 0 + 1 + 2) >> 2 = 0
* (tl=0, t[0]=0, t[1]=1)
* filt[top 15] = (t[14] + 3*t[15] + 2) >> 2 = (14 + 45 + 2) >> 2
* = 61 >> 2 = 15
*
* So Vertical output col 0 = filt[top 0] = 0, col 1 = filt[top 1] = 1,
* ..., col 7 = filt[top 7] = 7. Same for all 8 rows. */
{
uint8_t buf[ROWS][STRIDE];
int t[16], l[8] = {0};
for (int i = 0; i < 16; i++) t[i] = i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
if (buf[1+r][1+c] != c) diff++;
if (diff == 0) printf(" %-30s PASS (filtered gradient)\n", "Vertical (mode 0, gradient)");
else printf(" %-30s FAIL (%d/64 wrong)\n", "Vertical (mode 0, gradient)", diff);
fail |= (diff == 0) ? 0 : 1;
}
/* Mode 1 Horizontal gradient: left = 0..7. Filtered left:
* filt[left 0] = (tl + 2*l[0] + l[1] + 2) >> 2 = (0 + 0 + 1 + 2) >> 2 = 0
* filt[left j] for j=1..6 = (l[j-1] + 2*l[j] + l[j+1] + 2) >> 2 = j
* (same arithmetic as top)
* filt[left 7] = (l[6] + 3*l[7] + 2) >> 2 = (6 + 21 + 2) >> 2 = 7
* So Horizontal output row 0 = 0, row 7 = 7. */
{
uint8_t buf[ROWS][STRIDE];
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = j;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
if (buf[1+r][1+c] != r) diff++;
if (diff == 0) printf(" %-30s PASS (filtered gradient)\n", "Horizontal (mode 1, gradient)");
else printf(" %-30s FAIL (%d/64 wrong)\n", "Horizontal (mode 1, gradient)", diff);
fail |= (diff == 0) ? 0 : 1;
}
/* Directional modes — uniform-context sanity tests. With all
* neighbours = N, the 1-2-1 filter produces uniform N, and any
* 3-tap / 2-tap on uniform N produces N. So every directional
* mode should output uniform N on uniform input. */
{
typedef void (*pred_fn_t)(uint8_t *dst, ptrdiff_t stride);
struct { const char *name; pred_fn_t fn; } modes[] = {
{ "DDL (mode 3, uniform)", daedalus_h264_pred_8x8l_ddl },
{ "DDR (mode 4, uniform)", daedalus_h264_pred_8x8l_ddr },
{ "VR (mode 5, uniform)", daedalus_h264_pred_8x8l_vr },
{ "HD (mode 6, uniform)", daedalus_h264_pred_8x8l_hd },
{ "VL (mode 7, uniform)", daedalus_h264_pred_8x8l_vl },
{ "HU (mode 8, uniform)", daedalus_h264_pred_8x8l_hu },
};
for (size_t i = 0; i < sizeof(modes)/sizeof(modes[0]); i++) {
uint8_t buf[ROWS][STRIDE];
int t[16], l[8];
for (int k = 0; k < 16; k++) t[k] = 120;
for (int k = 0; k < 8; k++) l[k] = 120;
set_ctx(buf, 120, t, l);
modes[i].fn(&buf[1][1], STRIDE);
fail |= check_uniform(buf, modes[i].name, 120);
}
}
if (fail == 0) printf("\nALL Intra_8x8 luma PASS (9 modes — V, H, DC, DDL, DDR, VR, HD, VL, HU)\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+170
View File
@@ -0,0 +1,170 @@
/*
* Tests the 4 H.264 Intra_8x8 chroma prediction modes against
* spec-derived expected patterns. Same buffer layout idea as the
* other intra tests: a buffer that holds the 8x8 output + 1-pixel
* top/left context + 1-pixel top-left corner.
*
* row 0: [tl][t0..t7]
* row 1: [l0][output row 0]
* ...
* row 8: [l7][output row 7]
*
* Dimensions: 9 rows × 9 cols. dst (passed to pred fns) = &buf[1][1].
*/
#include <stdint.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
#define ROWS 9
static void set_ctx(uint8_t buf[ROWS][STRIDE], int tl,
const int t[8], const int l[8])
{
for (int r = 0; r < ROWS; r++)
for (int c = 0; c < STRIDE; c++) buf[r][c] = 0xff;
buf[0][0] = (uint8_t) tl;
for (int c = 0; c < 8; c++) buf[0][1 + c] = (uint8_t) t[c];
for (int r = 0; r < 8; r++) buf[1 + r][0] = (uint8_t) l[r];
}
static int check_per_cell(const uint8_t buf[ROWS][STRIDE], const char *name,
const uint8_t expect[8][8])
{
int diff = 0;
int first_r = 0, first_c = 0, first_got = 0, first_exp = 0;
for (int r = 0; r < 8; r++) {
for (int c = 0; c < 8; c++) {
uint8_t got = buf[1 + r][1 + c];
uint8_t exp = expect[r][c];
if (got != exp) {
if (diff == 0) {
first_r = r; first_c = c;
first_got = got; first_exp = exp;
}
diff++;
}
}
}
if (diff == 0)
printf(" %-30s PASS\n", name);
else
printf(" %-30s FAIL (%d/64 wrong, first r=%d c=%d got=%u exp=%u)\n",
name, diff, first_r, first_c, first_got, first_exp);
return diff == 0 ? 0 : 1;
}
int main(void)
{
int fail = 0;
/* --- Mode 1 Horizontal --- */
{
uint8_t buf[ROWS][STRIDE];
int t[8] = {0}, l[8] = {10, 20, 30, 40, 50, 60, 70, 80};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_horizontal(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) l[r];
fail |= check_per_cell(buf, "Horizontal (mode 1)", exp);
}
/* --- Mode 2 Vertical --- */
{
uint8_t buf[ROWS][STRIDE];
int t[8] = {15, 25, 35, 45, 55, 65, 75, 85}, l[8] = {0};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_vertical(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) t[c];
fail |= check_per_cell(buf, "Vertical (mode 2)", exp);
}
/* --- Mode 0 DC: per-quadrant. Test with distinct halves so any
* quadrant mix-up surfaces immediately.
*
* top[0..3] = 4 × 8 → sum_top_lo = 32
* top[4..7] = 4 × 16 → sum_top_hi = 64
* left[0..3] = 4 × 24 → sum_left_lo = 96
* left[4..7] = 4 × 40 → sum_left_hi = 160
*
* dc00 = (32 + 96 + 4) >> 3 = 132/8 = 16
* dc01 = (64 + 2) >> 2 = 66/4 = 16
* dc10 = ( 160 + 2) >> 2 = 162/4 = 40
* dc11 = (64 + 160 + 4) >> 3 = 228/8 = 28
*/
{
uint8_t buf[ROWS][STRIDE];
int t[8] = { 8, 8, 8, 8, 16, 16, 16, 16 };
int l[8] = { 24, 24, 24, 24, 40, 40, 40, 40 };
set_ctx(buf, 99, t, l);
daedalus_h264_pred_chroma8x8_dc(&buf[1][1], STRIDE);
uint8_t exp[8][8] = {
{16,16,16,16, 16,16,16,16},
{16,16,16,16, 16,16,16,16},
{16,16,16,16, 16,16,16,16},
{16,16,16,16, 16,16,16,16},
{40,40,40,40, 28,28,28,28},
{40,40,40,40, 28,28,28,28},
{40,40,40,40, 28,28,28,28},
{40,40,40,40, 28,28,28,28},
};
fail |= check_per_cell(buf, "DC quadrants (mode 0)", exp);
}
/* --- Mode 3 Plane (uniform): H = V = 0; a = 16 * (100 + 100) = 3200.
* pred[y][x] = (3200 + 0 + 0 + 16) >> 5 = 3216 >> 5 = 100. */
{
uint8_t buf[ROWS][STRIDE];
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = 100;
fail |= check_per_cell(buf, "Plane uniform (mode 3)", exp);
}
/* --- Mode 3 Plane gradient sanity ---
* t = 0..7, l = 0..7, tl = 0.
* H = 1*(t[4]-t[2]) + 2*(t[5]-t[1]) + 3*(t[6]-t[0]) + 4*(t[7]-tl)
* = 1*(4-2) + 2*(5-1) + 3*(6-0) + 4*(7-0)
* = 2 + 8 + 18 + 28 = 56
* V = same shape on left = 56
* b = (34*56 + 32) >> 6 = 1936 >> 6 = 30
* c = 30
* a = 16 * (l[7] + t[7]) = 16 * (7 + 7) = 224
*
* pred[0][0] = (224 + 30*(-3) + 30*(-3) + 16) >> 5
* = (224 - 90 - 90 + 16) >> 5
* = 60 >> 5 = 1
* pred[7][7] = (224 + 30*4 + 30*4 + 16) >> 5
* = (224 + 120 + 120 + 16) >> 5
* = 480 >> 5 = 15
* Spot-check those two corners. */
{
uint8_t buf[ROWS][STRIDE];
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 7][1 + 7];
int spot_fail = 0;
if (tl_actual != 1) { fprintf(stderr, "Plane gradient pred[0][0] = %u, expected 1\n", tl_actual); spot_fail = 1; }
if (br_actual != 15) { fprintf(stderr, "Plane gradient pred[7][7] = %u, expected 15\n", br_actual); spot_fail = 1; }
if (!spot_fail) printf(" %-30s PASS (corners 1, 15)\n", "Plane gradient (mode 3)");
else printf(" %-30s FAIL\n", "Plane gradient (mode 3)");
fail |= spot_fail;
}
if (fail == 0) printf("\nALL Intra_8x8 chroma mode references PASS\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}