26 Commits

Author SHA1 Message Date
marfrit 432d127ea9 Merge pull request 'v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline' (#37) from noether/spv-search-and-bench-retract into main
Reviewed-on: #37
2026-05-25 20:33:30 +00:00
claude-noether 1347fb961c v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path
sum.  That number was a measurement artifact.  This commit makes the
artifact impossible to reproduce by ANYONE running the bench again.

THE BUG
-------

v3d_runner read_spv() did fopen(spv_path, "rb") with no path search:
the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen
resolves it relative to cwd.  The cmake build puts SPVs in $builddir
(e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264)
were typically invoked from ~/src/daedalus-fourier/, so fopen failed.

On failure read_spv printed perror and returned NULL; pipeline create
then returned -1; dispatch then returned -1; the bench loop ignored
the return value and timed the failure path.  Each iter cost ~1-5 µs
(open + perror + return), which divided across 256 ops gave ~10-20
ns/op — looking convincingly like real-but-fast QPU work.

PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact.  PR #10's
much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found
that time, perhaps run from build/), so the artifact is what made it
look like the gap had closed.  The gap never closed.

CORRECTED NUMBERS
-----------------

Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit:

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:   5.57 ms
    QPU only:      123.54 ms
    Ratio:        CPU/QPU sum = 0.05x  (QPU 22x SLOWER than CPU)

QPU is currently 12-77x slower per kernel.  The post-buffer-pool /
post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT
close the gap with NEON.  Whether those tasks helped at all needs
re-measurement — the previous "we saw a big win" reading was the
same artifact.

PR #36's commit-message claim "PR #10's verdict is reversed" is
withdrawn.  PR #10 was right; PR #36 was wrong.

THE FIX
-------

Two changes:

1. v3d_runner: SPV search now tries, in order:
     - cwd (legacy)
     - $DAEDALUS_SHADER_DIR (env override)
     - <readlink /proc/self/exe>/.. (binary-relative)
     - /opt/fourier/share/daedalus-fourier/ (Pi 5 install)
     - /usr/share/daedalus-fourier/ (system-wide)
   Found-anywhere succeeds silently.  Found-nowhere prints one error
   naming all searched locations.

2. bench_h264_primitives: bench_fn now returns int.  bench_ns does
   a single preflight call; if rc != 0 it prints "DISPATCH FAILED
   rc=N — kernel skipped" and bails on the kernel.  Main loop counts
   QPU failures and exits 2 BEFORE printing the comparison table if
   any kernel failed — so the next person running this can't read
   fail-fast timings as substrate numbers.

POLICY IMPLICATIONS
-------------------

The QPU substrate decree (2026-05-23) was conceived as a policy
choice that overrides per-kernel measurement.  With the corrected
data the gap is not "fixable defect we'll close with one more
optimization", it's an order of magnitude.  Whether to keep the
decree, soften it (auto = QPU only where measured advantage), or
revert is now a clear-eyed decision for the user.

This commit doesn't change the recipe table — that's a separate
question, taken on its own merits with this data in hand.

Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu →
qpu-capable) was justified by PR #36's artifact and should be
reverted; that revert lands in a follow-up to marfrit-packages.
2026-05-25 21:45:12 +02:00
marfrit 9be02a9470 Merge pull request 'bench: H.264 primitive bench now measures both substrates + comparison table' (#36) from noether/h264-qpu-bench-and-cleanup into main
Reviewed-on: #36
2026-05-25 18:56:01 +00:00
claude-noether 989818c2e6 bench: H.264 primitive bench now measures both substrates + comparison table
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).
2026-05-25 20:42:39 +02:00
marfrit 1446b779a6 Merge pull request 'h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete' (#35) from noether/v3d-shader-h264-deblock-intra into main
Reviewed-on: #35
2026-05-25 18:36:10 +00:00
claude-noether c2d1e9790e h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete
Closes the H.264 deblock QPU coverage matrix.  Adds the 4 intra
(bS=4) variants — luma_v/h_intra + chroma_v/h_intra.

Algorithmically distinct from the bS<4 path:
  - Per-side strong/weak filter selector
      strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2)
      strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2)
  - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3)
  - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2
  - Mirror for q-side; no tc0 (bS=4 hardcodes the strength)
  - Chroma always weak, only p0/q0 updated (same as bS<4 chroma)

Per H.264 §8.3.2.3.  Transcribed from PR #11's C reference
(tests/h264_intra_loop_filter_ref.c).

Shaders:
  - v3d_h264deblock_luma_v_intra.comp  (luma 16-cell + strong/weak)
  - v3d_h264deblock_luma_h_intra.comp  (transpose of luma_v_intra)
  - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak)
  - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra)

Dispatch wiring:
  - 4 new pipeline pairs in daedalus_ctx
  - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by
    orient_h for V vs H) — 2 wrappers
  - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu
    helper (same WG geometry as bS<4 chroma) — 2 wrappers
  - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter,
    routes CPU/QPU per recipe table
  - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU
    to QPU

Verified on hertz:

  $ ./build/test_api_h264 | grep intra
    H.264 deblock luma v intra:   1024/1024 bytes bit-exact
    H.264 deblock luma h intra:   1024/1024 bytes bit-exact
    H.264 deblock chroma v intra:  256/256 bytes bit-exact
    H.264 deblock chroma h intra:  256/256 bytes bit-exact

All 4 PASS first try.  Strong/weak quad-tree selector + per-side
asymmetry would have surfaced any sign/shift/index mistake; passing
on all 4 (including the asymmetric writes-3-cells cases) means the
transcription from C is clean.

Deblock QPU coverage matrix — COMPLETE (8 of 8):

  bS<4 (non-intra):
    luma_v    ✓ cycle 8
    luma_h    ✓ PR #28
    chroma_v  ✓ PR #29
    chroma_h  ✓ PR #29

  bS=4 (intra, this PR):
    luma_v    ✓
    luma_h    ✓
    chroma_v  ✓
    chroma_h  ✓

The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU
when daedalus is initialised with a QPU-capable context:
  - IDCT 4x4 / 8x8 ✓
  - All 8 deblock variants ✓
  - All 30 qpel positions (15 put_ + 15 avg_) ✓
2026-05-25 20:30:07 +02:00
marfrit e506ef0803 Merge pull request 'h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete' (#34) from noether/v3d-shader-h264-qpel-avg into main
Reviewed-on: #34
2026-05-25 18:23:11 +00:00
claude-noether 2079fe39c6 h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete
Generates 15 avg_ shader variants by templating from the existing
put_ shaders.  Each avg_ shader is identical to its put_ sibling
except the final write does an L2 average with the existing dst:

  put_:  dst[r,c] = result
  avg_:  dst[r,c] = (dst[r,c] + result + 1) >> 1

Per H.264 §8.4.2.3.1 (B-slice biprediction): caller pre-loads dst
with the list0 prediction; the avg_ call folds in list1.

Generated via python (avg-shader-gen.py): reads each
v3d_h264_qpel_mcXY.comp, transforms the docstring header + final
write hunk, writes v3d_h264_qpel_avg_mcXY.comp.  ~88 lines each;
15 new shader files.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for
all 15 — same src envelope (10*stride+11 covers any (r±1, c±1)
shift), the L2 step only touches dst.  Slightly over-allocates for
the simpler positions (avg_mc20/02/10/30/01/03) but negligible
cost.  Eliminates 15 wrappers + 15 src_max bound calculations that
would otherwise duplicate.

CMake foreach loops compile + install 15 new SPV files.  ctx grows
15 pipeline pairs.  Recipe table flips DAEDALUS_KERNEL_H264_QPEL_AVG_*
from CPU to QPU.  Public dispatchers re-defined via the existing
DEFINE_QPEL_DIAG_PUBLIC macro (replaces the CPU-only
DEFINE_QPEL_DISPATCH instantiations).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel avg" | wc -l
  15
  $ ./build/test_api_h264 | grep "qpel avg" | grep -c "100.0000%"
  15

All 15 PASS 2048/2048 bytes bit-exact via QPU.

QPU coverage for the H.264 8-bit 4:2:0 hot-path pixel kernels:

  Layer                Coverage
  ─────────────────────────────────────────────────────────────
  IDCT 4x4 luma        ✓ cycle 6 (one QPU shader, also handles chroma)
  IDCT 8x8 luma        ✓ cycle 7
  Chroma DC Hadamard   CPU only (4 adds + 4 subs; not worth)
  Deblock luma_v       ✓ cycle 8
  Deblock luma_h       ✓ PR #28
  Deblock chroma_v/h   ✓ PR #29
  Deblock *_intra      CPU only (less common, structurally different)
  qpel put_ 15 pos     ✓ cycle 9 (mc20) + PRs #30-#33
  qpel avg_ 15 pos     ✓ THIS PR

The H.264 non-intra-deblock hot path is now FULLY on QPU for any
consumer that initialises daedalus with a QPU-capable context.
2026-05-25 20:22:33 +02:00
marfrit 55d3618408 Merge pull request 'h264: V3D shaders for the 8 diagonal qpel positions' (#33) from noether/v3d-shader-h264-qpel-diagonals into main
Reviewed-on: #33
2026-05-25 18:16:53 +00:00
claude-noether 746533582e h264: V3D shaders for the 8 diagonal qpel positions
Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON
2026-05-25 19:14:42 +02:00
marfrit 224f4be9e2 Merge pull request 'h264: V3D shaders for the 4 single-axis quarter-pel qpel variants' (#32) from noether/v3d-shader-h264-qpel-quarter-axis into main
Reviewed-on: #32
2026-05-25 17:09:00 +00:00
claude-noether e3c28495ae h264: V3D shaders for the 4 single-axis quarter-pel qpel variants
mc10 (¼-H), mc30 (¾-H), mc01 (¼-V), mc03 (¾-V).  Each is the
corresponding half-pel filter (mc20 or mc02) with one extra L2
rounded-average step against an integer-source pixel at the tail:

  mc10[r,c] = avg(clip255(mc20(s)), s[r,   c   ])
  mc30[r,c] = avg(clip255(mc20(s)), s[r,   c+1])
  mc01[r,c] = avg(clip255(mc02(s)), s[r,   c  ])
  mc03[r,c] = avg(clip255(mc02(s)), s[r+1, c  ])

Each shader is ~45 lines (mc20-/mc02-pattern + 1 L2 line).

CMake foreach loop generates the 4 SPV compile rules.  Dispatch
helper `dispatch_h264_qpel_axis_qpu` shares plumbing across all 4
(axis flag selects src_max bounds: H reads cols -2..+10, V reads
rows -2..+10).  DEFINE_QPEL_AXIS_QPU + DEFINE_QPEL_DISPATCH_QPU
macros collapse ~200 LOC of boilerplate.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC{10,30,01,03} from
CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[01230]"
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
    (+ mc20/mc02/mc22 anchors from previous PRs)

Qpel QPU coverage:

  put_  mc20 ✓  mc02 ✓  mc22 ✓                                  (3 anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓                          (4 quarter-axis, THIS PR)
        mc11/12/13/21/23/31/32/33 — CPU NEON                    (8 diagonals)
  avg_  all 15 positions — CPU NEON

7 of 15 useful put_ positions now on QPU.  The 8 diagonals each
compose two half-pel results via L2; can land via dedicated kernels
or by chaining existing anchor dispatches (the latter would need
the L2 step as a fourth dispatch — probably cheaper to write
dedicated 8x diagonal shaders).
2026-05-25 19:04:26 +02:00
marfrit 8b8e8dc6e8 Merge pull request 'h264: V3D shader for qpel mc22 (2D half-pel 'j' position)' (#31) from noether/v3d-shader-h264-qpel-mc22 into main
Reviewed-on: #31
2026-05-25 17:00:27 +00:00
claude-noether 02d564b43e h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.
2026-05-25 18:52:39 +02:00
marfrit 2074a50554 Merge pull request 'h264: V3D shader for qpel mc02 (vertical half-pel)' (#30) from noether/v3d-shader-h264-qpel-mc02 into main
Reviewed-on: #30
2026-05-25 16:49:26 +00:00
claude-noether bc5edf656d h264: V3D shader for qpel mc02 (vertical half-pel)
Sibling of cycle 9's v3d_h264_qpel_mc20.comp.  Same 6-tap H.264 luma
half-pel filter, transposed to vertical orientation: filter reads
rows [-2..+3] of source per output pixel instead of cols.

Shader is ~58 lines (vs mc20's 86) — same WG geometry (64 lanes /
1 block per WG / 1 lane per output pixel).  The address arithmetic
flips: row_base = src_off + r*stride + c (mc20) → col_base =
src_off + c, then col_base + (r±N)*stride (mc02).

dispatch_h264_qpel_mc02_qpu mirrors the mc20 QPU dispatch; src_max
calculation differs since the V kernel reads rows -2..+10 of source
(13 rows × stride wide) vs mc20's cols -2..+10 (8 rows × stride+11).
For 8x8 blocks: src_max = src_off + 10*stride + 8.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC02 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

QPU coverage for the 30 qpel positions:
  put_  mc20 ✓ (cycle 9)   mc02 ✓ (this PR)
        all 13 other put_  CPU NEON
  avg_  all 15 positions   CPU NEON

Next-priority candidates by per-frame impact (per PR #24 bench):
  mc22 (2D half-pel)  — 71.5 ns/block NEON × 32 640 blocks worst
                        case = 2.33 ms/frame at 1080p.  Most-used
                        qpel position in real H.264 streams.
  mc11/mc13/mc31/mc33 — corner ¼-pel positions, structurally similar
                        to mc20 + mc02 with L2 averaging.

The cascaded H+V structure of mc22 means it can either share the
existing mc20 + mc02 shaders' L2 (compute mc20 into tmp, then mc02
on tmp) or get a dedicated 2-stage pipeline.  Follow-up.
2026-05-25 18:38:38 +02:00
marfrit 37b75b5813 Merge pull request 'h264: V3D shaders for chroma deblock V + H (4:2:0)' (#29) from noether/v3d-shader-h264-deblock-chroma into main
Reviewed-on: #29
2026-05-25 16:35:08 +00:00
claude-noether d8de7754fa h264: V3D shaders for chroma deblock V + H (4:2:0)
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra
bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h.
Closes 4 of 8 deblock QPU coverage at non-intra:

  luma_v   ✓ cycle 8
  luma_h   ✓ PR #28
  chroma_v ✓ this PR
  chroma_h ✓ this PR
  *_intra  — CPU NEON (less common; smaller volume)

Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0
updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq
side bonus), 8 cells per edge (vs luma's 16).  Shader: 64 lines
vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes
8..15 of each edge early-return).

4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this
shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is
4:2:0-only by design, caller-side gating already covers this in the
libavcodec substitution arc (marfrit-packages PR #98).

Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to
QPU.  dispatch_h264_deblock_chroma_qpu factored to share QPU
plumbing between V and H (orientation passed as a flag for the
dst_max calculation).

Verified on hertz:

  $ ./build/test_api_h264 | grep "deblock chroma [vh]:"
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)

  Recipe substrate now reports 2 (QPU) for both CV and CH.

Coverage now:
                bS<4 QPU     bS=4 (intra)
  luma_v        ✓ cycle 8    CPU NEON
  luma_h        ✓ PR #28     CPU NEON
  chroma_v      ✓ this PR    CPU NEON
  chroma_h      ✓ this PR    CPU NEON

Intra (bS=4) variants stay CPU NEON.  Less common case, smaller
per-frame contribution, and the algorithm is structurally different
(no tc0; strong-vs-weak filter quad-tree).  Can land as a follow-up
PR if perf demands.
2026-05-25 17:10:34 +02:00
marfrit de9266a6eb Merge pull request 'h264: V3D shader for deblock_luma_h — first QPU port since cycle 9' (#28) from noether/v3d-shader-h264-deblock-luma-h into main
Reviewed-on: #28
2026-05-25 15:06:18 +00:00
claude-noether 3db059ffab h264: V3D shader for deblock_luma_h — first QPU port since cycle 9
Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row)
to the H orientation (V edge, horizontal across a column).  Same
algorithm, transposed access pattern:

  V variant: lane → column, reads/writes pix[±N*stride] (vertical I/O)
  H variant: lane → row,    reads/writes pix[±N]        (horizontal I/O)

  WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge.
  Lane-in-edge interpretation flips: column-index for V → row-index
  for H.  tc0 segment math unchanged (one tc0 byte per 4 lanes).
  dst_max calculation flips: V used dst_off + 3*stride + 16 (cols),
  H uses dst_off + 15*stride + 4 (rows).

Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU).  AUTO
dispatch now picks QPU for the H edge as well as the V edge.  CPU
NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | grep luma_h
    H264_DEBLOCK_LH recipe substrate: 2     (was 1 — flipped to QPU)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)

Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on
8 tiles × 8 cols × 16 rows of random input.  Same correctness gate
as the cycle 8 V shader.

CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV
added to daedalus_shaders ALL list + install rule.  daedalus_ctx
gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair
(can't share with V because pipelines bind a specific SPIR-V module
at create time).

What this changes for the substitution arc: PR #97's 0008-h264-
deblock-luma-h substitution patch already plumbed
daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec.
That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe
(unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu,
in which case it stays NEON — same shape as cycle 8's V shader).

Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders):
  luma_v       ✓ cycle 8       ✓ now
  luma_h       —               ✓ THIS PR
  chroma_v/h   —               (CPU NEON; smaller tiles, lower-priority)
  *_intra (4)  —               (CPU NEON; less common)
2026-05-25 16:50:41 +02:00
marfrit 2faa849ce2 Merge pull request 'h264: promote remaining intra prediction modes (17) to public API' (#27) from noether/h264-intra-pred-rest-api into main
Reviewed-on: #27
2026-05-25 13:43:56 +00:00
claude-noether cb3aef3dac h264: promote remaining intra prediction modes (17) to public API
Follows PR #26 (Intra_4x4 luma) with the same promotion pattern for
the rest of the intra prediction primitive set:

  Intra_16x16 luma   (4 modes, PR #13) — V/H/DC/Plane
  Intra_8x8  chroma  (4 modes, PR #14) — DC/H/V/Plane (4:2:0)
  Intra_8x8  luma    (9 modes, PRs #21 + #22) — High profile,
                                                 with 1-2-1 pre-filter

3 file moves via `git mv`, ~17 function renames stripping the `_ref`
suffix.  Test binaries rewired to link daedalus_core instead of
compiling the (now moved) ref files directly.  No code change — pure
plumbing for substitution-arc consumers.

26 intra prediction modes total now in the public API after this PR.

Verified on hertz:

  test_intra_pred_16x16:    5/5  PASS
  test_intra_pred_chroma8x8: 5/5  PASS
  test_intra_pred_8x8_luma: 11/11 PASS

All via public symbols (test binaries linked against daedalus_core).

Unblocks marfrit-packages substitution arc patch 0014 — wires
H264PredContext.pred4x4[], pred16x16[], pred8x8[], pred8x8l[]
through daedalus alongside the existing IDCT / deblock / qpel / DC
Hadamard substitutions.

After 0014 lands, the libavcodec.so built by marfrit-packages will
have EVERY hot-path pixel-math kernel of an H.264 8-bit 4:2:0
decode routing through daedalus — the substitution arc is feature-
complete for the campaign target (Pi 5 Firefox YouTube playback).
2026-05-25 15:37:44 +02:00
marfrit 31c68d0d0e Merge pull request 'h264: promote Intra_4x4 luma prediction (9 modes) to public API' (#26) from noether/h264-intra-pred-4x4-api into main
Reviewed-on: #26
2026-05-25 13:35:56 +00:00
claude-noether df9e1c9d78 h264: promote Intra_4x4 luma prediction (9 modes) to public API
PR #12 added the 9 Intra_4x4 luma intra prediction modes as test-only
spec references in tests/.  This PR promotes them to public src/
symbols so consumers (the eventual marfrit-packages substitution-arc
patch 0014) can link against them.

  Moved: tests/h264_intra_pred_4x4_ref.c → src/h264_intra_pred_4x4.c
  Renamed: daedalus_h264_pred_4x4_<mode>_ref → daedalus_h264_pred_4x4_<mode>
           (9 functions: vertical/horizontal/dc/ddl/ddr/vr/hd/vl/hu)

The src/ implementation is byte-for-byte the same code as the
test-only ref; this PR is plain plumbing.  The test binary now
links against daedalus_core to pull in the public symbols (instead
of compiling the ref file directly), exercising the path that real
consumers will use.

Same promotion shape as PR #25 (chroma DC Hadamard).

Verified on hertz:

  $ ./build/test_intra_pred_4x4
    Vertical (mode 0)          PASS
    Horizontal (mode 1)        PASS
    DC (mode 2)                PASS
    DiagDownLeft (mode 3)      PASS
    DiagDownRight (mode 4)     PASS
    VerticalRight (mode 5)     PASS
    HorizontalDown (mode 6)    PASS
    VerticalLeft (mode 7)      PASS
    HorizontalUp (mode 8)      PASS
    VR asym (sanity)           PASS

  ALL 10 intra-4x4 mode references PASS

  $ nm -g build/libdaedalus_core.a | grep "T daedalus_h264_pred_4x4"
  (9 symbols exported)

Follow-ups (same promotion pattern, can land in parallel):
  - Intra_16x16 luma (4 modes, PR #13)
  - Intra_8x8 chroma (4 modes, PR #14)
  - Intra_8x8 luma (9 modes, PRs #21 + #22)

Once all 26 intra modes are in the public API, the marfrit-packages
substitution arc can route H264PredContext's pred function pointer
tables through daedalus alongside the IDCT / deblock / qpel / DC
Hadamard substitutions already in place.
2026-05-25 14:53:37 +02:00
marfrit b9f9ff2a89 Merge pull request 'h264: expose chroma DC 2x2 Hadamard as public API' (#25) from noether/h264-chroma-dc-hadamard-api into main
Reviewed-on: #25
2026-05-25 11:35:05 +00:00
claude-noether 1f07f3cd70 h264: expose chroma DC 2x2 Hadamard as public API
PR #23 added the Hadamard as a test-only spec reference; this PR
promotes it to a public symbol in src/ so consumers (the eventual
marfrit-packages substitution-arc patch 0011) can link against it.

  New: void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
       — operates in-place on 4 int16, no QP-dependent scaling
       (caller composes that themselves per §8.5.11.2).

The src/ implementation is byte-for-byte identical to the test-only
ref in tests/h264_chroma_dc_hadamard_ref.c (kept as a separate
spec-validation copy).  A new "public API parity" test case verifies
the two produce identical output for a non-trivial input.

Pure CPU primitive — no substrate-dispatch wrapper because the work
is 4 adds + 4 subs; the substrate machinery would cost more than
the kernel itself.

Verified on hertz:

  $ ./build/test_chroma_dc_hadamard
    all-uniform 5                    PASS
    col gradient [0,10,0,10]         PASS
    row gradient [0,0,10,10]         PASS
    anti-diagonal [10,0,0,10]        PASS
    asymmetric [1,2,3,4]             PASS
    sign-alternating [-5,5,-5,5]     PASS
    double-Hadamard = 4*orig         PASS
    public API parity vs _ref        PASS

  ALL chroma DC Hadamard tests PASS

  $ nm -g build/libdaedalus_core.a | grep chroma_dc_hadamard
  0000000000000000 T daedalus_h264_chroma_dc_hadamard_2x2

Unblocks marfrit-packages 0011 (substituting
H264DSPContext.chroma_dc_dequant_idct, which composes the Hadamard
+ qmul scaling).
2026-05-25 13:32:01 +02:00
52 changed files with 4237 additions and 291 deletions
+166 -26
View File
@@ -284,6 +284,55 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
set(H264DEBLOCK_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
COMMENT "glslang: v3d_h264deblock_h.comp -> v3d_h264deblock_h.spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_V_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_v.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_V_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_V_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
COMMENT "glslang: v3d_h264deblock_chroma_v.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
COMMENT "glslang: v3d_h264deblock_chroma_h.comp -> .spv"
VERBATIM
)
# Intra (bS=4) deblock shaders — strong/weak filter selector per
# H.264 §8.3.2.3. 4 variants (luma_v/h + chroma_v/h).
foreach(_kind luma_v_intra luma_h_intra chroma_v_intra chroma_h_intra)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264deblock_${_kind}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
COMMENT "glslang: v3d_h264deblock_${_kind}.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_${_kind}_SPV ${_spv})
endforeach()
set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv)
add_custom_command(
OUTPUT ${H264_IDCT4_SPV}
@@ -317,7 +366,63 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV})
set(H264_QPEL_MC02_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc02.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC02_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC02_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
COMMENT "glslang: v3d_h264_qpel_mc02.comp -> v3d_h264_qpel_mc02.spv"
VERBATIM
)
set(H264_QPEL_MC22_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc22.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC22_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC22_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
COMMENT "glslang: v3d_h264_qpel_mc22.comp -> v3d_h264_qpel_mc22.spv"
VERBATIM
)
# Quarter-pel single-axis variants (mc10/30/01/03) + diagonal
# variants (mc11/12/13/21/23/31/32/33) — each composes 1-2 half-pel
# results with optional L2 averaging. Same WG geometry as mc20/mc02.
foreach(_mc mc10 mc30 mc01 mc03 mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_${_mc}_SPV ${_spv})
endforeach()
# avg_ biprediction variants — same shader as put_ + extra L2 with
# existing dst. All 15 useful positions.
foreach(_mc mc20 mc02 mc22 mc10 mc30 mc01 mc03
mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_avg_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_avg_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_avg_${_mc}_SPV ${_spv})
endforeach()
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264DEBLOCK_H_SPV} ${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_H_SPV} ${H264DEBLOCK_luma_v_intra_SPV} ${H264DEBLOCK_luma_h_intra_SPV} ${H264DEBLOCK_chroma_v_intra_SPV} ${H264DEBLOCK_chroma_h_intra_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV} ${H264_QPEL_MC02_SPV} ${H264_QPEL_MC22_SPV} ${H264_QPEL_mc10_SPV} ${H264_QPEL_mc30_SPV} ${H264_QPEL_mc01_SPV} ${H264_QPEL_mc03_SPV} ${H264_QPEL_mc11_SPV} ${H264_QPEL_mc12_SPV} ${H264_QPEL_mc13_SPV} ${H264_QPEL_mc21_SPV} ${H264_QPEL_mc23_SPV} ${H264_QPEL_mc31_SPV} ${H264_QPEL_mc32_SPV} ${H264_QPEL_mc33_SPV} ${H264_QPEL_avg_mc20_SPV} ${H264_QPEL_avg_mc02_SPV} ${H264_QPEL_avg_mc22_SPV} ${H264_QPEL_avg_mc10_SPV} ${H264_QPEL_avg_mc30_SPV} ${H264_QPEL_avg_mc01_SPV} ${H264_QPEL_avg_mc03_SPV} ${H264_QPEL_avg_mc11_SPV} ${H264_QPEL_avg_mc12_SPV} ${H264_QPEL_avg_mc13_SPV} ${H264_QPEL_avg_mc21_SPV} ${H264_QPEL_avg_mc23_SPV} ${H264_QPEL_avg_mc31_SPV} ${H264_QPEL_avg_mc32_SPV} ${H264_QPEL_avg_mc33_SPV})
# v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -391,6 +496,11 @@ endif()
add_library(daedalus_core STATIC
src/daedalus_core.c
src/h264_chroma_dc.c
src/h264_intra_pred_4x4.c
src/h264_intra_pred_16x16.c
src/h264_intra_pred_chroma8x8.c
src/h264_intra_pred_8x8_luma.c
src/v3d_runner.c
${FFASM_SOURCES}
${FFASM_LPF_SOURCES}
@@ -445,9 +555,45 @@ if (DAEDALUS_BUILD_VULKAN)
${LPF8_SPV}
${CDEF_SPV}
${H264DEBLOCK_SPV}
${H264DEBLOCK_H_SPV}
${H264DEBLOCK_CHROMA_V_SPV}
${H264DEBLOCK_CHROMA_H_SPV}
${H264DEBLOCK_luma_v_intra_SPV}
${H264DEBLOCK_luma_h_intra_SPV}
${H264DEBLOCK_chroma_v_intra_SPV}
${H264DEBLOCK_chroma_h_intra_SPV}
${H264_IDCT4_SPV}
${H264_IDCT8_SPV}
${H264_QPEL_MC20_SPV}
${H264_QPEL_MC02_SPV}
${H264_QPEL_MC22_SPV}
${H264_QPEL_mc10_SPV}
${H264_QPEL_mc30_SPV}
${H264_QPEL_mc01_SPV}
${H264_QPEL_mc03_SPV}
${H264_QPEL_mc11_SPV}
${H264_QPEL_mc12_SPV}
${H264_QPEL_mc13_SPV}
${H264_QPEL_mc21_SPV}
${H264_QPEL_mc23_SPV}
${H264_QPEL_mc31_SPV}
${H264_QPEL_mc32_SPV}
${H264_QPEL_mc33_SPV}
${H264_QPEL_avg_mc20_SPV}
${H264_QPEL_avg_mc02_SPV}
${H264_QPEL_avg_mc22_SPV}
${H264_QPEL_avg_mc10_SPV}
${H264_QPEL_avg_mc30_SPV}
${H264_QPEL_avg_mc01_SPV}
${H264_QPEL_avg_mc03_SPV}
${H264_QPEL_avg_mc11_SPV}
${H264_QPEL_avg_mc12_SPV}
${H264_QPEL_avg_mc13_SPV}
${H264_QPEL_avg_mc21_SPV}
${H264_QPEL_avg_mc23_SPV}
${H264_QPEL_avg_mc31_SPV}
${H264_QPEL_avg_mc32_SPV}
${H264_QPEL_avg_mc33_SPV}
DESTINATION ${CMAKE_INSTALL_DATADIR}/daedalus-fourier/shaders
)
endif()
@@ -537,38 +683,29 @@ add_executable(test_api_opportunistic_qpu tests/test_api_opportunistic_qpu.c)
target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core)
target_compile_options(test_api_opportunistic_qpu PRIVATE -O2)
# H.264 Intra_4x4 luma prediction (9 modes) — reference + tests.
# Pure CPU + spec-derived; no daedalus_core dependency yet (this is
# the bit-exact gate for the eventual shader / dispatch wiring).
add_executable(test_intra_pred_4x4
tests/test_intra_pred_4x4.c
tests/h264_intra_pred_4x4_ref.c
)
# H.264 Intra_4x4 luma prediction (9 modes) — public src primitives.
# The bodies now live in src/h264_intra_pred_4x4.c (linked into
# daedalus_core for use by libavcodec.so substitution-arc consumers).
# This test exercises the public symbols.
add_executable(test_intra_pred_4x4 tests/test_intra_pred_4x4.c)
target_link_libraries(test_intra_pred_4x4 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_4x4 PRIVATE -O2)
# H.264 Intra_16x16 luma prediction (4 modes: V, H, DC, Plane) —
# reference + tests. Same spec-gate role as the 4x4 sibling.
add_executable(test_intra_pred_16x16
tests/test_intra_pred_16x16.c
tests/h264_intra_pred_16x16_ref.c
)
# H.264 Intra_16x16 luma prediction (4 modes) — public src primitives,
# linked from daedalus_core.
add_executable(test_intra_pred_16x16 tests/test_intra_pred_16x16.c)
target_link_libraries(test_intra_pred_16x16 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_16x16 PRIVATE -O2)
# H.264 Intra_8x8 chroma prediction (4 modes: DC, H, V, Plane) —
# reference + tests. DC is per-quadrant (asymmetric); Plane uses
# slope coefficient 34 instead of luma's 5.
add_executable(test_intra_pred_chroma8x8
tests/test_intra_pred_chroma8x8.c
tests/h264_intra_pred_chroma8x8_ref.c
)
# H.264 Intra_8x8 chroma prediction (4 modes) — public src primitives.
add_executable(test_intra_pred_chroma8x8 tests/test_intra_pred_chroma8x8.c)
target_link_libraries(test_intra_pred_chroma8x8 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_chroma8x8 PRIVATE -O2)
# H.264 Intra_8x8 luma prediction (High profile, 9 modes + 1-2-1
# reference-sample pre-filter).
add_executable(test_intra_pred_8x8_luma
tests/test_intra_pred_8x8_luma.c
tests/h264_intra_pred_8x8_luma_ref.c
)
# pre-filter) — public src primitives.
add_executable(test_intra_pred_8x8_luma tests/test_intra_pred_8x8_luma.c)
target_link_libraries(test_intra_pred_8x8_luma PRIVATE daedalus_core)
target_compile_options(test_intra_pred_8x8_luma PRIVATE -O2)
# H.264 chroma DC 2x2 Hadamard pre-pass primitive. Pure transform,
@@ -577,6 +714,9 @@ add_executable(test_chroma_dc_hadamard
tests/test_chroma_dc_hadamard.c
tests/h264_chroma_dc_hadamard_ref.c
)
# Links daedalus_core to pull in the public daedalus_h264_chroma_dc_hadamard_2x2
# symbol (for the public-API parity test added in this PR).
target_link_libraries(test_chroma_dc_hadamard PRIVATE daedalus_core)
target_compile_options(test_chroma_dc_hadamard PRIVATE -O2)
# H.264 primitives latency benchmark (NEON CPU baseline).
+82
View File
@@ -544,6 +544,88 @@ DECLARE_QPEL_AVG(avg_mc33)
#undef DECLARE_QPEL_AVG
/* -------------------------------------------------------------------
* H.264 chroma DC 2x2 Hadamard pre-pass (per H.264 §8.5.11.1).
*
* Operates in-place on 4 int16 (the DC coefficients of an MB's
* chroma 4x4 AC blocks). Pure CPU primitive — no substrate
* dispatch wrapper because the work is 4 adds + 4 subs. Callers
* compose with QP-dependent scaling themselves; the scale shape
* varies by slice/PPS chroma_qp offset context.
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23).
* ----------------------------------------------------------------- */
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
/* -------------------------------------------------------------------
* H.264 Intra_4x4 luma prediction (per H.264 §8.3.1.4). 9 modes.
*
* Pure CPU primitives — each is a small straightforward fill of a
* 4x4 output block from neighbour pixels in the same buffer. No
* substrate-dispatch wrapper (the work is too small to amortise).
*
* FFmpeg-style interface: `dst` at row 0 col 0 of the 4x4 output.
* Reads top-left at dst[-stride-1], top at dst[-stride..-stride+7]
* (top-right for DDL/VL), and left at dst[r*stride - 1] for r=0..3.
* Caller must ensure all 13 neighbour bytes are valid (interior-MB
* assumption — H.264 availability fallback handled at caller).
*
* Bit-exact validated against tests/test_intra_pred_4x4.c (10-case
* spec-derived test suite including the asymmetric Vertical_Right
* 16-cell hand-derived case; see fourier PR #12).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_4x4_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_16x16 luma prediction (per §8.3.2). 4 modes:
* Vertical / Horizontal / DC / Plane. Same FFmpeg-style interface
* as the 4x4 family at 16x16 scale. Same neighbour availability
* assumption (interior-MB).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_16x16_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 chroma prediction (per §8.3.3, 4:2:0). 4 modes:
* DC / Horizontal / Vertical / Plane. Note: DC is per-quadrant
* asymmetric; Plane uses slope coefficient 34 (not luma's 5).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_chroma8x8_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 luma prediction (High profile, per §8.3.2.1).
* 9 modes with the spec-defined 1-2-1 reference-sample pre-filter
* applied internally to the 25 neighbours before the mode arithmetic.
*
* "_8x8l" naming follows the FFmpeg h264pred_template convention
* (pred8x8l_<mode>_c) to keep the substitution wrappers a 1:1 name
* map.
* ----------------------------------------------------------------- */
void daedalus_h264_pred_8x8l_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
+843 -94
View File
File diff suppressed because it is too large Load Diff
+34
View File
@@ -0,0 +1,34 @@
/* SPDX-License-Identifier: BSD-2-Clause */
/*
* H.264 chroma DC 2x2 Hadamard pre-pass (public, in-tree CPU).
*
* The 4 DC coefficients of an MB's chroma 4x4 AC blocks go through
* this 2x2 Hadamard before quant-scaling and re-injection into the
* AC blocks' [0,0] coefficient. Algorithm per H.264 §8.5.11.1.
*
* Pure CPU primitive — there's no substrate-dispatch wrapper because
* the work is 4 adds + 4 subs. Callers compose with QP-dependent
* scaling themselves (the scale shape varies by slice/PPS chroma_qp
* offset context and shouldn't be baked into the kernel).
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23). Same algorithm; this is the public
* src-tree copy.
*/
#include "daedalus.h"
#include <stdint.h>
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4])
{
int t0 = c[0] + c[1];
int t1 = c[0] - c[1];
int t2 = c[2] + c[3];
int t3 = c[2] - c[3];
c[0] = (int16_t)(t0 + t2); /* f[0,0] = sum of all 4 */
c[1] = (int16_t)(t1 + t3); /* f[0,1] = col-difference */
c[2] = (int16_t)(t0 - t2); /* f[1,0] = row-difference */
c[3] = (int16_t)(t1 - t3); /* f[1,1] = anti-diagonal */
}
@@ -29,7 +29,7 @@
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 16; r++)
@@ -37,7 +37,7 @@ void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 16; r++) {
uint8_t l = dst[r * stride - 1];
@@ -46,7 +46,7 @@ void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — DC: ((sum_top16 + sum_left16 + 16) >> 5) broadcast. */
void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 16; /* rounding for >> 5 over 32 samples */
@@ -77,7 +77,7 @@ void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride)
* (it does NOT participate in the H/V sums in the 16x16 case only
* for the chroma 8x8 plane mode).
*/
void daedalus_h264_pred_16x16_plane_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
/* H accumulates differences across the right vs left half of the
@@ -52,7 +52,7 @@ static inline uint8_t avg2(int a, int b)
}
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 4; r++) {
@@ -61,7 +61,7 @@ void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 4; r++) {
uint8_t l = dst[r * stride - 1];
@@ -70,7 +70,7 @@ void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — DC: mean of top 4 + left 4, broadcast. */
void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 4; /* rounding for ((sum + 4) >> 3) */
@@ -82,7 +82,7 @@ void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 3 — Diagonal_Down_Left. Uses top[0..7] (incl. top-right). */
void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0 = top[0], t1 = top[1], t2 = top[2], t3 = top[3];
@@ -102,7 +102,7 @@ void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 4 — Diagonal_Down_Right. Uses top-left + top[0..3] + left[0..3]. */
void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
@@ -123,7 +123,7 @@ void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 5 — Vertical_Right. */
void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
@@ -153,7 +153,7 @@ void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 6 — Horizontal_Down. */
void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1], t2 = dst[-stride + 2];
@@ -182,7 +182,7 @@ void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 7 — Vertical_Left. Uses top[0..7]. */
void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0=top[0], t1=top[1], t2=top[2], t3=top[3];
@@ -211,7 +211,7 @@ void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 8 — Horizontal_Up. Uses left[0..3] only. */
void daedalus_h264_pred_4x4_hu_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride)
{
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1], l3 = dst[ 3*stride - 1];
@@ -90,7 +90,7 @@ static void filter_refs(const uint8_t *dst, ptrdiff_t stride,
#define FTL filt[0] /* filtered top-left */
/* Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]. */
void daedalus_h264_pred_8x8l_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -99,7 +99,7 @@ void daedalus_h264_pred_8x8l_vertical_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]. */
void daedalus_h264_pred_8x8l_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -110,7 +110,7 @@ void daedalus_h264_pred_8x8l_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
/* Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
* + 8) >> 4) broadcast. Note the +8 (not +4 like 4x4): there are
* 16 samples summed total, so >> 4 with half-step rounding +8. */
void daedalus_h264_pred_8x8l_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -143,7 +143,7 @@ void daedalus_h264_pred_8x8l_dc_ref(uint8_t *dst, ptrdiff_t stride)
#define LT FTL
/* Mode 3 DDL (Diagonal_Down_Left) — uses TOP + TOP_RIGHT, no LEFT. */
void daedalus_h264_pred_8x8l_ddl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -165,7 +165,7 @@ void daedalus_h264_pred_8x8l_ddl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 4 DDR (Diagonal_Down_Right). */
void daedalus_h264_pred_8x8l_ddr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -187,7 +187,7 @@ void daedalus_h264_pred_8x8l_ddr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 5 VR (Vertical_Right). */
void daedalus_h264_pred_8x8l_vr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -216,7 +216,7 @@ void daedalus_h264_pred_8x8l_vr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 6 HD (Horizontal_Down). */
void daedalus_h264_pred_8x8l_hd_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -245,7 +245,7 @@ void daedalus_h264_pred_8x8l_hd_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 7 VL (Vertical_Left) — uses TOP + TOP_RIGHT only. */
void daedalus_h264_pred_8x8l_vl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -274,7 +274,7 @@ void daedalus_h264_pred_8x8l_vl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 8 HU (Horizontal_Up) — uses LEFT only. */
void daedalus_h264_pred_8x8l_hu_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
@@ -43,7 +43,7 @@ static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
* quadrant ignores the top-left-half because that half is "vertically
* above" the top-left quadrant; the spec uses top[4..7] only.
*/
void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int top_lo = 0, top_hi = 0, left_lo = 0, left_hi = 0;
@@ -68,7 +68,7 @@ void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++) {
uint8_t l = dst[r * stride - 1];
@@ -77,7 +77,7 @@ void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — Vertical: each col = top[col]. */
void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 8; r++)
@@ -97,7 +97,7 @@ void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride)
* - Centre is (x-3, y-3) (not x-7, y-7).
* - Spans 4 differences per sum (not 8).
*/
void daedalus_h264_pred_chroma8x8_plane_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int H = 0, V = 0;
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc01 (biprediction) (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc01_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+77
View File
@@ -0,0 +1,77 @@
// daedalus-fourier — H.264 luma qpel avg_mc02 (biprediction) (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc02_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc03 (biprediction) (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc03_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+55
View File
@@ -0,0 +1,55 @@
// daedalus-fourier — H.264 luma qpel avg_mc10 (biprediction) (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc10_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc11 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc11_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc12 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc12_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc13 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc13_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+91
View File
@@ -0,0 +1,91 @@
// daedalus-fourier — H.264 luma qpel avg_mc20 (biprediction) (8x8, horizontal half-pel), V3D 7.1.
//
// H.264 spec §8.4.2.2.1 horizontal 6-tap luma interpolation:
//
// dst[r,c] = clip255(
// ( s[r,c-2]
// - 5 * s[r,c-1]
// + 20 * s[r,c]
// + 20 * s[r,c+1]
// - 5 * s[r,c+2]
// + s[r,c+3]
// + 16
// ) >> 5)
//
// Single-stride: dst and src share `stride` (H264QpelContext
// convention). src+src_off already points at the leftmost output
// column (col 0); the filter reads cols -2..+3. Caller guarantees
// edge-padding context per the public API docstring.
//
// Workgroup layout: 64 invocations = 1 lane per output pixel.
// 1 block per WG; n_blocks WGs total. This is the simplest layout
// that avoids any inter-lane communication — each lane independently
// reads its 6 src samples and writes its 1 dst sample. V3D's L2
// cache handles the redundant reads from adjacent lanes.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc20_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src {
uint8_t src[];
} u_src;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off, .y = src_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
// 1 block per WG, 64 lanes covering the 8x8 output block.
uint wg_id = gl_WorkGroupID.x;
uint block_idx = wg_id;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3; // 0..7 (row)
uint c = lane & 7u; // 0..7 (column)
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// src points at output col 0 of the block; filter reads cols -2..+3
// of the current row. Negative col arithmetic is unsigned-safe
// because src_off >= 2 (caller-guaranteed left context).
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base + 0u]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc21 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc21_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+94
View File
@@ -0,0 +1,94 @@
// daedalus-fourier — H.264 luma qpel avg_mc22 (biprediction) (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc22_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc23 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc23_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc30 (biprediction) (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc30_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc31 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc31_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc32 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc32_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc33 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc33_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc01 (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 luma qpel mc02 (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc03 (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+47
View File
@@ -0,0 +1,47 @@
// daedalus-fourier — H.264 luma qpel mc10 (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc11 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc12 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc13 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc21 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+86
View File
@@ -0,0 +1,86 @@
// daedalus-fourier — H.264 luma qpel mc22 (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc23 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc30 (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc31 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc32 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc33 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 chroma 4:2:0 H loop filter (horizontal
// filter across a vertical edge), non-intra bS<4 variant.
//
// Sibling of v3d_h264deblock_chroma_v.comp; same kernel transposed
// to read pix[-2..+1] (cols) instead of pix[-2*stride..+1*stride]
// (rows). Same 8-cell × 4-segment geometry, same WG layout (lanes
// 8..15 of each edge early-return — only 8 active per edge).
//
// 4:2:0-only: 4:2:2 chroma_h has a 16-row edge that this shader
// doesn't address. daedalus_dispatch_h264_deblock_chroma_h is
// 4:2:0-only by design; caller (libavcodec init) gates accordingly.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint row_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
uint seg = row_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) H deblock —
// V3D 7.1. Transpose of v3d_h264deblock_chroma_v_intra.comp.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+76
View File
@@ -0,0 +1,76 @@
// daedalus-fourier — H.264 chroma 4:2:0 V loop filter (vertical
// filter across a horizontal edge), non-intra bS<4 variant.
//
// Per H.264 §8.7.2.4: chroma kernel is simpler than luma's bS<4 —
// only p0 / q0 are updated (chroma never modifies p1, p2, q1, q2),
// tC = tc0_seg + 1 (no luma-style ap/aq side bonus), and the edge
// spans 8 cells (4 segments × 2 cells/seg).
//
// V3D 7.1 via Mesa v3dv compute. WG geometry kept identical to the
// luma shader (16 edges × 16 lanes/WG) for uniform dispatch math
// across the deblock family; lanes 8..15 of each edge early-return.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint col_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return; // 8 cells per chroma edge
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// 8 cells / 4 segments = 2 cells per segment.
uint seg = col_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
// p1, q1 untouched — chroma kernel only updates p0/q0.
}
+54
View File
@@ -0,0 +1,54 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) V deblock —
// V3D 7.1. Per H.264 §8.3.2.3 chroma intra path: simpler than luma
// — always weak filter, only p0/q0 updated, 8 cells per edge.
//
// p0' = (2*p1 + p0 + q1 + 2) >> 2
// q0' = (2*q1 + q0 + p1 + 2) >> 2
//
// Same 16-edges × 16-lanes/edge WG shape as luma; lanes 8..15 of each
// edge early-return (chroma edges are only 8 cells wide).
//
// 4:2:0-only — caller-side gating handles 4:2:2 (chroma_format_idc>1)
// at the libavcodec init layer.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+111
View File
@@ -0,0 +1,111 @@
// daedalus-fourier — H.264 luma "h_loop_filter" (horizontal filtering
// across a vertical edge), non-intra bS<4 variant. Sibling of cycle 8's
// v3d_h264deblock.comp; same algorithm with row/col access transposed.
//
// V3D 7.1 via Mesa v3dv compute. Same WG geometry as the V shader:
// - 256 invocations / WG, 16 edges/WG (16 lanes/edge = 1 sg/edge)
// - uint8_t dst SSBO via storageBuffer8BitAccess
// - No barrier (each lane independent)
// - lane_in_edge = ROW index (0..15) along the vertical edge
// - meta.dst_off points to (row 0, col 0) of the RIGHT block;
// the kernel reads cols [-4..+3] of each row and writes [-2..+1].
//
// Filter contract (per H.264 §8.7.2.4):
// 1. (m.x % pc.dst_stride_u8) ≥ 4 (kernel reads p3 at pix[-4])
// 2. pc.dst_stride_u8 = byte stride between rows
// 3. tc0_s pre-stored as signed int8 in m.z packed 4 bytes (one per
// 4-row segment along the 16-row edge)
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_h_loop_filter_luma_ref.c which mirrors FFmpeg
// ff_h264_h_loop_filter_luma_neon (LGPL-2.1+).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gl_WorkGroupID.x;
uint lane_in_wg = gid & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15 (16 edges/WG)
uint row_in_edge = lane_in_wg & 15u; // 0..15 — ROW along the V edge
uint edge_idx = wg_id * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
// dst_off addresses row 0 col 0 of the right block; advance by row * stride
// to land at this lane's row. The kernel reads pix[-4..+3] AT THIS ROW.
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// tc0 segment = 0..3 indexed by (row_in_edge / 4).
uint seg = row_in_edge >> 2;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return; // segment skip
// Horizontal access pattern — read cols at offsets [-3..+2] of this row.
// p3 (col -4) unused in bS<4; same DCE comment as the V shader.
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
// Edge preconditions (same as V).
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int ap = abs(p2 - p0);
int aq = abs(q2 - q0);
bool ap_lt = ap < beta;
bool aq_lt = aq < beta;
int tc = tc0_s + int(ap_lt) + int(aq_lt);
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
int p0p = clamp(p0 + delta, 0, 255);
int q0p = clamp(q0 - delta, 0, 255);
int p1p = p1;
if (ap_lt) {
int d_p1 = clamp((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s);
p1p = clamp(p1 + d_p1, 0, 255);
}
int q1p = q1;
if (aq_lt) {
int d_q1 = clamp((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s);
q1p = clamp(q1 + d_q1, 0, 255);
}
u_dst.dst[dst_off - 2u] = uint8_t(p1p);
u_dst.dst[dst_off - 1u] = uint8_t(p0p);
u_dst.dst[dst_off ] = uint8_t(q0p);
u_dst.dst[dst_off + 1u] = uint8_t(q1p);
}
+70
View File
@@ -0,0 +1,70 @@
// daedalus-fourier — H.264 luma intra (bS=4) H deblock — V3D 7.1.
//
// Sibling of v3d_h264deblock_luma_v_intra.comp transposed to the
// horizontal axis: lane → row, reads pix[-4..+3] (cols) instead of
// pix[-4*stride..+3*stride] (rows). Same strong/weak filter
// selector + same write-back algebra.
//
// dst_off contract: (m.x % stride) ≥ 4 (kernel reads p3 at pix[-4]).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u]);
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
int q3 = int(u_dst.dst[dst_off + 3u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+81
View File
@@ -0,0 +1,81 @@
// daedalus-fourier — H.264 luma intra (bS=4) V deblock — V3D 7.1.
//
// Per H.264 §8.3.2.3: at I-MB edges and certain inter-MB edges that
// force boundary strength to 4, the deblock kernel is structurally
// different from bS<4 — it has a per-side strong/weak filter
// selector that decides whether to update 3 cells (strong) or 1
// (weak), reads p3/q3, and ignores tc0.
//
// strong_common = |p0-q0| < (α>>2) + 2
// strong_p = strong_common AND |p2-p0| < β
// strong_q = strong_common AND |q2-q0| < β
//
// Strong-p updates p0/p1/p2 with specific 5-/4-/3-tap blends.
// Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2.
// Mirror for q-side.
//
// WG geometry identical to v3d_h264deblock.comp (16 edges × 16 lanes/WG).
// dst_off contract: m.x ≥ 4*stride (kernel reads p3 at -4*stride).
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_intra_loop_filter_ref.c (PR #11).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u * stride]);
int p2 = int(u_dst.dst[dst_off - 3u * stride]);
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
int q2 = int(u_dst.dst[dst_off + 2u * stride]);
int q3 = int(u_dst.dst[dst_off + 3u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u * stride] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u * stride] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u * stride] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u * stride] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+62 -2
View File
@@ -8,6 +8,8 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <limits.h>
#define CHK(call) do { VkResult r__ = (call); if (r__ != VK_SUCCESS) { \
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
@@ -368,10 +370,68 @@ void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf)
/* ---- Pipelines -------------------------------------------------- */
/* SPV lookup tries a small set of locations. The caller passes a bare
* filename (e.g. "v3d_h264_idct4.spv"); we try, in order:
*
* 1. cwd-relative (legacy contract; works when run from build/)
* 2. $DAEDALUS_SHADER_DIR (env override for tests / packaged installs)
* 3. <binary-dir>/<name> (so the bench/test binary finds the SPV next
* to itself regardless of cwd — this is the
* fix for the silent-no-SPV regression that
* made PR #36's bench numbers meaningless)
* 4. /opt/fourier/share/daedalus-fourier/<name> (Pi 5 install layout)
* 5. /usr/share/daedalus-fourier/<name> (system-wide install)
*
* Returns NULL only if every location fails, with a single perror naming
* the bare filename so the user can grep for it. */
static FILE *open_spv(const char *name)
{
FILE *f = fopen(name, "rb");
if (f) return f;
const char *envdir = getenv("DAEDALUS_SHADER_DIR");
if (envdir && *envdir) {
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", envdir, name);
f = fopen(p, "rb");
if (f) return f;
}
char exe[PATH_MAX];
ssize_t n = readlink("/proc/self/exe", exe, sizeof(exe) - 1);
if (n > 0) {
exe[n] = 0;
char *slash = strrchr(exe, '/');
if (slash) {
*slash = 0;
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", exe, name);
f = fopen(p, "rb");
if (f) return f;
}
}
char p[PATH_MAX];
snprintf(p, sizeof(p), "/opt/fourier/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
snprintf(p, sizeof(p), "/usr/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
return NULL;
}
static uint32_t *read_spv(const char *path, size_t *out_size)
{
FILE *f = fopen(path, "rb");
if (!f) { perror(path); return NULL; }
FILE *f = open_spv(path);
if (!f) {
fprintf(stderr,
"daedalus: SPV not found via cwd / $DAEDALUS_SHADER_DIR / "
"binary-dir / /opt/fourier/share / /usr/share: %s\n", path);
return NULL;
}
fseek(f, 0, SEEK_END);
long sz = ftell(f);
fseek(f, 0, SEEK_SET);
+161 -82
View File
@@ -2,25 +2,22 @@
/* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */
#define _POSIX_C_SOURCE 200809L
/*
* bench_h264_primitives — NEON-path latency baseline for the H.264
* primitive library landed across PRs #9#23.
* bench_h264_primitives — latency baseline for the H.264 primitive
* library landed across PRs #9#35.
*
* Each kernel is exercised at a representative per-frame N for 1080p
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported.
* Lets us answer "what's the total NEON-only budget for the H.264
* decode at 1080p" — useful for sizing intercept-patch decisions
* (which kernels NEED QPU shaders vs which are budget-fine on NEON).
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported,
* once per substrate (CPU NEON, QPU V3D7 compute). The QPU column
* appears only when the host has a usable Vulkan device. When both
* columns exist a CPU/QPU ratio is printed; that's the per-kernel
* data the QPU-substrate decree (2026-05-23) deliberately overrides
* but which is still useful to track over time as dispatch overhead
* shrinks (buffer pool, persistent cmdbuf, dmabuf import — tasks 160-162).
*
* NOT a ctest — produces wall-time numbers, doesn't pass/fail.
*
* Invoke: ./build/bench_h264_primitives [iters]
* (default iters = 50, post-warmup = 5)
*
* NB: results are inherently approximate — single-core, includes
* loop overhead + memory access patterns that may not match what
* a real decode would hit (we touch a small set of pages repeatedly).
* The numbers are useful for relative comparison and order-of-
* magnitude sizing, not absolute perf claims.
* Invoke: ./build/bench_h264_primitives [iters [warmup]]
* (default iters = 50, warmup = 5)
*/
#include "daedalus.h"
@@ -46,34 +43,47 @@ static double now_ms(void) {
/* Per-1080p-frame counts (8160 MBs at 1920x1088). */
#define MBS_1080P 8160
#define LUMA_4x4_PER_MB 16 /* if transform_8x8=0 */
#define LUMA_8x8_PER_MB 4 /* if transform_8x8=1 */
#define CHROMA_4x4_PER_MB 8 /* 4 Cb + 4 Cr */
#define DEBLOCK_LUMA_EDGES_PER_MB 4 /* 4 horiz + 4 vert internal+MB-edge — ~4 each */
#define DEBLOCK_CHROMA_EDGES_PER_MB 2 /* 2 each direction */
/* Standard benchmark loop. fn() is called n times per iteration. */
typedef void (*bench_fn)(void);
/* Standard benchmark loop. fn() is called n times per iteration.
*
* fn() now returns the dispatch's int rc. A single preflight call is
* made before the hot loop; if rc != 0 (which on the QPU substrate
* almost always means "SPV not found via any search path"), bench_ns
* returns -1 and the caller must NOT report the kernel as measured.
*
* Without this, a missing SPV makes every dispatch fail fast at the
* cost of one fprintf+open call (~1-5 µs), and the loop times that
* cost as if it were real QPU work — producing absurdly-small ns/op
* numbers that look like a QPU speedup. This is exactly what made
* PR #36's bench numbers a measurement artifact. */
typedef int (*bench_fn)(void);
static double bench_ns(const char *name, int iters, int warmup,
int ops_per_iter, bench_fn fn)
{
int rc = fn();
if (rc != 0) {
printf(" %-32s DISPATCH FAILED rc=%d — kernel skipped\n", name, rc);
return -1;
}
for (int i = 0; i < warmup; i++) fn();
double t0 = now_ms();
for (int i = 0; i < iters; i++) fn();
double t1 = now_ms();
double total_ms = (t1 - t0);
double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter);
printf(" %-32s %8.2f ns/op (%d iters x %d ops)\n",
printf(" %-32s %10.2f ns/op (%d iters x %d ops)\n",
name, ns_per_op, iters, ops_per_iter);
return ns_per_op;
}
/* ---- Per-kernel scaffolding. Each section sets up the buffers +
* meta, then defines a static fn() that calls the corresponding
* dispatch with a representative N. */
* dispatch with a representative N. The substrate is read from the
* global g_sub so the same fn() can be re-driven with CPU then QPU. */
static daedalus_ctx *ctx;
static daedalus_ctx *ctx;
static daedalus_substrate g_sub = DAEDALUS_SUBSTRATE_CPU;
/* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks
* per call (64 MBs worth). Per-MB the dispatch overhead is the
@@ -82,8 +92,8 @@ static int16_t idct4_coeffs[1024 * 16];
static daedalus_h264_block_meta idct4_meta[1024];
static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */
static void bench_idct4(void) {
daedalus_dispatch_h264_idct4(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_idct4(void) {
return daedalus_dispatch_h264_idct4(ctx, g_sub,
idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta);
}
@@ -91,8 +101,8 @@ static void bench_idct4(void) {
static int16_t idct8_coeffs[256 * 64];
static daedalus_h264_block_meta idct8_meta[256];
static void bench_idct8(void) {
daedalus_dispatch_h264_idct8(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_idct8(void) {
return daedalus_dispatch_h264_idct8(ctx, g_sub,
idct_dst, 64*16, idct8_coeffs, 256, idct8_meta);
}
@@ -100,13 +110,13 @@ static void bench_idct8(void) {
static daedalus_h264_deblock_meta deblock_meta[256];
static uint8_t deblock_dst[256 * 16 * 16];
static void bench_deblock_v(void) {
daedalus_dispatch_h264_deblock_luma_v(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_deblock_v(void) {
return daedalus_dispatch_h264_deblock_luma_v(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
static void bench_deblock_h(void) {
daedalus_dispatch_h264_deblock_luma_h(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_deblock_h(void) {
return daedalus_dispatch_h264_deblock_luma_h(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
@@ -115,19 +125,44 @@ static uint8_t qpel_src[256 * 16 * 16];
static uint8_t qpel_dst[256 * 16 * 16];
static daedalus_h264_qpel_meta qpel_meta[256];
static void bench_qpel_mc20(void) {
daedalus_dispatch_h264_qpel_mc20(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_qpel_mc20(void) {
return daedalus_dispatch_h264_qpel_mc20(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static void bench_qpel_mc02(void) {
daedalus_dispatch_h264_qpel_mc02(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_qpel_mc02(void) {
return daedalus_dispatch_h264_qpel_mc02(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static void bench_qpel_mc22(void) {
daedalus_dispatch_h264_qpel_mc22(ctx, DAEDALUS_SUBSTRATE_CPU,
static int bench_qpel_mc22(void) {
return daedalus_dispatch_h264_qpel_mc22(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
/* ---- One row of bench output:
* - kernel name + N
* - CPU ns/op
* - QPU ns/op (or "n/a" if Vulkan absent)
* - CPU/QPU ratio (>1 means QPU wins; <1 means CPU wins) */
struct row {
const char *name;
int n_per_call;
bench_fn fn;
double cpu_ns;
double qpu_ns; /* -1 if not measured */
int frame_n; /* count per 1080p frame */
};
static struct row rows[] = {
{"IDCT 4x4 luma", 1024, bench_idct4, 0, -1, MBS_1080P * 16},
{"IDCT 8x8 luma", 256, bench_idct8, 0, -1, MBS_1080P * 4},
{"Deblock luma_v", 256, bench_deblock_v, 0, -1, MBS_1080P * 4},
{"Deblock luma_h", 256, bench_deblock_h, 0, -1, MBS_1080P * 4},
{"qpel mc20 (8x8)", 256, bench_qpel_mc20, 0, -1, MBS_1080P * 4},
{"qpel mc02 (8x8)", 256, bench_qpel_mc02, 0, -1, MBS_1080P * 4},
{"qpel mc22 (8x8)", 256, bench_qpel_mc22, 0, -1, MBS_1080P * 4},
};
#define N_ROWS ((int)(sizeof(rows)/sizeof(rows[0])))
int main(int argc, char **argv)
{
int iters = argc > 1 ? atoi(argv[1]) : 50;
@@ -138,6 +173,7 @@ int main(int argc, char **argv)
fprintf(stderr, "ctx create failed (Vulkan?)\n");
return 1;
}
int has_qpu = daedalus_ctx_has_qpu(ctx);
/* Pre-fill all input buffers with random data so the NEON inner
* loops see realistic memory access patterns. */
@@ -147,8 +183,7 @@ int main(int argc, char **argv)
idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff);
/* IDCT meta: each block at offset i*16 (row layout matters less
* here since we're just measuring per-block latency). */
/* IDCT meta. */
for (size_t i = 0; i < 1024; i++)
idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4);
for (size_t i = 0; i < 256; i++)
@@ -162,58 +197,102 @@ int main(int argc, char **argv)
for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1);
}
/* qpel meta: src and dst at row 3 col 3 of each 16x16 tile. */
/* qpel meta. */
for (size_t i = 0; i < 256; i++) {
qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3);
qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3);
}
printf("bench_h264_primitives: %d iters (%d warmup), substrate=CPU NEON\n",
iters, warmup);
printf("Per-call N is set per kernel; ns/op is per BLOCK or EDGE.\n\n");
printf("bench_h264_primitives: %d iters (%d warmup)\n", iters, warmup);
printf(" ctx has_qpu=%d (CPU pass always runs; QPU pass skipped without Vulkan)\n\n", has_qpu);
double idct4_ns = bench_ns("IDCT 4x4 luma", iters, warmup, 1024, bench_idct4);
double idct8_ns = bench_ns("IDCT 8x8 luma", iters, warmup, 256, bench_idct8);
double debl_v_ns = bench_ns("Deblock luma_v", iters, warmup, 256, bench_deblock_v);
double debl_h_ns = bench_ns("Deblock luma_h", iters, warmup, 256, bench_deblock_h);
double qmc20_ns = bench_ns("qpel mc20 (8x8)", iters, warmup, 256, bench_qpel_mc20);
double qmc02_ns = bench_ns("qpel mc02 (8x8)", iters, warmup, 256, bench_qpel_mc02);
double qmc22_ns = bench_ns("qpel mc22 (8x8)", iters, warmup, 256, bench_qpel_mc22);
/* Pass 1: CPU NEON. */
g_sub = DAEDALUS_SUBSTRATE_CPU;
printf("== CPU NEON ==\n");
for (int i = 0; i < N_ROWS; i++)
rows[i].cpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
/* Per-frame budget summary at 1080p (8160 MBs). Worst-case
* assumptions:
* - All MBs are transform_4x4 (16 4x4 IDCTs each) — so 130,560
* IDCT 4x4 blocks per frame. If High profile transform_8x8,
* it'd be 32,640 IDCT 8x8 blocks instead.
* - All MBs are intra (no MC — qpel zero) OR all inter (no
* intra prediction). We report MC at "all inter, all qpel
* mc22" worst case.
* - Deblock: ~4 luma_v + 4 luma_h edges per MB; assume all 8
* edges trigger filtering. */
printf("\nProjected 1080p frame budgets (worst-case, CPU NEON only):\n");
printf(" IDCT 4x4 (all-4x4 MBs): %7.2f ms (%d blocks)\n",
idct4_ns * MBS_1080P * 16 / 1e6, MBS_1080P * 16);
printf(" IDCT 8x8 (all-8x8 MBs): %7.2f ms (%d blocks)\n",
idct8_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4);
printf(" Deblock luma_v (all MBs): %7.2f ms (%d edges)\n",
debl_v_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4);
printf(" Deblock luma_h (all MBs): %7.2f ms (%d edges)\n",
debl_h_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4);
printf(" qpel mc22 (all 8x8 blocks): %7.2f ms (%d blocks)\n",
qmc22_ns * MBS_1080P * 4 / 1e6, MBS_1080P * 4);
/* Pass 2: QPU compute (if available). */
int qpu_failures = 0;
if (has_qpu) {
g_sub = DAEDALUS_SUBSTRATE_QPU;
printf("\n== QPU V3D7 compute ==\n");
for (int i = 0; i < N_ROWS; i++) {
rows[i].qpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
if (rows[i].qpu_ns < 0) qpu_failures++;
}
if (qpu_failures) {
fprintf(stderr,
"\nbench_h264_primitives: %d of %d QPU dispatches failed.\n"
" Almost always means SPV files were not found via any of:\n"
" cwd / $DAEDALUS_SHADER_DIR / binary-dir /\n"
" /opt/fourier/share/daedalus-fourier / /usr/share/daedalus-fourier\n"
" Set DAEDALUS_SHADER_DIR=<path> or run from a dir where the\n"
" .spv files exist (e.g. the cmake build dir).\n",
qpu_failures, N_ROWS);
return 2;
}
}
/* Summary table — both substrates side by side. */
printf("\n== Per-kernel comparison ==\n");
printf(" %-24s %12s %12s %8s %7s\n",
"kernel", "CPU ns/op", "QPU ns/op", "winner", "ms/frame");
for (int i = 0; i < N_ROWS; i++) {
double cpu_ms = rows[i].cpu_ns * rows[i].frame_n / 1e6;
double qpu_ms = rows[i].qpu_ns > 0 ? rows[i].qpu_ns * rows[i].frame_n / 1e6 : -1;
const char *winner;
char ratio[16];
if (rows[i].qpu_ns <= 0) {
winner = "CPU"; /* QPU n/a */
snprintf(ratio, sizeof(ratio), "n/a");
} else if (rows[i].cpu_ns < rows[i].qpu_ns) {
winner = "CPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].qpu_ns / rows[i].cpu_ns);
} else {
winner = "QPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].cpu_ns / rows[i].qpu_ns);
}
char qpu_field[16];
if (rows[i].qpu_ns > 0) snprintf(qpu_field, sizeof(qpu_field), "%.2f", rows[i].qpu_ns);
else snprintf(qpu_field, sizeof(qpu_field), "n/a");
char ms_field[24];
if (qpu_ms > 0)
snprintf(ms_field, sizeof(ms_field), "%.2f/%.2f", cpu_ms, qpu_ms);
else
snprintf(ms_field, sizeof(ms_field), "%.2f/n/a", cpu_ms);
printf(" %-24s %12.2f %12s %3s %s %s\n",
rows[i].name, rows[i].cpu_ns, qpu_field, winner, ratio, ms_field);
}
/* Per-frame budget summary at 1080p (8160 MBs). */
double cpu_idct4 = rows[0].cpu_ns * MBS_1080P * 16 / 1e6;
double cpu_debl = (rows[2].cpu_ns + rows[3].cpu_ns) * MBS_1080P * 4 / 1e6;
double cpu_mc = rows[6].cpu_ns * MBS_1080P * 4 / 1e6; /* mc22 worst-case */
double cpu_sum = cpu_idct4 + cpu_debl + cpu_mc;
printf("\n== Projected 1080p worst-case (CPU NEON only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", cpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - cpu_sum);
if (has_qpu) {
double qpu_idct4 = rows[0].qpu_ns * MBS_1080P * 16 / 1e6;
double qpu_debl = (rows[2].qpu_ns + rows[3].qpu_ns) * MBS_1080P * 4 / 1e6;
double qpu_mc = rows[6].qpu_ns * MBS_1080P * 4 / 1e6;
double qpu_sum = qpu_idct4 + qpu_debl + qpu_mc;
printf("\n== Projected 1080p worst-case (QPU V3D7 compute only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", qpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - qpu_sum);
printf("\n CPU vs QPU sum ratio: %.2fx (>1 means QPU wins)\n",
qpu_sum > 0 ? cpu_sum / qpu_sum : 0.0);
}
double sum_idct_4x4 = idct4_ns * MBS_1080P * 16 / 1e6;
double sum_deblock = (debl_v_ns + debl_h_ns) * MBS_1080P * 4 / 1e6;
double sum_mc = qmc22_ns * MBS_1080P * 4 / 1e6; /* worst-case all-mc22 */
printf("\n Sum (IDCT 4x4 + deblock luma + MC all-mc22): %7.2f ms\n",
sum_idct_4x4 + sum_deblock + sum_mc);
printf(" 30 fps deadline: 33.33 ms\n");
printf(" Margin: %+.2f ms\n",
33.33 - (sum_idct_4x4 + sum_deblock + sum_mc));
printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n");
printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n");
printf(" bound; the real decode stack adds 20-40%% on top.)\n");
(void) qmc20_ns; (void) qmc02_ns;
printf(" bound; the real decode stack adds 20-40%% on top.\n");
printf(" Per-kernel substrate decisions belong in daedalus_core.c recipe\n");
printf(" table; the QPU substrate decree (2026-05-23) keeps everything\n");
printf(" on QPU regardless of these numbers as a policy choice.)\n");
daedalus_ctx_destroy(ctx);
return 0;
+4 -4
View File
@@ -683,13 +683,13 @@ int main(void)
printf(" H264_QPEL_MC20 recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20));
printf(" H264_DEBLOCK_LH recipe substrate: %d (CPU, no QPU H shader yet)\n",
printf(" H264_DEBLOCK_LH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH));
printf(" H264_DEBLOCK_CV recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CV recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV));
printf(" H264_DEBLOCK_CH recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH));
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (CPU, bS=4 set)\n",
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (bS=4 family, all on QPU)\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA));
int fail = 0;
+18
View File
@@ -12,6 +12,7 @@
#include <string.h>
extern void daedalus_h264_chroma_dc_hadamard_2x2_ref(int16_t c[4]);
extern void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]); /* public API */
static int check(const char *name, int16_t in[4], int16_t expect[4])
{
@@ -112,6 +113,23 @@ int main(void)
fail |= local_fail;
}
/* Test 8: public API parity. The public symbol must produce
* byte-identical output to the test-only ref for the same input.
* If the src/ Hadamard ever drifts from the spec, this catches it. */
{
int16_t input[4] = { 7, -11, 23, -42 };
int16_t a[4], b[4];
memcpy(a, input, sizeof(a));
memcpy(b, input, sizeof(b));
daedalus_h264_chroma_dc_hadamard_2x2_ref(a);
daedalus_h264_chroma_dc_hadamard_2x2(b);
int local_fail = 0;
for (int i = 0; i < 4; i++) if (a[i] != b[i]) local_fail = 1;
printf(" %-32s %s\n", "public API parity vs _ref",
local_fail ? "FAIL" : "PASS");
fail |= local_fail;
}
if (fail == 0) printf("\nALL chroma DC Hadamard tests PASS\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
+9 -9
View File
@@ -18,10 +18,10 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_plane_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 17
@@ -84,7 +84,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 10 + i; l[i] = 0; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_vertical(&buf[1][1], STRIDE);
struct vertical_ctx vc = { t };
fail |= check(buf, "Vertical (mode 0)", expect_vertical, &vc);
}
@@ -95,7 +95,7 @@ int main(void)
int t[16] = {0}, l[16];
for (int i = 0; i < 16; i++) l[i] = 50 + i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_horizontal(&buf[1][1], STRIDE);
struct horizontal_ctx hc = { l };
fail |= check(buf, "Horizontal (mode 1)", expect_horizontal, &hc);
}
@@ -108,7 +108,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 2; l[i] = 6; }
set_ctx(buf, 99, t, l);
daedalus_h264_pred_16x16_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_dc(&buf[1][1], STRIDE);
uint8_t exp_val = 4;
fail |= check(buf, "DC (mode 2)", expect_uniform, &exp_val);
}
@@ -123,7 +123,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l); /* uniform tl too — H/V sums actually zero */
daedalus_h264_pred_16x16_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t exp_val = 100;
fail |= check(buf, "Plane (mode 3, uniform)", expect_uniform, &exp_val);
}
@@ -150,7 +150,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 15][1 + 15];
int spot_fail = 0;
+19 -19
View File
@@ -22,15 +22,15 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hu_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
typedef void (*pred_fn)(uint8_t *dst, ptrdiff_t stride);
@@ -82,7 +82,7 @@ int main(void)
int t[8] = { 10, 20, 30, 40, 0, 0, 0, 0 };
int l[4] = { 0, 0, 0, 0 };
set_ctx(buf, tl, t, l);
daedalus_h264_pred_4x4_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vertical(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{10,20,30,40}, {10,20,30,40}, {10,20,30,40}, {10,20,30,40}
};
@@ -95,7 +95,7 @@ int main(void)
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 50, 60, 70, 80 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_horizontal(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{50,50,50,50}, {60,60,60,60}, {70,70,70,70}, {80,80,80,80}
};
@@ -110,7 +110,7 @@ int main(void)
int t[8] = { 1,1,1,1, 0,0,0,0 };
int l[4] = { 3,3,3,3 };
set_ctx(buf, 99, t, l); /* tl unused for DC */
daedalus_h264_pred_4x4_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_dc(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{2,2,2,2}, {2,2,2,2}, {2,2,2,2}, {2,2,2,2}
};
@@ -125,7 +125,7 @@ int main(void)
int t[8] = { 100,100,100,100, 100,100,100,100 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_ddl_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_ddl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{100,100,100,100}, {100,100,100,100},
{100,100,100,100}, {100,100,100,100}
@@ -140,7 +140,7 @@ int main(void)
int t[8] = { 200,200,200,200, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 200, t, l);
daedalus_h264_pred_4x4_ddr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_ddr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
@@ -155,7 +155,7 @@ int main(void)
int t[8] = { 80,80,80,80, 0,0,0,0 };
int l[4] = { 80,80,80,80 };
set_ctx(buf, 80, t, l);
daedalus_h264_pred_4x4_vr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{80,80,80,80}, {80,80,80,80}, {80,80,80,80}, {80,80,80,80}
};
@@ -168,7 +168,7 @@ int main(void)
int t[8] = { 120,120,120,120, 0,0,0,0 };
int l[4] = { 120,120,120,120 };
set_ctx(buf, 120, t, l);
daedalus_h264_pred_4x4_hd_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_hd(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{120,120,120,120}, {120,120,120,120},
{120,120,120,120}, {120,120,120,120}
@@ -182,7 +182,7 @@ int main(void)
int t[8] = { 64,64,64,64, 64,64,64,64 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_vl_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{64,64,64,64}, {64,64,64,64}, {64,64,64,64}, {64,64,64,64}
};
@@ -195,7 +195,7 @@ int main(void)
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_hu_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_hu(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
@@ -230,7 +230,7 @@ int main(void)
int t[8] = { 10,20,30,40, 0,0,0,0 };
int l[4] = { 50,60,70,0 };
set_ctx(buf, 5, t, l);
daedalus_h264_pred_4x4_vr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{ 8,15,25,35},
{18,11,20,30},
+20 -20
View File
@@ -14,15 +14,15 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_8x8l_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hd_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hu_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 9
@@ -61,7 +61,7 @@ int main(void)
for (int i = 0; i < 16; i++) t[i] = 50;
for (int j = 0; j < 8; j++) l[j] = 0;
set_ctx(buf, 50, t, l);
daedalus_h264_pred_8x8l_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Vertical (mode 0, uniform top)", 50);
}
@@ -71,7 +71,7 @@ int main(void)
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = 70;
set_ctx(buf, 70, t, l);
daedalus_h264_pred_8x8l_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Horizontal (mode 1, uniform left)", 70);
}
@@ -84,7 +84,7 @@ int main(void)
for (int i = 0; i < 16; i++) t[i] = 33;
for (int j = 0; j < 8; j++) l[j] = 33;
set_ctx(buf, 33, t, l);
daedalus_h264_pred_8x8l_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_dc(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "DC (mode 2, uniform)", 33);
}
@@ -108,7 +108,7 @@ int main(void)
int t[16], l[8] = {0};
for (int i = 0; i < 16; i++) t[i] = i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
@@ -129,7 +129,7 @@ int main(void)
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = j;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
@@ -146,12 +146,12 @@ int main(void)
{
typedef void (*pred_fn_t)(uint8_t *dst, ptrdiff_t stride);
struct { const char *name; pred_fn_t fn; } modes[] = {
{ "DDL (mode 3, uniform)", daedalus_h264_pred_8x8l_ddl_ref },
{ "DDR (mode 4, uniform)", daedalus_h264_pred_8x8l_ddr_ref },
{ "VR (mode 5, uniform)", daedalus_h264_pred_8x8l_vr_ref },
{ "HD (mode 6, uniform)", daedalus_h264_pred_8x8l_hd_ref },
{ "VL (mode 7, uniform)", daedalus_h264_pred_8x8l_vl_ref },
{ "HU (mode 8, uniform)", daedalus_h264_pred_8x8l_hu_ref },
{ "DDL (mode 3, uniform)", daedalus_h264_pred_8x8l_ddl },
{ "DDR (mode 4, uniform)", daedalus_h264_pred_8x8l_ddr },
{ "VR (mode 5, uniform)", daedalus_h264_pred_8x8l_vr },
{ "HD (mode 6, uniform)", daedalus_h264_pred_8x8l_hd },
{ "VL (mode 7, uniform)", daedalus_h264_pred_8x8l_vl },
{ "HU (mode 8, uniform)", daedalus_h264_pred_8x8l_hu },
};
for (size_t i = 0; i < sizeof(modes)/sizeof(modes[0]); i++) {
uint8_t buf[ROWS][STRIDE];
+9 -9
View File
@@ -16,10 +16,10 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_plane_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
#define ROWS 9
@@ -69,7 +69,7 @@ int main(void)
uint8_t buf[ROWS][STRIDE];
int t[8] = {0}, l[8] = {10, 20, 30, 40, 50, 60, 70, 80};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_horizontal(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) l[r];
fail |= check_per_cell(buf, "Horizontal (mode 1)", exp);
@@ -80,7 +80,7 @@ int main(void)
uint8_t buf[ROWS][STRIDE];
int t[8] = {15, 25, 35, 45, 55, 65, 75, 85}, l[8] = {0};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_vertical(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) t[c];
fail |= check_per_cell(buf, "Vertical (mode 2)", exp);
@@ -104,7 +104,7 @@ int main(void)
int t[8] = { 8, 8, 8, 8, 16, 16, 16, 16 };
int l[8] = { 24, 24, 24, 24, 40, 40, 40, 40 };
set_ctx(buf, 99, t, l);
daedalus_h264_pred_chroma8x8_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_dc(&buf[1][1], STRIDE);
uint8_t exp[8][8] = {
{16,16,16,16, 16,16,16,16},
{16,16,16,16, 16,16,16,16},
@@ -125,7 +125,7 @@ int main(void)
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l);
daedalus_h264_pred_chroma8x8_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = 100;
fail |= check_per_cell(buf, "Plane uniform (mode 3)", exp);
@@ -153,7 +153,7 @@ int main(void)
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 7][1 + 7];
int spot_fail = 0;