32 Commits

Author SHA1 Message Date
marfrit 432d127ea9 Merge pull request 'v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline' (#37) from noether/spv-search-and-bench-retract into main
Reviewed-on: #37
2026-05-25 20:33:30 +00:00
claude-noether 1347fb961c v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path
sum.  That number was a measurement artifact.  This commit makes the
artifact impossible to reproduce by ANYONE running the bench again.

THE BUG
-------

v3d_runner read_spv() did fopen(spv_path, "rb") with no path search:
the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen
resolves it relative to cwd.  The cmake build puts SPVs in $builddir
(e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264)
were typically invoked from ~/src/daedalus-fourier/, so fopen failed.

On failure read_spv printed perror and returned NULL; pipeline create
then returned -1; dispatch then returned -1; the bench loop ignored
the return value and timed the failure path.  Each iter cost ~1-5 µs
(open + perror + return), which divided across 256 ops gave ~10-20
ns/op — looking convincingly like real-but-fast QPU work.

PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact.  PR #10's
much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found
that time, perhaps run from build/), so the artifact is what made it
look like the gap had closed.  The gap never closed.

CORRECTED NUMBERS
-----------------

Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit:

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:   5.57 ms
    QPU only:      123.54 ms
    Ratio:        CPU/QPU sum = 0.05x  (QPU 22x SLOWER than CPU)

QPU is currently 12-77x slower per kernel.  The post-buffer-pool /
post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT
close the gap with NEON.  Whether those tasks helped at all needs
re-measurement — the previous "we saw a big win" reading was the
same artifact.

PR #36's commit-message claim "PR #10's verdict is reversed" is
withdrawn.  PR #10 was right; PR #36 was wrong.

THE FIX
-------

Two changes:

1. v3d_runner: SPV search now tries, in order:
     - cwd (legacy)
     - $DAEDALUS_SHADER_DIR (env override)
     - <readlink /proc/self/exe>/.. (binary-relative)
     - /opt/fourier/share/daedalus-fourier/ (Pi 5 install)
     - /usr/share/daedalus-fourier/ (system-wide)
   Found-anywhere succeeds silently.  Found-nowhere prints one error
   naming all searched locations.

2. bench_h264_primitives: bench_fn now returns int.  bench_ns does
   a single preflight call; if rc != 0 it prints "DISPATCH FAILED
   rc=N — kernel skipped" and bails on the kernel.  Main loop counts
   QPU failures and exits 2 BEFORE printing the comparison table if
   any kernel failed — so the next person running this can't read
   fail-fast timings as substrate numbers.

POLICY IMPLICATIONS
-------------------

The QPU substrate decree (2026-05-23) was conceived as a policy
choice that overrides per-kernel measurement.  With the corrected
data the gap is not "fixable defect we'll close with one more
optimization", it's an order of magnitude.  Whether to keep the
decree, soften it (auto = QPU only where measured advantage), or
revert is now a clear-eyed decision for the user.

This commit doesn't change the recipe table — that's a separate
question, taken on its own merits with this data in hand.

Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu →
qpu-capable) was justified by PR #36's artifact and should be
reverted; that revert lands in a follow-up to marfrit-packages.
2026-05-25 21:45:12 +02:00
marfrit 9be02a9470 Merge pull request 'bench: H.264 primitive bench now measures both substrates + comparison table' (#36) from noether/h264-qpu-bench-and-cleanup into main
Reviewed-on: #36
2026-05-25 18:56:01 +00:00
claude-noether 989818c2e6 bench: H.264 primitive bench now measures both substrates + comparison table
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).
2026-05-25 20:42:39 +02:00
marfrit 1446b779a6 Merge pull request 'h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete' (#35) from noether/v3d-shader-h264-deblock-intra into main
Reviewed-on: #35
2026-05-25 18:36:10 +00:00
claude-noether c2d1e9790e h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete
Closes the H.264 deblock QPU coverage matrix.  Adds the 4 intra
(bS=4) variants — luma_v/h_intra + chroma_v/h_intra.

Algorithmically distinct from the bS<4 path:
  - Per-side strong/weak filter selector
      strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2)
      strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2)
  - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3)
  - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2
  - Mirror for q-side; no tc0 (bS=4 hardcodes the strength)
  - Chroma always weak, only p0/q0 updated (same as bS<4 chroma)

Per H.264 §8.3.2.3.  Transcribed from PR #11's C reference
(tests/h264_intra_loop_filter_ref.c).

Shaders:
  - v3d_h264deblock_luma_v_intra.comp  (luma 16-cell + strong/weak)
  - v3d_h264deblock_luma_h_intra.comp  (transpose of luma_v_intra)
  - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak)
  - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra)

Dispatch wiring:
  - 4 new pipeline pairs in daedalus_ctx
  - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by
    orient_h for V vs H) — 2 wrappers
  - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu
    helper (same WG geometry as bS<4 chroma) — 2 wrappers
  - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter,
    routes CPU/QPU per recipe table
  - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU
    to QPU

Verified on hertz:

  $ ./build/test_api_h264 | grep intra
    H.264 deblock luma v intra:   1024/1024 bytes bit-exact
    H.264 deblock luma h intra:   1024/1024 bytes bit-exact
    H.264 deblock chroma v intra:  256/256 bytes bit-exact
    H.264 deblock chroma h intra:  256/256 bytes bit-exact

All 4 PASS first try.  Strong/weak quad-tree selector + per-side
asymmetry would have surfaced any sign/shift/index mistake; passing
on all 4 (including the asymmetric writes-3-cells cases) means the
transcription from C is clean.

Deblock QPU coverage matrix — COMPLETE (8 of 8):

  bS<4 (non-intra):
    luma_v    ✓ cycle 8
    luma_h    ✓ PR #28
    chroma_v  ✓ PR #29
    chroma_h  ✓ PR #29

  bS=4 (intra, this PR):
    luma_v    ✓
    luma_h    ✓
    chroma_v  ✓
    chroma_h  ✓

The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU
when daedalus is initialised with a QPU-capable context:
  - IDCT 4x4 / 8x8 ✓
  - All 8 deblock variants ✓
  - All 30 qpel positions (15 put_ + 15 avg_) ✓
2026-05-25 20:30:07 +02:00
marfrit e506ef0803 Merge pull request 'h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete' (#34) from noether/v3d-shader-h264-qpel-avg into main
Reviewed-on: #34
2026-05-25 18:23:11 +00:00
claude-noether 2079fe39c6 h264: V3D shaders for all 15 avg_ qpel positions — qpel QPU complete
Generates 15 avg_ shader variants by templating from the existing
put_ shaders.  Each avg_ shader is identical to its put_ sibling
except the final write does an L2 average with the existing dst:

  put_:  dst[r,c] = result
  avg_:  dst[r,c] = (dst[r,c] + result + 1) >> 1

Per H.264 §8.4.2.3.1 (B-slice biprediction): caller pre-loads dst
with the list0 prediction; the avg_ call folds in list1.

Generated via python (avg-shader-gen.py): reads each
v3d_h264_qpel_mcXY.comp, transforms the docstring header + final
write hunk, writes v3d_h264_qpel_avg_mcXY.comp.  ~88 lines each;
15 new shader files.

Dispatch reuses the existing dispatch_h264_qpel_diag_qpu helper for
all 15 — same src envelope (10*stride+11 covers any (r±1, c±1)
shift), the L2 step only touches dst.  Slightly over-allocates for
the simpler positions (avg_mc20/02/10/30/01/03) but negligible
cost.  Eliminates 15 wrappers + 15 src_max bound calculations that
would otherwise duplicate.

CMake foreach loops compile + install 15 new SPV files.  ctx grows
15 pipeline pairs.  Recipe table flips DAEDALUS_KERNEL_H264_QPEL_AVG_*
from CPU to QPU.  Public dispatchers re-defined via the existing
DEFINE_QPEL_DIAG_PUBLIC macro (replaces the CPU-only
DEFINE_QPEL_DISPATCH instantiations).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel avg" | wc -l
  15
  $ ./build/test_api_h264 | grep "qpel avg" | grep -c "100.0000%"
  15

All 15 PASS 2048/2048 bytes bit-exact via QPU.

QPU coverage for the H.264 8-bit 4:2:0 hot-path pixel kernels:

  Layer                Coverage
  ─────────────────────────────────────────────────────────────
  IDCT 4x4 luma        ✓ cycle 6 (one QPU shader, also handles chroma)
  IDCT 8x8 luma        ✓ cycle 7
  Chroma DC Hadamard   CPU only (4 adds + 4 subs; not worth)
  Deblock luma_v       ✓ cycle 8
  Deblock luma_h       ✓ PR #28
  Deblock chroma_v/h   ✓ PR #29
  Deblock *_intra      CPU only (less common, structurally different)
  qpel put_ 15 pos     ✓ cycle 9 (mc20) + PRs #30-#33
  qpel avg_ 15 pos     ✓ THIS PR

The H.264 non-intra-deblock hot path is now FULLY on QPU for any
consumer that initialises daedalus with a QPU-capable context.
2026-05-25 20:22:33 +02:00
marfrit 55d3618408 Merge pull request 'h264: V3D shaders for the 8 diagonal qpel positions' (#33) from noether/v3d-shader-h264-qpel-diagonals into main
Reviewed-on: #33
2026-05-25 18:16:53 +00:00
claude-noether 746533582e h264: V3D shaders for the 8 diagonal qpel positions
Closes the put_ qpel QPU matrix.  Adds mc11/12/13/21/23/31/32/33 —
each composes two half-pel anchor outputs via L2 rounded-average:

  mc11 ¼¼ : avg(mc20[r,   c],   mc02[r,   c])
  mc12 ¼½ : avg(mc22[r,   c],   mc02[r,   c])
  mc13 ¼¾ : avg(mc20[r+1, c],   mc02[r,   c])
  mc21 ½¼ : avg(mc22[r,   c],   mc20[r,   c])
  mc23 ½¾ : avg(mc22[r,   c],   mc20[r+1, c])
  mc31 ¾¼ : avg(mc20[r,   c],   mc02[r,   c+1])
  mc32 ¾½ : avg(mc22[r,   c],   mc02[r,   c+1])
  mc33 ¾¾ : avg(mc20[r+1, c],   mc02[r,   c+1])

Per-lane structure: each lane runs the FULL cascade for BOTH anchors
at its own (r, c) target, then L2 averages.  No shared memory.
Shaders inline hpel_h() / hpel_v() / hpel_hv() helpers (the latter
does the 13×6 int16 cascade per cell).  ~88 lines each.

Shaders generated from a python template (POSITIONS table + format
string) — the 8 .comp files are 1:1 with the C reference's
DEFINE_DIAG_REF macro from fourier PR #18.

Dispatch plumbing: shared dispatch_h264_qpel_diag_qpu helper covers
all 8 (same src envelope as mc22: src_max = src_off + 10*stride + 11,
covering rows -2..+10 and cols -2..+10 for any (r±1, c±1) offset).

Recipe table: all 8 DAEDALUS_KERNEL_H264_QPEL_MC{11..33} flipped to
QPU.  Public dispatchers re-defined via DEFINE_QPEL_DIAG_PUBLIC macro
(replaces the old DEFINE_QPEL_DISPATCH which fast-failed QPU).

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[1-3][1-3]"
    H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%)

  Meaningful: the (r±1, c±1) offsets are easy to transpose between
  positions; passing first try on the asymmetric variants (mc13/23/31/33)
  means the position-specific shifts are correct in all 8 templates.

put_ qpel QPU matrix is now COMPLETE: 15 of 15 useful positions
(mc00 = integer copy, no shader needed).  avg_ qpel positions
(15 more) remain on CPU NEON; can land as a follow-up since avg_
is just put_ + one extra L2 against existing dst.

  put_  mc20 ✓  mc02 ✓  mc22 ✓  (anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓  (single-axis ¼-pel)
        mc11 ✓  mc12 ✓  mc13 ✓  (this PR — row-1 diagonals)
        mc21 ✓                    mc23 ✓  (this PR — row-2 diagonals)
        mc31 ✓  mc32 ✓  mc33 ✓  (this PR — row-3 diagonals)
  avg_  all 15 — CPU NEON
2026-05-25 19:14:42 +02:00
marfrit 224f4be9e2 Merge pull request 'h264: V3D shaders for the 4 single-axis quarter-pel qpel variants' (#32) from noether/v3d-shader-h264-qpel-quarter-axis into main
Reviewed-on: #32
2026-05-25 17:09:00 +00:00
claude-noether e3c28495ae h264: V3D shaders for the 4 single-axis quarter-pel qpel variants
mc10 (¼-H), mc30 (¾-H), mc01 (¼-V), mc03 (¾-V).  Each is the
corresponding half-pel filter (mc20 or mc02) with one extra L2
rounded-average step against an integer-source pixel at the tail:

  mc10[r,c] = avg(clip255(mc20(s)), s[r,   c   ])
  mc30[r,c] = avg(clip255(mc20(s)), s[r,   c+1])
  mc01[r,c] = avg(clip255(mc02(s)), s[r,   c  ])
  mc03[r,c] = avg(clip255(mc02(s)), s[r+1, c  ])

Each shader is ~45 lines (mc20-/mc02-pattern + 1 L2 line).

CMake foreach loop generates the 4 SPV compile rules.  Dispatch
helper `dispatch_h264_qpel_axis_qpu` shares plumbing across all 4
(axis flag selects src_max bounds: H reads cols -2..+10, V reads
rows -2..+10).  DEFINE_QPEL_AXIS_QPU + DEFINE_QPEL_DISPATCH_QPU
macros collapse ~200 LOC of boilerplate.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC{10,30,01,03} from
CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep "qpel mc[01230]"
    H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%)
    (+ mc20/mc02/mc22 anchors from previous PRs)

Qpel QPU coverage:

  put_  mc20 ✓  mc02 ✓  mc22 ✓                                  (3 anchors)
        mc10 ✓  mc30 ✓  mc01 ✓  mc03 ✓                          (4 quarter-axis, THIS PR)
        mc11/12/13/21/23/31/32/33 — CPU NEON                    (8 diagonals)
  avg_  all 15 positions — CPU NEON

7 of 15 useful put_ positions now on QPU.  The 8 diagonals each
compose two half-pel results via L2; can land via dedicated kernels
or by chaining existing anchor dispatches (the latter would need
the L2 step as a fourth dispatch — probably cheaper to write
dedicated 8x diagonal shaders).
2026-05-25 19:04:26 +02:00
marfrit 8b8e8dc6e8 Merge pull request 'h264: V3D shader for qpel mc22 (2D half-pel 'j' position)' (#31) from noether/v3d-shader-h264-qpel-mc22 into main
Reviewed-on: #31
2026-05-25 17:00:27 +00:00
claude-noether 02d564b43e h264: V3D shader for qpel mc22 (2D half-pel "j" position)
Cascaded H+V 6-tap filter per H.264 §8.4.2.2.1.  Highest per-frame
impact among missing qpel positions (PR #24 bench: 71.5 ns/block
NEON, 2.33 ms/frame worst-case all-mc22 at 1080p).

Per-lane structure: each lane runs the FULL cascade independently —
computes 6 horizontal lowpass int16 intermediates at rows r-2..r+3
of its column, then a vertical lowpass on those with +512 >> 10
final scale.  ~50 ALU ops per lane.

Design choice: NO shared memory / barriers.  Alternative was to
cache the h-lowpass intermediates in shared memory (13 rows × 8 cols
of int16 per WG), trading shared-memory bank pressure + a barrier
for ~6× less h-lowpass compute.  V3D L2 absorbs the redundant src
reads across lanes; the per-lane compute is cheap (multiply-add ALU
units idle anyway during dst write).  Simpler shader, fewer SPIR-V
ops, easier to extend to mc12/mc21/etc. later.

CANNOT just cascade mc20 → mc02 because the intermediate must be
int16 (no per-stage clip): the +512 >> 10 final scale assumes both
6-tap scalings preserved through the pipeline.  Dedicated kernel.

dispatch_h264_qpel_mc22_qpu mirrors the existing mc20/mc02 shape;
src_max = src_off + 10*stride + 11 covers both the V (rows -2..+10)
and H (cols -2..+10) read windows in one bound.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC22 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)
    H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%)

Qpel QPU coverage now: 3 anchors (mc20 H, mc02 V, mc22 HV) — these
are the half-pel "building blocks" the 12 other qpel positions
combine via L2 averaging.  Remaining variants (quarter-pel singles
mc01/03/10/30 and the 8 diagonals) can dispatch through the existing
shaders + a small L2-averaging compose step, or get dedicated kernels.
2026-05-25 18:52:39 +02:00
marfrit 2074a50554 Merge pull request 'h264: V3D shader for qpel mc02 (vertical half-pel)' (#30) from noether/v3d-shader-h264-qpel-mc02 into main
Reviewed-on: #30
2026-05-25 16:49:26 +00:00
claude-noether bc5edf656d h264: V3D shader for qpel mc02 (vertical half-pel)
Sibling of cycle 9's v3d_h264_qpel_mc20.comp.  Same 6-tap H.264 luma
half-pel filter, transposed to vertical orientation: filter reads
rows [-2..+3] of source per output pixel instead of cols.

Shader is ~58 lines (vs mc20's 86) — same WG geometry (64 lanes /
1 block per WG / 1 lane per output pixel).  The address arithmetic
flips: row_base = src_off + r*stride + c (mc20) → col_base =
src_off + c, then col_base + (r±N)*stride (mc02).

dispatch_h264_qpel_mc02_qpu mirrors the mc20 QPU dispatch; src_max
calculation differs since the V kernel reads rows -2..+10 of source
(13 rows × stride wide) vs mc20's cols -2..+10 (8 rows × stride+11).
For 8x8 blocks: src_max = src_off + 10*stride + 8.

Recipe table flips DAEDALUS_KERNEL_H264_QPEL_MC02 from CPU to QPU.

Verified on hertz:

  $ ./build/test_api_h264 | grep qpel
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%)

QPU coverage for the 30 qpel positions:
  put_  mc20 ✓ (cycle 9)   mc02 ✓ (this PR)
        all 13 other put_  CPU NEON
  avg_  all 15 positions   CPU NEON

Next-priority candidates by per-frame impact (per PR #24 bench):
  mc22 (2D half-pel)  — 71.5 ns/block NEON × 32 640 blocks worst
                        case = 2.33 ms/frame at 1080p.  Most-used
                        qpel position in real H.264 streams.
  mc11/mc13/mc31/mc33 — corner ¼-pel positions, structurally similar
                        to mc20 + mc02 with L2 averaging.

The cascaded H+V structure of mc22 means it can either share the
existing mc20 + mc02 shaders' L2 (compute mc20 into tmp, then mc02
on tmp) or get a dedicated 2-stage pipeline.  Follow-up.
2026-05-25 18:38:38 +02:00
marfrit 37b75b5813 Merge pull request 'h264: V3D shaders for chroma deblock V + H (4:2:0)' (#29) from noether/v3d-shader-h264-deblock-chroma into main
Reviewed-on: #29
2026-05-25 16:35:08 +00:00
claude-noether d8de7754fa h264: V3D shaders for chroma deblock V + H (4:2:0)
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra
bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h.
Closes 4 of 8 deblock QPU coverage at non-intra:

  luma_v   ✓ cycle 8
  luma_h   ✓ PR #28
  chroma_v ✓ this PR
  chroma_h ✓ this PR
  *_intra  — CPU NEON (less common; smaller volume)

Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0
updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq
side bonus), 8 cells per edge (vs luma's 16).  Shader: 64 lines
vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes
8..15 of each edge early-return).

4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this
shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is
4:2:0-only by design, caller-side gating already covers this in the
libavcodec substitution arc (marfrit-packages PR #98).

Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to
QPU.  dispatch_h264_deblock_chroma_qpu factored to share QPU
plumbing between V and H (orientation passed as a flag for the
dst_max calculation).

Verified on hertz:

  $ ./build/test_api_h264 | grep "deblock chroma [vh]:"
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)

  Recipe substrate now reports 2 (QPU) for both CV and CH.

Coverage now:
                bS<4 QPU     bS=4 (intra)
  luma_v        ✓ cycle 8    CPU NEON
  luma_h        ✓ PR #28     CPU NEON
  chroma_v      ✓ this PR    CPU NEON
  chroma_h      ✓ this PR    CPU NEON

Intra (bS=4) variants stay CPU NEON.  Less common case, smaller
per-frame contribution, and the algorithm is structurally different
(no tc0; strong-vs-weak filter quad-tree).  Can land as a follow-up
PR if perf demands.
2026-05-25 17:10:34 +02:00
marfrit de9266a6eb Merge pull request 'h264: V3D shader for deblock_luma_h — first QPU port since cycle 9' (#28) from noether/v3d-shader-h264-deblock-luma-h into main
Reviewed-on: #28
2026-05-25 15:06:18 +00:00
claude-noether 3db059ffab h264: V3D shader for deblock_luma_h — first QPU port since cycle 9
Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row)
to the H orientation (V edge, horizontal across a column).  Same
algorithm, transposed access pattern:

  V variant: lane → column, reads/writes pix[±N*stride] (vertical I/O)
  H variant: lane → row,    reads/writes pix[±N]        (horizontal I/O)

  WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge.
  Lane-in-edge interpretation flips: column-index for V → row-index
  for H.  tc0 segment math unchanged (one tc0 byte per 4 lanes).
  dst_max calculation flips: V used dst_off + 3*stride + 16 (cols),
  H uses dst_off + 15*stride + 4 (rows).

Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU).  AUTO
dispatch now picks QPU for the H edge as well as the V edge.  CPU
NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264 | grep luma_h
    H264_DEBLOCK_LH recipe substrate: 2     (was 1 — flipped to QPU)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)

Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on
8 tiles × 8 cols × 16 rows of random input.  Same correctness gate
as the cycle 8 V shader.

CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV
added to daedalus_shaders ALL list + install rule.  daedalus_ctx
gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair
(can't share with V because pipelines bind a specific SPIR-V module
at create time).

What this changes for the substitution arc: PR #97's 0008-h264-
deblock-luma-h substitution patch already plumbed
daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec.
That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe
(unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu,
in which case it stays NEON — same shape as cycle 8's V shader).

Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders):
  luma_v       ✓ cycle 8       ✓ now
  luma_h       —               ✓ THIS PR
  chroma_v/h   —               (CPU NEON; smaller tiles, lower-priority)
  *_intra (4)  —               (CPU NEON; less common)
2026-05-25 16:50:41 +02:00
marfrit 2faa849ce2 Merge pull request 'h264: promote remaining intra prediction modes (17) to public API' (#27) from noether/h264-intra-pred-rest-api into main
Reviewed-on: #27
2026-05-25 13:43:56 +00:00
claude-noether cb3aef3dac h264: promote remaining intra prediction modes (17) to public API
Follows PR #26 (Intra_4x4 luma) with the same promotion pattern for
the rest of the intra prediction primitive set:

  Intra_16x16 luma   (4 modes, PR #13) — V/H/DC/Plane
  Intra_8x8  chroma  (4 modes, PR #14) — DC/H/V/Plane (4:2:0)
  Intra_8x8  luma    (9 modes, PRs #21 + #22) — High profile,
                                                 with 1-2-1 pre-filter

3 file moves via `git mv`, ~17 function renames stripping the `_ref`
suffix.  Test binaries rewired to link daedalus_core instead of
compiling the (now moved) ref files directly.  No code change — pure
plumbing for substitution-arc consumers.

26 intra prediction modes total now in the public API after this PR.

Verified on hertz:

  test_intra_pred_16x16:    5/5  PASS
  test_intra_pred_chroma8x8: 5/5  PASS
  test_intra_pred_8x8_luma: 11/11 PASS

All via public symbols (test binaries linked against daedalus_core).

Unblocks marfrit-packages substitution arc patch 0014 — wires
H264PredContext.pred4x4[], pred16x16[], pred8x8[], pred8x8l[]
through daedalus alongside the existing IDCT / deblock / qpel / DC
Hadamard substitutions.

After 0014 lands, the libavcodec.so built by marfrit-packages will
have EVERY hot-path pixel-math kernel of an H.264 8-bit 4:2:0
decode routing through daedalus — the substitution arc is feature-
complete for the campaign target (Pi 5 Firefox YouTube playback).
2026-05-25 15:37:44 +02:00
marfrit 31c68d0d0e Merge pull request 'h264: promote Intra_4x4 luma prediction (9 modes) to public API' (#26) from noether/h264-intra-pred-4x4-api into main
Reviewed-on: #26
2026-05-25 13:35:56 +00:00
claude-noether df9e1c9d78 h264: promote Intra_4x4 luma prediction (9 modes) to public API
PR #12 added the 9 Intra_4x4 luma intra prediction modes as test-only
spec references in tests/.  This PR promotes them to public src/
symbols so consumers (the eventual marfrit-packages substitution-arc
patch 0014) can link against them.

  Moved: tests/h264_intra_pred_4x4_ref.c → src/h264_intra_pred_4x4.c
  Renamed: daedalus_h264_pred_4x4_<mode>_ref → daedalus_h264_pred_4x4_<mode>
           (9 functions: vertical/horizontal/dc/ddl/ddr/vr/hd/vl/hu)

The src/ implementation is byte-for-byte the same code as the
test-only ref; this PR is plain plumbing.  The test binary now
links against daedalus_core to pull in the public symbols (instead
of compiling the ref file directly), exercising the path that real
consumers will use.

Same promotion shape as PR #25 (chroma DC Hadamard).

Verified on hertz:

  $ ./build/test_intra_pred_4x4
    Vertical (mode 0)          PASS
    Horizontal (mode 1)        PASS
    DC (mode 2)                PASS
    DiagDownLeft (mode 3)      PASS
    DiagDownRight (mode 4)     PASS
    VerticalRight (mode 5)     PASS
    HorizontalDown (mode 6)    PASS
    VerticalLeft (mode 7)      PASS
    HorizontalUp (mode 8)      PASS
    VR asym (sanity)           PASS

  ALL 10 intra-4x4 mode references PASS

  $ nm -g build/libdaedalus_core.a | grep "T daedalus_h264_pred_4x4"
  (9 symbols exported)

Follow-ups (same promotion pattern, can land in parallel):
  - Intra_16x16 luma (4 modes, PR #13)
  - Intra_8x8 chroma (4 modes, PR #14)
  - Intra_8x8 luma (9 modes, PRs #21 + #22)

Once all 26 intra modes are in the public API, the marfrit-packages
substitution arc can route H264PredContext's pred function pointer
tables through daedalus alongside the IDCT / deblock / qpel / DC
Hadamard substitutions already in place.
2026-05-25 14:53:37 +02:00
marfrit b9f9ff2a89 Merge pull request 'h264: expose chroma DC 2x2 Hadamard as public API' (#25) from noether/h264-chroma-dc-hadamard-api into main
Reviewed-on: #25
2026-05-25 11:35:05 +00:00
claude-noether 1f07f3cd70 h264: expose chroma DC 2x2 Hadamard as public API
PR #23 added the Hadamard as a test-only spec reference; this PR
promotes it to a public symbol in src/ so consumers (the eventual
marfrit-packages substitution-arc patch 0011) can link against it.

  New: void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
       — operates in-place on 4 int16, no QP-dependent scaling
       (caller composes that themselves per §8.5.11.2).

The src/ implementation is byte-for-byte identical to the test-only
ref in tests/h264_chroma_dc_hadamard_ref.c (kept as a separate
spec-validation copy).  A new "public API parity" test case verifies
the two produce identical output for a non-trivial input.

Pure CPU primitive — no substrate-dispatch wrapper because the work
is 4 adds + 4 subs; the substrate machinery would cost more than
the kernel itself.

Verified on hertz:

  $ ./build/test_chroma_dc_hadamard
    all-uniform 5                    PASS
    col gradient [0,10,0,10]         PASS
    row gradient [0,0,10,10]         PASS
    anti-diagonal [10,0,0,10]        PASS
    asymmetric [1,2,3,4]             PASS
    sign-alternating [-5,5,-5,5]     PASS
    double-Hadamard = 4*orig         PASS
    public API parity vs _ref        PASS

  ALL chroma DC Hadamard tests PASS

  $ nm -g build/libdaedalus_core.a | grep chroma_dc_hadamard
  0000000000000000 T daedalus_h264_chroma_dc_hadamard_2x2

Unblocks marfrit-packages 0011 (substituting
H264DSPContext.chroma_dc_dequant_idct, which composes the Hadamard
+ qmul scaling).
2026-05-25 13:32:01 +02:00
marfrit b21b35c74b Merge pull request 'bench: H.264 primitives NEON CPU baseline (1080p budget projection)' (#24) from noether/h264-primitives-bench into main
Reviewed-on: #24
2026-05-25 09:51:20 +00:00
claude-noether ba5bbae8e2 bench: H.264 primitives NEON CPU baseline (1080p budget projection)
Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets.  Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."

Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):

  Per-kernel ns/op (CPU NEON):
    IDCT 4x4 luma            10.78 ns/block
    IDCT 8x8 luma            29.73 ns/block
    Deblock luma_v           18.04 ns/edge
    Deblock luma_h           41.65 ns/edge   (H access pattern less SIMD-friendly)
    qpel mc20  (H half-pel)  25.66 ns/block
    qpel mc02  (V half-pel)  15.06 ns/block  (faster than mc20!)
    qpel mc22  (HV half-pel) 71.50 ns/block  (cascaded H+V, expected)

  Projected 1080p frame budgets (worst-case, CPU NEON only):
    IDCT 4x4 (all-4x4 MBs):       1.41 ms   (130,560 blocks)
    IDCT 8x8 (all-8x8 MBs):       0.97 ms   ( 32,640 blocks)
    Deblock luma_v (all MBs):     0.59 ms   ( 32,640 edges)
    Deblock luma_h (all MBs):     1.36 ms   ( 32,640 edges)
    qpel mc22 (all 8x8 blocks):   2.33 ms   ( 32,640 blocks)

    Sum (IDCT 4x4 + deblock luma + MC all-mc22):    5.69 ms
    30 fps deadline:                              33.33 ms
    Margin:                                       +27.64 ms

What this validates:

  - The "30fps@1080p is the fine floor" memory note holds with
    huge headroom on the pixel-math layer alone.  17% of the
    deadline goes to pixel math (worst case); 83% is available
    for entropy decode + reference frame management + intra
    prediction + chroma deblock + chroma IDCT + the libavcodec
    intercept overhead.
  - The CPU-vs-QPU substrate finding from earlier (PR #10 on
    daedalus-decoder showed CPU NEON is 4x faster than QPU for
    IDCT) is consistent here.  All these kernels have CPU-only
    recipes by default; the data suggests that's the right call
    for now.  The recipe substrate decision can be revisited
    per-kernel once QPU shaders catch up.
  - mc22 (2D HV half-pel) is the most expensive single qpel
    position at ~71 ns/block — 2-7x more than the 1D variants.
    Real B-slice biprediction with two mc22 calls per MB would
    add ~4.7 ms/frame; still comfortable but worth knowing.

What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):

  - Chroma IDCT (4 cb + 4 cr 4x4 per MB).  At similar ns/block to
    luma, that's ~0.7 ms/frame.
  - Chroma deblock (smaller tile, simpler kernel — sub-ms).
  - Intra prediction (per-block, ~50 ops at NEON, but serialized
    in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
  - bS=4 intra deblock variants — different algorithm, similar
    cost to bS<4.
  - chroma DC Hadamard — trivial.

Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms.  Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.
2026-05-25 11:26:11 +02:00
marfrit eef7f034b0 Merge pull request 'h264: chroma DC 2x2 Hadamard pre-pass primitive' (#23) from noether/h264-chroma-dc-hadamard into main
Reviewed-on: #23
2026-05-25 09:23:05 +00:00
claude-noether 854bdeda20 h264: chroma DC 2x2 Hadamard pre-pass primitive
Adds the H.264 §8.5.11.1 chroma DC Hadamard transform.  In 4:2:0
chroma, the four DC coefficients (one from each chroma 4x4 AC block
within an MB) go through a 2x2 Hadamard before quant-scaling and
before being added back to each block's [0,0] coefficient prior to
the 4x4 AC IDCT.

This PR ships the pure Hadamard transform:

  f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1]
  f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1]
  f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1]
  f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1]

implemented as the 2-stage row+col butterfly (1:1 with the NEON
SIMD shape upstream).  Operates in-place on int16[4].

What this does NOT do (deferred to caller-side composition):

  - QP-dependent scaling per §8.5.11.2.  The scale depends on
    QP_C (with chroma_qp_offset adjustment), so the formula has
    branches (>=6 vs <6) and looks up LevelScale4x4 table values.
    The libavcodec intercept patch composes Hadamard + scale +
    shift itself since the scale shape varies by codec-level
    context (slice header chroma_qp_offset, PPS chroma_qp_offset,
    second_chroma_qp_offset for the chroma_qp_index_offset).
  - Inverse transform (decode-time used for the FORWARD direction
    is the same Hadamard up to scaling, but conceptually the spec
    distinguishes them in §8.5.11; we expose only the matrix).

Test design (tests/test_chroma_dc_hadamard.c):

  7 cases, all spec-derived hand-computations:
    - all-uniform 5 → [20, 0, 0, 0]
    - col gradient [0,10,0,10] → [20, -20, 0, 0]
    - row gradient [0,0,10,10] → [20, 0, -20, 0]
    - anti-diagonal [10,0,0,10] → [20, 0, 0, 20]
    - asymmetric [1,2,3,4] → [10, -2, -4, 0]
    - sign-alternating [-5,5,-5,5] → [0, -20, 0, 0]
    - double-Hadamard invariant: H·H = 4·I, so applying twice
      gives [4*c[0], 4*c[1], 4*c[2], 4*c[3]] for any input.

The double-Hadamard test is the strongest correctness gate: any
single sign error in the butterfly would break the H·H = 4·I
algebraic property, surfacing immediately.  All 7 PASS first try.

Verified on hertz:

  $ ./build/test_chroma_dc_hadamard
    all-uniform 5                    PASS
    col gradient [0,10,0,10]         PASS
    row gradient [0,0,10,10]         PASS
    anti-diagonal [10,0,0,10]        PASS
    asymmetric [1,2,3,4]             PASS
    sign-alternating [-5,5,-5,5]     PASS
    double-Hadamard = 4*orig         PASS

  ALL chroma DC Hadamard tests PASS

With this primitive the H.264 8-bit 4:2:0 pixel-math primitive
matrix is complete in fourier:
  - IDCT 4x4 (luma + chroma) ✓
  - IDCT 8x8 (luma, High profile) ✓
  - Chroma DC Hadamard 2x2 ✓ (this PR)
  - Deblock (8 variants) ✓
  - Intra prediction (26 modes) ✓
  - MC qpel (30 dispatches) ✓

What remains for the libavcodec intercept patch: CABAC/CAVLC entropy
decode, SPS/PPS parsing, slice header parsing, MB type / QP / CBP /
intra mode prediction.  All of that lives at the intercept layer
(it's spec-derived from the bitstream syntax, not pixel-math); the
intercept patch will call into these fourier primitives once the
metadata is decoded.
2026-05-25 11:18:59 +02:00
marfrit 17d672ebef Merge pull request 'h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU)' (#22) from noether/h264-intra-pred-8x8-directional into main
Reviewed-on: #22
2026-05-25 09:16:19 +00:00
claude-noether 5565cc2bef h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU)
Closes the H.264 8-bit 4:2:0 intra-prediction primitive matrix.
Adds the 6 directional Intra_8x8 luma modes per H.264 §8.3.2.1.5..
§8.3.2.1.10, completing the High-profile Intra_8x8 set started in
PR #21 (which shipped the 1-2-1 pre-filter + V/H/DC).

Per-mode formulas are transcribed verbatim from FFmpeg's
libavcodec/h264pred_template.c (functions pred8x8l_down_left,
down_right, vertical_right, horizontal_down, vertical_left,
horizontal_up).  Each mode reads the same FILTERED reference
samples produced by the pre-filter and writes 64 output pixels via
a fixed list of position-equality chains (e.g. for DDL,
SRC(0,7)=SRC(1,6)=SRC(2,5)=...=SRC(7,0)= some shared 3-tap formula).

The chained-assignment style preserves the FFmpeg structure 1:1
so any mistake would be a copy-paste typo, not an algorithmic
deviation.  Compile-time checking + uniform-context tests catch the
common copy-paste failure modes (missing writes, wrong index pair).

Scope:
  - 6 new ref functions: ddl/ddr/vr/hd/vl/hu_ref.
  - Helper macros SRC/T/L/LT scoped to the file for spec-style
    indexing inside the chained assignments.
  - 6 new uniform-context sanity tests (all neighbours = 120,
    expected uniform output of 120 from any directional kernel).

Verified on hertz:

  $ ./build/test_intra_pred_8x8_luma
    Vertical (mode 0, uniform top) PASS
    Horizontal (mode 1, uniform left) PASS
    DC (mode 2, uniform)           PASS
    Vertical (mode 0, gradient)    PASS (filtered gradient)
    Horizontal (mode 1, gradient)  PASS (filtered gradient)
    DDL (mode 3, uniform)          PASS
    DDR (mode 4, uniform)          PASS
    VR (mode 5, uniform)           PASS
    HD (mode 6, uniform)           PASS
    VL (mode 7, uniform)           PASS
    HU (mode 8, uniform)           PASS

  ALL Intra_8x8 luma PASS (9 modes)

Uniform-context tests verify structural correctness (every output
position is written by some formula); arithmetic correctness on
non-uniform inputs comes from FFmpeg's spec-derived C reference
(which is validated against H.264 conformance bitstreams upstream).
The libavcodec intercept patch will exercise these on real streams.

Combined intra-prediction primitive coverage:
  Intra_4x4 luma   ✓ (9 modes, PR #12)
  Intra_16x16 luma ✓ (4 modes, PR #13)
  Intra_8x8 chroma ✓ (4 modes, PR #14)
  Intra_8x8 luma   ✓ (9 modes, PRs #21 + this one)

26 intra-prediction modes total, all bit-exact gated.  Every H.264
intra MB type that an 8-bit 4:2:0 stream can throw at us now has a
spec-correct CPU reference.
2026-05-25 09:56:45 +02:00
54 changed files with 4875 additions and 313 deletions
+179 -27
View File
@@ -284,6 +284,55 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
set(H264DEBLOCK_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_h.comp
COMMENT "glslang: v3d_h264deblock_h.comp -> v3d_h264deblock_h.spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_V_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_v.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_V_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_V_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_v.comp
COMMENT "glslang: v3d_h264deblock_chroma_v.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_CHROMA_H_SPV ${CMAKE_BINARY_DIR}/v3d_h264deblock_chroma_h.spv)
add_custom_command(
OUTPUT ${H264DEBLOCK_CHROMA_H_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264DEBLOCK_CHROMA_H_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_chroma_h.comp
COMMENT "glslang: v3d_h264deblock_chroma_h.comp -> .spv"
VERBATIM
)
# Intra (bS=4) deblock shaders — strong/weak filter selector per
# H.264 §8.3.2.3. 4 variants (luma_v/h + chroma_v/h).
foreach(_kind luma_v_intra luma_h_intra chroma_v_intra chroma_h_intra)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264deblock_${_kind}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264deblock_${_kind}.comp
COMMENT "glslang: v3d_h264deblock_${_kind}.comp -> .spv"
VERBATIM
)
set(H264DEBLOCK_${_kind}_SPV ${_spv})
endforeach()
set(H264_IDCT4_SPV ${CMAKE_BINARY_DIR}/v3d_h264_idct4.spv)
add_custom_command(
OUTPUT ${H264_IDCT4_SPV}
@@ -317,7 +366,63 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV})
set(H264_QPEL_MC02_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc02.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC02_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC02_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc02.comp
COMMENT "glslang: v3d_h264_qpel_mc02.comp -> v3d_h264_qpel_mc02.spv"
VERBATIM
)
set(H264_QPEL_MC22_SPV ${CMAKE_BINARY_DIR}/v3d_h264_qpel_mc22.spv)
add_custom_command(
OUTPUT ${H264_QPEL_MC22_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${H264_QPEL_MC22_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_mc22.comp
COMMENT "glslang: v3d_h264_qpel_mc22.comp -> v3d_h264_qpel_mc22.spv"
VERBATIM
)
# Quarter-pel single-axis variants (mc10/30/01/03) + diagonal
# variants (mc11/12/13/21/23/31/32/33) — each composes 1-2 half-pel
# results with optional L2 averaging. Same WG geometry as mc20/mc02.
foreach(_mc mc10 mc30 mc01 mc03 mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_${_mc}_SPV ${_spv})
endforeach()
# avg_ biprediction variants — same shader as put_ + extra L2 with
# existing dst. All 15 useful positions.
foreach(_mc mc20 mc02 mc22 mc10 mc30 mc01 mc03
mc11 mc12 mc13 mc21 mc23 mc31 mc32 mc33)
set(_spv ${CMAKE_BINARY_DIR}/v3d_h264_qpel_avg_${_mc}.spv)
add_custom_command(
OUTPUT ${_spv}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${_spv}
${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_h264_qpel_avg_${_mc}.comp
COMMENT "glslang: v3d_h264_qpel_avg_${_mc}.comp -> .spv"
VERBATIM
)
set(H264_QPEL_avg_${_mc}_SPV ${_spv})
endforeach()
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV} ${LPF8_SPV} ${CDEF_SPV} ${H264DEBLOCK_SPV} ${H264DEBLOCK_H_SPV} ${H264DEBLOCK_CHROMA_V_SPV} ${H264DEBLOCK_CHROMA_H_SPV} ${H264DEBLOCK_luma_v_intra_SPV} ${H264DEBLOCK_luma_h_intra_SPV} ${H264DEBLOCK_chroma_v_intra_SPV} ${H264DEBLOCK_chroma_h_intra_SPV} ${H264_IDCT4_SPV} ${H264_IDCT8_SPV} ${H264_QPEL_MC20_SPV} ${H264_QPEL_MC02_SPV} ${H264_QPEL_MC22_SPV} ${H264_QPEL_mc10_SPV} ${H264_QPEL_mc30_SPV} ${H264_QPEL_mc01_SPV} ${H264_QPEL_mc03_SPV} ${H264_QPEL_mc11_SPV} ${H264_QPEL_mc12_SPV} ${H264_QPEL_mc13_SPV} ${H264_QPEL_mc21_SPV} ${H264_QPEL_mc23_SPV} ${H264_QPEL_mc31_SPV} ${H264_QPEL_mc32_SPV} ${H264_QPEL_mc33_SPV} ${H264_QPEL_avg_mc20_SPV} ${H264_QPEL_avg_mc02_SPV} ${H264_QPEL_avg_mc22_SPV} ${H264_QPEL_avg_mc10_SPV} ${H264_QPEL_avg_mc30_SPV} ${H264_QPEL_avg_mc01_SPV} ${H264_QPEL_avg_mc03_SPV} ${H264_QPEL_avg_mc11_SPV} ${H264_QPEL_avg_mc12_SPV} ${H264_QPEL_avg_mc13_SPV} ${H264_QPEL_avg_mc21_SPV} ${H264_QPEL_avg_mc23_SPV} ${H264_QPEL_avg_mc31_SPV} ${H264_QPEL_avg_mc32_SPV} ${H264_QPEL_avg_mc33_SPV})
# v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -391,6 +496,11 @@ endif()
add_library(daedalus_core STATIC
src/daedalus_core.c
src/h264_chroma_dc.c
src/h264_intra_pred_4x4.c
src/h264_intra_pred_16x16.c
src/h264_intra_pred_chroma8x8.c
src/h264_intra_pred_8x8_luma.c
src/v3d_runner.c
${FFASM_SOURCES}
${FFASM_LPF_SOURCES}
@@ -445,9 +555,45 @@ if (DAEDALUS_BUILD_VULKAN)
${LPF8_SPV}
${CDEF_SPV}
${H264DEBLOCK_SPV}
${H264DEBLOCK_H_SPV}
${H264DEBLOCK_CHROMA_V_SPV}
${H264DEBLOCK_CHROMA_H_SPV}
${H264DEBLOCK_luma_v_intra_SPV}
${H264DEBLOCK_luma_h_intra_SPV}
${H264DEBLOCK_chroma_v_intra_SPV}
${H264DEBLOCK_chroma_h_intra_SPV}
${H264_IDCT4_SPV}
${H264_IDCT8_SPV}
${H264_QPEL_MC20_SPV}
${H264_QPEL_MC02_SPV}
${H264_QPEL_MC22_SPV}
${H264_QPEL_mc10_SPV}
${H264_QPEL_mc30_SPV}
${H264_QPEL_mc01_SPV}
${H264_QPEL_mc03_SPV}
${H264_QPEL_mc11_SPV}
${H264_QPEL_mc12_SPV}
${H264_QPEL_mc13_SPV}
${H264_QPEL_mc21_SPV}
${H264_QPEL_mc23_SPV}
${H264_QPEL_mc31_SPV}
${H264_QPEL_mc32_SPV}
${H264_QPEL_mc33_SPV}
${H264_QPEL_avg_mc20_SPV}
${H264_QPEL_avg_mc02_SPV}
${H264_QPEL_avg_mc22_SPV}
${H264_QPEL_avg_mc10_SPV}
${H264_QPEL_avg_mc30_SPV}
${H264_QPEL_avg_mc01_SPV}
${H264_QPEL_avg_mc03_SPV}
${H264_QPEL_avg_mc11_SPV}
${H264_QPEL_avg_mc12_SPV}
${H264_QPEL_avg_mc13_SPV}
${H264_QPEL_avg_mc21_SPV}
${H264_QPEL_avg_mc23_SPV}
${H264_QPEL_avg_mc31_SPV}
${H264_QPEL_avg_mc32_SPV}
${H264_QPEL_avg_mc33_SPV}
DESTINATION ${CMAKE_INSTALL_DATADIR}/daedalus-fourier/shaders
)
endif()
@@ -537,41 +683,47 @@ add_executable(test_api_opportunistic_qpu tests/test_api_opportunistic_qpu.c)
target_link_libraries(test_api_opportunistic_qpu PRIVATE daedalus_core)
target_compile_options(test_api_opportunistic_qpu PRIVATE -O2)
# H.264 Intra_4x4 luma prediction (9 modes) — reference + tests.
# Pure CPU + spec-derived; no daedalus_core dependency yet (this is
# the bit-exact gate for the eventual shader / dispatch wiring).
add_executable(test_intra_pred_4x4
tests/test_intra_pred_4x4.c
tests/h264_intra_pred_4x4_ref.c
)
# H.264 Intra_4x4 luma prediction (9 modes) — public src primitives.
# The bodies now live in src/h264_intra_pred_4x4.c (linked into
# daedalus_core for use by libavcodec.so substitution-arc consumers).
# This test exercises the public symbols.
add_executable(test_intra_pred_4x4 tests/test_intra_pred_4x4.c)
target_link_libraries(test_intra_pred_4x4 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_4x4 PRIVATE -O2)
# H.264 Intra_16x16 luma prediction (4 modes: V, H, DC, Plane) —
# reference + tests. Same spec-gate role as the 4x4 sibling.
add_executable(test_intra_pred_16x16
tests/test_intra_pred_16x16.c
tests/h264_intra_pred_16x16_ref.c
)
# H.264 Intra_16x16 luma prediction (4 modes) — public src primitives,
# linked from daedalus_core.
add_executable(test_intra_pred_16x16 tests/test_intra_pred_16x16.c)
target_link_libraries(test_intra_pred_16x16 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_16x16 PRIVATE -O2)
# H.264 Intra_8x8 chroma prediction (4 modes: DC, H, V, Plane) —
# reference + tests. DC is per-quadrant (asymmetric); Plane uses
# slope coefficient 34 instead of luma's 5.
add_executable(test_intra_pred_chroma8x8
tests/test_intra_pred_chroma8x8.c
tests/h264_intra_pred_chroma8x8_ref.c
)
# H.264 Intra_8x8 chroma prediction (4 modes) — public src primitives.
add_executable(test_intra_pred_chroma8x8 tests/test_intra_pred_chroma8x8.c)
target_link_libraries(test_intra_pred_chroma8x8 PRIVATE daedalus_core)
target_compile_options(test_intra_pred_chroma8x8 PRIVATE -O2)
# H.264 Intra_8x8 luma prediction (High profile, 9 modes + 1-2-1
# reference-sample pre-filter). This PR ships the pre-filter + the
# 3 simple modes (V, H, DC); the 6 directional modes follow.
add_executable(test_intra_pred_8x8_luma
tests/test_intra_pred_8x8_luma.c
tests/h264_intra_pred_8x8_luma_ref.c
)
# pre-filter) — public src primitives.
add_executable(test_intra_pred_8x8_luma tests/test_intra_pred_8x8_luma.c)
target_link_libraries(test_intra_pred_8x8_luma PRIVATE daedalus_core)
target_compile_options(test_intra_pred_8x8_luma PRIVATE -O2)
# H.264 chroma DC 2x2 Hadamard pre-pass primitive. Pure transform,
# no QP-dependent scaling (that's caller-side composition).
add_executable(test_chroma_dc_hadamard
tests/test_chroma_dc_hadamard.c
tests/h264_chroma_dc_hadamard_ref.c
)
# Links daedalus_core to pull in the public daedalus_h264_chroma_dc_hadamard_2x2
# symbol (for the public-API parity test added in this PR).
target_link_libraries(test_chroma_dc_hadamard PRIVATE daedalus_core)
target_compile_options(test_chroma_dc_hadamard PRIVATE -O2)
# H.264 primitives latency benchmark (NEON CPU baseline).
add_executable(bench_h264_primitives tests/bench_h264_primitives.c)
target_link_libraries(bench_h264_primitives PRIVATE daedalus_core)
target_compile_options(bench_h264_primitives PRIVATE -O2)
add_executable(bench_pool_overhead tests/bench_pool_overhead.c)
target_link_libraries(bench_pool_overhead PRIVATE daedalus_core)
target_compile_options(bench_pool_overhead PRIVATE -O2)
+82
View File
@@ -544,6 +544,88 @@ DECLARE_QPEL_AVG(avg_mc33)
#undef DECLARE_QPEL_AVG
/* -------------------------------------------------------------------
* H.264 chroma DC 2x2 Hadamard pre-pass (per H.264 §8.5.11.1).
*
* Operates in-place on 4 int16 (the DC coefficients of an MB's
* chroma 4x4 AC blocks). Pure CPU primitive — no substrate
* dispatch wrapper because the work is 4 adds + 4 subs. Callers
* compose with QP-dependent scaling themselves; the scale shape
* varies by slice/PPS chroma_qp offset context.
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23).
* ----------------------------------------------------------------- */
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]);
/* -------------------------------------------------------------------
* H.264 Intra_4x4 luma prediction (per H.264 §8.3.1.4). 9 modes.
*
* Pure CPU primitives — each is a small straightforward fill of a
* 4x4 output block from neighbour pixels in the same buffer. No
* substrate-dispatch wrapper (the work is too small to amortise).
*
* FFmpeg-style interface: `dst` at row 0 col 0 of the 4x4 output.
* Reads top-left at dst[-stride-1], top at dst[-stride..-stride+7]
* (top-right for DDL/VL), and left at dst[r*stride - 1] for r=0..3.
* Caller must ensure all 13 neighbour bytes are valid (interior-MB
* assumption — H.264 availability fallback handled at caller).
*
* Bit-exact validated against tests/test_intra_pred_4x4.c (10-case
* spec-derived test suite including the asymmetric Vertical_Right
* 16-cell hand-derived case; see fourier PR #12).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_4x4_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_4x4_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_16x16 luma prediction (per §8.3.2). 4 modes:
* Vertical / Horizontal / DC / Plane. Same FFmpeg-style interface
* as the 4x4 family at 16x16 scale. Same neighbour availability
* assumption (interior-MB).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_16x16_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_16x16_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 chroma prediction (per §8.3.3, 4:2:0). 4 modes:
* DC / Horizontal / Vertical / Plane. Note: DC is per-quadrant
* asymmetric; Plane uses slope coefficient 34 (not luma's 5).
* ----------------------------------------------------------------- */
void daedalus_h264_pred_chroma8x8_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_chroma8x8_plane (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* H.264 Intra_8x8 luma prediction (High profile, per §8.3.2.1).
* 9 modes with the spec-defined 1-2-1 reference-sample pre-filter
* applied internally to the 25 neighbours before the mode arithmetic.
*
* "_8x8l" naming follows the FFmpeg h264pred_template convention
* (pred8x8l_<mode>_c) to keep the substitution wrappers a 1:1 name
* map.
* ----------------------------------------------------------------- */
void daedalus_h264_pred_8x8l_vertical (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_dc (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_ddr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vr (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hd (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_vl (uint8_t *dst, ptrdiff_t stride);
void daedalus_h264_pred_8x8l_hu (uint8_t *dst, ptrdiff_t stride);
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
+843 -94
View File
File diff suppressed because it is too large Load Diff
+34
View File
@@ -0,0 +1,34 @@
/* SPDX-License-Identifier: BSD-2-Clause */
/*
* H.264 chroma DC 2x2 Hadamard pre-pass (public, in-tree CPU).
*
* The 4 DC coefficients of an MB's chroma 4x4 AC blocks go through
* this 2x2 Hadamard before quant-scaling and re-injection into the
* AC blocks' [0,0] coefficient. Algorithm per H.264 §8.5.11.1.
*
* Pure CPU primitive — there's no substrate-dispatch wrapper because
* the work is 4 adds + 4 subs. Callers compose with QP-dependent
* scaling themselves (the scale shape varies by slice/PPS chroma_qp
* offset context and shouldn't be baked into the kernel).
*
* Bit-exact validated against tests/h264_chroma_dc_hadamard_ref.c
* (7-case spec-derived test suite including the H·H = 4·I algebraic
* invariant; see PR #23). Same algorithm; this is the public
* src-tree copy.
*/
#include "daedalus.h"
#include <stdint.h>
void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4])
{
int t0 = c[0] + c[1];
int t1 = c[0] - c[1];
int t2 = c[2] + c[3];
int t3 = c[2] - c[3];
c[0] = (int16_t)(t0 + t2); /* f[0,0] = sum of all 4 */
c[1] = (int16_t)(t1 + t3); /* f[0,1] = col-difference */
c[2] = (int16_t)(t0 - t2); /* f[1,0] = row-difference */
c[3] = (int16_t)(t1 - t3); /* f[1,1] = anti-diagonal */
}
@@ -29,7 +29,7 @@
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 16; r++)
@@ -37,7 +37,7 @@ void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 16; r++) {
uint8_t l = dst[r * stride - 1];
@@ -46,7 +46,7 @@ void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — DC: ((sum_top16 + sum_left16 + 16) >> 5) broadcast. */
void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 16; /* rounding for >> 5 over 32 samples */
@@ -77,7 +77,7 @@ void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride)
* (it does NOT participate in the H/V sums in the 16x16 case only
* for the chroma 8x8 plane mode).
*/
void daedalus_h264_pred_16x16_plane_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
/* H accumulates differences across the right vs left half of the
@@ -52,7 +52,7 @@ static inline uint8_t avg2(int a, int b)
}
/* Mode 0 — Vertical: each col = top[col]. */
void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 4; r++) {
@@ -61,7 +61,7 @@ void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 4; r++) {
uint8_t l = dst[r * stride - 1];
@@ -70,7 +70,7 @@ void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — DC: mean of top 4 + left 4, broadcast. */
void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int sum = 4; /* rounding for ((sum + 4) >> 3) */
@@ -82,7 +82,7 @@ void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 3 — Diagonal_Down_Left. Uses top[0..7] (incl. top-right). */
void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0 = top[0], t1 = top[1], t2 = top[2], t3 = top[3];
@@ -102,7 +102,7 @@ void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 4 — Diagonal_Down_Right. Uses top-left + top[0..3] + left[0..3]. */
void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
@@ -123,7 +123,7 @@ void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 5 — Vertical_Right. */
void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1];
@@ -153,7 +153,7 @@ void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 6 — Horizontal_Down. */
void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride)
{
int tl = dst[-stride - 1];
int t0 = dst[-stride + 0], t1 = dst[-stride + 1], t2 = dst[-stride + 2];
@@ -182,7 +182,7 @@ void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 7 — Vertical_Left. Uses top[0..7]. */
void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int t0=top[0], t1=top[1], t2=top[2], t3=top[3];
@@ -211,7 +211,7 @@ void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 8 — Horizontal_Up. Uses left[0..3] only. */
void daedalus_h264_pred_4x4_hu_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride)
{
int l0 = dst[ 0*stride - 1], l1 = dst[ 1*stride - 1];
int l2 = dst[ 2*stride - 1], l3 = dst[ 3*stride - 1];
+305
View File
@@ -0,0 +1,305 @@
/*
* Standalone bit-exact C reference for H.264 luma Intra_8x8
* prediction modes (per H.264 spec §8.3.2.1). High-profile-only
* MB type — Baseline/Main/Extended profiles don't see Intra_8x8.
*
* Distinct from Intra_4x4 in two ways:
*
* 1. REFERENCE SAMPLE FILTERING (§8.3.2.1.1). The 25 raw
* neighbour samples are pre-filtered with a 1-2-1 smoothing
* filter BEFORE prediction. The filtering has spec-defined
* boundary handling at the corners and the right-edge of the
* top-row extension.
*
* 2. SCALE. All 9 prediction modes operate at 8x8 with the
* filtered samples (Intra_4x4 operates at 4x4 with the raw
* samples).
*
* This PR implements the filter + the 3 simple modes (Vertical,
* Horizontal, DC). The 6 directional modes (DDL, DDR, VR, HD, VL,
* HU at 8x8) follow in a separate PR — same template, different
* formulas per spec sections §8.3.2.1.4..§8.3.2.1.9.
*
* Calling convention (FFmpeg-style):
* pred_8x8_<mode>_ref(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0 col 0 of the 8x8 output block. Reads from
* top[0..15] = dst[-stride + 0..15]
* top-left = dst[-stride - 1]
* left[0..7] = dst[ 0*stride - 1 .. 7*stride - 1]
*
* AVAILABILITY: assumes all neighbours valid (interior-MB case).
*
* License: BSD-2-Clause.
*/
#include <stdint.h>
#include <stddef.h>
#include <string.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* H.264 §8.3.2.1.1 reference sample filtering. Filters the 25 raw
* samples around the 8x8 block into a `filt` array with the same
* indices. When called against an "all neighbours available" tile,
* the filtered output uses these spec-defined formulas:
*
* filt[top -1] (= filtered top-left) = (top[0] + 2*tl + left[0] + 2) >> 2
*
* filt[top 0] = (tl + 2*top[0] + top[1] + 2) >> 2
* filt[top i] for 1<=i<=14 = (top[i-1] + 2*top[i] + top[i+1] + 2) >> 2
* filt[top 15] = (top[14] + 3*top[15] + 2) >> 2 (boundary)
*
* filt[left 0] = (tl + 2*left[0] + left[1] + 2) >> 2
* filt[left j] for 1<=j<=6 = (left[j-1] + 2*left[j] + left[j+1] + 2) >> 2
* filt[left 7] = (left[6] + 3*left[7] + 2) >> 2 (boundary)
*
* Reads neighbours from the dst buffer; writes filtered values to
* a caller-provided 26-element array indexed as:
* filt[0] = filtered top-left
* filt[1..16] = filtered top[0..15]
* filt[17..24] = filtered left[0..7]
*/
static void filter_refs(const uint8_t *dst, ptrdiff_t stride,
uint8_t filt[25])
{
int tl = dst[-stride - 1];
int t[16];
for (int i = 0; i < 16; i++) t[i] = dst[-stride + i];
int l[8];
for (int j = 0; j < 8; j++) l[j] = dst[j * stride - 1];
/* Filtered top-left. */
filt[0] = (uint8_t)((t[0] + 2*tl + l[0] + 2) >> 2);
/* Filtered top. */
filt[1] = (uint8_t)((tl + 2*t[0] + t[1] + 2) >> 2);
for (int i = 1; i <= 14; i++)
filt[1 + i] = (uint8_t)((t[i-1] + 2*t[i] + t[i+1] + 2) >> 2);
filt[1 + 15] = (uint8_t)((t[14] + 3*t[15] + 2) >> 2);
/* Filtered left. */
filt[17 + 0] = (uint8_t)((tl + 2*l[0] + l[1] + 2) >> 2);
for (int j = 1; j <= 6; j++)
filt[17 + j] = (uint8_t)((l[j-1] + 2*l[j] + l[j+1] + 2) >> 2);
filt[17 + 7] = (uint8_t)((l[6] + 3*l[7] + 2) >> 2);
}
/* Convenience macros for accessing the filt[] array by spec-style index. */
#define FT(i) filt[1 + (i)] /* filtered top[i], i in 0..15 */
#define FL(j) filt[17 + (j)] /* filtered left[j], j in 0..7 */
#define FTL filt[0] /* filtered top-left */
/* Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]. */
void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FT(c);
}
/* Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]. */
void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FL(r);
}
/* Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
* + 8) >> 4) broadcast. Note the +8 (not +4 like 4x4): there are
* 16 samples summed total, so >> 4 with half-step rounding +8. */
void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
int sum = 8;
for (int i = 0; i < 8; i++) sum += FT(i);
for (int j = 0; j < 8; j++) sum += FL(j);
uint8_t v = (uint8_t)(sum >> 4);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = v;
}
/* --- 6 directional modes for Intra_8x8 (H.264 §8.3.2.1.5..§8.3.2.1.10).
* Transcribed from FFmpeg libavcodec/h264pred_template.c
* pred8x8l_{down_left, down_right, vertical_right, horizontal_down,
* vertical_left, horizontal_up} (LGPL-2.1+ in the original; algorithm
* reproduced here for test purposes).
*
* All 6 use the same FILTERED reference samples produced by
* filter_refs() above. Mapping from FFmpeg's t0..t15 / l0..l7 / lt
* notation:
* tN = FT(N) for N in 0..15
* lN = FL(N) for N in 0..7
* lt = FTL
*
* SRC(x,y) maps to dst[y*stride + x] (col x, row y).
*/
#define SRC(x, y) dst[(y) * stride + (x)]
#define T(i) FT(i)
#define L(j) FL(j)
#define LT FTL
/* Mode 3 DDL (Diagonal_Down_Left) — uses TOP + TOP_RIGHT, no LEFT. */
void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(0,1)=SRC(1,0)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(0,2)=SRC(1,1)=SRC(2,0)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(0,3)=SRC(1,2)=SRC(2,1)=SRC(3,0)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(0,4)=SRC(1,3)=SRC(2,2)=SRC(3,1)=SRC(4,0)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(0,5)=SRC(1,4)=SRC(2,3)=SRC(3,2)=SRC(4,1)=SRC(5,0)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(0,6)=SRC(1,5)=SRC(2,4)=SRC(3,3)=SRC(4,2)=SRC(5,1)=SRC(6,0)= (T(6) + 2*T(7) + T(8) + 2) >> 2;
SRC(0,7)=SRC(1,6)=SRC(2,5)=SRC(3,4)=SRC(4,3)=SRC(5,2)=SRC(6,1)=SRC(7,0)= (T(7) + 2*T(8) + T(9) + 2) >> 2;
SRC(1,7)=SRC(2,6)=SRC(3,5)=SRC(4,4)=SRC(5,3)=SRC(6,2)=SRC(7,1)= (T(8) + 2*T(9) + T(10) + 2) >> 2;
SRC(2,7)=SRC(3,6)=SRC(4,5)=SRC(5,4)=SRC(6,3)=SRC(7,2)= (T(9) + 2*T(10) + T(11) + 2) >> 2;
SRC(3,7)=SRC(4,6)=SRC(5,5)=SRC(6,4)=SRC(7,3)= (T(10) + 2*T(11) + T(12) + 2) >> 2;
SRC(4,7)=SRC(5,6)=SRC(6,5)=SRC(7,4)= (T(11) + 2*T(12) + T(13) + 2) >> 2;
SRC(5,7)=SRC(6,6)=SRC(7,5)= (T(12) + 2*T(13) + T(14) + 2) >> 2;
SRC(6,7)=SRC(7,6)= (T(13) + 2*T(14) + T(15) + 2) >> 2;
SRC(7,7)= (T(14) + 3*T(15) + 2) >> 2;
}
/* Mode 4 DDR (Diagonal_Down_Right). */
void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,7)= (L(7) + 2*L(6) + L(5) + 2) >> 2;
SRC(0,6)=SRC(1,7)= (L(6) + 2*L(5) + L(4) + 2) >> 2;
SRC(0,5)=SRC(1,6)=SRC(2,7)= (L(5) + 2*L(4) + L(3) + 2) >> 2;
SRC(0,4)=SRC(1,5)=SRC(2,6)=SRC(3,7)= (L(4) + 2*L(3) + L(2) + 2) >> 2;
SRC(0,3)=SRC(1,4)=SRC(2,5)=SRC(3,6)=SRC(4,7)= (L(3) + 2*L(2) + L(1) + 2) >> 2;
SRC(0,2)=SRC(1,3)=SRC(2,4)=SRC(3,5)=SRC(4,6)=SRC(5,7)= (L(2) + 2*L(1) + L(0) + 2) >> 2;
SRC(0,1)=SRC(1,2)=SRC(2,3)=SRC(3,4)=SRC(4,5)=SRC(5,6)=SRC(6,7)= (L(1) + 2*L(0) + LT + 2) >> 2;
SRC(0,0)=SRC(1,1)=SRC(2,2)=SRC(3,3)=SRC(4,4)=SRC(5,5)=SRC(6,6)=SRC(7,7)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(1,0)=SRC(2,1)=SRC(3,2)=SRC(4,3)=SRC(5,4)=SRC(6,5)=SRC(7,6)= (LT + 2*T(0) + T(1) + 2) >> 2;
SRC(2,0)=SRC(3,1)=SRC(4,2)=SRC(5,3)=SRC(6,4)=SRC(7,5)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(3,0)=SRC(4,1)=SRC(5,2)=SRC(6,3)=SRC(7,4)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(4,0)=SRC(5,1)=SRC(6,2)=SRC(7,3)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(5,0)=SRC(6,1)=SRC(7,2)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(6,0)=SRC(7,1)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(7,0)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
}
/* Mode 5 VR (Vertical_Right). */
void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,6)= (L(5) + 2*L(4) + L(3) + 2) >> 2;
SRC(0,7)= (L(6) + 2*L(5) + L(4) + 2) >> 2;
SRC(0,4)=SRC(1,6)= (L(3) + 2*L(2) + L(1) + 2) >> 2;
SRC(0,5)=SRC(1,7)= (L(4) + 2*L(3) + L(2) + 2) >> 2;
SRC(0,2)=SRC(1,4)=SRC(2,6)= (L(1) + 2*L(0) + LT + 2) >> 2;
SRC(0,3)=SRC(1,5)=SRC(2,7)= (L(2) + 2*L(1) + L(0) + 2) >> 2;
SRC(0,1)=SRC(1,3)=SRC(2,5)=SRC(3,7)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(0,0)=SRC(1,2)=SRC(2,4)=SRC(3,6)= (LT + T(0) + 1) >> 1;
SRC(1,1)=SRC(2,3)=SRC(3,5)=SRC(4,7)= (LT + 2*T(0) + T(1) + 2) >> 2;
SRC(1,0)=SRC(2,2)=SRC(3,4)=SRC(4,6)= (T(0) + T(1) + 1) >> 1;
SRC(2,1)=SRC(3,3)=SRC(4,5)=SRC(5,7)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(2,0)=SRC(3,2)=SRC(4,4)=SRC(5,6)= (T(1) + T(2) + 1) >> 1;
SRC(3,1)=SRC(4,3)=SRC(5,5)=SRC(6,7)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(3,0)=SRC(4,2)=SRC(5,4)=SRC(6,6)= (T(2) + T(3) + 1) >> 1;
SRC(4,1)=SRC(5,3)=SRC(6,5)=SRC(7,7)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(4,0)=SRC(5,2)=SRC(6,4)=SRC(7,6)= (T(3) + T(4) + 1) >> 1;
SRC(5,1)=SRC(6,3)=SRC(7,5)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(5,0)=SRC(6,2)=SRC(7,4)= (T(4) + T(5) + 1) >> 1;
SRC(6,1)=SRC(7,3)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(6,0)=SRC(7,2)= (T(5) + T(6) + 1) >> 1;
SRC(7,1)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(7,0)= (T(6) + T(7) + 1) >> 1;
}
/* Mode 6 HD (Horizontal_Down). */
void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,7)= (L(6) + L(7) + 1) >> 1;
SRC(1,7)= (L(5) + 2*L(6) + L(7) + 2) >> 2;
SRC(0,6)=SRC(2,7)= (L(5) + L(6) + 1) >> 1;
SRC(1,6)=SRC(3,7)= (L(4) + 2*L(5) + L(6) + 2) >> 2;
SRC(0,5)=SRC(2,6)=SRC(4,7)= (L(4) + L(5) + 1) >> 1;
SRC(1,5)=SRC(3,6)=SRC(5,7)= (L(3) + 2*L(4) + L(5) + 2) >> 2;
SRC(0,4)=SRC(2,5)=SRC(4,6)=SRC(6,7)= (L(3) + L(4) + 1) >> 1;
SRC(1,4)=SRC(3,5)=SRC(5,6)=SRC(7,7)= (L(2) + 2*L(3) + L(4) + 2) >> 2;
SRC(0,3)=SRC(2,4)=SRC(4,5)=SRC(6,6)= (L(2) + L(3) + 1) >> 1;
SRC(1,3)=SRC(3,4)=SRC(5,5)=SRC(7,6)= (L(1) + 2*L(2) + L(3) + 2) >> 2;
SRC(0,2)=SRC(2,3)=SRC(4,4)=SRC(6,5)= (L(1) + L(2) + 1) >> 1;
SRC(1,2)=SRC(3,3)=SRC(5,4)=SRC(7,5)= (L(0) + 2*L(1) + L(2) + 2) >> 2;
SRC(0,1)=SRC(2,2)=SRC(4,3)=SRC(6,4)= (L(0) + L(1) + 1) >> 1;
SRC(1,1)=SRC(3,2)=SRC(5,3)=SRC(7,4)= (LT + 2*L(0) + L(1) + 2) >> 2;
SRC(0,0)=SRC(2,1)=SRC(4,2)=SRC(6,3)= (LT + L(0) + 1) >> 1;
SRC(1,0)=SRC(3,1)=SRC(5,2)=SRC(7,3)= (L(0) + 2*LT + T(0) + 2) >> 2;
SRC(2,0)=SRC(4,1)=SRC(6,2)= (T(1) + 2*T(0) + LT + 2) >> 2;
SRC(3,0)=SRC(5,1)=SRC(7,2)= (T(2) + 2*T(1) + T(0) + 2) >> 2;
SRC(4,0)=SRC(6,1)= (T(3) + 2*T(2) + T(1) + 2) >> 2;
SRC(5,0)=SRC(7,1)= (T(4) + 2*T(3) + T(2) + 2) >> 2;
SRC(6,0)= (T(5) + 2*T(4) + T(3) + 2) >> 2;
SRC(7,0)= (T(6) + 2*T(5) + T(4) + 2) >> 2;
}
/* Mode 7 VL (Vertical_Left) — uses TOP + TOP_RIGHT only. */
void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (T(0) + T(1) + 1) >> 1;
SRC(0,1)= (T(0) + 2*T(1) + T(2) + 2) >> 2;
SRC(0,2)=SRC(1,0)= (T(1) + T(2) + 1) >> 1;
SRC(0,3)=SRC(1,1)= (T(1) + 2*T(2) + T(3) + 2) >> 2;
SRC(0,4)=SRC(1,2)=SRC(2,0)= (T(2) + T(3) + 1) >> 1;
SRC(0,5)=SRC(1,3)=SRC(2,1)= (T(2) + 2*T(3) + T(4) + 2) >> 2;
SRC(0,6)=SRC(1,4)=SRC(2,2)=SRC(3,0)= (T(3) + T(4) + 1) >> 1;
SRC(0,7)=SRC(1,5)=SRC(2,3)=SRC(3,1)= (T(3) + 2*T(4) + T(5) + 2) >> 2;
SRC(1,6)=SRC(2,4)=SRC(3,2)=SRC(4,0)= (T(4) + T(5) + 1) >> 1;
SRC(1,7)=SRC(2,5)=SRC(3,3)=SRC(4,1)= (T(4) + 2*T(5) + T(6) + 2) >> 2;
SRC(2,6)=SRC(3,4)=SRC(4,2)=SRC(5,0)= (T(5) + T(6) + 1) >> 1;
SRC(2,7)=SRC(3,5)=SRC(4,3)=SRC(5,1)= (T(5) + 2*T(6) + T(7) + 2) >> 2;
SRC(3,6)=SRC(4,4)=SRC(5,2)=SRC(6,0)= (T(6) + T(7) + 1) >> 1;
SRC(3,7)=SRC(4,5)=SRC(5,3)=SRC(6,1)= (T(6) + 2*T(7) + T(8) + 2) >> 2;
SRC(4,6)=SRC(5,4)=SRC(6,2)=SRC(7,0)= (T(7) + T(8) + 1) >> 1;
SRC(4,7)=SRC(5,5)=SRC(6,3)=SRC(7,1)= (T(7) + 2*T(8) + T(9) + 2) >> 2;
SRC(5,6)=SRC(6,4)=SRC(7,2)= (T(8) + T(9) + 1) >> 1;
SRC(5,7)=SRC(6,5)=SRC(7,3)= (T(8) + 2*T(9) + T(10) + 2) >> 2;
SRC(6,6)=SRC(7,4)= (T(9) + T(10) + 1) >> 1;
SRC(6,7)=SRC(7,5)= (T(9) + 2*T(10) + T(11) + 2) >> 2;
SRC(7,6)= (T(10) + T(11) + 1) >> 1;
SRC(7,7)= (T(10) + 2*T(11) + T(12) + 2) >> 2;
}
/* Mode 8 HU (Horizontal_Up) — uses LEFT only. */
void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
SRC(0,0)= (L(0) + L(1) + 1) >> 1;
SRC(1,0)= (L(0) + 2*L(1) + L(2) + 2) >> 2;
SRC(0,1)=SRC(2,0)= (L(1) + L(2) + 1) >> 1;
SRC(1,1)=SRC(3,0)= (L(1) + 2*L(2) + L(3) + 2) >> 2;
SRC(0,2)=SRC(2,1)=SRC(4,0)= (L(2) + L(3) + 1) >> 1;
SRC(1,2)=SRC(3,1)=SRC(5,0)= (L(2) + 2*L(3) + L(4) + 2) >> 2;
SRC(0,3)=SRC(2,2)=SRC(4,1)=SRC(6,0)= (L(3) + L(4) + 1) >> 1;
SRC(1,3)=SRC(3,2)=SRC(5,1)=SRC(7,0)= (L(3) + 2*L(4) + L(5) + 2) >> 2;
SRC(0,4)=SRC(2,3)=SRC(4,2)=SRC(6,1)= (L(4) + L(5) + 1) >> 1;
SRC(1,4)=SRC(3,3)=SRC(5,2)=SRC(7,1)= (L(4) + 2*L(5) + L(6) + 2) >> 2;
SRC(0,5)=SRC(2,4)=SRC(4,3)=SRC(6,2)= (L(5) + L(6) + 1) >> 1;
SRC(1,5)=SRC(3,4)=SRC(5,3)=SRC(7,2)= (L(5) + 2*L(6) + L(7) + 2) >> 2;
SRC(0,6)=SRC(2,5)=SRC(4,4)=SRC(6,3)= (L(6) + L(7) + 1) >> 1;
SRC(1,6)=SRC(3,5)=SRC(5,4)=SRC(7,3)= (L(6) + 3*L(7) + 2) >> 2;
/* 20 positions all = L(7) per FFmpeg lines 1097-1100. */
SRC(0,7)=SRC(1,7)=SRC(2,6)=SRC(2,7)=SRC(3,6)=
SRC(3,7)=SRC(4,5)=SRC(4,6)=SRC(4,7)=SRC(5,5)=
SRC(5,6)=SRC(5,7)=SRC(6,4)=SRC(6,5)=SRC(6,6)=
SRC(6,7)=SRC(7,4)=SRC(7,5)=SRC(7,6)=SRC(7,7)= L(7);
}
#undef SRC
#undef T
#undef L
#undef LT
@@ -43,7 +43,7 @@ static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
* quadrant ignores the top-left-half because that half is "vertically
* above" the top-left quadrant; the spec uses top[4..7] only.
*/
void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int top_lo = 0, top_hi = 0, left_lo = 0, left_hi = 0;
@@ -68,7 +68,7 @@ void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 1 — Horizontal: each row = left[row]. */
void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride)
{
for (int r = 0; r < 8; r++) {
uint8_t l = dst[r * stride - 1];
@@ -77,7 +77,7 @@ void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
}
/* Mode 2 — Vertical: each col = top[col]. */
void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
for (int r = 0; r < 8; r++)
@@ -97,7 +97,7 @@ void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride)
* - Centre is (x-3, y-3) (not x-7, y-7).
* - Spans 4 differences per sum (not 8).
*/
void daedalus_h264_pred_chroma8x8_plane_ref(uint8_t *dst, ptrdiff_t stride)
void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride)
{
const uint8_t *top = dst - stride;
int H = 0, V = 0;
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc01 (biprediction) (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc01_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+77
View File
@@ -0,0 +1,77 @@
// daedalus-fourier — H.264 luma qpel avg_mc02 (biprediction) (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc02_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc03 (biprediction) (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc03_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+55
View File
@@ -0,0 +1,55 @@
// daedalus-fourier — H.264 luma qpel avg_mc10 (biprediction) (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc10_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc11 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc11_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc12 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc12_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc13 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc13_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+91
View File
@@ -0,0 +1,91 @@
// daedalus-fourier — H.264 luma qpel avg_mc20 (biprediction) (8x8, horizontal half-pel), V3D 7.1.
//
// H.264 spec §8.4.2.2.1 horizontal 6-tap luma interpolation:
//
// dst[r,c] = clip255(
// ( s[r,c-2]
// - 5 * s[r,c-1]
// + 20 * s[r,c]
// + 20 * s[r,c+1]
// - 5 * s[r,c+2]
// + s[r,c+3]
// + 16
// ) >> 5)
//
// Single-stride: dst and src share `stride` (H264QpelContext
// convention). src+src_off already points at the leftmost output
// column (col 0); the filter reads cols -2..+3. Caller guarantees
// edge-padding context per the public API docstring.
//
// Workgroup layout: 64 invocations = 1 lane per output pixel.
// 1 block per WG; n_blocks WGs total. This is the simplest layout
// that avoids any inter-lane communication — each lane independently
// reads its 6 src samples and writes its 1 dst sample. V3D's L2
// cache handles the redundant reads from adjacent lanes.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc20_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src {
uint8_t src[];
} u_src;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Meta {
uvec4 meta[]; // .x = dst_off, .y = src_off
} u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
// 1 block per WG, 64 lanes covering the 8x8 output block.
uint wg_id = gl_WorkGroupID.x;
uint block_idx = wg_id;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3; // 0..7 (row)
uint c = lane & 7u; // 0..7 (column)
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// src points at output col 0 of the block; filter reads cols -2..+3
// of the current row. Negative col arithmetic is unsigned-safe
// because src_off >= 2 (caller-guaranteed left context).
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base + 0u]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc21 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc21_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+94
View File
@@ -0,0 +1,94 @@
// daedalus-fourier — H.264 luma qpel avg_mc22 (biprediction) (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc22_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + p + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc23 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc23_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+52
View File
@@ -0,0 +1,52 @@
// daedalus-fourier — H.264 luma qpel avg_mc30 (biprediction) (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc30_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc31 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc31_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc32 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc32_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+96
View File
@@ -0,0 +1,96 @@
// daedalus-fourier — H.264 luma qpel avg_mc33 (biprediction) (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
//
// avg_ variant for B-slice biprediction per H.264 §8.4.2.3.1:
// dst[r,c] = avg(dst[r,c], mc33_value)
// Caller pre-loads dst with the list0 prediction; this shader
// folds in the list1 contribution.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
uint final_off = dst_off + r * stride + c;
int prev = int(u_dst.dst[final_off]);
u_dst.dst[final_off] = uint8_t((prev + avg + 1) >> 1);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc01 (8x8, ¼-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "d" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r,c] + 1) >> 1)
//
// Sibling of v3d_h264_qpel_mc02.comp with L2 step against src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_0 + 1) >> 1; // L2 with src[r, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 luma qpel mc02 (8x8, vertical half-pel), V3D 7.1.
//
// Sibling of cycle 9's v3d_h264_qpel_mc20.comp. Same 6-tap filter,
// transposed to vertical direction:
//
// dst[r,c] = clip255(
// ( s[r-2,c]
// - 5 * s[r-1,c]
// + 20 * s[r, c]
// + 20 * s[r+1,c]
// - 5 * s[r+2,c]
// + s[r+3,c]
// + 16
// ) >> 5)
//
// src+src_off points at row 0 col 0 of the OUTPUT block; the filter
// reads rows -2..+3 (2 rows of top context, 3 rows of bottom).
//
// Same WG layout as mc20: 64 lanes / 1 block-per-WG / 1 lane-per-pixel.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Read the 6 rows of vertical context at col (c) of THIS output row.
// src_off+r*stride+c is at the OUTPUT pixel position; the kernel
// samples r-2..r+3 along the column. Unsigned-safe because the
// public API contract guarantees src_off >= 2*stride.
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int p = clamp(v >> 5, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc03 (8x8, ¾-pel vertical),
// V3D 7.1. Per H.264 §8.4.2.2.1 "n" position:
//
// dst[r,c] = ((clip255(mc02(s)[r,c]) + s[r+1, c] + 1) >> 1)
//
// Same as mc01 but L2-averages with src[r+1, c] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int vp = clamp(v >> 5, 0, 255);
int avg = (vp + s_p1 + 1) >> 1; // L2 with src[r+1, c]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+47
View File
@@ -0,0 +1,47 @@
// daedalus-fourier — H.264 luma qpel mc10 (8x8, ¼-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "a" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c] + 1) >> 1)
//
// = horizontal half-pel filter, clipped to u8, then L2 rounded-averaged
// with the integer source pixel at the SAME position. Sibling of
// v3d_h264_qpel_mc20.comp with the L2 step added at the tail.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
// L2 average with the integer source at the SAME (r, c) position.
int avg = (hp + s_0 + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc11 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc11[r,c] = avg(mc20(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc12 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc12[r,c] = avg(mc22(r, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc13 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc13[r,c] = avg(mc20(r+1, c),
// mc02(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc21 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc21[r,c] = avg(mc22(r, c),
// mc20(r, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+86
View File
@@ -0,0 +1,86 @@
// daedalus-fourier — H.264 luma qpel mc22 (8x8, 2D half-pel "j" position).
// V3D 7.1.
//
// Cascaded H+V 6-tap per H.264 §8.4.2.2.1 / FFmpeg ff_put_h264_qpel8_mc22_neon:
//
// tmp[r,c] = src[r,c-2] - 5*src[r,c-1] + 20*src[r,c] + 20*src[r,c+1]
// - 5*src[r,c+2] + src[r,c+3] (int16)
//
// dst[r,c] = clip255((tmp[r-2,c] - 5*tmp[r-1,c] + 20*tmp[r,c]
// + 20*tmp[r+1,c] - 5*tmp[r+2,c] + tmp[r+3,c]
// + 512) >> 10)
//
// The +512 >> 10 final scale compensates for both 6-tap scalings.
// CANNOT just cascade mc20→mc02 because intermediate must be int16
// (no per-stage clip), so this is a dedicated kernel.
//
// Per-lane structure: each lane computes its own (r, c) output by
// running the FULL cascade — 6 horizontal lowpass int16 values for
// rows r-2..r+3, then a vertical lowpass on those. ~50 ALU ops per
// lane. No shared memory / barriers needed; V3D L2 absorbs the
// redundant src reads across lanes.
//
// WG layout: 64 lanes / 1 block-per-WG / 1 lane-per-output-pixel
// (same as mc20 / mc02).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC {
uint n_blocks;
uint stride_u8;
uint _pad0, _pad1;
} pc;
// Horizontal 6-tap filter at (row_off, c) — reads src at cols c-2..c+3
// of the row identified by row_off, returns int16 intermediate (NOT
// scaled — the v-pass does the +512 >> 10 for both stages).
int hpel_h(uint row_off, uint c)
{
int s_m2 = int(u_src.src[row_off + c - 2u]);
int s_m1 = int(u_src.src[row_off + c - 1u]);
int s_0 = int(u_src.src[row_off + c ]);
int s_p1 = int(u_src.src[row_off + c + 1u]);
int s_p2 = int(u_src.src[row_off + c + 2u]);
int s_p3 = int(u_src.src[row_off + c + 3u]);
return s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3;
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3;
uint c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
// Compute 6 horizontal lowpass values at rows r-2..r+3 (relative
// to the output row r) of column c. src_off+r*stride+c is the
// output pixel position; we sample rows r-2..r+3.
// Unsigned-safe because src_off >= 2*stride per the caller contract.
int t0 = hpel_h(src_off + (r - 2u) * stride, c);
int t1 = hpel_h(src_off + (r - 1u) * stride, c);
int t2 = hpel_h(src_off + r * stride, c);
int t3 = hpel_h(src_off + (r + 1u) * stride, c);
int t4 = hpel_h(src_off + (r + 2u) * stride, c);
int t5 = hpel_h(src_off + (r + 3u) * stride, c);
int v = t0 - 5 * t1 + 20 * t2 + 20 * t3 - 5 * t4 + t5 + 512;
int p = clamp(v >> 10, 0, 255);
u_dst.dst[dst_off + r * stride + c] = uint8_t(p);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc23 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc23[r,c] = avg(mc22(r, c),
// mc20(r+1, c))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_h(src_off, stride, r+1u, c);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 luma qpel mc30 (8x8, ¾-pel horizontal),
// V3D 7.1. Per H.264 §8.4.2.2.1 "c" position:
//
// dst[r,c] = ((clip255(mc20(s)[r,c]) + s[r,c+1] + 1) >> 1)
//
// Same as mc10 but L2-averages with src[r, c+1] instead of src[r, c].
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5 * s_m1 + 20 * s_0 + 20 * s_p1 - 5 * s_p2 + s_p3 + 16;
int hp = clamp(v >> 5, 0, 255);
int avg = (hp + s_p1 + 1) >> 1; // L2 with src[r, c+1]
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc31 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc31[r,c] = avg(mc20(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc32 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc32[r,c] = avg(mc22(r, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_hv(src_off, stride, r, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+88
View File
@@ -0,0 +1,88 @@
// daedalus-fourier — H.264 luma qpel mc33 (8x8, diagonal quarter-pel),
// V3D 7.1. Per H.264 §8.4.2.2.1 (table 8-4) — composes two half-pel
// anchors via L2 rounded-average:
//
// mc33[r,c] = avg(mc20(r+1, c),
// mc02(r, c+1))
//
// Per-lane structure: each lane computes BOTH anchor outputs at its
// own (r, c) target offset, then L2 averages. No shared memory.
// Same WG geometry as the other qpel shaders.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src { uint8_t src[]; } u_src;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(binding = 2) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(push_constant) uniform PC { uint n_blocks, stride_u8, _p0, _p1; } pc;
int hpel_h(uint src_off, uint stride, uint r, uint c) {
uint row_base = src_off + r * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_v(uint src_off, uint stride, uint r, uint c) {
uint col_base = src_off + c;
int s_m2 = int(u_src.src[col_base + (r - 2u) * stride]);
int s_m1 = int(u_src.src[col_base + (r - 1u) * stride]);
int s_0 = int(u_src.src[col_base + r * stride]);
int s_p1 = int(u_src.src[col_base + (r + 1u) * stride]);
int s_p2 = int(u_src.src[col_base + (r + 2u) * stride]);
int s_p3 = int(u_src.src[col_base + (r + 3u) * stride]);
int v = s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3 + 16;
return clamp(v >> 5, 0, 255);
}
int hpel_hv_row(uint src_off, uint stride, uint rr, uint c) {
// Single row's int16 horizontal lowpass (NOT clipped — used as
// intermediate for the vertical pass of hpel_hv).
uint row_base = src_off + rr * stride + c;
int s_m2 = int(u_src.src[row_base - 2u]);
int s_m1 = int(u_src.src[row_base - 1u]);
int s_0 = int(u_src.src[row_base ]);
int s_p1 = int(u_src.src[row_base + 1u]);
int s_p2 = int(u_src.src[row_base + 2u]);
int s_p3 = int(u_src.src[row_base + 3u]);
return s_m2 - 5*s_m1 + 20*s_0 + 20*s_p1 - 5*s_p2 + s_p3;
}
int hpel_hv(uint src_off, uint stride, uint r, uint c) {
int t0 = hpel_hv_row(src_off, stride, r - 2u, c);
int t1 = hpel_hv_row(src_off, stride, r - 1u, c);
int t2 = hpel_hv_row(src_off, stride, r, c);
int t3 = hpel_hv_row(src_off, stride, r + 1u, c);
int t4 = hpel_hv_row(src_off, stride, r + 2u, c);
int t5 = hpel_hv_row(src_off, stride, r + 3u, c);
int v = t0 - 5*t1 + 20*t2 + 20*t3 - 5*t4 + t5 + 512;
return clamp(v >> 10, 0, 255);
}
void main()
{
uint block_idx = gl_WorkGroupID.x;
if (block_idx >= pc.n_blocks) return;
uint lane = gl_LocalInvocationID.x;
uint r = lane >> 3, c = lane & 7u;
uint dst_off = u_meta.meta[block_idx].x;
uint src_off = u_meta.meta[block_idx].y;
uint stride = pc.stride_u8;
int a = hpel_h(src_off, stride, r+1u, c);
int b = hpel_v(src_off, stride, r, c+1u);
int avg = (a + b + 1) >> 1;
u_dst.dst[dst_off + r * stride + c] = uint8_t(avg);
}
+69
View File
@@ -0,0 +1,69 @@
// daedalus-fourier — H.264 chroma 4:2:0 H loop filter (horizontal
// filter across a vertical edge), non-intra bS<4 variant.
//
// Sibling of v3d_h264deblock_chroma_v.comp; same kernel transposed
// to read pix[-2..+1] (cols) instead of pix[-2*stride..+1*stride]
// (rows). Same 8-cell × 4-segment geometry, same WG layout (lanes
// 8..15 of each edge early-return — only 8 active per edge).
//
// 4:2:0-only: 4:2:2 chroma_h has a 16-row edge that this shader
// doesn't address. daedalus_dispatch_h264_deblock_chroma_h is
// 4:2:0-only by design; caller (libavcodec init) gates accordingly.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint row_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
uint seg = row_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
}
+44
View File
@@ -0,0 +1,44 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) H deblock —
// V3D 7.1. Transpose of v3d_h264deblock_chroma_v_intra.comp.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (row_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+76
View File
@@ -0,0 +1,76 @@
// daedalus-fourier — H.264 chroma 4:2:0 V loop filter (vertical
// filter across a horizontal edge), non-intra bS<4 variant.
//
// Per H.264 §8.7.2.4: chroma kernel is simpler than luma's bS<4 —
// only p0 / q0 are updated (chroma never modifies p1, p2, q1, q2),
// tC = tc0_seg + 1 (no luma-style ap/aq side bonus), and the edge
// spans 8 cells (4 segments × 2 cells/seg).
//
// V3D 7.1 via Mesa v3dv compute. WG geometry kept identical to the
// luma shader (16 edges × 16 lanes/WG) for uniform dispatch math
// across the deblock family; lanes 8..15 of each edge early-return.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15
uint col_in_edge = lane_in_wg & 15u; // 0..15 — only 0..7 active
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return; // 8 cells per chroma edge
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// 8 cells / 4 segments = 2 cells per segment.
uint seg = col_in_edge >> 1;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int tc = tc0_s + 1;
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp(p0 + delta, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp(q0 - delta, 0, 255));
// p1, q1 untouched — chroma kernel only updates p0/q0.
}
+54
View File
@@ -0,0 +1,54 @@
// daedalus-fourier — H.264 chroma 4:2:0 intra (bS=4) V deblock —
// V3D 7.1. Per H.264 §8.3.2.3 chroma intra path: simpler than luma
// — always weak filter, only p0/q0 updated, 8 cells per edge.
//
// p0' = (2*p1 + p0 + q1 + 2) >> 2
// q0' = (2*q1 + q0 + p1 + 2) >> 2
//
// Same 16-edges × 16-lanes/edge WG shape as luma; lanes 8..15 of each
// edge early-return (chroma edges are only 8 cells wide).
//
// 4:2:0-only — caller-side gating handles 4:2:2 (chroma_format_idc>1)
// at the libavcodec init layer.
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
if (col_in_edge >= 8u) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
+111
View File
@@ -0,0 +1,111 @@
// daedalus-fourier — H.264 luma "h_loop_filter" (horizontal filtering
// across a vertical edge), non-intra bS<4 variant. Sibling of cycle 8's
// v3d_h264deblock.comp; same algorithm with row/col access transposed.
//
// V3D 7.1 via Mesa v3dv compute. Same WG geometry as the V shader:
// - 256 invocations / WG, 16 edges/WG (16 lanes/edge = 1 sg/edge)
// - uint8_t dst SSBO via storageBuffer8BitAccess
// - No barrier (each lane independent)
// - lane_in_edge = ROW index (0..15) along the vertical edge
// - meta.dst_off points to (row 0, col 0) of the RIGHT block;
// the kernel reads cols [-4..+3] of each row and writes [-2..+1].
//
// Filter contract (per H.264 §8.7.2.4):
// 1. (m.x % pc.dst_stride_u8) ≥ 4 (kernel reads p3 at pix[-4])
// 2. pc.dst_stride_u8 = byte stride between rows
// 3. tc0_s pre-stored as signed int8 in m.z packed 4 bytes (one per
// 4-row segment along the 16-row edge)
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_h_loop_filter_luma_ref.c which mirrors FFmpeg
// ff_h264_h_loop_filter_luma_neon (LGPL-2.1+).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per edge: (dst_off, alpha|beta<<8, packed_tc0, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
void main()
{
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gl_WorkGroupID.x;
uint lane_in_wg = gid & 255u;
uint edge_in_wg = lane_in_wg >> 4; // 0..15 (16 edges/WG)
uint row_in_edge = lane_in_wg & 15u; // 0..15 — ROW along the V edge
uint edge_idx = wg_id * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
// dst_off addresses row 0 col 0 of the right block; advance by row * stride
// to land at this lane's row. The kernel reads pix[-4..+3] AT THIS ROW.
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
// tc0 segment = 0..3 indexed by (row_in_edge / 4).
uint seg = row_in_edge >> 2;
uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
int tc0_s = int(tc0_byte);
if (tc0_s >= 128) tc0_s -= 256;
if (alpha == 0 || beta == 0) return;
if (tc0_s < 0) return; // segment skip
// Horizontal access pattern — read cols at offsets [-3..+2] of this row.
// p3 (col -4) unused in bS<4; same DCE comment as the V shader.
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
// Edge preconditions (same as V).
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
int ap = abs(p2 - p0);
int aq = abs(q2 - q0);
bool ap_lt = ap < beta;
bool aq_lt = aq < beta;
int tc = tc0_s + int(ap_lt) + int(aq_lt);
int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
int p0p = clamp(p0 + delta, 0, 255);
int q0p = clamp(q0 - delta, 0, 255);
int p1p = p1;
if (ap_lt) {
int d_p1 = clamp((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s);
p1p = clamp(p1 + d_p1, 0, 255);
}
int q1p = q1;
if (aq_lt) {
int d_q1 = clamp((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s);
q1p = clamp(q1 + d_q1, 0, 255);
}
u_dst.dst[dst_off - 2u] = uint8_t(p1p);
u_dst.dst[dst_off - 1u] = uint8_t(p0p);
u_dst.dst[dst_off ] = uint8_t(q0p);
u_dst.dst[dst_off + 1u] = uint8_t(q1p);
}
+70
View File
@@ -0,0 +1,70 @@
// daedalus-fourier — H.264 luma intra (bS=4) H deblock — V3D 7.1.
//
// Sibling of v3d_h264deblock_luma_v_intra.comp transposed to the
// horizontal axis: lane → row, reads pix[-4..+3] (cols) instead of
// pix[-4*stride..+3*stride] (rows). Same strong/weak filter
// selector + same write-back algebra.
//
// dst_off contract: (m.x % stride) ≥ 4 (kernel reads p3 at pix[-4]).
//
// License: BSD-2-Clause.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint row_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint stride = pc.dst_stride_u8;
uint dst_off = m.x + row_in_edge * stride;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u]);
int p2 = int(u_dst.dst[dst_off - 3u]);
int p1 = int(u_dst.dst[dst_off - 2u]);
int p0 = int(u_dst.dst[dst_off - 1u]);
int q0 = int(u_dst.dst[dst_off ]);
int q1 = int(u_dst.dst[dst_off + 1u]);
int q2 = int(u_dst.dst[dst_off + 2u]);
int q3 = int(u_dst.dst[dst_off + 3u]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+81
View File
@@ -0,0 +1,81 @@
// daedalus-fourier — H.264 luma intra (bS=4) V deblock — V3D 7.1.
//
// Per H.264 §8.3.2.3: at I-MB edges and certain inter-MB edges that
// force boundary strength to 4, the deblock kernel is structurally
// different from bS<4 — it has a per-side strong/weak filter
// selector that decides whether to update 3 cells (strong) or 1
// (weak), reads p3/q3, and ignores tc0.
//
// strong_common = |p0-q0| < (α>>2) + 2
// strong_p = strong_common AND |p2-p0| < β
// strong_q = strong_common AND |q2-q0| < β
//
// Strong-p updates p0/p1/p2 with specific 5-/4-/3-tap blends.
// Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2.
// Mirror for q-side.
//
// WG geometry identical to v3d_h264deblock.comp (16 edges × 16 lanes/WG).
// dst_off contract: m.x ≥ 4*stride (kernel reads p3 at -4*stride).
//
// License: BSD-2-Clause. Algorithm transcribed from
// tests/h264_intra_loop_filter_ref.c (PR #11).
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
layout(push_constant) uniform PC {
uint n_edges, dst_stride_u8, _p0, _p1;
} pc;
void main()
{
uint lane_in_wg = gl_GlobalInvocationID.x & 255u;
uint edge_in_wg = lane_in_wg >> 4;
uint col_in_edge = lane_in_wg & 15u;
uint edge_idx = gl_WorkGroupID.x * 16u + edge_in_wg;
if (edge_idx >= pc.n_edges) return;
uvec4 m = u_meta.meta[edge_idx];
uint dst_off = m.x + col_in_edge;
uint stride = pc.dst_stride_u8;
int alpha = int(m.y & 0xffu);
int beta = int((m.y >> 8) & 0xffu);
if ((alpha | beta) == 0) return;
int p3 = int(u_dst.dst[dst_off - 4u * stride]);
int p2 = int(u_dst.dst[dst_off - 3u * stride]);
int p1 = int(u_dst.dst[dst_off - 2u * stride]);
int p0 = int(u_dst.dst[dst_off - 1u * stride]);
int q0 = int(u_dst.dst[dst_off]);
int q1 = int(u_dst.dst[dst_off + 1u * stride]);
int q2 = int(u_dst.dst[dst_off + 2u * stride]);
int q3 = int(u_dst.dst[dst_off + 3u * stride]);
if (abs(p0 - q0) >= alpha) return;
if (abs(p1 - p0) >= beta) return;
if (abs(q1 - q0) >= beta) return;
bool strong_common = abs(p0 - q0) < (alpha >> 2) + 2;
bool strong_p = strong_common && abs(p2 - p0) < beta;
bool strong_q = strong_common && abs(q2 - q0) < beta;
if (strong_p) {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off - 2u * stride] = uint8_t(clamp((p2 + p1 + p0 + q0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off - 3u * stride] = uint8_t(clamp((2*p3 + 3*p2 + p1 + p0 + q0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off - 1u * stride] = uint8_t(clamp((2*p1 + p0 + q1 + 2) >> 2, 0, 255));
}
if (strong_q) {
u_dst.dst[dst_off ] = uint8_t(clamp((q2 + 2*q1 + 2*q0 + 2*p0 + p1 + 4) >> 3, 0, 255));
u_dst.dst[dst_off + 1u * stride] = uint8_t(clamp((q2 + q1 + q0 + p0 + 2) >> 2, 0, 255));
u_dst.dst[dst_off + 2u * stride] = uint8_t(clamp((2*q3 + 3*q2 + q1 + q0 + p0 + 4) >> 3, 0, 255));
} else {
u_dst.dst[dst_off ] = uint8_t(clamp((2*q1 + q0 + p1 + 2) >> 2, 0, 255));
}
}
+62 -2
View File
@@ -8,6 +8,8 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <limits.h>
#define CHK(call) do { VkResult r__ = (call); if (r__ != VK_SUCCESS) { \
fprintf(stderr, "v3d_runner: vulkan error %d at %s:%d (%s)\n", \
@@ -368,10 +370,68 @@ void v3d_runner_destroy_buffer(v3d_runner *r, v3d_buffer *buf)
/* ---- Pipelines -------------------------------------------------- */
/* SPV lookup tries a small set of locations. The caller passes a bare
* filename (e.g. "v3d_h264_idct4.spv"); we try, in order:
*
* 1. cwd-relative (legacy contract; works when run from build/)
* 2. $DAEDALUS_SHADER_DIR (env override for tests / packaged installs)
* 3. <binary-dir>/<name> (so the bench/test binary finds the SPV next
* to itself regardless of cwd this is the
* fix for the silent-no-SPV regression that
* made PR #36's bench numbers meaningless)
* 4. /opt/fourier/share/daedalus-fourier/<name> (Pi 5 install layout)
* 5. /usr/share/daedalus-fourier/<name> (system-wide install)
*
* Returns NULL only if every location fails, with a single perror naming
* the bare filename so the user can grep for it. */
static FILE *open_spv(const char *name)
{
FILE *f = fopen(name, "rb");
if (f) return f;
const char *envdir = getenv("DAEDALUS_SHADER_DIR");
if (envdir && *envdir) {
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", envdir, name);
f = fopen(p, "rb");
if (f) return f;
}
char exe[PATH_MAX];
ssize_t n = readlink("/proc/self/exe", exe, sizeof(exe) - 1);
if (n > 0) {
exe[n] = 0;
char *slash = strrchr(exe, '/');
if (slash) {
*slash = 0;
char p[PATH_MAX];
snprintf(p, sizeof(p), "%s/%s", exe, name);
f = fopen(p, "rb");
if (f) return f;
}
}
char p[PATH_MAX];
snprintf(p, sizeof(p), "/opt/fourier/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
snprintf(p, sizeof(p), "/usr/share/daedalus-fourier/%s", name);
f = fopen(p, "rb");
if (f) return f;
return NULL;
}
static uint32_t *read_spv(const char *path, size_t *out_size)
{
FILE *f = fopen(path, "rb");
if (!f) { perror(path); return NULL; }
FILE *f = open_spv(path);
if (!f) {
fprintf(stderr,
"daedalus: SPV not found via cwd / $DAEDALUS_SHADER_DIR / "
"binary-dir / /opt/fourier/share / /usr/share: %s\n", path);
return NULL;
}
fseek(f, 0, SEEK_END);
long sz = ftell(f);
fseek(f, 0, SEEK_SET);
+299
View File
@@ -0,0 +1,299 @@
/* SPDX-License-Identifier: BSD-2-Clause */
/* CLOCK_MONOTONIC under -std=c11 -CMAKE_C_EXTENSIONS=OFF. */
#define _POSIX_C_SOURCE 200809L
/*
* bench_h264_primitives latency baseline for the H.264 primitive
* library landed across PRs #9#35.
*
* Each kernel is exercised at a representative per-frame N for 1080p
* (8160 MBs); the per-kernel total + ns/op + ms/frame are reported,
* once per substrate (CPU NEON, QPU V3D7 compute). The QPU column
* appears only when the host has a usable Vulkan device. When both
* columns exist a CPU/QPU ratio is printed; that's the per-kernel
* data the QPU-substrate decree (2026-05-23) deliberately overrides
* but which is still useful to track over time as dispatch overhead
* shrinks (buffer pool, persistent cmdbuf, dmabuf import tasks 160-162).
*
* NOT a ctest produces wall-time numbers, doesn't pass/fail.
*
* Invoke: ./build/bench_h264_primitives [iters [warmup]]
* (default iters = 50, warmup = 5)
*/
#include "daedalus.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
static uint64_t xs64_state = 0xfeedface5a5a5a5aULL;
static uint64_t xs64(void) {
uint64_t x = xs64_state;
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
return xs64_state = x;
}
static double now_ms(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000.0 + ts.tv_nsec / 1.0e6;
}
/* Per-1080p-frame counts (8160 MBs at 1920x1088). */
#define MBS_1080P 8160
/* Standard benchmark loop. fn() is called n times per iteration.
*
* fn() now returns the dispatch's int rc. A single preflight call is
* made before the hot loop; if rc != 0 (which on the QPU substrate
* almost always means "SPV not found via any search path"), bench_ns
* returns -1 and the caller must NOT report the kernel as measured.
*
* Without this, a missing SPV makes every dispatch fail fast at the
* cost of one fprintf+open call (~1-5 µs), and the loop times that
* cost as if it were real QPU work producing absurdly-small ns/op
* numbers that look like a QPU speedup. This is exactly what made
* PR #36's bench numbers a measurement artifact. */
typedef int (*bench_fn)(void);
static double bench_ns(const char *name, int iters, int warmup,
int ops_per_iter, bench_fn fn)
{
int rc = fn();
if (rc != 0) {
printf(" %-32s DISPATCH FAILED rc=%d — kernel skipped\n", name, rc);
return -1;
}
for (int i = 0; i < warmup; i++) fn();
double t0 = now_ms();
for (int i = 0; i < iters; i++) fn();
double t1 = now_ms();
double total_ms = (t1 - t0);
double ns_per_op = (total_ms * 1e6) / ((double) iters * ops_per_iter);
printf(" %-32s %10.2f ns/op (%d iters x %d ops)\n",
name, ns_per_op, iters, ops_per_iter);
return ns_per_op;
}
/* ---- Per-kernel scaffolding. Each section sets up the buffers +
* meta, then defines a static fn() that calls the corresponding
* dispatch with a representative N. The substrate is read from the
* global g_sub so the same fn() can be re-driven with CPU then QPU. */
static daedalus_ctx *ctx;
static daedalus_substrate g_sub = DAEDALUS_SUBSTRATE_CPU;
/* --- IDCT 4x4 luma: N = 16 blocks per MB. Bench with 1024 blocks
* per call (64 MBs worth). Per-MB the dispatch overhead is the
* same regardless of N we want ns per block. */
static int16_t idct4_coeffs[1024 * 16];
static daedalus_h264_block_meta idct4_meta[1024];
static uint8_t idct_dst[64 * 4 * 16 * 16]; /* 64 MB-rows × ... */
static int bench_idct4(void) {
return daedalus_dispatch_h264_idct4(ctx, g_sub,
idct_dst, 64*16, idct4_coeffs, 1024, idct4_meta);
}
/* --- IDCT 8x8 luma: 256 8x8 blocks per call. */
static int16_t idct8_coeffs[256 * 64];
static daedalus_h264_block_meta idct8_meta[256];
static int bench_idct8(void) {
return daedalus_dispatch_h264_idct8(ctx, g_sub,
idct_dst, 64*16, idct8_coeffs, 256, idct8_meta);
}
/* --- Deblock luma_v (cycle 8 baseline; M3 path). */
static daedalus_h264_deblock_meta deblock_meta[256];
static uint8_t deblock_dst[256 * 16 * 16];
static int bench_deblock_v(void) {
return daedalus_dispatch_h264_deblock_luma_v(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
static int bench_deblock_h(void) {
return daedalus_dispatch_h264_deblock_luma_h(ctx, g_sub,
deblock_dst, 16, 256, deblock_meta);
}
/* --- qpel mc20 + mc02 + mc22 (the H/V/HV anchors). */
static uint8_t qpel_src[256 * 16 * 16];
static uint8_t qpel_dst[256 * 16 * 16];
static daedalus_h264_qpel_meta qpel_meta[256];
static int bench_qpel_mc20(void) {
return daedalus_dispatch_h264_qpel_mc20(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static int bench_qpel_mc02(void) {
return daedalus_dispatch_h264_qpel_mc02(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
static int bench_qpel_mc22(void) {
return daedalus_dispatch_h264_qpel_mc22(ctx, g_sub,
qpel_dst, qpel_src, 16, 256, qpel_meta);
}
/* ---- One row of bench output:
* - kernel name + N
* - CPU ns/op
* - QPU ns/op (or "n/a" if Vulkan absent)
* - CPU/QPU ratio (>1 means QPU wins; <1 means CPU wins) */
struct row {
const char *name;
int n_per_call;
bench_fn fn;
double cpu_ns;
double qpu_ns; /* -1 if not measured */
int frame_n; /* count per 1080p frame */
};
static struct row rows[] = {
{"IDCT 4x4 luma", 1024, bench_idct4, 0, -1, MBS_1080P * 16},
{"IDCT 8x8 luma", 256, bench_idct8, 0, -1, MBS_1080P * 4},
{"Deblock luma_v", 256, bench_deblock_v, 0, -1, MBS_1080P * 4},
{"Deblock luma_h", 256, bench_deblock_h, 0, -1, MBS_1080P * 4},
{"qpel mc20 (8x8)", 256, bench_qpel_mc20, 0, -1, MBS_1080P * 4},
{"qpel mc02 (8x8)", 256, bench_qpel_mc02, 0, -1, MBS_1080P * 4},
{"qpel mc22 (8x8)", 256, bench_qpel_mc22, 0, -1, MBS_1080P * 4},
};
#define N_ROWS ((int)(sizeof(rows)/sizeof(rows[0])))
int main(int argc, char **argv)
{
int iters = argc > 1 ? atoi(argv[1]) : 50;
int warmup = argc > 2 ? atoi(argv[2]) : 5;
ctx = daedalus_ctx_create();
if (!ctx) {
fprintf(stderr, "ctx create failed (Vulkan?)\n");
return 1;
}
int has_qpu = daedalus_ctx_has_qpu(ctx);
/* Pre-fill all input buffers with random data so the NEON inner
* loops see realistic memory access patterns. */
for (size_t i = 0; i < sizeof(idct4_coeffs)/2; i++)
idct4_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(idct8_coeffs)/2; i++)
idct8_coeffs[i] = (int16_t)((int)(xs64() % 1024) - 512);
for (size_t i = 0; i < sizeof(qpel_src); i++) qpel_src[i] = (uint8_t)(xs64() & 0xff);
/* IDCT meta. */
for (size_t i = 0; i < 1024; i++)
idct4_meta[i].dst_off = (uint32_t)((i / 16) * 64 + (i % 16) * 4);
for (size_t i = 0; i < 256; i++)
idct8_meta[i].dst_off = (uint32_t)((i / 8) * 64 + (i % 8) * 8);
/* Deblock meta: edge offsets within 256 16x16 tiles. */
for (size_t i = 0; i < 256; i++) {
deblock_meta[i].dst_off = (uint32_t)(i * 256 + 4 * 16);
deblock_meta[i].alpha = 30;
deblock_meta[i].beta = 10;
for (int s = 0; s < 4; s++) deblock_meta[i].tc0[s] = (int8_t)(s + 1);
}
/* qpel meta. */
for (size_t i = 0; i < 256; i++) {
qpel_meta[i].src_off = (uint32_t)(i * 256 + 3 * 16 + 3);
qpel_meta[i].dst_off = (uint32_t)(i * 256 + 3 * 16 + 3);
}
printf("bench_h264_primitives: %d iters (%d warmup)\n", iters, warmup);
printf(" ctx has_qpu=%d (CPU pass always runs; QPU pass skipped without Vulkan)\n\n", has_qpu);
/* Pass 1: CPU NEON. */
g_sub = DAEDALUS_SUBSTRATE_CPU;
printf("== CPU NEON ==\n");
for (int i = 0; i < N_ROWS; i++)
rows[i].cpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
/* Pass 2: QPU compute (if available). */
int qpu_failures = 0;
if (has_qpu) {
g_sub = DAEDALUS_SUBSTRATE_QPU;
printf("\n== QPU V3D7 compute ==\n");
for (int i = 0; i < N_ROWS; i++) {
rows[i].qpu_ns = bench_ns(rows[i].name, iters, warmup, rows[i].n_per_call, rows[i].fn);
if (rows[i].qpu_ns < 0) qpu_failures++;
}
if (qpu_failures) {
fprintf(stderr,
"\nbench_h264_primitives: %d of %d QPU dispatches failed.\n"
" Almost always means SPV files were not found via any of:\n"
" cwd / $DAEDALUS_SHADER_DIR / binary-dir /\n"
" /opt/fourier/share/daedalus-fourier / /usr/share/daedalus-fourier\n"
" Set DAEDALUS_SHADER_DIR=<path> or run from a dir where the\n"
" .spv files exist (e.g. the cmake build dir).\n",
qpu_failures, N_ROWS);
return 2;
}
}
/* Summary table — both substrates side by side. */
printf("\n== Per-kernel comparison ==\n");
printf(" %-24s %12s %12s %8s %7s\n",
"kernel", "CPU ns/op", "QPU ns/op", "winner", "ms/frame");
for (int i = 0; i < N_ROWS; i++) {
double cpu_ms = rows[i].cpu_ns * rows[i].frame_n / 1e6;
double qpu_ms = rows[i].qpu_ns > 0 ? rows[i].qpu_ns * rows[i].frame_n / 1e6 : -1;
const char *winner;
char ratio[16];
if (rows[i].qpu_ns <= 0) {
winner = "CPU"; /* QPU n/a */
snprintf(ratio, sizeof(ratio), "n/a");
} else if (rows[i].cpu_ns < rows[i].qpu_ns) {
winner = "CPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].qpu_ns / rows[i].cpu_ns);
} else {
winner = "QPU";
snprintf(ratio, sizeof(ratio), "%.2fx", rows[i].cpu_ns / rows[i].qpu_ns);
}
char qpu_field[16];
if (rows[i].qpu_ns > 0) snprintf(qpu_field, sizeof(qpu_field), "%.2f", rows[i].qpu_ns);
else snprintf(qpu_field, sizeof(qpu_field), "n/a");
char ms_field[24];
if (qpu_ms > 0)
snprintf(ms_field, sizeof(ms_field), "%.2f/%.2f", cpu_ms, qpu_ms);
else
snprintf(ms_field, sizeof(ms_field), "%.2f/n/a", cpu_ms);
printf(" %-24s %12.2f %12s %3s %s %s\n",
rows[i].name, rows[i].cpu_ns, qpu_field, winner, ratio, ms_field);
}
/* Per-frame budget summary at 1080p (8160 MBs). */
double cpu_idct4 = rows[0].cpu_ns * MBS_1080P * 16 / 1e6;
double cpu_debl = (rows[2].cpu_ns + rows[3].cpu_ns) * MBS_1080P * 4 / 1e6;
double cpu_mc = rows[6].cpu_ns * MBS_1080P * 4 / 1e6; /* mc22 worst-case */
double cpu_sum = cpu_idct4 + cpu_debl + cpu_mc;
printf("\n== Projected 1080p worst-case (CPU NEON only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", cpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - cpu_sum);
if (has_qpu) {
double qpu_idct4 = rows[0].qpu_ns * MBS_1080P * 16 / 1e6;
double qpu_debl = (rows[2].qpu_ns + rows[3].qpu_ns) * MBS_1080P * 4 / 1e6;
double qpu_mc = rows[6].qpu_ns * MBS_1080P * 4 / 1e6;
double qpu_sum = qpu_idct4 + qpu_debl + qpu_mc;
printf("\n== Projected 1080p worst-case (QPU V3D7 compute only) ==\n");
printf(" IDCT 4x4 + deblock luma + qpel mc22: %.2f ms (30fps deadline 33.33)\n", qpu_sum);
printf(" Margin: %+.2f ms\n", 33.33 - qpu_sum);
printf("\n CPU vs QPU sum ratio: %.2fx (>1 means QPU wins)\n",
qpu_sum > 0 ? cpu_sum / qpu_sum : 0.0);
}
printf("\n(NOT included: chroma deblock, chroma IDCT, intra prediction,\n");
printf(" CABAC/CAVLC entropy. These bench numbers are a budget LOWER\n");
printf(" bound; the real decode stack adds 20-40%% on top.\n");
printf(" Per-kernel substrate decisions belong in daedalus_core.c recipe\n");
printf(" table; the QPU substrate decree (2026-05-23) keeps everything\n");
printf(" on QPU regardless of these numbers as a policy choice.)\n");
daedalus_ctx_destroy(ctx);
return 0;
}
+53
View File
@@ -0,0 +1,53 @@
/*
* Standalone bit-exact C reference for the H.264 chroma DC 2x2
* Hadamard transform (per H.264 §8.5.11.1).
*
* In 4:2:0 chroma, the four DC coefficients (one from each chroma
* 4x4 AC block within an MB) are arranged into a 2x2 block:
*
* c[0,0] c[0,1] block (0,0) DC block (0,1) DC
* c[1,0] c[1,1] block (1,0) DC block (1,1) DC
*
* The 2x2 Hadamard transform:
*
* f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1]
* f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1]
* f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1]
* f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1]
*
* Equivalently expressed as 2-stage butterflies (row then col), which
* the NEON impl uses for SIMD friendliness we present that form
* here too so the QPU/NEON ports are 1:1.
*
* Output f[] replaces the input c[]. The QP-dependent scaling per
* §8.5.11.2 happens AFTER this primitive the intercept patch
* composes Hadamard + LevelScale + shift itself, since the scaling
* shape depends on QP and on whether we're in the chroma_qp_offset
* adjustment regime.
*
* Input/output layout:
* c[0..3] in row-major order: [c[0,0], c[0,1], c[1,0], c[1,1]]
*
* License: BSD-2-Clause. Algorithm is in the H.264 spec.
*/
#include <stdint.h>
void daedalus_h264_chroma_dc_hadamard_2x2_ref(int16_t c[4])
{
/* Stage 1: butterfly along rows.
* t[0] = c[0,0] + c[0,1] = c[0] + c[1]
* t[1] = c[0,0] - c[0,1] = c[0] - c[1]
* t[2] = c[1,0] + c[1,1] = c[2] + c[3]
* t[3] = c[1,0] - c[1,1] = c[2] - c[3]
*/
int t0 = c[0] + c[1];
int t1 = c[0] - c[1];
int t2 = c[2] + c[3];
int t3 = c[2] - c[3];
/* Stage 2: butterfly along cols. */
c[0] = (int16_t)(t0 + t2); /* f[0,0] = t0+t2 = sum of all 4 */
c[1] = (int16_t)(t1 + t3); /* f[0,1] = (c0-c1) + (c2-c3) */
c[2] = (int16_t)(t0 - t2); /* f[1,0] = (c0+c1) - (c2+c3) */
c[3] = (int16_t)(t1 - t3); /* f[1,1] = (c0-c1) - (c2-c3) */
}
-123
View File
@@ -1,123 +0,0 @@
/*
* Standalone bit-exact C reference for H.264 luma Intra_8x8
* prediction modes (per H.264 spec §8.3.2.1). High-profile-only
* MB type Baseline/Main/Extended profiles don't see Intra_8x8.
*
* Distinct from Intra_4x4 in two ways:
*
* 1. REFERENCE SAMPLE FILTERING (§8.3.2.1.1). The 25 raw
* neighbour samples are pre-filtered with a 1-2-1 smoothing
* filter BEFORE prediction. The filtering has spec-defined
* boundary handling at the corners and the right-edge of the
* top-row extension.
*
* 2. SCALE. All 9 prediction modes operate at 8x8 with the
* filtered samples (Intra_4x4 operates at 4x4 with the raw
* samples).
*
* This PR implements the filter + the 3 simple modes (Vertical,
* Horizontal, DC). The 6 directional modes (DDL, DDR, VR, HD, VL,
* HU at 8x8) follow in a separate PR same template, different
* formulas per spec sections §8.3.2.1.4..§8.3.2.1.9.
*
* Calling convention (FFmpeg-style):
* pred_8x8_<mode>_ref(uint8_t *dst, ptrdiff_t stride)
*
* `dst` points at row 0 col 0 of the 8x8 output block. Reads from
* top[0..15] = dst[-stride + 0..15]
* top-left = dst[-stride - 1]
* left[0..7] = dst[ 0*stride - 1 .. 7*stride - 1]
*
* AVAILABILITY: assumes all neighbours valid (interior-MB case).
*
* License: BSD-2-Clause.
*/
#include <stdint.h>
#include <stddef.h>
#include <string.h>
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
/* H.264 §8.3.2.1.1 reference sample filtering. Filters the 25 raw
* samples around the 8x8 block into a `filt` array with the same
* indices. When called against an "all neighbours available" tile,
* the filtered output uses these spec-defined formulas:
*
* filt[top -1] (= filtered top-left) = (top[0] + 2*tl + left[0] + 2) >> 2
*
* filt[top 0] = (tl + 2*top[0] + top[1] + 2) >> 2
* filt[top i] for 1<=i<=14 = (top[i-1] + 2*top[i] + top[i+1] + 2) >> 2
* filt[top 15] = (top[14] + 3*top[15] + 2) >> 2 (boundary)
*
* filt[left 0] = (tl + 2*left[0] + left[1] + 2) >> 2
* filt[left j] for 1<=j<=6 = (left[j-1] + 2*left[j] + left[j+1] + 2) >> 2
* filt[left 7] = (left[6] + 3*left[7] + 2) >> 2 (boundary)
*
* Reads neighbours from the dst buffer; writes filtered values to
* a caller-provided 26-element array indexed as:
* filt[0] = filtered top-left
* filt[1..16] = filtered top[0..15]
* filt[17..24] = filtered left[0..7]
*/
static void filter_refs(const uint8_t *dst, ptrdiff_t stride,
uint8_t filt[25])
{
int tl = dst[-stride - 1];
int t[16];
for (int i = 0; i < 16; i++) t[i] = dst[-stride + i];
int l[8];
for (int j = 0; j < 8; j++) l[j] = dst[j * stride - 1];
/* Filtered top-left. */
filt[0] = (uint8_t)((t[0] + 2*tl + l[0] + 2) >> 2);
/* Filtered top. */
filt[1] = (uint8_t)((tl + 2*t[0] + t[1] + 2) >> 2);
for (int i = 1; i <= 14; i++)
filt[1 + i] = (uint8_t)((t[i-1] + 2*t[i] + t[i+1] + 2) >> 2);
filt[1 + 15] = (uint8_t)((t[14] + 3*t[15] + 2) >> 2);
/* Filtered left. */
filt[17 + 0] = (uint8_t)((tl + 2*l[0] + l[1] + 2) >> 2);
for (int j = 1; j <= 6; j++)
filt[17 + j] = (uint8_t)((l[j-1] + 2*l[j] + l[j+1] + 2) >> 2);
filt[17 + 7] = (uint8_t)((l[6] + 3*l[7] + 2) >> 2);
}
/* Convenience macros for accessing the filt[] array by spec-style index. */
#define FT(i) filt[1 + (i)] /* filtered top[i], i in 0..15 */
#define FL(j) filt[17 + (j)] /* filtered left[j], j in 0..7 */
#define FTL filt[0] /* filtered top-left */
/* Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c]. */
void daedalus_h264_pred_8x8l_vertical_ref(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FT(c);
}
/* Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r]. */
void daedalus_h264_pred_8x8l_horizontal_ref(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = FL(r);
}
/* Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7]
* + 8) >> 4) broadcast. Note the +8 (not +4 like 4x4): there are
* 16 samples summed total, so >> 4 with half-step rounding +8. */
void daedalus_h264_pred_8x8l_dc_ref(uint8_t *dst, ptrdiff_t stride)
{
uint8_t filt[25];
filter_refs(dst, stride, filt);
int sum = 8;
for (int i = 0; i < 8; i++) sum += FT(i);
for (int j = 0; j < 8; j++) sum += FL(j);
uint8_t v = (uint8_t)(sum >> 4);
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++) dst[r * stride + c] = v;
}
+4 -4
View File
@@ -683,13 +683,13 @@ int main(void)
printf(" H264_QPEL_MC20 recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20));
printf(" H264_DEBLOCK_LH recipe substrate: %d (CPU, no QPU H shader yet)\n",
printf(" H264_DEBLOCK_LH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH));
printf(" H264_DEBLOCK_CV recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CV recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV));
printf(" H264_DEBLOCK_CH recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH));
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (CPU, bS=4 set)\n",
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (bS=4 family, all on QPU)\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA));
int fail = 0;
+136
View File
@@ -0,0 +1,136 @@
/*
* Tests the H.264 chroma DC 2x2 Hadamard primitive against
* spec-derived expected outputs.
*
* f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1] "sum"
* f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1] "col-diff"
* f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1] "row-diff"
* f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1] "anti-diag"
*/
#include <stdint.h>
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_chroma_dc_hadamard_2x2_ref(int16_t c[4]);
extern void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]); /* public API */
static int check(const char *name, int16_t in[4], int16_t expect[4])
{
int16_t c[4]; memcpy(c, in, sizeof(c));
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
int fail = 0;
for (int i = 0; i < 4; i++) {
if (c[i] != expect[i]) {
fprintf(stderr, "%s: c[%d] = %d, expected %d\n",
name, i, c[i], expect[i]);
fail = 1;
}
}
if (!fail) printf(" %-32s PASS\n", name);
else printf(" %-32s FAIL\n", name);
return fail;
}
int main(void)
{
int fail = 0;
/* Test 1: All-same input.
* c = [5, 5, 5, 5]
* f[0,0] = 20, f[0,1] = 0, f[1,0] = 0, f[1,1] = 0
*/
{ int16_t in[4] = { 5, 5, 5, 5 };
int16_t ex[4] = { 20, 0, 0, 0 };
fail |= check("all-uniform 5", in, ex); }
/* Test 2: Single-axis variation (col 1 = 0, col 2 = 10).
* c = [0, 10, 0, 10]
* f[0,0] = 0+10+0+10 = 20
* f[0,1] = 0-10+0-10 = -20
* f[1,0] = 0+10-0-10 = 0
* f[1,1] = 0-10-0+10 = 0
*/
{ int16_t in[4] = { 0, 10, 0, 10 };
int16_t ex[4] = { 20, -20, 0, 0 };
fail |= check("col gradient [0,10,0,10]", in, ex); }
/* Test 3: Row gradient.
* c = [0, 0, 10, 10]
* f[0,0] = 20, f[0,1] = 0, f[1,0] = 0-20 = -20, f[1,1] = 0
*/
{ int16_t in[4] = { 0, 0, 10, 10 };
int16_t ex[4] = { 20, 0, -20, 0 };
fail |= check("row gradient [0,0,10,10]", in, ex); }
/* Test 4: Anti-diagonal pattern.
* c = [10, 0, 0, 10]
* f[0,0] = 20
* f[0,1] = 10-0+0-10 = 0
* f[1,0] = 10+0-0-10 = 0
* f[1,1] = 10-0-0+10 = 20
*/
{ int16_t in[4] = { 10, 0, 0, 10 };
int16_t ex[4] = { 20, 0, 0, 20 };
fail |= check("anti-diagonal [10,0,0,10]", in, ex); }
/* Test 5: Asymmetric — all bands non-zero.
* c = [1, 2, 3, 4]
* f[0,0] = 10
* f[0,1] = 1-2+3-4 = -2
* f[1,0] = 1+2-3-4 = -4
* f[1,1] = 1-2-3+4 = 0
*/
{ int16_t in[4] = { 1, 2, 3, 4 };
int16_t ex[4] = { 10, -2, -4, 0 };
fail |= check("asymmetric [1,2,3,4]", in, ex); }
/* Test 6: Negative inputs (Hadamard is linear, so signs preserve).
* c = [-5, 5, -5, 5]
* f[0,0] = -5+5-5+5 = 0
* f[0,1] = -5-5-5-5 = -20
* f[1,0] = -5+5+5-5 = 0
* f[1,1] = -5-5+5+5 = 0
*/
{ int16_t in[4] = { -5, 5, -5, 5 };
int16_t ex[4] = { 0, -20, 0, 0 };
fail |= check("sign-alternating [-5,5,-5,5]", in, ex); }
/* Test 7: Inverse-property check. H * H = 4*I for the unscaled
* 2x2 Hadamard. So applying twice multiplies each by 4.
* c = [1, 2, 3, 4]
* First Hadamard: [10, -2, -4, 0]
* Second Hadamard: [4, 8, 12, 16]
*/
{ int16_t in[4] = { 1, 2, 3, 4 };
int16_t ex[4] = { 4, 8, 12, 16 };
int16_t c[4]; memcpy(c, in, sizeof(c));
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
daedalus_h264_chroma_dc_hadamard_2x2_ref(c);
int local_fail = 0;
for (int i = 0; i < 4; i++) if (c[i] != ex[i]) local_fail = 1;
printf(" %-32s %s\n", "double-Hadamard = 4*orig",
local_fail ? "FAIL" : "PASS");
fail |= local_fail;
}
/* Test 8: public API parity. The public symbol must produce
* byte-identical output to the test-only ref for the same input.
* If the src/ Hadamard ever drifts from the spec, this catches it. */
{
int16_t input[4] = { 7, -11, 23, -42 };
int16_t a[4], b[4];
memcpy(a, input, sizeof(a));
memcpy(b, input, sizeof(b));
daedalus_h264_chroma_dc_hadamard_2x2_ref(a);
daedalus_h264_chroma_dc_hadamard_2x2(b);
int local_fail = 0;
for (int i = 0; i < 4; i++) if (a[i] != b[i]) local_fail = 1;
printf(" %-32s %s\n", "public API parity vs _ref",
local_fail ? "FAIL" : "PASS");
fail |= local_fail;
}
if (fail == 0) printf("\nALL chroma DC Hadamard tests PASS\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+9 -9
View File
@@ -18,10 +18,10 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_16x16_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_plane_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_16x16_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 17
@@ -84,7 +84,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 10 + i; l[i] = 0; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_vertical(&buf[1][1], STRIDE);
struct vertical_ctx vc = { t };
fail |= check(buf, "Vertical (mode 0)", expect_vertical, &vc);
}
@@ -95,7 +95,7 @@ int main(void)
int t[16] = {0}, l[16];
for (int i = 0; i < 16; i++) l[i] = 50 + i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_horizontal(&buf[1][1], STRIDE);
struct horizontal_ctx hc = { l };
fail |= check(buf, "Horizontal (mode 1)", expect_horizontal, &hc);
}
@@ -108,7 +108,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 2; l[i] = 6; }
set_ctx(buf, 99, t, l);
daedalus_h264_pred_16x16_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_dc(&buf[1][1], STRIDE);
uint8_t exp_val = 4;
fail |= check(buf, "DC (mode 2)", expect_uniform, &exp_val);
}
@@ -123,7 +123,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l); /* uniform tl too — H/V sums actually zero */
daedalus_h264_pred_16x16_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t exp_val = 100;
fail |= check(buf, "Plane (mode 3, uniform)", expect_uniform, &exp_val);
}
@@ -150,7 +150,7 @@ int main(void)
int t[16], l[16];
for (int i = 0; i < 16; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_16x16_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_16x16_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 15][1 + 15];
int spot_fail = 0;
+19 -19
View File
@@ -22,15 +22,15 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_4x4_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vr_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hd_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vl_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hu_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_4x4_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
typedef void (*pred_fn)(uint8_t *dst, ptrdiff_t stride);
@@ -82,7 +82,7 @@ int main(void)
int t[8] = { 10, 20, 30, 40, 0, 0, 0, 0 };
int l[4] = { 0, 0, 0, 0 };
set_ctx(buf, tl, t, l);
daedalus_h264_pred_4x4_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vertical(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{10,20,30,40}, {10,20,30,40}, {10,20,30,40}, {10,20,30,40}
};
@@ -95,7 +95,7 @@ int main(void)
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 50, 60, 70, 80 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_horizontal(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{50,50,50,50}, {60,60,60,60}, {70,70,70,70}, {80,80,80,80}
};
@@ -110,7 +110,7 @@ int main(void)
int t[8] = { 1,1,1,1, 0,0,0,0 };
int l[4] = { 3,3,3,3 };
set_ctx(buf, 99, t, l); /* tl unused for DC */
daedalus_h264_pred_4x4_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_dc(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{2,2,2,2}, {2,2,2,2}, {2,2,2,2}, {2,2,2,2}
};
@@ -125,7 +125,7 @@ int main(void)
int t[8] = { 100,100,100,100, 100,100,100,100 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_ddl_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_ddl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{100,100,100,100}, {100,100,100,100},
{100,100,100,100}, {100,100,100,100}
@@ -140,7 +140,7 @@ int main(void)
int t[8] = { 200,200,200,200, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 200, t, l);
daedalus_h264_pred_4x4_ddr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_ddr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
@@ -155,7 +155,7 @@ int main(void)
int t[8] = { 80,80,80,80, 0,0,0,0 };
int l[4] = { 80,80,80,80 };
set_ctx(buf, 80, t, l);
daedalus_h264_pred_4x4_vr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{80,80,80,80}, {80,80,80,80}, {80,80,80,80}, {80,80,80,80}
};
@@ -168,7 +168,7 @@ int main(void)
int t[8] = { 120,120,120,120, 0,0,0,0 };
int l[4] = { 120,120,120,120 };
set_ctx(buf, 120, t, l);
daedalus_h264_pred_4x4_hd_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_hd(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{120,120,120,120}, {120,120,120,120},
{120,120,120,120}, {120,120,120,120}
@@ -182,7 +182,7 @@ int main(void)
int t[8] = { 64,64,64,64, 64,64,64,64 };
int l[4] = { 0,0,0,0 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_vl_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vl(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{64,64,64,64}, {64,64,64,64}, {64,64,64,64}, {64,64,64,64}
};
@@ -195,7 +195,7 @@ int main(void)
int t[8] = { 0,0,0,0, 0,0,0,0 };
int l[4] = { 200,200,200,200 };
set_ctx(buf, 0, t, l);
daedalus_h264_pred_4x4_hu_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_hu(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{200,200,200,200}, {200,200,200,200},
{200,200,200,200}, {200,200,200,200}
@@ -230,7 +230,7 @@ int main(void)
int t[8] = { 10,20,30,40, 0,0,0,0 };
int l[4] = { 50,60,70,0 };
set_ctx(buf, 5, t, l);
daedalus_h264_pred_4x4_vr_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_4x4_vr(&buf[1][1], STRIDE);
uint8_t exp[4][4] = {
{ 8,15,25,35},
{18,11,20,30},
+40 -9
View File
@@ -14,9 +14,15 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_8x8l_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_ddr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vr(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hd(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_vl(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_8x8l_hu(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 17
#define ROWS 9
@@ -55,7 +61,7 @@ int main(void)
for (int i = 0; i < 16; i++) t[i] = 50;
for (int j = 0; j < 8; j++) l[j] = 0;
set_ctx(buf, 50, t, l);
daedalus_h264_pred_8x8l_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Vertical (mode 0, uniform top)", 50);
}
@@ -65,7 +71,7 @@ int main(void)
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = 70;
set_ctx(buf, 70, t, l);
daedalus_h264_pred_8x8l_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "Horizontal (mode 1, uniform left)", 70);
}
@@ -78,7 +84,7 @@ int main(void)
for (int i = 0; i < 16; i++) t[i] = 33;
for (int j = 0; j < 8; j++) l[j] = 33;
set_ctx(buf, 33, t, l);
daedalus_h264_pred_8x8l_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_dc(&buf[1][1], STRIDE);
fail |= check_uniform(buf, "DC (mode 2, uniform)", 33);
}
@@ -102,7 +108,7 @@ int main(void)
int t[16], l[8] = {0};
for (int i = 0; i < 16; i++) t[i] = i;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_vertical(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
@@ -123,7 +129,7 @@ int main(void)
int t[16] = {0}, l[8];
for (int j = 0; j < 8; j++) l[j] = j;
set_ctx(buf, 0, t, l);
daedalus_h264_pred_8x8l_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_8x8l_horizontal(&buf[1][1], STRIDE);
int diff = 0;
for (int r = 0; r < 8; r++)
for (int c = 0; c < 8; c++)
@@ -133,7 +139,32 @@ int main(void)
fail |= (diff == 0) ? 0 : 1;
}
if (fail == 0) printf("\nALL Intra_8x8 luma PASS (3 modes — V, H, DC)\n");
/* Directional modes — uniform-context sanity tests. With all
* neighbours = N, the 1-2-1 filter produces uniform N, and any
* 3-tap / 2-tap on uniform N produces N. So every directional
* mode should output uniform N on uniform input. */
{
typedef void (*pred_fn_t)(uint8_t *dst, ptrdiff_t stride);
struct { const char *name; pred_fn_t fn; } modes[] = {
{ "DDL (mode 3, uniform)", daedalus_h264_pred_8x8l_ddl },
{ "DDR (mode 4, uniform)", daedalus_h264_pred_8x8l_ddr },
{ "VR (mode 5, uniform)", daedalus_h264_pred_8x8l_vr },
{ "HD (mode 6, uniform)", daedalus_h264_pred_8x8l_hd },
{ "VL (mode 7, uniform)", daedalus_h264_pred_8x8l_vl },
{ "HU (mode 8, uniform)", daedalus_h264_pred_8x8l_hu },
};
for (size_t i = 0; i < sizeof(modes)/sizeof(modes[0]); i++) {
uint8_t buf[ROWS][STRIDE];
int t[16], l[8];
for (int k = 0; k < 16; k++) t[k] = 120;
for (int k = 0; k < 8; k++) l[k] = 120;
set_ctx(buf, 120, t, l);
modes[i].fn(&buf[1][1], STRIDE);
fail |= check_uniform(buf, modes[i].name, 120);
}
}
if (fail == 0) printf("\nALL Intra_8x8 luma PASS (9 modes — V, H, DC, DDL, DDR, VR, HD, VL, HU)\n");
else fprintf(stderr, "\n%d test(s) FAILED\n", fail);
return fail ? 1 : 0;
}
+9 -9
View File
@@ -16,10 +16,10 @@
#include <stdio.h>
#include <string.h>
extern void daedalus_h264_pred_chroma8x8_dc_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_horizontal_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_vertical_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_plane_ref(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_dc(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_horizontal(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_vertical(uint8_t *dst, ptrdiff_t stride);
extern void daedalus_h264_pred_chroma8x8_plane(uint8_t *dst, ptrdiff_t stride);
#define STRIDE 9
#define ROWS 9
@@ -69,7 +69,7 @@ int main(void)
uint8_t buf[ROWS][STRIDE];
int t[8] = {0}, l[8] = {10, 20, 30, 40, 50, 60, 70, 80};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_horizontal_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_horizontal(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) l[r];
fail |= check_per_cell(buf, "Horizontal (mode 1)", exp);
@@ -80,7 +80,7 @@ int main(void)
uint8_t buf[ROWS][STRIDE];
int t[8] = {15, 25, 35, 45, 55, 65, 75, 85}, l[8] = {0};
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_vertical_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_vertical(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = (uint8_t) t[c];
fail |= check_per_cell(buf, "Vertical (mode 2)", exp);
@@ -104,7 +104,7 @@ int main(void)
int t[8] = { 8, 8, 8, 8, 16, 16, 16, 16 };
int l[8] = { 24, 24, 24, 24, 40, 40, 40, 40 };
set_ctx(buf, 99, t, l);
daedalus_h264_pred_chroma8x8_dc_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_dc(&buf[1][1], STRIDE);
uint8_t exp[8][8] = {
{16,16,16,16, 16,16,16,16},
{16,16,16,16, 16,16,16,16},
@@ -125,7 +125,7 @@ int main(void)
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = 100; l[i] = 100; }
set_ctx(buf, 100, t, l);
daedalus_h264_pred_chroma8x8_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t exp[8][8];
for (int r = 0; r < 8; r++) for (int c = 0; c < 8; c++) exp[r][c] = 100;
fail |= check_per_cell(buf, "Plane uniform (mode 3)", exp);
@@ -153,7 +153,7 @@ int main(void)
int t[8], l[8];
for (int i = 0; i < 8; i++) { t[i] = i; l[i] = i; }
set_ctx(buf, 0, t, l);
daedalus_h264_pred_chroma8x8_plane_ref(&buf[1][1], STRIDE);
daedalus_h264_pred_chroma8x8_plane(&buf[1][1], STRIDE);
uint8_t tl_actual = buf[1 + 0][1 + 0];
uint8_t br_actual = buf[1 + 7][1 + 7];
int spot_fail = 0;