cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage #8

2026-05-23T19:05:59Z

marfrit commented

2026-05-23 19:05:59 +00:00

Closes the QPU-default substrate campaign per the 2026-05-23 decree — every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece: 6-tap horizontal half-pel luma MC, H.264 §8.4.2.2.1.

Shader

src/v3d_h264_qpel_mc20.comp:

local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout — no inter-lane communication needed; V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns.
Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte.
Single-stride convention matches FFmpeg's H264QpelContext.

Dispatch wiring

h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), one WG per block.
daedalus_dispatch_h264_qpel_mc20() swaps ROUTE_CPU_ONLY for the substrate-switch pattern.
Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1)

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      2
  H264_DEBLOCK_LV recipe substrate: 2
  H264_QPEL_MC20 recipe substrate:  2   ← flipped
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + 10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch.

9 of 9 cycles QPU-by-recipe

Cycle	Kernel	Substrate	V3D shader
1	VP9 IDCT 8x8	QPU	yes
2	VP9 LPF wd=4	QPU	yes
3	VP9 MC 8h	QPU	yes
4	VP9 LPF wd=8	QPU	yes
5	AV1 CDEF 8x8	QPU	yes
6	H.264 IDCT 4x4	QPU	yes
7	H.264 IDCT 8x8	QPU	yes
8	H.264 deblock luma-v	QPU	yes
9	H.264 qpel mc20	QPU	this PR

The prototype now demonstrates GPU acceleration on every measured kernel — campaign complete.

Closes daedalus-fourier task #165.

Closes the QPU-default substrate campaign per the 2026-05-23 decree — every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece: 6-tap horizontal half-pel luma MC, H.264 §8.4.2.2.1. ## Shader `src/v3d_h264_qpel_mc20.comp`: - local_size = 64, **1 lane per output pixel** of one 8x8 block, 1 block per workgroup. Simplest layout — no inter-lane communication needed; V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply `(1, -5, 20, 20, -5, 1) / 32` filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's `H264QpelContext`. ## Dispatch wiring - `h264_qpel_mc20_pipe` field on `daedalus_ctx`, lazy init. - `dispatch_h264_qpel_mc20_qpu()`: 3 SSBOs (src / dst / meta), one WG per block. - `daedalus_dispatch_h264_qpel_mc20()` swaps `ROUTE_CPU_ONLY` for the substrate-switch pattern. - Recipe table: `H264_QPEL_MC20` returns `SUBSTRATE_QPU`. ## Verification (hertz, Pi 5, V3D 7.1) ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU ``` First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing `src_max` in the host wrapper. The filter reads `src_off + 7*stride + 10` at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. ## 9 of 9 cycles QPU-by-recipe | Cycle | Kernel | Substrate | V3D shader | |---|---|---|---| | 1 | VP9 IDCT 8x8 | QPU | yes | | 2 | VP9 LPF wd=4 | QPU | yes | | 3 | VP9 MC 8h | QPU | yes | | 4 | VP9 LPF wd=8 | QPU | yes | | 5 | AV1 CDEF 8x8 | QPU | yes | | 6 | H.264 IDCT 4x4 | QPU | yes | | 7 | H.264 IDCT 8x8 | QPU | yes | | 8 | H.264 deblock luma-v | QPU | yes | | 9 | H.264 qpel mc20 | **QPU** | **this PR** | The prototype now demonstrates GPU acceleration on every measured kernel — campaign complete. Closes daedalus-fourier task #165.

marfrit added 1 commit 2026-05-23 19:06:00 +00:00

cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage 79553c6e22

Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU.  Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.

Shader (src/v3d_h264_qpel_mc20.comp):

  - local_size = 64, 1 lane per output pixel of one 8x8 block,
    1 block per workgroup.  Simplest layout that avoids any
    inter-lane communication — V3D's L2 cache handles the
    redundant reads from adjacent lanes computing adjacent
    output columns.
  - Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
    apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
    rounding, clip to u8, write one dst byte.
  - Single-stride convention matches FFmpeg's H264QpelContext
    (dst and src share `stride`; src+src_off points at output
    col 0 with the caller-guaranteed -2/+3 padding).

Dispatch wiring (src/daedalus_core.c):

  - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
    src_max = src_off + 7*stride + 11 (covers the +3-col read
    footprint on the last row), dst_max = dst_off + 7*stride + 8.
    1 block per WG.
  - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
    with the substrate-switch pattern matching the other H.264
    kernels.
  - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2   ← flipped
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper.  The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11.  Fixed in the same
patch.

All 9 daedalus-fourier cycles now QPU-by-recipe:

  cycle 1 VP9 IDCT 8x8         QPU
  cycle 2 VP9 LPF wd=4         QPU
  cycle 3 VP9 MC 8h            QPU
  cycle 4 VP9 LPF wd=8         QPU
  cycle 5 AV1 CDEF 8x8         QPU
  cycle 6 H.264 IDCT 4x4       QPU
  cycle 7 H.264 IDCT 8x8       QPU
  cycle 8 H.264 deblock luma-v QPU
  cycle 9 H.264 qpel mc20      QPU   ← this commit

Closes daedalus-fourier task #165.  Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.

marfrit merged commit 0d54d68f38 into main

2026-05-23 19:14:45 +00:00

marfrit deleted branch noether/v3d-shader-h264-qpel-mc20

2026-05-23 19:14:45 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#8