cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage #8

Merged
marfrit merged 1 commits from noether/v3d-shader-h264-qpel-mc20 into main 2026-05-23 19:14:45 +00:00
Owner

Closes the QPU-default substrate campaign per the 2026-05-23 decree — every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece: 6-tap horizontal half-pel luma MC, H.264 §8.4.2.2.1.

Shader

src/v3d_h264_qpel_mc20.comp:

  • local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout — no inter-lane communication needed; V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns.
  • Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte.
  • Single-stride convention matches FFmpeg's H264QpelContext.

Dispatch wiring

  • h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  • dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), one WG per block.
  • daedalus_dispatch_h264_qpel_mc20() swaps ROUTE_CPU_ONLY for the substrate-switch pattern.
  • Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1)

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      2
  H264_DEBLOCK_LV recipe substrate: 2
  H264_QPEL_MC20 recipe substrate:  2   ← flipped
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + 10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch.

9 of 9 cycles QPU-by-recipe

Cycle Kernel Substrate V3D shader
1 VP9 IDCT 8x8 QPU yes
2 VP9 LPF wd=4 QPU yes
3 VP9 MC 8h QPU yes
4 VP9 LPF wd=8 QPU yes
5 AV1 CDEF 8x8 QPU yes
6 H.264 IDCT 4x4 QPU yes
7 H.264 IDCT 8x8 QPU yes
8 H.264 deblock luma-v QPU yes
9 H.264 qpel mc20 QPU this PR

The prototype now demonstrates GPU acceleration on every measured kernel — campaign complete.

Closes daedalus-fourier task #165.

Closes the QPU-default substrate campaign per the 2026-05-23 decree — every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece: 6-tap horizontal half-pel luma MC, H.264 §8.4.2.2.1. ## Shader `src/v3d_h264_qpel_mc20.comp`: - local_size = 64, **1 lane per output pixel** of one 8x8 block, 1 block per workgroup. Simplest layout — no inter-lane communication needed; V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply `(1, -5, 20, 20, -5, 1) / 32` filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's `H264QpelContext`. ## Dispatch wiring - `h264_qpel_mc20_pipe` field on `daedalus_ctx`, lazy init. - `dispatch_h264_qpel_mc20_qpu()`: 3 SSBOs (src / dst / meta), one WG per block. - `daedalus_dispatch_h264_qpel_mc20()` swaps `ROUTE_CPU_ONLY` for the substrate-switch pattern. - Recipe table: `H264_QPEL_MC20` returns `SUBSTRATE_QPU`. ## Verification (hertz, Pi 5, V3D 7.1) ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU ``` First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing `src_max` in the host wrapper. The filter reads `src_off + 7*stride + 10` at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. ## 9 of 9 cycles QPU-by-recipe | Cycle | Kernel | Substrate | V3D shader | |---|---|---|---| | 1 | VP9 IDCT 8x8 | QPU | yes | | 2 | VP9 LPF wd=4 | QPU | yes | | 3 | VP9 MC 8h | QPU | yes | | 4 | VP9 LPF wd=8 | QPU | yes | | 5 | AV1 CDEF 8x8 | QPU | yes | | 6 | H.264 IDCT 4x4 | QPU | yes | | 7 | H.264 IDCT 8x8 | QPU | yes | | 8 | H.264 deblock luma-v | QPU | yes | | 9 | H.264 qpel mc20 | **QPU** | **this PR** | The prototype now demonstrates GPU acceleration on every measured kernel — campaign complete. Closes daedalus-fourier task #165.
marfrit added 1 commit 2026-05-23 19:06:00 +00:00
Closes the QPU-default substrate campaign per the 2026-05-23
decree: every daedalus-fourier kernel that can be done in QPU
is now done in QPU.  Cycle 9 is the last piece — 6-tap horizontal
half-pel luma motion compensation, H.264 §8.4.2.2.1.

Shader (src/v3d_h264_qpel_mc20.comp):

  - local_size = 64, 1 lane per output pixel of one 8x8 block,
    1 block per workgroup.  Simplest layout that avoids any
    inter-lane communication — V3D's L2 cache handles the
    redundant reads from adjacent lanes computing adjacent
    output columns.
  - Per-pixel: read 6 src samples (cols c-2..c+3 in row r),
    apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16
    rounding, clip to u8, write one dst byte.
  - Single-stride convention matches FFmpeg's H264QpelContext
    (dst and src share `stride`; src+src_off points at output
    col 0 with the caller-guaranteed -2/+3 padding).

Dispatch wiring (src/daedalus_core.c):

  - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init.
  - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta),
    src_max = src_off + 7*stride + 11 (covers the +3-col read
    footprint on the last row), dst_max = dst_off + 7*stride + 8.
    1 block per WG.
  - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY
    with the substrate-switch pattern matching the other H.264
    kernels.
  - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU.

Verification (hertz, Pi 5, V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2   ← flipped
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact   ← QPU

First-iteration result was 1017/1024 (99.32%) — off-by-7 traced
to undersizing src_max in the host wrapper.  The filter reads
src_off + 7*stride + (7 + 3) = +10 at the last row last column;
add 1 for memcpy's [0..N-1] semantic → 11.  Fixed in the same
patch.

All 9 daedalus-fourier cycles now QPU-by-recipe:

  cycle 1 VP9 IDCT 8x8         QPU
  cycle 2 VP9 LPF wd=4         QPU
  cycle 3 VP9 MC 8h            QPU
  cycle 4 VP9 LPF wd=8         QPU
  cycle 5 AV1 CDEF 8x8         QPU
  cycle 6 H.264 IDCT 4x4       QPU
  cycle 7 H.264 IDCT 8x8       QPU
  cycle 8 H.264 deblock luma-v QPU
  cycle 9 H.264 qpel mc20      QPU   ← this commit

Closes daedalus-fourier task #165.  Per the decree memory
[QPU is default substrate], the prototype now demonstrates GPU
acceleration on every measured kernel.
marfrit merged commit 0d54d68f38 into main 2026-05-23 19:14:45 +00:00
marfrit deleted branch noether/v3d-shader-h264-qpel-mc20 2026-05-23 19:14:45 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#8