h264: deblock bS=4 intra variants (luma + chroma, V + H) #11

2026-05-24T22:01:03Z

marfrit commented

2026-05-24 22:01:03 +00:00

Closes the H.264 deblock matrix: adds the four bS=4 intra-strength loop filters per H.264 §8.7.2.3. After this PR fourier covers all 8 standard 8-bit 4:2:0 deblock combinations (luma+chroma × v+h × inter+intra).

New kernel enums LV_INTRA=13, LH_INTRA=14, CV_INTRA=15, CH_INTRA=16 → CPU. 4 new dispatch fns + 4 recipe wrappers via macros to keep boilerplate tight. Vendored FFmpeg already has all four NEON symbols.

C reference covers all four orientations. Luma intra has per-side strong/weak filter selector (strong updates p0/p1/p2, weak updates p0 only); chroma always weak. All 11 H.264 kernels bit-exact PASS on hertz first try — meaningful because the strong/weak selector would surface any sign/shift/rounding mistake immediately.

daedalus_h264_deblock_meta is reused for intra (tc0 field ignored since bS=4 hardcodes strength) — callers can build a single edge list and route by kernel.

Closes the H.264 deblock matrix: adds the four bS=4 intra-strength loop filters per H.264 §8.7.2.3. After this PR fourier covers all 8 standard 8-bit 4:2:0 deblock combinations (luma+chroma × v+h × inter+intra). New kernel enums LV_INTRA=13, LH_INTRA=14, CV_INTRA=15, CH_INTRA=16 → CPU. 4 new dispatch fns + 4 recipe wrappers via macros to keep boilerplate tight. Vendored FFmpeg already has all four NEON symbols. C reference covers all four orientations. Luma intra has per-side strong/weak filter selector (strong updates p0/p1/p2, weak updates p0 only); chroma always weak. **All 11 H.264 kernels bit-exact PASS on hertz first try** — meaningful because the strong/weak selector would surface any sign/shift/rounding mistake immediately. `daedalus_h264_deblock_meta` is reused for intra (tc0 field ignored since bS=4 hardcodes strength) — callers can build a single edge list and route by kernel.

marfrit added 1 commit 2026-05-24 22:01:04 +00:00

h264: deblock bS=4 intra variants (luma + chroma, V + H) 9b1c106dc5

Closes the deblock matrix: adds the four bS=4 intra-strength loop
filters used at I-MB edges (and other boundaries where H.264
§8.7.2.1 forces boundary strength to 4).  After this PR fourier
covers all 8 standard 8-bit 4:2:0 deblock combinations:

    bS<4   bS=4
    -----  -----
  luma_v  ✓ (cycle 8 QPU)   ✓ (CPU)
  luma_h  ✓ (CPU, PR #9)    ✓ (CPU)
  chrm_v  ✓ (CPU, PR #10)   ✓ (CPU)
  chrm_h  ✓ (CPU, PR #10)   ✓ (CPU)

Scope:
  - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15,
    CH_INTRA=16), all → CPU substrate in the recipe table.
  - 4 new public dispatch fns + 4 recipe wrappers (defined via two
    DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the
    boilerplate tight).
  - 4 new extern decls for the vendored
    ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols.
  - C reference: tests/h264_intra_loop_filter_ref.c covers all four
    orientations.  Algorithm per H.264 §8.7.2.3:

      Luma: per-side strong/weak filter selector
        strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2)+2)
        strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2)+2)
        Strong updates p0/p1/p2 (and mirror); weak updates p0 only.
      Chroma: always weak, only p0/q0 updated.

  - daedalus_h264_deblock_meta is REUSED for intra dispatches; the
    tc0[] field is ignored (bS=4 hardcodes the strength).  Callers
    can build a single edge list and route by kernel without an
    extra struct.

  - Test refactor: an intra_test_spec table + run_intra_test helper
    drives all four orientations through one harness, keeping the
    new test surface compact (~50 LOC for 4 kernels vs ~200 if each
    had its own test_deblock_*_intra fn).

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    ...
    H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%)
    ...

  All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed.

The bit-exact match on first try is meaningful for these kernels:
the strong/weak filter selector + per-side asymmetry would have
surfaced any sign / shift / rounding mistake immediately.  The
C reference is now a usable spec checkpoint for the eventual QPU
shader work.

QPU shader follow-up: not in this PR.  The intra path's 3-cell
per-side update + strong/weak branch is structurally more complex
than the bS<4 path that already has a V shader (v3d_h264deblock.spv).
Per the prior R-band logic for deblock, intra edges are < 20% of
total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge
fits comfortably in the budget.

marfrit merged commit 5306bf0f61 into main

2026-05-24 22:09:16 +00:00

marfrit deleted branch noether/h264-deblock-intra

2026-05-24 22:09:16 +00:00

marfrit referenced this pull request

2026-05-25 18:30:09 +00:00

h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete #35

claude-noether referenced this issue from a commit

2026-05-25 18:30:10 +00:00

h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete

Sign in to join this conversation.