Files
daedalus-fourier/docs/k8_h264deblock_phase1.md
marfrit 5a085e7180 Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored
Targets the one H.264 kernel most likely to be QPU-worthy:
in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in
CPU-only because H.264 transforms are NEON-trivial. H.264
deblock has analogous structure to VP9 LPF (cycles 2+4, both
GREEN) so predicted R8 = ORANGE/YELLOW.

This commit:
- Vendors ff_h264_*_loop_filter_*_neon from h264dsp_neon.S
  (1076 lines, includes both v/h luma + chroma + intra variants
  + weight/biweight)
- PROVENANCE.md updated with the new vendored file
- Phase 1 doc captures the full plan: start with luma vertical
  non-intra (most common case), defer Phase 3+ to next session

H.264 deblock C ref scope is ~2 hours (per-row branching,
per-4-row-segment tc0, ap/aq side conditions, alpha/beta
thresholds — much more complex than VP9 LPF wd=4's
single-branch filter). Deferring to fresh attention next
session rather than rushing now.

After cycle 8 closes, the H.264 QPU surface is well-characterised
and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's
substrate-routing recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:18:19 +00:00

6.3 KiB
Raw Permalink Blame History

cycle, phase, status, date_opened, codec, kernel, parent, predicted_R
cycle phase status date_opened codec kernel parent predicted_R
8 1 open (Phase 3 deferred to next session — scope larger than VP9 LPF) 2026-05-18 H.264 in-loop deblock filter (luma vertical edge variant first) project_h264_scope_added.md (memory), k7_h264idct8_phase3_and_4.md (lesson) 0.3-0.8 (ORANGE/YELLOW) — analogous to VP9 LPF cycles 2/4 which were GREEN

Cycle 8, Phase 1 — H.264 in-loop deblock (luma vertical edge first)

After cycles 6 and 7 both came in as "predicted GREEN, measured CPU-only" for H.264 transforms (transforms too lightweight on NEON), cycle 8 targets the one H.264 kernel most likely to actually benefit from QPU offload: the in-loop deblock filter.

Why deblock as the H.264 QPU candidate

Per cycle 7's Phase 9 update:

  • H.264 transforms (cycles 6+7) NEON-saturated at ~150 Mblock/s, no QPU need
  • H.264 MC (luma qpel, chroma) likely analogous to cycle 3 VP9 MC (R=0.067 RED), QPU loses
  • Deblock is bandwidth-bound with per-pixel branching, analogous to VP9 LPF (cycle 2 R=0.41 GREEN, cycle 4 R=0.34 GREEN)
  • H.264 deblock processes 16-pixel-wide MB edges (vs VP9's 8-pixel smaller edges), so per-edge work is heavier — better for QPU amortization

Predicted R₈ band: ORANGE to GREEN based on the VP9 LPF analog.

Scope decision: start with luma vertical edge

H.264 deblock has many variants:

  1. Luma vertical edge (v_loop_filter_luma) — 16-row × 8-col region
  2. Luma horizontal edge (h_loop_filter_luma) — 4-row × 16-col region
  3. Luma intra (stronger filter, bS=4)
  4. Chroma {v,h} edge
  5. Chroma intra
  6. Chroma 4:2:2 variants

Start with luma vertical edge non-intra. Most common case (most MB-internal edges are non-intra). Other variants are follow-up cycles (8a, 8b, etc.) using the same QPU shader template.

NEON reference

ff_h264_v_loop_filter_luma_neon (external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S line 111, vendored 2026-05-18).

Signature inferred from h264_loop_filter_start macro:

void ff_h264_v_loop_filter_luma_neon(uint8_t *pix,
                                      ptrdiff_t stride,
                                      int alpha, int beta,
                                      int8_t *tc0);

Where:

  • pix: pointer to the edge centre — pix[0] = q0 pixel of first row
  • stride: byte stride between rows (typically picture width)
  • alpha: filter strength threshold (0..63, MB-derived)
  • beta: block-boundary threshold (0..63, MB-derived)
  • tc0: array of 4 int8 values — per-4-pixel-segment tc0 strengths

The 16-row edge is divided into 4 segments of 4 rows each; each segment can have its own tc0 (encoder-derived filter strength parameter).

Algorithm summary (H.264 §8.7.2.4)

Per row, for each 4-row segment:

  1. Compute pre-conditions:
    • bS > 0 (tc0[segment] != -1)
    • |p0 - q0| < alpha
    • |p1 - p0| < beta
    • |q1 - q0| < beta
  2. If precondition fails → no filter for this row
  3. Compute ap = |p2 - p0|, aq = |q2 - q0|
  4. Compute tc = tc0 + (ap < beta) + (aq < beta)
  5. delta = clip3(-tc, tc, (((q0-p0)*4 + (p1-q1) + 4) >> 3))
  6. Apply:
    • p0' = clip255(p0 + delta)
    • q0' = clip255(q0 - delta)
    • If ap < beta: p1' = p1 + clip3(-tc0, tc0, ...)
    • If aq < beta: q1' = q1 + clip3(-tc0, tc0, ...)

Multiple branches per row → harder to write a bit-exact C ref than cycle 2/4 LPF. ~80-100 LOC of C, careful with the clip3 ranges.

30fps@1080p H.264 deblock floor

A 1920×1080 frame has 120 × 67.5 = 8100 luma MBs × 4 inner-MB vertical edges × 4 rows of segments = ~129 600 segment-edges per frame. Plus 4 horizontal edges per MB.

At 30fps: ~3.9 M edges/s required for luma vertical alone, ~7.8 M edges/s for both v and h. Realistic (many edges skip filter via bS=0 or alpha/beta thresholds): ~30-50 % of these actually filter → effective ~2-4 M edges/s.

30fps@1080p deblock floor (realistic): 2-4 M edges/s. 30fps@1080p deblock floor (worst case): 8 M edges/s.

Acceptance for Phase 7

  • M1: 100.0000% bit-exact (NEON vs C ref, 10000+ random 4-row segments)
  • M3: captured
  • M2: captured
  • R₈: classified
  • M4: same-kernel mixed bench
  • 30fps@1080p floor margin reported

Cycle 8 deliverables

  1. external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S (already vendored this phase, 1076 lines)
  2. tests/h264_deblock_ref.c — C reference for luma vertical non-intra deblock (luma_v_filter_normal)
  3. tests/bench_neon_h264deblock.c — Phase 3 bench
  4. src/v3d_h264deblock.comp — Phase 6 shader (likely follow cycle 2 LPF v3d shader structure, but with deblock branching)
  5. tests/bench_v3d_h264deblock.c — Phase 6+7 bench
  6. CMakeLists.txt wiring

What's lands in THIS session

  • This Phase 1 doc
  • h264dsp_neon.S vendored (file present in repo)
  • PROVENANCE.md updated

What's NOT in this session (deferred to next):

  • C reference (~2 hours)
  • NEON bench
  • M1+M3 capture
  • Phase 4-7

Why defer Phase 3+ from this session

Cycle 8 NEON-baseline scope is materially larger than cycles 6/7 because the H.264 deblock has:

  • Per-row branching (filter applies or not based on alpha/beta)
  • Per-4-row-segment tc0 strength
  • 4 separate output adjustments per row (p0, q0, p1, q1)
  • ap/aq side-condition checks
  • All these need bit-exact in the C ref against NEON's vectorised version

Better to write the C ref with fresh attention next session than rush it now and have it M1-fail like cycle 6's first attempt.

The Phase 1 doc itself captures the analysis so next session can pick up cleanly from here.

Estimated effort for Phase 3 next session

  • C ref: ~2 hours (careful transcription from spec + cross-check against FFmpeg C reference)
  • Bench: ~30 min
  • M1 debugging (likely needed; cycle 6 took 90 min for column- major-block discovery, similar discoveries may apply here): 30-90 min
  • M3 capture: 5 min

Total: 3-4 hours for Phase 3 closure.

Linkage with cycles 6+7 closure

Cycles 6 + 7 + 8 together form the H.264 NEON inventory and the single-most-promising-QPU-target (cycle 8). After cycle 8 closes, the H.264 QPU surface area is well-characterised:

  • IDCT 4×4: CPU
  • IDCT 8×8: CPU
  • Deblock: TBD (cycle 8)
  • MC luma qpel: CPU (predicted; cycle 9 if measured)
  • MC chroma: CPU (predicted; cycle 10 if measured)

H.264 contribution to daedalus-fourier likely: CPU for transforms and MC, QPU for deblock IF cycle 8 lands GREEN.