Files
daedalus-fourier/docs/k8_h264deblock_phase1.md
T
marfrit 5a085e7180 Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored
Targets the one H.264 kernel most likely to be QPU-worthy:
in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in
CPU-only because H.264 transforms are NEON-trivial. H.264
deblock has analogous structure to VP9 LPF (cycles 2+4, both
GREEN) so predicted R8 = ORANGE/YELLOW.

This commit:
- Vendors ff_h264_*_loop_filter_*_neon from h264dsp_neon.S
  (1076 lines, includes both v/h luma + chroma + intra variants
  + weight/biweight)
- PROVENANCE.md updated with the new vendored file
- Phase 1 doc captures the full plan: start with luma vertical
  non-intra (most common case), defer Phase 3+ to next session

H.264 deblock C ref scope is ~2 hours (per-row branching,
per-4-row-segment tc0, ap/aq side conditions, alpha/beta
thresholds — much more complex than VP9 LPF wd=4's
single-branch filter). Deferring to fresh attention next
session rather than rushing now.

After cycle 8 closes, the H.264 QPU surface is well-characterised
and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's
substrate-routing recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:18:19 +00:00

184 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 8
phase: 1
status: open (Phase 3 deferred to next session — scope larger than VP9 LPF)
date_opened: 2026-05-18
codec: H.264
kernel: in-loop deblock filter (luma vertical edge variant first)
parent: project_h264_scope_added.md (memory), k7_h264idct8_phase3_and_4.md (lesson)
predicted_R: 0.3-0.8 (ORANGE/YELLOW) — analogous to VP9 LPF cycles 2/4 which were GREEN
---
# Cycle 8, Phase 1 — H.264 in-loop deblock (luma vertical edge first)
After cycles 6 and 7 both came in as "predicted GREEN, measured
CPU-only" for H.264 transforms (transforms too lightweight on
NEON), cycle 8 targets the one H.264 kernel most likely to actually
benefit from QPU offload: the **in-loop deblock filter**.
## Why deblock as the H.264 QPU candidate
Per cycle 7's Phase 9 update:
- H.264 transforms (cycles 6+7) NEON-saturated at ~150 Mblock/s,
no QPU need
- H.264 MC (luma qpel, chroma) likely analogous to cycle 3 VP9 MC
(R=0.067 RED), QPU loses
- **Deblock is bandwidth-bound** with per-pixel branching, analogous
to VP9 LPF (cycle 2 R=0.41 GREEN, cycle 4 R=0.34 GREEN)
- H.264 deblock processes 16-pixel-wide MB edges (vs VP9's 8-pixel
smaller edges), so per-edge work is heavier — better for QPU
amortization
Predicted R₈ band: **ORANGE to GREEN** based on the VP9 LPF analog.
## Scope decision: start with luma vertical edge
H.264 deblock has many variants:
1. Luma vertical edge (v_loop_filter_luma) — 16-row × 8-col region
2. Luma horizontal edge (h_loop_filter_luma) — 4-row × 16-col region
3. Luma intra (stronger filter, bS=4)
4. Chroma {v,h} edge
5. Chroma intra
6. Chroma 4:2:2 variants
Start with **luma vertical edge non-intra**. Most common case
(most MB-internal edges are non-intra). Other variants are
follow-up cycles (8a, 8b, etc.) using the same QPU shader
template.
## NEON reference
`ff_h264_v_loop_filter_luma_neon`
(external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S
line 111, vendored 2026-05-18).
Signature inferred from `h264_loop_filter_start` macro:
```
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix,
ptrdiff_t stride,
int alpha, int beta,
int8_t *tc0);
```
Where:
- `pix`: pointer to the edge centre — pix[0] = q0 pixel of first row
- `stride`: byte stride between rows (typically picture width)
- `alpha`: filter strength threshold (0..63, MB-derived)
- `beta`: block-boundary threshold (0..63, MB-derived)
- `tc0`: array of 4 int8 values — per-4-pixel-segment tc0 strengths
The 16-row edge is divided into 4 segments of 4 rows each; each
segment can have its own tc0 (encoder-derived filter strength
parameter).
## Algorithm summary (H.264 §8.7.2.4)
Per row, for each 4-row segment:
1. Compute pre-conditions:
- `bS > 0` (tc0[segment] != -1)
- `|p0 - q0| < alpha`
- `|p1 - p0| < beta`
- `|q1 - q0| < beta`
2. If precondition fails → no filter for this row
3. Compute `ap = |p2 - p0|`, `aq = |q2 - q0|`
4. Compute `tc = tc0 + (ap < beta) + (aq < beta)`
5. `delta = clip3(-tc, tc, (((q0-p0)*4 + (p1-q1) + 4) >> 3))`
6. Apply:
- `p0' = clip255(p0 + delta)`
- `q0' = clip255(q0 - delta)`
- If `ap < beta`: `p1' = p1 + clip3(-tc0, tc0, ...)`
- If `aq < beta`: `q1' = q1 + clip3(-tc0, tc0, ...)`
Multiple branches per row → harder to write a bit-exact C ref
than cycle 2/4 LPF. ~80-100 LOC of C, careful with the clip3
ranges.
## 30fps@1080p H.264 deblock floor
A 1920×1080 frame has 120 × 67.5 = 8100 luma MBs × 4 inner-MB
vertical edges × 4 rows of segments = ~129 600 segment-edges per
frame. Plus 4 horizontal edges per MB.
At 30fps: ~3.9 M edges/s required for luma vertical alone, ~7.8 M
edges/s for both v and h. Realistic (many edges skip filter via
bS=0 or alpha/beta thresholds): ~30-50 % of these actually filter
→ effective ~2-4 M edges/s.
**30fps@1080p deblock floor (realistic): 2-4 M edges/s.**
**30fps@1080p deblock floor (worst case): 8 M edges/s.**
## Acceptance for Phase 7
- M1: 100.0000% bit-exact (NEON vs C ref, 10000+ random 4-row segments)
- M3: captured
- M2: captured
- R₈: classified
- M4: same-kernel mixed bench
- 30fps@1080p floor margin reported
## Cycle 8 deliverables
1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S`
(already vendored this phase, 1076 lines)
2. `tests/h264_deblock_ref.c` — C reference for luma vertical
non-intra deblock (luma_v_filter_normal)
3. `tests/bench_neon_h264deblock.c` — Phase 3 bench
4. `src/v3d_h264deblock.comp` — Phase 6 shader (likely follow
cycle 2 LPF v3d shader structure, but with deblock branching)
5. `tests/bench_v3d_h264deblock.c` — Phase 6+7 bench
6. CMakeLists.txt wiring
## What's lands in THIS session
- This Phase 1 doc
- `h264dsp_neon.S` vendored (file present in repo)
- PROVENANCE.md updated
What's NOT in this session (deferred to next):
- C reference (~2 hours)
- NEON bench
- M1+M3 capture
- Phase 4-7
## Why defer Phase 3+ from this session
Cycle 8 NEON-baseline scope is materially larger than cycles 6/7
because the H.264 deblock has:
- Per-row branching (filter applies or not based on alpha/beta)
- Per-4-row-segment tc0 strength
- 4 separate output adjustments per row (p0, q0, p1, q1)
- ap/aq side-condition checks
- All these need bit-exact in the C ref against NEON's vectorised
version
Better to write the C ref with fresh attention next session than
rush it now and have it M1-fail like cycle 6's first attempt.
The Phase 1 doc itself captures the analysis so next session can
pick up cleanly from here.
## Estimated effort for Phase 3 next session
- C ref: ~2 hours (careful transcription from spec + cross-check
against FFmpeg C reference)
- Bench: ~30 min
- M1 debugging (likely needed; cycle 6 took 90 min for column-
major-block discovery, similar discoveries may apply here): 30-90 min
- M3 capture: 5 min
Total: 3-4 hours for Phase 3 closure.
## Linkage with cycles 6+7 closure
Cycles 6 + 7 + 8 together form the H.264 NEON inventory and the
single-most-promising-QPU-target (cycle 8). After cycle 8 closes,
the H.264 QPU surface area is well-characterised:
- IDCT 4×4: CPU
- IDCT 8×8: CPU
- Deblock: TBD (cycle 8)
- MC luma qpel: CPU (predicted; cycle 9 if measured)
- MC chroma: CPU (predicted; cycle 10 if measured)
H.264 contribution to daedalus-fourier likely: CPU for transforms
and MC, QPU for deblock IF cycle 8 lands GREEN.