Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored
Targets the one H.264 kernel most likely to be QPU-worthy: in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in CPU-only because H.264 transforms are NEON-trivial. H.264 deblock has analogous structure to VP9 LPF (cycles 2+4, both GREEN) so predicted R8 = ORANGE/YELLOW. This commit: - Vendors ff_h264_*_loop_filter_*_neon from h264dsp_neon.S (1076 lines, includes both v/h luma + chroma + intra variants + weight/biweight) - PROVENANCE.md updated with the new vendored file - Phase 1 doc captures the full plan: start with luma vertical non-intra (most common case), defer Phase 3+ to next session H.264 deblock C ref scope is ~2 hours (per-row branching, per-4-row-segment tc0, ap/aq side conditions, alpha/beta thresholds — much more complex than VP9 LPF wd=4's single-branch filter). Deferring to fresh attention next session rather than rushing now. After cycle 8 closes, the H.264 QPU surface is well-characterised and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's substrate-routing recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,183 @@
|
||||
---
|
||||
cycle: 8
|
||||
phase: 1
|
||||
status: open (Phase 3 deferred to next session — scope larger than VP9 LPF)
|
||||
date_opened: 2026-05-18
|
||||
codec: H.264
|
||||
kernel: in-loop deblock filter (luma vertical edge variant first)
|
||||
parent: project_h264_scope_added.md (memory), k7_h264idct8_phase3_and_4.md (lesson)
|
||||
predicted_R: 0.3-0.8 (ORANGE/YELLOW) — analogous to VP9 LPF cycles 2/4 which were GREEN
|
||||
---
|
||||
|
||||
# Cycle 8, Phase 1 — H.264 in-loop deblock (luma vertical edge first)
|
||||
|
||||
After cycles 6 and 7 both came in as "predicted GREEN, measured
|
||||
CPU-only" for H.264 transforms (transforms too lightweight on
|
||||
NEON), cycle 8 targets the one H.264 kernel most likely to actually
|
||||
benefit from QPU offload: the **in-loop deblock filter**.
|
||||
|
||||
## Why deblock as the H.264 QPU candidate
|
||||
|
||||
Per cycle 7's Phase 9 update:
|
||||
- H.264 transforms (cycles 6+7) NEON-saturated at ~150 Mblock/s,
|
||||
no QPU need
|
||||
- H.264 MC (luma qpel, chroma) likely analogous to cycle 3 VP9 MC
|
||||
(R=0.067 RED), QPU loses
|
||||
- **Deblock is bandwidth-bound** with per-pixel branching, analogous
|
||||
to VP9 LPF (cycle 2 R=0.41 GREEN, cycle 4 R=0.34 GREEN)
|
||||
- H.264 deblock processes 16-pixel-wide MB edges (vs VP9's 8-pixel
|
||||
smaller edges), so per-edge work is heavier — better for QPU
|
||||
amortization
|
||||
|
||||
Predicted R₈ band: **ORANGE to GREEN** based on the VP9 LPF analog.
|
||||
|
||||
## Scope decision: start with luma vertical edge
|
||||
|
||||
H.264 deblock has many variants:
|
||||
1. Luma vertical edge (v_loop_filter_luma) — 16-row × 8-col region
|
||||
2. Luma horizontal edge (h_loop_filter_luma) — 4-row × 16-col region
|
||||
3. Luma intra (stronger filter, bS=4)
|
||||
4. Chroma {v,h} edge
|
||||
5. Chroma intra
|
||||
6. Chroma 4:2:2 variants
|
||||
|
||||
Start with **luma vertical edge non-intra**. Most common case
|
||||
(most MB-internal edges are non-intra). Other variants are
|
||||
follow-up cycles (8a, 8b, etc.) using the same QPU shader
|
||||
template.
|
||||
|
||||
## NEON reference
|
||||
|
||||
`ff_h264_v_loop_filter_luma_neon`
|
||||
(external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S
|
||||
line 111, vendored 2026-05-18).
|
||||
|
||||
Signature inferred from `h264_loop_filter_start` macro:
|
||||
```
|
||||
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix,
|
||||
ptrdiff_t stride,
|
||||
int alpha, int beta,
|
||||
int8_t *tc0);
|
||||
```
|
||||
|
||||
Where:
|
||||
- `pix`: pointer to the edge centre — pix[0] = q0 pixel of first row
|
||||
- `stride`: byte stride between rows (typically picture width)
|
||||
- `alpha`: filter strength threshold (0..63, MB-derived)
|
||||
- `beta`: block-boundary threshold (0..63, MB-derived)
|
||||
- `tc0`: array of 4 int8 values — per-4-pixel-segment tc0 strengths
|
||||
|
||||
The 16-row edge is divided into 4 segments of 4 rows each; each
|
||||
segment can have its own tc0 (encoder-derived filter strength
|
||||
parameter).
|
||||
|
||||
## Algorithm summary (H.264 §8.7.2.4)
|
||||
|
||||
Per row, for each 4-row segment:
|
||||
1. Compute pre-conditions:
|
||||
- `bS > 0` (tc0[segment] != -1)
|
||||
- `|p0 - q0| < alpha`
|
||||
- `|p1 - p0| < beta`
|
||||
- `|q1 - q0| < beta`
|
||||
2. If precondition fails → no filter for this row
|
||||
3. Compute `ap = |p2 - p0|`, `aq = |q2 - q0|`
|
||||
4. Compute `tc = tc0 + (ap < beta) + (aq < beta)`
|
||||
5. `delta = clip3(-tc, tc, (((q0-p0)*4 + (p1-q1) + 4) >> 3))`
|
||||
6. Apply:
|
||||
- `p0' = clip255(p0 + delta)`
|
||||
- `q0' = clip255(q0 - delta)`
|
||||
- If `ap < beta`: `p1' = p1 + clip3(-tc0, tc0, ...)`
|
||||
- If `aq < beta`: `q1' = q1 + clip3(-tc0, tc0, ...)`
|
||||
|
||||
Multiple branches per row → harder to write a bit-exact C ref
|
||||
than cycle 2/4 LPF. ~80-100 LOC of C, careful with the clip3
|
||||
ranges.
|
||||
|
||||
## 30fps@1080p H.264 deblock floor
|
||||
|
||||
A 1920×1080 frame has 120 × 67.5 = 8100 luma MBs × 4 inner-MB
|
||||
vertical edges × 4 rows of segments = ~129 600 segment-edges per
|
||||
frame. Plus 4 horizontal edges per MB.
|
||||
|
||||
At 30fps: ~3.9 M edges/s required for luma vertical alone, ~7.8 M
|
||||
edges/s for both v and h. Realistic (many edges skip filter via
|
||||
bS=0 or alpha/beta thresholds): ~30-50 % of these actually filter
|
||||
→ effective ~2-4 M edges/s.
|
||||
|
||||
**30fps@1080p deblock floor (realistic): 2-4 M edges/s.**
|
||||
**30fps@1080p deblock floor (worst case): 8 M edges/s.**
|
||||
|
||||
## Acceptance for Phase 7
|
||||
|
||||
- M1: 100.0000% bit-exact (NEON vs C ref, 10000+ random 4-row segments)
|
||||
- M3: captured
|
||||
- M2: captured
|
||||
- R₈: classified
|
||||
- M4: same-kernel mixed bench
|
||||
- 30fps@1080p floor margin reported
|
||||
|
||||
## Cycle 8 deliverables
|
||||
|
||||
1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S`
|
||||
(already vendored this phase, 1076 lines)
|
||||
2. `tests/h264_deblock_ref.c` — C reference for luma vertical
|
||||
non-intra deblock (luma_v_filter_normal)
|
||||
3. `tests/bench_neon_h264deblock.c` — Phase 3 bench
|
||||
4. `src/v3d_h264deblock.comp` — Phase 6 shader (likely follow
|
||||
cycle 2 LPF v3d shader structure, but with deblock branching)
|
||||
5. `tests/bench_v3d_h264deblock.c` — Phase 6+7 bench
|
||||
6. CMakeLists.txt wiring
|
||||
|
||||
## What's lands in THIS session
|
||||
|
||||
- This Phase 1 doc
|
||||
- `h264dsp_neon.S` vendored (file present in repo)
|
||||
- PROVENANCE.md updated
|
||||
|
||||
What's NOT in this session (deferred to next):
|
||||
- C reference (~2 hours)
|
||||
- NEON bench
|
||||
- M1+M3 capture
|
||||
- Phase 4-7
|
||||
|
||||
## Why defer Phase 3+ from this session
|
||||
|
||||
Cycle 8 NEON-baseline scope is materially larger than cycles 6/7
|
||||
because the H.264 deblock has:
|
||||
- Per-row branching (filter applies or not based on alpha/beta)
|
||||
- Per-4-row-segment tc0 strength
|
||||
- 4 separate output adjustments per row (p0, q0, p1, q1)
|
||||
- ap/aq side-condition checks
|
||||
- All these need bit-exact in the C ref against NEON's vectorised
|
||||
version
|
||||
|
||||
Better to write the C ref with fresh attention next session than
|
||||
rush it now and have it M1-fail like cycle 6's first attempt.
|
||||
|
||||
The Phase 1 doc itself captures the analysis so next session can
|
||||
pick up cleanly from here.
|
||||
|
||||
## Estimated effort for Phase 3 next session
|
||||
|
||||
- C ref: ~2 hours (careful transcription from spec + cross-check
|
||||
against FFmpeg C reference)
|
||||
- Bench: ~30 min
|
||||
- M1 debugging (likely needed; cycle 6 took 90 min for column-
|
||||
major-block discovery, similar discoveries may apply here): 30-90 min
|
||||
- M3 capture: 5 min
|
||||
|
||||
Total: 3-4 hours for Phase 3 closure.
|
||||
|
||||
## Linkage with cycles 6+7 closure
|
||||
|
||||
Cycles 6 + 7 + 8 together form the H.264 NEON inventory and the
|
||||
single-most-promising-QPU-target (cycle 8). After cycle 8 closes,
|
||||
the H.264 QPU surface area is well-characterised:
|
||||
- IDCT 4×4: CPU
|
||||
- IDCT 8×8: CPU
|
||||
- Deblock: TBD (cycle 8)
|
||||
- MC luma qpel: CPU (predicted; cycle 9 if measured)
|
||||
- MC chroma: CPU (predicted; cycle 10 if measured)
|
||||
|
||||
H.264 contribution to daedalus-fourier likely: CPU for transforms
|
||||
and MC, QPU for deblock IF cycle 8 lands GREEN.
|
||||
Reference in New Issue
Block a user