--- cycle: 8 phase: 1 status: open (Phase 3 deferred to next session — scope larger than VP9 LPF) date_opened: 2026-05-18 codec: H.264 kernel: in-loop deblock filter (luma vertical edge variant first) parent: project_h264_scope_added.md (memory), k7_h264idct8_phase3_and_4.md (lesson) predicted_R: 0.3-0.8 (ORANGE/YELLOW) — analogous to VP9 LPF cycles 2/4 which were GREEN --- # Cycle 8, Phase 1 — H.264 in-loop deblock (luma vertical edge first) After cycles 6 and 7 both came in as "predicted GREEN, measured CPU-only" for H.264 transforms (transforms too lightweight on NEON), cycle 8 targets the one H.264 kernel most likely to actually benefit from QPU offload: the **in-loop deblock filter**. ## Why deblock as the H.264 QPU candidate Per cycle 7's Phase 9 update: - H.264 transforms (cycles 6+7) NEON-saturated at ~150 Mblock/s, no QPU need - H.264 MC (luma qpel, chroma) likely analogous to cycle 3 VP9 MC (R=0.067 RED), QPU loses - **Deblock is bandwidth-bound** with per-pixel branching, analogous to VP9 LPF (cycle 2 R=0.41 GREEN, cycle 4 R=0.34 GREEN) - H.264 deblock processes 16-pixel-wide MB edges (vs VP9's 8-pixel smaller edges), so per-edge work is heavier — better for QPU amortization Predicted R₈ band: **ORANGE to GREEN** based on the VP9 LPF analog. ## Scope decision: start with luma vertical edge H.264 deblock has many variants: 1. Luma vertical edge (v_loop_filter_luma) — 16-row × 8-col region 2. Luma horizontal edge (h_loop_filter_luma) — 4-row × 16-col region 3. Luma intra (stronger filter, bS=4) 4. Chroma {v,h} edge 5. Chroma intra 6. Chroma 4:2:2 variants Start with **luma vertical edge non-intra**. Most common case (most MB-internal edges are non-intra). Other variants are follow-up cycles (8a, 8b, etc.) using the same QPU shader template. ## NEON reference `ff_h264_v_loop_filter_luma_neon` (external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S line 111, vendored 2026-05-18). Signature inferred from `h264_loop_filter_start` macro: ``` void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha, int beta, int8_t *tc0); ``` Where: - `pix`: pointer to the edge centre — pix[0] = q0 pixel of first row - `stride`: byte stride between rows (typically picture width) - `alpha`: filter strength threshold (0..63, MB-derived) - `beta`: block-boundary threshold (0..63, MB-derived) - `tc0`: array of 4 int8 values — per-4-pixel-segment tc0 strengths The 16-row edge is divided into 4 segments of 4 rows each; each segment can have its own tc0 (encoder-derived filter strength parameter). ## Algorithm summary (H.264 §8.7.2.4) Per row, for each 4-row segment: 1. Compute pre-conditions: - `bS > 0` (tc0[segment] != -1) - `|p0 - q0| < alpha` - `|p1 - p0| < beta` - `|q1 - q0| < beta` 2. If precondition fails → no filter for this row 3. Compute `ap = |p2 - p0|`, `aq = |q2 - q0|` 4. Compute `tc = tc0 + (ap < beta) + (aq < beta)` 5. `delta = clip3(-tc, tc, (((q0-p0)*4 + (p1-q1) + 4) >> 3))` 6. Apply: - `p0' = clip255(p0 + delta)` - `q0' = clip255(q0 - delta)` - If `ap < beta`: `p1' = p1 + clip3(-tc0, tc0, ...)` - If `aq < beta`: `q1' = q1 + clip3(-tc0, tc0, ...)` Multiple branches per row → harder to write a bit-exact C ref than cycle 2/4 LPF. ~80-100 LOC of C, careful with the clip3 ranges. ## 30fps@1080p H.264 deblock floor A 1920×1080 frame has 120 × 67.5 = 8100 luma MBs × 4 inner-MB vertical edges × 4 rows of segments = ~129 600 segment-edges per frame. Plus 4 horizontal edges per MB. At 30fps: ~3.9 M edges/s required for luma vertical alone, ~7.8 M edges/s for both v and h. Realistic (many edges skip filter via bS=0 or alpha/beta thresholds): ~30-50 % of these actually filter → effective ~2-4 M edges/s. **30fps@1080p deblock floor (realistic): 2-4 M edges/s.** **30fps@1080p deblock floor (worst case): 8 M edges/s.** ## Acceptance for Phase 7 - M1: 100.0000% bit-exact (NEON vs C ref, 10000+ random 4-row segments) - M3: captured - M2: captured - R₈: classified - M4: same-kernel mixed bench - 30fps@1080p floor margin reported ## Cycle 8 deliverables 1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S` (already vendored this phase, 1076 lines) 2. `tests/h264_deblock_ref.c` — C reference for luma vertical non-intra deblock (luma_v_filter_normal) 3. `tests/bench_neon_h264deblock.c` — Phase 3 bench 4. `src/v3d_h264deblock.comp` — Phase 6 shader (likely follow cycle 2 LPF v3d shader structure, but with deblock branching) 5. `tests/bench_v3d_h264deblock.c` — Phase 6+7 bench 6. CMakeLists.txt wiring ## What's lands in THIS session - This Phase 1 doc - `h264dsp_neon.S` vendored (file present in repo) - PROVENANCE.md updated What's NOT in this session (deferred to next): - C reference (~2 hours) - NEON bench - M1+M3 capture - Phase 4-7 ## Why defer Phase 3+ from this session Cycle 8 NEON-baseline scope is materially larger than cycles 6/7 because the H.264 deblock has: - Per-row branching (filter applies or not based on alpha/beta) - Per-4-row-segment tc0 strength - 4 separate output adjustments per row (p0, q0, p1, q1) - ap/aq side-condition checks - All these need bit-exact in the C ref against NEON's vectorised version Better to write the C ref with fresh attention next session than rush it now and have it M1-fail like cycle 6's first attempt. The Phase 1 doc itself captures the analysis so next session can pick up cleanly from here. ## Estimated effort for Phase 3 next session - C ref: ~2 hours (careful transcription from spec + cross-check against FFmpeg C reference) - Bench: ~30 min - M1 debugging (likely needed; cycle 6 took 90 min for column- major-block discovery, similar discoveries may apply here): 30-90 min - M3 capture: 5 min Total: 3-4 hours for Phase 3 closure. ## Linkage with cycles 6+7 closure Cycles 6 + 7 + 8 together form the H.264 NEON inventory and the single-most-promising-QPU-target (cycle 8). After cycle 8 closes, the H.264 QPU surface area is well-characterised: - IDCT 4×4: CPU - IDCT 8×8: CPU - Deblock: TBD (cycle 8) - MC luma qpel: CPU (predicted; cycle 9 if measured) - MC chroma: CPU (predicted; cycle 10 if measured) H.264 contribution to daedalus-fourier likely: CPU for transforms and MC, QPU for deblock IF cycle 8 lands GREEN.