--- cycle: 8 phase: 3 status: closed 2026-05-18 — M1 PASS, M3₈ = 91.95 Medge/s date_opened: 2026-05-18 date_closed: 2026-05-18 parent: k8_h264deblock_phase1.md host: hertz --- # Cycle 8, Phase 3 — H.264 luma deblock NEON baseline ## M1 + M3 ``` === M1₈ bit-exact (10000 random edges) === M1₈ correctness: 10000 / 10000 edges bit-exact (100.0000%) filter triggered on 2507/10000 edges (25.07%) === M3₈ NEON throughput === total edges: 20 443 136 elapsed (kernel)=0.222 s throughput = 91.947 Medge/s per-edge = 10.9 ns H.264 1080p30 worst-case floor: 11.49x margin H.264 1080p30 realistic floor: 30.65x margin ``` Filter triggers 25 % of the time — realistic gating: random alpha/beta/tc0 cover both filter-applies and skip cases. ## Key Phase 9 lesson — H.264 v_loop_filter is VERTICAL filtering of HORIZONTAL edges The FFmpeg naming convention "v_loop_filter_luma" / "h_loop_filter_luma" refers to the **filter direction**, not the edge orientation: - `v_loop_filter_luma` — filter applied VERTICALLY across a HORIZONTAL edge (16-col wide edge between row -1 and row 0). pix points to row 0, column 0 of the bottom block. - `h_loop_filter_luma` — filter applied HORIZONTALLY across a VERTICAL edge (16-row tall edge between col -1 and col 0). This is the H.264 spec convention but it tripped up the cycle 8 first C-ref draft (which assumed v_loop_filter operated on a vertical edge with row-wise filtering). Trace showed only ±1 pixel differences which initially looked like a rounding issue but was actually a layout misinterpretation: - The 16 "columns" in the NEON's vector lanes correspond to image COLUMNS spanning the edge horizontally. - The 8 "rows" (p3..p0 / q0..q3 context) span the edge vertically. Cycle 6 had a similar lesson with column-major-block; cycle 8 has this related-but-distinct edge-orientation lesson. Encoded for future cycles. ## R₈ prediction (revised from Phase 1) Phase 1 predicted R₈ = 0.3-0.8 ORANGE/YELLOW based on VP9 LPF analog. With M3₈ = 92 Medge/s captured (vs cycle 2's 48 Medge/s), the picture refines: - H.264 deblock per-edge 10.9 ns vs cycle 2's 20 ns — **H.264 is ~2× faster on NEON per edge** - Cycle 2 QPU was 19.6 Medge/s = R = 0.41 GREEN - H.264 deblock is MORE complex per edge (alpha/beta gating, tc0 array, ap/aq side conditions, conditional p1/q1 writes) → QPU work per edge likely 1.5-2× heavier than cycle 2's QPU - Expected QPU M2 = 8-13 Medge/s - **Predicted R₈ = 0.09-0.14 → ORANGE (lower than predicted)** Still likely worth building the QPU shader because: - ORANGE is in the "M4 may still rescue" band (per cycle 1 calibration where R=0.92 turned into +7.2% M4) - For real deployment, mixed-kernel (Issue 003) helper value matters more than isolation R - Even at modest QPU contribution, the 25 %-of-edges-trigger reality means QPU only needs to handle the 25 % that actually filter; that's a 4× effective contribution multiplier ## Cycle comparison | | Cycle 2 LPF wd=4 | Cycle 8 H.264 deblock | |---|---|---| | Codec | VP9 | H.264 | | Edge size | 8 rows, 4-tap | 8 rows, 4-tap (similar) | | NEON M3 | 48.285 Medge/s | **91.947 Medge/s** (1.9× faster) | | Per-edge ns | 20.7 | **10.9** | | Filter triggering rate | ~30 % (cycle 2 bench) | 25 % | | Cycle 2 verdict | GREEN (M4 +6.9 %) | TBD (predicted ORANGE) | H.264 deblock's per-edge work is comparable to VP9 LPF but 2× faster on NEON due to: - 16 columns processed in parallel (vs VP9 LPF 4-tap's 8 columns) - More efficient byte-vector ops in FFmpeg's NEON implementation - H.264 deblock doesn't have VP9's wd=4/8/16 variant overhead ## Acceptance for Phase 7 - ✓ M1 bit-exact (100.00 % on 10 000 random edges) - ✓ M3 captured (91.947 Medge/s) - ✓ 30fps@1080p floor exceeded by 11× worst-case - → Phase 4 plan QPU shader (next) ## Cycle 8 next phase Phase 4: plan v3d_h264deblock.comp. Likely follows cycle 2 LPF shader template (no barrier, edge per lane decomposition, uint8 dst SSBO). Differences: - 16 columns per edge (not 8) - alpha/beta gating with multiple short-circuit conditions - tc0 per 4-col segment - ap/aq side conditions affecting p1/q1 writes - More compute per pixel than cycle 2 Then Phase 5 Sonnet review (non-skippable), Phase 6 implement, Phase 7 measure.