Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s
M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_ filter is "vertical filtering of horizontal edges", not "vertical edge"; 16 columns process the edge horizontally with 8 rows of vertical context). M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case 1080p30 floor, 30x realistic floor. Filter triggers on 25 % of edges (random alpha/beta/tc0 covers both gating paths). Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses filter DIRECTION (vertical) not edge orientation. Edge is horizontal; filter operates vertically across it. Distinct from cycle 6's column-major-block lesson but related discovery pattern. Encoded for future cycles. R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's 0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9 LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches that hurt QPU more. Worth building anyway: - ORANGE in cycle 1's "M4 may rescue" band - Mixed-kernel deployment helper value (Issue 003) matters more than isolation R - 25%-trigger rate gives 4x effective contribution multiplier on QPU side - tests/h264_deblock_ref.c (column-walking C ref per row segment) - tests/bench_neon_h264deblock.c (M1 + M3 bench) - CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S - docs/k8_h264deblock_phase3.md (closure) Next: Phase 4 plan QPU shader, Phase 5 Sonnet review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,116 @@
|
||||
---
|
||||
cycle: 8
|
||||
phase: 3
|
||||
status: closed 2026-05-18 — M1 PASS, M3₈ = 91.95 Medge/s
|
||||
date_opened: 2026-05-18
|
||||
date_closed: 2026-05-18
|
||||
parent: k8_h264deblock_phase1.md
|
||||
host: hertz
|
||||
---
|
||||
|
||||
# Cycle 8, Phase 3 — H.264 luma deblock NEON baseline
|
||||
|
||||
## M1 + M3
|
||||
|
||||
```
|
||||
=== M1₈ bit-exact (10000 random edges) ===
|
||||
M1₈ correctness: 10000 / 10000 edges bit-exact (100.0000%)
|
||||
filter triggered on 2507/10000 edges (25.07%)
|
||||
|
||||
=== M3₈ NEON throughput ===
|
||||
total edges: 20 443 136
|
||||
elapsed (kernel)=0.222 s
|
||||
throughput = 91.947 Medge/s
|
||||
per-edge = 10.9 ns
|
||||
H.264 1080p30 worst-case floor: 11.49x margin
|
||||
H.264 1080p30 realistic floor: 30.65x margin
|
||||
```
|
||||
|
||||
Filter triggers 25 % of the time — realistic gating: random
|
||||
alpha/beta/tc0 cover both filter-applies and skip cases.
|
||||
|
||||
## Key Phase 9 lesson — H.264 v_loop_filter is VERTICAL filtering of HORIZONTAL edges
|
||||
|
||||
The FFmpeg naming convention "v_loop_filter_luma" / "h_loop_filter_luma"
|
||||
refers to the **filter direction**, not the edge orientation:
|
||||
|
||||
- `v_loop_filter_luma` — filter applied VERTICALLY across a
|
||||
HORIZONTAL edge (16-col wide edge between row -1 and row 0).
|
||||
pix points to row 0, column 0 of the bottom block.
|
||||
- `h_loop_filter_luma` — filter applied HORIZONTALLY across a
|
||||
VERTICAL edge (16-row tall edge between col -1 and col 0).
|
||||
|
||||
This is the H.264 spec convention but it tripped up the cycle 8
|
||||
first C-ref draft (which assumed v_loop_filter operated on a
|
||||
vertical edge with row-wise filtering). Trace showed only ±1 pixel
|
||||
differences which initially looked like a rounding issue but was
|
||||
actually a layout misinterpretation:
|
||||
- The 16 "columns" in the NEON's vector lanes correspond to image
|
||||
COLUMNS spanning the edge horizontally.
|
||||
- The 8 "rows" (p3..p0 / q0..q3 context) span the edge vertically.
|
||||
|
||||
Cycle 6 had a similar lesson with column-major-block; cycle 8 has
|
||||
this related-but-distinct edge-orientation lesson. Encoded for
|
||||
future cycles.
|
||||
|
||||
## R₈ prediction (revised from Phase 1)
|
||||
|
||||
Phase 1 predicted R₈ = 0.3-0.8 ORANGE/YELLOW based on VP9 LPF
|
||||
analog. With M3₈ = 92 Medge/s captured (vs cycle 2's 48
|
||||
Medge/s), the picture refines:
|
||||
|
||||
- H.264 deblock per-edge 10.9 ns vs cycle 2's 20 ns — **H.264 is
|
||||
~2× faster on NEON per edge**
|
||||
- Cycle 2 QPU was 19.6 Medge/s = R = 0.41 GREEN
|
||||
- H.264 deblock is MORE complex per edge (alpha/beta gating, tc0
|
||||
array, ap/aq side conditions, conditional p1/q1 writes) → QPU
|
||||
work per edge likely 1.5-2× heavier than cycle 2's QPU
|
||||
- Expected QPU M2 = 8-13 Medge/s
|
||||
- **Predicted R₈ = 0.09-0.14 → ORANGE (lower than predicted)**
|
||||
|
||||
Still likely worth building the QPU shader because:
|
||||
- ORANGE is in the "M4 may still rescue" band (per cycle 1
|
||||
calibration where R=0.92 turned into +7.2% M4)
|
||||
- For real deployment, mixed-kernel (Issue 003) helper value
|
||||
matters more than isolation R
|
||||
- Even at modest QPU contribution, the 25 %-of-edges-trigger
|
||||
reality means QPU only needs to handle the 25 % that actually
|
||||
filter; that's a 4× effective contribution multiplier
|
||||
|
||||
## Cycle comparison
|
||||
|
||||
| | Cycle 2 LPF wd=4 | Cycle 8 H.264 deblock |
|
||||
|---|---|---|
|
||||
| Codec | VP9 | H.264 |
|
||||
| Edge size | 8 rows, 4-tap | 8 rows, 4-tap (similar) |
|
||||
| NEON M3 | 48.285 Medge/s | **91.947 Medge/s** (1.9× faster) |
|
||||
| Per-edge ns | 20.7 | **10.9** |
|
||||
| Filter triggering rate | ~30 % (cycle 2 bench) | 25 % |
|
||||
| Cycle 2 verdict | GREEN (M4 +6.9 %) | TBD (predicted ORANGE) |
|
||||
|
||||
H.264 deblock's per-edge work is comparable to VP9 LPF but
|
||||
2× faster on NEON due to:
|
||||
- 16 columns processed in parallel (vs VP9 LPF 4-tap's 8 columns)
|
||||
- More efficient byte-vector ops in FFmpeg's NEON implementation
|
||||
- H.264 deblock doesn't have VP9's wd=4/8/16 variant overhead
|
||||
|
||||
## Acceptance for Phase 7
|
||||
|
||||
- ✓ M1 bit-exact (100.00 % on 10 000 random edges)
|
||||
- ✓ M3 captured (91.947 Medge/s)
|
||||
- ✓ 30fps@1080p floor exceeded by 11× worst-case
|
||||
- → Phase 4 plan QPU shader (next)
|
||||
|
||||
## Cycle 8 next phase
|
||||
|
||||
Phase 4: plan v3d_h264deblock.comp. Likely follows cycle 2 LPF
|
||||
shader template (no barrier, edge per lane decomposition,
|
||||
uint8 dst SSBO). Differences:
|
||||
- 16 columns per edge (not 8)
|
||||
- alpha/beta gating with multiple short-circuit conditions
|
||||
- tc0 per 4-col segment
|
||||
- ap/aq side conditions affecting p1/q1 writes
|
||||
- More compute per pixel than cycle 2
|
||||
|
||||
Then Phase 5 Sonnet review (non-skippable), Phase 6 implement,
|
||||
Phase 7 measure.
|
||||
Reference in New Issue
Block a user