Files

T

marfrit 436a5c4f74 Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_
filter is "vertical filtering of horizontal edges", not "vertical
edge"; 16 columns process the edge horizontally with 8 rows of
vertical context).

M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case
1080p30 floor, 30x realistic floor. Filter triggers on 25 % of
edges (random alpha/beta/tc0 covers both gating paths).

Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses
filter DIRECTION (vertical) not edge orientation. Edge is
horizontal; filter operates vertically across it. Distinct from
cycle 6's column-major-block lesson but related discovery
pattern. Encoded for future cycles.

R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's
0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9
LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches
that hurt QPU more. Worth building anyway:
- ORANGE in cycle 1's "M4 may rescue" band
- Mixed-kernel deployment helper value (Issue 003) matters more
  than isolation R
- 25%-trigger rate gives 4x effective contribution multiplier
  on QPU side

- tests/h264_deblock_ref.c (column-walking C ref per row segment)
- tests/bench_neon_h264deblock.c (M1 + M3 bench)
- CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S
- docs/k8_h264deblock_phase3.md (closure)

Next: Phase 4 plan QPU shader, Phase 5 Sonnet review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:39:36 +00:00

4.2 KiB

Raw Blame History

cycle, phase, status, date_opened, date_closed, parent, host

cycle	phase	status	date_opened	date_closed	parent	host
8	3	closed 2026-05-18 — M1 PASS, M3₈ = 91.95 Medge/s	2026-05-18	2026-05-18	k8_h264deblock_phase1.md	hertz

Cycle 8, Phase 3 — H.264 luma deblock NEON baseline

M1 + M3

=== M1₈ bit-exact (10000 random edges) ===
M1₈ correctness: 10000 / 10000 edges bit-exact (100.0000%)
  filter triggered on 2507/10000 edges (25.07%)

=== M3₈ NEON throughput ===
  total edges:    20 443 136
  elapsed (kernel)=0.222 s
  throughput      = 91.947 Medge/s
  per-edge        = 10.9 ns
  H.264 1080p30 worst-case floor: 11.49x margin
  H.264 1080p30 realistic floor:  30.65x margin

Filter triggers 25 % of the time — realistic gating: random alpha/beta/tc0 cover both filter-applies and skip cases.

Key Phase 9 lesson — H.264 v_loop_filter is VERTICAL filtering of HORIZONTAL edges

The FFmpeg naming convention "v_loop_filter_luma" / "h_loop_filter_luma" refers to the filter direction, not the edge orientation:

v_loop_filter_luma — filter applied VERTICALLY across a HORIZONTAL edge (16-col wide edge between row -1 and row 0). pix points to row 0, column 0 of the bottom block.
h_loop_filter_luma — filter applied HORIZONTALLY across a VERTICAL edge (16-row tall edge between col -1 and col 0).

This is the H.264 spec convention but it tripped up the cycle 8 first C-ref draft (which assumed v_loop_filter operated on a vertical edge with row-wise filtering). Trace showed only ±1 pixel differences which initially looked like a rounding issue but was actually a layout misinterpretation:

The 16 "columns" in the NEON's vector lanes correspond to image COLUMNS spanning the edge horizontally.
The 8 "rows" (p3..p0 / q0..q3 context) span the edge vertically.

Cycle 6 had a similar lesson with column-major-block; cycle 8 has this related-but-distinct edge-orientation lesson. Encoded for future cycles.

R₈ prediction (revised from Phase 1)

Phase 1 predicted R₈ = 0.3-0.8 ORANGE/YELLOW based on VP9 LPF analog. With M3₈ = 92 Medge/s captured (vs cycle 2's 48 Medge/s), the picture refines:

H.264 deblock per-edge 10.9 ns vs cycle 2's 20 ns — H.264 is ~2× faster on NEON per edge
Cycle 2 QPU was 19.6 Medge/s = R = 0.41 GREEN
H.264 deblock is MORE complex per edge (alpha/beta gating, tc0 array, ap/aq side conditions, conditional p1/q1 writes) → QPU work per edge likely 1.5-2× heavier than cycle 2's QPU
Expected QPU M2 = 8-13 Medge/s
Predicted R₈ = 0.09-0.14 → ORANGE (lower than predicted)

Still likely worth building the QPU shader because:

ORANGE is in the "M4 may still rescue" band (per cycle 1 calibration where R=0.92 turned into +7.2% M4)
For real deployment, mixed-kernel (Issue 003) helper value matters more than isolation R
Even at modest QPU contribution, the 25 %-of-edges-trigger reality means QPU only needs to handle the 25 % that actually filter; that's a 4× effective contribution multiplier

Cycle comparison

	Cycle 2 LPF wd=4	Cycle 8 H.264 deblock
Codec	VP9	H.264
Edge size	8 rows, 4-tap	8 rows, 4-tap (similar)
NEON M3	48.285 Medge/s	91.947 Medge/s (1.9× faster)
Per-edge ns	20.7	10.9
Filter triggering rate	~30 % (cycle 2 bench)	25 %
Cycle 2 verdict	GREEN (M4 +6.9 %)	TBD (predicted ORANGE)

H.264 deblock's per-edge work is comparable to VP9 LPF but 2× faster on NEON due to:

16 columns processed in parallel (vs VP9 LPF 4-tap's 8 columns)
More efficient byte-vector ops in FFmpeg's NEON implementation
H.264 deblock doesn't have VP9's wd=4/8/16 variant overhead

Acceptance for Phase 7

✓ M1 bit-exact (100.00 % on 10 000 random edges)
✓ M3 captured (91.947 Medge/s)
✓ 30fps@1080p floor exceeded by 11× worst-case
→ Phase 4 plan QPU shader (next)

Cycle 8 next phase

Phase 4: plan v3d_h264deblock.comp. Likely follows cycle 2 LPF shader template (no barrier, edge per lane decomposition, uint8 dst SSBO). Differences:

16 columns per edge (not 8)
alpha/beta gating with multiple short-circuit conditions
tc0 per 4-col segment
ap/aq side conditions affecting p1/q1 writes
More compute per pixel than cycle 2

Then Phase 5 Sonnet review (non-skippable), Phase 6 implement, Phase 7 measure.

4.2 KiB Raw Blame History Unescape Escape