Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

Second kernel candidate per phase7_M4.md verdict "next-kernel cycle authorised". VP9 4-tap inner loop filter, horizontal direction, 8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline). Different workload shape from IDCT - boundary streaming, lighter compute per unit, per-row conditionals - tests whether QPU win generalises. docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules as cycle 1 (phase1.md), with the cycle-1 calibration adjustment: ORANGE band is no longer auto-close because M4 showed mixed > pure CPU even at modest R when CPU bandwidth-saturates. docs/k2_deblock_phase2.md - situation analysis. C reference already in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin, SHA-256 384e49e7...). PROVENANCE.md updated. docs/k2_deblock_phase3.md - NEON baseline: M1''_c bit-exact 100.0000 % (10000 random edges) M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76) per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame) cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9 - frame-level batching amortises the same 33us dispatch overhead; workload becomes bandwidth-bound rather than compute-bound (~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge). New artifacts: - tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4 inner only; clean transcription) - tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench (5s window, edge-content-biased RNG for realistic fm/hev hit rates) - external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S - CMakeLists.txt updated with bench_neon_lpf target Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md + phase5.md + phase7.md learnings apply directly - bake in the v4 winning patterns from the start (WG=256, edges-per-subgroup pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled writes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:28:57 +00:00
parent 8182e43c15
commit be7ff5587c
8 changed files with 2024 additions and 1 deletions
@@ -0,0 +1,107 @@
+---
+cycle: 2
+phase: 3
+status: closed 2026-05-18
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: k2_deblock_phase2.md
+host: hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712,
+      Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz)
+---
+
+# Cycle 2, Phase 3 — NEON M3'' baseline
+
+Per `dev_process.md`: real measurements, before any changes.
+
+## Raw
+
+```
+=== M1''_c: bit-exact correctness (10000 random edges) ===
+M1''_c correctness: 10000 / 10000 edges bit-exact (100.0000%)
+
+=== M3'': NEON throughput ===
+M3'' NEON throughput:
+  edges/batch:     65536
+  batches done:    2009
+  total edges:     131 661 824
+  elapsed (kernel)=2.726785 s  (setup-subtracted)
+  elapsed (setup) =2.273954 s
+  throughput      = 48.285 Medge/s
+  per-edge        = 20.7 ns
+  equiv 1080p     = 748.3 FPS  (~64530 edges/frame, worst case)
+```
+
+## Numbers
+
+| | |
+|---|---|
+| **M1''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_loop_filter_h_4_8_ref` |
+| **M3'' (throughput)** | **48.285 Medge/s** (single A76 core @ 2.8 GHz) |
+| per-edge | 20.7 ns |
+| cycles/edge | 20.7 ns × 2.8 GHz ≈ 58 cycles (~7 cycles per pixel-row) |
+| 1080p FPS-equivalent | 748 FPS (worst-case 64 530 edges) |
+
+## Comparison vs cycle-1 IDCT M3
+
+| | IDCT 8×8 | LPF h_4_8 | ratio |
+|---|---|---|---|
+| Per-unit (block / edge) | 122.4 ns | 20.7 ns | **LPF 5.9× faster** |
+| 1080p FPS-eq, single core | 252 FPS | 748 FPS | LPF 3.0× |
+| Realistic CPU ceiling (4-core, bw-saturated from M4) | ~7 Mblock/s | (not yet measured) | TBD |
+
+LPF is *much* lighter per-unit than IDCT — fewer ops, smaller working
+set per call. Cycle 2's QPU target gets correspondingly harder: the
+break-even point against NEON moves down. Predicted at Phase 4.
+
+## Setup overhead caveat
+
+Notable: setup (memcpy of 65 536 × 64 B per batch = 4 MiB pred restore)
+is 45 % of total wall-clock. The subtraction step matters here more
+than for IDCT (where setup was ~9 %). Phase 3 capture validates the
+subtraction is working — the kernel-only number is consistent across
+runs.
+
+## Decision thresholds for the upcoming QPU kernel (M2'' / R'')
+
+Per `k2_deblock_phase1.md §"Decision rules"`, R'' = M2'' / M3'' bands:
+
+| R'' | Verdict | Implication |
+|---|---|---|
+| ≥ 1.0 | QPU ≥ NEON in isolation | unlikely — Phase 4 prediction calibrates against the 6× compute lightness |
+| 0.5 ≤ R'' < 1.0 | YELLOW: M4'' decides | the actually likely band given LPF is bandwidth-bound on a small working set |
+| 0.1 ≤ R'' < 0.5 | ORANGE: M4'' may still rescue | run M4'' anyway per cycle-1 calibration |
+| < 0.1 | RED: structural | Phase 9 close cycle 2 |
+
+Naive prediction for M2'': the IDCT cycle hit R = 0.92 because LPF's
+per-block compute is so much lighter than IDCT's. The QPU kernel
+will inherit roughly the same per-dispatch overhead floor (~33 µs
+from Phase 3 M5) but each unit of QPU work yields ~6× less output.
+**Predicted R''_v1: 0.15–0.30 if the kernel is bandwidth/launch-bound,
+0.5+ if computation is hidden under dispatch/sync.** Phase 4 will
+sharpen this.
+
+## What's not in this number
+
+- M3'' is single-core. Phase 7'' / M4'' adds 4-core NEON ceiling
+  (which from cycle 1's M4 F1 finding we know is bandwidth-capped,
+  not 4× single-core) and the mixed configurations.
+- Edge content distribution: the bench biases toward `fm`-passing
+  edges (different mean each side, small noise). Real bitstream
+  distributions may flip the fm-pass rate. Phase 7 may revisit.
+- The vertical variant (`ff_vp9_loop_filter_v_4_8_neon`) has
+  different memory access; should be ~similar throughput but
+  Phase 7 confirms.
+
+## Artifacts
+
+- `tests/vp9_lpf_ref.c` — standalone C reference (clean transcription
+  of vp9dsp_template.c:1780-1898, 4-tap inner only)
+- `tests/bench_neon_lpf.c` — M1''_c + M3'' bench
+- `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S` —
+  vendored at FFmpeg n7.1.3 commit f46e514 (SHA-256 in PROVENANCE.md)
+- `CMakeLists.txt` — adds `bench_neon_lpf` target with the LPF .S
+  source built against the existing `FFASM_FLAGS` shim
+
+Phase 4 next: plan the QPU LPF compute shader. The IDCT cycle's
+`phase4.md` is the template; constraints C1-C10 carry forward
+unchanged.