Files
marfrit be7ff5587c Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline
Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.

docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.

docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.

docs/k2_deblock_phase3.md - NEON baseline:

  M1''_c bit-exact     100.0000 % (10000 random edges)
  M3'' throughput      48.285 Medge/s  (20.7 ns/edge, single A76)
  per-frame 1080p-eq   748 FPS (worst case 64 530 edges/frame)
  cycles/edge          ~58 (=20.7ns x 2.8GHz), ~7 cycles/row

LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).

New artifacts:
- tests/vp9_lpf_ref.c    - standalone bit-exact C ref (8-bit, wd=4
                           inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
                           (5s window, edge-content-biased RNG for
                           realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target

Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:28:57 +00:00

125 lines
5.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 2
phase: 2
status: closed 2026-05-18
date_opened: 2026-05-18
parent: k2_deblock_phase1.md
target_kernel: VP9 loop filter h_4_8 (4-tap inner, 8-pixel horizontal-direction-on-vertical-edge)
---
# Cycle 2, Phase 2 — Loop filter situation analysis
## 1. Reference implementations
### 1.1 C reference (bit-exact gate)
- **Source**: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c:1780-1898`
(already vendored; no additional fetch needed).
- **Function entry point**: `loop_filter_h_4_8_c` — generated by the macro
`lf_8_fn(h, 4, stride, 1)` at line 1892 + `lf_8_fns(4)` at 1900.
- **Signature**:
```c
void loop_filter_h_4_8_c(uint8_t *dst, ptrdiff_t stride,
int E, int I, int H);
```
- **Spec basis**: VP9 specification §8.8.1 (Loop filter process).
- **Algorithm (4-tap inner, the simplest path)**:
1. For each of 8 rows along the edge (`i = 0..7, dst += stride`):
1. Read 8 pixels straddling the edge: `p3, p2, p1, p0 | q0, q1, q2, q3`
(4 each side at strideb=1 spacing).
2. Compute `fm` (filter mask) — gating; if false, skip this row.
3. Compute `hev` (high edge variance) test from `(p1 - p0)` and `(q1 - q0)`.
4. If hev: write 2 pixels (`p0, q0`) with clipping.
If !hev: write 4 pixels (`p1, p0, q0, q1`) with clipping.
- All arithmetic is signed `int`; clipping via `av_clip_pixel` (8-bit → [0, 255]).
- Filter is **conditional per row**: `fm` may skip; `hev` selects between
2-pixel and 4-pixel updates. This is a *divergence-friendly* shape for
SIMD only if the divergence is rare; on real bitstreams it's frequent.
### 1.2 NEON reference (M3'' baseline)
- **Source**: `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S`
(vendored 2026-05-18; SHA-256
`384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7`).
- **Symbol**: `ff_vp9_loop_filter_h_4_8_neon`
- **Signature** (same as C):
```
void ff_vp9_loop_filter_h_4_8_neon(uint8_t *dst, ptrdiff_t stride,
int E, int I, int H);
```
Registers: `x0=dst, x1=stride, w2=E, w3=I, w4=H`.
- **Dependencies** (all already vendored):
- `libavutil/aarch64/asm.S` — `function`/`endfunc`/`movrel` macros
- `libavcodec/aarch64/neon.S` — `transpose_8x8B` / `transpose_4x8B`
- **Size**: ~40-60 instructions per export (after `.macro loop_filter` expansion).
Significantly simpler than the IDCT 8×8 (~270 inst, butterflies).
- **License**: LGPL-2.1-or-later (Google 2016, same as vp9itxfm_neon.S).
The vendored snapshot now covers cycle 1 + cycle 2 references with the
same FFmpeg n7.1.3 pin.
## 2. Workload model
Each call to `ff_vp9_loop_filter_h_4_8_neon` processes **one
8-pixel-tall edge** = 8 rows × 8 pixel-positions = 64 pixels touched
(but only a subset written depending on `fm`/`hev`).
For a 1920×1080 luma plane with VP9's 8×8-min-block partitioning, the
worst-case edge count is approximately:
- Vertical edges: (1920/8 - 1) × (1080/8) blocks-worth = 239 × 135 = 32 265 edges
- Horizontal edges: similarly ~32 265 edges
- Total per frame: ~64 530 edges
Real bitstreams have fewer edges (larger blocks merge edges away).
Phase 4/7 may model a realistic edge count from a sample stream;
for Phase 1 we measure raw edges/sec.
**Memory access shape**: per-edge, read 8 neighborhoods of 8 pixels
each = 512 bits worst case (8×8 = 64 bytes). Write 2-4 pixels per row
× 8 rows = 16-32 bytes. Per-edge read-modify-write footprint is
~80-100 bytes. Per-frame memory traffic (worst case all edges
processed) ≈ 64 530 × 96 B ≈ 6.2 MB read + 64 530 × 32 B ≈ 2.1 MB
written = ~8.3 MB/frame, *similar to IDCT's 8 MB/frame*. Bandwidth
prediction transfers.
## 3. Per-edge workload diversity (vs IDCT)
| | IDCT 8×8 | LPF h_4_8 |
|---|---|---|
| Per-block math | Heavy: 30 ops × 2 passes per block | Light: ~10-20 ops per row × 8 rows = 80-160 ops per edge |
| Per-block memory | 256B in (coeffs) + 64B in (pred) + 64B out | 64B in + 16-32B out per edge |
| Parallelism | Fully data-parallel, no conditionals | Per-row conditionals (`fm`, `hev`) cause divergence |
| Compute / memory | High | Low (memory-bound) |
| Predicted v3d fit | "good" — fits the SMUL24 + Q14 shape | "marginal" — divergence cost, lighter compute |
The LPF kernel is **deliberately a different workload class** so we
test whether v3d wins generalise.
## 4. Constraints carried from cycle 1
All cycle-1 V3D 7.1 device limits (Phase 0 §2) apply unchanged.
Specifically:
- C2 shared mem ≤ 16 KiB — LPF needs even less than IDCT (no
intermediate transposed scratch)
- C3 ≤ 8 SSBO bindings — LPF needs only 2 (dst, edge_meta)
- C5 SMUL24 — covers the small constants in clip/abs
- shaderInt8 = false — uint8_t writes via storageBuffer8BitAccess
(same race-safe pattern as cycle 1)
## 5. What Phase 2 does *not* close
- Per-edge meta layout (E/I/H thresholds as packed u32 per edge, or
uniform across all edges?). Phase 4 picks. For Phase 3 NEON
baseline, we use the same thresholds for every edge to simplify.
- Divergence handling: NEON's hand-tuned LPF predicates per-lane;
the QPU shader will need to either predicate too (some lanes
idle when `fm` fails) or always-execute (write zero updates when
`fm` fails) — Phase 4 picks.
- Vertical vs horizontal: Phase 1 picked `h_4_8`. The `v_4_8`
variant has a different memory access shape (read columns 8 wide,
not rows of 8 stride apart) and would be a useful comparator in
Phase 7.
Phase 3 next: build `tests/bench_neon_lpf.c` (clone of
`bench_neon_idct.c` shape, swap kernel) and capture M3'' baseline.