# Issue 002 — VP9 LPF vertical variants (v_4_8 / v_8_8) **Status**: open, not blocking **Type**: kernel-cycle (cycle 5/6 candidate) **Predicted verdict**: similar to horizontal cousins (k2/k4 = YELLOW PASS) **Priority**: low (different memory pattern; completeness) **Filed**: 2026-05-18 ## Background Cycles 2 and 4 implemented the **horizontal-direction** LPF inner filters (`h_4_8`, `h_8_8`). The corresponding **vertical-direction** filters (`v_4_8`, `v_8_8`) have the same arithmetic but a different memory access pattern: column-strided reads of 8 pixels (one per row) vs row-strided reads of 8 pixels (one per column). Concretely from `vp9dsp_template.c`: - `h_*_*_neon`: stridea=stride, strideb=1 (advance rows, neighborhood in cols) - `v_*_*_neon`: stridea=1, strideb=stride (advance cols, neighborhood in rows) The vertical variant tests whether the QPU's "8 lanes per row, contiguous read" assumption (cycles 2/4 wd=4/wd=8) generalises to the strided memory pattern. The TMU's coalescing behaviour may differ significantly when 8 lanes need to load from 8 different rows of the same column (cache-line-miss-y) vs 8 different cols of the same row (sequential). ## What to do Cycle 5 or 6 (after CDEF), one cycle per variant: 1. **v_4_8** — vertical 4-tap inner, 8-pixel edge (vertical edge, filter spans rows above/below). 2. Optional **v_8_8** — vertical 8-tap inner. Each cycle: same shape as cycle 2/4 but - C reference: same `loop_filter` function, instantiated via `lf_8_fn(v, 4, 1, stride)` (note: stridea + strideb swapped). - NEON: `ff_vp9_loop_filter_v_4_8_neon` (in vendored `vp9lpf_neon.S`). - QPU geometry: same 32-edges/WG, but per-edge memory access shape changes — lanes now span 8 rows (strided by stride) of one column. ## Key question to answer **Does the QPU's mixed-mode +6.9 % win (cycle 2 wd=4 horizontal) hold for the vertical variant?** The TMU latency / cache behaviour on column-strided reads is the main unknown. If positive: deployment recipe gains v variants symmetrically. If negative: deployment recipe needs to split by orientation (h on QPU, v on CPU). ## Expected outcome | Quantity | Predicted | |---|---| | M1 bit-exact | 100 % | | M3 NEON | similar to h (NEON handles both orientations well) | | M2 QPU isolation | possibly LOWER than h variant (TMU column reads less coalesced) | | R isolation | 0.30-0.45 (ORANGE) | | M4 mixed | UNKNOWN — this is the load-bearing experiment | ## Acceptance criteria - v_4_8 cycle 1-7 complete with M4 measurement - Decision: "v variants → QPU same as h" OR "v variants → CPU only" - Deployment recipe updated - Optional: v_8_8 follow-on cycle if v_4_8 was positive ## Why deferred - Out of cycle 4's compressed scope (cycle 4 was a focused wd=4 → wd=8 extension) - User-stated cycle 5 direction was CDEF (AV1 coverage), not VP9 variant completeness ## Related - `docs/k2_deblock_phase4.md §"3. Workgroup geometry"` discusses the 32-edges-per-WG mapping that needs revisiting for v variant - `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S` — NEON refs already vendored for both v_4_8 and v_8_8 - `phase0.md §2` device profile — TMU read patterns relevant for the column-strided question