--- cycle: 4 phases: 1-3 (combined doc — straight extension of cycle 2) status: phase 3 in progress date_opened: 2026-05-18 parent_cycle: k3_mc_phase7.md target_kernel: VP9 loop filter wd=8 inner-edge horizontal (h_8_8) --- # Cycle 4, Phases 1-3 — LPF wd=8 Compact combined doc — cycle 4 is a *width extension* of cycle 2 (same kernel family, same shape, same NEON file). ## Phase 1 — goal **Kernel**: VP9 loop filter, 8-tap inner-edge variant (wd=8), horizontal direction, 8-pixel edge. libavcodec symbol `ff_vp9_loop_filter_h_8_8_neon` (already in vendored `vp9lpf_neon.S`). **Why this kernel**: completes VP9 LPF coverage alongside cycle 2's wd=4. The wd=8 path adds the `flat8in` test (6 abs comparisons) and a 6-pixel "flat region" write path — meaningfully more conditional branches than wd=4 within the same kernel family. **Measurable success** (cycle-4 numbering, `''''` superscript): | ID | Measurement | Gate | |---|---|---| | M1'''' | Bit-exact vs C reference | 100.0000 % | | M2'''' | QPU throughput Medge/s | recorded | | M3'''' | NEON `ff_vp9_loop_filter_h_8_8_neon` Medge/s | recorded | | M4'''' | Mixed NEON-3 + QPU vs pure NEON-4 (Medge/s) | recorded if YELLOW | Same R bands + 30fps-floor calibration as cycles 2/3. **Predicted R''''**: 0.3–0.5. Cycle 2 LPF wd=4 hit R=0.41; wd=8 adds ~20 % more conditional logic (flat8in test) and additional writes when flat8in passes. Likely modestly worse R than wd=4. The 6-write flat8in path under SIMD divergence may dominate. ## Phase 2 — situation C reference: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`, the same `loop_filter()` function (lines 1780-1898) used in cycle 2 but invoked with wd=8 via the `lf_8_fn(h, 8, stride, 1)` macro instantiation. The wd=8 path activates the `if (wd >= 8 && flat8in)` branch. NEON reference: already vendored at `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S`, symbol `ff_vp9_loop_filter_h_8_8_neon`. Same calling convention as wd=4: `(uint8_t *dst, ptrdiff_t stride, int E, int I, int H)`. No new vendored sources needed. **Workload model per edge (worst case, flat8in passes):** - 8 rows × 6 written + 2 unwritten = 48 writes per edge (vs wd=4's 16-32) - 8 rows × 8 reads = 64 reads (same as wd=4) - ~12 abs+compares per row × 8 = ~96 per edge (vs wd=4's ~50) Memory traffic similar to cycle 2 (~80-110 bytes per edge). Compute moderately higher (more conditional branches + more writes when flat8in fires). ## Phase 3 — NEON M3'''' baseline (captured below after build + run)