Per user direction at cycle-4 close: file wd=16 (trend prediction validation) and vertical variants (column-stride TMU behaviour unknown) as local issues for future cycles. Progress instead to CDEF (AV1) for codec breadth. docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4, trend says wd=16 likely flips M4 negative. Quick incremental cycle when revisited. docs/issues/002 — vertical variants. Different memory access pattern (column-strided vs row-strided). The load-bearing unknown is whether the cycle 2 +6.9% mixed gain survives the TMU coalescing shift. If positive, deployment recipe gains symmetry; if negative, must split by orientation. Both issues have acceptance criteria + expected outcomes documented. Cycle 5 next: CDEF (AV1) — codec-breadth expansion. No Gitea repo exists for daedalus-fourier yet (project is local- only). If a tracker is wanted, create the repo and migrate these .md files. For now they live in-tree as part of the project history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.2 KiB
Issue 002 — VP9 LPF vertical variants (v_4_8 / v_8_8)
Status: open, not blocking Type: kernel-cycle (cycle 5/6 candidate) Predicted verdict: similar to horizontal cousins (k2/k4 = YELLOW PASS) Priority: low (different memory pattern; completeness) Filed: 2026-05-18
Background
Cycles 2 and 4 implemented the horizontal-direction LPF inner
filters (h_4_8, h_8_8). The corresponding vertical-direction
filters (v_4_8, v_8_8) have the same arithmetic but a different
memory access pattern: column-strided reads of 8 pixels (one per row)
vs row-strided reads of 8 pixels (one per column).
Concretely from vp9dsp_template.c:
h_*_*_neon: stridea=stride, strideb=1 (advance rows, neighborhood in cols)v_*_*_neon: stridea=1, strideb=stride (advance cols, neighborhood in rows)
The vertical variant tests whether the QPU's "8 lanes per row, contiguous read" assumption (cycles 2/4 wd=4/wd=8) generalises to the strided memory pattern. The TMU's coalescing behaviour may differ significantly when 8 lanes need to load from 8 different rows of the same column (cache-line-miss-y) vs 8 different cols of the same row (sequential).
What to do
Cycle 5 or 6 (after CDEF), one cycle per variant:
- v_4_8 — vertical 4-tap inner, 8-pixel edge (vertical edge, filter spans rows above/below).
- Optional v_8_8 — vertical 8-tap inner.
Each cycle: same shape as cycle 2/4 but
- C reference: same
loop_filterfunction, instantiated vialf_8_fn(v, 4, 1, stride)(note: stridea + strideb swapped). - NEON:
ff_vp9_loop_filter_v_4_8_neon(in vendoredvp9lpf_neon.S). - QPU geometry: same 32-edges/WG, but per-edge memory access shape changes — lanes now span 8 rows (strided by stride) of one column.
Key question to answer
Does the QPU's mixed-mode +6.9 % win (cycle 2 wd=4 horizontal) hold for the vertical variant? The TMU latency / cache behaviour on column-strided reads is the main unknown. If positive: deployment recipe gains v variants symmetrically. If negative: deployment recipe needs to split by orientation (h on QPU, v on CPU).
Expected outcome
| Quantity | Predicted |
|---|---|
| M1 bit-exact | 100 % |
| M3 NEON | similar to h (NEON handles both orientations well) |
| M2 QPU isolation | possibly LOWER than h variant (TMU column reads less coalesced) |
| R isolation | 0.30-0.45 (ORANGE) |
| M4 mixed | UNKNOWN — this is the load-bearing experiment |
Acceptance criteria
- v_4_8 cycle 1-7 complete with M4 measurement
- Decision: "v variants → QPU same as h" OR "v variants → CPU only"
- Deployment recipe updated
- Optional: v_8_8 follow-on cycle if v_4_8 was positive
Why deferred
- Out of cycle 4's compressed scope (cycle 4 was a focused wd=4 → wd=8 extension)
- User-stated cycle 5 direction was CDEF (AV1 coverage), not VP9 variant completeness
Related
docs/k2_deblock_phase4.md §"3. Workgroup geometry"discusses the 32-edges-per-WG mapping that needs revisiting for v variantexternal/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S— NEON refs already vendored for both v_4_8 and v_8_8phase0.md §2device profile — TMU read patterns relevant for the column-strided question