Files

T

marfrit 20e3d004ae Issues 001+002: defer LPF wd=16 + LPF vertical variants

Per user direction at cycle-4 close: file wd=16 (trend prediction
validation) and vertical variants (column-stride TMU behaviour
unknown) as local issues for future cycles. Progress instead to
CDEF (AV1) for codec breadth.

docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4,
trend says wd=16 likely flips M4 negative. Quick incremental cycle
when revisited.

docs/issues/002 — vertical variants. Different memory access pattern
(column-strided vs row-strided). The load-bearing unknown is
whether the cycle 2 +6.9% mixed gain survives the TMU coalescing
shift. If positive, deployment recipe gains symmetry; if negative,
must split by orientation.

Both issues have acceptance criteria + expected outcomes documented.

Cycle 5 next: CDEF (AV1) — codec-breadth expansion.

No Gitea repo exists for daedalus-fourier yet (project is local-
only). If a tracker is wanted, create the repo and migrate these
.md files. For now they live in-tree as part of the project history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 13:09:51 +00:00

3.2 KiB

Raw Blame History

Issue 002 — VP9 LPF vertical variants (v_4_8 / v_8_8)

Status: open, not blocking Type: kernel-cycle (cycle 5/6 candidate) Predicted verdict: similar to horizontal cousins (k2/k4 = YELLOW PASS) Priority: low (different memory pattern; completeness) Filed: 2026-05-18

Background

Cycles 2 and 4 implemented the horizontal-direction LPF inner filters (h_4_8, h_8_8). The corresponding vertical-direction filters (v_4_8, v_8_8) have the same arithmetic but a different memory access pattern: column-strided reads of 8 pixels (one per row) vs row-strided reads of 8 pixels (one per column).

Concretely from vp9dsp_template.c:

h_*_*_neon: stridea=stride, strideb=1 (advance rows, neighborhood in cols)
v_*_*_neon: stridea=1, strideb=stride (advance cols, neighborhood in rows)

The vertical variant tests whether the QPU's "8 lanes per row, contiguous read" assumption (cycles 2/4 wd=4/wd=8) generalises to the strided memory pattern. The TMU's coalescing behaviour may differ significantly when 8 lanes need to load from 8 different rows of the same column (cache-line-miss-y) vs 8 different cols of the same row (sequential).

What to do

Cycle 5 or 6 (after CDEF), one cycle per variant:

v_4_8 — vertical 4-tap inner, 8-pixel edge (vertical edge, filter spans rows above/below).
Optional v_8_8 — vertical 8-tap inner.

Each cycle: same shape as cycle 2/4 but

C reference: same loop_filter function, instantiated via lf_8_fn(v, 4, 1, stride) (note: stridea + strideb swapped).
NEON: ff_vp9_loop_filter_v_4_8_neon (in vendored vp9lpf_neon.S).
QPU geometry: same 32-edges/WG, but per-edge memory access shape changes — lanes now span 8 rows (strided by stride) of one column.

Key question to answer

Does the QPU's mixed-mode +6.9 % win (cycle 2 wd=4 horizontal) hold for the vertical variant? The TMU latency / cache behaviour on column-strided reads is the main unknown. If positive: deployment recipe gains v variants symmetrically. If negative: deployment recipe needs to split by orientation (h on QPU, v on CPU).

Expected outcome

Quantity	Predicted
M1 bit-exact	100 %
M3 NEON	similar to h (NEON handles both orientations well)
M2 QPU isolation	possibly LOWER than h variant (TMU column reads less coalesced)
R isolation	0.30-0.45 (ORANGE)
M4 mixed	UNKNOWN — this is the load-bearing experiment

Acceptance criteria

v_4_8 cycle 1-7 complete with M4 measurement
Decision: "v variants → QPU same as h" OR "v variants → CPU only"
Deployment recipe updated
Optional: v_8_8 follow-on cycle if v_4_8 was positive

Why deferred

Out of cycle 4's compressed scope (cycle 4 was a focused wd=4 → wd=8 extension)
User-stated cycle 5 direction was CDEF (AV1 coverage), not VP9 variant completeness

docs/k2_deblock_phase4.md §"3. Workgroup geometry" discusses the 32-edges-per-WG mapping that needs revisiting for v variant
external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S — NEON refs already vendored for both v_4_8 and v_8_8
phase0.md §2 device profile — TMU read patterns relevant for the column-strided question

3.2 KiB Raw Blame History