Files
daedalus-fourier/docs/k2_deblock_phase1.md
T
marfrit be7ff5587c Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline
Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.

docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.

docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.

docs/k2_deblock_phase3.md - NEON baseline:

  M1''_c bit-exact     100.0000 % (10000 random edges)
  M3'' throughput      48.285 Medge/s  (20.7 ns/edge, single A76)
  per-frame 1080p-eq   748 FPS (worst case 64 530 edges/frame)
  cycles/edge          ~58 (=20.7ns x 2.8GHz), ~7 cycles/row

LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).

New artifacts:
- tests/vp9_lpf_ref.c    - standalone bit-exact C ref (8-bit, wd=4
                           inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
                           (5s window, edge-content-biased RNG for
                           realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target

Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:28:57 +00:00

126 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 2
phase: 1
status: open
date_opened: 2026-05-18
parent_cycle1: phase9 (lessons distilled inline below)
target_kernel: VP9 loop filter — 4-tap inner-edge variant (horizontal direction, 8-pixel boundary)
dev_host: hertz
---
# Cycle 2, Phase 1 — Loop filter kernel goal
Cycle 1 (8×8 IDCT) closed with `phase7_M4.md` verdict GO. Per
Phase 1 §"Decision rules", the next-kernel cycle is authorised.
This doc is compact; it references cycle-1 phase docs for the
substrate framework rather than re-deriving it.
## Why deblocking, why this variant
Three candidates were on the table from `phase0.md §5`:
| candidate | covers | shape | why pick / skip |
|---|---|---|---|
| **VP9 loop filter (4-tap inner)** | **VP9 + AV1** (similar) | boundary streaming | **Picked.** Different memory access from IDCT → tests whether QPU win generalises beyond compute-bound small transforms |
| AV1 CDEF | AV1 only | per-superblock, 8-px halo | AV1-only is narrower; can come later |
| MC interpolation | VP9 + AV1 | convolution, multiply-heavy | Pure-multiply workload — V3D's SMUL24 + no INT8 MAC may bite harder than for IDCT; defer until we have more substrate confidence |
The specific variant: **VP9 4-tap inner-edge horizontal loop
filter, 8-pixel edge.** libavcodec symbol
`ff_vp9_loop_filter_h_4_8_neon` from
`libavcodec/aarch64/vp9lpf_neon.S` (already vendored in
`external/ffmpeg-snapshot/` at the FFmpeg n7.1.3 pin — verify in
Phase 2). Inner-edge means we *assume* the filter strength
parameters have been pre-computed by the caller (skipping the
per-edge strength-decision tree, which is the codec's contextual
work, not the filter itself).
## Measurable success criteria
Reusing `phase1.md §"Measurable success criteria"` structure
with cycle-2 numbering:
| ID | Measurement | Gate |
|---|---|---|
| **M1''** | Bit-exact match rate vs libavcodec C reference, ≥10 000 random edges | 100.000 % |
| **M2''** | QPU throughput in Medge/s (millions of edges processed per second) | recorded |
| **M3''** | NEON `ff_vp9_loop_filter_h_4_8_neon` throughput on same hertz, single-core, time-based | recorded |
| **M4''** | Concurrent NEON-3 + QPU vs pure NEON-4, both running deblocking | recorded |
Derived: **R'' = M2'' / M3''**.
## Decision rules (publish before measure)
Same R bands as cycle 1 — the substrate hasn't changed:
| R'' | Verdict | Next |
|---|---|---|
| ≥ 1.0 | QPU beats NEON in isolation | Phase 9 → Phase 1 of kernel 3 |
| 0.5 ≤ R'' < 1.0 | YELLOW: M4'' gate decides | Run M4''; if mixed > pure-CPU → continue |
| 0.1 ≤ R'' < 0.5 | ORANGE: M4'' may still rescue if QPU adds *anything* on top of saturated CPU (per cycle-1 F1+F2 findings) | Run M4'' anyway given M4 surprised |
| < 0.1 | RED: structural | Phase 9 close, deblocking unsuitable for QPU |
**Cycle-1 calibration adjustment:** the orange band is no longer
auto-close. Cycle 1 M4 showed mixed > pure-CPU even at R = 0.92;
similar bandwidth-contention dynamics may hold at lower R if the
QPU's memory channel stays underutilised by the CPU. Run M4'' as
the deciding measurement regardless of M2''.
## Cycle-1 lessons carried in (compressed)
From `phase7.md` + `phase7_M4.md`:
1. **The single biggest perf lever was workgroup-size scaling**
(64 → 256 invocations gave 2× throughput from latency hiding).
For cycle 2: jump straight to max WG size where shared-mem
fits, skip the small-WG exploration of cycle 1.
2. **`V3D_DEBUG=shaderdb` is load-bearing diagnostic.** Read
instruction count / threads / max-temps / spills:fills after
first compile. Multiply that by lane occupancy to predict
per-block cycle cost.
3. **Chained-ternary "spill killer" optimisation was a bust**
v3d_compiler had already coalesced. Don't pre-emptively
restructure for spills; let shaderdb tell you first.
4. **Pi 5 LPDDR4x bandwidth is the realistic ceiling.** Per-core
NEON delivers 12.6 Mblock/s on cold-cache 1080p IDCT but only
1.77 Mblock/s when 4 cores compete. The QPU lives in an
underutilised channel; the marginal contribution counts.
5. **uint8_t SSBO with `storageBuffer8BitAccess`** is the
race-free dst write pattern (cycle-1 phase-5 finding 5).
Same applies to loop-filter output pixels.
6. **Barrier-safe oob flag pattern** (cycle-1 phase-5 finding 7):
never early-return before `barrier()`. Loop filter doesn't
need a barrier within the kernel (filter is straight pass) so
this may not bite; still good to keep in mind.
## What cycle-2 Phase 1 does *not* lock
- Vulkan-compute vs direct-DRM dispatch path. Cycle 1 picked
Vulkan; loop filter has the same justification (debuggability,
spirv-toolchain reuse).
- WG geometry (number of edges per WG). Phase 4 picks based on
shared-mem and SIMD-width arithmetic.
- Vertical vs horizontal variant — Phase 1 picks horizontal
arbitrarily; Phase 4/7 may revisit if there's a perf reason.
## Phase 2 → Phase 3 hand-off
Phase 2 inventory must produce:
- Verbatim quote of the C reference for `loop_filter_h_4_8`
(will be in `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`
or `vp9lpf_template.c` — Phase 2 finds it).
- The NEON symbol signature (likely `void(uint8_t *dst, ptrdiff_t
stride, int E, int I, int H)` or similar).
- VP9 spec §8.8.1 (loop filter process) — at minimum which
conditions select the 4-tap inner filter.
- Whether the inner `loop_filter` function is exposed in the
vendored snapshot or needs additional .c files vendoring.
Phase 3 will then build `tests/bench_neon_lpf.c` and capture M3''.