85feba4087
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: | | wd=4 (k2) | wd=8 (k4) | | M3 NEON | 48.3 | 52.4 | | M2 QPU iso | 19.6 | 17.8 | | R iso | 0.41 | 0.34 | | M4 delta | +6.9% | +4.1% | | 30fps mixed | 7.2x | 20.3x | | Verdict | GO QPU | GO QPU | NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
174 lines
6.3 KiB
Markdown
174 lines
6.3 KiB
Markdown
---
|
||
cycle: 4
|
||
phases: 4-7 (combined)
|
||
status: in_progress
|
||
date_opened: 2026-05-18
|
||
parent: k4_lpf8_phase1_3.md
|
||
template: k2_deblock_phase4.md (direct adaptation)
|
||
---
|
||
|
||
# Cycle 4, Phases 4-7 — LPF wd=8
|
||
|
||
Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits
|
||
all of cycle-2's geometry/contracts unchanged; only the per-thread
|
||
algorithm changes (adds flat8in branch).
|
||
|
||
## Phase 4 — plan
|
||
|
||
**Geometry**: identical to cycle 2 LPF (256 invocations/WG, 2 edges
|
||
per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return
|
||
safe).
|
||
|
||
**SSBO bindings**: identical to cycle 2 (meta uvec4, dst uint8_t).
|
||
|
||
**Per-thread algorithm** — extends cycle 2 with flat8in:
|
||
```glsl
|
||
// ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads,
|
||
// fm test, !fm early return as cycle 2 ...
|
||
|
||
bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 &&
|
||
abs(p1-p0) <= 1 && abs(q1-q0) <= 1 &&
|
||
abs(q2-q0) <= 1 && abs(q3-q0) <= 1;
|
||
|
||
if (flat8in) {
|
||
/* 6-write flat-region filter */
|
||
u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3);
|
||
u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3);
|
||
u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3);
|
||
u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3);
|
||
u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3);
|
||
u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3);
|
||
} else {
|
||
/* same hev/no-hev paths as cycle 2 */
|
||
bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
|
||
if (hev) { /* 2-write */ }
|
||
else { /* 4-write */ }
|
||
}
|
||
```
|
||
|
||
**Race safety**: flat8in path writes at `base-3..base+2` = 6
|
||
contiguous bytes per row. **Updated contract** vs cycle 2:
|
||
`dst_stride_u8 ≥ 6` (vs cycle 2's `≥ 4`). Bench uses stride=8,
|
||
satisfies. Phase 6 MUST add `assert(dst_stride_u8 >= 6)`.
|
||
|
||
**Predicted R''''**: 0.3–0.5 (similar to wd=4's 0.41). The flat8in
|
||
write-on-pass path has 50 % more writes than wd=4's no-hev path,
|
||
but if flat8in passes rarely under random distributions, it's a
|
||
small perturbation.
|
||
|
||
## Phase 5 — review (skipped — incremental extension)
|
||
|
||
Cycle-2's phase5 review remains the relevant outside-look. The
|
||
specific delta from cycle 2 to cycle 4:
|
||
- Added flat8in branch + 6 writes
|
||
- Stride contract relaxed-tightened from ≥4 to ≥6
|
||
- Same geometry, same SSBOs, same race-safety pattern
|
||
|
||
The cycle-2 review's two RED-pattern checks (write race, barrier UB)
|
||
remain satisfied because the geometry is unchanged. The new
|
||
arithmetic is mechanically transcribed from `vp9_lpf8_ref.c` —
|
||
risk of orientation/arithmetic bug is concrete but contained; M1''''
|
||
is the immediate gate.
|
||
|
||
**Justification for skipping fresh-context review**: cycle 4 changes
|
||
~30 lines of one shader and inherits everything else from cycle 2.
|
||
Per dev_process.md "Skipping phases is a deliberate choice that
|
||
should be flagged, not a default" — flagging here. If M1'''' fails
|
||
on first run, restart with full Phase 5'''' review.
|
||
|
||
## Phase 6 — implementation
|
||
|
||
(executed below — `src/v3d_lpf_h_8_8.comp` + `tests/bench_v3d_lpf8.c`)
|
||
|
||
## Phase 7 — verification
|
||
|
||
### v1 first-light
|
||
```
|
||
=== v3d LPF h_8_8 bench ===
|
||
=== M1'''': QPU vs C bit-exact ===
|
||
edges bit-exact: 65536 / 65536 (100.0000 %)
|
||
|
||
=== M2'''': QPU throughput ===
|
||
per-edge = 56.0 ns
|
||
per-dispatch = 3672.1 us
|
||
M2'''' = 17.847 Medge/s
|
||
R'''' = 0.341 → ORANGE band
|
||
30fps@1080p floor: 9.2x margin (isolation)
|
||
```
|
||
|
||
shaderdb: **231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms.**
|
||
The 4-thread result is the meaningful one — compiler delivered. The
|
||
wd=8 kernel runs at the latency-hiding ceiling from v1.
|
||
|
||
### M4'''' concurrent (8s windows)
|
||
|
||
| Config | Medge/s | vs NEON-4 | 30fps margin |
|
||
|---|---|---|---|
|
||
| **NEON 4-core** | **37.823** | baseline | 19.5× |
|
||
| QPU only | 14.867 | — | 7.7× |
|
||
| **MIXED NEON-3 + QPU** | **39.389** | **+4.1 %** | 20.3× |
|
||
|
||
**M4'''' PASSES**. The freed-core pattern from cycles 1+2 holds for
|
||
wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive.
|
||
The larger conditional logic (flat8in path) dilutes per-edge QPU
|
||
contribution under contention (3.98 vs cycle-2's 4.00 — basically
|
||
same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because
|
||
the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's
|
||
20.7 ns), so the relative gain shrinks.
|
||
|
||
### Cross-cycle LPF comparison
|
||
|
||
| | k2 wd=4 | k4 wd=8 |
|
||
|---|---|---|
|
||
| M3 NEON (Medge/s) | 48.285 | 52.382 |
|
||
| M2 QPU isolation | 19.645 | 17.847 |
|
||
| R isolation | 0.41 | 0.34 |
|
||
| NEON-4 (Medge/s) | 33.726 | 37.823 |
|
||
| Mixed N-3+QPU | 36.049 | 39.389 |
|
||
| M4 delta | **+6.9 %** | **+4.1 %** |
|
||
| 30fps margin (mixed) | 7.2× | 20.3× |
|
||
| Verdict | GO QPU | GO QPU |
|
||
|
||
### Decision per Phase 1 rules + 30fps floor
|
||
|
||
| Rule | Result | Status |
|
||
|---|---|---|
|
||
| M1'''' bit-exact | 100.0000 % | ✓ PASS |
|
||
| R'''' = M2''''/M3'''' | 0.341 (ORANGE) | does not auto-close |
|
||
| M4'''' > pure NEON-4 | +4.1 % | ✓ PASS gate |
|
||
| 30fps@1080p floor | 20.3× mixed | ✓ PASS user-facing |
|
||
|
||
**Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU,
|
||
alongside cycle-2 wd=4.** Combined VP9 LPF coverage = wd=4 + wd=8
|
||
on QPU.
|
||
|
||
### Phase 9 lessons
|
||
|
||
1. Width extensions of a known-working kernel (wd=4 → wd=8) inherit
|
||
the pattern reliably. v1 first-light hit M1'''' = 100 % first try
|
||
on a 30-line shader delta. No iteration needed.
|
||
|
||
2. **Phase 5 review can be skipped for incremental extensions** —
|
||
when the delta is < ~30 lines and the cycle-2 review's pattern
|
||
coverage still applies. Flagged explicitly in §"Phase 5 — review
|
||
(skipped)". If M1 had failed, restart with full review. Cycle 5+
|
||
should restore mandatory review for non-incremental work.
|
||
|
||
3. NEON gets faster per edge as filter width grows (20.7 → 19.1 ns
|
||
wd=4 → wd=8). The NEON implementation is heavily optimised; the
|
||
relative QPU loss grows with kernel width. Cycle 5 wd=16 would
|
||
probably show further R degradation.
|
||
|
||
4. M4 delta is the gating metric for ORANGE-band kernels. The gap
|
||
from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline
|
||
useful on QPU; wd=16 may flip negative."
|
||
|
||
### Leaves open
|
||
|
||
- LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on
|
||
the trend line)
|
||
- Vertical variants of both wd=4 and wd=8 (different memory pattern)
|
||
- CDEF / loop restoration (AV1 kernels)
|
||
- Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)
|
||
|