Files
marfrit 85feba4087 Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8
h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9
inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one
combined go (cycle compressed since incremental from cycle 2).

Phase 5 review explicitly skipped (incremental ~30-line shader
delta from cycle 2 + same geometry + cycle-2 RED-pattern checks
still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md
"Skipping phases is a deliberate choice that should be flagged."

Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536)
first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills,
27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling.

Performance:

  M3'''' NEON (single-core)  52.382 Medge/s
  M2'''' QPU isolation       17.847 Medge/s
  R''''                      0.341  → ORANGE band
  30fps floor margin         9.2x (isolation), 20.3x (mixed)

M4'''' concurrent matrix:

  NEON 4-core                37.823 Medge/s  <- baseline
  QPU only                   14.867 Medge/s
  MIXED NEON-3 + QPU         39.389 Medge/s  <- +4.1% PASS

Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside
cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete.

Cross-cycle LPF comparison:

  | | wd=4 (k2) | wd=8 (k4) |
  | M3 NEON     | 48.3      | 52.4      |
  | M2 QPU iso  | 19.6      | 17.8      |
  | R iso       | 0.41      | 0.34      |
  | M4 delta    | +6.9%     | +4.1%     |
  | 30fps mixed | 7.2x      | 20.3x     |
  | Verdict     | GO QPU    | GO QPU    |

NEW finding (Phase 9 lesson): NEON gets faster per edge as filter
width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss
grows with width. wd=16 would probably flip negative based on the
trend line.

Deployment recipe with cycle 4:

  IDCT 8x8 (k1)     -> QPU   (R=0.92, +7% mixed)
  LPF wd=4 (k2)     -> QPU   (R=0.41, +7% mixed)
  LPF wd=8 (k4)     -> QPU   (R=0.34, +4% mixed)
  MC 8h (k3)        -> CPU   (R=0.067, -19% mixed)
  Entropy           -> CPU   (structural)

VP9 inner-edge LPF coverage complete. Project continues to higgs
deployment plumbing or further kernels per user direction.

New artifacts:
- src/v3d_lpf_h_8_8.comp                 — GLSL shader
- tests/vp9_lpf8_ref.c                   — standalone C ref
- tests/bench_neon_lpf8.c                — M1+M3 bench
- tests/bench_v3d_lpf8.c                 — M1+M2 bench
- tests/bench_concurrent_lpf8.c          — M4 pthread bench
- docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:56:25 +00:00

174 lines
6.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 4
phases: 4-7 (combined)
status: in_progress
date_opened: 2026-05-18
parent: k4_lpf8_phase1_3.md
template: k2_deblock_phase4.md (direct adaptation)
---
# Cycle 4, Phases 4-7 — LPF wd=8
Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits
all of cycle-2's geometry/contracts unchanged; only the per-thread
algorithm changes (adds flat8in branch).
## Phase 4 — plan
**Geometry**: identical to cycle 2 LPF (256 invocations/WG, 2 edges
per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return
safe).
**SSBO bindings**: identical to cycle 2 (meta uvec4, dst uint8_t).
**Per-thread algorithm** — extends cycle 2 with flat8in:
```glsl
// ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads,
// fm test, !fm early return as cycle 2 ...
bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 &&
abs(p1-p0) <= 1 && abs(q1-q0) <= 1 &&
abs(q2-q0) <= 1 && abs(q3-q0) <= 1;
if (flat8in) {
/* 6-write flat-region filter */
u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3);
u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3);
u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3);
u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3);
u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3);
u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3);
} else {
/* same hev/no-hev paths as cycle 2 */
bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
if (hev) { /* 2-write */ }
else { /* 4-write */ }
}
```
**Race safety**: flat8in path writes at `base-3..base+2` = 6
contiguous bytes per row. **Updated contract** vs cycle 2:
`dst_stride_u8 ≥ 6` (vs cycle 2's `≥ 4`). Bench uses stride=8,
satisfies. Phase 6 MUST add `assert(dst_stride_u8 >= 6)`.
**Predicted R''''**: 0.30.5 (similar to wd=4's 0.41). The flat8in
write-on-pass path has 50 % more writes than wd=4's no-hev path,
but if flat8in passes rarely under random distributions, it's a
small perturbation.
## Phase 5 — review (skipped — incremental extension)
Cycle-2's phase5 review remains the relevant outside-look. The
specific delta from cycle 2 to cycle 4:
- Added flat8in branch + 6 writes
- Stride contract relaxed-tightened from ≥4 to ≥6
- Same geometry, same SSBOs, same race-safety pattern
The cycle-2 review's two RED-pattern checks (write race, barrier UB)
remain satisfied because the geometry is unchanged. The new
arithmetic is mechanically transcribed from `vp9_lpf8_ref.c`
risk of orientation/arithmetic bug is concrete but contained; M1''''
is the immediate gate.
**Justification for skipping fresh-context review**: cycle 4 changes
~30 lines of one shader and inherits everything else from cycle 2.
Per dev_process.md "Skipping phases is a deliberate choice that
should be flagged, not a default" — flagging here. If M1'''' fails
on first run, restart with full Phase 5'''' review.
## Phase 6 — implementation
(executed below — `src/v3d_lpf_h_8_8.comp` + `tests/bench_v3d_lpf8.c`)
## Phase 7 — verification
### v1 first-light
```
=== v3d LPF h_8_8 bench ===
=== M1'''': QPU vs C bit-exact ===
edges bit-exact: 65536 / 65536 (100.0000 %)
=== M2'''': QPU throughput ===
per-edge = 56.0 ns
per-dispatch = 3672.1 us
M2'''' = 17.847 Medge/s
R'''' = 0.341 → ORANGE band
30fps@1080p floor: 9.2x margin (isolation)
```
shaderdb: **231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms.**
The 4-thread result is the meaningful one — compiler delivered. The
wd=8 kernel runs at the latency-hiding ceiling from v1.
### M4'''' concurrent (8s windows)
| Config | Medge/s | vs NEON-4 | 30fps margin |
|---|---|---|---|
| **NEON 4-core** | **37.823** | baseline | 19.5× |
| QPU only | 14.867 | — | 7.7× |
| **MIXED NEON-3 + QPU** | **39.389** | **+4.1 %** | 20.3× |
**M4'''' PASSES**. The freed-core pattern from cycles 1+2 holds for
wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive.
The larger conditional logic (flat8in path) dilutes per-edge QPU
contribution under contention (3.98 vs cycle-2's 4.00 — basically
same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because
the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's
20.7 ns), so the relative gain shrinks.
### Cross-cycle LPF comparison
| | k2 wd=4 | k4 wd=8 |
|---|---|---|
| M3 NEON (Medge/s) | 48.285 | 52.382 |
| M2 QPU isolation | 19.645 | 17.847 |
| R isolation | 0.41 | 0.34 |
| NEON-4 (Medge/s) | 33.726 | 37.823 |
| Mixed N-3+QPU | 36.049 | 39.389 |
| M4 delta | **+6.9 %** | **+4.1 %** |
| 30fps margin (mixed) | 7.2× | 20.3× |
| Verdict | GO QPU | GO QPU |
### Decision per Phase 1 rules + 30fps floor
| Rule | Result | Status |
|---|---|---|
| M1'''' bit-exact | 100.0000 % | ✓ PASS |
| R'''' = M2''''/M3'''' | 0.341 (ORANGE) | does not auto-close |
| M4'''' > pure NEON-4 | +4.1 % | ✓ PASS gate |
| 30fps@1080p floor | 20.3× mixed | ✓ PASS user-facing |
**Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU,
alongside cycle-2 wd=4.** Combined VP9 LPF coverage = wd=4 + wd=8
on QPU.
### Phase 9 lessons
1. Width extensions of a known-working kernel (wd=4 → wd=8) inherit
the pattern reliably. v1 first-light hit M1'''' = 100 % first try
on a 30-line shader delta. No iteration needed.
2. **Phase 5 review can be skipped for incremental extensions**
when the delta is < ~30 lines and the cycle-2 review's pattern
coverage still applies. Flagged explicitly in §"Phase 5 — review
(skipped)". If M1 had failed, restart with full review. Cycle 5+
should restore mandatory review for non-incremental work.
3. NEON gets faster per edge as filter width grows (20.7 → 19.1 ns
wd=4 → wd=8). The NEON implementation is heavily optimised; the
relative QPU loss grows with kernel width. Cycle 5 wd=16 would
probably show further R degradation.
4. M4 delta is the gating metric for ORANGE-band kernels. The gap
from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline
useful on QPU; wd=16 may flip negative."
### Leaves open
- LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on
the trend line)
- Vertical variants of both wd=4 and wd=8 (different memory pattern)
- CDEF / loop restoration (AV1 kernels)
- Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)