Files
daedalus-fourier/docs/k3_mc_phase7.md
T
marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00

153 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 3
phase: 7
status: closed 2026-05-18 — RED engineering / PASS 30fps-floor / M4 NEGATIVE
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k3_mc_phase4.md (revised per phase5''')
host: hertz
verdict: cycle 3 closes; MC stays on CPU for higgs deployment; engineering negative documented
---
# Cycle 3, Phase 7 — Verification (v1 + M4''')
## v1 first-light
```
=== v3d MC 8h bench ===
n_blocks: 65536 iters: 100
=== M1''': QPU vs C reference bit-exact ===
blocks bit-exact: 65536 / 65536 (100.0000 %)
=== M2''': QPU throughput ===
M2''' = 1.413 Mblock/s
per-block = 707.9 ns
per-dispatch = 46390.5 us
R''' = 0.067 → RED band
30fps@1080p floor: 1.5x margin (isolation)
```
shaderdb (v1 MC):
```
SHADER-DB-ffcca249...: 488 inst, 2 threads, 0 loops, 197 uniforms,
25 max-temps, 0:0 spills:fills, 0 sfu-stalls, 488 inst-and-stalls, 7 nops
```
**Phase 5''' finding 2 prediction confirmed**: filter LUT inflated
uniforms to 197 (gate was at ~144). Compiler also forced to 2 threads
(from cycle-2's 4) due to register pressure (25 max-temps vs cycle-2's
21). The "no DP4A" structural deficit shows up directly here — 8
SMUL24 + 7 ADD per output pixel × 64 pixels per block × 8-lane
geometry = 488 instructions, 30× heavier than the LPF kernel.
## M4''' concurrent matrix (8s windows)
| Config | Mblock/s | per-core (NEON) | vs NEON-4 | 30fps |
|---|---|---|---|---|
| NEON 1-core | 14.479 | — | — | 14.9× |
| **NEON 4-core** | **15.248** | 3.24 4.48 | **baseline** | 15.7× |
| QPU only | 1.380 | — | — | 1.4× |
| **Mixed NEON-3 + QPU** | **12.277** | 3.78 4.16 | **19.5 %** | 12.6× |
| Mixed NEON-4 + QPU | 12.158 | 2.49 3.35 | 20.3 % | 12.5× |
**M4 gate: FAIL.** Mixed (12.28) < pure NEON-4 (15.25) by 2.97
Mblock/s. The QPU's 0.45 Mblock/s contribution under contention
doesn't compensate for losing one NEON core that delivers ~3.8.
## Cross-cycle comparison
| | Cycle 1 IDCT | Cycle 2 LPF | Cycle 3 MC |
|---|---|---|---|
| R isolation | 0.92 | 0.41 | **0.067** |
| 30fps floor margin (isolation) | 7.9× | 10× | **1.5×** |
| M4 mixed vs pure NEON-4 | +7.2 % | +6.9 % | **19.5 %** |
| 30fps floor margin (mixed) | 7.2× | 7.2× | **12.6×** |
| Verdict for higgs | GO QPU | GO QPU | **STAY CPU** |
| NEON 4-core scaling vs 1-core | 0.56× (bw-bound) | 0.82× (bw-bound) | **1.05× (compute-bound)** |
The MC result is **structurally consistent** with the V3D substrate
profile from `phase0.md`:
- No DP4A → 8-wide convolution doesn't pack as it does on NEON SDOT
- Filter coefficients drive uniform count high → register pressure → 2 threads
- High per-output-pixel multiply count → compiled instruction count
3× cycle 1, 6× cycle 2
NEON 4-core is *compute*-bound for MC (not bandwidth-bound like
the other two kernels). So 4-core scales nearly linearly with cores —
the NEON CPU has plenty of headroom and the QPU has nothing to add
even in concurrent mode.
## Deployment recipe (for higgs / libva-v4l2-request-fourier)
Per `project_consumer_target.md`, the eventual integration target is
V4L2 stateless → libva-v4l2-request-fourier → firefox-fourier. The
back-end-on-QPU/CPU split for the consumed decoder pipeline:
- **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
- **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
- **MC (cycle 3)** → **CPU NEON**. R = 0.067, 19.5 % mixed.
Compute-bound on CPU but CPU already comfortably exceeds 30fps;
offload makes things worse.
- **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
This is a **mixed-substrate deployment**, not a "QPU does everything"
plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
## Decision per Phase 1 rules + 30fps-floor calibration
| Rule | Result | Status |
|---|---|---|
| M1''' bit-exact | 100.0000 % | ✓ PASS |
| R''' = M2'''/M3''' | 0.067 (RED) | structural mismatch |
| M4''' > pure-CPU 4-core | 19.5 % | ✗ FAIL gate |
| 30fps@1080p floor (isolation) | 1.5× | ✓ PASS (user-facing) |
| 30fps@1080p floor (mixed) | 12.6× | ✓ PASS (user-facing) |
**Engineering cycle verdict: do not deploy MC on QPU; deploy on CPU.**
**User-facing cycle verdict: 30fps floor easily met in any
configuration; either path works for daily YouTube.**
For the deployment recipe above, **MC stays on CPU**. The Phase 1
ORANGE/RED "honest close" rule applies here: cycle 3 closes as a
documented negative for this kernel without affecting the
project-level "continue" verdict (cycles 1+2 GO results stand).
## Phase 9 lessons (added to project memory)
1. **Multiply-heavy workloads expose V3D's no-DP4A deficit** in a way
that cycle 1+2 didn't. CPU SDOT/UDOT pack 4 INT8 MACs in one
instruction; V3D's SMUL24 is one scalar mult at a time. The 4×
gap shows up directly as a 6-15× per-block slowdown.
2. **Compute-bound CPU workloads make the QPU offload story collapse.**
When NEON 4-core scales near-linearly (not bandwidth-saturated),
the "freed-core" argument from cycle 1+2 doesn't apply — there
are no free cycles to free. Mixed mode is strictly worse.
3. **The 30fps@1080p user-facing test (`project_30fps_floor_is_fine.md`)
passes regardless of engineering verdict.** All three cycles pass
it in isolation. This is a project-level win to communicate
separately from per-cycle engineering R numbers.
4. **The shaderdb filter-LUT gate from phase5''' finding 2 fired
exactly as predicted** (197 uniforms > 144 threshold; 2 threads
instead of 4). This validates the cycle-discipline of running
`V3D_DEBUG=shaderdb` early and using the result as an actionable
gate. Cycle 4 (if any) should bake this in from Phase 4 §design.
## Leaves open
- Cycle 3 v2 with filter LUT escalated to SSBO (per phase5''' finding 2
trigger). Would reduce uniforms to ~30, potentially restore 4
threads. Expected upside: ~2× → R''' = 0.13. Still RED, still M4-
negative. Skipped — even doubling doesn't change the deployment
recipe.
- Vertical / hv / 4-tap / wider variants — all of cycle 3 same
multiply-shape, same structural verdict expected. Not worth Phase
1+ for those.
- Cycle 4 candidates (per phase7_M4.md §"Cycle 3 candidates"):
CDEF (AV1-only directional filter), Loop Restoration (AV1-only),
or higgs deployment plumbing.