Files
daedalus-fourier/docs/k3_mc_phase1.md
T
marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00

105 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 3
phase: 1
status: open
date_opened: 2026-05-18
parent_cycle: k2_deblock_phase7.md (cycle 2 closed YELLOW-via-M4'' PASS)
target_kernel: VP9 8-tap MC interpolation, regular filter, horizontal, 8×N block
dev_host: hertz
---
# Cycle 3, Phase 1 — MC interpolation kernel goal
Per `k2_deblock_phase7.md` verdict (project continues). MC interpolation
chosen because: most-common per-frame work in real bitstreams (every
inter block); multiply-heavy → stresses V3D SMUL24 / lack of DP4A
directly; VP9+AV1 both use the same 8-tap structure.
## Kernel under test
**VP9 8-tap regular subpel filter, horizontal direction, 8×N block,
"put" (non-averaging) mode.**
libavcodec symbol: `ff_vp9_put_8tap_regular_8h_neon` (and equivalents
for smooth/sharp filter types). C reference: `put_8tap_regular_8h_c`
from `libavcodec/vp9dsp_template.c` (instantiated via the
`filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` macro
expansion).
I/O contract (per VP9 spec § 8.5.1 — subpel motion compensation):
```c
void put_8tap_regular_8h_c(uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
```
- `dst` : destination block, written
- `dst_stride` : destination row stride
- `src` : source block, read (with -3..+4 column overhang for horizontal)
- `src_stride` : source row stride
- `h` : block height (typically 8 for 8×8)
- `mx` : x-axis subpel phase ∈ [0, 15]
- `my` : y-axis subpel phase (unused for horizontal-only filter)
Per output pixel:
```
out[r][c] = clip(sum_{k=0..7} filter[k] * src[r][c+k-3] + 64) >> 7
```
Filter coefficients: `ff_vp9_subpel_filters[FILTER_8TAP_REGULAR][mx][0..7]`
(int16, signed; 16 phases; sum to 128).
## Measurable success criteria (cycle-3 numbering)
| ID | Measurement | Gate |
|---|---|---|
| **M1'''** | Bit-exact match rate vs C reference, ≥10 000 random 8×8 blocks (all 16 mx phases sampled) | 100.0000 % |
| **M2'''** | QPU throughput in Mblock/s | recorded |
| **M3'''** | NEON `ff_vp9_put_8tap_regular_8h_neon` throughput, single-core | recorded |
| **M4'''** | MIXED NEON-3 + QPU vs pure NEON-4 (only if YELLOW band) | conditional |
Derived: **R''' = M2''' / M3'''**.
## Decision rules (same as cycle 1/2)
R''' bands and verdicts unchanged (see `phase1.md` and `k2_deblock_phase1.md`).
Cycle-2 calibration adjustment: ORANGE band (0.1 ≤ R''' < 0.5) is
no longer auto-close — run M4''' regardless.
Predicted R''' band: **0.40.8.**
- MC is more compute-bound than LPF (8 mults + 7 adds per output
pixel; 64 pixels per block → ~960 ops per block)
- Bandwidth-equivalent to LPF (per-block ~120 B read + 64 B write
≈ 184 B → similar 5-6 MB/frame at 32 400 blocks)
- V3D SMUL24 covers the 8b×8b → 16b mults without overflow
- But no DP4A means we lose the typical "4× INT8 speedup" CPUs get
via SDOT — V3D does these as scalar SMUL24
## Cycle 1+2 lessons baked in from start
Per `k2_deblock_phase7.md §"Phase 9 lessons"`:
1. WG=256, 2-per-subgroup adaptation, uint8_t SSBO, oob early-return,
NO chained ternary — these are the v1 defaults.
2. Phase 5 second-model review is mandatory.
3. R isolation is misleading; M4''' is the real gate.
4. Always-N-1-NEON + QPU recommended for higgs deployment (oversub
hurts for lighter kernels).
5. shaderdb at 4 threads / 0 spills = compiler delivered; further
optimisation must target algorithm, not compile shape.
## Phase 2 → Phase 3 hand-off
Phase 2 must:
- Vendor `libavcodec/aarch64/vp9mc_neon.S` from FFmpeg n7.1.3
(matches existing snapshot pin)
- Confirm `ff_vp9_subpel_filters` definition source
(`libavcodec/vp9dsp.c:32`, just the 16 × 8 REGULAR row needed)
- Pin the exact NEON symbol naming
Phase 3 must:
- Write standalone C ref (`tests/vp9_mc_ref.c`) with REGULAR filter
table embedded
- Write `tests/bench_neon_mc.c` (M1'''_c gate + M3''')
- Capture M3''' before any QPU work