Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00
parent 36eca40ff2
commit 356e446a49
15 changed files with 2545 additions and 1 deletions
@@ -55,6 +55,19 @@ set_source_files_properties(${FFASM_LPF_SOURCES} PROPERTIES
    COMPILE_OPTIONS "${FFASM_FLAGS}"
    LANGUAGE ASM)

+# Cycle 3 — VP9 MC interpolation NEON source + filter coefficient table
+# (vendored 2026-05-18). The .c table provides ff_vp9_subpel_filters
+# symbol which vp9mc_neon.S references via movrel.
+set(FFASM_MC_SOURCES
+    ${FFSNAP}/libavcodec/aarch64/vp9mc_neon.S
+)
+set(FFC_MC_SOURCES
+    ${FFSNAP}/libavcodec/vp9_subpel_filters_table.c
+)
+set_source_files_properties(${FFASM_MC_SOURCES} PROPERTIES
+    COMPILE_OPTIONS "${FFASM_FLAGS}"
+    LANGUAGE ASM)
+
 # Tell CMake/gas to preprocess .S sources.
 set_source_files_properties(${FFASM_SOURCES} PROPERTIES
    COMPILE_OPTIONS "${FFASM_FLAGS}"
@@ -76,6 +89,15 @@ add_executable(bench_neon_lpf
    ${FFASM_LPF_SOURCES}
 )
 target_compile_options(bench_neon_lpf PRIVATE -O3 -march=armv8-a+simd)
+
+# Cycle 3 — VP9 MC interpolation NEON baseline.
+add_executable(bench_neon_mc
+    tests/bench_neon_mc.c
+    tests/vp9_mc_ref.c
+    ${FFASM_MC_SOURCES}
+    ${FFC_MC_SOURCES}
+)
+target_compile_options(bench_neon_mc PRIVATE -O3 -march=armv8-a+simd)
 # bench_neon_idct doesn't need vulkan/drm — pure CPU baseline.

 # ---- Vulkan dispatch-overhead microbench (next chunk) ----------------------
@@ -125,7 +147,18 @@ if (DAEDALUS_BUILD_VULKAN)
        VERBATIM
    )

-    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV})
+    set(MC_SPV ${CMAKE_BINARY_DIR}/v3d_mc_8h.spv)
+    add_custom_command(
+        OUTPUT ${MC_SPV}
+        COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
+                -o ${MC_SPV}
+                ${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
+        DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
+        COMMENT "glslang: v3d_mc_8h.comp -> v3d_mc_8h.spv"
+        VERBATIM
+    )
+
+    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV})

    # v3d_runner — reusable Vulkan plumbing.
    add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -155,6 +188,15 @@ if (DAEDALUS_BUILD_VULKAN)
    target_link_libraries(bench_v3d_lpf PRIVATE v3d_runner Vulkan::Vulkan)
    target_compile_options(bench_v3d_lpf PRIVATE -O2)

+    # Cycle 3 — QPU MC bench.
+    add_executable(bench_v3d_mc
+        tests/bench_v3d_mc.c
+        tests/vp9_mc_ref.c
+    )
+    add_dependencies(bench_v3d_mc daedalus_shaders)
+    target_link_libraries(bench_v3d_mc PRIVATE v3d_runner Vulkan::Vulkan)
+    target_compile_options(bench_v3d_mc PRIVATE -O2)
+
    # M4 — concurrent CPU(NEON) + QPU bench. Links the FFmpeg NEON
    # snapshot so we can run real NEON kernels on pinned CPU cores
    # while the QPU runs its dispatch loop concurrently.
@@ -174,6 +216,16 @@ if (DAEDALUS_BUILD_VULKAN)
    add_dependencies(bench_concurrent_lpf daedalus_shaders)
    target_link_libraries(bench_concurrent_lpf PRIVATE v3d_runner Vulkan::Vulkan pthread)
    target_compile_options(bench_concurrent_lpf PRIVATE -O3 -march=armv8-a+simd)
+
+    # Cycle 3 M4''' — concurrent MC.
+    add_executable(bench_concurrent_mc
+        tests/bench_concurrent_mc.c
+        ${FFASM_MC_SOURCES}
+        ${FFC_MC_SOURCES}
+    )
+    add_dependencies(bench_concurrent_mc daedalus_shaders)
+    target_link_libraries(bench_concurrent_mc PRIVATE v3d_runner Vulkan::Vulkan pthread)
+    target_compile_options(bench_concurrent_mc PRIVATE -O3 -march=armv8-a+simd)
 endif()

 # ---- Summary ----------------------------------------------------------------
@@ -0,0 +1,104 @@
+---
+cycle: 3
+phase: 1
+status: open
+date_opened: 2026-05-18
+parent_cycle: k2_deblock_phase7.md (cycle 2 closed YELLOW-via-M4'' PASS)
+target_kernel: VP9 8-tap MC interpolation, regular filter, horizontal, 8×N block
+dev_host: hertz
+---
+
+# Cycle 3, Phase 1 — MC interpolation kernel goal
+
+Per `k2_deblock_phase7.md` verdict (project continues). MC interpolation
+chosen because: most-common per-frame work in real bitstreams (every
+inter block); multiply-heavy → stresses V3D SMUL24 / lack of DP4A
+directly; VP9+AV1 both use the same 8-tap structure.
+
+## Kernel under test
+
+**VP9 8-tap regular subpel filter, horizontal direction, 8×N block,
+"put" (non-averaging) mode.**
+
+libavcodec symbol: `ff_vp9_put_8tap_regular_8h_neon` (and equivalents
+for smooth/sharp filter types). C reference: `put_8tap_regular_8h_c`
+from `libavcodec/vp9dsp_template.c` (instantiated via the
+`filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` macro
+expansion).
+
+I/O contract (per VP9 spec § 8.5.1 — subpel motion compensation):
+```c
+void put_8tap_regular_8h_c(uint8_t *dst, ptrdiff_t dst_stride,
+                           const uint8_t *src, ptrdiff_t src_stride,
+                           int h, int mx, int my);
+```
+
+- `dst` : destination block, written
+- `dst_stride` : destination row stride
+- `src` : source block, read (with -3..+4 column overhang for horizontal)
+- `src_stride` : source row stride
+- `h` : block height (typically 8 for 8×8)
+- `mx` : x-axis subpel phase ∈ [0, 15]
+- `my` : y-axis subpel phase (unused for horizontal-only filter)
+
+Per output pixel:
+```
+out[r][c] = clip(sum_{k=0..7} filter[k] * src[r][c+k-3] + 64) >> 7
+```
+
+Filter coefficients: `ff_vp9_subpel_filters[FILTER_8TAP_REGULAR][mx][0..7]`
+(int16, signed; 16 phases; sum to 128).
+
+## Measurable success criteria (cycle-3 numbering)
+
+| ID | Measurement | Gate |
+|---|---|---|
+| **M1'''** | Bit-exact match rate vs C reference, ≥10 000 random 8×8 blocks (all 16 mx phases sampled) | 100.0000 % |
+| **M2'''** | QPU throughput in Mblock/s | recorded |
+| **M3'''** | NEON `ff_vp9_put_8tap_regular_8h_neon` throughput, single-core | recorded |
+| **M4'''** | MIXED NEON-3 + QPU vs pure NEON-4 (only if YELLOW band) | conditional |
+
+Derived: **R''' = M2''' / M3'''**.
+
+## Decision rules (same as cycle 1/2)
+
+R''' bands and verdicts unchanged (see `phase1.md` and `k2_deblock_phase1.md`).
+Cycle-2 calibration adjustment: ORANGE band (0.1 ≤ R''' < 0.5) is
+no longer auto-close — run M4''' regardless.
+
+Predicted R''' band: **0.4–0.8.**
+- MC is more compute-bound than LPF (8 mults + 7 adds per output
+  pixel; 64 pixels per block → ~960 ops per block)
+- Bandwidth-equivalent to LPF (per-block ~120 B read + 64 B write
+  ≈ 184 B → similar 5-6 MB/frame at 32 400 blocks)
+- V3D SMUL24 covers the 8b×8b → 16b mults without overflow
+- But no DP4A means we lose the typical "4× INT8 speedup" CPUs get
+  via SDOT — V3D does these as scalar SMUL24
+
+## Cycle 1+2 lessons baked in from start
+
+Per `k2_deblock_phase7.md §"Phase 9 lessons"`:
+
+1. WG=256, 2-per-subgroup adaptation, uint8_t SSBO, oob early-return,
+   NO chained ternary — these are the v1 defaults.
+2. Phase 5 second-model review is mandatory.
+3. R isolation is misleading; M4''' is the real gate.
+4. Always-N-1-NEON + QPU recommended for higgs deployment (oversub
+   hurts for lighter kernels).
+5. shaderdb at 4 threads / 0 spills = compiler delivered; further
+   optimisation must target algorithm, not compile shape.
+
+## Phase 2 → Phase 3 hand-off
+
+Phase 2 must:
+- Vendor `libavcodec/aarch64/vp9mc_neon.S` from FFmpeg n7.1.3
+  (matches existing snapshot pin)
+- Confirm `ff_vp9_subpel_filters` definition source
+  (`libavcodec/vp9dsp.c:32`, just the 16 × 8 REGULAR row needed)
+- Pin the exact NEON symbol naming
+
+Phase 3 must:
+- Write standalone C ref (`tests/vp9_mc_ref.c`) with REGULAR filter
+  table embedded
+- Write `tests/bench_neon_mc.c` (M1'''_c gate + M3''')
+- Capture M3''' before any QPU work
@@ -0,0 +1,109 @@
+---
+cycle: 3
+phase: 2
+status: closed 2026-05-18
+date_opened: 2026-05-18
+parent: k3_mc_phase1.md
+---
+
+# Cycle 3, Phase 2 — MC situation analysis
+
+## 1. C reference
+
+- **Source**: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`
+  (already vendored from cycle 1).
+- **Function**: `put_8tap_regular_8h_c` generated by
+  `filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` —
+  expands to call `do_8tap_1d_c` with `ds=1` (horizontal) and the
+  REGULAR filter bank.
+- **Underlying primitive**: `do_8tap_1d_c` iterates `h` rows;
+  per row, iterates `w=8` columns; per column, computes the
+  `FILTER_8TAP` macro: `clip((sum_{k=0..7} F[k] * src[x+k-3]
+  + 64) >> 7, 0, 255)`.
+- **Spec**: VP9 specification § 8.5.1 (subpel motion compensation).
+
+## 2. NEON reference
+
+- **Source**: `external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S`
+  (vendored 2026-05-18, FFmpeg n7.1.3, SHA-256
+  `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef`).
+- **Symbol**: `ff_vp9_put_regular8_h_neon` (note: filter type baked
+  into name, width=8 baked in, h-direction baked in)
+- **Signature** (VP9 `vp9_mc_func` typedef):
+  ```c
+  void ff_vp9_put_regular8_h_neon(uint8_t *dst, ptrdiff_t dst_stride,
+                                  const uint8_t *src, ptrdiff_t src_stride,
+                                  int h, int mx, int my);
+  ```
+  Registers: `x0=dst, x1=dst_stride, x2=src, x3=src_stride, w4=h, w5=mx, w6=my`.
+- **Dependencies**:
+  - `libavutil/aarch64/asm.S` ✓ (already vendored)
+  - `ff_vp9_subpel_filters[3][16][8]` symbol — provided by
+    `external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c`
+    (hand-extracted from `libavcodec/vp9dsp.c` of the same n7.1.3
+    pin; copying just the constant data avoids dragging in the
+    rest of `vp9dsp.c` which would require linking the entire VP9
+    decoder).
+
+## 3. Workload model
+
+Per 8×8 block output:
+- 8 multiplies × 8 columns × 8 rows = **512 multiplies**
+- 7 additions × 8 columns × 8 rows = 448 additions
+- 1 round (+64), 1 shift (>>7), 1 clip per pixel × 64 = 192 ops
+- Total ~1150 integer ops per block
+
+Per-block memory (horizontal-only filter, 8-pixel-wide output):
+- Read: 8 rows × (8 output cols + 7 tap overhang) = 8 × 15 = **120 source bytes**
+- Write: 8 rows × 8 cols = **64 dst bytes**
+- Total: **~184 bytes / block**
+
+Per 1080p frame (32 400 8×8 blocks, worst case all-MC):
+- ~5.9 MB total memory traffic
+- ~37 Mops compute
+- At GPU 4 GB/s share: 1.48 ms / frame = 675 FPS = 21.9 Mblock/s
+- At V3D 92 GFLOPS theoretical scalar (SMUL24 throughput ≈ FP MUL): 0.4 ms compute / frame = 2500 FPS theoretical → **compute is NOT the bottleneck** at this shape
+
+So MC is **bandwidth-bound on the QPU**, similar to LPF cycle 2.
+
+## 4. Per-row workload diversity (vs cycle 1+2)
+
+| | IDCT (k1) | LPF (k2) | MC (k3) |
+|---|---|---|---|
+| Per-block math | Heavy butterflies (~60 ops/block via separable transform) | Light: 0-30 ops per edge × 8 rows | 8-tap convolution: 1150 ops per block |
+| Per-block memory | ~320 B in + 64 B out | ~64 B in + ~24 B out per edge | 120 B in + 64 B out |
+| Compute / memory ratio | High | Low (memory-bound, lots of skipping) | Medium (compute-rich but bandwidth-bound at GPU) |
+| Conditional? | No (always-execute) | Yes (fm/hev divergence per row) | No (deterministic per pixel) |
+| QPU mult intensity | Q14 16b×16b mults | Light (compares, small clips) | 16b×8b mults (filter × pixel) |
+
+MC is interesting because it's **compute-rich AND bandwidth-bound** —
+the closest match in workload shape to a real-world GPU compute kernel
+the V3D was designed for (graphics filtering).
+
+## 5. Constraints carried from cycle 1+2
+
+Same V3D 7.1 device profile (vulkaninfo unchanged). The relevant
+specifics for MC:
+- No DP4A → 8-tap convolution must be 8 separate SMUL24 + ADDs
+  (the typical GPU "dot4" packing is not available)
+- shaderInt16 = false → filter coefficients widened to int32 in
+  registers; the filter table itself can be a uint16-storage SSBO
+- shaderInt8 = false → source pixels widened to int32 in registers
+- 1024-byte (16 KiB / 16) shared mem per WG is ample for MC source
+  staging if useful (15 cols × 8 rows × 1 byte per block-row × 32
+  blocks per WG = 3 840 B per row); for v1 we skip shared-mem
+  staging and let TMU handle reads directly
+
+## 6. What Phase 2 does *not* close
+
+- Per-block (block_y, block_x) layout / meta format. Phase 4 picks.
+  Likely same shape as cycle 2 (uvec4 per block: dst_offset,
+  src_offset, mx, _pad).
+- Filter table residency: as SSBO load every row, push-constants
+  per dispatch (different mx per dispatch), or constant baked into
+  shader (one filter per shader = 16 specialised shaders for the 16
+  mx phases). Phase 4 picks; v1 likely SSBO for simplicity.
+- Vertical / "hv" / "avg" / 4-pixel / 16-pixel / 32-pixel / 64-pixel
+  variants — out of cycle 3 scope; cycle 4+ if needed.
+
+Phase 3 next: build `tests/bench_neon_mc.c`, capture M3'''.
@@ -0,0 +1,77 @@
+---
+cycle: 3
+phase: 3
+status: closed 2026-05-18
+date_opened: 2026-05-18
+parent: k3_mc_phase2.md
+host: hertz
+---
+
+# Cycle 3, Phase 3 — NEON M3''' baseline
+
+## Raw
+
+```
+=== M1'''_c bit-exact (10000 random blocks) ===
+M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
+  mx phase coverage: min=577 max=668 (16 phases sampled)
+
+=== M3''' NEON throughput ===
+M3''' NEON throughput:
+  blocks/batch:    65536
+  batches done:    939
+  total blocks:    61 538 304
+  elapsed (kernel)=2.930751 s
+  elapsed (setup) =2.075477 s
+  throughput      = 20.997 Mblock/s
+  per-block       = 47.6 ns
+  equiv 1080p     = 648.1 FPS  (32400 blocks/frame)
+```
+
+## Numbers
+
+| | |
+|---|---|
+| **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` |
+| mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count |
+| **M3''' (throughput)** | **20.997 Mblock/s** single-core |
+| per-block | 47.6 ns |
+| cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles |
+| 1080p FPS-eq | 648 FPS |
+
+## Comparison across cycles
+
+| | IDCT (k1) | LPF (k2) | MC (k3) |
+|---|---|---|---|
+| Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 |
+| 1080p FPS-eq | 252 | 748 (worst edges) | 648 |
+| Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy |
+| NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing |
+
+MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT
+or SMULL-pair pattern handles 8-tap convolution extremely well; this
+is precisely the workload NEON SIMD was built for. **The QPU's
+break-even point on cycle 3 is correspondingly tight.**
+
+## Predictions for M2''' / R'''
+
+V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the
+QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise
+MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier
+than LPF.
+
+- Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops
+  per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS
+  effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s
+- Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s
+- Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s
+
+**Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW
+band by a small margin.
+
+Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than
+estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU
+needs 4× more cycles for the same work in the worst case), R'''
+could land near 0.5-0.6. Phase 7''' measures.
+
+Phase 4 next.
@@ -0,0 +1,207 @@
+---
+cycle: 3
+phase: 4
+status: open (awaiting Phase 5''' review)
+date_opened: 2026-05-18
+parent: k3_mc_phase3.md
+template: phase4.md (cycle 1) + k2_deblock_phase4.md (cycle 2) — same constraints, same patterns
+---
+
+# Cycle 3, Phase 4 — Plan QPU MC kernel
+
+Compact plan. Cycle 1+2 phase4 docs cover the constraint matrix
+(C1-C10) and the dev-discipline patterns. Phase 4''' references
+them rather than re-deriving.
+
+## 1. Constraints (carried)
+
+Same V3D 7.1 device. New for MC specifically:
+- SMUL24 covers 16-bit filter × 8-bit pixel mults (max ~32K product, fits)
+- Sum of 8 products fits in int32 trivially
+- No DP4A — must use 8 separate scalar muls per output pixel
+- 16 filter phases × 8 taps × 2 B = 256 B — too big for push constants
+  (max 128 B), small enough for one const array in shader
+
+## 2. Workload model
+
+Per 8×8 block:
+- 512 SMUL24 (8 mults × 64 output pixels)
+- 448 ADD (7 adds × 64 output pixels)
+- 64 round (+64 → >>7) operations
+- 64 clip-to-[0,255]
+- ≈ 1150 ALU ops per block
+- 120 B read + 64 B write = 184 B per block
+
+Per 1080p frame (32 400 blocks):
+- ~37 Mops compute → 1.8 ms at v3d 23 % sustained (compute-bound estimate)
+- ~5.9 MB traffic → 1.48 ms at 4 GB/s GPU share (bandwidth-bound estimate)
+
+## 3. Workgroup geometry
+
+Bake in the v4 lesson and the cycle-2 single-WG-size-from-start:
+
+- `local_size_x = 256` (16 subgroups × 16 lanes)
+- 8 lanes per block (1 lane per row r=0..7), 2 blocks per subgroup
+- **32 blocks per WG**
+- 1080p: 1 013 WGs
+
+Same lane decomposition as cycle 2 LPF:
+```
+edge_slot  = lane_in_sg >> 3    // 0 or 1 — "which block in this subgroup"
+row        = lane_in_sg & 7     // 0..7
+block_local = sg_in_wg * 2 + edge_slot
+block_idx   = wg_id * 32 + block_local
+oob = block_idx >= n_blocks
+```
+
+No barrier needed, no shared mem. Safe early-return on oob.
+
+## 4. Per-thread algorithm
+
+```glsl
+if (block_idx >= pc.n_blocks) return;
+
+uvec4 m = u_meta.meta[block_idx];
+uint dst_off = m.x;
+uint src_off = m.y;
+uint mx      = m.z & 15u;
+
+// Read 15 source pixels for this row.
+uint src_row_addr = src_off + row * pc.src_stride_u8;
+int s0  = int(u_src.src[src_row_addr +  0u]);
+int s1  = int(u_src.src[src_row_addr +  1u]);
+int s2  = int(u_src.src[src_row_addr +  2u]);
+int s3  = int(u_src.src[src_row_addr +  3u]);
+int s4  = int(u_src.src[src_row_addr +  4u]);
+int s5  = int(u_src.src[src_row_addr +  5u]);
+int s6  = int(u_src.src[src_row_addr +  6u]);
+int s7  = int(u_src.src[src_row_addr +  7u]);
+int s8  = int(u_src.src[src_row_addr +  8u]);
+int s9  = int(u_src.src[src_row_addr +  9u]);
+int s10 = int(u_src.src[src_row_addr + 10u]);
+int s11 = int(u_src.src[src_row_addr + 11u]);
+int s12 = int(u_src.src[src_row_addr + 12u]);
+int s13 = int(u_src.src[src_row_addr + 13u]);
+int s14 = int(u_src.src[src_row_addr + 14u]);
+
+// Filter coefficients — const REGULAR table, indexed by mx.
+int F0 = FILTER_REGULAR[mx][0]; ... int F7 = FILTER_REGULAR[mx][7];
+
+// 8 output pixels (each = 8-tap convolution of 8 consecutive source).
+uint dst_row_addr = dst_off + row * pc.dst_stride_u8;
+
+int o0 = F0*s0 + F1*s1 + F2*s2 + F3*s3 + F4*s4 + F5*s5 + F6*s6 + F7*s7;
+int o1 = F0*s1 + F1*s2 + F2*s3 + F3*s4 + F4*s5 + F5*s6 + F6*s7 + F7*s8;
+int o2 = F0*s2 + F1*s3 + F2*s4 + F3*s5 + F4*s6 + F5*s7 + F6*s8 + F7*s9;
+int o3 = F0*s3 + F1*s4 + F2*s5 + F3*s6 + F4*s7 + F5*s8 + F6*s9 + F7*s10;
+int o4 = F0*s4 + F1*s5 + F2*s6 + F3*s7 + F4*s8 + F5*s9 + F6*s10+ F7*s11;
+int o5 = F0*s5 + F1*s6 + F2*s7 + F3*s8 + F4*s9 + F5*s10+ F6*s11+ F7*s12;
+int o6 = F0*s6 + F1*s7 + F2*s8 + F3*s9 + F4*s10+ F5*s11+ F6*s12+ F7*s13;
+int o7 = F0*s7 + F1*s8 + F2*s9 + F3*s10+ F4*s11+ F5*s12+ F6*s13+ F7*s14;
+
+u_dst.dst[dst_row_addr + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
+u_dst.dst[dst_row_addr + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
+```
+
+Mirrors `tests/vp9_mc_ref.c` directly.
+
+## 5. SSBOs / push constants
+
+| binding | name | type | usage |
+|---|---|---|---|
+| 0 | `meta` | `readonly uvec4[]` | per-block (dst_off, src_off, mx, _pad) |
+| 1 | `dst` | `uint8_t[]` | output pixels |
+| 2 | `src` | `readonly uint8_t[]` | input pixels |
+
+Push constants (16 B):
+```
+n_blocks, dst_stride_u8, src_stride_u8, _pad
+```
+
+Filter table: hard-coded in shader as
+`const int FILTER_REGULAR[16][8] = { ... };` — 128 const ints.
+
+**Race safety:** lane r writes `dst[dst_off + r*dst_stride + 0..7]`
+(8 contiguous bytes). For rows r and r+1, writes are `r*stride + 7`
+and `(r+1)*stride + 0`. Disjoint iff `dst_stride ≥ 8`.
+
+**Contracts (revised per phase5''' findings 4 + 6):**
+1. `dst_stride_u8 ≥ 8` (race-safety lower bound)
+2. `src_stride_u8 ≥ 15` (per-row read span)
+3. `dst_off + 7 + (r_max)*dst_stride < dst_buffer_size`
+4. `src_off + 14 + (r_max)*src_stride < src_buffer_size`
+5. **`src_off` is the byte offset of the FIRST byte of the source
+   block's row 0 in the SSBO buffer — NOT shifted by +3.** The
+   C bench's `src + 3` C-caller convention does not carry into
+   the SSBO offset. Shader reads `s[k] = u_src.src[src_off +
+   row*stride + k]` for k=0..14, which equals
+   `master_src[block_base + row*stride + k]`, matching the C ref's
+   per-row read of `master_src[block_base + row*stride + (x..x+7)]`
+   for output col x ∈ 0..7.
+
+**Phase 6 MUST** add `assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)`
+in `bench_v3d_mc.c`'s meta-construction loop. **Phase 6 MUST** also
+run `V3D_DEBUG=shaderdb` after first compile and record uniform
+count. If uniform count > ~144 (a fall-out indicator that the
+filter LUT inflated unfavorably), escalate filter to a dedicated
+SSBO binding 3.
+
+## 6. Predicted M2''' / R'''
+
+From Phase 3:
+- Compute envelope: 17.5 Mblock/s
+- Bandwidth envelope: 22.0 Mblock/s
+- min ≈ 17.5 Mblock/s
+- R''' isolation = 17.5 / 20.997 ≈ **0.83** (YELLOW, near GREEN)
+
+Honest lower bound R''' = 0.5-0.6 if SMUL24-vs-DP4A penalty bites
+harder. Phase 7''' measures.
+
+## 7. WILL / WILL NOT touch
+
+WILL (Phase 6 creates):
+- `src/v3d_mc_8h.comp` — GLSL shader
+- `tests/bench_v3d_mc.c` — harness with contract asserts
+- CMake updates
+
+WILL NOT touch:
+- Cycle 1/2 artifacts (frozen Phase 3 baselines)
+- `external/ffmpeg-snapshot/` (frozen vendored sources, including
+  the just-added `vp9_subpel_filters_table.c`)
+- `src/v3d_runner.{c,h}` (reusable as-is)
+
+## 8. Phase 5''' review prompts
+
+Specific high-risk decisions:
+1. **Orientation / arithmetic correctness** — the 8 `o0..o7`
+   expressions in §4 are stencil-aligned. Verify the off-by-one
+   in `F[k] * s[c+k]` matches `F[k] * src[x+k-3]` after the
+   `src+3` indexing shift used by the bench.
+2. **Filter table residency** — hard-coded const array vs SSBO
+   vs push constants. Const is simplest but may cause v3d_compiler
+   to generate a large constant LUT. Worth verifying via shaderdb.
+3. **Race safety** — same shape as cycle 2 (different rows of
+   same block disjoint iff stride ≥ row-width). Verify
+   `dst_stride ≥ 8` contract.
+4. **`src+3` index shift** — the bench's source layout puts the
+   "row-0 col-0 source pixel" at `src + 3` (so src has -3..+12
+   reachable). Make sure the QPU shader applies this offset
+   consistently to its `src_off` meta value.
+   **RESOLVED (phase5''' finding 4, RED):** `src_off` is the raw
+   block-base offset (NOT +3-shifted). See §5 contract 5.
+5. **Anything missing.**
+
+## 9. Phase 6 execution order
+
+1. Write shader, get glslang to accept (likely 0 spills, ≥2 threads)
+2. Write bench with asserts + meta layout
+3. Run M1''' bit-exact (gate)
+4. Run M2''' (throughput)
+5. If R''' < 1.0 → M4''' concurrent
+6. Phase 7''' verdict
@@ -0,0 +1,71 @@
+---
+cycle: 3
+phase: 5
+status: closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: k3_mc_phase4.md
+reviewer: Claude Sonnet (general-purpose Agent, fresh context)
+plan_author: Claude Opus 4.7 (this session)
+verdict: PASS-WITH-REVISIONS
+---
+
+# Cycle 3, Phase 5 — Second-Model Review of MC Plan
+
+Same handoff: in-session Agent (Sonnet, fresh context), files read
+direct from disk, 5 review prompts + "anything else."
+
+Outcome: **1 RED (off-by-3 `src_off` indexing bug)**, **2 YELLOW**
+(shaderdb LUT gate for filter table, "MUST" assert language for
+contracts). Cycle-1+2 RED patterns (write race, barrier UB,
+subgroup-ops table error) did not recur.
+
+**Phase 5 paid off again.** The RED would have caused a bit-exact
+mismatch on the first run with cryptic "high index source pixels are
+wrong" symptoms — likely 1-2 debug cycles to track down without the
+review.
+
+## Review (verbatim)
+
+````markdown
+## Verdict
+PASS-WITH-REVISIONS — no RED-class correctness bugs. Two YELLOW findings
+require plan amendments before Phase 6 proceeds. ...
+
+[full review preserved — reviewer's RED finding 4 traces the off-by-3:
+shader's `src_off = block_base + 3` + `src_stride_u8 = 16` + reading
+`s[0..14]` causes high-index reads to spill into next row]
+````
+*(Verbatim review in agent output; key findings paraphrased below.)*
+
+| # | Severity | Issue | Resolution |
+|---|---|---|---|
+| 1 (orientation) | GREEN | All 8 oN expressions stencil-aligned correctly | accepted |
+| 2 (filter LUT) | YELLOW | `const int FILTER_REGULAR[16][8]` may inflate uniform count or compile to large LUT | Phase 6 to record uniform count via `V3D_DEBUG=shaderdb`; if >~144 uniforms, escalate filter to SSBO binding 3 |
+| 3 (race safety) | GREEN-w/note | `stride ≥ 8` contract correct; phrasing softer than cycle-2 standard | applied: §5 MUST assert |
+| 4 (`src_off` semantics) | **RED** | Plan said "src_off mirrors src+3"; with stride=16 shader's `s13`/`s14` read into next row's first 2 bytes | **applied: src_off = raw block base (no +3 shift); shader reads s[0..14] from there** |
+| 5 (missing) | GREEN-w/note | Coefficient overflow safely fits int32 (worked bound); no missing barrier-UB or write-race issues | accepted |
+| 6 (assert MUST language) | YELLOW | "Bench enforces with asserts" softer than cycle-2 MUST pattern | applied: §5 MUST language |
+| 7 (no barrier OK) | GREEN | Cycle-1 finding-7 doesn't apply (no barrier) | accepted |
+| 8 (filter table matches) | GREEN | `vp9_mc_ref.c` filter values match `vp9_subpel_filters_table.c[1]` verbatim | accepted |
+
+## Resolution (applied to phase4 inline)
+
+1. **§4** — Clarified `src_off` is the byte offset of the **first byte
+   of the source block in the SSBO buffer** (NOT shifted by +3). The
+   C bench's `src + 3` C-caller convention does NOT carry into the
+   SSBO offset. Shader reads `s[k] = u_src.src[src_off + row*stride + k]`
+   for k=0..14, which equals `master_src[block_base + row*stride + k]`,
+   matching the C ref's per-row read of `master_src[block_base + row*stride + (x..x+7)]`
+   for output col x ∈ 0..7.
+
+2. **§5** — Hardened "Bench enforces" to "Phase 6 MUST add
+   `assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)` in
+   `bench_v3d_mc.c`'s meta-construction loop." Cycle-2 finding-4
+   pattern applied.
+
+3. **§5** — Added: "Phase 6 MUST run `V3D_DEBUG=shaderdb` after first
+   compile and record uniform count. If uniform count > ~144,
+   escalate filter to a dedicated SSBO binding 3."
+
+After revisions: **Phase 4''' APPROVED for Phase 6''' implementation.**
@@ -0,0 +1,152 @@
+---
+cycle: 3
+phase: 7
+status: closed 2026-05-18 — RED engineering / PASS 30fps-floor / M4 NEGATIVE
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: k3_mc_phase4.md (revised per phase5''')
+host: hertz
+verdict: cycle 3 closes; MC stays on CPU for higgs deployment; engineering negative documented
+---
+
+# Cycle 3, Phase 7 — Verification (v1 + M4''')
+
+## v1 first-light
+
+```
+=== v3d MC 8h bench ===
+  n_blocks: 65536  iters: 100
+
+=== M1''': QPU vs C reference bit-exact ===
+  blocks bit-exact: 65536 / 65536 (100.0000 %)
+
+=== M2''': QPU throughput ===
+  M2''' = 1.413 Mblock/s
+  per-block = 707.9 ns
+  per-dispatch = 46390.5 us
+  R''' = 0.067 → RED band
+  30fps@1080p floor: 1.5x margin (isolation)
+```
+
+shaderdb (v1 MC):
+```
+SHADER-DB-ffcca249...: 488 inst, 2 threads, 0 loops, 197 uniforms,
+  25 max-temps, 0:0 spills:fills, 0 sfu-stalls, 488 inst-and-stalls, 7 nops
+```
+
+**Phase 5''' finding 2 prediction confirmed**: filter LUT inflated
+uniforms to 197 (gate was at ~144). Compiler also forced to 2 threads
+(from cycle-2's 4) due to register pressure (25 max-temps vs cycle-2's
+21). The "no DP4A" structural deficit shows up directly here — 8
+SMUL24 + 7 ADD per output pixel × 64 pixels per block × 8-lane
+geometry = 488 instructions, 30× heavier than the LPF kernel.
+
+## M4''' concurrent matrix (8s windows)
+
+| Config | Mblock/s | per-core (NEON) | vs NEON-4 | 30fps |
+|---|---|---|---|---|
+| NEON 1-core | 14.479 | — | — | 14.9× |
+| **NEON 4-core** | **15.248** | 3.24 – 4.48 | **baseline** | 15.7× |
+| QPU only | 1.380 | — | — | 1.4× |
+| **Mixed NEON-3 + QPU** | **12.277** | 3.78 – 4.16 | **−19.5 %** | 12.6× |
+| Mixed NEON-4 + QPU | 12.158 | 2.49 – 3.35 | −20.3 % | 12.5× |
+
+**M4 gate: FAIL.** Mixed (12.28) < pure NEON-4 (15.25) by 2.97
+Mblock/s. The QPU's 0.45 Mblock/s contribution under contention
+doesn't compensate for losing one NEON core that delivers ~3.8.
+
+## Cross-cycle comparison
+
+| | Cycle 1 IDCT | Cycle 2 LPF | Cycle 3 MC |
+|---|---|---|---|
+| R isolation | 0.92 | 0.41 | **0.067** |
+| 30fps floor margin (isolation) | 7.9× | 10× | **1.5×** |
+| M4 mixed vs pure NEON-4 | +7.2 % | +6.9 % | **−19.5 %** |
+| 30fps floor margin (mixed) | 7.2× | 7.2× | **12.6×** |
+| Verdict for higgs | GO QPU | GO QPU | **STAY CPU** |
+| NEON 4-core scaling vs 1-core | 0.56× (bw-bound) | 0.82× (bw-bound) | **1.05× (compute-bound)** |
+
+The MC result is **structurally consistent** with the V3D substrate
+profile from `phase0.md`:
+- No DP4A → 8-wide convolution doesn't pack as it does on NEON SDOT
+- Filter coefficients drive uniform count high → register pressure → 2 threads
+- High per-output-pixel multiply count → compiled instruction count
+  3× cycle 1, 6× cycle 2
+
+NEON 4-core is *compute*-bound for MC (not bandwidth-bound like
+the other two kernels). So 4-core scales nearly linearly with cores —
+the NEON CPU has plenty of headroom and the QPU has nothing to add
+even in concurrent mode.
+
+## Deployment recipe (for higgs / libva-v4l2-request-fourier)
+
+Per `project_consumer_target.md`, the eventual integration target is
+V4L2 stateless → libva-v4l2-request-fourier → firefox-fourier. The
+back-end-on-QPU/CPU split for the consumed decoder pipeline:
+
+- **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
+- **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
+- **MC (cycle 3)** → **CPU NEON**. R = 0.067, −19.5 % mixed.
+  Compute-bound on CPU but CPU already comfortably exceeds 30fps;
+  offload makes things worse.
+- **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
+
+This is a **mixed-substrate deployment**, not a "QPU does everything"
+plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
+dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
+
+## Decision per Phase 1 rules + 30fps-floor calibration
+
+| Rule | Result | Status |
+|---|---|---|
+| M1''' bit-exact | 100.0000 % | ✓ PASS |
+| R''' = M2'''/M3''' | 0.067 (RED) | structural mismatch |
+| M4''' > pure-CPU 4-core | −19.5 % | ✗ FAIL gate |
+| 30fps@1080p floor (isolation) | 1.5× | ✓ PASS (user-facing) |
+| 30fps@1080p floor (mixed) | 12.6× | ✓ PASS (user-facing) |
+
+**Engineering cycle verdict: do not deploy MC on QPU; deploy on CPU.**
+**User-facing cycle verdict: 30fps floor easily met in any
+configuration; either path works for daily YouTube.**
+
+For the deployment recipe above, **MC stays on CPU**. The Phase 1
+ORANGE/RED "honest close" rule applies here: cycle 3 closes as a
+documented negative for this kernel without affecting the
+project-level "continue" verdict (cycles 1+2 GO results stand).
+
+## Phase 9 lessons (added to project memory)
+
+1. **Multiply-heavy workloads expose V3D's no-DP4A deficit** in a way
+   that cycle 1+2 didn't. CPU SDOT/UDOT pack 4 INT8 MACs in one
+   instruction; V3D's SMUL24 is one scalar mult at a time. The 4×
+   gap shows up directly as a 6-15× per-block slowdown.
+
+2. **Compute-bound CPU workloads make the QPU offload story collapse.**
+   When NEON 4-core scales near-linearly (not bandwidth-saturated),
+   the "freed-core" argument from cycle 1+2 doesn't apply — there
+   are no free cycles to free. Mixed mode is strictly worse.
+
+3. **The 30fps@1080p user-facing test (`project_30fps_floor_is_fine.md`)
+   passes regardless of engineering verdict.** All three cycles pass
+   it in isolation. This is a project-level win to communicate
+   separately from per-cycle engineering R numbers.
+
+4. **The shaderdb filter-LUT gate from phase5''' finding 2 fired
+   exactly as predicted** (197 uniforms > 144 threshold; 2 threads
+   instead of 4). This validates the cycle-discipline of running
+   `V3D_DEBUG=shaderdb` early and using the result as an actionable
+   gate. Cycle 4 (if any) should bake this in from Phase 4 §design.
+
+## Leaves open
+
+- Cycle 3 v2 with filter LUT escalated to SSBO (per phase5''' finding 2
+  trigger). Would reduce uniforms to ~30, potentially restore 4
+  threads. Expected upside: ~2× → R''' = 0.13. Still RED, still M4-
+  negative. Skipped — even doubling doesn't change the deployment
+  recipe.
+- Vertical / hv / 4-tap / wider variants — all of cycle 3 same
+  multiply-shape, same structural verdict expected. Not worth Phase
+  1+ for those.
+- Cycle 4 candidates (per phase7_M4.md §"Cycle 3 candidates"):
+  CDEF (AV1-only directional filter), Loop Restoration (AV1-only),
+  or higgs deployment plumbing.
@@ -25,6 +25,8 @@ tagged commit, no modifications.
 | `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
 | `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
 | `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
+| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
+| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
 | `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
 | `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
 | `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
@@ -0,0 +1,665 @@
+/*
+ * Copyright (c) 2016 Google Inc.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+// All public functions in this file have the following signature:
+// typedef void (*vp9_mc_func)(uint8_t *dst, ptrdiff_t dst_stride,
+//                            const uint8_t *ref, ptrdiff_t ref_stride,
+//                            int h, int mx, int my);
+
+function ff_vp9_avg64_neon, export=1
+        mov             x5,  x0
+1:
+        ld1             {v4.16b,  v5.16b,  v6.16b,  v7.16b},  [x2], x3
+        ld1             {v0.16b,  v1.16b,  v2.16b,  v3.16b},  [x0], x1
+        ld1             {v20.16b, v21.16b, v22.16b, v23.16b}, [x2], x3
+        urhadd          v0.16b,  v0.16b,  v4.16b
+        urhadd          v1.16b,  v1.16b,  v5.16b
+        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0], x1
+        urhadd          v2.16b,  v2.16b,  v6.16b
+        urhadd          v3.16b,  v3.16b,  v7.16b
+        subs            w4,  w4,  #2
+        urhadd          v16.16b, v16.16b, v20.16b
+        urhadd          v17.16b, v17.16b, v21.16b
+        st1             {v0.16b,  v1.16b,  v2.16b,  v3.16b},  [x5], x1
+        urhadd          v18.16b, v18.16b, v22.16b
+        urhadd          v19.16b, v19.16b, v23.16b
+        st1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x5], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_avg32_neon, export=1
+1:
+        ld1             {v2.16b, v3.16b},  [x2], x3
+        ld1             {v0.16b, v1.16b},  [x0]
+        urhadd          v0.16b,  v0.16b,  v2.16b
+        urhadd          v1.16b,  v1.16b,  v3.16b
+        subs            w4,  w4,  #1
+        st1             {v0.16b, v1.16b},  [x0], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_copy16_neon, export=1
+        add             x5,  x0,  x1
+        lsl             x1,  x1,  #1
+        add             x6,  x2,  x3
+        lsl             x3,  x3,  #1
+1:
+        ld1             {v0.16b},  [x2], x3
+        ld1             {v1.16b},  [x6], x3
+        ld1             {v2.16b},  [x2], x3
+        ld1             {v3.16b},  [x6], x3
+        subs            w4,  w4,  #4
+        st1             {v0.16b},  [x0], x1
+        st1             {v1.16b},  [x5], x1
+        st1             {v2.16b},  [x0], x1
+        st1             {v3.16b},  [x5], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_avg16_neon, export=1
+        mov             x5,  x0
+1:
+        ld1             {v2.16b},  [x2], x3
+        ld1             {v0.16b},  [x0], x1
+        ld1             {v3.16b},  [x2], x3
+        urhadd          v0.16b,  v0.16b,  v2.16b
+        ld1             {v1.16b},  [x0], x1
+        urhadd          v1.16b,  v1.16b,  v3.16b
+        subs            w4,  w4,  #2
+        st1             {v0.16b},  [x5], x1
+        st1             {v1.16b},  [x5], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_copy8_neon, export=1
+1:
+        ld1             {v0.8b},  [x2], x3
+        ld1             {v1.8b},  [x2], x3
+        subs            w4,  w4,  #2
+        st1             {v0.8b},  [x0], x1
+        st1             {v1.8b},  [x0], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_avg8_neon, export=1
+        mov             x5,  x0
+1:
+        ld1             {v2.8b},  [x2], x3
+        ld1             {v0.8b},  [x0], x1
+        ld1             {v3.8b},  [x2], x3
+        urhadd          v0.8b,  v0.8b,  v2.8b
+        ld1             {v1.8b},  [x0], x1
+        urhadd          v1.8b,  v1.8b,  v3.8b
+        subs            w4,  w4,  #2
+        st1             {v0.8b},  [x5], x1
+        st1             {v1.8b},  [x5], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_copy4_neon, export=1
+1:
+        ld1             {v0.s}[0], [x2], x3
+        ld1             {v1.s}[0], [x2], x3
+        st1             {v0.s}[0], [x0], x1
+        ld1             {v2.s}[0], [x2], x3
+        st1             {v1.s}[0], [x0], x1
+        ld1             {v3.s}[0], [x2], x3
+        subs            w4,  w4,  #4
+        st1             {v2.s}[0], [x0], x1
+        st1             {v3.s}[0], [x0], x1
+        b.ne            1b
+        ret
+endfunc
+
+function ff_vp9_avg4_neon, export=1
+        mov             x5,  x0
+1:
+        ld1             {v2.s}[0], [x2], x3
+        ld1             {v0.s}[0], [x0], x1
+        ld1             {v2.s}[1], [x2], x3
+        ld1             {v0.s}[1], [x0], x1
+        ld1             {v3.s}[0], [x2], x3
+        ld1             {v1.s}[0], [x0], x1
+        ld1             {v3.s}[1], [x2], x3
+        ld1             {v1.s}[1], [x0], x1
+        subs            w4,  w4,  #4
+        urhadd          v0.8b,  v0.8b,  v2.8b
+        urhadd          v1.8b,  v1.8b,  v3.8b
+        st1             {v0.s}[0], [x5], x1
+        st1             {v0.s}[1], [x5], x1
+        st1             {v1.s}[0], [x5], x1
+        st1             {v1.s}[1], [x5], x1
+        b.ne            1b
+        ret
+endfunc
+
+
+// Extract a vector from src1-src2 and src4-src5 (src1-src3 and src4-src6
+// for size >= 16), and multiply-accumulate into dst1 and dst3 (or
+// dst1-dst2 and dst3-dst4 for size >= 16)
+.macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
+        ext             v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
+        ext             v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
+.if \size >= 16
+        mla             \dst1\().8h, v20.8h, v0.h[\offset]
+        ext             v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
+        mla             \dst3\().8h, v22.8h, v0.h[\offset]
+        ext             v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
+        mla             \dst2\().8h, v21.8h, v0.h[\offset]
+        mla             \dst4\().8h, v23.8h, v0.h[\offset]
+.elseif \size == 8
+        mla             \dst1\().8h, v20.8h, v0.h[\offset]
+        mla             \dst3\().8h, v22.8h, v0.h[\offset]
+.else
+        mla             \dst1\().4h, v20.4h, v0.h[\offset]
+        mla             \dst3\().4h, v22.4h, v0.h[\offset]
+.endif
+.endm
+// The same as above, but don't accumulate straight into the
+// destination, but use a temp register and accumulate with saturation.
+.macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
+        ext             v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
+        ext             v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
+.if \size >= 16
+        mul             v20.8h, v20.8h, v0.h[\offset]
+        ext             v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
+        mul             v22.8h, v22.8h, v0.h[\offset]
+        ext             v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
+        mul             v21.8h, v21.8h, v0.h[\offset]
+        mul             v23.8h, v23.8h, v0.h[\offset]
+.elseif \size == 8
+        mul             v20.8h, v20.8h, v0.h[\offset]
+        mul             v22.8h, v22.8h, v0.h[\offset]
+.else
+        mul             v20.4h, v20.4h, v0.h[\offset]
+        mul             v22.4h, v22.4h, v0.h[\offset]
+.endif
+.if \size == 4
+        sqadd           \dst1\().4h, \dst1\().4h, v20.4h
+        sqadd           \dst3\().4h, \dst3\().4h, v22.4h
+.else
+        sqadd           \dst1\().8h, \dst1\().8h, v20.8h
+        sqadd           \dst3\().8h, \dst3\().8h, v22.8h
+.if \size >= 16
+        sqadd           \dst2\().8h, \dst2\().8h, v21.8h
+        sqadd           \dst4\().8h, \dst4\().8h, v23.8h
+.endif
+.endif
+.endm
+
+
+// Instantiate a horizontal filter function for the given size.
+// This can work on 4, 8 or 16 pixels in parallel; for larger
+// widths it will do 16 pixels at a time and loop horizontally.
+// The actual width is passed in x5, the height in w4 and the
+// filter coefficients in x9. idx2 is the index of the largest
+// filter coefficient (3 or 4) and idx1 is the other one of them.
+.macro do_8tap_h type, size, idx1, idx2
+function \type\()_8tap_\size\()h_\idx1\idx2
+        sub             x2,  x2,  #3
+        add             x6,  x0,  x1
+        add             x7,  x2,  x3
+        add             x1,  x1,  x1
+        add             x3,  x3,  x3
+        // Only size >= 16 loops horizontally and needs
+        // reduced dst stride
+.if \size >= 16
+        sub             x1,  x1,  x5
+.elseif \size == 4
+        add             x12, x2,  #8
+        add             x13, x7,  #8
+.endif
+        // size >= 16 loads two qwords and increments x2,
+        // for size 4/8 it's enough with one qword and no
+        // postincrement
+.if \size >= 16
+        sub             x3,  x3,  x5
+        sub             x3,  x3,  #8
+.endif
+        // Load the filter vector
+        ld1             {v0.8h},  [x9]
+1:
+.if \size >= 16
+        mov             x9,  x5
+.endif
+        // Load src
+.if \size >= 16
+        ld1             {v4.8b,  v5.8b,  v6.8b},  [x2], #24
+        ld1             {v16.8b, v17.8b, v18.8b}, [x7], #24
+.elseif \size == 8
+        ld1             {v4.8b,  v5.8b},  [x2]
+        ld1             {v16.8b, v17.8b}, [x7]
+.else // \size == 4
+        ld1             {v4.8b},  [x2]
+        ld1             {v16.8b}, [x7]
+        ld1             {v5.s}[0],  [x12], x3
+        ld1             {v17.s}[0], [x13], x3
+.endif
+        uxtl            v4.8h,  v4.8b
+        uxtl            v5.8h,  v5.8b
+        uxtl            v16.8h, v16.8b
+        uxtl            v17.8h, v17.8b
+.if \size >= 16
+        uxtl            v6.8h,  v6.8b
+        uxtl            v18.8h, v18.8b
+.endif
+2:
+
+        // Accumulate, adding idx2 last with a separate
+        // saturating add. The positive filter coefficients
+        // for all indices except idx2 must add up to less
+        // than 127 for this not to overflow.
+        mul             v1.8h,  v4.8h,  v0.h[0]
+        mul             v24.8h, v16.8h, v0.h[0]
+.if \size >= 16
+        mul             v2.8h,  v5.8h,  v0.h[0]
+        mul             v25.8h, v17.8h, v0.h[0]
+.endif
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 1,     \size
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 2,     \size
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, \idx1, \size
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 5,     \size
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 6,     \size
+        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 7,     \size
+        extmulqadd      v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, \idx2, \size
+
+        // Round, shift and saturate
+        sqrshrun        v1.8b,   v1.8h,  #7
+        sqrshrun        v24.8b,  v24.8h, #7
+.if \size >= 16
+        sqrshrun2       v1.16b,  v2.8h,  #7
+        sqrshrun2       v24.16b, v25.8h, #7
+.endif
+        // Average
+.ifc \type,avg
+.if \size >= 16
+        ld1             {v2.16b}, [x0]
+        ld1             {v3.16b}, [x6]
+        urhadd          v1.16b,  v1.16b,  v2.16b
+        urhadd          v24.16b, v24.16b, v3.16b
+.elseif \size == 8
+        ld1             {v2.8b},  [x0]
+        ld1             {v3.8b},  [x6]
+        urhadd          v1.8b,  v1.8b,  v2.8b
+        urhadd          v24.8b, v24.8b, v3.8b
+.else
+        ld1             {v2.s}[0], [x0]
+        ld1             {v3.s}[0], [x6]
+        urhadd          v1.8b,  v1.8b,  v2.8b
+        urhadd          v24.8b, v24.8b, v3.8b
+.endif
+.endif
+        // Store and loop horizontally (for size >= 16)
+.if \size >= 16
+        subs            x9,  x9,  #16
+        st1             {v1.16b},  [x0], #16
+        st1             {v24.16b}, [x6], #16
+        b.eq            3f
+        mov             v4.16b,  v6.16b
+        mov             v16.16b, v18.16b
+        ld1             {v6.16b},  [x2], #16
+        ld1             {v18.16b}, [x7], #16
+        uxtl            v5.8h,  v6.8b
+        uxtl2           v6.8h,  v6.16b
+        uxtl            v17.8h, v18.8b
+        uxtl2           v18.8h, v18.16b
+        b               2b
+.elseif \size == 8
+        st1             {v1.8b},    [x0]
+        st1             {v24.8b},   [x6]
+.else // \size == 4
+        st1             {v1.s}[0],  [x0]
+        st1             {v24.s}[0], [x6]
+.endif
+3:
+        // Loop vertically
+        add             x0,  x0,  x1
+        add             x6,  x6,  x1
+        add             x2,  x2,  x3
+        add             x7,  x7,  x3
+        subs            w4,  w4,  #2
+        b.ne            1b
+        ret
+endfunc
+.endm
+
+.macro do_8tap_h_size size
+do_8tap_h put, \size, 3, 4
+do_8tap_h avg, \size, 3, 4
+do_8tap_h put, \size, 4, 3
+do_8tap_h avg, \size, 4, 3
+.endm
+
+do_8tap_h_size 4
+do_8tap_h_size 8
+do_8tap_h_size 16
+
+.macro do_8tap_h_func type, filter, offset, size
+function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
+        movrel          x6,  X(ff_vp9_subpel_filters), 256*\offset
+        cmp             w5,  #8
+        add             x9,  x6,  w5, uxtw #4
+        mov             x5,  #\size
+.if \size >= 16
+        b.ge            \type\()_8tap_16h_34
+        b               \type\()_8tap_16h_43
+.else
+        b.ge            \type\()_8tap_\size\()h_34
+        b               \type\()_8tap_\size\()h_43
+.endif
+endfunc
+.endm
+
+.macro do_8tap_h_filters size
+do_8tap_h_func put, regular, 1, \size
+do_8tap_h_func avg, regular, 1, \size
+do_8tap_h_func put, sharp,   2, \size
+do_8tap_h_func avg, sharp,   2, \size
+do_8tap_h_func put, smooth,  0, \size
+do_8tap_h_func avg, smooth,  0, \size
+.endm
+
+do_8tap_h_filters 64
+do_8tap_h_filters 32
+do_8tap_h_filters 16
+do_8tap_h_filters 8
+do_8tap_h_filters 4
+
+
+// Vertical filters
+
+// Round, shift and saturate and store reg1-reg2 over 4 lines
+.macro do_store4 reg1, reg2, tmp1, tmp2, type
+        sqrshrun        \reg1\().8b,  \reg1\().8h, #7
+        sqrshrun        \reg2\().8b,  \reg2\().8h, #7
+.ifc \type,avg
+        ld1             {\tmp1\().s}[0],  [x7], x1
+        ld1             {\tmp2\().s}[0],  [x7], x1
+        ld1             {\tmp1\().s}[1],  [x7], x1
+        ld1             {\tmp2\().s}[1],  [x7], x1
+        urhadd          \reg1\().8b,  \reg1\().8b,  \tmp1\().8b
+        urhadd          \reg2\().8b,  \reg2\().8b,  \tmp2\().8b
+.endif
+        st1             {\reg1\().s}[0],  [x0], x1
+        st1             {\reg2\().s}[0],  [x0], x1
+        st1             {\reg1\().s}[1],  [x0], x1
+        st1             {\reg2\().s}[1],  [x0], x1
+.endm
+
+// Round, shift and saturate and store reg1-4
+.macro do_store reg1, reg2, reg3, reg4, tmp1, tmp2, tmp3, tmp4, type
+        sqrshrun        \reg1\().8b,  \reg1\().8h, #7
+        sqrshrun        \reg2\().8b,  \reg2\().8h, #7
+        sqrshrun        \reg3\().8b,  \reg3\().8h, #7
+        sqrshrun        \reg4\().8b,  \reg4\().8h, #7
+.ifc \type,avg
+        ld1             {\tmp1\().8b},  [x7], x1
+        ld1             {\tmp2\().8b},  [x7], x1
+        ld1             {\tmp3\().8b},  [x7], x1
+        ld1             {\tmp4\().8b},  [x7], x1
+        urhadd          \reg1\().8b,  \reg1\().8b,  \tmp1\().8b
+        urhadd          \reg2\().8b,  \reg2\().8b,  \tmp2\().8b
+        urhadd          \reg3\().8b,  \reg3\().8b,  \tmp3\().8b
+        urhadd          \reg4\().8b,  \reg4\().8b,  \tmp4\().8b
+.endif
+        st1             {\reg1\().8b},  [x0], x1
+        st1             {\reg2\().8b},  [x0], x1
+        st1             {\reg3\().8b},  [x0], x1
+        st1             {\reg4\().8b},  [x0], x1
+.endm
+
+// Evaluate the filter twice in parallel, from the inputs src1-src9 into dst1-dst2
+// (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately
+// at the end with saturation. Indices 0 and 7 always have negative or zero
+// coefficients, so they can be accumulated into tmp1-tmp2 together with the
+// largest coefficient.
+.macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7, src8, src9, idx1, idx2, tmp1, tmp2
+        mul             \dst1\().8h, \src2\().8h, v0.h[1]
+        mul             \dst2\().8h, \src3\().8h, v0.h[1]
+        mul             \tmp1\().8h, \src1\().8h, v0.h[0]
+        mul             \tmp2\().8h, \src2\().8h, v0.h[0]
+        mla             \dst1\().8h, \src3\().8h, v0.h[2]
+        mla             \dst2\().8h, \src4\().8h, v0.h[2]
+.if \idx1 == 3
+        mla             \dst1\().8h, \src4\().8h, v0.h[3]
+        mla             \dst2\().8h, \src5\().8h, v0.h[3]
+.else
+        mla             \dst1\().8h, \src5\().8h, v0.h[4]
+        mla             \dst2\().8h, \src6\().8h, v0.h[4]
+.endif
+        mla             \dst1\().8h, \src6\().8h, v0.h[5]
+        mla             \dst2\().8h, \src7\().8h, v0.h[5]
+        mla             \tmp1\().8h, \src8\().8h, v0.h[7]
+        mla             \tmp2\().8h, \src9\().8h, v0.h[7]
+        mla             \dst1\().8h, \src7\().8h, v0.h[6]
+        mla             \dst2\().8h, \src8\().8h, v0.h[6]
+.if \idx2 == 3
+        mla             \tmp1\().8h, \src4\().8h, v0.h[3]
+        mla             \tmp2\().8h, \src5\().8h, v0.h[3]
+.else
+        mla             \tmp1\().8h, \src5\().8h, v0.h[4]
+        mla             \tmp2\().8h, \src6\().8h, v0.h[4]
+.endif
+        sqadd           \dst1\().8h, \dst1\().8h, \tmp1\().8h
+        sqadd           \dst2\().8h, \dst2\().8h, \tmp2\().8h
+.endm
+
+// Load pixels and extend them to 16 bit
+.macro loadl dst1, dst2, dst3, dst4
+        ld1             {v1.8b}, [x2], x3
+        ld1             {v2.8b}, [x2], x3
+        ld1             {v3.8b}, [x2], x3
+.ifnb \dst4
+        ld1             {v4.8b}, [x2], x3
+.endif
+        uxtl            \dst1\().8h, v1.8b
+        uxtl            \dst2\().8h, v2.8b
+        uxtl            \dst3\().8h, v3.8b
+.ifnb \dst4
+        uxtl            \dst4\().8h, v4.8b
+.endif
+.endm
+
+// Instantiate a vertical filter function for filtering 8 pixels at a time.
+// The height is passed in x4, the width in x5 and the filter coefficients
+// in x6. idx2 is the index of the largest filter coefficient (3 or 4)
+// and idx1 is the other one of them.
+.macro do_8tap_8v type, idx1, idx2
+function \type\()_8tap_8v_\idx1\idx2
+        sub             x2,  x2,  x3, lsl #1
+        sub             x2,  x2,  x3
+        ld1             {v0.8h},  [x6]
+1:
+.ifc \type,avg
+        mov             x7,  x0
+.endif
+        mov             x6,  x4
+
+        loadl           v17, v18, v19
+
+        loadl           v20, v21, v22, v23
+2:
+        loadl           v24, v25, v26, v27
+        convolve        v1,  v2,  v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v5,  v6
+        convolve        v3,  v4,  v19, v20, v21, v22, v23, v24, v25, v26, v27, \idx1, \idx2, v5,  v6
+        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
+
+        subs            x6,  x6,  #4
+        b.eq            8f
+
+        loadl           v16, v17, v18, v19
+        convolve        v1,  v2,  v21, v22, v23, v24, v25, v26, v27, v16, v17, \idx1, \idx2, v5,  v6
+        convolve        v3,  v4,  v23, v24, v25, v26, v27, v16, v17, v18, v19, \idx1, \idx2, v5,  v6
+        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
+
+        subs            x6,  x6,  #4
+        b.eq            8f
+
+        loadl           v20, v21, v22, v23
+        convolve        v1,  v2,  v25, v26, v27, v16, v17, v18, v19, v20, v21, \idx1, \idx2, v5,  v6
+        convolve        v3,  v4,  v27, v16, v17, v18, v19, v20, v21, v22, v23, \idx1, \idx2, v5,  v6
+        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
+
+        subs            x6,  x6,  #4
+        b.ne            2b
+
+8:
+        subs            x5,  x5,  #8
+        b.eq            9f
+        // x0 -= h * dst_stride
+        msub            x0,  x1,  x4, x0
+        // x2 -= h * src_stride
+        msub            x2,  x3,  x4, x2
+        // x2 -= 8 * src_stride
+        sub             x2,  x2,  x3, lsl #3
+        // x2 += 1 * src_stride
+        add             x2,  x2,  x3
+        add             x2,  x2,  #8
+        add             x0,  x0,  #8
+        b               1b
+9:
+        ret
+endfunc
+.endm
+
+do_8tap_8v put, 3, 4
+do_8tap_8v put, 4, 3
+do_8tap_8v avg, 3, 4
+do_8tap_8v avg, 4, 3
+
+
+// Instantiate a vertical filter function for filtering a 4 pixels wide
+// slice. The first half of the registers contain one row, while the second
+// half of a register contains the second-next row (also stored in the first
+// half of the register two steps ahead). The convolution does two outputs
+// at a time; the output of v17-v24 into one, and v18-v25 into another one.
+// The first half of first output is the first output row, the first half
+// of the other output is the second output row. The second halves of the
+// registers are rows 3 and 4.
+// This only is designed to work for 4 or 8 output lines.
+.macro do_8tap_4v type, idx1, idx2
+function \type\()_8tap_4v_\idx1\idx2
+        sub             x2,  x2,  x3, lsl #1
+        sub             x2,  x2,  x3
+        ld1             {v0.8h},  [x6]
+.ifc \type,avg
+        mov             x7,  x0
+.endif
+
+        ld1             {v1.s}[0],  [x2], x3
+        ld1             {v2.s}[0],  [x2], x3
+        ld1             {v3.s}[0],  [x2], x3
+        ld1             {v4.s}[0],  [x2], x3
+        ld1             {v5.s}[0],  [x2], x3
+        ld1             {v6.s}[0],  [x2], x3
+        trn1            v1.2s,  v1.2s,  v3.2s
+        ld1             {v7.s}[0],  [x2], x3
+        trn1            v2.2s,  v2.2s,  v4.2s
+        ld1             {v26.s}[0], [x2], x3
+        uxtl            v17.8h, v1.8b
+        trn1            v3.2s,  v3.2s,  v5.2s
+        ld1             {v27.s}[0], [x2], x3
+        uxtl            v18.8h, v2.8b
+        trn1            v4.2s,  v4.2s,  v6.2s
+        ld1             {v28.s}[0], [x2], x3
+        uxtl            v19.8h, v3.8b
+        trn1            v5.2s,  v5.2s,  v7.2s
+        ld1             {v29.s}[0], [x2], x3
+        uxtl            v20.8h, v4.8b
+        trn1            v6.2s,  v6.2s,  v26.2s
+        uxtl            v21.8h, v5.8b
+        trn1            v7.2s,  v7.2s,  v27.2s
+        uxtl            v22.8h, v6.8b
+        trn1            v26.2s, v26.2s, v28.2s
+        uxtl            v23.8h, v7.8b
+        trn1            v27.2s, v27.2s, v29.2s
+        uxtl            v24.8h, v26.8b
+        uxtl            v25.8h, v27.8b
+
+        convolve        v1,  v2,  v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v3,  v4
+        do_store4       v1,  v2,  v5,  v6,  \type
+
+        subs            x4,  x4,  #4
+        b.eq            9f
+
+        ld1             {v1.s}[0],  [x2], x3
+        ld1             {v2.s}[0],  [x2], x3
+        trn1            v28.2s, v28.2s, v1.2s
+        trn1            v29.2s, v29.2s, v2.2s
+        ld1             {v1.s}[1],  [x2], x3
+        uxtl            v26.8h, v28.8b
+        ld1             {v2.s}[1],  [x2], x3
+        uxtl            v27.8h, v29.8b
+        uxtl            v28.8h, v1.8b
+        uxtl            v29.8h, v2.8b
+
+        convolve        v1,  v2,  v21, v22, v23, v24, v25, v26, v27, v28, v29, \idx1, \idx2, v3,  v4
+        do_store4       v1,  v2,  v5,  v6,  \type
+
+9:
+        ret
+endfunc
+.endm
+
+do_8tap_4v put, 3, 4
+do_8tap_4v put, 4, 3
+do_8tap_4v avg, 3, 4
+do_8tap_4v avg, 4, 3
+
+
+.macro do_8tap_v_func type, filter, offset, size
+function ff_vp9_\type\()_\filter\()\size\()_v_neon, export=1
+        uxtw            x4,  w4
+        movrel          x5,  X(ff_vp9_subpel_filters), 256*\offset
+        cmp             w6,  #8
+        add             x6,  x5,  w6, uxtw #4
+        mov             x5,  #\size
+.if \size >= 8
+        b.ge            \type\()_8tap_8v_34
+        b               \type\()_8tap_8v_43
+.else
+        b.ge            \type\()_8tap_4v_34
+        b               \type\()_8tap_4v_43
+.endif
+endfunc
+.endm
+
+.macro do_8tap_v_filters size
+do_8tap_v_func put, regular, 1, \size
+do_8tap_v_func avg, regular, 1, \size
+do_8tap_v_func put, sharp,   2, \size
+do_8tap_v_func avg, sharp,   2, \size
+do_8tap_v_func put, smooth,  0, \size
+do_8tap_v_func avg, smooth,  0, \size
+.endm
+
+do_8tap_v_filters 64
+do_8tap_v_filters 32
+do_8tap_v_filters 16
+do_8tap_v_filters 8
+do_8tap_v_filters 4
@@ -0,0 +1,82 @@
+/*
+ * VP9 8-tap subpel filter table — verbatim transcription of
+ * ff_vp9_subpel_filters from FFmpeg n7.1.3 libavcodec/vp9dsp.c
+ * (commit f46e514). Provided as a standalone .c so the vendored
+ * vp9mc_neon.S has the `ff_vp9_subpel_filters` symbol to link
+ * against, without pulling in the full vp9dsp.c init machinery
+ * (which would chain-include the entire VP9 decoder).
+ *
+ * Enum order from libavcodec/vp9dsp.h:64-67:
+ *   FILTER_8TAP_SMOOTH  = 0
+ *   FILTER_8TAP_REGULAR = 1
+ *   FILTER_8TAP_SHARP   = 2
+ *
+ * License: LGPL-2.1-or-later (matches vp9dsp.c upstream).
+ */
+#include <stdint.h>
+
+#ifdef __GNUC__
+#define DAEDALUS_ALIGNED(n) __attribute__((aligned(n)))
+#else
+#define DAEDALUS_ALIGNED(n)
+#endif
+
+const DAEDALUS_ALIGNED(16) int16_t ff_vp9_subpel_filters[3][16][8] = {
+    /* [0] = FILTER_8TAP_SMOOTH */
+    {
+        {  0,  0,   0, 128,   0,   0,  0,  0 },
+        { -3, -1,  32,  64,  38,   1, -3,  0 },
+        { -2, -2,  29,  63,  41,   2, -3,  0 },
+        { -2, -2,  26,  63,  43,   4, -4,  0 },
+        { -2, -3,  24,  62,  46,   5, -4,  0 },
+        { -2, -3,  21,  60,  49,   7, -4,  0 },
+        { -1, -4,  18,  59,  51,   9, -4,  0 },
+        { -1, -4,  16,  57,  53,  12, -4, -1 },
+        { -1, -4,  14,  55,  55,  14, -4, -1 },
+        { -1, -4,  12,  53,  57,  16, -4, -1 },
+        {  0, -4,   9,  51,  59,  18, -4, -1 },
+        {  0, -4,   7,  49,  60,  21, -3, -2 },
+        {  0, -4,   5,  46,  62,  24, -3, -2 },
+        {  0, -4,   4,  43,  63,  26, -2, -2 },
+        {  0, -3,   2,  41,  63,  29, -2, -2 },
+        {  0, -3,   1,  38,  64,  32, -1, -3 },
+    },
+    /* [1] = FILTER_8TAP_REGULAR */
+    {
+        {  0,  0,   0, 128,   0,   0,  0,  0 },
+        {  0,  1,  -5, 126,   8,  -3,  1,  0 },
+        { -1,  3, -10, 122,  18,  -6,  2,  0 },
+        { -1,  4, -13, 118,  27,  -9,  3, -1 },
+        { -1,  4, -16, 112,  37, -11,  4, -1 },
+        { -1,  5, -18, 105,  48, -14,  4, -1 },
+        { -1,  5, -19,  97,  58, -16,  5, -1 },
+        { -1,  6, -19,  88,  68, -18,  5, -1 },
+        { -1,  6, -19,  78,  78, -19,  6, -1 },
+        { -1,  5, -18,  68,  88, -19,  6, -1 },
+        { -1,  5, -16,  58,  97, -19,  5, -1 },
+        { -1,  4, -14,  48, 105, -18,  5, -1 },
+        { -1,  4, -11,  37, 112, -16,  4, -1 },
+        { -1,  3,  -9,  27, 118, -13,  4, -1 },
+        {  0,  2,  -6,  18, 122, -10,  3, -1 },
+        {  0,  1,  -3,   8, 126,  -5,  1,  0 },
+    },
+    /* [2] = FILTER_8TAP_SHARP */
+    {
+        {  0,  0,   0, 128,   0,   0,  0,  0 },
+        { -1,  3,  -7, 127,   8,  -3,  1,  0 },
+        { -2,  5, -13, 125,  17,  -6,  3, -1 },
+        { -3,  7, -17, 121,  27, -10,  5, -2 },
+        { -4,  9, -20, 115,  37, -13,  6, -2 },
+        { -4, 10, -23, 108,  48, -16,  8, -3 },
+        { -4, 10, -24, 100,  59, -19,  9, -3 },
+        { -4, 11, -24,  90,  70, -21, 10, -4 },
+        { -4, 11, -23,  80,  80, -23, 11, -4 },
+        { -4, 10, -21,  70,  90, -24, 11, -4 },
+        { -3,  9, -19,  59, 100, -24, 10, -4 },
+        { -3,  8, -16,  48, 108, -23, 10, -4 },
+        { -2,  6, -13,  37, 115, -20,  9, -4 },
+        { -2,  5, -10,  27, 121, -17,  7, -3 },
+        { -1,  3,  -6,  17, 125, -13,  5, -2 },
+        {  0,  1,  -3,   8, 127,  -7,  3, -1 },
+    },
+};
@@ -0,0 +1,142 @@
+// daedalus-fourier cycle 3 — VP9 8-tap "regular" subpel filter,
+// horizontal direction, 8-wide output, h rows. V3D 7.1 via Mesa v3dv.
+//
+// Bakes in cycle-1+2 v4 winning patterns from start:
+//   - local_size_x = 256
+//   - 8 lanes per block (1 lane per output row), 2 blocks per
+//     16-lane subgroup, 16 subgroups per WG → 32 blocks per WG
+//   - uint8_t SSBO via storageBuffer8BitAccess
+//   - oob early-return safe (no barrier)
+//
+// Contracts (per k3_mc_phase4.md §5, revised per phase5''' findings):
+//   - meta[i].x: dst_off (byte offset of block's row-0 col-0 dst pixel)
+//   - meta[i].y: src_off (byte offset of block's row-0 col-0 SOURCE
+//     pixel — note: NO +3 shift; the C bench's `src + 3` C-caller
+//     convention does NOT carry into the SSBO offset. Shader reads
+//     s[k] = SSBO[src_off + row*stride + k] for k=0..14, matching
+//     C ref's per-row read of `master_src[block_base + row*stride
+//     + (x..x+7)]` for output col x ∈ 0..7).
+//   - meta[i].z: mx (subpel phase in [0..15])
+//   - dst_stride_u8 ≥ 8 (race-safety lower bound; bench asserts)
+//   - src_stride_u8 ≥ 15 (per-row read span; bench asserts)
+//
+// License: BSD-2-Clause. Algorithm transcribed from tests/vp9_mc_ref.c
+// which mirrors libavcodec/vp9dsp_template.c FILTER_8TAP macro.
+
+#version 450
+#extension GL_EXT_shader_8bit_storage              : require
+#extension GL_EXT_shader_explicit_arithmetic_types : require
+
+layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
+
+layout(binding = 0) readonly buffer Meta {
+    uvec4 meta[];   // per block: (dst_off, src_off, mx, _pad)
+} u_meta;
+
+layout(binding = 1) buffer Dst {
+    uint8_t dst[];
+} u_dst;
+
+layout(binding = 2) readonly buffer Src {
+    uint8_t src[];
+} u_src;
+
+layout(push_constant) uniform PC {
+    uint n_blocks;
+    uint dst_stride_u8;
+    uint src_stride_u8;
+    uint _pad;
+} pc;
+
+// VP9 8-tap REGULAR filter table — verbatim from
+// external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
+// (index [1] = FILTER_8TAP_REGULAR). 16 subpel phases × 8 taps.
+//
+// shaderdb-gate (phase5''' finding 2): if uniform count > ~144 after
+// first compile, escalate this LUT to SSBO binding 3.
+const int FILTER_REGULAR[16][8] = int[16][8](
+    int[8]( 0,  0,   0, 128,   0,   0,  0,  0 ),
+    int[8]( 0,  1,  -5, 126,   8,  -3,  1,  0 ),
+    int[8](-1,  3, -10, 122,  18,  -6,  2,  0 ),
+    int[8](-1,  4, -13, 118,  27,  -9,  3, -1 ),
+    int[8](-1,  4, -16, 112,  37, -11,  4, -1 ),
+    int[8](-1,  5, -18, 105,  48, -14,  4, -1 ),
+    int[8](-1,  5, -19,  97,  58, -16,  5, -1 ),
+    int[8](-1,  6, -19,  88,  68, -18,  5, -1 ),
+    int[8](-1,  6, -19,  78,  78, -19,  6, -1 ),
+    int[8](-1,  5, -18,  68,  88, -19,  6, -1 ),
+    int[8](-1,  5, -16,  58,  97, -19,  5, -1 ),
+    int[8](-1,  4, -14,  48, 105, -18,  5, -1 ),
+    int[8](-1,  4, -11,  37, 112, -16,  4, -1 ),
+    int[8](-1,  3,  -9,  27, 118, -13,  4, -1 ),
+    int[8]( 0,  2,  -6,  18, 122, -10,  3, -1 ),
+    int[8]( 0,  1,  -3,   8, 126,  -5,  1,  0 )
+);
+
+void main()
+{
+    uint gid         = gl_GlobalInvocationID.x;
+    uint wg_id       = gid / 256u;
+    uint lane_in_wg  = gid & 255u;
+    uint sg_in_wg    = lane_in_wg >> 4;
+    uint lane_in_sg  = lane_in_wg & 15u;
+    uint block_slot  = lane_in_sg >> 3;
+    uint row         = lane_in_sg & 7u;
+
+    uint block_local = sg_in_wg * 2u + block_slot;
+    uint block_idx   = wg_id * 32u + block_local;
+
+    // No barrier follows — safe early-return.
+    if (block_idx >= pc.n_blocks) return;
+
+    uvec4 m = u_meta.meta[block_idx];
+    uint dst_off = m.x;
+    uint src_off = m.y;
+    uint mx      = m.z & 15u;
+
+    // Read 15 source pixels for this row.
+    uint src_row = src_off + row * pc.src_stride_u8;
+    int s0  = int(u_src.src[src_row +  0u]);
+    int s1  = int(u_src.src[src_row +  1u]);
+    int s2  = int(u_src.src[src_row +  2u]);
+    int s3  = int(u_src.src[src_row +  3u]);
+    int s4  = int(u_src.src[src_row +  4u]);
+    int s5  = int(u_src.src[src_row +  5u]);
+    int s6  = int(u_src.src[src_row +  6u]);
+    int s7  = int(u_src.src[src_row +  7u]);
+    int s8  = int(u_src.src[src_row +  8u]);
+    int s9  = int(u_src.src[src_row +  9u]);
+    int s10 = int(u_src.src[src_row + 10u]);
+    int s11 = int(u_src.src[src_row + 11u]);
+    int s12 = int(u_src.src[src_row + 12u]);
+    int s13 = int(u_src.src[src_row + 13u]);
+    int s14 = int(u_src.src[src_row + 14u]);
+
+    int F0 = FILTER_REGULAR[mx][0];
+    int F1 = FILTER_REGULAR[mx][1];
+    int F2 = FILTER_REGULAR[mx][2];
+    int F3 = FILTER_REGULAR[mx][3];
+    int F4 = FILTER_REGULAR[mx][4];
+    int F5 = FILTER_REGULAR[mx][5];
+    int F6 = FILTER_REGULAR[mx][6];
+    int F7 = FILTER_REGULAR[mx][7];
+
+    int o0 = F0*s0  + F1*s1  + F2*s2  + F3*s3  + F4*s4  + F5*s5  + F6*s6  + F7*s7;
+    int o1 = F0*s1  + F1*s2  + F2*s3  + F3*s4  + F4*s5  + F5*s6  + F6*s7  + F7*s8;
+    int o2 = F0*s2  + F1*s3  + F2*s4  + F3*s5  + F4*s6  + F5*s7  + F6*s8  + F7*s9;
+    int o3 = F0*s3  + F1*s4  + F2*s5  + F3*s6  + F4*s7  + F5*s8  + F6*s9  + F7*s10;
+    int o4 = F0*s4  + F1*s5  + F2*s6  + F3*s7  + F4*s8  + F5*s9  + F6*s10 + F7*s11;
+    int o5 = F0*s5  + F1*s6  + F2*s7  + F3*s8  + F4*s9  + F5*s10 + F6*s11 + F7*s12;
+    int o6 = F0*s6  + F1*s7  + F2*s8  + F3*s9  + F4*s10 + F5*s11 + F6*s12 + F7*s13;
+    int o7 = F0*s7  + F1*s8  + F2*s9  + F3*s10 + F4*s11 + F5*s12 + F6*s13 + F7*s14;
+
+    uint dst_row = dst_off + row * pc.dst_stride_u8;
+    u_dst.dst[dst_row + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
+    u_dst.dst[dst_row + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
+}
@@ -0,0 +1,286 @@
+/*
+ * Cycle 3 M4''' — concurrent CPU(NEON MC) + QPU(V3D MC) throughput.
+ * Same pthread/barrier pattern as bench_concurrent{,_lpf}.c.
+ * License: BSD-2-Clause.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stddef.h>
+#include <time.h>
+#include <getopt.h>
+#include <pthread.h>
+#include <sched.h>
+#include <assert.h>
+#include <vulkan/vulkan.h>
+
+#include "v3d_runner.h"
+
+extern void ff_vp9_put_regular8_h_neon(
+    uint8_t *dst, ptrdiff_t dst_stride,
+    const uint8_t *src, ptrdiff_t src_stride,
+    int h, int mx, int my);
+
+#define SRC_W 16
+#define DST_W 8
+#define SRC_H 8
+#define DST_H 8
+#define SRC_BYTES (SRC_H * SRC_W)
+#define DST_BYTES (DST_H * DST_W)
+
+static inline uint64_t xs_step(uint64_t *s) {
+    uint64_t x = *s; x ^= x << 13; x ^= x >> 7; x ^= x << 17; return *s = x;
+}
+static uint64_t xs_init(uint64_t s) { return s ? s : 0xa57edbeef5717ULL; }
+static double now_s(void) {
+    struct timespec t; clock_gettime(CLOCK_MONOTONIC_RAW, &t);
+    return t.tv_sec + t.tv_nsec * 1e-9;
+}
+
+static volatile int g_stop = 0;
+static pthread_barrier_t g_start;
+
+/* --- NEON worker ----------- */
+
+#define NEON_BATCH 8192
+
+typedef struct {
+    int worker_id, affinity_core;
+    uint64_t blocks_done;
+    double elapsed_s;
+} neon_args;
+
+static void *neon_worker(void *p)
+{
+    neon_args *a = p;
+    cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
+    pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
+
+    uint64_t s = xs_init((uint64_t) a->worker_id * 0xc01dbeefULL);
+    uint8_t *master = malloc((size_t) NEON_BATCH * SRC_BYTES);
+    uint8_t *work   = malloc((size_t) NEON_BATCH * SRC_BYTES);
+    uint8_t *dsts   = malloc((size_t) NEON_BATCH * DST_BYTES);
+    int     *mxs    = malloc(NEON_BATCH * sizeof(int));
+    for (int i = 0; i < NEON_BATCH; i++) {
+        for (int j = 0; j < SRC_BYTES; j++)
+            master[(size_t)i * SRC_BYTES + j] = (uint8_t)(xs_step(&s) & 0xff);
+        mxs[i] = (int)(xs_step(&s) & 15);
+    }
+
+    pthread_barrier_wait(&g_start);
+    double t0 = now_s();
+    uint64_t done = 0;
+    while (!g_stop) {
+        memcpy(work, master, (size_t) NEON_BATCH * SRC_BYTES);
+        for (int i = 0; i < NEON_BATCH; i++)
+            ff_vp9_put_regular8_h_neon(
+                dsts + (size_t)i * DST_BYTES, DST_W,
+                work + (size_t)i * SRC_BYTES + 3, SRC_W,
+                DST_H, mxs[i], 0);
+        done += NEON_BATCH;
+    }
+    a->elapsed_s = now_s() - t0;
+    a->blocks_done = done;
+    free(master); free(work); free(dsts); free(mxs);
+    return NULL;
+}
+
+/* --- QPU worker ----------- */
+
+typedef struct {
+    int affinity_core, n_blocks;
+    uint64_t blocks_done;
+    double elapsed_s;
+} qpu_args;
+
+typedef struct {
+    uint32_t n_blocks, dst_stride_u8, src_stride_u8, _pad;
+} push_consts;
+
+static void *qpu_worker(void *p)
+{
+    qpu_args *a = p;
+    cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
+    pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
+
+    v3d_runner *r = v3d_runner_create();
+    if (!r) return NULL;
+
+    int n_blocks = a->n_blocks;
+    size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
+    size_t src_bytes  = (size_t) n_blocks * SRC_BYTES;
+    size_t dst_bytes  = (size_t) n_blocks * DST_BYTES;
+
+    v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
+    v3d_runner_create_buffer(r, meta_bytes, &buf_meta);
+    v3d_runner_create_buffer(r, dst_bytes,  &buf_dst);
+    v3d_runner_create_buffer(r, src_bytes,  &buf_src);
+
+    uint64_t s = 0xfeedfacecafebabeULL;
+    uint8_t *master = malloc(src_bytes);
+    for (size_t i = 0; i < src_bytes; i++) master[i] = (uint8_t)(xs_step(&s) & 0xff);
+    memcpy(buf_src.mapped, master, src_bytes);
+
+    uint32_t *meta = buf_meta.mapped;
+    assert(DST_W >= 8); assert(SRC_W >= 15);
+    for (int i = 0; i < n_blocks; i++) {
+        meta[4*i + 0] = (uint32_t)((size_t)i * DST_BYTES);   /* dst_off */
+        meta[4*i + 1] = (uint32_t)((size_t)i * SRC_BYTES);   /* src_off (RAW, no +3) */
+        meta[4*i + 2] = (uint32_t)(xs_step(&s) & 15);        /* mx */
+        meta[4*i + 3] = 0;
+    }
+
+    v3d_pipeline pipe = {0};
+    v3d_runner_create_pipeline(r, "v3d_mc_8h.spv", 3, sizeof(push_consts), &pipe);
+    v3d_buffer bufs[3] = { buf_meta, buf_dst, buf_src };
+    v3d_runner_bind_buffers(r, &pipe, bufs, 3);
+
+    const uint32_t bpw = 32;
+    uint32_t gc = (uint32_t)((n_blocks + bpw - 1) / bpw);
+    push_consts pc = { .n_blocks = (uint32_t) n_blocks,
+                       .dst_stride_u8 = DST_W,
+                       .src_stride_u8 = SRC_W };
+
+    VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
+    VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
+    vkBeginCommandBuffer(cb, &cbbi);
+    vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
+    vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
+                            pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
+    vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
+                       0, sizeof(pc), &pc);
+    vkCmdDispatch(cb, gc, 1, 1);
+    vkEndCommandBuffer(cb);
+
+    for (int i = 0; i < 5; i++) v3d_runner_submit_wait(r, cb);
+
+    pthread_barrier_wait(&g_start);
+    double t0 = now_s();
+    uint64_t done = 0;
+    while (!g_stop) {
+        memset(buf_dst.mapped, 0, dst_bytes);
+        v3d_runner_submit_wait(r, cb);
+        done += n_blocks;
+    }
+    a->elapsed_s = now_s() - t0;
+    a->blocks_done = done;
+
+    free(master);
+    v3d_runner_destroy_pipeline(r, &pipe);
+    v3d_runner_destroy_buffer(r, &buf_src);
+    v3d_runner_destroy_buffer(r, &buf_dst);
+    v3d_runner_destroy_buffer(r, &buf_meta);
+    v3d_runner_destroy(r);
+    return NULL;
+}
+
+typedef struct { double duration_s; } timer_args;
+static void *timer_thread(void *p) {
+    timer_args *a = p;
+    pthread_barrier_wait(&g_start);
+    double end = now_s() + a->duration_s;
+    while (now_s() < end) {
+        struct timespec ts = {0, 1000000}; nanosleep(&ts, NULL);
+    }
+    g_stop = 1;
+    return NULL;
+}
+
+enum mode { MODE_NEON, MODE_QPU, MODE_MIXED };
+
+int main(int argc, char **argv)
+{
+    enum mode mode = MODE_NEON;
+    int n_neon = 4, qpu_core = 3, qpu_n_blocks = 65536;
+    double duration = 8.0;
+
+    static struct option opts[] = {
+        {"mode",         required_argument, 0, 'm'},
+        {"neon-threads", required_argument, 0, 'n'},
+        {"qpu-core",     required_argument, 0, 'c'},
+        {"qpu-blocks",   required_argument, 0, 'b'},
+        {"duration",     required_argument, 0, 'd'},
+        {0,0,0,0}
+    };
+    for (int c; (c = getopt_long(argc, argv, "m:n:c:b:d:", opts, 0)) != -1;) {
+        switch (c) {
+        case 'm':
+            if      (!strcmp(optarg, "neon-only")) mode = MODE_NEON;
+            else if (!strcmp(optarg, "qpu-only"))  mode = MODE_QPU;
+            else if (!strcmp(optarg, "mixed"))     mode = MODE_MIXED;
+            else { fprintf(stderr, "bad mode\n"); return 2; }
+            break;
+        case 'n': n_neon = atoi(optarg); break;
+        case 'c': qpu_core = atoi(optarg); break;
+        case 'b': qpu_n_blocks = atoi(optarg); break;
+        case 'd': duration = atof(optarg); break;
+        default: return 2;
+        }
+    }
+    int has_qpu  = (mode == MODE_QPU || mode == MODE_MIXED);
+    int has_neon = (mode == MODE_NEON || mode == MODE_MIXED);
+    int n_workers = (has_neon ? n_neon : 0) + (has_qpu ? 1 : 0);
+    int barrier_count = n_workers + 1 + 1;
+
+    printf("=== M4''' concurrent MC bench ===\n");
+    printf("  mode: %s, neon: %d, qpu: core %d / %d blocks, %.1fs\n",
+           mode == MODE_NEON ? "neon-only" : mode == MODE_QPU ? "qpu-only" : "mixed",
+           has_neon ? n_neon : 0,
+           has_qpu ? qpu_core : -1,
+           has_qpu ? qpu_n_blocks : 0,
+           duration);
+
+    pthread_barrier_init(&g_start, NULL, barrier_count);
+
+    pthread_t timer_tid; timer_args ta = { .duration_s = duration };
+    pthread_create(&timer_tid, NULL, timer_thread, &ta);
+
+    pthread_t neon_tids[16] = {0};
+    neon_args n_args[16] = {0};
+    if (has_neon) {
+        for (int i = 0; i < n_neon; i++) {
+            n_args[i] = (neon_args){ .worker_id = i, .affinity_core = i };
+            pthread_create(&neon_tids[i], NULL, neon_worker, &n_args[i]);
+        }
+    }
+    pthread_t qpu_tid = 0;
+    qpu_args q_args = {0};
+    if (has_qpu) {
+        q_args = (qpu_args){ .affinity_core = qpu_core, .n_blocks = qpu_n_blocks };
+        pthread_create(&qpu_tid, NULL, qpu_worker, &q_args);
+    }
+
+    pthread_barrier_wait(&g_start);
+
+    pthread_join(timer_tid, NULL);
+    if (has_neon) for (int i = 0; i < n_neon; i++) pthread_join(neon_tids[i], NULL);
+    if (has_qpu)  pthread_join(qpu_tid, NULL);
+
+    uint64_t total = 0; double max_e = 0;
+    if (has_neon) {
+        printf("NEON per-thread:\n");
+        for (int i = 0; i < n_neon; i++) {
+            double mbs = n_args[i].blocks_done / n_args[i].elapsed_s / 1e6;
+            printf("  core %d: %.3f Mblock/s\n", n_args[i].affinity_core, mbs);
+            total += n_args[i].blocks_done;
+            if (n_args[i].elapsed_s > max_e) max_e = n_args[i].elapsed_s;
+        }
+    }
+    if (has_qpu) {
+        double mbs = q_args.blocks_done / q_args.elapsed_s / 1e6;
+        printf("QPU (core %d): %.3f Mblock/s\n", q_args.affinity_core, mbs);
+        total += q_args.blocks_done;
+        if (q_args.elapsed_s > max_e) max_e = q_args.elapsed_s;
+    }
+
+    double total_mbs = total / max_e / 1e6;
+    printf("\n=== AGGREGATE ===\n");
+    printf("  Mblock/s        : %.3f\n", total_mbs);
+    printf("  30fps@1080p floor: 0.972 Mblock/s — %.1fx margin\n",
+           total_mbs / 0.972);
+
+    pthread_barrier_destroy(&g_start);
+    return 0;
+}
@@ -0,0 +1,220 @@
+/*
+ * Cycle 3 Phase 3 — NEON M3''' baseline for VP9 8-tap regular
+ * horizontal MC interpolation, 8×8 block.
+ *
+ * Reports:
+ *   M1'''_c (correctness): C-ref ↔ NEON bit-exact rate, N random
+ *                          8×8 blocks with random source pixels and
+ *                          random subpel phase mx ∈ [0, 15]
+ *   M3'''   (throughput):  NEON sustained Mblock/s, single-thread,
+ *                          time-based
+ *
+ * License: LGPL-2.1+ (statically links FFmpeg NEON snapshot).
+ */
+#define _POSIX_C_SOURCE 200809L
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <time.h>
+#include <getopt.h>
+
+extern void daedalus_vp9_put_regular_8h_ref(
+    uint8_t *dst, ptrdiff_t dst_stride,
+    const uint8_t *src, ptrdiff_t src_stride,
+    int h, int mx, int my);
+
+extern void ff_vp9_put_regular8_h_neon(
+    uint8_t *dst, ptrdiff_t dst_stride,
+    const uint8_t *src, ptrdiff_t src_stride,
+    int h, int mx, int my);
+
+/* RNG ------------------------------------------------------------ */
+
+static uint64_t xs_state;
+static inline uint64_t xs(void) {
+    uint64_t x = xs_state;
+    x ^= x << 13; x ^= x >> 7; x ^= x << 17;
+    return xs_state = x;
+}
+
+/* Block layout: each block gets its own 8×16 source buffer + 8×8 dst.
+ *   - source buffer is 16 cols wide; the filter is called with
+ *     src = block_src + 3, so it reads cols [src+0-3..src+8+4] =
+ *     [0..14] of the 16-col buffer. col 15 is unused padding.
+ *   - dst is 8 cols × 8 rows.
+ */
+#define SRC_W 16
+#define SRC_H 8
+#define DST_W 8
+#define DST_H 8
+#define SRC_BYTES (SRC_H * SRC_W)  /* 128 */
+#define DST_BYTES (DST_H * DST_W)  /* 64 */
+
+static void gen_src(uint8_t *buf)
+{
+    for (int i = 0; i < SRC_BYTES; i++)
+        buf[i] = (uint8_t)(xs() & 0xff);
+}
+
+static double now_seconds(void)
+{
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
+    return ts.tv_sec + ts.tv_nsec * 1e-9;
+}
+
+/* M1'''_c correctness gate -------------------------------------- */
+
+static int correctness_check(uint64_t seed, int n_blocks)
+{
+    xs_state = seed ? seed : 0xabcdef1234567890ULL;
+    int mismatches = 0;
+    uint8_t src[SRC_BYTES];
+    uint8_t dst_a[DST_BYTES], dst_b[DST_BYTES];
+
+    int mx_hist[16] = {0};
+
+    for (int i = 0; i < n_blocks; i++) {
+        gen_src(src);
+        int mx = (int)(xs() & 15);
+        mx_hist[mx]++;
+
+        memset(dst_a, 0, DST_BYTES);
+        memset(dst_b, 0, DST_BYTES);
+
+        daedalus_vp9_put_regular_8h_ref(dst_a, DST_W, src + 3, SRC_W, DST_H, mx, 0);
+        ff_vp9_put_regular8_h_neon  (dst_b, DST_W, src + 3, SRC_W, DST_H, mx, 0);
+
+        if (memcmp(dst_a, dst_b, DST_BYTES) != 0) {
+            if (mismatches < 3) {
+                fprintf(stderr, "MISMATCH block %d mx=%d:\n", i, mx);
+                fprintf(stderr, "  ref:");
+                for (int r = 0; r < 8; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_a[r*8+c]);
+                }
+                fprintf(stderr, "\n  neon:");
+                for (int r = 0; r < 8; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_b[r*8+c]);
+                }
+                fprintf(stderr, "\n");
+            }
+            mismatches++;
+        }
+    }
+    printf("M1'''_c correctness: %d / %d blocks bit-exact (%.4f%%)\n",
+           n_blocks - mismatches, n_blocks,
+           100.0 * (n_blocks - mismatches) / n_blocks);
+    /* mx histogram — confirms all 16 phases get exercised. */
+    int min_mx = mx_hist[0], max_mx = mx_hist[0];
+    for (int i = 1; i < 16; i++) {
+        if (mx_hist[i] < min_mx) min_mx = mx_hist[i];
+        if (mx_hist[i] > max_mx) max_mx = mx_hist[i];
+    }
+    printf("  mx phase coverage: min=%d max=%d (16 phases sampled)\n",
+           min_mx, max_mx);
+    return mismatches;
+}
+
+/* M3''' throughput ---------------------------------------------- */
+
+static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
+{
+    xs_state = seed ? seed : 0xdeadbeef12345678ULL;
+
+    uint8_t *master_src = malloc((size_t) n_blocks * SRC_BYTES);
+    uint8_t *work_src   = malloc((size_t) n_blocks * SRC_BYTES);
+    uint8_t *dsts       = malloc((size_t) n_blocks * DST_BYTES);
+    int     *mxs        = malloc(n_blocks * sizeof(int));
+    if (!master_src || !work_src || !dsts || !mxs) { fprintf(stderr, "alloc fail\n"); exit(1); }
+
+    for (int i = 0; i < n_blocks; i++) {
+        gen_src(master_src + (size_t)i * SRC_BYTES);
+        mxs[i] = (int)(xs() & 15);
+    }
+
+    /* Warm. */
+    memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
+    for (int i = 0; i < n_blocks; i++)
+        ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
+                                   work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
+                                   DST_H, mxs[i], 0);
+
+    double t0 = now_seconds();
+    double t_end = t0 + duration_s;
+    uint64_t done = 0;
+    while (now_seconds() < t_end) {
+        memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
+        for (int i = 0; i < n_blocks; i++)
+            ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
+                                       work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
+                                       DST_H, mxs[i], 0);
+        done += n_blocks;
+    }
+    double elapsed = now_seconds() - t0;
+
+    /* setup-only subtraction */
+    int setup_iters = (int) (done / n_blocks);
+    double s0 = now_seconds();
+    for (int it = 0; it < setup_iters; it++)
+        memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
+    double s1 = now_seconds();
+
+    double kernel_seconds = elapsed - (s1 - s0);
+    double mbps = done / kernel_seconds / 1e6;
+
+    printf("M3''' NEON throughput:\n");
+    printf("  blocks/batch:    %d\n", n_blocks);
+    printf("  batches done:    %d\n", setup_iters);
+    printf("  total blocks:    %llu\n", (unsigned long long) done);
+    printf("  elapsed (kernel)=%.6f s\n", kernel_seconds);
+    printf("  elapsed (setup) =%.6f s\n", s1 - s0);
+    printf("  throughput      = %.3f Mblock/s\n", mbps);
+    printf("  per-block       = %.1f ns\n", kernel_seconds / done * 1e9);
+    /* 1080p: 32400 blocks/frame */
+    printf("  equiv 1080p     = %.1f FPS  (32400 blocks/frame)\n",
+           mbps * 1e6 / 32400.0);
+
+    free(master_src); free(work_src); free(dsts); free(mxs);
+}
+
+int main(int argc, char **argv)
+{
+    int n_blocks = 65536;
+    double duration = 5.0;
+    uint64_t seed = 0;
+    int do_correctness = 1;
+
+    static struct option opts[] = {
+        {"blocks",         required_argument, 0, 'b'},
+        {"duration",       required_argument, 0, 'd'},
+        {"seed",           required_argument, 0, 's'},
+        {"no-correctness", no_argument,       0, 'C'},
+        {0,0,0,0}
+    };
+    for (int c; (c = getopt_long(argc, argv, "b:d:s:C", opts, 0)) != -1;) {
+        switch (c) {
+        case 'b': n_blocks = atoi(optarg); break;
+        case 'd': duration = atof(optarg); break;
+        case 's': seed = strtoull(optarg, 0, 0); break;
+        case 'C': do_correctness = 0; break;
+        default: return 2;
+        }
+    }
+
+    if (do_correctness) {
+        printf("=== M1'''_c bit-exact (10000 random blocks) ===\n");
+        if (correctness_check(seed, 10000) != 0) {
+            fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
+            return 1;
+        }
+        printf("\n");
+    }
+
+    printf("=== M3''' NEON throughput ===\n");
+    throughput_neon(seed, n_blocks, duration);
+    return 0;
+}
@@ -0,0 +1,303 @@
+/*
+ * Cycle 3 Phase 6 — QPU bench for VP9 8-tap "regular" subpel filter,
+ * horizontal, 8-wide output on V3D 7.1.
+ *
+ * Reports:
+ *   M1''' (correctness): QPU output vs C reference, N blocks across
+ *                        all 16 mx phases
+ *   M2''' (throughput):  QPU sustained Mblock/s
+ *
+ * Per k3_mc_phase4.md §5 (revised per phase5''' findings 4 + 6):
+ *   - src_off is the RAW block base (no +3 shift)
+ *   - assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)
+ *
+ * License: BSD-2-Clause.
+ */
+#define _POSIX_C_SOURCE 200809L
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <assert.h>
+#include <time.h>
+#include <getopt.h>
+#include <vulkan/vulkan.h>
+
+#include "v3d_runner.h"
+
+extern void daedalus_vp9_put_regular_8h_ref(
+    uint8_t *dst, ptrdiff_t dst_stride,
+    const uint8_t *src, ptrdiff_t src_stride,
+    int h, int mx, int my);
+
+/* Per-block layout: src buffer 8 rows × 16 cols = 128 bytes. The
+ * C bench's src+3 convention: NEON/C ref is called with
+ * `src = block_base + 3, src_stride = 16`. The shader's src_off
+ * is the RAW block_base (no +3 shift), and the shader reads
+ * s[0..14] from src_off + row*stride. Together this means:
+ *   shader's s[k] for k=0..14 = master_src[block_base + row*16 + k]
+ *   C ref's `src[x+k-3]` for x=0..7, k=0..7 with `src = block_base+3`
+ *     = master_src[block_base + row*16 + (x+k)]
+ *     = master_src[block_base + row*16 + (0..14)]
+ * which is exactly what the shader reads. */
+
+#define SRC_W 16
+#define SRC_H 8
+#define DST_W 8
+#define DST_H 8
+#define SRC_BYTES (SRC_H * SRC_W)
+#define DST_BYTES (DST_H * DST_W)
+
+static uint64_t xs_state;
+static inline uint64_t xs(void) {
+    uint64_t x = xs_state;
+    x ^= x << 13; x ^= x >> 7; x ^= x << 17;
+    return xs_state = x;
+}
+static void gen_src(uint8_t *b) {
+    for (int i = 0; i < SRC_BYTES; i++) b[i] = (uint8_t)(xs() & 0xff);
+}
+static double now_seconds(void) {
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
+    return ts.tv_sec + ts.tv_nsec * 1e-9;
+}
+
+typedef struct {
+    uint32_t n_blocks;
+    uint32_t dst_stride_u8;
+    uint32_t src_stride_u8;
+    uint32_t _pad;
+} push_consts;
+
+int main(int argc, char **argv)
+{
+    int n_blocks = 65536;
+    int iters = 100;
+    uint64_t seed = 0;
+    int verify_only = 0;
+    const char *spv_path = "v3d_mc_8h.spv";
+
+    static struct option opts[] = {
+        {"blocks",      required_argument, 0, 'b'},
+        {"iters",       required_argument, 0, 'i'},
+        {"seed",        required_argument, 0, 's'},
+        {"spv",         required_argument, 0, 'S'},
+        {"verify-only", no_argument,       0, 'V'},
+        {0,0,0,0}
+    };
+    for (int c; (c = getopt_long(argc, argv, "b:i:s:S:V", opts, 0)) != -1;) {
+        switch (c) {
+        case 'b': n_blocks    = atoi(optarg); break;
+        case 'i': iters       = atoi(optarg); break;
+        case 's': seed        = strtoull(optarg, 0, 0); break;
+        case 'S': spv_path    = optarg; break;
+        case 'V': verify_only = 1; break;
+        default: return 2;
+        }
+    }
+
+    xs_state = seed ? seed : 0xabcdef1234567890ULL;
+
+    v3d_runner *r = v3d_runner_create();
+    if (!r) { fprintf(stderr, "v3d_runner_create failed\n"); return 1; }
+    printf("=== v3d MC 8h bench ===\n");
+    printf("  device: %s\n", v3d_runner_device_name(r));
+    printf("  n_blocks: %d  iters: %d\n", n_blocks, iters);
+
+    /* Buffers: meta + dst + src, all blocks contiguous. */
+    size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
+    size_t src_bytes  = (size_t) n_blocks * SRC_BYTES;
+    size_t dst_bytes  = (size_t) n_blocks * DST_BYTES;
+
+    v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
+    if (v3d_runner_create_buffer(r, meta_bytes, &buf_meta)) return 1;
+    if (v3d_runner_create_buffer(r, dst_bytes,  &buf_dst))  return 1;
+    if (v3d_runner_create_buffer(r, src_bytes,  &buf_src))  return 1;
+
+    uint8_t *master_src = malloc(src_bytes);
+    uint8_t *expected   = malloc(dst_bytes);
+    int     *mxs        = malloc(n_blocks * sizeof(int));
+    if (!master_src || !expected || !mxs) { fprintf(stderr, "alloc\n"); return 1; }
+    for (int i = 0; i < n_blocks; i++) {
+        gen_src(master_src + (size_t)i * SRC_BYTES);
+        mxs[i] = (int)(xs() & 15);
+    }
+
+    /* Build C-ref expected. C ref takes `src + 3, src_stride = SRC_W`. */
+    memset(expected, 0, dst_bytes);
+    for (int i = 0; i < n_blocks; i++) {
+        daedalus_vp9_put_regular_8h_ref(
+            expected + (size_t)i * DST_BYTES, DST_W,
+            master_src + (size_t)i * SRC_BYTES + 3, SRC_W,
+            DST_H, mxs[i], 0);
+    }
+
+    /* Populate GPU buffers. Contracts (phase4 §5) enforced via asserts. */
+    uint32_t dst_stride_u8 = DST_W;
+    uint32_t src_stride_u8 = SRC_W;
+    assert(dst_stride_u8 >= 8 && "phase4 §5 contract 1");
+    assert(src_stride_u8 >= 15 && "phase4 §5 contract 2");
+
+    uint32_t *meta = (uint32_t *) buf_meta.mapped;
+    for (int i = 0; i < n_blocks; i++) {
+        /* src_off: RAW block base. NO +3 shift. (phase5''' finding 4) */
+        uint32_t src_off = (uint32_t)((size_t)i * SRC_BYTES);
+        uint32_t dst_off = (uint32_t)((size_t)i * DST_BYTES);
+        meta[4*i + 0] = dst_off;
+        meta[4*i + 1] = src_off;
+        meta[4*i + 2] = (uint32_t) mxs[i];
+        meta[4*i + 3] = 0;
+    }
+    memcpy(buf_src.mapped, master_src, src_bytes);
+    memset(buf_dst.mapped, 0, dst_bytes);
+
+    /* Pipeline. */
+    v3d_pipeline pipe = {0};
+    if (v3d_runner_create_pipeline(r, spv_path,
+                                   /*n_ssbos=*/3,
+                                   /*push_const_size=*/sizeof(push_consts),
+                                   &pipe)) return 1;
+    v3d_buffer bind_bufs[3] = { buf_meta, buf_dst, buf_src };
+    if (v3d_runner_bind_buffers(r, &pipe, bind_bufs, 3)) return 1;
+
+    const uint32_t blocks_per_wg = 32;
+    uint32_t group_count_x = (uint32_t)((n_blocks + blocks_per_wg - 1) / blocks_per_wg);
+    printf("  dispatch: %u WGs × 256 invocations = %u blocks (rounded up from %d)\n",
+           group_count_x, group_count_x * blocks_per_wg, n_blocks);
+
+    push_consts pc = {
+        .n_blocks      = (uint32_t) n_blocks,
+        .dst_stride_u8 = dst_stride_u8,
+        .src_stride_u8 = src_stride_u8,
+        ._pad = 0,
+    };
+
+    VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
+    VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
+    vkBeginCommandBuffer(cb, &cbbi);
+    vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
+    vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
+                            pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
+    vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
+                       0, sizeof(pc), &pc);
+    vkCmdDispatch(cb, group_count_x, 1, 1);
+    vkEndCommandBuffer(cb);
+
+    /* --- M1''' bit-exact --- */
+    printf("\n=== M1''': QPU vs C reference bit-exact ===\n");
+    memset(buf_dst.mapped, 0, dst_bytes);
+    if (v3d_runner_submit_wait(r, cb)) return 1;
+
+    int mismatch_blocks = 0;
+    int total_byte_diffs = 0;
+    int prints = 0;
+    for (int i = 0; i < n_blocks; i++) {
+        const uint8_t *q = (uint8_t *) buf_dst.mapped + (size_t)i * DST_BYTES;
+        const uint8_t *e = expected + (size_t)i * DST_BYTES;
+        if (memcmp(q, e, DST_BYTES) != 0) {
+            int diffs = 0;
+            for (int j = 0; j < DST_BYTES; j++) if (q[j] != e[j]) diffs++;
+            total_byte_diffs += diffs;
+            if (prints < 3) {
+                fprintf(stderr, "MISMATCH block %d mx=%d: %d/64 bytes differ\n",
+                        i, mxs[i], diffs);
+                fprintf(stderr, "  ref:");
+                for (int r0 = 0; r0 < 8; r0++) {
+                    fprintf(stderr, "\n    r%d ", r0);
+                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", e[r0*8+c]);
+                }
+                fprintf(stderr, "\n  qpu:");
+                for (int r0 = 0; r0 < 8; r0++) {
+                    fprintf(stderr, "\n    r%d ", r0);
+                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", q[r0*8+c]);
+                }
+                fprintf(stderr, "\n");
+                prints++;
+            }
+            mismatch_blocks++;
+        }
+    }
+    printf("  blocks bit-exact: %d / %d (%.4f%%)\n",
+           n_blocks - mismatch_blocks, n_blocks,
+           100.0 * (n_blocks - mismatch_blocks) / n_blocks);
+    printf("  total byte diffs: %d / %zu (%.4f%%)\n",
+           total_byte_diffs, (size_t) n_blocks * DST_BYTES,
+           100.0 * total_byte_diffs / ((double) n_blocks * DST_BYTES));
+
+    if (mismatch_blocks > 0) {
+        fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
+        v3d_runner_destroy_pipeline(r, &pipe);
+        v3d_runner_destroy_buffer(r, &buf_src);
+        v3d_runner_destroy_buffer(r, &buf_dst);
+        v3d_runner_destroy_buffer(r, &buf_meta);
+        v3d_runner_destroy(r);
+        return 1;
+    }
+
+    if (verify_only) {
+        v3d_runner_destroy_pipeline(r, &pipe);
+        v3d_runner_destroy_buffer(r, &buf_src);
+        v3d_runner_destroy_buffer(r, &buf_dst);
+        v3d_runner_destroy_buffer(r, &buf_meta);
+        v3d_runner_destroy(r);
+        return 0;
+    }
+
+    /* --- M2''' throughput --- */
+    printf("\n=== M2''': QPU throughput ===\n");
+
+    for (int i = 0; i < 10; i++) {
+        memset(buf_dst.mapped, 0, dst_bytes);
+        if (v3d_runner_submit_wait(r, cb)) return 1;
+    }
+
+    double t0 = now_seconds();
+    for (int i = 0; i < iters; i++) {
+        memset(buf_dst.mapped, 0, dst_bytes);
+        if (v3d_runner_submit_wait(r, cb)) return 1;
+    }
+    double t1 = now_seconds();
+
+    double s0 = now_seconds();
+    for (int i = 0; i < iters; i++) memset(buf_dst.mapped, 0, dst_bytes);
+    double s1 = now_seconds();
+
+    double kernel_seconds = (t1 - t0) - (s1 - s0);
+    double total_blocks   = (double) n_blocks * iters;
+    double mbps           = total_blocks / kernel_seconds / 1e6;
+
+    printf("  blocks/dispatch: %d\n", n_blocks);
+    printf("  iters:           %d\n", iters);
+    printf("  total blocks:    %.0f\n", total_blocks);
+    printf("  elapsed (kernel)=%.6f s\n", kernel_seconds);
+    printf("  elapsed (setup) =%.6f s\n", s1 - s0);
+    printf("  M2''' throughput = %.3f Mblock/s\n", mbps);
+    printf("  per-block        = %.1f ns\n", kernel_seconds / total_blocks * 1e9);
+    printf("  per-dispatch     = %.1f us\n", kernel_seconds / iters * 1e6);
+
+    double M3 = 20.997;   /* from k3_mc_phase3.md */
+    double R  = mbps / M3;
+    printf("\n  Cycle 3 NEON M3''' = %.3f Mblock/s\n", M3);
+    printf("  R''' = M2'''/M3''' = %.3f\n", R);
+    if      (R >= 1.0) printf("  decision band      = GREEN: QPU beats NEON in isolation\n");
+    else if (R >= 0.5) printf("  decision band      = YELLOW: M4''' decides\n");
+    else if (R >= 0.1) printf("  decision band      = ORANGE: M4''' may still rescue\n");
+    else               printf("  decision band      = RED: structural mismatch\n");
+
+    /* 30fps@1080p floor check (per project_30fps_floor_is_fine.md) */
+    double mblocks_per_1080p = 32400.0 * 30.0 / 1e6;
+    printf("\n  30fps@1080p floor : %.3f Mblock/s (32400 blocks × 30 fps)\n",
+           mblocks_per_1080p);
+    printf("  isolation margin  : %.1fx over 30fps floor\n",
+           mbps / mblocks_per_1080p);
+
+    v3d_runner_destroy_pipeline(r, &pipe);
+    v3d_runner_destroy_buffer(r, &buf_src);
+    v3d_runner_destroy_buffer(r, &buf_dst);
+    v3d_runner_destroy_buffer(r, &buf_meta);
+    v3d_runner_destroy(r);
+    free(master_src); free(expected); free(mxs);
+    return 0;
+}
@@ -0,0 +1,72 @@
+/*
+ * Standalone bit-exact C reference for VP9 8-tap "regular" subpel
+ * filter, horizontal direction, 8-pixel-wide output. Transcribed
+ * from FFmpeg's libavcodec/vp9dsp_template.c FILTER_8TAP macro
+ * (vendored at external/ffmpeg-snapshot/). 8-bit pixels only.
+ *
+ * Filter coefficients embedded inline (REGULAR filter only, all 16
+ * subpel phases). Same values as ff_vp9_subpel_filters[1][mx] in
+ * external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c.
+ *
+ * License: LGPL-2.1-or-later.
+ *
+ * Spec source: VP9 specification §8.5.1 — subpel motion compensation.
+ */
+#include <stdint.h>
+#include <stddef.h>
+
+static const int16_t vp9_8tap_regular_filters[16][8] = {
+    {  0,  0,   0, 128,   0,   0,  0,  0 },
+    {  0,  1,  -5, 126,   8,  -3,  1,  0 },
+    { -1,  3, -10, 122,  18,  -6,  2,  0 },
+    { -1,  4, -13, 118,  27,  -9,  3, -1 },
+    { -1,  4, -16, 112,  37, -11,  4, -1 },
+    { -1,  5, -18, 105,  48, -14,  4, -1 },
+    { -1,  5, -19,  97,  58, -16,  5, -1 },
+    { -1,  6, -19,  88,  68, -18,  5, -1 },
+    { -1,  6, -19,  78,  78, -19,  6, -1 },
+    { -1,  5, -18,  68,  88, -19,  6, -1 },
+    { -1,  5, -16,  58,  97, -19,  5, -1 },
+    { -1,  4, -14,  48, 105, -18,  5, -1 },
+    { -1,  4, -11,  37, 112, -16,  4, -1 },
+    { -1,  3,  -9,  27, 118, -13,  4, -1 },
+    {  0,  2,  -6,  18, 122, -10,  3, -1 },
+    {  0,  1,  -3,   8, 126,  -5,  1,  0 },
+};
+
+static inline uint8_t clip_u8(int x)
+{
+    return (uint8_t)(x > 255 ? 255 : x < 0 ? 0 : x);
+}
+
+/*
+ * 8x8 horizontal 8-tap "put" (non-averaging). Width hard-coded 8.
+ * `src` must point at the row-0 output-column-0 source pixel; valid
+ * source memory must extend src[r*src_stride + (-3..+11)] for r=0..h-1.
+ * `dst` is written at dst[r*dst_stride + 0..7] for r=0..h-1.
+ *
+ * Matches ff_vp9_put_regular8_h_neon byte-for-byte on 8-bit input.
+ */
+void daedalus_vp9_put_regular_8h_ref(uint8_t *dst, ptrdiff_t dst_stride,
+                                     const uint8_t *src, ptrdiff_t src_stride,
+                                     int h, int mx, int my)
+{
+    (void) my;   /* horizontal-only filter ignores y phase */
+    const int16_t *F = vp9_8tap_regular_filters[mx & 15];
+
+    for (int r = 0; r < h; r++) {
+        for (int x = 0; x < 8; x++) {
+            int sum = F[0] * (int) src[x - 3]
+                    + F[1] * (int) src[x - 2]
+                    + F[2] * (int) src[x - 1]
+                    + F[3] * (int) src[x + 0]
+                    + F[4] * (int) src[x + 1]
+                    + F[5] * (int) src[x + 2]
+                    + F[6] * (int) src[x + 3]
+                    + F[7] * (int) src[x + 4];
+            dst[x] = clip_u8((sum + 64) >> 7);
+        }
+        dst += dst_stride;
+        src += src_stride;
+    }
+}