Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00
parent 36eca40ff2
commit 356e446a49
15 changed files with 2545 additions and 1 deletions
@@ -55,6 +55,19 @@ set_source_files_properties(${FFASM_LPF_SOURCES} PROPERTIES
    COMPILE_OPTIONS "${FFASM_FLAGS}"
    LANGUAGE ASM)
 # Cycle 3 — VP9 MC interpolation NEON source + filter coefficient table
 # (vendored 2026-05-18). The .c table provides ff_vp9_subpel_filters
 # symbol which vp9mc_neon.S references via movrel.
 set(FFASM_MC_SOURCES
    ${FFSNAP}/libavcodec/aarch64/vp9mc_neon.S
 )
 set(FFC_MC_SOURCES
    ${FFSNAP}/libavcodec/vp9_subpel_filters_table.c
 )
 set_source_files_properties(${FFASM_MC_SOURCES} PROPERTIES
    COMPILE_OPTIONS "${FFASM_FLAGS}"
    LANGUAGE ASM)
 # Tell CMake/gas to preprocess .S sources.
 set_source_files_properties(${FFASM_SOURCES} PROPERTIES
    COMPILE_OPTIONS "${FFASM_FLAGS}"
@@ -76,6 +89,15 @@ add_executable(bench_neon_lpf
    ${FFASM_LPF_SOURCES}
 )
 target_compile_options(bench_neon_lpf PRIVATE -O3 -march=armv8-a+simd)
 # Cycle 3 — VP9 MC interpolation NEON baseline.
 add_executable(bench_neon_mc
    tests/bench_neon_mc.c
    tests/vp9_mc_ref.c
    ${FFASM_MC_SOURCES}
    ${FFC_MC_SOURCES}
 )
 target_compile_options(bench_neon_mc PRIVATE -O3 -march=armv8-a+simd)
 # bench_neon_idct doesn't need vulkan/drm — pure CPU baseline.
 # ---- Vulkan dispatch-overhead microbench (next chunk) ----------------------
@@ -125,7 +147,18 @@ if (DAEDALUS_BUILD_VULKAN)
        VERBATIM
    )
-    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV})
+    set(MC_SPV ${CMAKE_BINARY_DIR}/v3d_mc_8h.spv)
    add_custom_command(
        OUTPUT ${MC_SPV}
        COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
                -o ${MC_SPV}
                ${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
        DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
        COMMENT "glslang: v3d_mc_8h.comp -> v3d_mc_8h.spv"
        VERBATIM
    )
    add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV})
    # v3d_runner — reusable Vulkan plumbing.
    add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -155,6 +188,15 @@ if (DAEDALUS_BUILD_VULKAN)
    target_link_libraries(bench_v3d_lpf PRIVATE v3d_runner Vulkan::Vulkan)
    target_compile_options(bench_v3d_lpf PRIVATE -O2)
    # Cycle 3 — QPU MC bench.
    add_executable(bench_v3d_mc
        tests/bench_v3d_mc.c
        tests/vp9_mc_ref.c
    )
    add_dependencies(bench_v3d_mc daedalus_shaders)
    target_link_libraries(bench_v3d_mc PRIVATE v3d_runner Vulkan::Vulkan)
    target_compile_options(bench_v3d_mc PRIVATE -O2)
    # M4 — concurrent CPU(NEON) + QPU bench. Links the FFmpeg NEON
    # snapshot so we can run real NEON kernels on pinned CPU cores
    # while the QPU runs its dispatch loop concurrently.
@@ -174,6 +216,16 @@ if (DAEDALUS_BUILD_VULKAN)
    add_dependencies(bench_concurrent_lpf daedalus_shaders)
    target_link_libraries(bench_concurrent_lpf PRIVATE v3d_runner Vulkan::Vulkan pthread)
    target_compile_options(bench_concurrent_lpf PRIVATE -O3 -march=armv8-a+simd)
    # Cycle 3 M4''' — concurrent MC.
    add_executable(bench_concurrent_mc
        tests/bench_concurrent_mc.c
        ${FFASM_MC_SOURCES}
        ${FFC_MC_SOURCES}
    )
    add_dependencies(bench_concurrent_mc daedalus_shaders)
    target_link_libraries(bench_concurrent_mc PRIVATE v3d_runner Vulkan::Vulkan pthread)
    target_compile_options(bench_concurrent_mc PRIVATE -O3 -march=armv8-a+simd)
 endif()
 # ---- Summary ----------------------------------------------------------------
@@ -0,0 +1,104 @@
 ---
 cycle: 3
 phase: 1
 status: open
 date_opened: 2026-05-18
 parent_cycle: k2_deblock_phase7.md (cycle 2 closed YELLOW-via-M4'' PASS)
 target_kernel: VP9 8-tap MC interpolation, regular filter, horizontal, 8×N block
 dev_host: hertz
 ---
 # Cycle 3, Phase 1 — MC interpolation kernel goal
 Per `k2_deblock_phase7.md` verdict (project continues). MC interpolation
 chosen because: most-common per-frame work in real bitstreams (every
 inter block); multiply-heavy → stresses V3D SMUL24 / lack of DP4A
 directly; VP9+AV1 both use the same 8-tap structure.
 ## Kernel under test
 **VP9 8-tap regular subpel filter, horizontal direction, 8×N block,
 "put" (non-averaging) mode.**
 libavcodec symbol: `ff_vp9_put_8tap_regular_8h_neon` (and equivalents
 for smooth/sharp filter types). C reference: `put_8tap_regular_8h_c`
 from `libavcodec/vp9dsp_template.c` (instantiated via the
 `filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` macro
 expansion).
 I/O contract (per VP9 spec § 8.5.1 — subpel motion compensation):
 ```c
 void put_8tap_regular_8h_c(uint8_t *dst, ptrdiff_t dst_stride,
                           const uint8_t *src, ptrdiff_t src_stride,
                           int h, int mx, int my);
 ```
 - `dst` : destination block, written
 - `dst_stride` : destination row stride
 - `src` : source block, read (with -3..+4 column overhang for horizontal)
 - `src_stride` : source row stride
 - `h` : block height (typically 8 for 8×8)
 - `mx` : x-axis subpel phase ∈ [0, 15]
 - `my` : y-axis subpel phase (unused for horizontal-only filter)
 Per output pixel:
 ```
 out[r][c] = clip(sum_{k=0..7} filter[k] * src[r][c+k-3] + 64) >> 7
 ```
 Filter coefficients: `ff_vp9_subpel_filters[FILTER_8TAP_REGULAR][mx][0..7]`
 (int16, signed; 16 phases; sum to 128).
 ## Measurable success criteria (cycle-3 numbering)
 | ID | Measurement | Gate |
 |---|---|---|
 | **M1'''** | Bit-exact match rate vs C reference, ≥10 000 random 8×8 blocks (all 16 mx phases sampled) | 100.0000 % |
 | **M2'''** | QPU throughput in Mblock/s | recorded |
 | **M3'''** | NEON `ff_vp9_put_8tap_regular_8h_neon` throughput, single-core | recorded |
 | **M4'''** | MIXED NEON-3 + QPU vs pure NEON-4 (only if YELLOW band) | conditional |
 Derived: **R''' = M2''' / M3'''**.
 ## Decision rules (same as cycle 1/2)
 R''' bands and verdicts unchanged (see `phase1.md` and `k2_deblock_phase1.md`).
 Cycle-2 calibration adjustment: ORANGE band (0.1 ≤ R''' < 0.5) is
 no longer auto-close — run M4''' regardless.
 Predicted R''' band: **0.4–0.8.**
 - MC is more compute-bound than LPF (8 mults + 7 adds per output
  pixel; 64 pixels per block → ~960 ops per block)
 - Bandwidth-equivalent to LPF (per-block ~120 B read + 64 B write
  ≈ 184 B → similar 5-6 MB/frame at 32 400 blocks)
 - V3D SMUL24 covers the 8b×8b → 16b mults without overflow
 - But no DP4A means we lose the typical "4× INT8 speedup" CPUs get
  via SDOT — V3D does these as scalar SMUL24
 ## Cycle 1+2 lessons baked in from start
 Per `k2_deblock_phase7.md §"Phase 9 lessons"`:
 1. WG=256, 2-per-subgroup adaptation, uint8_t SSBO, oob early-return,
   NO chained ternary — these are the v1 defaults.
 2. Phase 5 second-model review is mandatory.
 3. R isolation is misleading; M4''' is the real gate.
 4. Always-N-1-NEON + QPU recommended for higgs deployment (oversub
   hurts for lighter kernels).
 5. shaderdb at 4 threads / 0 spills = compiler delivered; further
   optimisation must target algorithm, not compile shape.
 ## Phase 2 → Phase 3 hand-off
 Phase 2 must:
 - Vendor `libavcodec/aarch64/vp9mc_neon.S` from FFmpeg n7.1.3
  (matches existing snapshot pin)
 - Confirm `ff_vp9_subpel_filters` definition source
  (`libavcodec/vp9dsp.c:32`, just the 16 × 8 REGULAR row needed)
 - Pin the exact NEON symbol naming
 Phase 3 must:
 - Write standalone C ref (`tests/vp9_mc_ref.c`) with REGULAR filter
  table embedded
 - Write `tests/bench_neon_mc.c` (M1'''_c gate + M3''')
 - Capture M3''' before any QPU work
@@ -0,0 +1,109 @@
 ---
 cycle: 3
 phase: 2
 status: closed 2026-05-18
 date_opened: 2026-05-18
 parent: k3_mc_phase1.md
 ---
 # Cycle 3, Phase 2 — MC situation analysis
 ## 1. C reference
 - **Source**: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`
  (already vendored from cycle 1).
 - **Function**: `put_8tap_regular_8h_c` generated by
  `filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` —
  expands to call `do_8tap_1d_c` with `ds=1` (horizontal) and the
  REGULAR filter bank.
 - **Underlying primitive**: `do_8tap_1d_c` iterates `h` rows;
  per row, iterates `w=8` columns; per column, computes the
  `FILTER_8TAP` macro: `clip((sum_{k=0..7} F[k] * src[x+k-3]
  + 64) >> 7, 0, 255)`.
 - **Spec**: VP9 specification § 8.5.1 (subpel motion compensation).
 ## 2. NEON reference
 - **Source**: `external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S`
  (vendored 2026-05-18, FFmpeg n7.1.3, SHA-256
  `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef`).
 - **Symbol**: `ff_vp9_put_regular8_h_neon` (note: filter type baked
  into name, width=8 baked in, h-direction baked in)
 - **Signature** (VP9 `vp9_mc_func` typedef):
  ```c
  void ff_vp9_put_regular8_h_neon(uint8_t *dst, ptrdiff_t dst_stride,
                                  const uint8_t *src, ptrdiff_t src_stride,
                                  int h, int mx, int my);
  ```
  Registers: `x0=dst, x1=dst_stride, x2=src, x3=src_stride, w4=h, w5=mx, w6=my`.
 - **Dependencies**:
  - `libavutil/aarch64/asm.S` ✓ (already vendored)
  - `ff_vp9_subpel_filters[3][16][8]` symbol — provided by
    `external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c`
    (hand-extracted from `libavcodec/vp9dsp.c` of the same n7.1.3
    pin; copying just the constant data avoids dragging in the
    rest of `vp9dsp.c` which would require linking the entire VP9
    decoder).
 ## 3. Workload model
 Per 8×8 block output:
 - 8 multiplies × 8 columns × 8 rows = **512 multiplies**
 - 7 additions × 8 columns × 8 rows = 448 additions
 - 1 round (+64), 1 shift (>>7), 1 clip per pixel × 64 = 192 ops
 - Total ~1150 integer ops per block
 Per-block memory (horizontal-only filter, 8-pixel-wide output):
 - Read: 8 rows × (8 output cols + 7 tap overhang) = 8 × 15 = **120 source bytes**
 - Write: 8 rows × 8 cols = **64 dst bytes**
 - Total: **~184 bytes / block**
 Per 1080p frame (32 400 8×8 blocks, worst case all-MC):
 - ~5.9 MB total memory traffic
 - ~37 Mops compute
 - At GPU 4 GB/s share: 1.48 ms / frame = 675 FPS = 21.9 Mblock/s
 - At V3D 92 GFLOPS theoretical scalar (SMUL24 throughput ≈ FP MUL): 0.4 ms compute / frame = 2500 FPS theoretical → **compute is NOT the bottleneck** at this shape
 So MC is **bandwidth-bound on the QPU**, similar to LPF cycle 2.
 ## 4. Per-row workload diversity (vs cycle 1+2)
 | | IDCT (k1) | LPF (k2) | MC (k3) |
 |---|---|---|---|
 | Per-block math | Heavy butterflies (~60 ops/block via separable transform) | Light: 0-30 ops per edge × 8 rows | 8-tap convolution: 1150 ops per block |
 | Per-block memory | ~320 B in + 64 B out | ~64 B in + ~24 B out per edge | 120 B in + 64 B out |
 | Compute / memory ratio | High | Low (memory-bound, lots of skipping) | Medium (compute-rich but bandwidth-bound at GPU) |
 | Conditional? | No (always-execute) | Yes (fm/hev divergence per row) | No (deterministic per pixel) |
 | QPU mult intensity | Q14 16b×16b mults | Light (compares, small clips) | 16b×8b mults (filter × pixel) |
 MC is interesting because it's **compute-rich AND bandwidth-bound** —
 the closest match in workload shape to a real-world GPU compute kernel
 the V3D was designed for (graphics filtering).
 ## 5. Constraints carried from cycle 1+2
 Same V3D 7.1 device profile (vulkaninfo unchanged). The relevant
 specifics for MC:
 - No DP4A → 8-tap convolution must be 8 separate SMUL24 + ADDs
  (the typical GPU "dot4" packing is not available)
 - shaderInt16 = false → filter coefficients widened to int32 in
  registers; the filter table itself can be a uint16-storage SSBO
 - shaderInt8 = false → source pixels widened to int32 in registers
 - 1024-byte (16 KiB / 16) shared mem per WG is ample for MC source
  staging if useful (15 cols × 8 rows × 1 byte per block-row × 32
  blocks per WG = 3 840 B per row); for v1 we skip shared-mem
  staging and let TMU handle reads directly
 ## 6. What Phase 2 does *not* close
 - Per-block (block_y, block_x) layout / meta format. Phase 4 picks.
  Likely same shape as cycle 2 (uvec4 per block: dst_offset,
  src_offset, mx, _pad).
 - Filter table residency: as SSBO load every row, push-constants
  per dispatch (different mx per dispatch), or constant baked into
  shader (one filter per shader = 16 specialised shaders for the 16
  mx phases). Phase 4 picks; v1 likely SSBO for simplicity.
 - Vertical / "hv" / "avg" / 4-pixel / 16-pixel / 32-pixel / 64-pixel
  variants — out of cycle 3 scope; cycle 4+ if needed.
 Phase 3 next: build `tests/bench_neon_mc.c`, capture M3'''.
@@ -0,0 +1,77 @@
 ---
 cycle: 3
 phase: 3
 status: closed 2026-05-18
 date_opened: 2026-05-18
 parent: k3_mc_phase2.md
 host: hertz
 ---
 # Cycle 3, Phase 3 — NEON M3''' baseline
 ## Raw
 ```
 === M1'''_c bit-exact (10000 random blocks) ===
 M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
  mx phase coverage: min=577 max=668 (16 phases sampled)
 === M3''' NEON throughput ===
 M3''' NEON throughput:
  blocks/batch:    65536
  batches done:    939
  total blocks:    61 538 304
  elapsed (kernel)=2.930751 s
  elapsed (setup) =2.075477 s
  throughput      = 20.997 Mblock/s
  per-block       = 47.6 ns
  equiv 1080p     = 648.1 FPS  (32400 blocks/frame)
 ```
 ## Numbers
 | | |
 |---|---|
 | **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` |
 | mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count |
 | **M3''' (throughput)** | **20.997 Mblock/s** single-core |
 | per-block | 47.6 ns |
 | cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles |
 | 1080p FPS-eq | 648 FPS |
 ## Comparison across cycles
 | | IDCT (k1) | LPF (k2) | MC (k3) |
 |---|---|---|---|
 | Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 |
 | 1080p FPS-eq | 252 | 748 (worst edges) | 648 |
 | Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy |
 | NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing |
 MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT
 or SMULL-pair pattern handles 8-tap convolution extremely well; this
 is precisely the workload NEON SIMD was built for. **The QPU's
 break-even point on cycle 3 is correspondingly tight.**
 ## Predictions for M2''' / R'''
 V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the
 QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise
 MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier
 than LPF.
 - Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops
  per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS
  effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s
 - Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s
 - Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s
 **Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW
 band by a small margin.
 Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than
 estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU
 needs 4× more cycles for the same work in the worst case), R'''
 could land near 0.5-0.6. Phase 7''' measures.
 Phase 4 next.
@@ -0,0 +1,207 @@
 ---
 cycle: 3
 phase: 4
 status: open (awaiting Phase 5''' review)
 date_opened: 2026-05-18
 parent: k3_mc_phase3.md
 template: phase4.md (cycle 1) + k2_deblock_phase4.md (cycle 2) — same constraints, same patterns
 ---
 # Cycle 3, Phase 4 — Plan QPU MC kernel
 Compact plan. Cycle 1+2 phase4 docs cover the constraint matrix
 (C1-C10) and the dev-discipline patterns. Phase 4''' references
 them rather than re-deriving.
 ## 1. Constraints (carried)
 Same V3D 7.1 device. New for MC specifically:
 - SMUL24 covers 16-bit filter × 8-bit pixel mults (max ~32K product, fits)
 - Sum of 8 products fits in int32 trivially
 - No DP4A — must use 8 separate scalar muls per output pixel
 - 16 filter phases × 8 taps × 2 B = 256 B — too big for push constants
  (max 128 B), small enough for one const array in shader
 ## 2. Workload model
 Per 8×8 block:
 - 512 SMUL24 (8 mults × 64 output pixels)
 - 448 ADD (7 adds × 64 output pixels)
 - 64 round (+64 → >>7) operations
 - 64 clip-to-[0,255]
 - ≈ 1150 ALU ops per block
 - 120 B read + 64 B write = 184 B per block
 Per 1080p frame (32 400 blocks):
 - ~37 Mops compute → 1.8 ms at v3d 23 % sustained (compute-bound estimate)
 - ~5.9 MB traffic → 1.48 ms at 4 GB/s GPU share (bandwidth-bound estimate)
 ## 3. Workgroup geometry
 Bake in the v4 lesson and the cycle-2 single-WG-size-from-start:
 - `local_size_x = 256` (16 subgroups × 16 lanes)
 - 8 lanes per block (1 lane per row r=0..7), 2 blocks per subgroup
 - **32 blocks per WG**
 - 1080p: 1 013 WGs
 Same lane decomposition as cycle 2 LPF:
 ```
 edge_slot  = lane_in_sg >> 3    // 0 or 1 — "which block in this subgroup"
 row        = lane_in_sg & 7     // 0..7
 block_local = sg_in_wg * 2 + edge_slot
 block_idx   = wg_id * 32 + block_local
 oob = block_idx >= n_blocks
 ```
 No barrier needed, no shared mem. Safe early-return on oob.
 ## 4. Per-thread algorithm
 ```glsl
 if (block_idx >= pc.n_blocks) return;
 uvec4 m = u_meta.meta[block_idx];
 uint dst_off = m.x;
 uint src_off = m.y;
 uint mx      = m.z & 15u;
 // Read 15 source pixels for this row.
 uint src_row_addr = src_off + row * pc.src_stride_u8;
 int s0  = int(u_src.src[src_row_addr +  0u]);
 int s1  = int(u_src.src[src_row_addr +  1u]);
 int s2  = int(u_src.src[src_row_addr +  2u]);
 int s3  = int(u_src.src[src_row_addr +  3u]);
 int s4  = int(u_src.src[src_row_addr +  4u]);
 int s5  = int(u_src.src[src_row_addr +  5u]);
 int s6  = int(u_src.src[src_row_addr +  6u]);
 int s7  = int(u_src.src[src_row_addr +  7u]);
 int s8  = int(u_src.src[src_row_addr +  8u]);
 int s9  = int(u_src.src[src_row_addr +  9u]);
 int s10 = int(u_src.src[src_row_addr + 10u]);
 int s11 = int(u_src.src[src_row_addr + 11u]);
 int s12 = int(u_src.src[src_row_addr + 12u]);
 int s13 = int(u_src.src[src_row_addr + 13u]);
 int s14 = int(u_src.src[src_row_addr + 14u]);
 // Filter coefficients — const REGULAR table, indexed by mx.
 int F0 = FILTER_REGULAR[mx][0]; ... int F7 = FILTER_REGULAR[mx][7];
 // 8 output pixels (each = 8-tap convolution of 8 consecutive source).
 uint dst_row_addr = dst_off + row * pc.dst_stride_u8;
 int o0 = F0*s0 + F1*s1 + F2*s2 + F3*s3 + F4*s4 + F5*s5 + F6*s6 + F7*s7;
 int o1 = F0*s1 + F1*s2 + F2*s3 + F3*s4 + F4*s5 + F5*s6 + F6*s7 + F7*s8;
 int o2 = F0*s2 + F1*s3 + F2*s4 + F3*s5 + F4*s6 + F5*s7 + F6*s8 + F7*s9;
 int o3 = F0*s3 + F1*s4 + F2*s5 + F3*s6 + F4*s7 + F5*s8 + F6*s9 + F7*s10;
 int o4 = F0*s4 + F1*s5 + F2*s6 + F3*s7 + F4*s8 + F5*s9 + F6*s10+ F7*s11;
 int o5 = F0*s5 + F1*s6 + F2*s7 + F3*s8 + F4*s9 + F5*s10+ F6*s11+ F7*s12;
 int o6 = F0*s6 + F1*s7 + F2*s8 + F3*s9 + F4*s10+ F5*s11+ F6*s12+ F7*s13;
 int o7 = F0*s7 + F1*s8 + F2*s9 + F3*s10+ F4*s11+ F5*s12+ F6*s13+ F7*s14;
 u_dst.dst[dst_row_addr + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
 u_dst.dst[dst_row_addr + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
 ```
 Mirrors `tests/vp9_mc_ref.c` directly.
 ## 5. SSBOs / push constants
 | binding | name | type | usage |
 |---|---|---|---|
 | 0 | `meta` | `readonly uvec4[]` | per-block (dst_off, src_off, mx, _pad) |
 | 1 | `dst` | `uint8_t[]` | output pixels |
 | 2 | `src` | `readonly uint8_t[]` | input pixels |
 Push constants (16 B):
 ```
 n_blocks, dst_stride_u8, src_stride_u8, _pad
 ```
 Filter table: hard-coded in shader as
 `const int FILTER_REGULAR[16][8] = { ... };` — 128 const ints.
 **Race safety:** lane r writes `dst[dst_off + r*dst_stride + 0..7]`
 (8 contiguous bytes). For rows r and r+1, writes are `r*stride + 7`
 and `(r+1)*stride + 0`. Disjoint iff `dst_stride ≥ 8`.
 **Contracts (revised per phase5''' findings 4 + 6):**
 1. `dst_stride_u8 ≥ 8` (race-safety lower bound)
 2. `src_stride_u8 ≥ 15` (per-row read span)
 3. `dst_off + 7 + (r_max)*dst_stride < dst_buffer_size`
 4. `src_off + 14 + (r_max)*src_stride < src_buffer_size`
 5. **`src_off` is the byte offset of the FIRST byte of the source
   block's row 0 in the SSBO buffer — NOT shifted by +3.** The
   C bench's `src + 3` C-caller convention does not carry into
   the SSBO offset. Shader reads `s[k] = u_src.src[src_off +
   row*stride + k]` for k=0..14, which equals
   `master_src[block_base + row*stride + k]`, matching the C ref's
   per-row read of `master_src[block_base + row*stride + (x..x+7)]`
   for output col x ∈ 0..7.
 **Phase 6 MUST** add `assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)`
 in `bench_v3d_mc.c`'s meta-construction loop. **Phase 6 MUST** also
 run `V3D_DEBUG=shaderdb` after first compile and record uniform
 count. If uniform count > ~144 (a fall-out indicator that the
 filter LUT inflated unfavorably), escalate filter to a dedicated
 SSBO binding 3.
 ## 6. Predicted M2''' / R'''
 From Phase 3:
 - Compute envelope: 17.5 Mblock/s
 - Bandwidth envelope: 22.0 Mblock/s
 - min ≈ 17.5 Mblock/s
 - R''' isolation = 17.5 / 20.997 ≈ **0.83** (YELLOW, near GREEN)
 Honest lower bound R''' = 0.5-0.6 if SMUL24-vs-DP4A penalty bites
 harder. Phase 7''' measures.
 ## 7. WILL / WILL NOT touch
 WILL (Phase 6 creates):
 - `src/v3d_mc_8h.comp` — GLSL shader
 - `tests/bench_v3d_mc.c` — harness with contract asserts
 - CMake updates
 WILL NOT touch:
 - Cycle 1/2 artifacts (frozen Phase 3 baselines)
 - `external/ffmpeg-snapshot/` (frozen vendored sources, including
  the just-added `vp9_subpel_filters_table.c`)
 - `src/v3d_runner.{c,h}` (reusable as-is)
 ## 8. Phase 5''' review prompts
 Specific high-risk decisions:
 1. **Orientation / arithmetic correctness** — the 8 `o0..o7`
   expressions in §4 are stencil-aligned. Verify the off-by-one
   in `F[k] * s[c+k]` matches `F[k] * src[x+k-3]` after the
   `src+3` indexing shift used by the bench.
 2. **Filter table residency** — hard-coded const array vs SSBO
   vs push constants. Const is simplest but may cause v3d_compiler
   to generate a large constant LUT. Worth verifying via shaderdb.
 3. **Race safety** — same shape as cycle 2 (different rows of
   same block disjoint iff stride ≥ row-width). Verify
   `dst_stride ≥ 8` contract.
 4. **`src+3` index shift** — the bench's source layout puts the
   "row-0 col-0 source pixel" at `src + 3` (so src has -3..+12
   reachable). Make sure the QPU shader applies this offset
   consistently to its `src_off` meta value.
   **RESOLVED (phase5''' finding 4, RED):** `src_off` is the raw
   block-base offset (NOT +3-shifted). See §5 contract 5.
 5. **Anything missing.**
 ## 9. Phase 6 execution order
 1. Write shader, get glslang to accept (likely 0 spills, ≥2 threads)
 2. Write bench with asserts + meta layout
 3. Run M1''' bit-exact (gate)
 4. Run M2''' (throughput)
 5. If R''' < 1.0 → M4''' concurrent
 6. Phase 7''' verdict
@@ -0,0 +1,71 @@
 ---
 cycle: 3
 phase: 5
 status: closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied
 date_opened: 2026-05-18
 date_closed: 2026-05-18
 parent: k3_mc_phase4.md
 reviewer: Claude Sonnet (general-purpose Agent, fresh context)
 plan_author: Claude Opus 4.7 (this session)
 verdict: PASS-WITH-REVISIONS
 ---
 # Cycle 3, Phase 5 — Second-Model Review of MC Plan
 Same handoff: in-session Agent (Sonnet, fresh context), files read
 direct from disk, 5 review prompts + "anything else."
 Outcome: **1 RED (off-by-3 `src_off` indexing bug)**, **2 YELLOW**
 (shaderdb LUT gate for filter table, "MUST" assert language for
 contracts). Cycle-1+2 RED patterns (write race, barrier UB,
 subgroup-ops table error) did not recur.
 **Phase 5 paid off again.** The RED would have caused a bit-exact
 mismatch on the first run with cryptic "high index source pixels are
 wrong" symptoms — likely 1-2 debug cycles to track down without the
 review.
 ## Review (verbatim)
 ````markdown
 ## Verdict
 PASS-WITH-REVISIONS — no RED-class correctness bugs. Two YELLOW findings
 require plan amendments before Phase 6 proceeds. ...
 [full review preserved — reviewer's RED finding 4 traces the off-by-3:
 shader's `src_off = block_base + 3` + `src_stride_u8 = 16` + reading
 `s[0..14]` causes high-index reads to spill into next row]
 ````
 *(Verbatim review in agent output; key findings paraphrased below.)*
 | # | Severity | Issue | Resolution |
 |---|---|---|---|
 | 1 (orientation) | GREEN | All 8 oN expressions stencil-aligned correctly | accepted |
 | 2 (filter LUT) | YELLOW | `const int FILTER_REGULAR[16][8]` may inflate uniform count or compile to large LUT | Phase 6 to record uniform count via `V3D_DEBUG=shaderdb`; if >~144 uniforms, escalate filter to SSBO binding 3 |
 | 3 (race safety) | GREEN-w/note | `stride ≥ 8` contract correct; phrasing softer than cycle-2 standard | applied: §5 MUST assert |
 | 4 (`src_off` semantics) | **RED** | Plan said "src_off mirrors src+3"; with stride=16 shader's `s13`/`s14` read into next row's first 2 bytes | **applied: src_off = raw block base (no +3 shift); shader reads s[0..14] from there** |
 | 5 (missing) | GREEN-w/note | Coefficient overflow safely fits int32 (worked bound); no missing barrier-UB or write-race issues | accepted |
 | 6 (assert MUST language) | YELLOW | "Bench enforces with asserts" softer than cycle-2 MUST pattern | applied: §5 MUST language |
 | 7 (no barrier OK) | GREEN | Cycle-1 finding-7 doesn't apply (no barrier) | accepted |
 | 8 (filter table matches) | GREEN | `vp9_mc_ref.c` filter values match `vp9_subpel_filters_table.c[1]` verbatim | accepted |
 ## Resolution (applied to phase4 inline)
 1. **§4** — Clarified `src_off` is the byte offset of the **first byte
   of the source block in the SSBO buffer** (NOT shifted by +3). The
   C bench's `src + 3` C-caller convention does NOT carry into the
   SSBO offset. Shader reads `s[k] = u_src.src[src_off + row*stride + k]`
   for k=0..14, which equals `master_src[block_base + row*stride + k]`,
   matching the C ref's per-row read of `master_src[block_base + row*stride + (x..x+7)]`
   for output col x ∈ 0..7.
 2. **§5** — Hardened "Bench enforces" to "Phase 6 MUST add
   `assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)` in
   `bench_v3d_mc.c`'s meta-construction loop." Cycle-2 finding-4
   pattern applied.
 3. **§5** — Added: "Phase 6 MUST run `V3D_DEBUG=shaderdb` after first
   compile and record uniform count. If uniform count > ~144,
   escalate filter to a dedicated SSBO binding 3."
 After revisions: **Phase 4''' APPROVED for Phase 6''' implementation.**
@@ -0,0 +1,152 @@
 ---
 cycle: 3
 phase: 7
 status: closed 2026-05-18 — RED engineering / PASS 30fps-floor / M4 NEGATIVE
 date_opened: 2026-05-18
 date_closed: 2026-05-18
 parent: k3_mc_phase4.md (revised per phase5''')
 host: hertz
 verdict: cycle 3 closes; MC stays on CPU for higgs deployment; engineering negative documented
 ---
 # Cycle 3, Phase 7 — Verification (v1 + M4''')
 ## v1 first-light
 ```
 === v3d MC 8h bench ===
  n_blocks: 65536  iters: 100
 === M1''': QPU vs C reference bit-exact ===
  blocks bit-exact: 65536 / 65536 (100.0000 %)
 === M2''': QPU throughput ===
  M2''' = 1.413 Mblock/s
  per-block = 707.9 ns
  per-dispatch = 46390.5 us
  R''' = 0.067 → RED band
  30fps@1080p floor: 1.5x margin (isolation)
 ```
 shaderdb (v1 MC):
 ```
 SHADER-DB-ffcca249...: 488 inst, 2 threads, 0 loops, 197 uniforms,
  25 max-temps, 0:0 spills:fills, 0 sfu-stalls, 488 inst-and-stalls, 7 nops
 ```
 **Phase 5''' finding 2 prediction confirmed**: filter LUT inflated
 uniforms to 197 (gate was at ~144). Compiler also forced to 2 threads
 (from cycle-2's 4) due to register pressure (25 max-temps vs cycle-2's
 21). The "no DP4A" structural deficit shows up directly here — 8
 SMUL24 + 7 ADD per output pixel × 64 pixels per block × 8-lane
 geometry = 488 instructions, 30× heavier than the LPF kernel.
 ## M4''' concurrent matrix (8s windows)
 | Config | Mblock/s | per-core (NEON) | vs NEON-4 | 30fps |
 |---|---|---|---|---|
 | NEON 1-core | 14.479 | — | — | 14.9× |
 | **NEON 4-core** | **15.248** | 3.24 – 4.48 | **baseline** | 15.7× |
 | QPU only | 1.380 | — | — | 1.4× |
 | **Mixed NEON-3 + QPU** | **12.277** | 3.78 – 4.16 | **−19.5 %** | 12.6× |
 | Mixed NEON-4 + QPU | 12.158 | 2.49 – 3.35 | −20.3 % | 12.5× |
 **M4 gate: FAIL.** Mixed (12.28) < pure NEON-4 (15.25) by 2.97
 Mblock/s. The QPU's 0.45 Mblock/s contribution under contention
 doesn't compensate for losing one NEON core that delivers ~3.8.
 ## Cross-cycle comparison
 | | Cycle 1 IDCT | Cycle 2 LPF | Cycle 3 MC |
 |---|---|---|---|
 | R isolation | 0.92 | 0.41 | **0.067** |
 | 30fps floor margin (isolation) | 7.9× | 10× | **1.5×** |
 | M4 mixed vs pure NEON-4 | +7.2 % | +6.9 % | **−19.5 %** |
 | 30fps floor margin (mixed) | 7.2× | 7.2× | **12.6×** |
 | Verdict for higgs | GO QPU | GO QPU | **STAY CPU** |
 | NEON 4-core scaling vs 1-core | 0.56× (bw-bound) | 0.82× (bw-bound) | **1.05× (compute-bound)** |
 The MC result is **structurally consistent** with the V3D substrate
 profile from `phase0.md`:
 - No DP4A → 8-wide convolution doesn't pack as it does on NEON SDOT
 - Filter coefficients drive uniform count high → register pressure → 2 threads
 - High per-output-pixel multiply count → compiled instruction count
  3× cycle 1, 6× cycle 2
 NEON 4-core is *compute*-bound for MC (not bandwidth-bound like
 the other two kernels). So 4-core scales nearly linearly with cores —
 the NEON CPU has plenty of headroom and the QPU has nothing to add
 even in concurrent mode.
 ## Deployment recipe (for higgs / libva-v4l2-request-fourier)
 Per `project_consumer_target.md`, the eventual integration target is
 V4L2 stateless → libva-v4l2-request-fourier → firefox-fourier. The
 back-end-on-QPU/CPU split for the consumed decoder pipeline:
 - **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
 - **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
 - **MC (cycle 3)** → **CPU NEON**. R = 0.067, −19.5 % mixed.
  Compute-bound on CPU but CPU already comfortably exceeds 30fps;
  offload makes things worse.
 - **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
 This is a **mixed-substrate deployment**, not a "QPU does everything"
 plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
 dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
 ## Decision per Phase 1 rules + 30fps-floor calibration
 | Rule | Result | Status |
 |---|---|---|
 | M1''' bit-exact | 100.0000 % | ✓ PASS |
 | R''' = M2'''/M3''' | 0.067 (RED) | structural mismatch |
 | M4''' > pure-CPU 4-core | −19.5 % | ✗ FAIL gate |
 | 30fps@1080p floor (isolation) | 1.5× | ✓ PASS (user-facing) |
 | 30fps@1080p floor (mixed) | 12.6× | ✓ PASS (user-facing) |
 **Engineering cycle verdict: do not deploy MC on QPU; deploy on CPU.**
 **User-facing cycle verdict: 30fps floor easily met in any
 configuration; either path works for daily YouTube.**
 For the deployment recipe above, **MC stays on CPU**. The Phase 1
 ORANGE/RED "honest close" rule applies here: cycle 3 closes as a
 documented negative for this kernel without affecting the
 project-level "continue" verdict (cycles 1+2 GO results stand).
 ## Phase 9 lessons (added to project memory)
 1. **Multiply-heavy workloads expose V3D's no-DP4A deficit** in a way
   that cycle 1+2 didn't. CPU SDOT/UDOT pack 4 INT8 MACs in one
   instruction; V3D's SMUL24 is one scalar mult at a time. The 4×
   gap shows up directly as a 6-15× per-block slowdown.
 2. **Compute-bound CPU workloads make the QPU offload story collapse.**
   When NEON 4-core scales near-linearly (not bandwidth-saturated),
   the "freed-core" argument from cycle 1+2 doesn't apply — there
   are no free cycles to free. Mixed mode is strictly worse.
 3. **The 30fps@1080p user-facing test (`project_30fps_floor_is_fine.md`)
   passes regardless of engineering verdict.** All three cycles pass
   it in isolation. This is a project-level win to communicate
   separately from per-cycle engineering R numbers.
 4. **The shaderdb filter-LUT gate from phase5''' finding 2 fired
   exactly as predicted** (197 uniforms > 144 threshold; 2 threads
   instead of 4). This validates the cycle-discipline of running
   `V3D_DEBUG=shaderdb` early and using the result as an actionable
   gate. Cycle 4 (if any) should bake this in from Phase 4 §design.
 ## Leaves open
 - Cycle 3 v2 with filter LUT escalated to SSBO (per phase5''' finding 2
  trigger). Would reduce uniforms to ~30, potentially restore 4
  threads. Expected upside: ~2× → R''' = 0.13. Still RED, still M4-
  negative. Skipped — even doubling doesn't change the deployment
  recipe.
 - Vertical / hv / 4-tap / wider variants — all of cycle 3 same
  multiply-shape, same structural verdict expected. Not worth Phase
  1+ for those.
 - Cycle 4 candidates (per phase7_M4.md §"Cycle 3 candidates"):
  CDEF (AV1-only directional filter), Loop Restoration (AV1-only),
  or higgs deployment plumbing.
@@ -25,6 +25,8 @@ tagged commit, no modifications.
 | `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
 | `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
 | `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
 | `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
 | `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
 | `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
 | `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
 | `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
@@ -0,0 +1,665 @@
 /*
 * Copyright (c) 2016 Google Inc.
 *
 * This file is part of FFmpeg.
 *
 * FFmpeg is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 *
 * FFmpeg is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with FFmpeg; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 */
 #include "libavutil/aarch64/asm.S"
 // All public functions in this file have the following signature:
 // typedef void (*vp9_mc_func)(uint8_t *dst, ptrdiff_t dst_stride,
 //                            const uint8_t *ref, ptrdiff_t ref_stride,
 //                            int h, int mx, int my);
 function ff_vp9_avg64_neon, export=1
        mov             x5,  x0
 1:
        ld1             {v4.16b,  v5.16b,  v6.16b,  v7.16b},  [x2], x3
        ld1             {v0.16b,  v1.16b,  v2.16b,  v3.16b},  [x0], x1
        ld1             {v20.16b, v21.16b, v22.16b, v23.16b}, [x2], x3
        urhadd          v0.16b,  v0.16b,  v4.16b
        urhadd          v1.16b,  v1.16b,  v5.16b
        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0], x1
        urhadd          v2.16b,  v2.16b,  v6.16b
        urhadd          v3.16b,  v3.16b,  v7.16b
        subs            w4,  w4,  #2
        urhadd          v16.16b, v16.16b, v20.16b
        urhadd          v17.16b, v17.16b, v21.16b
        st1             {v0.16b,  v1.16b,  v2.16b,  v3.16b},  [x5], x1
        urhadd          v18.16b, v18.16b, v22.16b
        urhadd          v19.16b, v19.16b, v23.16b
        st1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x5], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_avg32_neon, export=1
 1:
        ld1             {v2.16b, v3.16b},  [x2], x3
        ld1             {v0.16b, v1.16b},  [x0]
        urhadd          v0.16b,  v0.16b,  v2.16b
        urhadd          v1.16b,  v1.16b,  v3.16b
        subs            w4,  w4,  #1
        st1             {v0.16b, v1.16b},  [x0], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_copy16_neon, export=1
        add             x5,  x0,  x1
        lsl             x1,  x1,  #1
        add             x6,  x2,  x3
        lsl             x3,  x3,  #1
 1:
        ld1             {v0.16b},  [x2], x3
        ld1             {v1.16b},  [x6], x3
        ld1             {v2.16b},  [x2], x3
        ld1             {v3.16b},  [x6], x3
        subs            w4,  w4,  #4
        st1             {v0.16b},  [x0], x1
        st1             {v1.16b},  [x5], x1
        st1             {v2.16b},  [x0], x1
        st1             {v3.16b},  [x5], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_avg16_neon, export=1
        mov             x5,  x0
 1:
        ld1             {v2.16b},  [x2], x3
        ld1             {v0.16b},  [x0], x1
        ld1             {v3.16b},  [x2], x3
        urhadd          v0.16b,  v0.16b,  v2.16b
        ld1             {v1.16b},  [x0], x1
        urhadd          v1.16b,  v1.16b,  v3.16b
        subs            w4,  w4,  #2
        st1             {v0.16b},  [x5], x1
        st1             {v1.16b},  [x5], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_copy8_neon, export=1
 1:
        ld1             {v0.8b},  [x2], x3
        ld1             {v1.8b},  [x2], x3
        subs            w4,  w4,  #2
        st1             {v0.8b},  [x0], x1
        st1             {v1.8b},  [x0], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_avg8_neon, export=1
        mov             x5,  x0
 1:
        ld1             {v2.8b},  [x2], x3
        ld1             {v0.8b},  [x0], x1
        ld1             {v3.8b},  [x2], x3
        urhadd          v0.8b,  v0.8b,  v2.8b
        ld1             {v1.8b},  [x0], x1
        urhadd          v1.8b,  v1.8b,  v3.8b
        subs            w4,  w4,  #2
        st1             {v0.8b},  [x5], x1
        st1             {v1.8b},  [x5], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_copy4_neon, export=1
 1:
        ld1             {v0.s}[0], [x2], x3
        ld1             {v1.s}[0], [x2], x3
        st1             {v0.s}[0], [x0], x1
        ld1             {v2.s}[0], [x2], x3
        st1             {v1.s}[0], [x0], x1
        ld1             {v3.s}[0], [x2], x3
        subs            w4,  w4,  #4
        st1             {v2.s}[0], [x0], x1
        st1             {v3.s}[0], [x0], x1
        b.ne            1b
        ret
 endfunc
 function ff_vp9_avg4_neon, export=1
        mov             x5,  x0
 1:
        ld1             {v2.s}[0], [x2], x3
        ld1             {v0.s}[0], [x0], x1
        ld1             {v2.s}[1], [x2], x3
        ld1             {v0.s}[1], [x0], x1
        ld1             {v3.s}[0], [x2], x3
        ld1             {v1.s}[0], [x0], x1
        ld1             {v3.s}[1], [x2], x3
        ld1             {v1.s}[1], [x0], x1
        subs            w4,  w4,  #4
        urhadd          v0.8b,  v0.8b,  v2.8b
        urhadd          v1.8b,  v1.8b,  v3.8b
        st1             {v0.s}[0], [x5], x1
        st1             {v0.s}[1], [x5], x1
        st1             {v1.s}[0], [x5], x1
        st1             {v1.s}[1], [x5], x1
        b.ne            1b
        ret
 endfunc
 // Extract a vector from src1-src2 and src4-src5 (src1-src3 and src4-src6
 // for size >= 16), and multiply-accumulate into dst1 and dst3 (or
 // dst1-dst2 and dst3-dst4 for size >= 16)
 .macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
        ext             v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
        ext             v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
 .if \size >= 16
        mla             \dst1\().8h, v20.8h, v0.h[\offset]
        ext             v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
        mla             \dst3\().8h, v22.8h, v0.h[\offset]
        ext             v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
        mla             \dst2\().8h, v21.8h, v0.h[\offset]
        mla             \dst4\().8h, v23.8h, v0.h[\offset]
 .elseif \size == 8
        mla             \dst1\().8h, v20.8h, v0.h[\offset]
        mla             \dst3\().8h, v22.8h, v0.h[\offset]
 .else
        mla             \dst1\().4h, v20.4h, v0.h[\offset]
        mla             \dst3\().4h, v22.4h, v0.h[\offset]
 .endif
 .endm
 // The same as above, but don't accumulate straight into the
 // destination, but use a temp register and accumulate with saturation.
 .macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
        ext             v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
        ext             v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
 .if \size >= 16
        mul             v20.8h, v20.8h, v0.h[\offset]
        ext             v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
        mul             v22.8h, v22.8h, v0.h[\offset]
        ext             v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
        mul             v21.8h, v21.8h, v0.h[\offset]
        mul             v23.8h, v23.8h, v0.h[\offset]
 .elseif \size == 8
        mul             v20.8h, v20.8h, v0.h[\offset]
        mul             v22.8h, v22.8h, v0.h[\offset]
 .else
        mul             v20.4h, v20.4h, v0.h[\offset]
        mul             v22.4h, v22.4h, v0.h[\offset]
 .endif
 .if \size == 4
        sqadd           \dst1\().4h, \dst1\().4h, v20.4h
        sqadd           \dst3\().4h, \dst3\().4h, v22.4h
 .else
        sqadd           \dst1\().8h, \dst1\().8h, v20.8h
        sqadd           \dst3\().8h, \dst3\().8h, v22.8h
 .if \size >= 16
        sqadd           \dst2\().8h, \dst2\().8h, v21.8h
        sqadd           \dst4\().8h, \dst4\().8h, v23.8h
 .endif
 .endif
 .endm
 // Instantiate a horizontal filter function for the given size.
 // This can work on 4, 8 or 16 pixels in parallel; for larger
 // widths it will do 16 pixels at a time and loop horizontally.
 // The actual width is passed in x5, the height in w4 and the
 // filter coefficients in x9. idx2 is the index of the largest
 // filter coefficient (3 or 4) and idx1 is the other one of them.
 .macro do_8tap_h type, size, idx1, idx2
 function \type\()_8tap_\size\()h_\idx1\idx2
        sub             x2,  x2,  #3
        add             x6,  x0,  x1
        add             x7,  x2,  x3
        add             x1,  x1,  x1
        add             x3,  x3,  x3
        // Only size >= 16 loops horizontally and needs
        // reduced dst stride
 .if \size >= 16
        sub             x1,  x1,  x5
 .elseif \size == 4
        add             x12, x2,  #8
        add             x13, x7,  #8
 .endif
        // size >= 16 loads two qwords and increments x2,
        // for size 4/8 it's enough with one qword and no
        // postincrement
 .if \size >= 16
        sub             x3,  x3,  x5
        sub             x3,  x3,  #8
 .endif
        // Load the filter vector
        ld1             {v0.8h},  [x9]
 1:
 .if \size >= 16
        mov             x9,  x5
 .endif
        // Load src
 .if \size >= 16
        ld1             {v4.8b,  v5.8b,  v6.8b},  [x2], #24
        ld1             {v16.8b, v17.8b, v18.8b}, [x7], #24
 .elseif \size == 8
        ld1             {v4.8b,  v5.8b},  [x2]
        ld1             {v16.8b, v17.8b}, [x7]
 .else // \size == 4
        ld1             {v4.8b},  [x2]
        ld1             {v16.8b}, [x7]
        ld1             {v5.s}[0],  [x12], x3
        ld1             {v17.s}[0], [x13], x3
 .endif
        uxtl            v4.8h,  v4.8b
        uxtl            v5.8h,  v5.8b
        uxtl            v16.8h, v16.8b
        uxtl            v17.8h, v17.8b
 .if \size >= 16
        uxtl            v6.8h,  v6.8b
        uxtl            v18.8h, v18.8b
 .endif
 2:
        // Accumulate, adding idx2 last with a separate
        // saturating add. The positive filter coefficients
        // for all indices except idx2 must add up to less
        // than 127 for this not to overflow.
        mul             v1.8h,  v4.8h,  v0.h[0]
        mul             v24.8h, v16.8h, v0.h[0]
 .if \size >= 16
        mul             v2.8h,  v5.8h,  v0.h[0]
        mul             v25.8h, v17.8h, v0.h[0]
 .endif
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 1,     \size
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 2,     \size
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, \idx1, \size
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 5,     \size
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 6,     \size
        extmla          v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 7,     \size
        extmulqadd      v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, \idx2, \size
        // Round, shift and saturate
        sqrshrun        v1.8b,   v1.8h,  #7
        sqrshrun        v24.8b,  v24.8h, #7
 .if \size >= 16
        sqrshrun2       v1.16b,  v2.8h,  #7
        sqrshrun2       v24.16b, v25.8h, #7
 .endif
        // Average
 .ifc \type,avg
 .if \size >= 16
        ld1             {v2.16b}, [x0]
        ld1             {v3.16b}, [x6]
        urhadd          v1.16b,  v1.16b,  v2.16b
        urhadd          v24.16b, v24.16b, v3.16b
 .elseif \size == 8
        ld1             {v2.8b},  [x0]
        ld1             {v3.8b},  [x6]
        urhadd          v1.8b,  v1.8b,  v2.8b
        urhadd          v24.8b, v24.8b, v3.8b
 .else
        ld1             {v2.s}[0], [x0]
        ld1             {v3.s}[0], [x6]
        urhadd          v1.8b,  v1.8b,  v2.8b
        urhadd          v24.8b, v24.8b, v3.8b
 .endif
 .endif
        // Store and loop horizontally (for size >= 16)
 .if \size >= 16
        subs            x9,  x9,  #16
        st1             {v1.16b},  [x0], #16
        st1             {v24.16b}, [x6], #16
        b.eq            3f
        mov             v4.16b,  v6.16b
        mov             v16.16b, v18.16b
        ld1             {v6.16b},  [x2], #16
        ld1             {v18.16b}, [x7], #16
        uxtl            v5.8h,  v6.8b
        uxtl2           v6.8h,  v6.16b
        uxtl            v17.8h, v18.8b
        uxtl2           v18.8h, v18.16b
        b               2b
 .elseif \size == 8
        st1             {v1.8b},    [x0]
        st1             {v24.8b},   [x6]
 .else // \size == 4
        st1             {v1.s}[0],  [x0]
        st1             {v24.s}[0], [x6]
 .endif
 3:
        // Loop vertically
        add             x0,  x0,  x1
        add             x6,  x6,  x1
        add             x2,  x2,  x3
        add             x7,  x7,  x3
        subs            w4,  w4,  #2
        b.ne            1b
        ret
 endfunc
 .endm
 .macro do_8tap_h_size size
 do_8tap_h put, \size, 3, 4
 do_8tap_h avg, \size, 3, 4
 do_8tap_h put, \size, 4, 3
 do_8tap_h avg, \size, 4, 3
 .endm
 do_8tap_h_size 4
 do_8tap_h_size 8
 do_8tap_h_size 16
 .macro do_8tap_h_func type, filter, offset, size
 function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
        movrel          x6,  X(ff_vp9_subpel_filters), 256*\offset
        cmp             w5,  #8
        add             x9,  x6,  w5, uxtw #4
        mov             x5,  #\size
 .if \size >= 16
        b.ge            \type\()_8tap_16h_34
        b               \type\()_8tap_16h_43
 .else
        b.ge            \type\()_8tap_\size\()h_34
        b               \type\()_8tap_\size\()h_43
 .endif
 endfunc
 .endm
 .macro do_8tap_h_filters size
 do_8tap_h_func put, regular, 1, \size
 do_8tap_h_func avg, regular, 1, \size
 do_8tap_h_func put, sharp,   2, \size
 do_8tap_h_func avg, sharp,   2, \size
 do_8tap_h_func put, smooth,  0, \size
 do_8tap_h_func avg, smooth,  0, \size
 .endm
 do_8tap_h_filters 64
 do_8tap_h_filters 32
 do_8tap_h_filters 16
 do_8tap_h_filters 8
 do_8tap_h_filters 4
 // Vertical filters
 // Round, shift and saturate and store reg1-reg2 over 4 lines
 .macro do_store4 reg1, reg2, tmp1, tmp2, type
        sqrshrun        \reg1\().8b,  \reg1\().8h, #7
        sqrshrun        \reg2\().8b,  \reg2\().8h, #7
 .ifc \type,avg
        ld1             {\tmp1\().s}[0],  [x7], x1
        ld1             {\tmp2\().s}[0],  [x7], x1
        ld1             {\tmp1\().s}[1],  [x7], x1
        ld1             {\tmp2\().s}[1],  [x7], x1
        urhadd          \reg1\().8b,  \reg1\().8b,  \tmp1\().8b
        urhadd          \reg2\().8b,  \reg2\().8b,  \tmp2\().8b
 .endif
        st1             {\reg1\().s}[0],  [x0], x1
        st1             {\reg2\().s}[0],  [x0], x1
        st1             {\reg1\().s}[1],  [x0], x1
        st1             {\reg2\().s}[1],  [x0], x1
 .endm
 // Round, shift and saturate and store reg1-4
 .macro do_store reg1, reg2, reg3, reg4, tmp1, tmp2, tmp3, tmp4, type
        sqrshrun        \reg1\().8b,  \reg1\().8h, #7
        sqrshrun        \reg2\().8b,  \reg2\().8h, #7
        sqrshrun        \reg3\().8b,  \reg3\().8h, #7
        sqrshrun        \reg4\().8b,  \reg4\().8h, #7
 .ifc \type,avg
        ld1             {\tmp1\().8b},  [x7], x1
        ld1             {\tmp2\().8b},  [x7], x1
        ld1             {\tmp3\().8b},  [x7], x1
        ld1             {\tmp4\().8b},  [x7], x1
        urhadd          \reg1\().8b,  \reg1\().8b,  \tmp1\().8b
        urhadd          \reg2\().8b,  \reg2\().8b,  \tmp2\().8b
        urhadd          \reg3\().8b,  \reg3\().8b,  \tmp3\().8b
        urhadd          \reg4\().8b,  \reg4\().8b,  \tmp4\().8b
 .endif
        st1             {\reg1\().8b},  [x0], x1
        st1             {\reg2\().8b},  [x0], x1
        st1             {\reg3\().8b},  [x0], x1
        st1             {\reg4\().8b},  [x0], x1
 .endm
 // Evaluate the filter twice in parallel, from the inputs src1-src9 into dst1-dst2
 // (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately
 // at the end with saturation. Indices 0 and 7 always have negative or zero
 // coefficients, so they can be accumulated into tmp1-tmp2 together with the
 // largest coefficient.
 .macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7, src8, src9, idx1, idx2, tmp1, tmp2
        mul             \dst1\().8h, \src2\().8h, v0.h[1]
        mul             \dst2\().8h, \src3\().8h, v0.h[1]
        mul             \tmp1\().8h, \src1\().8h, v0.h[0]
        mul             \tmp2\().8h, \src2\().8h, v0.h[0]
        mla             \dst1\().8h, \src3\().8h, v0.h[2]
        mla             \dst2\().8h, \src4\().8h, v0.h[2]
 .if \idx1 == 3
        mla             \dst1\().8h, \src4\().8h, v0.h[3]
        mla             \dst2\().8h, \src5\().8h, v0.h[3]
 .else
        mla             \dst1\().8h, \src5\().8h, v0.h[4]
        mla             \dst2\().8h, \src6\().8h, v0.h[4]
 .endif
        mla             \dst1\().8h, \src6\().8h, v0.h[5]
        mla             \dst2\().8h, \src7\().8h, v0.h[5]
        mla             \tmp1\().8h, \src8\().8h, v0.h[7]
        mla             \tmp2\().8h, \src9\().8h, v0.h[7]
        mla             \dst1\().8h, \src7\().8h, v0.h[6]
        mla             \dst2\().8h, \src8\().8h, v0.h[6]
 .if \idx2 == 3
        mla             \tmp1\().8h, \src4\().8h, v0.h[3]
        mla             \tmp2\().8h, \src5\().8h, v0.h[3]
 .else
        mla             \tmp1\().8h, \src5\().8h, v0.h[4]
        mla             \tmp2\().8h, \src6\().8h, v0.h[4]
 .endif
        sqadd           \dst1\().8h, \dst1\().8h, \tmp1\().8h
        sqadd           \dst2\().8h, \dst2\().8h, \tmp2\().8h
 .endm
 // Load pixels and extend them to 16 bit
 .macro loadl dst1, dst2, dst3, dst4
        ld1             {v1.8b}, [x2], x3
        ld1             {v2.8b}, [x2], x3
        ld1             {v3.8b}, [x2], x3
 .ifnb \dst4
        ld1             {v4.8b}, [x2], x3
 .endif
        uxtl            \dst1\().8h, v1.8b
        uxtl            \dst2\().8h, v2.8b
        uxtl            \dst3\().8h, v3.8b
 .ifnb \dst4
        uxtl            \dst4\().8h, v4.8b
 .endif
 .endm
 // Instantiate a vertical filter function for filtering 8 pixels at a time.
 // The height is passed in x4, the width in x5 and the filter coefficients
 // in x6. idx2 is the index of the largest filter coefficient (3 or 4)
 // and idx1 is the other one of them.
 .macro do_8tap_8v type, idx1, idx2
 function \type\()_8tap_8v_\idx1\idx2
        sub             x2,  x2,  x3, lsl #1
        sub             x2,  x2,  x3
        ld1             {v0.8h},  [x6]
 1:
 .ifc \type,avg
        mov             x7,  x0
 .endif
        mov             x6,  x4
        loadl           v17, v18, v19
        loadl           v20, v21, v22, v23
 2:
        loadl           v24, v25, v26, v27
        convolve        v1,  v2,  v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v5,  v6
        convolve        v3,  v4,  v19, v20, v21, v22, v23, v24, v25, v26, v27, \idx1, \idx2, v5,  v6
        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
        subs            x6,  x6,  #4
        b.eq            8f
        loadl           v16, v17, v18, v19
        convolve        v1,  v2,  v21, v22, v23, v24, v25, v26, v27, v16, v17, \idx1, \idx2, v5,  v6
        convolve        v3,  v4,  v23, v24, v25, v26, v27, v16, v17, v18, v19, \idx1, \idx2, v5,  v6
        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
        subs            x6,  x6,  #4
        b.eq            8f
        loadl           v20, v21, v22, v23
        convolve        v1,  v2,  v25, v26, v27, v16, v17, v18, v19, v20, v21, \idx1, \idx2, v5,  v6
        convolve        v3,  v4,  v27, v16, v17, v18, v19, v20, v21, v22, v23, \idx1, \idx2, v5,  v6
        do_store        v1,  v2,  v3,  v4,  v5,  v6,  v7,  v28, \type
        subs            x6,  x6,  #4
        b.ne            2b
 8:
        subs            x5,  x5,  #8
        b.eq            9f
        // x0 -= h * dst_stride
        msub            x0,  x1,  x4, x0
        // x2 -= h * src_stride
        msub            x2,  x3,  x4, x2
        // x2 -= 8 * src_stride
        sub             x2,  x2,  x3, lsl #3
        // x2 += 1 * src_stride
        add             x2,  x2,  x3
        add             x2,  x2,  #8
        add             x0,  x0,  #8
        b               1b
 9:
        ret
 endfunc
 .endm
 do_8tap_8v put, 3, 4
 do_8tap_8v put, 4, 3
 do_8tap_8v avg, 3, 4
 do_8tap_8v avg, 4, 3
 // Instantiate a vertical filter function for filtering a 4 pixels wide
 // slice. The first half of the registers contain one row, while the second
 // half of a register contains the second-next row (also stored in the first
 // half of the register two steps ahead). The convolution does two outputs
 // at a time; the output of v17-v24 into one, and v18-v25 into another one.
 // The first half of first output is the first output row, the first half
 // of the other output is the second output row. The second halves of the
 // registers are rows 3 and 4.
 // This only is designed to work for 4 or 8 output lines.
 .macro do_8tap_4v type, idx1, idx2
 function \type\()_8tap_4v_\idx1\idx2
        sub             x2,  x2,  x3, lsl #1
        sub             x2,  x2,  x3
        ld1             {v0.8h},  [x6]
 .ifc \type,avg
        mov             x7,  x0
 .endif
        ld1             {v1.s}[0],  [x2], x3
        ld1             {v2.s}[0],  [x2], x3
        ld1             {v3.s}[0],  [x2], x3
        ld1             {v4.s}[0],  [x2], x3
        ld1             {v5.s}[0],  [x2], x3
        ld1             {v6.s}[0],  [x2], x3
        trn1            v1.2s,  v1.2s,  v3.2s
        ld1             {v7.s}[0],  [x2], x3
        trn1            v2.2s,  v2.2s,  v4.2s
        ld1             {v26.s}[0], [x2], x3
        uxtl            v17.8h, v1.8b
        trn1            v3.2s,  v3.2s,  v5.2s
        ld1             {v27.s}[0], [x2], x3
        uxtl            v18.8h, v2.8b
        trn1            v4.2s,  v4.2s,  v6.2s
        ld1             {v28.s}[0], [x2], x3
        uxtl            v19.8h, v3.8b
        trn1            v5.2s,  v5.2s,  v7.2s
        ld1             {v29.s}[0], [x2], x3
        uxtl            v20.8h, v4.8b
        trn1            v6.2s,  v6.2s,  v26.2s
        uxtl            v21.8h, v5.8b
        trn1            v7.2s,  v7.2s,  v27.2s
        uxtl            v22.8h, v6.8b
        trn1            v26.2s, v26.2s, v28.2s
        uxtl            v23.8h, v7.8b
        trn1            v27.2s, v27.2s, v29.2s
        uxtl            v24.8h, v26.8b
        uxtl            v25.8h, v27.8b
        convolve        v1,  v2,  v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v3,  v4
        do_store4       v1,  v2,  v5,  v6,  \type
        subs            x4,  x4,  #4
        b.eq            9f
        ld1             {v1.s}[0],  [x2], x3
        ld1             {v2.s}[0],  [x2], x3
        trn1            v28.2s, v28.2s, v1.2s
        trn1            v29.2s, v29.2s, v2.2s
        ld1             {v1.s}[1],  [x2], x3
        uxtl            v26.8h, v28.8b
        ld1             {v2.s}[1],  [x2], x3
        uxtl            v27.8h, v29.8b
        uxtl            v28.8h, v1.8b
        uxtl            v29.8h, v2.8b
        convolve        v1,  v2,  v21, v22, v23, v24, v25, v26, v27, v28, v29, \idx1, \idx2, v3,  v4
        do_store4       v1,  v2,  v5,  v6,  \type
 9:
        ret
 endfunc
 .endm
 do_8tap_4v put, 3, 4
 do_8tap_4v put, 4, 3
 do_8tap_4v avg, 3, 4
 do_8tap_4v avg, 4, 3
 .macro do_8tap_v_func type, filter, offset, size
 function ff_vp9_\type\()_\filter\()\size\()_v_neon, export=1
        uxtw            x4,  w4
        movrel          x5,  X(ff_vp9_subpel_filters), 256*\offset
        cmp             w6,  #8
        add             x6,  x5,  w6, uxtw #4
        mov             x5,  #\size
 .if \size >= 8
        b.ge            \type\()_8tap_8v_34
        b               \type\()_8tap_8v_43
 .else
        b.ge            \type\()_8tap_4v_34
        b               \type\()_8tap_4v_43
 .endif
 endfunc
 .endm
 .macro do_8tap_v_filters size
 do_8tap_v_func put, regular, 1, \size
 do_8tap_v_func avg, regular, 1, \size
 do_8tap_v_func put, sharp,   2, \size
 do_8tap_v_func avg, sharp,   2, \size
 do_8tap_v_func put, smooth,  0, \size
 do_8tap_v_func avg, smooth,  0, \size
 .endm
 do_8tap_v_filters 64
 do_8tap_v_filters 32
 do_8tap_v_filters 16
 do_8tap_v_filters 8
 do_8tap_v_filters 4
@@ -0,0 +1,82 @@
 /*
 * VP9 8-tap subpel filter table — verbatim transcription of
 * ff_vp9_subpel_filters from FFmpeg n7.1.3 libavcodec/vp9dsp.c
 * (commit f46e514). Provided as a standalone .c so the vendored
 * vp9mc_neon.S has the `ff_vp9_subpel_filters` symbol to link
 * against, without pulling in the full vp9dsp.c init machinery
 * (which would chain-include the entire VP9 decoder).
 *
 * Enum order from libavcodec/vp9dsp.h:64-67:
 *   FILTER_8TAP_SMOOTH  = 0
 *   FILTER_8TAP_REGULAR = 1
 *   FILTER_8TAP_SHARP   = 2
 *
 * License: LGPL-2.1-or-later (matches vp9dsp.c upstream).
 */
 #include <stdint.h>
 #ifdef __GNUC__
 #define DAEDALUS_ALIGNED(n) __attribute__((aligned(n)))
 #else
 #define DAEDALUS_ALIGNED(n)
 #endif
 const DAEDALUS_ALIGNED(16) int16_t ff_vp9_subpel_filters[3][16][8] = {
    /* [0] = FILTER_8TAP_SMOOTH */
    {
        {  0,  0,   0, 128,   0,   0,  0,  0 },
        { -3, -1,  32,  64,  38,   1, -3,  0 },
        { -2, -2,  29,  63,  41,   2, -3,  0 },
        { -2, -2,  26,  63,  43,   4, -4,  0 },
        { -2, -3,  24,  62,  46,   5, -4,  0 },
        { -2, -3,  21,  60,  49,   7, -4,  0 },
        { -1, -4,  18,  59,  51,   9, -4,  0 },
        { -1, -4,  16,  57,  53,  12, -4, -1 },
        { -1, -4,  14,  55,  55,  14, -4, -1 },
        { -1, -4,  12,  53,  57,  16, -4, -1 },
        {  0, -4,   9,  51,  59,  18, -4, -1 },
        {  0, -4,   7,  49,  60,  21, -3, -2 },
        {  0, -4,   5,  46,  62,  24, -3, -2 },
        {  0, -4,   4,  43,  63,  26, -2, -2 },
        {  0, -3,   2,  41,  63,  29, -2, -2 },
        {  0, -3,   1,  38,  64,  32, -1, -3 },
    },
    /* [1] = FILTER_8TAP_REGULAR */
    {
        {  0,  0,   0, 128,   0,   0,  0,  0 },
        {  0,  1,  -5, 126,   8,  -3,  1,  0 },
        { -1,  3, -10, 122,  18,  -6,  2,  0 },
        { -1,  4, -13, 118,  27,  -9,  3, -1 },
        { -1,  4, -16, 112,  37, -11,  4, -1 },
        { -1,  5, -18, 105,  48, -14,  4, -1 },
        { -1,  5, -19,  97,  58, -16,  5, -1 },
        { -1,  6, -19,  88,  68, -18,  5, -1 },
        { -1,  6, -19,  78,  78, -19,  6, -1 },
        { -1,  5, -18,  68,  88, -19,  6, -1 },
        { -1,  5, -16,  58,  97, -19,  5, -1 },
        { -1,  4, -14,  48, 105, -18,  5, -1 },
        { -1,  4, -11,  37, 112, -16,  4, -1 },
        { -1,  3,  -9,  27, 118, -13,  4, -1 },
        {  0,  2,  -6,  18, 122, -10,  3, -1 },
        {  0,  1,  -3,   8, 126,  -5,  1,  0 },
    },
    /* [2] = FILTER_8TAP_SHARP */
    {
        {  0,  0,   0, 128,   0,   0,  0,  0 },
        { -1,  3,  -7, 127,   8,  -3,  1,  0 },
        { -2,  5, -13, 125,  17,  -6,  3, -1 },
        { -3,  7, -17, 121,  27, -10,  5, -2 },
        { -4,  9, -20, 115,  37, -13,  6, -2 },
        { -4, 10, -23, 108,  48, -16,  8, -3 },
        { -4, 10, -24, 100,  59, -19,  9, -3 },
        { -4, 11, -24,  90,  70, -21, 10, -4 },
        { -4, 11, -23,  80,  80, -23, 11, -4 },
        { -4, 10, -21,  70,  90, -24, 11, -4 },
        { -3,  9, -19,  59, 100, -24, 10, -4 },
        { -3,  8, -16,  48, 108, -23, 10, -4 },
        { -2,  6, -13,  37, 115, -20,  9, -4 },
        { -2,  5, -10,  27, 121, -17,  7, -3 },
        { -1,  3,  -6,  17, 125, -13,  5, -2 },
        {  0,  1,  -3,   8, 127,  -7,  3, -1 },
    },
 };
@@ -0,0 +1,142 @@
 // daedalus-fourier cycle 3 — VP9 8-tap "regular" subpel filter,
 // horizontal direction, 8-wide output, h rows. V3D 7.1 via Mesa v3dv.
 //
 // Bakes in cycle-1+2 v4 winning patterns from start:
 //   - local_size_x = 256
 //   - 8 lanes per block (1 lane per output row), 2 blocks per
 //     16-lane subgroup, 16 subgroups per WG → 32 blocks per WG
 //   - uint8_t SSBO via storageBuffer8BitAccess
 //   - oob early-return safe (no barrier)
 //
 // Contracts (per k3_mc_phase4.md §5, revised per phase5''' findings):
 //   - meta[i].x: dst_off (byte offset of block's row-0 col-0 dst pixel)
 //   - meta[i].y: src_off (byte offset of block's row-0 col-0 SOURCE
 //     pixel — note: NO +3 shift; the C bench's `src + 3` C-caller
 //     convention does NOT carry into the SSBO offset. Shader reads
 //     s[k] = SSBO[src_off + row*stride + k] for k=0..14, matching
 //     C ref's per-row read of `master_src[block_base + row*stride
 //     + (x..x+7)]` for output col x ∈ 0..7).
 //   - meta[i].z: mx (subpel phase in [0..15])
 //   - dst_stride_u8 ≥ 8 (race-safety lower bound; bench asserts)
 //   - src_stride_u8 ≥ 15 (per-row read span; bench asserts)
 //
 // License: BSD-2-Clause. Algorithm transcribed from tests/vp9_mc_ref.c
 // which mirrors libavcodec/vp9dsp_template.c FILTER_8TAP macro.
 #version 450
 #extension GL_EXT_shader_8bit_storage              : require
 #extension GL_EXT_shader_explicit_arithmetic_types : require
 layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
 layout(binding = 0) readonly buffer Meta {
    uvec4 meta[];   // per block: (dst_off, src_off, mx, _pad)
 } u_meta;
 layout(binding = 1) buffer Dst {
    uint8_t dst[];
 } u_dst;
 layout(binding = 2) readonly buffer Src {
    uint8_t src[];
 } u_src;
 layout(push_constant) uniform PC {
    uint n_blocks;
    uint dst_stride_u8;
    uint src_stride_u8;
    uint _pad;
 } pc;
 // VP9 8-tap REGULAR filter table — verbatim from
 // external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
 // (index [1] = FILTER_8TAP_REGULAR). 16 subpel phases × 8 taps.
 //
 // shaderdb-gate (phase5''' finding 2): if uniform count > ~144 after
 // first compile, escalate this LUT to SSBO binding 3.
 const int FILTER_REGULAR[16][8] = int[16][8](
    int[8]( 0,  0,   0, 128,   0,   0,  0,  0 ),
    int[8]( 0,  1,  -5, 126,   8,  -3,  1,  0 ),
    int[8](-1,  3, -10, 122,  18,  -6,  2,  0 ),
    int[8](-1,  4, -13, 118,  27,  -9,  3, -1 ),
    int[8](-1,  4, -16, 112,  37, -11,  4, -1 ),
    int[8](-1,  5, -18, 105,  48, -14,  4, -1 ),
    int[8](-1,  5, -19,  97,  58, -16,  5, -1 ),
    int[8](-1,  6, -19,  88,  68, -18,  5, -1 ),
    int[8](-1,  6, -19,  78,  78, -19,  6, -1 ),
    int[8](-1,  5, -18,  68,  88, -19,  6, -1 ),
    int[8](-1,  5, -16,  58,  97, -19,  5, -1 ),
    int[8](-1,  4, -14,  48, 105, -18,  5, -1 ),
    int[8](-1,  4, -11,  37, 112, -16,  4, -1 ),
    int[8](-1,  3,  -9,  27, 118, -13,  4, -1 ),
    int[8]( 0,  2,  -6,  18, 122, -10,  3, -1 ),
    int[8]( 0,  1,  -3,   8, 126,  -5,  1,  0 )
 );
 void main()
 {
    uint gid         = gl_GlobalInvocationID.x;
    uint wg_id       = gid / 256u;
    uint lane_in_wg  = gid & 255u;
    uint sg_in_wg    = lane_in_wg >> 4;
    uint lane_in_sg  = lane_in_wg & 15u;
    uint block_slot  = lane_in_sg >> 3;
    uint row         = lane_in_sg & 7u;
    uint block_local = sg_in_wg * 2u + block_slot;
    uint block_idx   = wg_id * 32u + block_local;
    // No barrier follows — safe early-return.
    if (block_idx >= pc.n_blocks) return;
    uvec4 m = u_meta.meta[block_idx];
    uint dst_off = m.x;
    uint src_off = m.y;
    uint mx      = m.z & 15u;
    // Read 15 source pixels for this row.
    uint src_row = src_off + row * pc.src_stride_u8;
    int s0  = int(u_src.src[src_row +  0u]);
    int s1  = int(u_src.src[src_row +  1u]);
    int s2  = int(u_src.src[src_row +  2u]);
    int s3  = int(u_src.src[src_row +  3u]);
    int s4  = int(u_src.src[src_row +  4u]);
    int s5  = int(u_src.src[src_row +  5u]);
    int s6  = int(u_src.src[src_row +  6u]);
    int s7  = int(u_src.src[src_row +  7u]);
    int s8  = int(u_src.src[src_row +  8u]);
    int s9  = int(u_src.src[src_row +  9u]);
    int s10 = int(u_src.src[src_row + 10u]);
    int s11 = int(u_src.src[src_row + 11u]);
    int s12 = int(u_src.src[src_row + 12u]);
    int s13 = int(u_src.src[src_row + 13u]);
    int s14 = int(u_src.src[src_row + 14u]);
    int F0 = FILTER_REGULAR[mx][0];
    int F1 = FILTER_REGULAR[mx][1];
    int F2 = FILTER_REGULAR[mx][2];
    int F3 = FILTER_REGULAR[mx][3];
    int F4 = FILTER_REGULAR[mx][4];
    int F5 = FILTER_REGULAR[mx][5];
    int F6 = FILTER_REGULAR[mx][6];
    int F7 = FILTER_REGULAR[mx][7];
    int o0 = F0*s0  + F1*s1  + F2*s2  + F3*s3  + F4*s4  + F5*s5  + F6*s6  + F7*s7;
    int o1 = F0*s1  + F1*s2  + F2*s3  + F3*s4  + F4*s5  + F5*s6  + F6*s7  + F7*s8;
    int o2 = F0*s2  + F1*s3  + F2*s4  + F3*s5  + F4*s6  + F5*s7  + F6*s8  + F7*s9;
    int o3 = F0*s3  + F1*s4  + F2*s5  + F3*s6  + F4*s7  + F5*s8  + F6*s9  + F7*s10;
    int o4 = F0*s4  + F1*s5  + F2*s6  + F3*s7  + F4*s8  + F5*s9  + F6*s10 + F7*s11;
    int o5 = F0*s5  + F1*s6  + F2*s7  + F3*s8  + F4*s9  + F5*s10 + F6*s11 + F7*s12;
    int o6 = F0*s6  + F1*s7  + F2*s8  + F3*s9  + F4*s10 + F5*s11 + F6*s12 + F7*s13;
    int o7 = F0*s7  + F1*s8  + F2*s9  + F3*s10 + F4*s11 + F5*s12 + F6*s13 + F7*s14;
    uint dst_row = dst_off + row * pc.dst_stride_u8;
    u_dst.dst[dst_row + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
    u_dst.dst[dst_row + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
 }
@@ -0,0 +1,286 @@
 /*
 * Cycle 3 M4''' — concurrent CPU(NEON MC) + QPU(V3D MC) throughput.
 * Same pthread/barrier pattern as bench_concurrent{,_lpf}.c.
 * License: BSD-2-Clause.
 */
 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
 #include <stddef.h>
 #include <time.h>
 #include <getopt.h>
 #include <pthread.h>
 #include <sched.h>
 #include <assert.h>
 #include <vulkan/vulkan.h>
 #include "v3d_runner.h"
 extern void ff_vp9_put_regular8_h_neon(
    uint8_t *dst, ptrdiff_t dst_stride,
    const uint8_t *src, ptrdiff_t src_stride,
    int h, int mx, int my);
 #define SRC_W 16
 #define DST_W 8
 #define SRC_H 8
 #define DST_H 8
 #define SRC_BYTES (SRC_H * SRC_W)
 #define DST_BYTES (DST_H * DST_W)
 static inline uint64_t xs_step(uint64_t *s) {
    uint64_t x = *s; x ^= x << 13; x ^= x >> 7; x ^= x << 17; return *s = x;
 }
 static uint64_t xs_init(uint64_t s) { return s ? s : 0xa57edbeef5717ULL; }
 static double now_s(void) {
    struct timespec t; clock_gettime(CLOCK_MONOTONIC_RAW, &t);
    return t.tv_sec + t.tv_nsec * 1e-9;
 }
 static volatile int g_stop = 0;
 static pthread_barrier_t g_start;
 /* --- NEON worker ----------- */
 #define NEON_BATCH 8192
 typedef struct {
    int worker_id, affinity_core;
    uint64_t blocks_done;
    double elapsed_s;
 } neon_args;
 static void *neon_worker(void *p)
 {
    neon_args *a = p;
    cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
    pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
    uint64_t s = xs_init((uint64_t) a->worker_id * 0xc01dbeefULL);
    uint8_t *master = malloc((size_t) NEON_BATCH * SRC_BYTES);
    uint8_t *work   = malloc((size_t) NEON_BATCH * SRC_BYTES);
    uint8_t *dsts   = malloc((size_t) NEON_BATCH * DST_BYTES);
    int     *mxs    = malloc(NEON_BATCH * sizeof(int));
    for (int i = 0; i < NEON_BATCH; i++) {
        for (int j = 0; j < SRC_BYTES; j++)
            master[(size_t)i * SRC_BYTES + j] = (uint8_t)(xs_step(&s) & 0xff);
        mxs[i] = (int)(xs_step(&s) & 15);
    }
    pthread_barrier_wait(&g_start);
    double t0 = now_s();
    uint64_t done = 0;
    while (!g_stop) {
        memcpy(work, master, (size_t) NEON_BATCH * SRC_BYTES);
        for (int i = 0; i < NEON_BATCH; i++)
            ff_vp9_put_regular8_h_neon(
                dsts + (size_t)i * DST_BYTES, DST_W,
                work + (size_t)i * SRC_BYTES + 3, SRC_W,
                DST_H, mxs[i], 0);
        done += NEON_BATCH;
    }
    a->elapsed_s = now_s() - t0;
    a->blocks_done = done;
    free(master); free(work); free(dsts); free(mxs);
    return NULL;
 }
 /* --- QPU worker ----------- */
 typedef struct {
    int affinity_core, n_blocks;
    uint64_t blocks_done;
    double elapsed_s;
 } qpu_args;
 typedef struct {
    uint32_t n_blocks, dst_stride_u8, src_stride_u8, _pad;
 } push_consts;
 static void *qpu_worker(void *p)
 {
    qpu_args *a = p;
    cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
    pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
    v3d_runner *r = v3d_runner_create();
    if (!r) return NULL;
    int n_blocks = a->n_blocks;
    size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
    size_t src_bytes  = (size_t) n_blocks * SRC_BYTES;
    size_t dst_bytes  = (size_t) n_blocks * DST_BYTES;
    v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
    v3d_runner_create_buffer(r, meta_bytes, &buf_meta);
    v3d_runner_create_buffer(r, dst_bytes,  &buf_dst);
    v3d_runner_create_buffer(r, src_bytes,  &buf_src);
    uint64_t s = 0xfeedfacecafebabeULL;
    uint8_t *master = malloc(src_bytes);
    for (size_t i = 0; i < src_bytes; i++) master[i] = (uint8_t)(xs_step(&s) & 0xff);
    memcpy(buf_src.mapped, master, src_bytes);
    uint32_t *meta = buf_meta.mapped;
    assert(DST_W >= 8); assert(SRC_W >= 15);
    for (int i = 0; i < n_blocks; i++) {
        meta[4*i + 0] = (uint32_t)((size_t)i * DST_BYTES);   /* dst_off */
        meta[4*i + 1] = (uint32_t)((size_t)i * SRC_BYTES);   /* src_off (RAW, no +3) */
        meta[4*i + 2] = (uint32_t)(xs_step(&s) & 15);        /* mx */
        meta[4*i + 3] = 0;
    }
    v3d_pipeline pipe = {0};
    v3d_runner_create_pipeline(r, "v3d_mc_8h.spv", 3, sizeof(push_consts), &pipe);
    v3d_buffer bufs[3] = { buf_meta, buf_dst, buf_src };
    v3d_runner_bind_buffers(r, &pipe, bufs, 3);
    const uint32_t bpw = 32;
    uint32_t gc = (uint32_t)((n_blocks + bpw - 1) / bpw);
    push_consts pc = { .n_blocks = (uint32_t) n_blocks,
                       .dst_stride_u8 = DST_W,
                       .src_stride_u8 = SRC_W };
    VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
    VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
    vkBeginCommandBuffer(cb, &cbbi);
    vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
    vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
                            pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
    vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
                       0, sizeof(pc), &pc);
    vkCmdDispatch(cb, gc, 1, 1);
    vkEndCommandBuffer(cb);
    for (int i = 0; i < 5; i++) v3d_runner_submit_wait(r, cb);
    pthread_barrier_wait(&g_start);
    double t0 = now_s();
    uint64_t done = 0;
    while (!g_stop) {
        memset(buf_dst.mapped, 0, dst_bytes);
        v3d_runner_submit_wait(r, cb);
        done += n_blocks;
    }
    a->elapsed_s = now_s() - t0;
    a->blocks_done = done;
    free(master);
    v3d_runner_destroy_pipeline(r, &pipe);
    v3d_runner_destroy_buffer(r, &buf_src);
    v3d_runner_destroy_buffer(r, &buf_dst);
    v3d_runner_destroy_buffer(r, &buf_meta);
    v3d_runner_destroy(r);
    return NULL;
 }
 typedef struct { double duration_s; } timer_args;
 static void *timer_thread(void *p) {
    timer_args *a = p;
    pthread_barrier_wait(&g_start);
    double end = now_s() + a->duration_s;
    while (now_s() < end) {
        struct timespec ts = {0, 1000000}; nanosleep(&ts, NULL);
    }
    g_stop = 1;
    return NULL;
 }
 enum mode { MODE_NEON, MODE_QPU, MODE_MIXED };
 int main(int argc, char **argv)
 {
    enum mode mode = MODE_NEON;
    int n_neon = 4, qpu_core = 3, qpu_n_blocks = 65536;
    double duration = 8.0;
    static struct option opts[] = {
        {"mode",         required_argument, 0, 'm'},
        {"neon-threads", required_argument, 0, 'n'},
        {"qpu-core",     required_argument, 0, 'c'},
        {"qpu-blocks",   required_argument, 0, 'b'},
        {"duration",     required_argument, 0, 'd'},
        {0,0,0,0}
    };
    for (int c; (c = getopt_long(argc, argv, "m:n:c:b:d:", opts, 0)) != -1;) {
        switch (c) {
        case 'm':
            if      (!strcmp(optarg, "neon-only")) mode = MODE_NEON;
            else if (!strcmp(optarg, "qpu-only"))  mode = MODE_QPU;
            else if (!strcmp(optarg, "mixed"))     mode = MODE_MIXED;
            else { fprintf(stderr, "bad mode\n"); return 2; }
            break;
        case 'n': n_neon = atoi(optarg); break;
        case 'c': qpu_core = atoi(optarg); break;
        case 'b': qpu_n_blocks = atoi(optarg); break;
        case 'd': duration = atof(optarg); break;
        default: return 2;
        }
    }
    int has_qpu  = (mode == MODE_QPU || mode == MODE_MIXED);
    int has_neon = (mode == MODE_NEON || mode == MODE_MIXED);
    int n_workers = (has_neon ? n_neon : 0) + (has_qpu ? 1 : 0);
    int barrier_count = n_workers + 1 + 1;
    printf("=== M4''' concurrent MC bench ===\n");
    printf("  mode: %s, neon: %d, qpu: core %d / %d blocks, %.1fs\n",
           mode == MODE_NEON ? "neon-only" : mode == MODE_QPU ? "qpu-only" : "mixed",
           has_neon ? n_neon : 0,
           has_qpu ? qpu_core : -1,
           has_qpu ? qpu_n_blocks : 0,
           duration);
    pthread_barrier_init(&g_start, NULL, barrier_count);
    pthread_t timer_tid; timer_args ta = { .duration_s = duration };
    pthread_create(&timer_tid, NULL, timer_thread, &ta);
    pthread_t neon_tids[16] = {0};
    neon_args n_args[16] = {0};
    if (has_neon) {
        for (int i = 0; i < n_neon; i++) {
            n_args[i] = (neon_args){ .worker_id = i, .affinity_core = i };
            pthread_create(&neon_tids[i], NULL, neon_worker, &n_args[i]);
        }
    }
    pthread_t qpu_tid = 0;
    qpu_args q_args = {0};
    if (has_qpu) {
        q_args = (qpu_args){ .affinity_core = qpu_core, .n_blocks = qpu_n_blocks };
        pthread_create(&qpu_tid, NULL, qpu_worker, &q_args);
    }
    pthread_barrier_wait(&g_start);
    pthread_join(timer_tid, NULL);
    if (has_neon) for (int i = 0; i < n_neon; i++) pthread_join(neon_tids[i], NULL);
    if (has_qpu)  pthread_join(qpu_tid, NULL);
    uint64_t total = 0; double max_e = 0;
    if (has_neon) {
        printf("NEON per-thread:\n");
        for (int i = 0; i < n_neon; i++) {
            double mbs = n_args[i].blocks_done / n_args[i].elapsed_s / 1e6;
            printf("  core %d: %.3f Mblock/s\n", n_args[i].affinity_core, mbs);
            total += n_args[i].blocks_done;
            if (n_args[i].elapsed_s > max_e) max_e = n_args[i].elapsed_s;
        }
    }
    if (has_qpu) {
        double mbs = q_args.blocks_done / q_args.elapsed_s / 1e6;
        printf("QPU (core %d): %.3f Mblock/s\n", q_args.affinity_core, mbs);
        total += q_args.blocks_done;
        if (q_args.elapsed_s > max_e) max_e = q_args.elapsed_s;
    }
    double total_mbs = total / max_e / 1e6;
    printf("\n=== AGGREGATE ===\n");
    printf("  Mblock/s        : %.3f\n", total_mbs);
    printf("  30fps@1080p floor: 0.972 Mblock/s — %.1fx margin\n",
           total_mbs / 0.972);
    pthread_barrier_destroy(&g_start);
    return 0;
 }
@@ -0,0 +1,220 @@
 /*
 * Cycle 3 Phase 3 — NEON M3''' baseline for VP9 8-tap regular
 * horizontal MC interpolation, 8×8 block.
 *
 * Reports:
 *   M1'''_c (correctness): C-ref ↔ NEON bit-exact rate, N random
 *                          8×8 blocks with random source pixels and
 *                          random subpel phase mx ∈ [0, 15]
 *   M3'''   (throughput):  NEON sustained Mblock/s, single-thread,
 *                          time-based
 *
 * License: LGPL-2.1+ (statically links FFmpeg NEON snapshot).
 */
 #define _POSIX_C_SOURCE 200809L
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
 #include <time.h>
 #include <getopt.h>
 extern void daedalus_vp9_put_regular_8h_ref(
    uint8_t *dst, ptrdiff_t dst_stride,
    const uint8_t *src, ptrdiff_t src_stride,
    int h, int mx, int my);
 extern void ff_vp9_put_regular8_h_neon(
    uint8_t *dst, ptrdiff_t dst_stride,
    const uint8_t *src, ptrdiff_t src_stride,
    int h, int mx, int my);
 /* RNG ------------------------------------------------------------ */
 static uint64_t xs_state;
 static inline uint64_t xs(void) {
    uint64_t x = xs_state;
    x ^= x << 13; x ^= x >> 7; x ^= x << 17;
    return xs_state = x;
 }
 /* Block layout: each block gets its own 8×16 source buffer + 8×8 dst.
 *   - source buffer is 16 cols wide; the filter is called with
 *     src = block_src + 3, so it reads cols [src+0-3..src+8+4] =
 *     [0..14] of the 16-col buffer. col 15 is unused padding.
 *   - dst is 8 cols × 8 rows.
 */
 #define SRC_W 16
 #define SRC_H 8
 #define DST_W 8
 #define DST_H 8
 #define SRC_BYTES (SRC_H * SRC_W)  /* 128 */
 #define DST_BYTES (DST_H * DST_W)  /* 64 */
 static void gen_src(uint8_t *buf)
 {
    for (int i = 0; i < SRC_BYTES; i++)
        buf[i] = (uint8_t)(xs() & 0xff);
 }
 static double now_seconds(void)
 {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
 }
 /* M1'''_c correctness gate -------------------------------------- */
 static int correctness_check(uint64_t seed, int n_blocks)
 {
    xs_state = seed ? seed : 0xabcdef1234567890ULL;
    int mismatches = 0;
    uint8_t src[SRC_BYTES];
    uint8_t dst_a[DST_BYTES], dst_b[DST_BYTES];
    int mx_hist[16] = {0};
    for (int i = 0; i < n_blocks; i++) {
        gen_src(src);
        int mx = (int)(xs() & 15);
        mx_hist[mx]++;
        memset(dst_a, 0, DST_BYTES);
        memset(dst_b, 0, DST_BYTES);
        daedalus_vp9_put_regular_8h_ref(dst_a, DST_W, src + 3, SRC_W, DST_H, mx, 0);
        ff_vp9_put_regular8_h_neon  (dst_b, DST_W, src + 3, SRC_W, DST_H, mx, 0);
        if (memcmp(dst_a, dst_b, DST_BYTES) != 0) {
            if (mismatches < 3) {
                fprintf(stderr, "MISMATCH block %d mx=%d:\n", i, mx);
                fprintf(stderr, "  ref:");
                for (int r = 0; r < 8; r++) {
                    fprintf(stderr, "\n    r%d ", r);
                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_a[r*8+c]);
                }
                fprintf(stderr, "\n  neon:");
                for (int r = 0; r < 8; r++) {
                    fprintf(stderr, "\n    r%d ", r);
                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_b[r*8+c]);
                }
                fprintf(stderr, "\n");
            }
            mismatches++;
        }
    }
    printf("M1'''_c correctness: %d / %d blocks bit-exact (%.4f%%)\n",
           n_blocks - mismatches, n_blocks,
           100.0 * (n_blocks - mismatches) / n_blocks);
    /* mx histogram — confirms all 16 phases get exercised. */
    int min_mx = mx_hist[0], max_mx = mx_hist[0];
    for (int i = 1; i < 16; i++) {
        if (mx_hist[i] < min_mx) min_mx = mx_hist[i];
        if (mx_hist[i] > max_mx) max_mx = mx_hist[i];
    }
    printf("  mx phase coverage: min=%d max=%d (16 phases sampled)\n",
           min_mx, max_mx);
    return mismatches;
 }
 /* M3''' throughput ---------------------------------------------- */
 static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
 {
    xs_state = seed ? seed : 0xdeadbeef12345678ULL;
    uint8_t *master_src = malloc((size_t) n_blocks * SRC_BYTES);
    uint8_t *work_src   = malloc((size_t) n_blocks * SRC_BYTES);
    uint8_t *dsts       = malloc((size_t) n_blocks * DST_BYTES);
    int     *mxs        = malloc(n_blocks * sizeof(int));
    if (!master_src || !work_src || !dsts || !mxs) { fprintf(stderr, "alloc fail\n"); exit(1); }
    for (int i = 0; i < n_blocks; i++) {
        gen_src(master_src + (size_t)i * SRC_BYTES);
        mxs[i] = (int)(xs() & 15);
    }
    /* Warm. */
    memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
    for (int i = 0; i < n_blocks; i++)
        ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
                                   work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
                                   DST_H, mxs[i], 0);
    double t0 = now_seconds();
    double t_end = t0 + duration_s;
    uint64_t done = 0;
    while (now_seconds() < t_end) {
        memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
        for (int i = 0; i < n_blocks; i++)
            ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
                                       work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
                                       DST_H, mxs[i], 0);
        done += n_blocks;
    }
    double elapsed = now_seconds() - t0;
    /* setup-only subtraction */
    int setup_iters = (int) (done / n_blocks);
    double s0 = now_seconds();
    for (int it = 0; it < setup_iters; it++)
        memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
    double s1 = now_seconds();
    double kernel_seconds = elapsed - (s1 - s0);
    double mbps = done / kernel_seconds / 1e6;
    printf("M3''' NEON throughput:\n");
    printf("  blocks/batch:    %d\n", n_blocks);
    printf("  batches done:    %d\n", setup_iters);
    printf("  total blocks:    %llu\n", (unsigned long long) done);
    printf("  elapsed (kernel)=%.6f s\n", kernel_seconds);
    printf("  elapsed (setup) =%.6f s\n", s1 - s0);
    printf("  throughput      = %.3f Mblock/s\n", mbps);
    printf("  per-block       = %.1f ns\n", kernel_seconds / done * 1e9);
    /* 1080p: 32400 blocks/frame */
    printf("  equiv 1080p     = %.1f FPS  (32400 blocks/frame)\n",
           mbps * 1e6 / 32400.0);
    free(master_src); free(work_src); free(dsts); free(mxs);
 }
 int main(int argc, char **argv)
 {
    int n_blocks = 65536;
    double duration = 5.0;
    uint64_t seed = 0;
    int do_correctness = 1;
    static struct option opts[] = {
        {"blocks",         required_argument, 0, 'b'},
        {"duration",       required_argument, 0, 'd'},
        {"seed",           required_argument, 0, 's'},
        {"no-correctness", no_argument,       0, 'C'},
        {0,0,0,0}
    };
    for (int c; (c = getopt_long(argc, argv, "b:d:s:C", opts, 0)) != -1;) {
        switch (c) {
        case 'b': n_blocks = atoi(optarg); break;
        case 'd': duration = atof(optarg); break;
        case 's': seed = strtoull(optarg, 0, 0); break;
        case 'C': do_correctness = 0; break;
        default: return 2;
        }
    }
    if (do_correctness) {
        printf("=== M1'''_c bit-exact (10000 random blocks) ===\n");
        if (correctness_check(seed, 10000) != 0) {
            fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
            return 1;
        }
        printf("\n");
    }
    printf("=== M3''' NEON throughput ===\n");
    throughput_neon(seed, n_blocks, duration);
    return 0;
 }
@@ -0,0 +1,303 @@
 /*
 * Cycle 3 Phase 6 — QPU bench for VP9 8-tap "regular" subpel filter,
 * horizontal, 8-wide output on V3D 7.1.
 *
 * Reports:
 *   M1''' (correctness): QPU output vs C reference, N blocks across
 *                        all 16 mx phases
 *   M2''' (throughput):  QPU sustained Mblock/s
 *
 * Per k3_mc_phase4.md §5 (revised per phase5''' findings 4 + 6):
 *   - src_off is the RAW block base (no +3 shift)
 *   - assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)
 *
 * License: BSD-2-Clause.
 */
 #define _POSIX_C_SOURCE 200809L
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
 #include <assert.h>
 #include <time.h>
 #include <getopt.h>
 #include <vulkan/vulkan.h>
 #include "v3d_runner.h"
 extern void daedalus_vp9_put_regular_8h_ref(
    uint8_t *dst, ptrdiff_t dst_stride,
    const uint8_t *src, ptrdiff_t src_stride,
    int h, int mx, int my);
 /* Per-block layout: src buffer 8 rows × 16 cols = 128 bytes. The
 * C bench's src+3 convention: NEON/C ref is called with
 * `src = block_base + 3, src_stride = 16`. The shader's src_off
 * is the RAW block_base (no +3 shift), and the shader reads
 * s[0..14] from src_off + row*stride. Together this means:
 *   shader's s[k] for k=0..14 = master_src[block_base + row*16 + k]
 *   C ref's `src[x+k-3]` for x=0..7, k=0..7 with `src = block_base+3`
 *     = master_src[block_base + row*16 + (x+k)]
 *     = master_src[block_base + row*16 + (0..14)]
 * which is exactly what the shader reads. */
 #define SRC_W 16
 #define SRC_H 8
 #define DST_W 8
 #define DST_H 8
 #define SRC_BYTES (SRC_H * SRC_W)
 #define DST_BYTES (DST_H * DST_W)
 static uint64_t xs_state;
 static inline uint64_t xs(void) {
    uint64_t x = xs_state;
    x ^= x << 13; x ^= x >> 7; x ^= x << 17;
    return xs_state = x;
 }
 static void gen_src(uint8_t *b) {
    for (int i = 0; i < SRC_BYTES; i++) b[i] = (uint8_t)(xs() & 0xff);
 }
 static double now_seconds(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
 }
 typedef struct {
    uint32_t n_blocks;
    uint32_t dst_stride_u8;
    uint32_t src_stride_u8;
    uint32_t _pad;
 } push_consts;
 int main(int argc, char **argv)
 {
    int n_blocks = 65536;
    int iters = 100;
    uint64_t seed = 0;
    int verify_only = 0;
    const char *spv_path = "v3d_mc_8h.spv";
    static struct option opts[] = {
        {"blocks",      required_argument, 0, 'b'},
        {"iters",       required_argument, 0, 'i'},
        {"seed",        required_argument, 0, 's'},
        {"spv",         required_argument, 0, 'S'},
        {"verify-only", no_argument,       0, 'V'},
        {0,0,0,0}
    };
    for (int c; (c = getopt_long(argc, argv, "b:i:s:S:V", opts, 0)) != -1;) {
        switch (c) {
        case 'b': n_blocks    = atoi(optarg); break;
        case 'i': iters       = atoi(optarg); break;
        case 's': seed        = strtoull(optarg, 0, 0); break;
        case 'S': spv_path    = optarg; break;
        case 'V': verify_only = 1; break;
        default: return 2;
        }
    }
    xs_state = seed ? seed : 0xabcdef1234567890ULL;
    v3d_runner *r = v3d_runner_create();
    if (!r) { fprintf(stderr, "v3d_runner_create failed\n"); return 1; }
    printf("=== v3d MC 8h bench ===\n");
    printf("  device: %s\n", v3d_runner_device_name(r));
    printf("  n_blocks: %d  iters: %d\n", n_blocks, iters);
    /* Buffers: meta + dst + src, all blocks contiguous. */
    size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
    size_t src_bytes  = (size_t) n_blocks * SRC_BYTES;
    size_t dst_bytes  = (size_t) n_blocks * DST_BYTES;
    v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
    if (v3d_runner_create_buffer(r, meta_bytes, &buf_meta)) return 1;
    if (v3d_runner_create_buffer(r, dst_bytes,  &buf_dst))  return 1;
    if (v3d_runner_create_buffer(r, src_bytes,  &buf_src))  return 1;
    uint8_t *master_src = malloc(src_bytes);
    uint8_t *expected   = malloc(dst_bytes);
    int     *mxs        = malloc(n_blocks * sizeof(int));
    if (!master_src || !expected || !mxs) { fprintf(stderr, "alloc\n"); return 1; }
    for (int i = 0; i < n_blocks; i++) {
        gen_src(master_src + (size_t)i * SRC_BYTES);
        mxs[i] = (int)(xs() & 15);
    }
    /* Build C-ref expected. C ref takes `src + 3, src_stride = SRC_W`. */
    memset(expected, 0, dst_bytes);
    for (int i = 0; i < n_blocks; i++) {
        daedalus_vp9_put_regular_8h_ref(
            expected + (size_t)i * DST_BYTES, DST_W,
            master_src + (size_t)i * SRC_BYTES + 3, SRC_W,
            DST_H, mxs[i], 0);
    }
    /* Populate GPU buffers. Contracts (phase4 §5) enforced via asserts. */
    uint32_t dst_stride_u8 = DST_W;
    uint32_t src_stride_u8 = SRC_W;
    assert(dst_stride_u8 >= 8 && "phase4 §5 contract 1");
    assert(src_stride_u8 >= 15 && "phase4 §5 contract 2");
    uint32_t *meta = (uint32_t *) buf_meta.mapped;
    for (int i = 0; i < n_blocks; i++) {
        /* src_off: RAW block base. NO +3 shift. (phase5''' finding 4) */
        uint32_t src_off = (uint32_t)((size_t)i * SRC_BYTES);
        uint32_t dst_off = (uint32_t)((size_t)i * DST_BYTES);
        meta[4*i + 0] = dst_off;
        meta[4*i + 1] = src_off;
        meta[4*i + 2] = (uint32_t) mxs[i];
        meta[4*i + 3] = 0;
    }
    memcpy(buf_src.mapped, master_src, src_bytes);
    memset(buf_dst.mapped, 0, dst_bytes);
    /* Pipeline. */
    v3d_pipeline pipe = {0};
    if (v3d_runner_create_pipeline(r, spv_path,
                                   /*n_ssbos=*/3,
                                   /*push_const_size=*/sizeof(push_consts),
                                   &pipe)) return 1;
    v3d_buffer bind_bufs[3] = { buf_meta, buf_dst, buf_src };
    if (v3d_runner_bind_buffers(r, &pipe, bind_bufs, 3)) return 1;
    const uint32_t blocks_per_wg = 32;
    uint32_t group_count_x = (uint32_t)((n_blocks + blocks_per_wg - 1) / blocks_per_wg);
    printf("  dispatch: %u WGs × 256 invocations = %u blocks (rounded up from %d)\n",
           group_count_x, group_count_x * blocks_per_wg, n_blocks);
    push_consts pc = {
        .n_blocks      = (uint32_t) n_blocks,
        .dst_stride_u8 = dst_stride_u8,
        .src_stride_u8 = src_stride_u8,
        ._pad = 0,
    };
    VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
    VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
    vkBeginCommandBuffer(cb, &cbbi);
    vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
    vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
                            pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
    vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
                       0, sizeof(pc), &pc);
    vkCmdDispatch(cb, group_count_x, 1, 1);
    vkEndCommandBuffer(cb);
    /* --- M1''' bit-exact --- */
    printf("\n=== M1''': QPU vs C reference bit-exact ===\n");
    memset(buf_dst.mapped, 0, dst_bytes);
    if (v3d_runner_submit_wait(r, cb)) return 1;
    int mismatch_blocks = 0;
    int total_byte_diffs = 0;
    int prints = 0;
    for (int i = 0; i < n_blocks; i++) {
        const uint8_t *q = (uint8_t *) buf_dst.mapped + (size_t)i * DST_BYTES;
        const uint8_t *e = expected + (size_t)i * DST_BYTES;
        if (memcmp(q, e, DST_BYTES) != 0) {
            int diffs = 0;
            for (int j = 0; j < DST_BYTES; j++) if (q[j] != e[j]) diffs++;
            total_byte_diffs += diffs;
            if (prints < 3) {
                fprintf(stderr, "MISMATCH block %d mx=%d: %d/64 bytes differ\n",
                        i, mxs[i], diffs);
                fprintf(stderr, "  ref:");
                for (int r0 = 0; r0 < 8; r0++) {
                    fprintf(stderr, "\n    r%d ", r0);
                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", e[r0*8+c]);
                }
                fprintf(stderr, "\n  qpu:");
                for (int r0 = 0; r0 < 8; r0++) {
                    fprintf(stderr, "\n    r%d ", r0);
                    for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", q[r0*8+c]);
                }
                fprintf(stderr, "\n");
                prints++;
            }
            mismatch_blocks++;
        }
    }
    printf("  blocks bit-exact: %d / %d (%.4f%%)\n",
           n_blocks - mismatch_blocks, n_blocks,
           100.0 * (n_blocks - mismatch_blocks) / n_blocks);
    printf("  total byte diffs: %d / %zu (%.4f%%)\n",
           total_byte_diffs, (size_t) n_blocks * DST_BYTES,
           100.0 * total_byte_diffs / ((double) n_blocks * DST_BYTES));
    if (mismatch_blocks > 0) {
        fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
        v3d_runner_destroy_pipeline(r, &pipe);
        v3d_runner_destroy_buffer(r, &buf_src);
        v3d_runner_destroy_buffer(r, &buf_dst);
        v3d_runner_destroy_buffer(r, &buf_meta);
        v3d_runner_destroy(r);
        return 1;
    }
    if (verify_only) {
        v3d_runner_destroy_pipeline(r, &pipe);
        v3d_runner_destroy_buffer(r, &buf_src);
        v3d_runner_destroy_buffer(r, &buf_dst);
        v3d_runner_destroy_buffer(r, &buf_meta);
        v3d_runner_destroy(r);
        return 0;
    }
    /* --- M2''' throughput --- */
    printf("\n=== M2''': QPU throughput ===\n");
    for (int i = 0; i < 10; i++) {
        memset(buf_dst.mapped, 0, dst_bytes);
        if (v3d_runner_submit_wait(r, cb)) return 1;
    }
    double t0 = now_seconds();
    for (int i = 0; i < iters; i++) {
        memset(buf_dst.mapped, 0, dst_bytes);
        if (v3d_runner_submit_wait(r, cb)) return 1;
    }
    double t1 = now_seconds();
    double s0 = now_seconds();
    for (int i = 0; i < iters; i++) memset(buf_dst.mapped, 0, dst_bytes);
    double s1 = now_seconds();
    double kernel_seconds = (t1 - t0) - (s1 - s0);
    double total_blocks   = (double) n_blocks * iters;
    double mbps           = total_blocks / kernel_seconds / 1e6;
    printf("  blocks/dispatch: %d\n", n_blocks);
    printf("  iters:           %d\n", iters);
    printf("  total blocks:    %.0f\n", total_blocks);
    printf("  elapsed (kernel)=%.6f s\n", kernel_seconds);
    printf("  elapsed (setup) =%.6f s\n", s1 - s0);
    printf("  M2''' throughput = %.3f Mblock/s\n", mbps);
    printf("  per-block        = %.1f ns\n", kernel_seconds / total_blocks * 1e9);
    printf("  per-dispatch     = %.1f us\n", kernel_seconds / iters * 1e6);
    double M3 = 20.997;   /* from k3_mc_phase3.md */
    double R  = mbps / M3;
    printf("\n  Cycle 3 NEON M3''' = %.3f Mblock/s\n", M3);
    printf("  R''' = M2'''/M3''' = %.3f\n", R);
    if      (R >= 1.0) printf("  decision band      = GREEN: QPU beats NEON in isolation\n");
    else if (R >= 0.5) printf("  decision band      = YELLOW: M4''' decides\n");
    else if (R >= 0.1) printf("  decision band      = ORANGE: M4''' may still rescue\n");
    else               printf("  decision band      = RED: structural mismatch\n");
    /* 30fps@1080p floor check (per project_30fps_floor_is_fine.md) */
    double mblocks_per_1080p = 32400.0 * 30.0 / 1e6;
    printf("\n  30fps@1080p floor : %.3f Mblock/s (32400 blocks × 30 fps)\n",
           mblocks_per_1080p);
    printf("  isolation margin  : %.1fx over 30fps floor\n",
           mbps / mblocks_per_1080p);
    v3d_runner_destroy_pipeline(r, &pipe);
    v3d_runner_destroy_buffer(r, &buf_src);
    v3d_runner_destroy_buffer(r, &buf_dst);
    v3d_runner_destroy_buffer(r, &buf_meta);
    v3d_runner_destroy(r);
    free(master_src); free(expected); free(mxs);
    return 0;
 }
@@ -0,0 +1,72 @@
 /*
 * Standalone bit-exact C reference for VP9 8-tap "regular" subpel
 * filter, horizontal direction, 8-pixel-wide output. Transcribed
 * from FFmpeg's libavcodec/vp9dsp_template.c FILTER_8TAP macro
 * (vendored at external/ffmpeg-snapshot/). 8-bit pixels only.
 *
 * Filter coefficients embedded inline (REGULAR filter only, all 16
 * subpel phases). Same values as ff_vp9_subpel_filters[1][mx] in
 * external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c.
 *
 * License: LGPL-2.1-or-later.
 *
 * Spec source: VP9 specification §8.5.1 — subpel motion compensation.
 */
 #include <stdint.h>
 #include <stddef.h>
 static const int16_t vp9_8tap_regular_filters[16][8] = {
    {  0,  0,   0, 128,   0,   0,  0,  0 },
    {  0,  1,  -5, 126,   8,  -3,  1,  0 },
    { -1,  3, -10, 122,  18,  -6,  2,  0 },
    { -1,  4, -13, 118,  27,  -9,  3, -1 },
    { -1,  4, -16, 112,  37, -11,  4, -1 },
    { -1,  5, -18, 105,  48, -14,  4, -1 },
    { -1,  5, -19,  97,  58, -16,  5, -1 },
    { -1,  6, -19,  88,  68, -18,  5, -1 },
    { -1,  6, -19,  78,  78, -19,  6, -1 },
    { -1,  5, -18,  68,  88, -19,  6, -1 },
    { -1,  5, -16,  58,  97, -19,  5, -1 },
    { -1,  4, -14,  48, 105, -18,  5, -1 },
    { -1,  4, -11,  37, 112, -16,  4, -1 },
    { -1,  3,  -9,  27, 118, -13,  4, -1 },
    {  0,  2,  -6,  18, 122, -10,  3, -1 },
    {  0,  1,  -3,   8, 126,  -5,  1,  0 },
 };
 static inline uint8_t clip_u8(int x)
 {
    return (uint8_t)(x > 255 ? 255 : x < 0 ? 0 : x);
 }
 /*
 * 8x8 horizontal 8-tap "put" (non-averaging). Width hard-coded 8.
 * `src` must point at the row-0 output-column-0 source pixel; valid
 * source memory must extend src[r*src_stride + (-3..+11)] for r=0..h-1.
 * `dst` is written at dst[r*dst_stride + 0..7] for r=0..h-1.
 *
 * Matches ff_vp9_put_regular8_h_neon byte-for-byte on 8-bit input.
 */
 void daedalus_vp9_put_regular_8h_ref(uint8_t *dst, ptrdiff_t dst_stride,
                                     const uint8_t *src, ptrdiff_t src_stride,
                                     int h, int mx, int my)
 {
    (void) my;   /* horizontal-only filter ignores y phase */
    const int16_t *F = vp9_8tap_regular_filters[mx & 15];
    for (int r = 0; r < h; r++) {
        for (int x = 0; x < 8; x++) {
            int sum = F[0] * (int) src[x - 3]
                    + F[1] * (int) src[x - 2]
                    + F[2] * (int) src[x - 1]
                    + F[3] * (int) src[x + 0]
                    + F[4] * (int) src[x + 1]
                    + F[5] * (int) src[x + 2]
                    + F[6] * (int) src[x + 3]
                    + F[7] * (int) src[x + 4];
            dst[x] = clip_u8((sum + 64) >> 7);
        }
        dst += dst_stride;
        src += src_stride;
    }
 }