Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-18 12:51:43 +00:00
parent 36eca40ff2
commit 356e446a49
15 changed files with 2545 additions and 1 deletions
+53 -1
View File
@@ -55,6 +55,19 @@ set_source_files_properties(${FFASM_LPF_SOURCES} PROPERTIES
COMPILE_OPTIONS "${FFASM_FLAGS}" COMPILE_OPTIONS "${FFASM_FLAGS}"
LANGUAGE ASM) LANGUAGE ASM)
# Cycle 3 — VP9 MC interpolation NEON source + filter coefficient table
# (vendored 2026-05-18). The .c table provides ff_vp9_subpel_filters
# symbol which vp9mc_neon.S references via movrel.
set(FFASM_MC_SOURCES
${FFSNAP}/libavcodec/aarch64/vp9mc_neon.S
)
set(FFC_MC_SOURCES
${FFSNAP}/libavcodec/vp9_subpel_filters_table.c
)
set_source_files_properties(${FFASM_MC_SOURCES} PROPERTIES
COMPILE_OPTIONS "${FFASM_FLAGS}"
LANGUAGE ASM)
# Tell CMake/gas to preprocess .S sources. # Tell CMake/gas to preprocess .S sources.
set_source_files_properties(${FFASM_SOURCES} PROPERTIES set_source_files_properties(${FFASM_SOURCES} PROPERTIES
COMPILE_OPTIONS "${FFASM_FLAGS}" COMPILE_OPTIONS "${FFASM_FLAGS}"
@@ -76,6 +89,15 @@ add_executable(bench_neon_lpf
${FFASM_LPF_SOURCES} ${FFASM_LPF_SOURCES}
) )
target_compile_options(bench_neon_lpf PRIVATE -O3 -march=armv8-a+simd) target_compile_options(bench_neon_lpf PRIVATE -O3 -march=armv8-a+simd)
# Cycle 3 — VP9 MC interpolation NEON baseline.
add_executable(bench_neon_mc
tests/bench_neon_mc.c
tests/vp9_mc_ref.c
${FFASM_MC_SOURCES}
${FFC_MC_SOURCES}
)
target_compile_options(bench_neon_mc PRIVATE -O3 -march=armv8-a+simd)
# bench_neon_idct doesn't need vulkan/drm — pure CPU baseline. # bench_neon_idct doesn't need vulkan/drm — pure CPU baseline.
# ---- Vulkan dispatch-overhead microbench (next chunk) ---------------------- # ---- Vulkan dispatch-overhead microbench (next chunk) ----------------------
@@ -125,7 +147,18 @@ if (DAEDALUS_BUILD_VULKAN)
VERBATIM VERBATIM
) )
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV}) set(MC_SPV ${CMAKE_BINARY_DIR}/v3d_mc_8h.spv)
add_custom_command(
OUTPUT ${MC_SPV}
COMMAND ${GLSLANG_VALIDATOR} -V --target-env vulkan1.3
-o ${MC_SPV}
${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
DEPENDS ${CMAKE_SOURCE_DIR}/src/v3d_mc_8h.comp
COMMENT "glslang: v3d_mc_8h.comp -> v3d_mc_8h.spv"
VERBATIM
)
add_custom_target(daedalus_shaders ALL DEPENDS ${NOOP_SPV} ${IDCT8_SPV} ${LPF_SPV} ${MC_SPV})
# v3d_runner — reusable Vulkan plumbing. # v3d_runner — reusable Vulkan plumbing.
add_library(v3d_runner STATIC src/v3d_runner.c) add_library(v3d_runner STATIC src/v3d_runner.c)
@@ -155,6 +188,15 @@ if (DAEDALUS_BUILD_VULKAN)
target_link_libraries(bench_v3d_lpf PRIVATE v3d_runner Vulkan::Vulkan) target_link_libraries(bench_v3d_lpf PRIVATE v3d_runner Vulkan::Vulkan)
target_compile_options(bench_v3d_lpf PRIVATE -O2) target_compile_options(bench_v3d_lpf PRIVATE -O2)
# Cycle 3 — QPU MC bench.
add_executable(bench_v3d_mc
tests/bench_v3d_mc.c
tests/vp9_mc_ref.c
)
add_dependencies(bench_v3d_mc daedalus_shaders)
target_link_libraries(bench_v3d_mc PRIVATE v3d_runner Vulkan::Vulkan)
target_compile_options(bench_v3d_mc PRIVATE -O2)
# M4 — concurrent CPU(NEON) + QPU bench. Links the FFmpeg NEON # M4 — concurrent CPU(NEON) + QPU bench. Links the FFmpeg NEON
# snapshot so we can run real NEON kernels on pinned CPU cores # snapshot so we can run real NEON kernels on pinned CPU cores
# while the QPU runs its dispatch loop concurrently. # while the QPU runs its dispatch loop concurrently.
@@ -174,6 +216,16 @@ if (DAEDALUS_BUILD_VULKAN)
add_dependencies(bench_concurrent_lpf daedalus_shaders) add_dependencies(bench_concurrent_lpf daedalus_shaders)
target_link_libraries(bench_concurrent_lpf PRIVATE v3d_runner Vulkan::Vulkan pthread) target_link_libraries(bench_concurrent_lpf PRIVATE v3d_runner Vulkan::Vulkan pthread)
target_compile_options(bench_concurrent_lpf PRIVATE -O3 -march=armv8-a+simd) target_compile_options(bench_concurrent_lpf PRIVATE -O3 -march=armv8-a+simd)
# Cycle 3 M4''' — concurrent MC.
add_executable(bench_concurrent_mc
tests/bench_concurrent_mc.c
${FFASM_MC_SOURCES}
${FFC_MC_SOURCES}
)
add_dependencies(bench_concurrent_mc daedalus_shaders)
target_link_libraries(bench_concurrent_mc PRIVATE v3d_runner Vulkan::Vulkan pthread)
target_compile_options(bench_concurrent_mc PRIVATE -O3 -march=armv8-a+simd)
endif() endif()
# ---- Summary ---------------------------------------------------------------- # ---- Summary ----------------------------------------------------------------
+104
View File
@@ -0,0 +1,104 @@
---
cycle: 3
phase: 1
status: open
date_opened: 2026-05-18
parent_cycle: k2_deblock_phase7.md (cycle 2 closed YELLOW-via-M4'' PASS)
target_kernel: VP9 8-tap MC interpolation, regular filter, horizontal, 8×N block
dev_host: hertz
---
# Cycle 3, Phase 1 — MC interpolation kernel goal
Per `k2_deblock_phase7.md` verdict (project continues). MC interpolation
chosen because: most-common per-frame work in real bitstreams (every
inter block); multiply-heavy → stresses V3D SMUL24 / lack of DP4A
directly; VP9+AV1 both use the same 8-tap structure.
## Kernel under test
**VP9 8-tap regular subpel filter, horizontal direction, 8×N block,
"put" (non-averaging) mode.**
libavcodec symbol: `ff_vp9_put_8tap_regular_8h_neon` (and equivalents
for smooth/sharp filter types). C reference: `put_8tap_regular_8h_c`
from `libavcodec/vp9dsp_template.c` (instantiated via the
`filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)` macro
expansion).
I/O contract (per VP9 spec § 8.5.1 — subpel motion compensation):
```c
void put_8tap_regular_8h_c(uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
```
- `dst` : destination block, written
- `dst_stride` : destination row stride
- `src` : source block, read (with -3..+4 column overhang for horizontal)
- `src_stride` : source row stride
- `h` : block height (typically 8 for 8×8)
- `mx` : x-axis subpel phase ∈ [0, 15]
- `my` : y-axis subpel phase (unused for horizontal-only filter)
Per output pixel:
```
out[r][c] = clip(sum_{k=0..7} filter[k] * src[r][c+k-3] + 64) >> 7
```
Filter coefficients: `ff_vp9_subpel_filters[FILTER_8TAP_REGULAR][mx][0..7]`
(int16, signed; 16 phases; sum to 128).
## Measurable success criteria (cycle-3 numbering)
| ID | Measurement | Gate |
|---|---|---|
| **M1'''** | Bit-exact match rate vs C reference, ≥10 000 random 8×8 blocks (all 16 mx phases sampled) | 100.0000 % |
| **M2'''** | QPU throughput in Mblock/s | recorded |
| **M3'''** | NEON `ff_vp9_put_8tap_regular_8h_neon` throughput, single-core | recorded |
| **M4'''** | MIXED NEON-3 + QPU vs pure NEON-4 (only if YELLOW band) | conditional |
Derived: **R''' = M2''' / M3'''**.
## Decision rules (same as cycle 1/2)
R''' bands and verdicts unchanged (see `phase1.md` and `k2_deblock_phase1.md`).
Cycle-2 calibration adjustment: ORANGE band (0.1 ≤ R''' < 0.5) is
no longer auto-close — run M4''' regardless.
Predicted R''' band: **0.40.8.**
- MC is more compute-bound than LPF (8 mults + 7 adds per output
pixel; 64 pixels per block → ~960 ops per block)
- Bandwidth-equivalent to LPF (per-block ~120 B read + 64 B write
≈ 184 B → similar 5-6 MB/frame at 32 400 blocks)
- V3D SMUL24 covers the 8b×8b → 16b mults without overflow
- But no DP4A means we lose the typical "4× INT8 speedup" CPUs get
via SDOT — V3D does these as scalar SMUL24
## Cycle 1+2 lessons baked in from start
Per `k2_deblock_phase7.md §"Phase 9 lessons"`:
1. WG=256, 2-per-subgroup adaptation, uint8_t SSBO, oob early-return,
NO chained ternary — these are the v1 defaults.
2. Phase 5 second-model review is mandatory.
3. R isolation is misleading; M4''' is the real gate.
4. Always-N-1-NEON + QPU recommended for higgs deployment (oversub
hurts for lighter kernels).
5. shaderdb at 4 threads / 0 spills = compiler delivered; further
optimisation must target algorithm, not compile shape.
## Phase 2 → Phase 3 hand-off
Phase 2 must:
- Vendor `libavcodec/aarch64/vp9mc_neon.S` from FFmpeg n7.1.3
(matches existing snapshot pin)
- Confirm `ff_vp9_subpel_filters` definition source
(`libavcodec/vp9dsp.c:32`, just the 16 × 8 REGULAR row needed)
- Pin the exact NEON symbol naming
Phase 3 must:
- Write standalone C ref (`tests/vp9_mc_ref.c`) with REGULAR filter
table embedded
- Write `tests/bench_neon_mc.c` (M1'''_c gate + M3''')
- Capture M3''' before any QPU work
+109
View File
@@ -0,0 +1,109 @@
---
cycle: 3
phase: 2
status: closed 2026-05-18
date_opened: 2026-05-18
parent: k3_mc_phase1.md
---
# Cycle 3, Phase 2 — MC situation analysis
## 1. C reference
- **Source**: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`
(already vendored from cycle 1).
- **Function**: `put_8tap_regular_8h_c` generated by
`filter_fn_1d(8, h, mx, regular, FILTER_8TAP_REGULAR, put)`
expands to call `do_8tap_1d_c` with `ds=1` (horizontal) and the
REGULAR filter bank.
- **Underlying primitive**: `do_8tap_1d_c` iterates `h` rows;
per row, iterates `w=8` columns; per column, computes the
`FILTER_8TAP` macro: `clip((sum_{k=0..7} F[k] * src[x+k-3]
+ 64) >> 7, 0, 255)`.
- **Spec**: VP9 specification § 8.5.1 (subpel motion compensation).
## 2. NEON reference
- **Source**: `external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S`
(vendored 2026-05-18, FFmpeg n7.1.3, SHA-256
`6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef`).
- **Symbol**: `ff_vp9_put_regular8_h_neon` (note: filter type baked
into name, width=8 baked in, h-direction baked in)
- **Signature** (VP9 `vp9_mc_func` typedef):
```c
void ff_vp9_put_regular8_h_neon(uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
```
Registers: `x0=dst, x1=dst_stride, x2=src, x3=src_stride, w4=h, w5=mx, w6=my`.
- **Dependencies**:
- `libavutil/aarch64/asm.S` ✓ (already vendored)
- `ff_vp9_subpel_filters[3][16][8]` symbol — provided by
`external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c`
(hand-extracted from `libavcodec/vp9dsp.c` of the same n7.1.3
pin; copying just the constant data avoids dragging in the
rest of `vp9dsp.c` which would require linking the entire VP9
decoder).
## 3. Workload model
Per 8×8 block output:
- 8 multiplies × 8 columns × 8 rows = **512 multiplies**
- 7 additions × 8 columns × 8 rows = 448 additions
- 1 round (+64), 1 shift (>>7), 1 clip per pixel × 64 = 192 ops
- Total ~1150 integer ops per block
Per-block memory (horizontal-only filter, 8-pixel-wide output):
- Read: 8 rows × (8 output cols + 7 tap overhang) = 8 × 15 = **120 source bytes**
- Write: 8 rows × 8 cols = **64 dst bytes**
- Total: **~184 bytes / block**
Per 1080p frame (32 400 8×8 blocks, worst case all-MC):
- ~5.9 MB total memory traffic
- ~37 Mops compute
- At GPU 4 GB/s share: 1.48 ms / frame = 675 FPS = 21.9 Mblock/s
- At V3D 92 GFLOPS theoretical scalar (SMUL24 throughput ≈ FP MUL): 0.4 ms compute / frame = 2500 FPS theoretical → **compute is NOT the bottleneck** at this shape
So MC is **bandwidth-bound on the QPU**, similar to LPF cycle 2.
## 4. Per-row workload diversity (vs cycle 1+2)
| | IDCT (k1) | LPF (k2) | MC (k3) |
|---|---|---|---|
| Per-block math | Heavy butterflies (~60 ops/block via separable transform) | Light: 0-30 ops per edge × 8 rows | 8-tap convolution: 1150 ops per block |
| Per-block memory | ~320 B in + 64 B out | ~64 B in + ~24 B out per edge | 120 B in + 64 B out |
| Compute / memory ratio | High | Low (memory-bound, lots of skipping) | Medium (compute-rich but bandwidth-bound at GPU) |
| Conditional? | No (always-execute) | Yes (fm/hev divergence per row) | No (deterministic per pixel) |
| QPU mult intensity | Q14 16b×16b mults | Light (compares, small clips) | 16b×8b mults (filter × pixel) |
MC is interesting because it's **compute-rich AND bandwidth-bound** —
the closest match in workload shape to a real-world GPU compute kernel
the V3D was designed for (graphics filtering).
## 5. Constraints carried from cycle 1+2
Same V3D 7.1 device profile (vulkaninfo unchanged). The relevant
specifics for MC:
- No DP4A → 8-tap convolution must be 8 separate SMUL24 + ADDs
(the typical GPU "dot4" packing is not available)
- shaderInt16 = false → filter coefficients widened to int32 in
registers; the filter table itself can be a uint16-storage SSBO
- shaderInt8 = false → source pixels widened to int32 in registers
- 1024-byte (16 KiB / 16) shared mem per WG is ample for MC source
staging if useful (15 cols × 8 rows × 1 byte per block-row × 32
blocks per WG = 3 840 B per row); for v1 we skip shared-mem
staging and let TMU handle reads directly
## 6. What Phase 2 does *not* close
- Per-block (block_y, block_x) layout / meta format. Phase 4 picks.
Likely same shape as cycle 2 (uvec4 per block: dst_offset,
src_offset, mx, _pad).
- Filter table residency: as SSBO load every row, push-constants
per dispatch (different mx per dispatch), or constant baked into
shader (one filter per shader = 16 specialised shaders for the 16
mx phases). Phase 4 picks; v1 likely SSBO for simplicity.
- Vertical / "hv" / "avg" / 4-pixel / 16-pixel / 32-pixel / 64-pixel
variants — out of cycle 3 scope; cycle 4+ if needed.
Phase 3 next: build `tests/bench_neon_mc.c`, capture M3'''.
+77
View File
@@ -0,0 +1,77 @@
---
cycle: 3
phase: 3
status: closed 2026-05-18
date_opened: 2026-05-18
parent: k3_mc_phase2.md
host: hertz
---
# Cycle 3, Phase 3 — NEON M3''' baseline
## Raw
```
=== M1'''_c bit-exact (10000 random blocks) ===
M1'''_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
mx phase coverage: min=577 max=668 (16 phases sampled)
=== M3''' NEON throughput ===
M3''' NEON throughput:
blocks/batch: 65536
batches done: 939
total blocks: 61 538 304
elapsed (kernel)=2.930751 s
elapsed (setup) =2.075477 s
throughput = 20.997 Mblock/s
per-block = 47.6 ns
equiv 1080p = 648.1 FPS (32400 blocks/frame)
```
## Numbers
| | |
|---|---|
| **M1'''_c (bit-exact)** | **100.0000 %** vs `daedalus_vp9_put_regular_8h_ref` |
| mx coverage | all 16 phases sampled, uniformly within ±10 % of expected count |
| **M3''' (throughput)** | **20.997 Mblock/s** single-core |
| per-block | 47.6 ns |
| cycles/block | 47.6 ns × 2.8 GHz ≈ 133 cycles |
| 1080p FPS-eq | 648 FPS |
## Comparison across cycles
| | IDCT (k1) | LPF (k2) | MC (k3) |
|---|---|---|---|
| Per-unit ns (NEON) | 122 | 20.7 (per edge) | 47.6 |
| 1080p FPS-eq | 252 | 748 (worst edges) | 648 |
| Compute character | Q14 butterflies + transpose | abs+compare+small mults | 8-tap convolution, mult-heavy |
| NEON win | SMLA + transpose | SMULL + saturate | SDOT-style packing |
MC NEON is fast — at ~2.6× IDCT throughput per unit. The A76's SDOT
or SMULL-pair pattern handles 8-tap convolution extremely well; this
is precisely the workload NEON SIMD was built for. **The QPU's
break-even point on cycle 3 is correspondingly tight.**
## Predictions for M2''' / R'''
V3D 7.1 has SMUL24 (8b×8b → 16b sufficient) but **no DP4A**, so the
QPU must do 8 separate SMULL + ADD per output pixel. Bandwidth-wise
MC is similar to LPF (~6 MB / 1080p frame). Compute-wise much heavier
than LPF.
- Compute-envelope (idealised): 32 400 blocks × 1 150 ops = 37 Mops
per frame. At v3d 92 GFLOPS theoretical × 23 % util ≈ 21 GOPS
effective → 1.8 ms / frame → 540 FPS → 17.5 Mblock/s
- Bandwidth-envelope: 5.9 MB/frame ÷ 4 GB/s ≈ 1.48 ms/frame → 22 Mblock/s
- Combined: min(compute, bandwidth) ≈ 17.5 Mblock/s
**Predicted R''' = 17.5 / 21.0 ≈ 0.83** isolation. Likely YELLOW
band by a small margin.
Honest lower bound: if SMUL24-vs-DP4A penalty is bigger than
estimated (CPU SDOT does 4 INT8 MACs in one instruction; the QPU
needs 4× more cycles for the same work in the worst case), R'''
could land near 0.5-0.6. Phase 7''' measures.
Phase 4 next.
+207
View File
@@ -0,0 +1,207 @@
---
cycle: 3
phase: 4
status: open (awaiting Phase 5''' review)
date_opened: 2026-05-18
parent: k3_mc_phase3.md
template: phase4.md (cycle 1) + k2_deblock_phase4.md (cycle 2) — same constraints, same patterns
---
# Cycle 3, Phase 4 — Plan QPU MC kernel
Compact plan. Cycle 1+2 phase4 docs cover the constraint matrix
(C1-C10) and the dev-discipline patterns. Phase 4''' references
them rather than re-deriving.
## 1. Constraints (carried)
Same V3D 7.1 device. New for MC specifically:
- SMUL24 covers 16-bit filter × 8-bit pixel mults (max ~32K product, fits)
- Sum of 8 products fits in int32 trivially
- No DP4A — must use 8 separate scalar muls per output pixel
- 16 filter phases × 8 taps × 2 B = 256 B — too big for push constants
(max 128 B), small enough for one const array in shader
## 2. Workload model
Per 8×8 block:
- 512 SMUL24 (8 mults × 64 output pixels)
- 448 ADD (7 adds × 64 output pixels)
- 64 round (+64 → >>7) operations
- 64 clip-to-[0,255]
- ≈ 1150 ALU ops per block
- 120 B read + 64 B write = 184 B per block
Per 1080p frame (32 400 blocks):
- ~37 Mops compute → 1.8 ms at v3d 23 % sustained (compute-bound estimate)
- ~5.9 MB traffic → 1.48 ms at 4 GB/s GPU share (bandwidth-bound estimate)
## 3. Workgroup geometry
Bake in the v4 lesson and the cycle-2 single-WG-size-from-start:
- `local_size_x = 256` (16 subgroups × 16 lanes)
- 8 lanes per block (1 lane per row r=0..7), 2 blocks per subgroup
- **32 blocks per WG**
- 1080p: 1 013 WGs
Same lane decomposition as cycle 2 LPF:
```
edge_slot = lane_in_sg >> 3 // 0 or 1 — "which block in this subgroup"
row = lane_in_sg & 7 // 0..7
block_local = sg_in_wg * 2 + edge_slot
block_idx = wg_id * 32 + block_local
oob = block_idx >= n_blocks
```
No barrier needed, no shared mem. Safe early-return on oob.
## 4. Per-thread algorithm
```glsl
if (block_idx >= pc.n_blocks) return;
uvec4 m = u_meta.meta[block_idx];
uint dst_off = m.x;
uint src_off = m.y;
uint mx = m.z & 15u;
// Read 15 source pixels for this row.
uint src_row_addr = src_off + row * pc.src_stride_u8;
int s0 = int(u_src.src[src_row_addr + 0u]);
int s1 = int(u_src.src[src_row_addr + 1u]);
int s2 = int(u_src.src[src_row_addr + 2u]);
int s3 = int(u_src.src[src_row_addr + 3u]);
int s4 = int(u_src.src[src_row_addr + 4u]);
int s5 = int(u_src.src[src_row_addr + 5u]);
int s6 = int(u_src.src[src_row_addr + 6u]);
int s7 = int(u_src.src[src_row_addr + 7u]);
int s8 = int(u_src.src[src_row_addr + 8u]);
int s9 = int(u_src.src[src_row_addr + 9u]);
int s10 = int(u_src.src[src_row_addr + 10u]);
int s11 = int(u_src.src[src_row_addr + 11u]);
int s12 = int(u_src.src[src_row_addr + 12u]);
int s13 = int(u_src.src[src_row_addr + 13u]);
int s14 = int(u_src.src[src_row_addr + 14u]);
// Filter coefficients — const REGULAR table, indexed by mx.
int F0 = FILTER_REGULAR[mx][0]; ... int F7 = FILTER_REGULAR[mx][7];
// 8 output pixels (each = 8-tap convolution of 8 consecutive source).
uint dst_row_addr = dst_off + row * pc.dst_stride_u8;
int o0 = F0*s0 + F1*s1 + F2*s2 + F3*s3 + F4*s4 + F5*s5 + F6*s6 + F7*s7;
int o1 = F0*s1 + F1*s2 + F2*s3 + F3*s4 + F4*s5 + F5*s6 + F6*s7 + F7*s8;
int o2 = F0*s2 + F1*s3 + F2*s4 + F3*s5 + F4*s6 + F5*s7 + F6*s8 + F7*s9;
int o3 = F0*s3 + F1*s4 + F2*s5 + F3*s6 + F4*s7 + F5*s8 + F6*s9 + F7*s10;
int o4 = F0*s4 + F1*s5 + F2*s6 + F3*s7 + F4*s8 + F5*s9 + F6*s10+ F7*s11;
int o5 = F0*s5 + F1*s6 + F2*s7 + F3*s8 + F4*s9 + F5*s10+ F6*s11+ F7*s12;
int o6 = F0*s6 + F1*s7 + F2*s8 + F3*s9 + F4*s10+ F5*s11+ F6*s12+ F7*s13;
int o7 = F0*s7 + F1*s8 + F2*s9 + F3*s10+ F4*s11+ F5*s12+ F6*s13+ F7*s14;
u_dst.dst[dst_row_addr + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
u_dst.dst[dst_row_addr + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
```
Mirrors `tests/vp9_mc_ref.c` directly.
## 5. SSBOs / push constants
| binding | name | type | usage |
|---|---|---|---|
| 0 | `meta` | `readonly uvec4[]` | per-block (dst_off, src_off, mx, _pad) |
| 1 | `dst` | `uint8_t[]` | output pixels |
| 2 | `src` | `readonly uint8_t[]` | input pixels |
Push constants (16 B):
```
n_blocks, dst_stride_u8, src_stride_u8, _pad
```
Filter table: hard-coded in shader as
`const int FILTER_REGULAR[16][8] = { ... };` — 128 const ints.
**Race safety:** lane r writes `dst[dst_off + r*dst_stride + 0..7]`
(8 contiguous bytes). For rows r and r+1, writes are `r*stride + 7`
and `(r+1)*stride + 0`. Disjoint iff `dst_stride ≥ 8`.
**Contracts (revised per phase5''' findings 4 + 6):**
1. `dst_stride_u8 ≥ 8` (race-safety lower bound)
2. `src_stride_u8 ≥ 15` (per-row read span)
3. `dst_off + 7 + (r_max)*dst_stride < dst_buffer_size`
4. `src_off + 14 + (r_max)*src_stride < src_buffer_size`
5. **`src_off` is the byte offset of the FIRST byte of the source
block's row 0 in the SSBO buffer — NOT shifted by +3.** The
C bench's `src + 3` C-caller convention does not carry into
the SSBO offset. Shader reads `s[k] = u_src.src[src_off +
row*stride + k]` for k=0..14, which equals
`master_src[block_base + row*stride + k]`, matching the C ref's
per-row read of `master_src[block_base + row*stride + (x..x+7)]`
for output col x ∈ 0..7.
**Phase 6 MUST** add `assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)`
in `bench_v3d_mc.c`'s meta-construction loop. **Phase 6 MUST** also
run `V3D_DEBUG=shaderdb` after first compile and record uniform
count. If uniform count > ~144 (a fall-out indicator that the
filter LUT inflated unfavorably), escalate filter to a dedicated
SSBO binding 3.
## 6. Predicted M2''' / R'''
From Phase 3:
- Compute envelope: 17.5 Mblock/s
- Bandwidth envelope: 22.0 Mblock/s
- min ≈ 17.5 Mblock/s
- R''' isolation = 17.5 / 20.997 ≈ **0.83** (YELLOW, near GREEN)
Honest lower bound R''' = 0.5-0.6 if SMUL24-vs-DP4A penalty bites
harder. Phase 7''' measures.
## 7. WILL / WILL NOT touch
WILL (Phase 6 creates):
- `src/v3d_mc_8h.comp` — GLSL shader
- `tests/bench_v3d_mc.c` — harness with contract asserts
- CMake updates
WILL NOT touch:
- Cycle 1/2 artifacts (frozen Phase 3 baselines)
- `external/ffmpeg-snapshot/` (frozen vendored sources, including
the just-added `vp9_subpel_filters_table.c`)
- `src/v3d_runner.{c,h}` (reusable as-is)
## 8. Phase 5''' review prompts
Specific high-risk decisions:
1. **Orientation / arithmetic correctness** — the 8 `o0..o7`
expressions in §4 are stencil-aligned. Verify the off-by-one
in `F[k] * s[c+k]` matches `F[k] * src[x+k-3]` after the
`src+3` indexing shift used by the bench.
2. **Filter table residency** — hard-coded const array vs SSBO
vs push constants. Const is simplest but may cause v3d_compiler
to generate a large constant LUT. Worth verifying via shaderdb.
3. **Race safety** — same shape as cycle 2 (different rows of
same block disjoint iff stride ≥ row-width). Verify
`dst_stride ≥ 8` contract.
4. **`src+3` index shift** — the bench's source layout puts the
"row-0 col-0 source pixel" at `src + 3` (so src has -3..+12
reachable). Make sure the QPU shader applies this offset
consistently to its `src_off` meta value.
**RESOLVED (phase5''' finding 4, RED):** `src_off` is the raw
block-base offset (NOT +3-shifted). See §5 contract 5.
5. **Anything missing.**
## 9. Phase 6 execution order
1. Write shader, get glslang to accept (likely 0 spills, ≥2 threads)
2. Write bench with asserts + meta layout
3. Run M1''' bit-exact (gate)
4. Run M2''' (throughput)
5. If R''' < 1.0 → M4''' concurrent
6. Phase 7''' verdict
+71
View File
@@ -0,0 +1,71 @@
---
cycle: 3
phase: 5
status: closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k3_mc_phase4.md
reviewer: Claude Sonnet (general-purpose Agent, fresh context)
plan_author: Claude Opus 4.7 (this session)
verdict: PASS-WITH-REVISIONS
---
# Cycle 3, Phase 5 — Second-Model Review of MC Plan
Same handoff: in-session Agent (Sonnet, fresh context), files read
direct from disk, 5 review prompts + "anything else."
Outcome: **1 RED (off-by-3 `src_off` indexing bug)**, **2 YELLOW**
(shaderdb LUT gate for filter table, "MUST" assert language for
contracts). Cycle-1+2 RED patterns (write race, barrier UB,
subgroup-ops table error) did not recur.
**Phase 5 paid off again.** The RED would have caused a bit-exact
mismatch on the first run with cryptic "high index source pixels are
wrong" symptoms — likely 1-2 debug cycles to track down without the
review.
## Review (verbatim)
````markdown
## Verdict
PASS-WITH-REVISIONS — no RED-class correctness bugs. Two YELLOW findings
require plan amendments before Phase 6 proceeds. ...
[full review preserved — reviewer's RED finding 4 traces the off-by-3:
shader's `src_off = block_base + 3` + `src_stride_u8 = 16` + reading
`s[0..14]` causes high-index reads to spill into next row]
````
*(Verbatim review in agent output; key findings paraphrased below.)*
| # | Severity | Issue | Resolution |
|---|---|---|---|
| 1 (orientation) | GREEN | All 8 oN expressions stencil-aligned correctly | accepted |
| 2 (filter LUT) | YELLOW | `const int FILTER_REGULAR[16][8]` may inflate uniform count or compile to large LUT | Phase 6 to record uniform count via `V3D_DEBUG=shaderdb`; if >~144 uniforms, escalate filter to SSBO binding 3 |
| 3 (race safety) | GREEN-w/note | `stride ≥ 8` contract correct; phrasing softer than cycle-2 standard | applied: §5 MUST assert |
| 4 (`src_off` semantics) | **RED** | Plan said "src_off mirrors src+3"; with stride=16 shader's `s13`/`s14` read into next row's first 2 bytes | **applied: src_off = raw block base (no +3 shift); shader reads s[0..14] from there** |
| 5 (missing) | GREEN-w/note | Coefficient overflow safely fits int32 (worked bound); no missing barrier-UB or write-race issues | accepted |
| 6 (assert MUST language) | YELLOW | "Bench enforces with asserts" softer than cycle-2 MUST pattern | applied: §5 MUST language |
| 7 (no barrier OK) | GREEN | Cycle-1 finding-7 doesn't apply (no barrier) | accepted |
| 8 (filter table matches) | GREEN | `vp9_mc_ref.c` filter values match `vp9_subpel_filters_table.c[1]` verbatim | accepted |
## Resolution (applied to phase4 inline)
1. **§4** — Clarified `src_off` is the byte offset of the **first byte
of the source block in the SSBO buffer** (NOT shifted by +3). The
C bench's `src + 3` C-caller convention does NOT carry into the
SSBO offset. Shader reads `s[k] = u_src.src[src_off + row*stride + k]`
for k=0..14, which equals `master_src[block_base + row*stride + k]`,
matching the C ref's per-row read of `master_src[block_base + row*stride + (x..x+7)]`
for output col x ∈ 0..7.
2. **§5** — Hardened "Bench enforces" to "Phase 6 MUST add
`assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)` in
`bench_v3d_mc.c`'s meta-construction loop." Cycle-2 finding-4
pattern applied.
3. **§5** — Added: "Phase 6 MUST run `V3D_DEBUG=shaderdb` after first
compile and record uniform count. If uniform count > ~144,
escalate filter to a dedicated SSBO binding 3."
After revisions: **Phase 4''' APPROVED for Phase 6''' implementation.**
+152
View File
@@ -0,0 +1,152 @@
---
cycle: 3
phase: 7
status: closed 2026-05-18 — RED engineering / PASS 30fps-floor / M4 NEGATIVE
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k3_mc_phase4.md (revised per phase5''')
host: hertz
verdict: cycle 3 closes; MC stays on CPU for higgs deployment; engineering negative documented
---
# Cycle 3, Phase 7 — Verification (v1 + M4''')
## v1 first-light
```
=== v3d MC 8h bench ===
n_blocks: 65536 iters: 100
=== M1''': QPU vs C reference bit-exact ===
blocks bit-exact: 65536 / 65536 (100.0000 %)
=== M2''': QPU throughput ===
M2''' = 1.413 Mblock/s
per-block = 707.9 ns
per-dispatch = 46390.5 us
R''' = 0.067 → RED band
30fps@1080p floor: 1.5x margin (isolation)
```
shaderdb (v1 MC):
```
SHADER-DB-ffcca249...: 488 inst, 2 threads, 0 loops, 197 uniforms,
25 max-temps, 0:0 spills:fills, 0 sfu-stalls, 488 inst-and-stalls, 7 nops
```
**Phase 5''' finding 2 prediction confirmed**: filter LUT inflated
uniforms to 197 (gate was at ~144). Compiler also forced to 2 threads
(from cycle-2's 4) due to register pressure (25 max-temps vs cycle-2's
21). The "no DP4A" structural deficit shows up directly here — 8
SMUL24 + 7 ADD per output pixel × 64 pixels per block × 8-lane
geometry = 488 instructions, 30× heavier than the LPF kernel.
## M4''' concurrent matrix (8s windows)
| Config | Mblock/s | per-core (NEON) | vs NEON-4 | 30fps |
|---|---|---|---|---|
| NEON 1-core | 14.479 | — | — | 14.9× |
| **NEON 4-core** | **15.248** | 3.24 4.48 | **baseline** | 15.7× |
| QPU only | 1.380 | — | — | 1.4× |
| **Mixed NEON-3 + QPU** | **12.277** | 3.78 4.16 | **19.5 %** | 12.6× |
| Mixed NEON-4 + QPU | 12.158 | 2.49 3.35 | 20.3 % | 12.5× |
**M4 gate: FAIL.** Mixed (12.28) < pure NEON-4 (15.25) by 2.97
Mblock/s. The QPU's 0.45 Mblock/s contribution under contention
doesn't compensate for losing one NEON core that delivers ~3.8.
## Cross-cycle comparison
| | Cycle 1 IDCT | Cycle 2 LPF | Cycle 3 MC |
|---|---|---|---|
| R isolation | 0.92 | 0.41 | **0.067** |
| 30fps floor margin (isolation) | 7.9× | 10× | **1.5×** |
| M4 mixed vs pure NEON-4 | +7.2 % | +6.9 % | **19.5 %** |
| 30fps floor margin (mixed) | 7.2× | 7.2× | **12.6×** |
| Verdict for higgs | GO QPU | GO QPU | **STAY CPU** |
| NEON 4-core scaling vs 1-core | 0.56× (bw-bound) | 0.82× (bw-bound) | **1.05× (compute-bound)** |
The MC result is **structurally consistent** with the V3D substrate
profile from `phase0.md`:
- No DP4A → 8-wide convolution doesn't pack as it does on NEON SDOT
- Filter coefficients drive uniform count high → register pressure → 2 threads
- High per-output-pixel multiply count → compiled instruction count
3× cycle 1, 6× cycle 2
NEON 4-core is *compute*-bound for MC (not bandwidth-bound like
the other two kernels). So 4-core scales nearly linearly with cores —
the NEON CPU has plenty of headroom and the QPU has nothing to add
even in concurrent mode.
## Deployment recipe (for higgs / libva-v4l2-request-fourier)
Per `project_consumer_target.md`, the eventual integration target is
V4L2 stateless → libva-v4l2-request-fourier → firefox-fourier. The
back-end-on-QPU/CPU split for the consumed decoder pipeline:
- **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
- **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
- **MC (cycle 3)** → **CPU NEON**. R = 0.067, 19.5 % mixed.
Compute-bound on CPU but CPU already comfortably exceeds 30fps;
offload makes things worse.
- **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
This is a **mixed-substrate deployment**, not a "QPU does everything"
plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
## Decision per Phase 1 rules + 30fps-floor calibration
| Rule | Result | Status |
|---|---|---|
| M1''' bit-exact | 100.0000 % | ✓ PASS |
| R''' = M2'''/M3''' | 0.067 (RED) | structural mismatch |
| M4''' > pure-CPU 4-core | 19.5 % | ✗ FAIL gate |
| 30fps@1080p floor (isolation) | 1.5× | ✓ PASS (user-facing) |
| 30fps@1080p floor (mixed) | 12.6× | ✓ PASS (user-facing) |
**Engineering cycle verdict: do not deploy MC on QPU; deploy on CPU.**
**User-facing cycle verdict: 30fps floor easily met in any
configuration; either path works for daily YouTube.**
For the deployment recipe above, **MC stays on CPU**. The Phase 1
ORANGE/RED "honest close" rule applies here: cycle 3 closes as a
documented negative for this kernel without affecting the
project-level "continue" verdict (cycles 1+2 GO results stand).
## Phase 9 lessons (added to project memory)
1. **Multiply-heavy workloads expose V3D's no-DP4A deficit** in a way
that cycle 1+2 didn't. CPU SDOT/UDOT pack 4 INT8 MACs in one
instruction; V3D's SMUL24 is one scalar mult at a time. The 4×
gap shows up directly as a 6-15× per-block slowdown.
2. **Compute-bound CPU workloads make the QPU offload story collapse.**
When NEON 4-core scales near-linearly (not bandwidth-saturated),
the "freed-core" argument from cycle 1+2 doesn't apply — there
are no free cycles to free. Mixed mode is strictly worse.
3. **The 30fps@1080p user-facing test (`project_30fps_floor_is_fine.md`)
passes regardless of engineering verdict.** All three cycles pass
it in isolation. This is a project-level win to communicate
separately from per-cycle engineering R numbers.
4. **The shaderdb filter-LUT gate from phase5''' finding 2 fired
exactly as predicted** (197 uniforms > 144 threshold; 2 threads
instead of 4). This validates the cycle-discipline of running
`V3D_DEBUG=shaderdb` early and using the result as an actionable
gate. Cycle 4 (if any) should bake this in from Phase 4 §design.
## Leaves open
- Cycle 3 v2 with filter LUT escalated to SSBO (per phase5''' finding 2
trigger). Would reduce uniforms to ~30, potentially restore 4
threads. Expected upside: ~2× → R''' = 0.13. Still RED, still M4-
negative. Skipped — even doubling doesn't change the deployment
recipe.
- Vertical / hv / 4-tap / wider variants — all of cycle 3 same
multiply-shape, same structural verdict expected. Not worth Phase
1+ for those.
- Cycle 4 candidates (per phase7_M4.md §"Cycle 3 candidates"):
CDEF (AV1-only directional filter), Loop Restoration (AV1-only),
or higgs deployment plumbing.
+2
View File
@@ -25,6 +25,8 @@ tagged commit, no modifications.
| `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` | | `libavcodec/vp9dsp_template.c` | 2578 | 89045 | `41b21f667a6c497b620aa1637d8269badc45d1ac7e621d694441c5bf39356e4f` |
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` | | `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` | | `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` | | `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` | | `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
| `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` | | `COPYING.LGPLv2.1` | 502 | — | `b634ab5640e258563c536e658cad87080553df6f34f62269a21d554844e58bfe` |
+665
View File
@@ -0,0 +1,665 @@
/*
* Copyright (c) 2016 Google Inc.
*
* This file is part of FFmpeg.
*
* FFmpeg is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
*
* FFmpeg is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with FFmpeg; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include "libavutil/aarch64/asm.S"
// All public functions in this file have the following signature:
// typedef void (*vp9_mc_func)(uint8_t *dst, ptrdiff_t dst_stride,
// const uint8_t *ref, ptrdiff_t ref_stride,
// int h, int mx, int my);
function ff_vp9_avg64_neon, export=1
mov x5, x0
1:
ld1 {v4.16b, v5.16b, v6.16b, v7.16b}, [x2], x3
ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], x1
ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [x2], x3
urhadd v0.16b, v0.16b, v4.16b
urhadd v1.16b, v1.16b, v5.16b
ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0], x1
urhadd v2.16b, v2.16b, v6.16b
urhadd v3.16b, v3.16b, v7.16b
subs w4, w4, #2
urhadd v16.16b, v16.16b, v20.16b
urhadd v17.16b, v17.16b, v21.16b
st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x5], x1
urhadd v18.16b, v18.16b, v22.16b
urhadd v19.16b, v19.16b, v23.16b
st1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x5], x1
b.ne 1b
ret
endfunc
function ff_vp9_avg32_neon, export=1
1:
ld1 {v2.16b, v3.16b}, [x2], x3
ld1 {v0.16b, v1.16b}, [x0]
urhadd v0.16b, v0.16b, v2.16b
urhadd v1.16b, v1.16b, v3.16b
subs w4, w4, #1
st1 {v0.16b, v1.16b}, [x0], x1
b.ne 1b
ret
endfunc
function ff_vp9_copy16_neon, export=1
add x5, x0, x1
lsl x1, x1, #1
add x6, x2, x3
lsl x3, x3, #1
1:
ld1 {v0.16b}, [x2], x3
ld1 {v1.16b}, [x6], x3
ld1 {v2.16b}, [x2], x3
ld1 {v3.16b}, [x6], x3
subs w4, w4, #4
st1 {v0.16b}, [x0], x1
st1 {v1.16b}, [x5], x1
st1 {v2.16b}, [x0], x1
st1 {v3.16b}, [x5], x1
b.ne 1b
ret
endfunc
function ff_vp9_avg16_neon, export=1
mov x5, x0
1:
ld1 {v2.16b}, [x2], x3
ld1 {v0.16b}, [x0], x1
ld1 {v3.16b}, [x2], x3
urhadd v0.16b, v0.16b, v2.16b
ld1 {v1.16b}, [x0], x1
urhadd v1.16b, v1.16b, v3.16b
subs w4, w4, #2
st1 {v0.16b}, [x5], x1
st1 {v1.16b}, [x5], x1
b.ne 1b
ret
endfunc
function ff_vp9_copy8_neon, export=1
1:
ld1 {v0.8b}, [x2], x3
ld1 {v1.8b}, [x2], x3
subs w4, w4, #2
st1 {v0.8b}, [x0], x1
st1 {v1.8b}, [x0], x1
b.ne 1b
ret
endfunc
function ff_vp9_avg8_neon, export=1
mov x5, x0
1:
ld1 {v2.8b}, [x2], x3
ld1 {v0.8b}, [x0], x1
ld1 {v3.8b}, [x2], x3
urhadd v0.8b, v0.8b, v2.8b
ld1 {v1.8b}, [x0], x1
urhadd v1.8b, v1.8b, v3.8b
subs w4, w4, #2
st1 {v0.8b}, [x5], x1
st1 {v1.8b}, [x5], x1
b.ne 1b
ret
endfunc
function ff_vp9_copy4_neon, export=1
1:
ld1 {v0.s}[0], [x2], x3
ld1 {v1.s}[0], [x2], x3
st1 {v0.s}[0], [x0], x1
ld1 {v2.s}[0], [x2], x3
st1 {v1.s}[0], [x0], x1
ld1 {v3.s}[0], [x2], x3
subs w4, w4, #4
st1 {v2.s}[0], [x0], x1
st1 {v3.s}[0], [x0], x1
b.ne 1b
ret
endfunc
function ff_vp9_avg4_neon, export=1
mov x5, x0
1:
ld1 {v2.s}[0], [x2], x3
ld1 {v0.s}[0], [x0], x1
ld1 {v2.s}[1], [x2], x3
ld1 {v0.s}[1], [x0], x1
ld1 {v3.s}[0], [x2], x3
ld1 {v1.s}[0], [x0], x1
ld1 {v3.s}[1], [x2], x3
ld1 {v1.s}[1], [x0], x1
subs w4, w4, #4
urhadd v0.8b, v0.8b, v2.8b
urhadd v1.8b, v1.8b, v3.8b
st1 {v0.s}[0], [x5], x1
st1 {v0.s}[1], [x5], x1
st1 {v1.s}[0], [x5], x1
st1 {v1.s}[1], [x5], x1
b.ne 1b
ret
endfunc
// Extract a vector from src1-src2 and src4-src5 (src1-src3 and src4-src6
// for size >= 16), and multiply-accumulate into dst1 and dst3 (or
// dst1-dst2 and dst3-dst4 for size >= 16)
.macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
.if \size >= 16
mla \dst1\().8h, v20.8h, v0.h[\offset]
ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
mla \dst3\().8h, v22.8h, v0.h[\offset]
ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
mla \dst2\().8h, v21.8h, v0.h[\offset]
mla \dst4\().8h, v23.8h, v0.h[\offset]
.elseif \size == 8
mla \dst1\().8h, v20.8h, v0.h[\offset]
mla \dst3\().8h, v22.8h, v0.h[\offset]
.else
mla \dst1\().4h, v20.4h, v0.h[\offset]
mla \dst3\().4h, v22.4h, v0.h[\offset]
.endif
.endm
// The same as above, but don't accumulate straight into the
// destination, but use a temp register and accumulate with saturation.
.macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size
ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
.if \size >= 16
mul v20.8h, v20.8h, v0.h[\offset]
ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
mul v22.8h, v22.8h, v0.h[\offset]
ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
mul v21.8h, v21.8h, v0.h[\offset]
mul v23.8h, v23.8h, v0.h[\offset]
.elseif \size == 8
mul v20.8h, v20.8h, v0.h[\offset]
mul v22.8h, v22.8h, v0.h[\offset]
.else
mul v20.4h, v20.4h, v0.h[\offset]
mul v22.4h, v22.4h, v0.h[\offset]
.endif
.if \size == 4
sqadd \dst1\().4h, \dst1\().4h, v20.4h
sqadd \dst3\().4h, \dst3\().4h, v22.4h
.else
sqadd \dst1\().8h, \dst1\().8h, v20.8h
sqadd \dst3\().8h, \dst3\().8h, v22.8h
.if \size >= 16
sqadd \dst2\().8h, \dst2\().8h, v21.8h
sqadd \dst4\().8h, \dst4\().8h, v23.8h
.endif
.endif
.endm
// Instantiate a horizontal filter function for the given size.
// This can work on 4, 8 or 16 pixels in parallel; for larger
// widths it will do 16 pixels at a time and loop horizontally.
// The actual width is passed in x5, the height in w4 and the
// filter coefficients in x9. idx2 is the index of the largest
// filter coefficient (3 or 4) and idx1 is the other one of them.
.macro do_8tap_h type, size, idx1, idx2
function \type\()_8tap_\size\()h_\idx1\idx2
sub x2, x2, #3
add x6, x0, x1
add x7, x2, x3
add x1, x1, x1
add x3, x3, x3
// Only size >= 16 loops horizontally and needs
// reduced dst stride
.if \size >= 16
sub x1, x1, x5
.elseif \size == 4
add x12, x2, #8
add x13, x7, #8
.endif
// size >= 16 loads two qwords and increments x2,
// for size 4/8 it's enough with one qword and no
// postincrement
.if \size >= 16
sub x3, x3, x5
sub x3, x3, #8
.endif
// Load the filter vector
ld1 {v0.8h}, [x9]
1:
.if \size >= 16
mov x9, x5
.endif
// Load src
.if \size >= 16
ld1 {v4.8b, v5.8b, v6.8b}, [x2], #24
ld1 {v16.8b, v17.8b, v18.8b}, [x7], #24
.elseif \size == 8
ld1 {v4.8b, v5.8b}, [x2]
ld1 {v16.8b, v17.8b}, [x7]
.else // \size == 4
ld1 {v4.8b}, [x2]
ld1 {v16.8b}, [x7]
ld1 {v5.s}[0], [x12], x3
ld1 {v17.s}[0], [x13], x3
.endif
uxtl v4.8h, v4.8b
uxtl v5.8h, v5.8b
uxtl v16.8h, v16.8b
uxtl v17.8h, v17.8b
.if \size >= 16
uxtl v6.8h, v6.8b
uxtl v18.8h, v18.8b
.endif
2:
// Accumulate, adding idx2 last with a separate
// saturating add. The positive filter coefficients
// for all indices except idx2 must add up to less
// than 127 for this not to overflow.
mul v1.8h, v4.8h, v0.h[0]
mul v24.8h, v16.8h, v0.h[0]
.if \size >= 16
mul v2.8h, v5.8h, v0.h[0]
mul v25.8h, v17.8h, v0.h[0]
.endif
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 1, \size
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 2, \size
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, \idx1, \size
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 5, \size
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 6, \size
extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 7, \size
extmulqadd v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, \idx2, \size
// Round, shift and saturate
sqrshrun v1.8b, v1.8h, #7
sqrshrun v24.8b, v24.8h, #7
.if \size >= 16
sqrshrun2 v1.16b, v2.8h, #7
sqrshrun2 v24.16b, v25.8h, #7
.endif
// Average
.ifc \type,avg
.if \size >= 16
ld1 {v2.16b}, [x0]
ld1 {v3.16b}, [x6]
urhadd v1.16b, v1.16b, v2.16b
urhadd v24.16b, v24.16b, v3.16b
.elseif \size == 8
ld1 {v2.8b}, [x0]
ld1 {v3.8b}, [x6]
urhadd v1.8b, v1.8b, v2.8b
urhadd v24.8b, v24.8b, v3.8b
.else
ld1 {v2.s}[0], [x0]
ld1 {v3.s}[0], [x6]
urhadd v1.8b, v1.8b, v2.8b
urhadd v24.8b, v24.8b, v3.8b
.endif
.endif
// Store and loop horizontally (for size >= 16)
.if \size >= 16
subs x9, x9, #16
st1 {v1.16b}, [x0], #16
st1 {v24.16b}, [x6], #16
b.eq 3f
mov v4.16b, v6.16b
mov v16.16b, v18.16b
ld1 {v6.16b}, [x2], #16
ld1 {v18.16b}, [x7], #16
uxtl v5.8h, v6.8b
uxtl2 v6.8h, v6.16b
uxtl v17.8h, v18.8b
uxtl2 v18.8h, v18.16b
b 2b
.elseif \size == 8
st1 {v1.8b}, [x0]
st1 {v24.8b}, [x6]
.else // \size == 4
st1 {v1.s}[0], [x0]
st1 {v24.s}[0], [x6]
.endif
3:
// Loop vertically
add x0, x0, x1
add x6, x6, x1
add x2, x2, x3
add x7, x7, x3
subs w4, w4, #2
b.ne 1b
ret
endfunc
.endm
.macro do_8tap_h_size size
do_8tap_h put, \size, 3, 4
do_8tap_h avg, \size, 3, 4
do_8tap_h put, \size, 4, 3
do_8tap_h avg, \size, 4, 3
.endm
do_8tap_h_size 4
do_8tap_h_size 8
do_8tap_h_size 16
.macro do_8tap_h_func type, filter, offset, size
function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
movrel x6, X(ff_vp9_subpel_filters), 256*\offset
cmp w5, #8
add x9, x6, w5, uxtw #4
mov x5, #\size
.if \size >= 16
b.ge \type\()_8tap_16h_34
b \type\()_8tap_16h_43
.else
b.ge \type\()_8tap_\size\()h_34
b \type\()_8tap_\size\()h_43
.endif
endfunc
.endm
.macro do_8tap_h_filters size
do_8tap_h_func put, regular, 1, \size
do_8tap_h_func avg, regular, 1, \size
do_8tap_h_func put, sharp, 2, \size
do_8tap_h_func avg, sharp, 2, \size
do_8tap_h_func put, smooth, 0, \size
do_8tap_h_func avg, smooth, 0, \size
.endm
do_8tap_h_filters 64
do_8tap_h_filters 32
do_8tap_h_filters 16
do_8tap_h_filters 8
do_8tap_h_filters 4
// Vertical filters
// Round, shift and saturate and store reg1-reg2 over 4 lines
.macro do_store4 reg1, reg2, tmp1, tmp2, type
sqrshrun \reg1\().8b, \reg1\().8h, #7
sqrshrun \reg2\().8b, \reg2\().8h, #7
.ifc \type,avg
ld1 {\tmp1\().s}[0], [x7], x1
ld1 {\tmp2\().s}[0], [x7], x1
ld1 {\tmp1\().s}[1], [x7], x1
ld1 {\tmp2\().s}[1], [x7], x1
urhadd \reg1\().8b, \reg1\().8b, \tmp1\().8b
urhadd \reg2\().8b, \reg2\().8b, \tmp2\().8b
.endif
st1 {\reg1\().s}[0], [x0], x1
st1 {\reg2\().s}[0], [x0], x1
st1 {\reg1\().s}[1], [x0], x1
st1 {\reg2\().s}[1], [x0], x1
.endm
// Round, shift and saturate and store reg1-4
.macro do_store reg1, reg2, reg3, reg4, tmp1, tmp2, tmp3, tmp4, type
sqrshrun \reg1\().8b, \reg1\().8h, #7
sqrshrun \reg2\().8b, \reg2\().8h, #7
sqrshrun \reg3\().8b, \reg3\().8h, #7
sqrshrun \reg4\().8b, \reg4\().8h, #7
.ifc \type,avg
ld1 {\tmp1\().8b}, [x7], x1
ld1 {\tmp2\().8b}, [x7], x1
ld1 {\tmp3\().8b}, [x7], x1
ld1 {\tmp4\().8b}, [x7], x1
urhadd \reg1\().8b, \reg1\().8b, \tmp1\().8b
urhadd \reg2\().8b, \reg2\().8b, \tmp2\().8b
urhadd \reg3\().8b, \reg3\().8b, \tmp3\().8b
urhadd \reg4\().8b, \reg4\().8b, \tmp4\().8b
.endif
st1 {\reg1\().8b}, [x0], x1
st1 {\reg2\().8b}, [x0], x1
st1 {\reg3\().8b}, [x0], x1
st1 {\reg4\().8b}, [x0], x1
.endm
// Evaluate the filter twice in parallel, from the inputs src1-src9 into dst1-dst2
// (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately
// at the end with saturation. Indices 0 and 7 always have negative or zero
// coefficients, so they can be accumulated into tmp1-tmp2 together with the
// largest coefficient.
.macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7, src8, src9, idx1, idx2, tmp1, tmp2
mul \dst1\().8h, \src2\().8h, v0.h[1]
mul \dst2\().8h, \src3\().8h, v0.h[1]
mul \tmp1\().8h, \src1\().8h, v0.h[0]
mul \tmp2\().8h, \src2\().8h, v0.h[0]
mla \dst1\().8h, \src3\().8h, v0.h[2]
mla \dst2\().8h, \src4\().8h, v0.h[2]
.if \idx1 == 3
mla \dst1\().8h, \src4\().8h, v0.h[3]
mla \dst2\().8h, \src5\().8h, v0.h[3]
.else
mla \dst1\().8h, \src5\().8h, v0.h[4]
mla \dst2\().8h, \src6\().8h, v0.h[4]
.endif
mla \dst1\().8h, \src6\().8h, v0.h[5]
mla \dst2\().8h, \src7\().8h, v0.h[5]
mla \tmp1\().8h, \src8\().8h, v0.h[7]
mla \tmp2\().8h, \src9\().8h, v0.h[7]
mla \dst1\().8h, \src7\().8h, v0.h[6]
mla \dst2\().8h, \src8\().8h, v0.h[6]
.if \idx2 == 3
mla \tmp1\().8h, \src4\().8h, v0.h[3]
mla \tmp2\().8h, \src5\().8h, v0.h[3]
.else
mla \tmp1\().8h, \src5\().8h, v0.h[4]
mla \tmp2\().8h, \src6\().8h, v0.h[4]
.endif
sqadd \dst1\().8h, \dst1\().8h, \tmp1\().8h
sqadd \dst2\().8h, \dst2\().8h, \tmp2\().8h
.endm
// Load pixels and extend them to 16 bit
.macro loadl dst1, dst2, dst3, dst4
ld1 {v1.8b}, [x2], x3
ld1 {v2.8b}, [x2], x3
ld1 {v3.8b}, [x2], x3
.ifnb \dst4
ld1 {v4.8b}, [x2], x3
.endif
uxtl \dst1\().8h, v1.8b
uxtl \dst2\().8h, v2.8b
uxtl \dst3\().8h, v3.8b
.ifnb \dst4
uxtl \dst4\().8h, v4.8b
.endif
.endm
// Instantiate a vertical filter function for filtering 8 pixels at a time.
// The height is passed in x4, the width in x5 and the filter coefficients
// in x6. idx2 is the index of the largest filter coefficient (3 or 4)
// and idx1 is the other one of them.
.macro do_8tap_8v type, idx1, idx2
function \type\()_8tap_8v_\idx1\idx2
sub x2, x2, x3, lsl #1
sub x2, x2, x3
ld1 {v0.8h}, [x6]
1:
.ifc \type,avg
mov x7, x0
.endif
mov x6, x4
loadl v17, v18, v19
loadl v20, v21, v22, v23
2:
loadl v24, v25, v26, v27
convolve v1, v2, v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v5, v6
convolve v3, v4, v19, v20, v21, v22, v23, v24, v25, v26, v27, \idx1, \idx2, v5, v6
do_store v1, v2, v3, v4, v5, v6, v7, v28, \type
subs x6, x6, #4
b.eq 8f
loadl v16, v17, v18, v19
convolve v1, v2, v21, v22, v23, v24, v25, v26, v27, v16, v17, \idx1, \idx2, v5, v6
convolve v3, v4, v23, v24, v25, v26, v27, v16, v17, v18, v19, \idx1, \idx2, v5, v6
do_store v1, v2, v3, v4, v5, v6, v7, v28, \type
subs x6, x6, #4
b.eq 8f
loadl v20, v21, v22, v23
convolve v1, v2, v25, v26, v27, v16, v17, v18, v19, v20, v21, \idx1, \idx2, v5, v6
convolve v3, v4, v27, v16, v17, v18, v19, v20, v21, v22, v23, \idx1, \idx2, v5, v6
do_store v1, v2, v3, v4, v5, v6, v7, v28, \type
subs x6, x6, #4
b.ne 2b
8:
subs x5, x5, #8
b.eq 9f
// x0 -= h * dst_stride
msub x0, x1, x4, x0
// x2 -= h * src_stride
msub x2, x3, x4, x2
// x2 -= 8 * src_stride
sub x2, x2, x3, lsl #3
// x2 += 1 * src_stride
add x2, x2, x3
add x2, x2, #8
add x0, x0, #8
b 1b
9:
ret
endfunc
.endm
do_8tap_8v put, 3, 4
do_8tap_8v put, 4, 3
do_8tap_8v avg, 3, 4
do_8tap_8v avg, 4, 3
// Instantiate a vertical filter function for filtering a 4 pixels wide
// slice. The first half of the registers contain one row, while the second
// half of a register contains the second-next row (also stored in the first
// half of the register two steps ahead). The convolution does two outputs
// at a time; the output of v17-v24 into one, and v18-v25 into another one.
// The first half of first output is the first output row, the first half
// of the other output is the second output row. The second halves of the
// registers are rows 3 and 4.
// This only is designed to work for 4 or 8 output lines.
.macro do_8tap_4v type, idx1, idx2
function \type\()_8tap_4v_\idx1\idx2
sub x2, x2, x3, lsl #1
sub x2, x2, x3
ld1 {v0.8h}, [x6]
.ifc \type,avg
mov x7, x0
.endif
ld1 {v1.s}[0], [x2], x3
ld1 {v2.s}[0], [x2], x3
ld1 {v3.s}[0], [x2], x3
ld1 {v4.s}[0], [x2], x3
ld1 {v5.s}[0], [x2], x3
ld1 {v6.s}[0], [x2], x3
trn1 v1.2s, v1.2s, v3.2s
ld1 {v7.s}[0], [x2], x3
trn1 v2.2s, v2.2s, v4.2s
ld1 {v26.s}[0], [x2], x3
uxtl v17.8h, v1.8b
trn1 v3.2s, v3.2s, v5.2s
ld1 {v27.s}[0], [x2], x3
uxtl v18.8h, v2.8b
trn1 v4.2s, v4.2s, v6.2s
ld1 {v28.s}[0], [x2], x3
uxtl v19.8h, v3.8b
trn1 v5.2s, v5.2s, v7.2s
ld1 {v29.s}[0], [x2], x3
uxtl v20.8h, v4.8b
trn1 v6.2s, v6.2s, v26.2s
uxtl v21.8h, v5.8b
trn1 v7.2s, v7.2s, v27.2s
uxtl v22.8h, v6.8b
trn1 v26.2s, v26.2s, v28.2s
uxtl v23.8h, v7.8b
trn1 v27.2s, v27.2s, v29.2s
uxtl v24.8h, v26.8b
uxtl v25.8h, v27.8b
convolve v1, v2, v17, v18, v19, v20, v21, v22, v23, v24, v25, \idx1, \idx2, v3, v4
do_store4 v1, v2, v5, v6, \type
subs x4, x4, #4
b.eq 9f
ld1 {v1.s}[0], [x2], x3
ld1 {v2.s}[0], [x2], x3
trn1 v28.2s, v28.2s, v1.2s
trn1 v29.2s, v29.2s, v2.2s
ld1 {v1.s}[1], [x2], x3
uxtl v26.8h, v28.8b
ld1 {v2.s}[1], [x2], x3
uxtl v27.8h, v29.8b
uxtl v28.8h, v1.8b
uxtl v29.8h, v2.8b
convolve v1, v2, v21, v22, v23, v24, v25, v26, v27, v28, v29, \idx1, \idx2, v3, v4
do_store4 v1, v2, v5, v6, \type
9:
ret
endfunc
.endm
do_8tap_4v put, 3, 4
do_8tap_4v put, 4, 3
do_8tap_4v avg, 3, 4
do_8tap_4v avg, 4, 3
.macro do_8tap_v_func type, filter, offset, size
function ff_vp9_\type\()_\filter\()\size\()_v_neon, export=1
uxtw x4, w4
movrel x5, X(ff_vp9_subpel_filters), 256*\offset
cmp w6, #8
add x6, x5, w6, uxtw #4
mov x5, #\size
.if \size >= 8
b.ge \type\()_8tap_8v_34
b \type\()_8tap_8v_43
.else
b.ge \type\()_8tap_4v_34
b \type\()_8tap_4v_43
.endif
endfunc
.endm
.macro do_8tap_v_filters size
do_8tap_v_func put, regular, 1, \size
do_8tap_v_func avg, regular, 1, \size
do_8tap_v_func put, sharp, 2, \size
do_8tap_v_func avg, sharp, 2, \size
do_8tap_v_func put, smooth, 0, \size
do_8tap_v_func avg, smooth, 0, \size
.endm
do_8tap_v_filters 64
do_8tap_v_filters 32
do_8tap_v_filters 16
do_8tap_v_filters 8
do_8tap_v_filters 4
@@ -0,0 +1,82 @@
/*
* VP9 8-tap subpel filter table — verbatim transcription of
* ff_vp9_subpel_filters from FFmpeg n7.1.3 libavcodec/vp9dsp.c
* (commit f46e514). Provided as a standalone .c so the vendored
* vp9mc_neon.S has the `ff_vp9_subpel_filters` symbol to link
* against, without pulling in the full vp9dsp.c init machinery
* (which would chain-include the entire VP9 decoder).
*
* Enum order from libavcodec/vp9dsp.h:64-67:
* FILTER_8TAP_SMOOTH = 0
* FILTER_8TAP_REGULAR = 1
* FILTER_8TAP_SHARP = 2
*
* License: LGPL-2.1-or-later (matches vp9dsp.c upstream).
*/
#include <stdint.h>
#ifdef __GNUC__
#define DAEDALUS_ALIGNED(n) __attribute__((aligned(n)))
#else
#define DAEDALUS_ALIGNED(n)
#endif
const DAEDALUS_ALIGNED(16) int16_t ff_vp9_subpel_filters[3][16][8] = {
/* [0] = FILTER_8TAP_SMOOTH */
{
{ 0, 0, 0, 128, 0, 0, 0, 0 },
{ -3, -1, 32, 64, 38, 1, -3, 0 },
{ -2, -2, 29, 63, 41, 2, -3, 0 },
{ -2, -2, 26, 63, 43, 4, -4, 0 },
{ -2, -3, 24, 62, 46, 5, -4, 0 },
{ -2, -3, 21, 60, 49, 7, -4, 0 },
{ -1, -4, 18, 59, 51, 9, -4, 0 },
{ -1, -4, 16, 57, 53, 12, -4, -1 },
{ -1, -4, 14, 55, 55, 14, -4, -1 },
{ -1, -4, 12, 53, 57, 16, -4, -1 },
{ 0, -4, 9, 51, 59, 18, -4, -1 },
{ 0, -4, 7, 49, 60, 21, -3, -2 },
{ 0, -4, 5, 46, 62, 24, -3, -2 },
{ 0, -4, 4, 43, 63, 26, -2, -2 },
{ 0, -3, 2, 41, 63, 29, -2, -2 },
{ 0, -3, 1, 38, 64, 32, -1, -3 },
},
/* [1] = FILTER_8TAP_REGULAR */
{
{ 0, 0, 0, 128, 0, 0, 0, 0 },
{ 0, 1, -5, 126, 8, -3, 1, 0 },
{ -1, 3, -10, 122, 18, -6, 2, 0 },
{ -1, 4, -13, 118, 27, -9, 3, -1 },
{ -1, 4, -16, 112, 37, -11, 4, -1 },
{ -1, 5, -18, 105, 48, -14, 4, -1 },
{ -1, 5, -19, 97, 58, -16, 5, -1 },
{ -1, 6, -19, 88, 68, -18, 5, -1 },
{ -1, 6, -19, 78, 78, -19, 6, -1 },
{ -1, 5, -18, 68, 88, -19, 6, -1 },
{ -1, 5, -16, 58, 97, -19, 5, -1 },
{ -1, 4, -14, 48, 105, -18, 5, -1 },
{ -1, 4, -11, 37, 112, -16, 4, -1 },
{ -1, 3, -9, 27, 118, -13, 4, -1 },
{ 0, 2, -6, 18, 122, -10, 3, -1 },
{ 0, 1, -3, 8, 126, -5, 1, 0 },
},
/* [2] = FILTER_8TAP_SHARP */
{
{ 0, 0, 0, 128, 0, 0, 0, 0 },
{ -1, 3, -7, 127, 8, -3, 1, 0 },
{ -2, 5, -13, 125, 17, -6, 3, -1 },
{ -3, 7, -17, 121, 27, -10, 5, -2 },
{ -4, 9, -20, 115, 37, -13, 6, -2 },
{ -4, 10, -23, 108, 48, -16, 8, -3 },
{ -4, 10, -24, 100, 59, -19, 9, -3 },
{ -4, 11, -24, 90, 70, -21, 10, -4 },
{ -4, 11, -23, 80, 80, -23, 11, -4 },
{ -4, 10, -21, 70, 90, -24, 11, -4 },
{ -3, 9, -19, 59, 100, -24, 10, -4 },
{ -3, 8, -16, 48, 108, -23, 10, -4 },
{ -2, 6, -13, 37, 115, -20, 9, -4 },
{ -2, 5, -10, 27, 121, -17, 7, -3 },
{ -1, 3, -6, 17, 125, -13, 5, -2 },
{ 0, 1, -3, 8, 127, -7, 3, -1 },
},
};
+142
View File
@@ -0,0 +1,142 @@
// daedalus-fourier cycle 3 — VP9 8-tap "regular" subpel filter,
// horizontal direction, 8-wide output, h rows. V3D 7.1 via Mesa v3dv.
//
// Bakes in cycle-1+2 v4 winning patterns from start:
// - local_size_x = 256
// - 8 lanes per block (1 lane per output row), 2 blocks per
// 16-lane subgroup, 16 subgroups per WG → 32 blocks per WG
// - uint8_t SSBO via storageBuffer8BitAccess
// - oob early-return safe (no barrier)
//
// Contracts (per k3_mc_phase4.md §5, revised per phase5''' findings):
// - meta[i].x: dst_off (byte offset of block's row-0 col-0 dst pixel)
// - meta[i].y: src_off (byte offset of block's row-0 col-0 SOURCE
// pixel — note: NO +3 shift; the C bench's `src + 3` C-caller
// convention does NOT carry into the SSBO offset. Shader reads
// s[k] = SSBO[src_off + row*stride + k] for k=0..14, matching
// C ref's per-row read of `master_src[block_base + row*stride
// + (x..x+7)]` for output col x ∈ 0..7).
// - meta[i].z: mx (subpel phase in [0..15])
// - dst_stride_u8 ≥ 8 (race-safety lower bound; bench asserts)
// - src_stride_u8 ≥ 15 (per-row read span; bench asserts)
//
// License: BSD-2-Clause. Algorithm transcribed from tests/vp9_mc_ref.c
// which mirrors libavcodec/vp9dsp_template.c FILTER_8TAP macro.
#version 450
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Meta {
uvec4 meta[]; // per block: (dst_off, src_off, mx, _pad)
} u_meta;
layout(binding = 1) buffer Dst {
uint8_t dst[];
} u_dst;
layout(binding = 2) readonly buffer Src {
uint8_t src[];
} u_src;
layout(push_constant) uniform PC {
uint n_blocks;
uint dst_stride_u8;
uint src_stride_u8;
uint _pad;
} pc;
// VP9 8-tap REGULAR filter table — verbatim from
// external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
// (index [1] = FILTER_8TAP_REGULAR). 16 subpel phases × 8 taps.
//
// shaderdb-gate (phase5''' finding 2): if uniform count > ~144 after
// first compile, escalate this LUT to SSBO binding 3.
const int FILTER_REGULAR[16][8] = int[16][8](
int[8]( 0, 0, 0, 128, 0, 0, 0, 0 ),
int[8]( 0, 1, -5, 126, 8, -3, 1, 0 ),
int[8](-1, 3, -10, 122, 18, -6, 2, 0 ),
int[8](-1, 4, -13, 118, 27, -9, 3, -1 ),
int[8](-1, 4, -16, 112, 37, -11, 4, -1 ),
int[8](-1, 5, -18, 105, 48, -14, 4, -1 ),
int[8](-1, 5, -19, 97, 58, -16, 5, -1 ),
int[8](-1, 6, -19, 88, 68, -18, 5, -1 ),
int[8](-1, 6, -19, 78, 78, -19, 6, -1 ),
int[8](-1, 5, -18, 68, 88, -19, 6, -1 ),
int[8](-1, 5, -16, 58, 97, -19, 5, -1 ),
int[8](-1, 4, -14, 48, 105, -18, 5, -1 ),
int[8](-1, 4, -11, 37, 112, -16, 4, -1 ),
int[8](-1, 3, -9, 27, 118, -13, 4, -1 ),
int[8]( 0, 2, -6, 18, 122, -10, 3, -1 ),
int[8]( 0, 1, -3, 8, 126, -5, 1, 0 )
);
void main()
{
uint gid = gl_GlobalInvocationID.x;
uint wg_id = gid / 256u;
uint lane_in_wg = gid & 255u;
uint sg_in_wg = lane_in_wg >> 4;
uint lane_in_sg = lane_in_wg & 15u;
uint block_slot = lane_in_sg >> 3;
uint row = lane_in_sg & 7u;
uint block_local = sg_in_wg * 2u + block_slot;
uint block_idx = wg_id * 32u + block_local;
// No barrier follows — safe early-return.
if (block_idx >= pc.n_blocks) return;
uvec4 m = u_meta.meta[block_idx];
uint dst_off = m.x;
uint src_off = m.y;
uint mx = m.z & 15u;
// Read 15 source pixels for this row.
uint src_row = src_off + row * pc.src_stride_u8;
int s0 = int(u_src.src[src_row + 0u]);
int s1 = int(u_src.src[src_row + 1u]);
int s2 = int(u_src.src[src_row + 2u]);
int s3 = int(u_src.src[src_row + 3u]);
int s4 = int(u_src.src[src_row + 4u]);
int s5 = int(u_src.src[src_row + 5u]);
int s6 = int(u_src.src[src_row + 6u]);
int s7 = int(u_src.src[src_row + 7u]);
int s8 = int(u_src.src[src_row + 8u]);
int s9 = int(u_src.src[src_row + 9u]);
int s10 = int(u_src.src[src_row + 10u]);
int s11 = int(u_src.src[src_row + 11u]);
int s12 = int(u_src.src[src_row + 12u]);
int s13 = int(u_src.src[src_row + 13u]);
int s14 = int(u_src.src[src_row + 14u]);
int F0 = FILTER_REGULAR[mx][0];
int F1 = FILTER_REGULAR[mx][1];
int F2 = FILTER_REGULAR[mx][2];
int F3 = FILTER_REGULAR[mx][3];
int F4 = FILTER_REGULAR[mx][4];
int F5 = FILTER_REGULAR[mx][5];
int F6 = FILTER_REGULAR[mx][6];
int F7 = FILTER_REGULAR[mx][7];
int o0 = F0*s0 + F1*s1 + F2*s2 + F3*s3 + F4*s4 + F5*s5 + F6*s6 + F7*s7;
int o1 = F0*s1 + F1*s2 + F2*s3 + F3*s4 + F4*s5 + F5*s6 + F6*s7 + F7*s8;
int o2 = F0*s2 + F1*s3 + F2*s4 + F3*s5 + F4*s6 + F5*s7 + F6*s8 + F7*s9;
int o3 = F0*s3 + F1*s4 + F2*s5 + F3*s6 + F4*s7 + F5*s8 + F6*s9 + F7*s10;
int o4 = F0*s4 + F1*s5 + F2*s6 + F3*s7 + F4*s8 + F5*s9 + F6*s10 + F7*s11;
int o5 = F0*s5 + F1*s6 + F2*s7 + F3*s8 + F4*s9 + F5*s10 + F6*s11 + F7*s12;
int o6 = F0*s6 + F1*s7 + F2*s8 + F3*s9 + F4*s10 + F5*s11 + F6*s12 + F7*s13;
int o7 = F0*s7 + F1*s8 + F2*s9 + F3*s10 + F4*s11 + F5*s12 + F6*s13 + F7*s14;
uint dst_row = dst_off + row * pc.dst_stride_u8;
u_dst.dst[dst_row + 0u] = uint8_t(clamp((o0 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 1u] = uint8_t(clamp((o1 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 2u] = uint8_t(clamp((o2 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 3u] = uint8_t(clamp((o3 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 4u] = uint8_t(clamp((o4 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 5u] = uint8_t(clamp((o5 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 6u] = uint8_t(clamp((o6 + 64) >> 7, 0, 255));
u_dst.dst[dst_row + 7u] = uint8_t(clamp((o7 + 64) >> 7, 0, 255));
}
+286
View File
@@ -0,0 +1,286 @@
/*
* Cycle 3 M4''' — concurrent CPU(NEON MC) + QPU(V3D MC) throughput.
* Same pthread/barrier pattern as bench_concurrent{,_lpf}.c.
* License: BSD-2-Clause.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stddef.h>
#include <time.h>
#include <getopt.h>
#include <pthread.h>
#include <sched.h>
#include <assert.h>
#include <vulkan/vulkan.h>
#include "v3d_runner.h"
extern void ff_vp9_put_regular8_h_neon(
uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
#define SRC_W 16
#define DST_W 8
#define SRC_H 8
#define DST_H 8
#define SRC_BYTES (SRC_H * SRC_W)
#define DST_BYTES (DST_H * DST_W)
static inline uint64_t xs_step(uint64_t *s) {
uint64_t x = *s; x ^= x << 13; x ^= x >> 7; x ^= x << 17; return *s = x;
}
static uint64_t xs_init(uint64_t s) { return s ? s : 0xa57edbeef5717ULL; }
static double now_s(void) {
struct timespec t; clock_gettime(CLOCK_MONOTONIC_RAW, &t);
return t.tv_sec + t.tv_nsec * 1e-9;
}
static volatile int g_stop = 0;
static pthread_barrier_t g_start;
/* --- NEON worker ----------- */
#define NEON_BATCH 8192
typedef struct {
int worker_id, affinity_core;
uint64_t blocks_done;
double elapsed_s;
} neon_args;
static void *neon_worker(void *p)
{
neon_args *a = p;
cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
uint64_t s = xs_init((uint64_t) a->worker_id * 0xc01dbeefULL);
uint8_t *master = malloc((size_t) NEON_BATCH * SRC_BYTES);
uint8_t *work = malloc((size_t) NEON_BATCH * SRC_BYTES);
uint8_t *dsts = malloc((size_t) NEON_BATCH * DST_BYTES);
int *mxs = malloc(NEON_BATCH * sizeof(int));
for (int i = 0; i < NEON_BATCH; i++) {
for (int j = 0; j < SRC_BYTES; j++)
master[(size_t)i * SRC_BYTES + j] = (uint8_t)(xs_step(&s) & 0xff);
mxs[i] = (int)(xs_step(&s) & 15);
}
pthread_barrier_wait(&g_start);
double t0 = now_s();
uint64_t done = 0;
while (!g_stop) {
memcpy(work, master, (size_t) NEON_BATCH * SRC_BYTES);
for (int i = 0; i < NEON_BATCH; i++)
ff_vp9_put_regular8_h_neon(
dsts + (size_t)i * DST_BYTES, DST_W,
work + (size_t)i * SRC_BYTES + 3, SRC_W,
DST_H, mxs[i], 0);
done += NEON_BATCH;
}
a->elapsed_s = now_s() - t0;
a->blocks_done = done;
free(master); free(work); free(dsts); free(mxs);
return NULL;
}
/* --- QPU worker ----------- */
typedef struct {
int affinity_core, n_blocks;
uint64_t blocks_done;
double elapsed_s;
} qpu_args;
typedef struct {
uint32_t n_blocks, dst_stride_u8, src_stride_u8, _pad;
} push_consts;
static void *qpu_worker(void *p)
{
qpu_args *a = p;
cpu_set_t cs; CPU_ZERO(&cs); CPU_SET(a->affinity_core, &cs);
pthread_setaffinity_np(pthread_self(), sizeof(cs), &cs);
v3d_runner *r = v3d_runner_create();
if (!r) return NULL;
int n_blocks = a->n_blocks;
size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
size_t src_bytes = (size_t) n_blocks * SRC_BYTES;
size_t dst_bytes = (size_t) n_blocks * DST_BYTES;
v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
v3d_runner_create_buffer(r, meta_bytes, &buf_meta);
v3d_runner_create_buffer(r, dst_bytes, &buf_dst);
v3d_runner_create_buffer(r, src_bytes, &buf_src);
uint64_t s = 0xfeedfacecafebabeULL;
uint8_t *master = malloc(src_bytes);
for (size_t i = 0; i < src_bytes; i++) master[i] = (uint8_t)(xs_step(&s) & 0xff);
memcpy(buf_src.mapped, master, src_bytes);
uint32_t *meta = buf_meta.mapped;
assert(DST_W >= 8); assert(SRC_W >= 15);
for (int i = 0; i < n_blocks; i++) {
meta[4*i + 0] = (uint32_t)((size_t)i * DST_BYTES); /* dst_off */
meta[4*i + 1] = (uint32_t)((size_t)i * SRC_BYTES); /* src_off (RAW, no +3) */
meta[4*i + 2] = (uint32_t)(xs_step(&s) & 15); /* mx */
meta[4*i + 3] = 0;
}
v3d_pipeline pipe = {0};
v3d_runner_create_pipeline(r, "v3d_mc_8h.spv", 3, sizeof(push_consts), &pipe);
v3d_buffer bufs[3] = { buf_meta, buf_dst, buf_src };
v3d_runner_bind_buffers(r, &pipe, bufs, 3);
const uint32_t bpw = 32;
uint32_t gc = (uint32_t)((n_blocks + bpw - 1) / bpw);
push_consts pc = { .n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = DST_W,
.src_stride_u8 = SRC_W };
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
0, sizeof(pc), &pc);
vkCmdDispatch(cb, gc, 1, 1);
vkEndCommandBuffer(cb);
for (int i = 0; i < 5; i++) v3d_runner_submit_wait(r, cb);
pthread_barrier_wait(&g_start);
double t0 = now_s();
uint64_t done = 0;
while (!g_stop) {
memset(buf_dst.mapped, 0, dst_bytes);
v3d_runner_submit_wait(r, cb);
done += n_blocks;
}
a->elapsed_s = now_s() - t0;
a->blocks_done = done;
free(master);
v3d_runner_destroy_pipeline(r, &pipe);
v3d_runner_destroy_buffer(r, &buf_src);
v3d_runner_destroy_buffer(r, &buf_dst);
v3d_runner_destroy_buffer(r, &buf_meta);
v3d_runner_destroy(r);
return NULL;
}
typedef struct { double duration_s; } timer_args;
static void *timer_thread(void *p) {
timer_args *a = p;
pthread_barrier_wait(&g_start);
double end = now_s() + a->duration_s;
while (now_s() < end) {
struct timespec ts = {0, 1000000}; nanosleep(&ts, NULL);
}
g_stop = 1;
return NULL;
}
enum mode { MODE_NEON, MODE_QPU, MODE_MIXED };
int main(int argc, char **argv)
{
enum mode mode = MODE_NEON;
int n_neon = 4, qpu_core = 3, qpu_n_blocks = 65536;
double duration = 8.0;
static struct option opts[] = {
{"mode", required_argument, 0, 'm'},
{"neon-threads", required_argument, 0, 'n'},
{"qpu-core", required_argument, 0, 'c'},
{"qpu-blocks", required_argument, 0, 'b'},
{"duration", required_argument, 0, 'd'},
{0,0,0,0}
};
for (int c; (c = getopt_long(argc, argv, "m:n:c:b:d:", opts, 0)) != -1;) {
switch (c) {
case 'm':
if (!strcmp(optarg, "neon-only")) mode = MODE_NEON;
else if (!strcmp(optarg, "qpu-only")) mode = MODE_QPU;
else if (!strcmp(optarg, "mixed")) mode = MODE_MIXED;
else { fprintf(stderr, "bad mode\n"); return 2; }
break;
case 'n': n_neon = atoi(optarg); break;
case 'c': qpu_core = atoi(optarg); break;
case 'b': qpu_n_blocks = atoi(optarg); break;
case 'd': duration = atof(optarg); break;
default: return 2;
}
}
int has_qpu = (mode == MODE_QPU || mode == MODE_MIXED);
int has_neon = (mode == MODE_NEON || mode == MODE_MIXED);
int n_workers = (has_neon ? n_neon : 0) + (has_qpu ? 1 : 0);
int barrier_count = n_workers + 1 + 1;
printf("=== M4''' concurrent MC bench ===\n");
printf(" mode: %s, neon: %d, qpu: core %d / %d blocks, %.1fs\n",
mode == MODE_NEON ? "neon-only" : mode == MODE_QPU ? "qpu-only" : "mixed",
has_neon ? n_neon : 0,
has_qpu ? qpu_core : -1,
has_qpu ? qpu_n_blocks : 0,
duration);
pthread_barrier_init(&g_start, NULL, barrier_count);
pthread_t timer_tid; timer_args ta = { .duration_s = duration };
pthread_create(&timer_tid, NULL, timer_thread, &ta);
pthread_t neon_tids[16] = {0};
neon_args n_args[16] = {0};
if (has_neon) {
for (int i = 0; i < n_neon; i++) {
n_args[i] = (neon_args){ .worker_id = i, .affinity_core = i };
pthread_create(&neon_tids[i], NULL, neon_worker, &n_args[i]);
}
}
pthread_t qpu_tid = 0;
qpu_args q_args = {0};
if (has_qpu) {
q_args = (qpu_args){ .affinity_core = qpu_core, .n_blocks = qpu_n_blocks };
pthread_create(&qpu_tid, NULL, qpu_worker, &q_args);
}
pthread_barrier_wait(&g_start);
pthread_join(timer_tid, NULL);
if (has_neon) for (int i = 0; i < n_neon; i++) pthread_join(neon_tids[i], NULL);
if (has_qpu) pthread_join(qpu_tid, NULL);
uint64_t total = 0; double max_e = 0;
if (has_neon) {
printf("NEON per-thread:\n");
for (int i = 0; i < n_neon; i++) {
double mbs = n_args[i].blocks_done / n_args[i].elapsed_s / 1e6;
printf(" core %d: %.3f Mblock/s\n", n_args[i].affinity_core, mbs);
total += n_args[i].blocks_done;
if (n_args[i].elapsed_s > max_e) max_e = n_args[i].elapsed_s;
}
}
if (has_qpu) {
double mbs = q_args.blocks_done / q_args.elapsed_s / 1e6;
printf("QPU (core %d): %.3f Mblock/s\n", q_args.affinity_core, mbs);
total += q_args.blocks_done;
if (q_args.elapsed_s > max_e) max_e = q_args.elapsed_s;
}
double total_mbs = total / max_e / 1e6;
printf("\n=== AGGREGATE ===\n");
printf(" Mblock/s : %.3f\n", total_mbs);
printf(" 30fps@1080p floor: 0.972 Mblock/s — %.1fx margin\n",
total_mbs / 0.972);
pthread_barrier_destroy(&g_start);
return 0;
}
+220
View File
@@ -0,0 +1,220 @@
/*
* Cycle 3 Phase 3 — NEON M3''' baseline for VP9 8-tap regular
* horizontal MC interpolation, 8×8 block.
*
* Reports:
* M1'''_c (correctness): C-ref ↔ NEON bit-exact rate, N random
* 8×8 blocks with random source pixels and
* random subpel phase mx ∈ [0, 15]
* M3''' (throughput): NEON sustained Mblock/s, single-thread,
* time-based
*
* License: LGPL-2.1+ (statically links FFmpeg NEON snapshot).
*/
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <time.h>
#include <getopt.h>
extern void daedalus_vp9_put_regular_8h_ref(
uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
extern void ff_vp9_put_regular8_h_neon(
uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
/* RNG ------------------------------------------------------------ */
static uint64_t xs_state;
static inline uint64_t xs(void) {
uint64_t x = xs_state;
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
return xs_state = x;
}
/* Block layout: each block gets its own 8×16 source buffer + 8×8 dst.
* - source buffer is 16 cols wide; the filter is called with
* src = block_src + 3, so it reads cols [src+0-3..src+8+4] =
* [0..14] of the 16-col buffer. col 15 is unused padding.
* - dst is 8 cols × 8 rows.
*/
#define SRC_W 16
#define SRC_H 8
#define DST_W 8
#define DST_H 8
#define SRC_BYTES (SRC_H * SRC_W) /* 128 */
#define DST_BYTES (DST_H * DST_W) /* 64 */
static void gen_src(uint8_t *buf)
{
for (int i = 0; i < SRC_BYTES; i++)
buf[i] = (uint8_t)(xs() & 0xff);
}
static double now_seconds(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
/* M1'''_c correctness gate -------------------------------------- */
static int correctness_check(uint64_t seed, int n_blocks)
{
xs_state = seed ? seed : 0xabcdef1234567890ULL;
int mismatches = 0;
uint8_t src[SRC_BYTES];
uint8_t dst_a[DST_BYTES], dst_b[DST_BYTES];
int mx_hist[16] = {0};
for (int i = 0; i < n_blocks; i++) {
gen_src(src);
int mx = (int)(xs() & 15);
mx_hist[mx]++;
memset(dst_a, 0, DST_BYTES);
memset(dst_b, 0, DST_BYTES);
daedalus_vp9_put_regular_8h_ref(dst_a, DST_W, src + 3, SRC_W, DST_H, mx, 0);
ff_vp9_put_regular8_h_neon (dst_b, DST_W, src + 3, SRC_W, DST_H, mx, 0);
if (memcmp(dst_a, dst_b, DST_BYTES) != 0) {
if (mismatches < 3) {
fprintf(stderr, "MISMATCH block %d mx=%d:\n", i, mx);
fprintf(stderr, " ref:");
for (int r = 0; r < 8; r++) {
fprintf(stderr, "\n r%d ", r);
for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_a[r*8+c]);
}
fprintf(stderr, "\n neon:");
for (int r = 0; r < 8; r++) {
fprintf(stderr, "\n r%d ", r);
for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", dst_b[r*8+c]);
}
fprintf(stderr, "\n");
}
mismatches++;
}
}
printf("M1'''_c correctness: %d / %d blocks bit-exact (%.4f%%)\n",
n_blocks - mismatches, n_blocks,
100.0 * (n_blocks - mismatches) / n_blocks);
/* mx histogram — confirms all 16 phases get exercised. */
int min_mx = mx_hist[0], max_mx = mx_hist[0];
for (int i = 1; i < 16; i++) {
if (mx_hist[i] < min_mx) min_mx = mx_hist[i];
if (mx_hist[i] > max_mx) max_mx = mx_hist[i];
}
printf(" mx phase coverage: min=%d max=%d (16 phases sampled)\n",
min_mx, max_mx);
return mismatches;
}
/* M3''' throughput ---------------------------------------------- */
static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
{
xs_state = seed ? seed : 0xdeadbeef12345678ULL;
uint8_t *master_src = malloc((size_t) n_blocks * SRC_BYTES);
uint8_t *work_src = malloc((size_t) n_blocks * SRC_BYTES);
uint8_t *dsts = malloc((size_t) n_blocks * DST_BYTES);
int *mxs = malloc(n_blocks * sizeof(int));
if (!master_src || !work_src || !dsts || !mxs) { fprintf(stderr, "alloc fail\n"); exit(1); }
for (int i = 0; i < n_blocks; i++) {
gen_src(master_src + (size_t)i * SRC_BYTES);
mxs[i] = (int)(xs() & 15);
}
/* Warm. */
memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
for (int i = 0; i < n_blocks; i++)
ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
DST_H, mxs[i], 0);
double t0 = now_seconds();
double t_end = t0 + duration_s;
uint64_t done = 0;
while (now_seconds() < t_end) {
memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
for (int i = 0; i < n_blocks; i++)
ff_vp9_put_regular8_h_neon(dsts + (size_t)i * DST_BYTES, DST_W,
work_src + (size_t)i * SRC_BYTES + 3, SRC_W,
DST_H, mxs[i], 0);
done += n_blocks;
}
double elapsed = now_seconds() - t0;
/* setup-only subtraction */
int setup_iters = (int) (done / n_blocks);
double s0 = now_seconds();
for (int it = 0; it < setup_iters; it++)
memcpy(work_src, master_src, (size_t) n_blocks * SRC_BYTES);
double s1 = now_seconds();
double kernel_seconds = elapsed - (s1 - s0);
double mbps = done / kernel_seconds / 1e6;
printf("M3''' NEON throughput:\n");
printf(" blocks/batch: %d\n", n_blocks);
printf(" batches done: %d\n", setup_iters);
printf(" total blocks: %llu\n", (unsigned long long) done);
printf(" elapsed (kernel)=%.6f s\n", kernel_seconds);
printf(" elapsed (setup) =%.6f s\n", s1 - s0);
printf(" throughput = %.3f Mblock/s\n", mbps);
printf(" per-block = %.1f ns\n", kernel_seconds / done * 1e9);
/* 1080p: 32400 blocks/frame */
printf(" equiv 1080p = %.1f FPS (32400 blocks/frame)\n",
mbps * 1e6 / 32400.0);
free(master_src); free(work_src); free(dsts); free(mxs);
}
int main(int argc, char **argv)
{
int n_blocks = 65536;
double duration = 5.0;
uint64_t seed = 0;
int do_correctness = 1;
static struct option opts[] = {
{"blocks", required_argument, 0, 'b'},
{"duration", required_argument, 0, 'd'},
{"seed", required_argument, 0, 's'},
{"no-correctness", no_argument, 0, 'C'},
{0,0,0,0}
};
for (int c; (c = getopt_long(argc, argv, "b:d:s:C", opts, 0)) != -1;) {
switch (c) {
case 'b': n_blocks = atoi(optarg); break;
case 'd': duration = atof(optarg); break;
case 's': seed = strtoull(optarg, 0, 0); break;
case 'C': do_correctness = 0; break;
default: return 2;
}
}
if (do_correctness) {
printf("=== M1'''_c bit-exact (10000 random blocks) ===\n");
if (correctness_check(seed, 10000) != 0) {
fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
return 1;
}
printf("\n");
}
printf("=== M3''' NEON throughput ===\n");
throughput_neon(seed, n_blocks, duration);
return 0;
}
+303
View File
@@ -0,0 +1,303 @@
/*
* Cycle 3 Phase 6 — QPU bench for VP9 8-tap "regular" subpel filter,
* horizontal, 8-wide output on V3D 7.1.
*
* Reports:
* M1''' (correctness): QPU output vs C reference, N blocks across
* all 16 mx phases
* M2''' (throughput): QPU sustained Mblock/s
*
* Per k3_mc_phase4.md §5 (revised per phase5''' findings 4 + 6):
* - src_off is the RAW block base (no +3 shift)
* - assert(dst_stride_u8 >= 8 && src_stride_u8 >= 15)
*
* License: BSD-2-Clause.
*/
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <assert.h>
#include <time.h>
#include <getopt.h>
#include <vulkan/vulkan.h>
#include "v3d_runner.h"
extern void daedalus_vp9_put_regular_8h_ref(
uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my);
/* Per-block layout: src buffer 8 rows × 16 cols = 128 bytes. The
* C bench's src+3 convention: NEON/C ref is called with
* `src = block_base + 3, src_stride = 16`. The shader's src_off
* is the RAW block_base (no +3 shift), and the shader reads
* s[0..14] from src_off + row*stride. Together this means:
* shader's s[k] for k=0..14 = master_src[block_base + row*16 + k]
* C ref's `src[x+k-3]` for x=0..7, k=0..7 with `src = block_base+3`
* = master_src[block_base + row*16 + (x+k)]
* = master_src[block_base + row*16 + (0..14)]
* which is exactly what the shader reads. */
#define SRC_W 16
#define SRC_H 8
#define DST_W 8
#define DST_H 8
#define SRC_BYTES (SRC_H * SRC_W)
#define DST_BYTES (DST_H * DST_W)
static uint64_t xs_state;
static inline uint64_t xs(void) {
uint64_t x = xs_state;
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
return xs_state = x;
}
static void gen_src(uint8_t *b) {
for (int i = 0; i < SRC_BYTES; i++) b[i] = (uint8_t)(xs() & 0xff);
}
static double now_seconds(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
typedef struct {
uint32_t n_blocks;
uint32_t dst_stride_u8;
uint32_t src_stride_u8;
uint32_t _pad;
} push_consts;
int main(int argc, char **argv)
{
int n_blocks = 65536;
int iters = 100;
uint64_t seed = 0;
int verify_only = 0;
const char *spv_path = "v3d_mc_8h.spv";
static struct option opts[] = {
{"blocks", required_argument, 0, 'b'},
{"iters", required_argument, 0, 'i'},
{"seed", required_argument, 0, 's'},
{"spv", required_argument, 0, 'S'},
{"verify-only", no_argument, 0, 'V'},
{0,0,0,0}
};
for (int c; (c = getopt_long(argc, argv, "b:i:s:S:V", opts, 0)) != -1;) {
switch (c) {
case 'b': n_blocks = atoi(optarg); break;
case 'i': iters = atoi(optarg); break;
case 's': seed = strtoull(optarg, 0, 0); break;
case 'S': spv_path = optarg; break;
case 'V': verify_only = 1; break;
default: return 2;
}
}
xs_state = seed ? seed : 0xabcdef1234567890ULL;
v3d_runner *r = v3d_runner_create();
if (!r) { fprintf(stderr, "v3d_runner_create failed\n"); return 1; }
printf("=== v3d MC 8h bench ===\n");
printf(" device: %s\n", v3d_runner_device_name(r));
printf(" n_blocks: %d iters: %d\n", n_blocks, iters);
/* Buffers: meta + dst + src, all blocks contiguous. */
size_t meta_bytes = (size_t) n_blocks * 4 * sizeof(uint32_t);
size_t src_bytes = (size_t) n_blocks * SRC_BYTES;
size_t dst_bytes = (size_t) n_blocks * DST_BYTES;
v3d_buffer buf_meta = {0}, buf_dst = {0}, buf_src = {0};
if (v3d_runner_create_buffer(r, meta_bytes, &buf_meta)) return 1;
if (v3d_runner_create_buffer(r, dst_bytes, &buf_dst)) return 1;
if (v3d_runner_create_buffer(r, src_bytes, &buf_src)) return 1;
uint8_t *master_src = malloc(src_bytes);
uint8_t *expected = malloc(dst_bytes);
int *mxs = malloc(n_blocks * sizeof(int));
if (!master_src || !expected || !mxs) { fprintf(stderr, "alloc\n"); return 1; }
for (int i = 0; i < n_blocks; i++) {
gen_src(master_src + (size_t)i * SRC_BYTES);
mxs[i] = (int)(xs() & 15);
}
/* Build C-ref expected. C ref takes `src + 3, src_stride = SRC_W`. */
memset(expected, 0, dst_bytes);
for (int i = 0; i < n_blocks; i++) {
daedalus_vp9_put_regular_8h_ref(
expected + (size_t)i * DST_BYTES, DST_W,
master_src + (size_t)i * SRC_BYTES + 3, SRC_W,
DST_H, mxs[i], 0);
}
/* Populate GPU buffers. Contracts (phase4 §5) enforced via asserts. */
uint32_t dst_stride_u8 = DST_W;
uint32_t src_stride_u8 = SRC_W;
assert(dst_stride_u8 >= 8 && "phase4 §5 contract 1");
assert(src_stride_u8 >= 15 && "phase4 §5 contract 2");
uint32_t *meta = (uint32_t *) buf_meta.mapped;
for (int i = 0; i < n_blocks; i++) {
/* src_off: RAW block base. NO +3 shift. (phase5''' finding 4) */
uint32_t src_off = (uint32_t)((size_t)i * SRC_BYTES);
uint32_t dst_off = (uint32_t)((size_t)i * DST_BYTES);
meta[4*i + 0] = dst_off;
meta[4*i + 1] = src_off;
meta[4*i + 2] = (uint32_t) mxs[i];
meta[4*i + 3] = 0;
}
memcpy(buf_src.mapped, master_src, src_bytes);
memset(buf_dst.mapped, 0, dst_bytes);
/* Pipeline. */
v3d_pipeline pipe = {0};
if (v3d_runner_create_pipeline(r, spv_path,
/*n_ssbos=*/3,
/*push_const_size=*/sizeof(push_consts),
&pipe)) return 1;
v3d_buffer bind_bufs[3] = { buf_meta, buf_dst, buf_src };
if (v3d_runner_bind_buffers(r, &pipe, bind_bufs, 3)) return 1;
const uint32_t blocks_per_wg = 32;
uint32_t group_count_x = (uint32_t)((n_blocks + blocks_per_wg - 1) / blocks_per_wg);
printf(" dispatch: %u WGs × 256 invocations = %u blocks (rounded up from %d)\n",
group_count_x, group_count_x * blocks_per_wg, n_blocks);
push_consts pc = {
.n_blocks = (uint32_t) n_blocks,
.dst_stride_u8 = dst_stride_u8,
.src_stride_u8 = src_stride_u8,
._pad = 0,
};
VkCommandBuffer cb = v3d_runner_alloc_cmdbuf(r);
VkCommandBufferBeginInfo cbbi = { .sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
vkBeginCommandBuffer(cb, &cbbi);
vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipe.pipeline);
vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE,
pipe.layout, 0, 1, &pipe.desc_set, 0, NULL);
vkCmdPushConstants(cb, pipe.layout, VK_SHADER_STAGE_COMPUTE_BIT,
0, sizeof(pc), &pc);
vkCmdDispatch(cb, group_count_x, 1, 1);
vkEndCommandBuffer(cb);
/* --- M1''' bit-exact --- */
printf("\n=== M1''': QPU vs C reference bit-exact ===\n");
memset(buf_dst.mapped, 0, dst_bytes);
if (v3d_runner_submit_wait(r, cb)) return 1;
int mismatch_blocks = 0;
int total_byte_diffs = 0;
int prints = 0;
for (int i = 0; i < n_blocks; i++) {
const uint8_t *q = (uint8_t *) buf_dst.mapped + (size_t)i * DST_BYTES;
const uint8_t *e = expected + (size_t)i * DST_BYTES;
if (memcmp(q, e, DST_BYTES) != 0) {
int diffs = 0;
for (int j = 0; j < DST_BYTES; j++) if (q[j] != e[j]) diffs++;
total_byte_diffs += diffs;
if (prints < 3) {
fprintf(stderr, "MISMATCH block %d mx=%d: %d/64 bytes differ\n",
i, mxs[i], diffs);
fprintf(stderr, " ref:");
for (int r0 = 0; r0 < 8; r0++) {
fprintf(stderr, "\n r%d ", r0);
for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", e[r0*8+c]);
}
fprintf(stderr, "\n qpu:");
for (int r0 = 0; r0 < 8; r0++) {
fprintf(stderr, "\n r%d ", r0);
for (int c = 0; c < 8; c++) fprintf(stderr, "%3u ", q[r0*8+c]);
}
fprintf(stderr, "\n");
prints++;
}
mismatch_blocks++;
}
}
printf(" blocks bit-exact: %d / %d (%.4f%%)\n",
n_blocks - mismatch_blocks, n_blocks,
100.0 * (n_blocks - mismatch_blocks) / n_blocks);
printf(" total byte diffs: %d / %zu (%.4f%%)\n",
total_byte_diffs, (size_t) n_blocks * DST_BYTES,
100.0 * total_byte_diffs / ((double) n_blocks * DST_BYTES));
if (mismatch_blocks > 0) {
fprintf(stderr, "REFUSING to measure throughput on a broken kernel.\n");
v3d_runner_destroy_pipeline(r, &pipe);
v3d_runner_destroy_buffer(r, &buf_src);
v3d_runner_destroy_buffer(r, &buf_dst);
v3d_runner_destroy_buffer(r, &buf_meta);
v3d_runner_destroy(r);
return 1;
}
if (verify_only) {
v3d_runner_destroy_pipeline(r, &pipe);
v3d_runner_destroy_buffer(r, &buf_src);
v3d_runner_destroy_buffer(r, &buf_dst);
v3d_runner_destroy_buffer(r, &buf_meta);
v3d_runner_destroy(r);
return 0;
}
/* --- M2''' throughput --- */
printf("\n=== M2''': QPU throughput ===\n");
for (int i = 0; i < 10; i++) {
memset(buf_dst.mapped, 0, dst_bytes);
if (v3d_runner_submit_wait(r, cb)) return 1;
}
double t0 = now_seconds();
for (int i = 0; i < iters; i++) {
memset(buf_dst.mapped, 0, dst_bytes);
if (v3d_runner_submit_wait(r, cb)) return 1;
}
double t1 = now_seconds();
double s0 = now_seconds();
for (int i = 0; i < iters; i++) memset(buf_dst.mapped, 0, dst_bytes);
double s1 = now_seconds();
double kernel_seconds = (t1 - t0) - (s1 - s0);
double total_blocks = (double) n_blocks * iters;
double mbps = total_blocks / kernel_seconds / 1e6;
printf(" blocks/dispatch: %d\n", n_blocks);
printf(" iters: %d\n", iters);
printf(" total blocks: %.0f\n", total_blocks);
printf(" elapsed (kernel)=%.6f s\n", kernel_seconds);
printf(" elapsed (setup) =%.6f s\n", s1 - s0);
printf(" M2''' throughput = %.3f Mblock/s\n", mbps);
printf(" per-block = %.1f ns\n", kernel_seconds / total_blocks * 1e9);
printf(" per-dispatch = %.1f us\n", kernel_seconds / iters * 1e6);
double M3 = 20.997; /* from k3_mc_phase3.md */
double R = mbps / M3;
printf("\n Cycle 3 NEON M3''' = %.3f Mblock/s\n", M3);
printf(" R''' = M2'''/M3''' = %.3f\n", R);
if (R >= 1.0) printf(" decision band = GREEN: QPU beats NEON in isolation\n");
else if (R >= 0.5) printf(" decision band = YELLOW: M4''' decides\n");
else if (R >= 0.1) printf(" decision band = ORANGE: M4''' may still rescue\n");
else printf(" decision band = RED: structural mismatch\n");
/* 30fps@1080p floor check (per project_30fps_floor_is_fine.md) */
double mblocks_per_1080p = 32400.0 * 30.0 / 1e6;
printf("\n 30fps@1080p floor : %.3f Mblock/s (32400 blocks × 30 fps)\n",
mblocks_per_1080p);
printf(" isolation margin : %.1fx over 30fps floor\n",
mbps / mblocks_per_1080p);
v3d_runner_destroy_pipeline(r, &pipe);
v3d_runner_destroy_buffer(r, &buf_src);
v3d_runner_destroy_buffer(r, &buf_dst);
v3d_runner_destroy_buffer(r, &buf_meta);
v3d_runner_destroy(r);
free(master_src); free(expected); free(mxs);
return 0;
}
+72
View File
@@ -0,0 +1,72 @@
/*
* Standalone bit-exact C reference for VP9 8-tap "regular" subpel
* filter, horizontal direction, 8-pixel-wide output. Transcribed
* from FFmpeg's libavcodec/vp9dsp_template.c FILTER_8TAP macro
* (vendored at external/ffmpeg-snapshot/). 8-bit pixels only.
*
* Filter coefficients embedded inline (REGULAR filter only, all 16
* subpel phases). Same values as ff_vp9_subpel_filters[1][mx] in
* external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c.
*
* License: LGPL-2.1-or-later.
*
* Spec source: VP9 specification §8.5.1 — subpel motion compensation.
*/
#include <stdint.h>
#include <stddef.h>
static const int16_t vp9_8tap_regular_filters[16][8] = {
{ 0, 0, 0, 128, 0, 0, 0, 0 },
{ 0, 1, -5, 126, 8, -3, 1, 0 },
{ -1, 3, -10, 122, 18, -6, 2, 0 },
{ -1, 4, -13, 118, 27, -9, 3, -1 },
{ -1, 4, -16, 112, 37, -11, 4, -1 },
{ -1, 5, -18, 105, 48, -14, 4, -1 },
{ -1, 5, -19, 97, 58, -16, 5, -1 },
{ -1, 6, -19, 88, 68, -18, 5, -1 },
{ -1, 6, -19, 78, 78, -19, 6, -1 },
{ -1, 5, -18, 68, 88, -19, 6, -1 },
{ -1, 5, -16, 58, 97, -19, 5, -1 },
{ -1, 4, -14, 48, 105, -18, 5, -1 },
{ -1, 4, -11, 37, 112, -16, 4, -1 },
{ -1, 3, -9, 27, 118, -13, 4, -1 },
{ 0, 2, -6, 18, 122, -10, 3, -1 },
{ 0, 1, -3, 8, 126, -5, 1, 0 },
};
static inline uint8_t clip_u8(int x)
{
return (uint8_t)(x > 255 ? 255 : x < 0 ? 0 : x);
}
/*
* 8x8 horizontal 8-tap "put" (non-averaging). Width hard-coded 8.
* `src` must point at the row-0 output-column-0 source pixel; valid
* source memory must extend src[r*src_stride + (-3..+11)] for r=0..h-1.
* `dst` is written at dst[r*dst_stride + 0..7] for r=0..h-1.
*
* Matches ff_vp9_put_regular8_h_neon byte-for-byte on 8-bit input.
*/
void daedalus_vp9_put_regular_8h_ref(uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int h, int mx, int my)
{
(void) my; /* horizontal-only filter ignores y phase */
const int16_t *F = vp9_8tap_regular_filters[mx & 15];
for (int r = 0; r < h; r++) {
for (int x = 0; x < 8; x++) {
int sum = F[0] * (int) src[x - 3]
+ F[1] * (int) src[x - 2]
+ F[2] * (int) src[x - 1]
+ F[3] * (int) src[x + 0]
+ F[4] * (int) src[x + 1]
+ F[5] * (int) src[x + 2]
+ F[6] * (int) src[x + 3]
+ F[7] * (int) src[x + 4];
dst[x] = clip_u8((sum + 64) >> 7);
}
dst += dst_stride;
src += src_stride;
}
}