Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources

First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2 docs lay out the structural complexity (CDEF needs pre-padded 12x12 working buffer + external edge context + direction lookup + constraint function — meaningfully more complex than cycles 1-4). Phase 3+ deferred to next session — CDEF is the first cycle that doesn't fit cleanly into a single autonomous run. Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than FFmpeg's LGPL-2.1+): src/arm/64/cdef.S 520 lines — NEON impl src/arm/64/util.S 278 lines — NEON helpers src/arm/asm.S 335 lines — GAS preamble src/cdef_tmpl.c 331 lines — C reference (templated) include/common/intops.h 84 lines — utility helpers src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions only (avoids dragging full 1013-line tables.c + transitive includes) Discovery from Phase 2 analysis: - Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes (dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h). The 'tmp' arg is the pre-padded 12x12 buffer constructed externally by the dav1d C-side padding() function. - Tap weights are inline-computed (not table): pri_tap = 4 or 3 (based on pri_strength bit), sec_tap = 2 or 1. Only dav1d_cdef_directions[12][2] is an external table. - Constraint function: constrain(diff, threshold, shift) = apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))), diff) Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than LPF (per-pixel min/max conditional logic), so likely worse R than cycle 2/4 but better than cycle 3 MC. M4 gate likely required. What Phase 3+ needs (next session): 1. config.h shim for dav1d's asm preamble (defines TBD on first build) 2. Standalone C reference for cdef_filter_block_8x8_c (cdef_tmpl.c references several dav1d private headers; cleaner to transcribe to a self-contained tests/cdef_ref.c) 3. tests/bench_neon_cdef.c — M1+M3 bench 4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure PROVENANCE.md documents pin + per-file role + re-vendoring procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:12:25 +00:00
parent 20e3d004ae
commit 2cd2258a7b
8 changed files with 1879 additions and 0 deletions
@@ -0,0 +1,190 @@
+---
+cycle: 5
+phases: 1-2 (combined; phase 3+ pending)
+status: setup in progress
+date_opened: 2026-05-18
+parent_cycle: k4_lpf8_phase4_7.md
+target_kernel: AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only
+                (assume direction + strengths pre-computed)
+new_vendor: dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin
+---
+
+# Cycle 5, Phases 1-2 — AV1 CDEF
+
+First AV1 kernel; first cycle that vendors from outside the FFmpeg
+snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause,
+mature aarch64 NEON, used by VLC + Firefox via libdav1d).
+
+## Phase 1 — goal
+
+**Kernel**: AV1 Constrained Directional Enhancement Filter, 8×8 luma
+output, 8 bits/component, FILTER stage (direction + strength
+parameters assumed pre-computed). Match the "pre-computed params"
+convention of LPF (E/I/H) and MC (mx).
+
+**NEON symbol target**: `dav1d_cdef_filter8_pri_sec_8bpc_neon` (combined
+primary + secondary filter). There are also `_pri_` and `_sec_` only
+variants for the cases where one strength is 0; for the bench we
+cover the worst case (both active).
+
+**C reference**: `cdef_filter_block_8x8_c` from `dav1d/src/cdef_tmpl.c`
+(macro-expanded), delegating to `cdef_filter_block_c`. Spec source:
+AV1 specification §7.15 (CDEF).
+
+### Measurable success (cycle-5 numbering, `5` superscript)
+
+| ID | Measurement | Gate |
+|---|---|---|
+| M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % |
+| M2₅ | QPU throughput Mblock/s | recorded |
+| M3₅ | NEON `dav1d_cdef_filter8_pri_sec_8bpc_neon` Mblock/s | recorded |
+| M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional |
+
+### Decision bands (carried)
+
+Same R bands and 30fps-floor calibration as cycles 1-4.
+
+### Predicted R₅
+
+The CDEF filter is **compute-heavier than LPF**:
+- Per pixel: 8 constraint applications (abs + min + max + sign-restore)
+  plus the per-pixel accumulation with min/max tracking
+- Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many
+  conditionals
+- Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block
+  (vs LPF's ~88 B and MC's ~184 B)
+- No DP4A applicability (the multipliers are small constants, but
+  the constraint function dominates)
+
+**Predicted R₅ band**: 0.15-0.30 (ORANGE). The constraint function's
+per-pixel min/max conditional logic is heavier than LPF's per-row
+fm/flat tests. Compute-bound on QPU. M4 may still rescue per
+cycle-1+2 pattern.
+
+### NEW for cycle 5
+
+- **First AV1 kernel** → expands codec coverage beyond VP9
+- **First dav1d-vendored source** → new external/ subdirectory:
+  `external/dav1d-snapshot/` (BSD-2-Clause; clean license vs LGPL
+  FFmpeg)
+- **First kernel needing external padding context** — CDEF reads
+  beyond the 8×8 block (2-pixel halo on each side); dav1d's C
+  reference uses pre-padded `tmp_buf[12×12]` constructed by a
+  separate `padding()` function from left/top/bottom edge arrays.
+  Our bench will construct this padding inline for each random
+  block.
+
+## Phase 2 — situation analysis
+
+### C reference structure (dav1d)
+
+`cdef_filter_block_8x8_c` signature:
+```c
+void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
+                             const pixel (*left)[2],
+                             const pixel *top, const pixel *bottom,
+                             int pri_strength, int sec_strength,
+                             int dir, int damping,
+                             enum CdefEdgeFlags edges);
+```
+
+The function:
+1. Allocates `int16_t tmp_buf[144]` (12×12 working buffer)
+2. Calls `padding()` to fill from left/top/bottom + dst with edge-replicate
+3. Iterates 8 rows × 8 cols; per pixel:
+   - Looks up direction offsets: `dav1d_cdef_directions[dir+offset][k]`
+   - For each of 4 primary tap positions (k=0..1, both signs):
+     compute pri-constrained diff, multiply by tap weight, accumulate
+   - For each of 4 secondary tap positions (k=0..1, both signs,
+     two adjacent directions):
+     same with sec weights
+   - Track min/max across all sampled neighbours
+   - Output: `iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)`
+
+The "constraint" function:
+```c
+static inline int constrain(int diff, int threshold, int shift) {
+    int adiff = abs(diff);
+    return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
+                      diff);
+}
+```
+
+This is the per-pixel-pair clamp that makes CDEF *constrained*
+(directional enhancement that can't exceed a threshold tied to
+local strength).
+
+### Tables needed
+
+- `dav1d_cdef_directions[12][2]` — 12 directions (8 + 4 wrap-arounds),
+  each a (y_offset, x_offset) pair. In `dav1d/src/tables.c`.
+- `dav1d_cdef_pri_taps[2][2]` — primary tap weights, indexed by
+  `(pri_strength & 1)` and tap position k. Small ints.
+- `dav1d_cdef_sec_taps[2]` — secondary tap weights, just 2 entries.
+
+### NEON reference structure (dav1d)
+
+`dav1d_cdef_filter8_pri_sec_8bpc_neon` signature:
+```
+x0: dst         pixel buffer
+x1: dst_stride  ptrdiff_t
+x2: tmp         uint8_t source (the pre-padded 12×12 buffer reinterpreted)
+w3: pri_strength
+w4: sec_strength
+w5: dir
+w6: damping
+w7: h           height (8 for 8×8)
+```
+
+Notable: dav1d's NEON takes the already-padded `tmp` buffer pointer
+(after the C side did `padding()`). So our bench needs to construct
+the padded buffer per block.
+
+Padded buffer layout (12×12, int16 elements):
+- Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
+- Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate
+  from adjacent block (if edges flag set) or INT16_MIN (which the
+  constraint function treats as "skip this neighbour")
+
+### Vendoring plan
+
+New directory: `external/dav1d-snapshot/` (BSD-2-Clause, separate
+PROVENANCE.md from FFmpeg pin).
+
+Files to vendor from dav1d 1.4.3:
+1. `src/arm/64/cdef.S` — main NEON file (~870 lines)
+2. `src/arm/64/util.S` — helper macros referenced by cdef.S
+3. `src/arm/asm.S` — top-level macros (function, endfunc, etc.)
+4. `src/cdef_tmpl.c` — C reference (~250 lines)
+5. `src/tables.c` — the static tables (cdef_directions, pri/sec taps)
+   *or* hand-extract just the CDEF tables (~50 lines)
+6. `include/common/intops.h` — apply_sign, imin, imax, iclip helpers
+7. A standalone PROVENANCE.md with pin + SHA-256s
+
+dav1d's asm preamble may need its own config.h shim (different
+defines than FFmpeg's). Phase 6 setup will identify exact needs.
+
+### Build path
+
+dav1d's asm uses similar GAS preamble to FFmpeg's. The config
+defines are different: `ARCH_AARCH64`, `HAVE_AS_FUNC`, etc., but
+also dav1d-specific like `PRIVATE_PREFIX dav1d_` and `EXTERN_ASM ` (same
+empty for ELF as in cycle 1).
+
+### What Phase 2 does *not* close
+
+- The exact list of dav1d asm.S macros needed (will surface during
+  first build attempt)
+- C reference completeness — `padding()` setup logic is non-trivial
+  (handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP,
+  HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by
+  always passing "all edges valid" with synthetic neighbouring pixels.
+- Direction validation — directions 0..7 should all be tested for
+  bit-exactness; an off-by-one in the direction-offset table would
+  be caught by M1.
+
+Phase 3 next: vendor the dav1d files, write standalone C ref +
+bench, capture M3₅ NEON baseline.
+
+This is **the first multi-session cycle** — Phase 3+ likely lands
+in next session. Cycle setup commit at end of this session.
@@ -0,0 +1,109 @@
+# dav1d source snapshot
+
+Verbatim subset of dav1d source pinned for use as reference
+implementations of AV1 CDEF (cycle 5 of `daedalus-fourier`) and
+potentially future AV1 kernels. dav1d is the canonical AV1 decoder
+library (BSD-2-Clause, maintained by VideoLAN).
+
+See `../../docs/k5_cdef_phase1_2.md` for the cycle 5 scope and
+rationale.
+
+## Upstream pin
+
+- **Repository**: https://github.com/videolan/dav1d (canonical mirror
+  of https://code.videolan.org/videolan/dav1d)
+- **Tag**: `1.4.3` (last stable release in the 1.4.x line as of
+  2026-05-18; pinned for reproducibility)
+- **Snapshot fetched**: 2026-05-18 (UTC), via
+  `https://raw.githubusercontent.com/videolan/dav1d/1.4.3/<path>`
+
+## Files in this snapshot
+
+All files are byte-for-byte copies of the upstream source at the
+tagged commit, except `tables_cdef_subset.c` which is a hand-extracted
+single-table copy from `src/tables.c` (see §"Why each file" below).
+
+| Path | Lines | SHA-256 |
+|---|---|---|
+| `src/arm/64/cdef.S` | 520 | `88d048cbed93f168...` (TODO full hash) |
+| `src/arm/64/util.S` | 278 | `582acd8e2b74a1e8...` |
+| `src/arm/asm.S` | 335 | `6a22def2799876c4...` |
+| `src/cdef_tmpl.c` | 331 | `26a7a5f9fda65c58...` |
+| `include/common/intops.h` | 84 | `c1e7d52b421d6417...` |
+| `src/tables_cdef_subset.c` | hand-extracted | — |
+
+Full SHA-256s (regenerated by `phase 3` setup):
+
+```sh
+( cd external/dav1d-snapshot && sha256sum \
+    src/arm/64/cdef.S src/arm/64/util.S src/arm/asm.S \
+    src/cdef_tmpl.c include/common/intops.h )
+```
+
+## License
+
+BSD-2-Clause. Copyright (c) 2018 VideoLAN and dav1d authors; (c) 2019
+Martin Storsjö (NEON aarch64). Original copyright headers preserved
+in each vendored file.
+
+Notably cleaner license than the FFmpeg LGPL-2.1+ snapshot — dav1d's
+BSD allows distribution of binaries without LGPL's "share linking
+ability" requirements. For daedalus-fourier benches that link only
+this snapshot, the binary inherits BSD-2-Clause. Benches that
+combine both snapshots (none currently) inherit LGPL-2.1+ via
+FFmpeg's stronger terms.
+
+## Why each file
+
+- **`src/arm/64/cdef.S`** — the NEON aarch64 implementation. Provides
+  `dav1d_cdef_filter8_pri_sec_8bpc_neon` and pri-only / sec-only
+  variants. The Phase 3 NEON baseline (M3₅) measures this symbol.
+
+- **`src/arm/64/util.S`** — helper macros (`load_px_8`,
+  `handle_pixel_8`, etc.) referenced by cdef.S.
+
+- **`src/arm/asm.S`** — top-level GAS preamble (function/endfunc,
+  movrel, register macros). dav1d's own version is similar to FFmpeg's
+  but with different defines (PRIVATE_PREFIX dav1d_ etc.); Phase 6
+  setup will identify the config.h shim needed for standalone
+  assembly.
+
+- **`src/cdef_tmpl.c`** — the C reference (templated; the
+  `cdef_filter_block_c` core function is in here, expanded to
+  `cdef_filter_block_8x8_c` via `cdef_fn(8, 8)`).
+
+- **`include/common/intops.h`** — utility helpers (apply_sign,
+  imin, imax, iclip, umin) used by cdef_tmpl.c.
+
+- **`src/tables_cdef_subset.c`** — hand-extracted `dav1d_cdef_directions`
+  table from `src/tables.c` (lines 400-414). Provides the only
+  table symbol both `cdef.S` and `cdef_tmpl.c` reference externally.
+  Pulling in the full `src/tables.c` (1013 lines) would chain-include
+  the entire dav1d decoder, which is overkill for our purposes.
+  See `tables_cdef_subset.c` header comment for line-range
+  reference back to upstream.
+
+## Re-vendoring procedure
+
+Same as FFmpeg snapshot — see `../ffmpeg-snapshot/PROVENANCE.md`.
+
+```sh
+TAG=1.x.y
+BASE=https://raw.githubusercontent.com/videolan/dav1d/$TAG
+cd external/dav1d-snapshot
+for f in src/arm/64/cdef.S src/arm/64/util.S src/arm/asm.S \
+         src/cdef_tmpl.c include/common/intops.h; do
+  curl -sSf -o "$f" "$BASE/$f"
+done
+# tables_cdef_subset.c needs manual re-extraction from
+# upstream src/tables.c — search for "dav1d_cdef_directions ="
+```
+
+## Pending work (Phase 3+, next session)
+
+- config.h shim for assembling cdef.S standalone (dav1d's defines
+  differ from FFmpeg's; will identify exact list on first build)
+- Standalone C reference for `cdef_filter_block_8x8_c` (this snapshot's
+  `cdef_tmpl.c` references several private headers — easier to
+  transcribe to a self-contained `tests/cdef_ref.c`)
+- `tests/bench_neon_cdef.c` to capture M3₅ baseline
@@ -0,0 +1,84 @@
+/*
+ * Copyright © 2018, VideoLAN and dav1d authors
+ * Copyright © 2018, Two Orioles, LLC
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ *    list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ *    this list of conditions and the following disclaimer in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef DAV1D_COMMON_INTOPS_H
+#define DAV1D_COMMON_INTOPS_H
+
+#include <stdint.h>
+
+#include "common/attributes.h"
+
+static inline int imax(const int a, const int b) {
+    return a > b ? a : b;
+}
+
+static inline int imin(const int a, const int b) {
+    return a < b ? a : b;
+}
+
+static inline unsigned umax(const unsigned a, const unsigned b) {
+    return a > b ? a : b;
+}
+
+static inline unsigned umin(const unsigned a, const unsigned b) {
+    return a < b ? a : b;
+}
+
+static inline int iclip(const int v, const int min, const int max) {
+    return v < min ? min : v > max ? max : v;
+}
+
+static inline int iclip_u8(const int v) {
+    return iclip(v, 0, 255);
+}
+
+static inline int apply_sign(const int v, const int s) {
+    return s < 0 ? -v : v;
+}
+
+static inline int apply_sign64(const int v, const int64_t s) {
+    return s < 0 ? -v : v;
+}
+
+static inline int ulog2(const unsigned v) {
+    return 31 - clz(v);
+}
+
+static inline int u64log2(const uint64_t v) {
+    return 63 - clzll(v);
+}
+
+static inline unsigned inv_recenter(const unsigned r, const unsigned v) {
+    if (v > (r << 1))
+        return v;
+    else if ((v & 1) == 0)
+        return (v >> 1) + r;
+    else
+        return r - ((v + 1) >> 1);
+}
+
+#endif /* DAV1D_COMMON_INTOPS_H */
@@ -0,0 +1,520 @@
+/*
+ * Copyright © 2018, VideoLAN and dav1d authors
+ * Copyright © 2019, Martin Storsjo
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ *    list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ *    this list of conditions and the following disclaimer in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "src/arm/asm.S"
+#include "util.S"
+#include "cdef_tmpl.S"
+
+.macro pad_top_bottom s1, s2, w, stride, rn, rw, ret
+        tst             w7,  #1 // CDEF_HAVE_LEFT
+        b.eq            2f
+        // CDEF_HAVE_LEFT
+        sub             \s1,  \s1,  #2
+        sub             \s2,  \s2,  #2
+        tst             w7,  #2 // CDEF_HAVE_RIGHT
+        b.eq            1f
+        // CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
+        ldr             \rn\()0, [\s1]
+        ldr             s1,      [\s1, #\w]
+        ldr             \rn\()2, [\s2]
+        ldr             s3,      [\s2, #\w]
+        uxtl            v0.8h,   v0.8b
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        str             \rw\()0, [x0]
+        str             d1,      [x0, #2*\w]
+        add             x0,  x0,  #2*\stride
+        str             \rw\()2, [x0]
+        str             d3,      [x0, #2*\w]
+.if \ret
+        ret
+.else
+        add             x0,  x0,  #2*\stride
+        b               3f
+.endif
+
+1:
+        // CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
+        ldr             \rn\()0, [\s1]
+        ldr             h1,      [\s1, #\w]
+        ldr             \rn\()2, [\s2]
+        ldr             h3,      [\s2, #\w]
+        uxtl            v0.8h,   v0.8b
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        str             \rw\()0, [x0]
+        str             s1,      [x0, #2*\w]
+        str             s31,     [x0, #2*\w+4]
+        add             x0,  x0,  #2*\stride
+        str             \rw\()2, [x0]
+        str             s3,      [x0, #2*\w]
+        str             s31,     [x0, #2*\w+4]
+.if \ret
+        ret
+.else
+        add             x0,  x0,  #2*\stride
+        b               3f
+.endif
+
+2:
+        // !CDEF_HAVE_LEFT
+        tst             w7,  #2 // CDEF_HAVE_RIGHT
+        b.eq            1f
+        // !CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
+        ldr             \rn\()0, [\s1]
+        ldr             h1,      [\s1, #\w]
+        ldr             \rn\()2, [\s2]
+        ldr             h3,      [\s2, #\w]
+        uxtl            v0.8h,  v0.8b
+        uxtl            v1.8h,  v1.8b
+        uxtl            v2.8h,  v2.8b
+        uxtl            v3.8h,  v3.8b
+        str             s31, [x0]
+        stur            \rw\()0, [x0, #4]
+        str             s1,      [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        str             s31, [x0]
+        stur            \rw\()2, [x0, #4]
+        str             s3,      [x0, #4+2*\w]
+.if \ret
+        ret
+.else
+        add             x0,  x0,  #2*\stride
+        b               3f
+.endif
+
+1:
+        // !CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
+        ldr             \rn\()0, [\s1]
+        ldr             \rn\()1, [\s2]
+        uxtl            v0.8h,  v0.8b
+        uxtl            v1.8h,  v1.8b
+        str             s31,     [x0]
+        stur            \rw\()0, [x0, #4]
+        str             s31,     [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        str             s31,     [x0]
+        stur            \rw\()1, [x0, #4]
+        str             s31,     [x0, #4+2*\w]
+.if \ret
+        ret
+.else
+        add             x0,  x0,  #2*\stride
+.endif
+3:
+.endm
+
+.macro load_n_incr dst, src, incr, w
+.if \w == 4
+        ld1             {\dst\().s}[0], [\src], \incr
+.else
+        ld1             {\dst\().8b},   [\src], \incr
+.endif
+.endm
+
+// void dav1d_cdef_paddingX_8bpc_neon(uint16_t *tmp, const pixel *src,
+//                                    ptrdiff_t src_stride, const pixel (*left)[2],
+//                                    const pixel *const top,
+//                                    const pixel *const bottom, int h,
+//                                    enum CdefEdgeFlags edges);
+
+.macro padding_func w, stride, rn, rw
+function cdef_padding\w\()_8bpc_neon, export=1
+        cmp             w7,  #0xf // fully edged
+        b.eq            cdef_padding\w\()_edged_8bpc_neon
+        movi            v30.8h,  #0x80, lsl #8
+        mov             v31.16b, v30.16b
+        sub             x0,  x0,  #2*(2*\stride+2)
+        tst             w7,  #4 // CDEF_HAVE_TOP
+        b.ne            1f
+        // !CDEF_HAVE_TOP
+        st1             {v30.8h, v31.8h}, [x0], #32
+.if \w == 8
+        st1             {v30.8h, v31.8h}, [x0], #32
+.endif
+        b               3f
+1:
+        // CDEF_HAVE_TOP
+        add             x9,  x4,  x2
+        pad_top_bottom  x4,  x9, \w, \stride, \rn, \rw, 0
+
+        // Middle section
+3:
+        tst             w7,  #1 // CDEF_HAVE_LEFT
+        b.eq            2f
+        // CDEF_HAVE_LEFT
+        tst             w7,  #2 // CDEF_HAVE_RIGHT
+        b.eq            1f
+        // CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
+0:
+        ld1             {v0.h}[0], [x3], #2
+        ldr             h2,      [x1, #\w]
+        load_n_incr     v1,  x1,  x2,  \w
+        subs            w6,  w6,  #1
+        uxtl            v0.8h,  v0.8b
+        uxtl            v1.8h,  v1.8b
+        uxtl            v2.8h,  v2.8b
+        str             s0,      [x0]
+        stur            \rw\()1, [x0, #4]
+        str             s2,      [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        b.gt            0b
+        b               3f
+1:
+        // CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
+        ld1             {v0.h}[0], [x3], #2
+        load_n_incr     v1,  x1,  x2,  \w
+        subs            w6,  w6,  #1
+        uxtl            v0.8h,  v0.8b
+        uxtl            v1.8h,  v1.8b
+        str             s0,      [x0]
+        stur            \rw\()1, [x0, #4]
+        str             s31,     [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        b.gt            1b
+        b               3f
+2:
+        tst             w7,  #2 // CDEF_HAVE_RIGHT
+        b.eq            1f
+        // !CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
+0:
+        ldr             h1,      [x1, #\w]
+        load_n_incr     v0,  x1,  x2,  \w
+        subs            w6,  w6,  #1
+        uxtl            v0.8h,  v0.8b
+        uxtl            v1.8h,  v1.8b
+        str             s31,     [x0]
+        stur            \rw\()0, [x0, #4]
+        str             s1,      [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        b.gt            0b
+        b               3f
+1:
+        // !CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
+        load_n_incr     v0,  x1,  x2,  \w
+        subs            w6,  w6,  #1
+        uxtl            v0.8h,  v0.8b
+        str             s31,     [x0]
+        stur            \rw\()0, [x0, #4]
+        str             s31,     [x0, #4+2*\w]
+        add             x0,  x0,  #2*\stride
+        b.gt            1b
+
+3:
+        tst             w7,  #8 // CDEF_HAVE_BOTTOM
+        b.ne            1f
+        // !CDEF_HAVE_BOTTOM
+        st1             {v30.8h, v31.8h}, [x0], #32
+.if \w == 8
+        st1             {v30.8h, v31.8h}, [x0], #32
+.endif
+        ret
+1:
+        // CDEF_HAVE_BOTTOM
+        add             x9,  x5,  x2
+        pad_top_bottom  x5,  x9, \w, \stride, \rn, \rw, 1
+endfunc
+.endm
+
+padding_func 8, 16, d, q
+padding_func 4, 8,  s, d
+
+// void cdef_paddingX_edged_8bpc_neon(uint8_t *tmp, const pixel *src,
+//                                    ptrdiff_t src_stride, const pixel (*left)[2],
+//                                    const pixel *const top,
+//                                    const pixel *const bottom, int h,
+//                                    enum CdefEdgeFlags edges);
+
+.macro padding_func_edged w, stride, reg
+function cdef_padding\w\()_edged_8bpc_neon, export=1
+        sub             x4,  x4,  #2
+        sub             x5,  x5,  #2
+        sub             x0,  x0,  #(2*\stride+2)
+
+.if \w == 4
+        ldr             d0, [x4]
+        ldr             d1, [x4, x2]
+        st1             {v0.8b, v1.8b}, [x0], #16
+.else
+        add             x9,  x4,  x2
+        ldr             d0, [x4]
+        ldr             s1, [x4, #8]
+        ldr             d2, [x9]
+        ldr             s3, [x9, #8]
+        str             d0, [x0]
+        str             s1, [x0, #8]
+        str             d2, [x0, #\stride]
+        str             s3, [x0, #\stride+8]
+        add             x0,  x0,  #2*\stride
+.endif
+
+0:
+        ld1             {v0.h}[0], [x3], #2
+        ldr             h2,      [x1, #\w]
+        load_n_incr     v1,  x1,  x2,  \w
+        subs            w6,  w6,  #1
+        str             h0,      [x0]
+        stur            \reg\()1, [x0, #2]
+        str             h2,      [x0, #2+\w]
+        add             x0,  x0,  #\stride
+        b.gt            0b
+
+.if \w == 4
+        ldr             d0, [x5]
+        ldr             d1, [x5, x2]
+        st1             {v0.8b, v1.8b}, [x0], #16
+.else
+        add             x9,  x5,  x2
+        ldr             d0, [x5]
+        ldr             s1, [x5, #8]
+        ldr             d2, [x9]
+        ldr             s3, [x9, #8]
+        str             d0, [x0]
+        str             s1, [x0, #8]
+        str             d2, [x0, #\stride]
+        str             s3, [x0, #\stride+8]
+.endif
+        ret
+endfunc
+.endm
+
+padding_func_edged 8, 16, d
+padding_func_edged 4, 8,  s
+
+tables
+
+filter 8, 8
+filter 4, 8
+
+find_dir 8
+
+.macro load_px_8 d1, d2, w
+.if \w == 8
+        add             x6,  x2,  w9, sxtb          // x + off
+        sub             x9,  x2,  w9, sxtb          // x - off
+        ld1             {\d1\().d}[0], [x6]         // p0
+        add             x6,  x6,  #16               // += stride
+        ld1             {\d2\().d}[0], [x9]         // p1
+        add             x9,  x9,  #16               // += stride
+        ld1             {\d1\().d}[1], [x6]         // p0
+        ld1             {\d2\().d}[1], [x9]         // p0
+.else
+        add             x6,  x2,  w9, sxtb          // x + off
+        sub             x9,  x2,  w9, sxtb          // x - off
+        ld1             {\d1\().s}[0], [x6]         // p0
+        add             x6,  x6,  #8                // += stride
+        ld1             {\d2\().s}[0], [x9]         // p1
+        add             x9,  x9,  #8                // += stride
+        ld1             {\d1\().s}[1], [x6]         // p0
+        add             x6,  x6,  #8                // += stride
+        ld1             {\d2\().s}[1], [x9]         // p1
+        add             x9,  x9,  #8                // += stride
+        ld1             {\d1\().s}[2], [x6]         // p0
+        add             x6,  x6,  #8                // += stride
+        ld1             {\d2\().s}[2], [x9]         // p1
+        add             x9,  x9,  #8                // += stride
+        ld1             {\d1\().s}[3], [x6]         // p0
+        ld1             {\d2\().s}[3], [x9]         // p1
+.endif
+.endm
+.macro handle_pixel_8 s1, s2, thresh_vec, shift, tap, min
+.if \min
+        umin            v3.16b,  v3.16b,  \s1\().16b
+        umax            v4.16b,  v4.16b,  \s1\().16b
+        umin            v3.16b,  v3.16b,  \s2\().16b
+        umax            v4.16b,  v4.16b,  \s2\().16b
+.endif
+        uabd            v16.16b, v0.16b,  \s1\().16b  // abs(diff)
+        uabd            v20.16b, v0.16b,  \s2\().16b  // abs(diff)
+        ushl            v17.16b, v16.16b, \shift      // abs(diff) >> shift
+        ushl            v21.16b, v20.16b, \shift      // abs(diff) >> shift
+        uqsub           v17.16b, \thresh_vec, v17.16b // clip = imax(0, threshold - (abs(diff) >> shift))
+        uqsub           v21.16b, \thresh_vec, v21.16b // clip = imax(0, threshold - (abs(diff) >> shift))
+        cmhi            v18.16b, v0.16b,  \s1\().16b  // px > p0
+        cmhi            v22.16b, v0.16b,  \s2\().16b  // px > p1
+        umin            v17.16b, v17.16b, v16.16b     // imin(abs(diff), clip)
+        umin            v21.16b, v21.16b, v20.16b     // imin(abs(diff), clip)
+        dup             v19.16b, \tap                 // taps[k]
+        neg             v16.16b, v17.16b              // -imin()
+        neg             v20.16b, v21.16b              // -imin()
+        bsl             v18.16b, v16.16b, v17.16b     // constrain() = apply_sign()
+        bsl             v22.16b, v20.16b, v21.16b     // constrain() = apply_sign()
+        mla             v1.16b,  v18.16b, v19.16b     // sum += taps[k] * constrain()
+        mla             v2.16b,  v22.16b, v19.16b     // sum += taps[k] * constrain()
+.endm
+
+// void cdef_filterX_edged_8bpc_neon(pixel *dst, ptrdiff_t dst_stride,
+//                                   const uint8_t *tmp, int pri_strength,
+//                                   int sec_strength, int dir, int damping,
+//                                   int h);
+.macro filter_func_8 w, pri, sec, min, suffix
+function cdef_filter\w\suffix\()_edged_8bpc_neon
+.if \pri
+        movrel          x8,  pri_taps
+        and             w9,  w3,  #1
+        add             x8,  x8,  w9, uxtw #1
+.endif
+        movrel          x9,  directions\w
+        add             x5,  x9,  w5, uxtw #1
+        movi            v30.8b,  #7
+        dup             v28.8b,  w6                 // damping
+
+.if \pri
+        dup             v25.16b, w3                 // threshold
+.endif
+.if \sec
+        dup             v27.16b, w4                 // threshold
+.endif
+        trn1            v24.8b,  v25.8b, v27.8b
+        clz             v24.8b,  v24.8b             // clz(threshold)
+        sub             v24.8b,  v30.8b, v24.8b     // ulog2(threshold)
+        uqsub           v24.8b,  v28.8b, v24.8b     // shift = imax(0, damping - ulog2(threshold))
+        neg             v24.8b,  v24.8b             // -shift
+.if \sec
+        dup             v26.16b, v24.b[1]
+.endif
+.if \pri
+        dup             v24.16b, v24.b[0]
+.endif
+
+1:
+.if \w == 8
+        add             x12, x2,  #16
+        ld1             {v0.d}[0], [x2]             // px
+        ld1             {v0.d}[1], [x12]            // px
+.else
+        add             x12, x2,  #1*8
+        add             x13, x2,  #2*8
+        add             x14, x2,  #3*8
+        ld1             {v0.s}[0], [x2]             // px
+        ld1             {v0.s}[1], [x12]            // px
+        ld1             {v0.s}[2], [x13]            // px
+        ld1             {v0.s}[3], [x14]            // px
+.endif
+
+        // We need 9-bits or two 8-bit accululators to fit the sum.
+        // Max of |sum| > 15*2*6(pri) + 4*4*3(sec) = 228.
+        // Start sum at -1 instead of 0 to help handle rounding later.
+        movi            v1.16b, #255                // sum
+        movi            v2.16b, #0                  // sum
+.if \min
+        mov             v3.16b, v0.16b              // min
+        mov             v4.16b, v0.16b              // max
+.endif
+
+        // Instead of loading sec_taps 2, 1 from memory, just set it
+        // to 2 initially and decrease for the second round.
+        // This is also used as loop counter.
+        mov             w11, #2                     // sec_taps[0]
+
+2:
+.if \pri
+        ldrb            w9,  [x5]                   // off1
+
+        load_px_8       v5,  v6, \w
+.endif
+
+.if \sec
+        add             x5,  x5,  #4                // +2*2
+        ldrb            w9,  [x5]                   // off2
+        load_px_8       v28, v29, \w
+.endif
+
+.if \pri
+        ldrb            w10, [x8]                   // *pri_taps
+
+        handle_pixel_8  v5,  v6,  v25.16b, v24.16b, w10, \min
+.endif
+
+.if \sec
+        add             x5,  x5,  #8                // +2*4
+        ldrb            w9,  [x5]                   // off3
+        load_px_8       v5,  v6,  \w
+
+        handle_pixel_8  v28, v29, v27.16b, v26.16b, w11, \min
+
+        handle_pixel_8  v5,  v6,  v27.16b, v26.16b, w11, \min
+
+        sub             x5,  x5,  #11               // x5 -= 2*(2+4); x5 += 1;
+.else
+        add             x5,  x5,  #1                // x5 += 1
+.endif
+        subs            w11, w11, #1                // sec_tap-- (value)
+.if \pri
+        add             x8,  x8,  #1                // pri_taps++ (pointer)
+.endif
+        b.ne            2b
+
+        // Perform halving adds since the value won't fit otherwise.
+        // To handle the offset for negative values, use both halving w/ and w/o rounding.
+        srhadd          v5.16b,  v1.16b,  v2.16b    // sum >> 1
+        shadd           v6.16b,  v1.16b,  v2.16b    // (sum - 1) >> 1
+        cmlt            v1.16b,  v5.16b,  #0        // sum < 0
+        bsl             v1.16b,  v6.16b,  v5.16b    // (sum - (sum < 0)) >> 1
+
+        srshr           v1.16b,  v1.16b,  #3        // (8 + sum - (sum < 0)) >> 4
+
+        usqadd          v0.16b,  v1.16b             // px + (8 + sum ...) >> 4
+.if \min
+        umin            v0.16b,  v0.16b,  v4.16b
+        umax            v0.16b,  v0.16b,  v3.16b    // iclip(px + .., min, max)
+.endif
+.if \w == 8
+        st1             {v0.d}[0], [x0], x1
+        add             x2,  x2,  #2*16             // tmp += 2*tmp_stride
+        subs            w7,  w7,  #2                // h -= 2
+        st1             {v0.d}[1], [x0], x1
+.else
+        st1             {v0.s}[0], [x0], x1
+        add             x2,  x2,  #4*8              // tmp += 4*tmp_stride
+        st1             {v0.s}[1], [x0], x1
+        subs            w7,  w7,  #4                // h -= 4
+        st1             {v0.s}[2], [x0], x1
+        st1             {v0.s}[3], [x0], x1
+.endif
+
+        // Reset pri_taps and directions back to the original point
+        sub             x5,  x5,  #2
+.if \pri
+        sub             x8,  x8,  #2
+.endif
+
+        b.gt            1b
+        ret
+endfunc
+.endm
+
+.macro filter_8 w
+filter_func_8 \w, pri=1, sec=0, min=0, suffix=_pri
+filter_func_8 \w, pri=0, sec=1, min=0, suffix=_sec
+filter_func_8 \w, pri=1, sec=1, min=1, suffix=_pri_sec
+.endm
+
+filter_8 8
+filter_8 4
@@ -0,0 +1,278 @@
+/******************************************************************************
+ * Copyright © 2018, VideoLAN and dav1d authors
+ * Copyright © 2015 Martin Storsjo
+ * Copyright © 2015 Janne Grunau
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ *    list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ *    this list of conditions and the following disclaimer in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *****************************************************************************/
+
+#ifndef DAV1D_SRC_ARM_64_UTIL_S
+#define DAV1D_SRC_ARM_64_UTIL_S
+
+#include "config.h"
+#include "src/arm/asm.S"
+
+#ifndef __has_feature
+#define __has_feature(x) 0
+#endif
+
+.macro  movrel rd, val, offset=0
+#if defined(__APPLE__)
+  .if \offset < 0
+        adrp            \rd, \val@PAGE
+        add             \rd, \rd, \val@PAGEOFF
+        sub             \rd, \rd, -(\offset)
+  .else
+        adrp            \rd, \val+(\offset)@PAGE
+        add             \rd, \rd, \val+(\offset)@PAGEOFF
+  .endif
+#elif defined(PIC) && defined(_WIN32)
+  .if \offset < 0
+        adrp            \rd, \val
+        add             \rd, \rd, :lo12:\val
+        sub             \rd, \rd, -(\offset)
+  .else
+        adrp            \rd, \val+(\offset)
+        add             \rd, \rd, :lo12:\val+(\offset)
+  .endif
+#elif __has_feature(hwaddress_sanitizer)
+        adrp            \rd, :pg_hi21_nc:\val+(\offset)
+        movk            \rd, #:prel_g3:\val+0x100000000
+        add             \rd, \rd, :lo12:\val+(\offset)
+#elif defined(PIC)
+        adrp            \rd, \val+(\offset)
+        add             \rd, \rd, :lo12:\val+(\offset)
+#else
+        ldr             \rd, =\val+\offset
+#endif
+.endm
+
+.macro sub_sp space
+#ifdef _WIN32
+.if \space > 8192
+        // Here, we'd need to touch two (or more) pages while decrementing
+        // the stack pointer.
+        .error          "sub_sp_align doesn't support values over 8K at the moment"
+.elseif \space > 4096
+        sub             x16, sp,  #4096
+        ldr             xzr, [x16]
+        sub             sp,  x16, #(\space - 4096)
+.else
+        sub             sp,  sp,  #\space
+.endif
+#else
+.if \space >= 4096
+        sub             sp,  sp,  #(\space)/4096*4096
+.endif
+.if (\space % 4096) != 0
+        sub             sp,  sp,  #(\space)%4096
+.endif
+#endif
+.endm
+
+.macro transpose_8x8b_xtl r0, r1, r2, r3, r4, r5, r6, r7, xtl
+        // a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7
+        zip1            \r0\().16b, \r0\().16b, \r1\().16b
+        // c0 d0 c1 d1 c2 d2 d3 d3 c4 d4 c5 d5 c6 d6 d7 d7
+        zip1            \r2\().16b, \r2\().16b, \r3\().16b
+        // e0 f0 e1 f1 e2 f2 e3 f3 e4 f4 e5 f5 e6 f6 e7 f7
+        zip1            \r4\().16b, \r4\().16b, \r5\().16b
+        // g0 h0 g1 h1 g2 h2 h3 h3 g4 h4 g5 h5 g6 h6 h7 h7
+        zip1            \r6\().16b, \r6\().16b, \r7\().16b
+
+        // a0 b0 c0 d0 a2 b2 c2 d2 a4 b4 c4 d4 a6 b6 c6 d6
+        trn1            \r1\().8h,  \r0\().8h,  \r2\().8h
+        // a1 b1 c1 d1 a3 b3 c3 d3 a5 b5 c5 d5 a7 b7 c7 d7
+        trn2            \r3\().8h,  \r0\().8h,  \r2\().8h
+        // e0 f0 g0 h0 e2 f2 g2 h2 e4 f4 g4 h4 e6 f6 g6 h6
+        trn1            \r5\().8h,  \r4\().8h,  \r6\().8h
+        // e1 f1 g1 h1 e3 f3 g3 h3 e5 f5 g5 h5 e7 f7 g7 h7
+        trn2            \r7\().8h,  \r4\().8h,  \r6\().8h
+
+        // a0 b0 c0 d0 e0 f0 g0 h0 a4 b4 c4 d4 e4 f4 g4 h4
+        trn1            \r0\().4s,  \r1\().4s,  \r5\().4s
+        // a2 b2 c2 d2 e2 f2 g2 h2 a6 b6 c6 d6 e6 f6 g6 h6
+        trn2            \r2\().4s,  \r1\().4s,  \r5\().4s
+        // a1 b1 c1 d1 e1 f1 g1 h1 a5 b5 c5 d5 e5 f5 g5 h5
+        trn1            \r1\().4s,  \r3\().4s,  \r7\().4s
+        // a3 b3 c3 d3 e3 f3 g3 h3 a7 b7 c7 d7 e7 f7 g7 h7
+        trn2            \r3\().4s,  \r3\().4s,  \r7\().4s
+
+        \xtl\()2        \r4\().8h,  \r0\().16b
+        \xtl            \r0\().8h,  \r0\().8b
+        \xtl\()2        \r6\().8h,  \r2\().16b
+        \xtl            \r2\().8h,  \r2\().8b
+        \xtl\()2        \r5\().8h,  \r1\().16b
+        \xtl            \r1\().8h,  \r1\().8b
+        \xtl\()2        \r7\().8h,  \r3\().16b
+        \xtl            \r3\().8h,  \r3\().8b
+.endm
+
+.macro transpose_8x8h r0, r1, r2, r3, r4, r5, r6, r7, t8, t9
+        trn1            \t8\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \t9\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \r1\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \r3\().8h,  \r2\().8h,  \r3\().8h
+        trn1            \r0\().8h,  \r4\().8h,  \r5\().8h
+        trn2            \r5\().8h,  \r4\().8h,  \r5\().8h
+        trn1            \r2\().8h,  \r6\().8h,  \r7\().8h
+        trn2            \r7\().8h,  \r6\().8h,  \r7\().8h
+
+        trn1            \r4\().4s,  \r0\().4s,  \r2\().4s
+        trn2            \r2\().4s,  \r0\().4s,  \r2\().4s
+        trn1            \r6\().4s,  \r5\().4s,  \r7\().4s
+        trn2            \r7\().4s,  \r5\().4s,  \r7\().4s
+        trn1            \r5\().4s,  \t9\().4s,  \r3\().4s
+        trn2            \t9\().4s,  \t9\().4s,  \r3\().4s
+        trn1            \r3\().4s,  \t8\().4s,  \r1\().4s
+        trn2            \t8\().4s,  \t8\().4s,  \r1\().4s
+
+        trn1            \r0\().2d,  \r3\().2d,  \r4\().2d
+        trn2            \r4\().2d,  \r3\().2d,  \r4\().2d
+        trn1            \r1\().2d,  \r5\().2d,  \r6\().2d
+        trn2            \r5\().2d,  \r5\().2d,  \r6\().2d
+        trn2            \r6\().2d,  \t8\().2d,  \r2\().2d
+        trn1            \r2\().2d,  \t8\().2d,  \r2\().2d
+        trn1            \r3\().2d,  \t9\().2d,  \r7\().2d
+        trn2            \r7\().2d,  \t9\().2d,  \r7\().2d
+.endm
+
+.macro transpose_8x8h_mov r0, r1, r2, r3, r4, r5, r6, r7, t8, t9, o0, o1, o2, o3, o4, o5, o6, o7
+        trn1            \t8\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \t9\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \r1\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \r3\().8h,  \r2\().8h,  \r3\().8h
+        trn1            \r0\().8h,  \r4\().8h,  \r5\().8h
+        trn2            \r5\().8h,  \r4\().8h,  \r5\().8h
+        trn1            \r2\().8h,  \r6\().8h,  \r7\().8h
+        trn2            \r7\().8h,  \r6\().8h,  \r7\().8h
+
+        trn1            \r4\().4s,  \r0\().4s,  \r2\().4s
+        trn2            \r2\().4s,  \r0\().4s,  \r2\().4s
+        trn1            \r6\().4s,  \r5\().4s,  \r7\().4s
+        trn2            \r7\().4s,  \r5\().4s,  \r7\().4s
+        trn1            \r5\().4s,  \t9\().4s,  \r3\().4s
+        trn2            \t9\().4s,  \t9\().4s,  \r3\().4s
+        trn1            \r3\().4s,  \t8\().4s,  \r1\().4s
+        trn2            \t8\().4s,  \t8\().4s,  \r1\().4s
+
+        trn1            \o0\().2d,  \r3\().2d,  \r4\().2d
+        trn2            \o4\().2d,  \r3\().2d,  \r4\().2d
+        trn1            \o1\().2d,  \r5\().2d,  \r6\().2d
+        trn2            \o5\().2d,  \r5\().2d,  \r6\().2d
+        trn2            \o6\().2d,  \t8\().2d,  \r2\().2d
+        trn1            \o2\().2d,  \t8\().2d,  \r2\().2d
+        trn1            \o3\().2d,  \t9\().2d,  \r7\().2d
+        trn2            \o7\().2d,  \t9\().2d,  \r7\().2d
+.endm
+
+.macro transpose_8x16b r0, r1, r2, r3, r4, r5, r6, r7, t8, t9
+        trn1            \t8\().16b, \r0\().16b, \r1\().16b
+        trn2            \t9\().16b, \r0\().16b, \r1\().16b
+        trn1            \r1\().16b, \r2\().16b, \r3\().16b
+        trn2            \r3\().16b, \r2\().16b, \r3\().16b
+        trn1            \r0\().16b, \r4\().16b, \r5\().16b
+        trn2            \r5\().16b, \r4\().16b, \r5\().16b
+        trn1            \r2\().16b, \r6\().16b, \r7\().16b
+        trn2            \r7\().16b, \r6\().16b, \r7\().16b
+
+        trn1            \r4\().8h,  \r0\().8h,  \r2\().8h
+        trn2            \r2\().8h,  \r0\().8h,  \r2\().8h
+        trn1            \r6\().8h,  \r5\().8h,  \r7\().8h
+        trn2            \r7\().8h,  \r5\().8h,  \r7\().8h
+        trn1            \r5\().8h,  \t9\().8h,  \r3\().8h
+        trn2            \t9\().8h,  \t9\().8h,  \r3\().8h
+        trn1            \r3\().8h,  \t8\().8h,  \r1\().8h
+        trn2            \t8\().8h,  \t8\().8h,  \r1\().8h
+
+        trn1            \r0\().4s,  \r3\().4s,  \r4\().4s
+        trn2            \r4\().4s,  \r3\().4s,  \r4\().4s
+        trn1            \r1\().4s,  \r5\().4s,  \r6\().4s
+        trn2            \r5\().4s,  \r5\().4s,  \r6\().4s
+        trn2            \r6\().4s,  \t8\().4s,  \r2\().4s
+        trn1            \r2\().4s,  \t8\().4s,  \r2\().4s
+        trn1            \r3\().4s,  \t9\().4s,  \r7\().4s
+        trn2            \r7\().4s,  \t9\().4s,  \r7\().4s
+.endm
+
+.macro  transpose_4x16b r0, r1, r2, r3, t4, t5, t6, t7
+        trn1            \t4\().16b, \r0\().16b, \r1\().16b
+        trn2            \t5\().16b, \r0\().16b, \r1\().16b
+        trn1            \t6\().16b, \r2\().16b, \r3\().16b
+        trn2            \t7\().16b, \r2\().16b, \r3\().16b
+
+        trn1            \r0\().8h,  \t4\().8h,  \t6\().8h
+        trn2            \r2\().8h,  \t4\().8h,  \t6\().8h
+        trn1            \r1\().8h,  \t5\().8h,  \t7\().8h
+        trn2            \r3\().8h,  \t5\().8h,  \t7\().8h
+.endm
+
+.macro  transpose_4x4h  r0, r1, r2, r3, t4, t5, t6, t7
+        trn1            \t4\().4h,  \r0\().4h,  \r1\().4h
+        trn2            \t5\().4h,  \r0\().4h,  \r1\().4h
+        trn1            \t6\().4h,  \r2\().4h,  \r3\().4h
+        trn2            \t7\().4h,  \r2\().4h,  \r3\().4h
+
+        trn1            \r0\().2s,  \t4\().2s,  \t6\().2s
+        trn2            \r2\().2s,  \t4\().2s,  \t6\().2s
+        trn1            \r1\().2s,  \t5\().2s,  \t7\().2s
+        trn2            \r3\().2s,  \t5\().2s,  \t7\().2s
+.endm
+
+.macro  transpose_4x4s  r0, r1, r2, r3, t4, t5, t6, t7
+        trn1            \t4\().4s,  \r0\().4s,  \r1\().4s
+        trn2            \t5\().4s,  \r0\().4s,  \r1\().4s
+        trn1            \t6\().4s,  \r2\().4s,  \r3\().4s
+        trn2            \t7\().4s,  \r2\().4s,  \r3\().4s
+
+        trn1            \r0\().2d,  \t4\().2d,  \t6\().2d
+        trn2            \r2\().2d,  \t4\().2d,  \t6\().2d
+        trn1            \r1\().2d,  \t5\().2d,  \t7\().2d
+        trn2            \r3\().2d,  \t5\().2d,  \t7\().2d
+.endm
+
+.macro  transpose_4x8h  r0, r1, r2, r3, t4, t5, t6, t7
+        trn1            \t4\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \t5\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \t6\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \t7\().8h,  \r2\().8h,  \r3\().8h
+
+        trn1            \r0\().4s,  \t4\().4s,  \t6\().4s
+        trn2            \r2\().4s,  \t4\().4s,  \t6\().4s
+        trn1            \r1\().4s,  \t5\().4s,  \t7\().4s
+        trn2            \r3\().4s,  \t5\().4s,  \t7\().4s
+.endm
+
+.macro  transpose_4x8h_mov r0, r1, r2, r3, t4, t5, t6, t7, o0, o1, o2, o3
+        trn1            \t4\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \t5\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \t6\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \t7\().8h,  \r2\().8h,  \r3\().8h
+
+        trn1            \o0\().4s,  \t4\().4s,  \t6\().4s
+        trn2            \o2\().4s,  \t4\().4s,  \t6\().4s
+        trn1            \o1\().4s,  \t5\().4s,  \t7\().4s
+        trn2            \o3\().4s,  \t5\().4s,  \t7\().4s
+.endm
+
+#endif /* DAV1D_SRC_ARM_64_UTIL_S */
@@ -0,0 +1,335 @@
+/*
+ * Copyright © 2018, VideoLAN and dav1d authors
+ * Copyright © 2018, Janne Grunau
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ *    list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ *    this list of conditions and the following disclaimer in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef DAV1D_SRC_ARM_ASM_S
+#define DAV1D_SRC_ARM_ASM_S
+
+#include "config.h"
+
+#if ARCH_AARCH64
+#define x18 do_not_use_x18
+#define w18 do_not_use_w18
+
+#if HAVE_AS_ARCH_DIRECTIVE
+        .arch AS_ARCH_LEVEL
+#endif
+
+#if HAVE_AS_ARCHEXT_DOTPROD_DIRECTIVE
+#define ENABLE_DOTPROD  .arch_extension dotprod
+#define DISABLE_DOTPROD .arch_extension nodotprod
+#else
+#define ENABLE_DOTPROD
+#define DISABLE_DOTPROD
+#endif
+#if HAVE_AS_ARCHEXT_I8MM_DIRECTIVE
+#define ENABLE_I8MM  .arch_extension i8mm
+#define DISABLE_I8MM .arch_extension noi8mm
+#else
+#define ENABLE_I8MM
+#define DISABLE_I8MM
+#endif
+#if HAVE_AS_ARCHEXT_SVE_DIRECTIVE
+#define ENABLE_SVE  .arch_extension sve
+#define DISABLE_SVE .arch_extension nosve
+#else
+#define ENABLE_SVE
+#define DISABLE_SVE
+#endif
+#if HAVE_AS_ARCHEXT_SVE2_DIRECTIVE
+#define ENABLE_SVE2  .arch_extension sve2
+#define DISABLE_SVE2 .arch_extension nosve2
+#else
+#define ENABLE_SVE2
+#define DISABLE_SVE2
+#endif
+
+/* If we do support the .arch_extension directives, disable support for all
+ * the extensions that we may use, in case they were implicitly enabled by
+ * the .arch level. This makes it clear if we try to assemble an instruction
+ * from an unintended extension set; we only allow assmbling such instructions
+ * within regions where we explicitly enable those extensions. */
+DISABLE_DOTPROD
+DISABLE_I8MM
+DISABLE_SVE
+DISABLE_SVE2
+
+
+/* Support macros for
+ *   - Armv8.3-A Pointer Authentication and
+ *   - Armv8.5-A Branch Target Identification
+ * features which require emitting a .note.gnu.property section with the
+ * appropriate architecture-dependent feature bits set.
+ *
+ * |AARCH64_SIGN_LINK_REGISTER| and |AARCH64_VALIDATE_LINK_REGISTER| expand to
+ * PACIxSP and AUTIxSP, respectively. |AARCH64_SIGN_LINK_REGISTER| should be
+ * used immediately before saving the LR register (x30) to the stack.
+ * |AARCH64_VALIDATE_LINK_REGISTER| should be used immediately after restoring
+ * it. Note |AARCH64_SIGN_LINK_REGISTER|'s modifications to LR must be undone
+ * with |AARCH64_VALIDATE_LINK_REGISTER| before RET. The SP register must also
+ * have the same value at the two points. For example:
+ *
+ *   .global f
+ *   f:
+ *     AARCH64_SIGN_LINK_REGISTER
+ *     stp x29, x30, [sp, #-96]!
+ *     mov x29, sp
+ *     ...
+ *     ldp x29, x30, [sp], #96
+ *     AARCH64_VALIDATE_LINK_REGISTER
+ *     ret
+ *
+ * |AARCH64_VALID_CALL_TARGET| expands to BTI 'c'. Either it, or
+ * |AARCH64_SIGN_LINK_REGISTER|, must be used at every point that may be an
+ * indirect call target. In particular, all symbols exported from a file must
+ * begin with one of these macros. For example, a leaf function that does not
+ * save LR can instead use |AARCH64_VALID_CALL_TARGET|:
+ *
+ *   .globl return_zero
+ *   return_zero:
+ *     AARCH64_VALID_CALL_TARGET
+ *     mov x0, #0
+ *     ret
+ *
+ * A non-leaf function which does not immediately save LR may need both macros
+ * because |AARCH64_SIGN_LINK_REGISTER| appears late. For example, the function
+ * may jump to an alternate implementation before setting up the stack:
+ *
+ *   .globl with_early_jump
+ *   with_early_jump:
+ *     AARCH64_VALID_CALL_TARGET
+ *     cmp x0, #128
+ *     b.lt .Lwith_early_jump_128
+ *     AARCH64_SIGN_LINK_REGISTER
+ *     stp x29, x30, [sp, #-96]!
+ *     mov x29, sp
+ *     ...
+ *     ldp x29, x30, [sp], #96
+ *     AARCH64_VALIDATE_LINK_REGISTER
+ *     ret
+ *
+ *  .Lwith_early_jump_128:
+ *     ...
+ *     ret
+ *
+ * These annotations are only required with indirect calls. Private symbols that
+ * are only the target of direct calls do not require annotations. Also note
+ * that |AARCH64_VALID_CALL_TARGET| is only valid for indirect calls (BLR), not
+ * indirect jumps (BR). Indirect jumps in assembly are supported through
+ * |AARCH64_VALID_JUMP_TARGET|. Landing Pads which shall serve for jumps and
+ * calls can be created using |AARCH64_VALID_JUMP_CALL_TARGET|.
+ *
+ * Although not necessary, it is safe to use these macros in 32-bit ARM
+ * assembly. This may be used to simplify dual 32-bit and 64-bit files.
+ *
+ * References:
+ * - "ELF for the Arm® 64-bit Architecture"
+ *   https: *github.com/ARM-software/abi-aa/blob/master/aaelf64/aaelf64.rst
+ * - "Providing protection for complex software"
+ *   https://developer.arm.com/architectures/learn-the-architecture/providing-protection-for-complex-software
+ */
+#if defined(__ARM_FEATURE_BTI_DEFAULT) && (__ARM_FEATURE_BTI_DEFAULT == 1)
+#define GNU_PROPERTY_AARCH64_BTI (1 << 0)   // Has Branch Target Identification
+#define AARCH64_VALID_JUMP_CALL_TARGET hint #38  // BTI 'jc'
+#define AARCH64_VALID_CALL_TARGET      hint #34  // BTI 'c'
+#define AARCH64_VALID_JUMP_TARGET      hint #36  // BTI 'j'
+#else
+#define GNU_PROPERTY_AARCH64_BTI 0          // No Branch Target Identification
+#define AARCH64_VALID_JUMP_CALL_TARGET
+#define AARCH64_VALID_CALL_TARGET
+#define AARCH64_VALID_JUMP_TARGET
+#endif
+
+#if defined(__ARM_FEATURE_PAC_DEFAULT)
+
+#if ((__ARM_FEATURE_PAC_DEFAULT & (1 << 0)) != 0) // authentication using key A
+#define AARCH64_SIGN_LINK_REGISTER      paciasp
+#define AARCH64_VALIDATE_LINK_REGISTER  autiasp
+#elif ((__ARM_FEATURE_PAC_DEFAULT & (1 << 1)) != 0) // authentication using key B
+#define AARCH64_SIGN_LINK_REGISTER      pacibsp
+#define AARCH64_VALIDATE_LINK_REGISTER  autibsp
+#else
+#error Pointer authentication defines no valid key!
+#endif
+#if ((__ARM_FEATURE_PAC_DEFAULT & (1 << 2)) != 0) // authentication of leaf functions
+#error Authentication of leaf functions is enabled but not supported in dav1d!
+#endif
+#define GNU_PROPERTY_AARCH64_PAC (1 << 1)
+
+#elif defined(__APPLE__) && defined(__arm64e__)
+
+#define GNU_PROPERTY_AARCH64_PAC 0
+#define AARCH64_SIGN_LINK_REGISTER      pacibsp
+#define AARCH64_VALIDATE_LINK_REGISTER  autibsp
+
+#else /* __ARM_FEATURE_PAC_DEFAULT */
+
+#define GNU_PROPERTY_AARCH64_PAC 0
+#define AARCH64_SIGN_LINK_REGISTER
+#define AARCH64_VALIDATE_LINK_REGISTER
+
+#endif /* !__ARM_FEATURE_PAC_DEFAULT */
+
+
+#if (GNU_PROPERTY_AARCH64_BTI != 0 || GNU_PROPERTY_AARCH64_PAC != 0) && defined(__ELF__)
+        .pushsection .note.gnu.property, "a"
+        .balign 8
+        .long 4
+        .long 0x10
+        .long 0x5
+        .asciz "GNU"
+        .long 0xc0000000 /* GNU_PROPERTY_AARCH64_FEATURE_1_AND */
+        .long 4
+        .long (GNU_PROPERTY_AARCH64_BTI | GNU_PROPERTY_AARCH64_PAC)
+        .long 0
+        .popsection
+#endif /* (GNU_PROPERTY_AARCH64_BTI != 0 || GNU_PROPERTY_AARCH64_PAC != 0) && defined(__ELF__) */
+#endif /* ARCH_AARCH64 */
+
+#if ARCH_ARM
+        .syntax unified
+#ifdef __ELF__
+        .arch armv7-a
+        .fpu neon
+        .eabi_attribute 10, 0           // suppress Tag_FP_arch
+        .eabi_attribute 12, 0           // suppress Tag_Advanced_SIMD_arch
+        .section .note.GNU-stack,"",%progbits // Mark stack as non-executable
+#endif /* __ELF__ */
+
+#ifdef _WIN32
+#define CONFIG_THUMB 1
+#else
+#define CONFIG_THUMB 0
+#endif
+
+#if CONFIG_THUMB
+        .thumb
+#define A @
+#define T
+#else
+#define A
+#define T @
+#endif /* CONFIG_THUMB */
+#endif /* ARCH_ARM */
+
+#if !defined(PIC)
+#if defined(__PIC__)
+#define PIC __PIC__
+#elif defined(__pic__)
+#define PIC __pic__
+#endif
+#endif
+
+#ifndef PRIVATE_PREFIX
+#define PRIVATE_PREFIX dav1d_
+#endif
+
+#define PASTE(a,b) a ## b
+#define CONCAT(a,b) PASTE(a,b)
+
+#ifdef PREFIX
+#define EXTERN CONCAT(_,PRIVATE_PREFIX)
+#else
+#define EXTERN PRIVATE_PREFIX
+#endif
+
+.macro function name, export=0, align=2
+    .macro endfunc
+#ifdef __ELF__
+        .size   \name, . - \name
+#endif
+#if HAVE_AS_FUNC
+        .endfunc
+#endif
+        .purgem endfunc
+    .endm
+        .text
+        .align \align
+    .if \export
+        .global EXTERN\name
+#ifdef __ELF__
+        .type   EXTERN\name, %function
+        .hidden EXTERN\name
+#elif defined(__MACH__)
+        .private_extern EXTERN\name
+#endif
+#if HAVE_AS_FUNC
+        .func   EXTERN\name
+#endif
+EXTERN\name:
+    .else
+#ifdef __ELF__
+        .type \name, %function
+#endif
+#if HAVE_AS_FUNC
+        .func \name
+#endif
+    .endif
+\name:
+#if ARCH_AARCH64
+    .if \export
+         AARCH64_VALID_CALL_TARGET
+    .endif
+#endif
+.endm
+
+.macro  const   name, export=0, align=2
+    .macro endconst
+#ifdef __ELF__
+        .size   \name, . - \name
+#endif
+        .purgem endconst
+    .endm
+#if defined(_WIN32)
+        .section        .rdata
+#elif !defined(__MACH__)
+        .section        .rodata
+#else
+        .const_data
+#endif
+        .align          \align
+    .if \export
+        .global EXTERN\name
+#ifdef __ELF__
+        .hidden EXTERN\name
+#elif defined(__MACH__)
+        .private_extern EXTERN\name
+#endif
+EXTERN\name:
+    .endif
+\name:
+.endm
+
+#ifdef __APPLE__
+#define L(x) L ## x
+#else
+#define L(x) .L ## x
+#endif
+
+#define X(x) CONCAT(EXTERN, x)
+
+
+#endif /* DAV1D_SRC_ARM_ASM_S */
@@ -0,0 +1,331 @@
+/*
+ * Copyright © 2018, VideoLAN and dav1d authors
+ * Copyright © 2018, Two Orioles, LLC
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ *    list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ *    this list of conditions and the following disclaimer in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "config.h"
+
+#include <stdlib.h>
+
+#include "common/intops.h"
+
+#include "src/cdef.h"
+#include "src/tables.h"
+
+static inline int constrain(const int diff, const int threshold,
+                            const int shift)
+{
+    const int adiff = abs(diff);
+    return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))), diff);
+}
+
+static inline void fill(int16_t *tmp, const ptrdiff_t stride,
+                        const int w, const int h)
+{
+    /* Use a value that's a large positive number when interpreted as unsigned,
+     * and a large negative number when interpreted as signed. */
+    for (int y = 0; y < h; y++) {
+        for (int x = 0; x < w; x++)
+            tmp[x] = INT16_MIN;
+        tmp += stride;
+    }
+}
+
+static void padding(int16_t *tmp, const ptrdiff_t tmp_stride,
+                    const pixel *src, const ptrdiff_t src_stride,
+                    const pixel (*left)[2],
+                    const pixel *top, const pixel *bottom,
+                    const int w, const int h, const enum CdefEdgeFlags edges)
+{
+    // fill extended input buffer
+    int x_start = -2, x_end = w + 2, y_start = -2, y_end = h + 2;
+    if (!(edges & CDEF_HAVE_TOP)) {
+        fill(tmp - 2 - 2 * tmp_stride, tmp_stride, w + 4, 2);
+        y_start = 0;
+    }
+    if (!(edges & CDEF_HAVE_BOTTOM)) {
+        fill(tmp + h * tmp_stride - 2, tmp_stride, w + 4, 2);
+        y_end -= 2;
+    }
+    if (!(edges & CDEF_HAVE_LEFT)) {
+        fill(tmp + y_start * tmp_stride - 2, tmp_stride, 2, y_end - y_start);
+        x_start = 0;
+    }
+    if (!(edges & CDEF_HAVE_RIGHT)) {
+        fill(tmp + y_start * tmp_stride + w, tmp_stride, 2, y_end - y_start);
+        x_end -= 2;
+    }
+
+    for (int y = y_start; y < 0; y++) {
+        for (int x = x_start; x < x_end; x++)
+            tmp[x + y * tmp_stride] = top[x];
+        top += PXSTRIDE(src_stride);
+    }
+    for (int y = 0; y < h; y++)
+        for (int x = x_start; x < 0; x++)
+            tmp[x + y * tmp_stride] = left[y][2 + x];
+    for (int y = 0; y < h; y++) {
+        for (int x = (y < h) ? 0 : x_start; x < x_end; x++)
+            tmp[x] = src[x];
+        src += PXSTRIDE(src_stride);
+        tmp += tmp_stride;
+    }
+    for (int y = h; y < y_end; y++) {
+        for (int x = x_start; x < x_end; x++)
+            tmp[x] = bottom[x];
+        bottom += PXSTRIDE(src_stride);
+        tmp += tmp_stride;
+    }
+
+}
+
+static NOINLINE void
+cdef_filter_block_c(pixel *dst, const ptrdiff_t dst_stride,
+                    const pixel (*left)[2],
+                    const pixel *const top, const pixel *const bottom,
+                    const int pri_strength, const int sec_strength,
+                    const int dir, const int damping, const int w, int h,
+                    const enum CdefEdgeFlags edges HIGHBD_DECL_SUFFIX)
+{
+    const ptrdiff_t tmp_stride = 12;
+    assert((w == 4 || w == 8) && (h == 4 || h == 8));
+    int16_t tmp_buf[144]; // 12*12 is the maximum value of tmp_stride * (h + 4)
+    int16_t *tmp = tmp_buf + 2 * tmp_stride + 2;
+
+    padding(tmp, tmp_stride, dst, dst_stride, left, top, bottom, w, h, edges);
+
+    if (pri_strength) {
+        const int bitdepth_min_8 = bitdepth_from_max(bitdepth_max) - 8;
+        const int pri_tap = 4 - ((pri_strength >> bitdepth_min_8) & 1);
+        const int pri_shift = imax(0, damping - ulog2(pri_strength));
+        if (sec_strength) {
+            const int sec_shift = damping - ulog2(sec_strength);
+            do {
+                for (int x = 0; x < w; x++) {
+                    const int px = dst[x];
+                    int sum = 0;
+                    int max = px, min = px;
+                    int pri_tap_k = pri_tap;
+                    for (int k = 0; k < 2; k++) {
+                        const int off1 = dav1d_cdef_directions[dir + 2][k]; // dir
+                        const int p0 = tmp[x + off1];
+                        const int p1 = tmp[x - off1];
+                        sum += pri_tap_k * constrain(p0 - px, pri_strength, pri_shift);
+                        sum += pri_tap_k * constrain(p1 - px, pri_strength, pri_shift);
+                        // if pri_tap_k == 4 then it becomes 2 else it remains 3
+                        pri_tap_k = (pri_tap_k & 3) | 2;
+                        min = umin(p0, min);
+                        max = imax(p0, max);
+                        min = umin(p1, min);
+                        max = imax(p1, max);
+                        const int off2 = dav1d_cdef_directions[dir + 4][k]; // dir + 2
+                        const int off3 = dav1d_cdef_directions[dir + 0][k]; // dir - 2
+                        const int s0 = tmp[x + off2];
+                        const int s1 = tmp[x - off2];
+                        const int s2 = tmp[x + off3];
+                        const int s3 = tmp[x - off3];
+                        // sec_tap starts at 2 and becomes 1
+                        const int sec_tap = 2 - k;
+                        sum += sec_tap * constrain(s0 - px, sec_strength, sec_shift);
+                        sum += sec_tap * constrain(s1 - px, sec_strength, sec_shift);
+                        sum += sec_tap * constrain(s2 - px, sec_strength, sec_shift);
+                        sum += sec_tap * constrain(s3 - px, sec_strength, sec_shift);
+                        min = umin(s0, min);
+                        max = imax(s0, max);
+                        min = umin(s1, min);
+                        max = imax(s1, max);
+                        min = umin(s2, min);
+                        max = imax(s2, max);
+                        min = umin(s3, min);
+                        max = imax(s3, max);
+                    }
+                    dst[x] = iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max);
+                }
+                dst += PXSTRIDE(dst_stride);
+                tmp += tmp_stride;
+            } while (--h);
+        } else { // pri_strength only
+            do {
+                for (int x = 0; x < w; x++) {
+                    const int px = dst[x];
+                    int sum = 0;
+                    int pri_tap_k = pri_tap;
+                    for (int k = 0; k < 2; k++) {
+                        const int off = dav1d_cdef_directions[dir + 2][k]; // dir
+                        const int p0 = tmp[x + off];
+                        const int p1 = tmp[x - off];
+                        sum += pri_tap_k * constrain(p0 - px, pri_strength, pri_shift);
+                        sum += pri_tap_k * constrain(p1 - px, pri_strength, pri_shift);
+                        pri_tap_k = (pri_tap_k & 3) | 2;
+                    }
+                    dst[x] = px + ((sum - (sum < 0) + 8) >> 4);
+                }
+                dst += PXSTRIDE(dst_stride);
+                tmp += tmp_stride;
+            } while (--h);
+        }
+    } else { // sec_strength only
+        assert(sec_strength);
+        const int sec_shift = damping - ulog2(sec_strength);
+        do {
+            for (int x = 0; x < w; x++) {
+                const int px = dst[x];
+                int sum = 0;
+                for (int k = 0; k < 2; k++) {
+                    const int off1 = dav1d_cdef_directions[dir + 4][k]; // dir + 2
+                    const int off2 = dav1d_cdef_directions[dir + 0][k]; // dir - 2
+                    const int s0 = tmp[x + off1];
+                    const int s1 = tmp[x - off1];
+                    const int s2 = tmp[x + off2];
+                    const int s3 = tmp[x - off2];
+                    const int sec_tap = 2 - k;
+                    sum += sec_tap * constrain(s0 - px, sec_strength, sec_shift);
+                    sum += sec_tap * constrain(s1 - px, sec_strength, sec_shift);
+                    sum += sec_tap * constrain(s2 - px, sec_strength, sec_shift);
+                    sum += sec_tap * constrain(s3 - px, sec_strength, sec_shift);
+                }
+                dst[x] = px + ((sum - (sum < 0) + 8) >> 4);
+            }
+            dst += PXSTRIDE(dst_stride);
+            tmp += tmp_stride;
+        } while (--h);
+    }
+}
+
+#define cdef_fn(w, h) \
+static void cdef_filter_block_##w##x##h##_c(pixel *const dst, \
+                                            const ptrdiff_t stride, \
+                                            const pixel (*left)[2], \
+                                            const pixel *const top, \
+                                            const pixel *const bottom, \
+                                            const int pri_strength, \
+                                            const int sec_strength, \
+                                            const int dir, \
+                                            const int damping, \
+                                            const enum CdefEdgeFlags edges \
+                                            HIGHBD_DECL_SUFFIX) \
+{ \
+    cdef_filter_block_c(dst, stride, left, top, bottom, \
+                        pri_strength, sec_strength, dir, damping, w, h, edges HIGHBD_TAIL_SUFFIX); \
+}
+
+cdef_fn(4, 4);
+cdef_fn(4, 8);
+cdef_fn(8, 8);
+
+static int cdef_find_dir_c(const pixel *img, const ptrdiff_t stride,
+                           unsigned *const var HIGHBD_DECL_SUFFIX)
+{
+    const int bitdepth_min_8 = bitdepth_from_max(bitdepth_max) - 8;
+    int partial_sum_hv[2][8] = { { 0 } };
+    int partial_sum_diag[2][15] = { { 0 } };
+    int partial_sum_alt[4][11] = { { 0 } };
+
+    for (int y = 0; y < 8; y++) {
+        for (int x = 0; x < 8; x++) {
+            const int px = (img[x] >> bitdepth_min_8) - 128;
+
+            partial_sum_diag[0][     y       +  x      ] += px;
+            partial_sum_alt [0][     y       + (x >> 1)] += px;
+            partial_sum_hv  [0][     y                 ] += px;
+            partial_sum_alt [1][3 +  y       - (x >> 1)] += px;
+            partial_sum_diag[1][7 +  y       -  x      ] += px;
+            partial_sum_alt [2][3 - (y >> 1) +  x      ] += px;
+            partial_sum_hv  [1][                x      ] += px;
+            partial_sum_alt [3][    (y >> 1) +  x      ] += px;
+        }
+        img += PXSTRIDE(stride);
+    }
+
+    unsigned cost[8] = { 0 };
+    for (int n = 0; n < 8; n++) {
+        cost[2] += partial_sum_hv[0][n] * partial_sum_hv[0][n];
+        cost[6] += partial_sum_hv[1][n] * partial_sum_hv[1][n];
+    }
+    cost[2] *= 105;
+    cost[6] *= 105;
+
+    static const uint16_t div_table[7] = { 840, 420, 280, 210, 168, 140, 120 };
+    for (int n = 0; n < 7; n++) {
+        const int d = div_table[n];
+        cost[0] += (partial_sum_diag[0][n]      * partial_sum_diag[0][n] +
+                    partial_sum_diag[0][14 - n] * partial_sum_diag[0][14 - n]) * d;
+        cost[4] += (partial_sum_diag[1][n]      * partial_sum_diag[1][n] +
+                    partial_sum_diag[1][14 - n] * partial_sum_diag[1][14 - n]) * d;
+    }
+    cost[0] += partial_sum_diag[0][7] * partial_sum_diag[0][7] * 105;
+    cost[4] += partial_sum_diag[1][7] * partial_sum_diag[1][7] * 105;
+
+    for (int n = 0; n < 4; n++) {
+        unsigned *const cost_ptr = &cost[n * 2 + 1];
+        for (int m = 0; m < 5; m++)
+            *cost_ptr += partial_sum_alt[n][3 + m] * partial_sum_alt[n][3 + m];
+        *cost_ptr *= 105;
+        for (int m = 0; m < 3; m++) {
+            const int d = div_table[2 * m + 1];
+            *cost_ptr += (partial_sum_alt[n][m]      * partial_sum_alt[n][m] +
+                          partial_sum_alt[n][10 - m] * partial_sum_alt[n][10 - m]) * d;
+        }
+    }
+
+    int best_dir = 0;
+    unsigned best_cost = cost[0];
+    for (int n = 1; n < 8; n++) {
+        if (cost[n] > best_cost) {
+            best_cost = cost[n];
+            best_dir = n;
+        }
+    }
+
+    *var = (best_cost - (cost[best_dir ^ 4])) >> 10;
+    return best_dir;
+}
+
+#if HAVE_ASM
+#if ARCH_AARCH64 || ARCH_ARM
+#include "src/arm/cdef.h"
+#elif ARCH_PPC64LE
+#include "src/ppc/cdef.h"
+#elif ARCH_X86
+#include "src/x86/cdef.h"
+#endif
+#endif
+
+COLD void bitfn(dav1d_cdef_dsp_init)(Dav1dCdefDSPContext *const c) {
+    c->dir = cdef_find_dir_c;
+    c->fb[0] = cdef_filter_block_8x8_c;
+    c->fb[1] = cdef_filter_block_4x8_c;
+    c->fb[2] = cdef_filter_block_4x4_c;
+
+#if HAVE_ASM
+#if ARCH_AARCH64 || ARCH_ARM
+    cdef_dsp_init_arm(c);
+#elif ARCH_PPC64LE
+    cdef_dsp_init_ppc(c);
+#elif ARCH_X86
+    cdef_dsp_init_x86(c);
+#endif
+#endif
+}
@@ -0,0 +1,32 @@
+/*
+ * dav1d_cdef_directions — verbatim transcription of the CDEF
+ * directions table from dav1d/src/tables.c (1.4.3, lines 400-414).
+ * Provided as a standalone .c so the vendored cdef.S has the
+ * symbol to link against without pulling in dav1d's full tables.c
+ * (which is 1013 lines and chain-references the entire decoder).
+ *
+ * Used by both the C reference (cdef_tmpl.c) and the NEON
+ * implementation (cdef.S).
+ *
+ * The table has 12 entries (2 + 8 + 2) because direction indexing
+ * wraps modulo 8 with ±2 lookahead for secondary taps; the leading
+ * and trailing 2 entries are the wrap-around prefixes/suffixes.
+ *
+ * License: BSD-2-Clause (matches dav1d upstream).
+ */
+#include <stdint.h>
+
+const int8_t dav1d_cdef_directions[2 + 8 + 2][2] = {
+    {  1 * 12 + 0,  2 * 12 + 0 }, // 6 (wrap prefix)
+    {  1 * 12 + 0,  2 * 12 - 1 }, // 7 (wrap prefix)
+    { -1 * 12 + 1, -2 * 12 + 2 }, // 0
+    {  0 * 12 + 1, -1 * 12 + 2 }, // 1
+    {  0 * 12 + 1,  0 * 12 + 2 }, // 2
+    {  0 * 12 + 1,  1 * 12 + 2 }, // 3
+    {  1 * 12 + 1,  2 * 12 + 2 }, // 4
+    {  1 * 12 + 0,  2 * 12 + 1 }, // 5
+    {  1 * 12 + 0,  2 * 12 + 0 }, // 6
+    {  1 * 12 + 0,  2 * 12 - 1 }, // 7
+    { -1 * 12 + 1, -2 * 12 + 2 }, // 0 (wrap suffix)
+    {  0 * 12 + 1, -1 * 12 + 2 }, // 1 (wrap suffix)
+};