Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources
First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).
Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.
Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):
src/arm/64/cdef.S 520 lines — NEON impl
src/arm/64/util.S 278 lines — NEON helpers
src/arm/asm.S 335 lines — GAS preamble
src/cdef_tmpl.c 331 lines — C reference (templated)
include/common/intops.h 84 lines — utility helpers
src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions
only (avoids dragging full 1013-line
tables.c + transitive includes)
Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
(dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
(based on pri_strength bit), sec_tap = 2 or 1. Only
dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
diff)
Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.
What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
(cdef_tmpl.c references several dav1d private headers; cleaner to
transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure
PROVENANCE.md documents pin + per-file role + re-vendoring procedure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,190 @@
|
||||
---
|
||||
cycle: 5
|
||||
phases: 1-2 (combined; phase 3+ pending)
|
||||
status: setup in progress
|
||||
date_opened: 2026-05-18
|
||||
parent_cycle: k4_lpf8_phase4_7.md
|
||||
target_kernel: AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only
|
||||
(assume direction + strengths pre-computed)
|
||||
new_vendor: dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin
|
||||
---
|
||||
|
||||
# Cycle 5, Phases 1-2 — AV1 CDEF
|
||||
|
||||
First AV1 kernel; first cycle that vendors from outside the FFmpeg
|
||||
snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause,
|
||||
mature aarch64 NEON, used by VLC + Firefox via libdav1d).
|
||||
|
||||
## Phase 1 — goal
|
||||
|
||||
**Kernel**: AV1 Constrained Directional Enhancement Filter, 8×8 luma
|
||||
output, 8 bits/component, FILTER stage (direction + strength
|
||||
parameters assumed pre-computed). Match the "pre-computed params"
|
||||
convention of LPF (E/I/H) and MC (mx).
|
||||
|
||||
**NEON symbol target**: `dav1d_cdef_filter8_pri_sec_8bpc_neon` (combined
|
||||
primary + secondary filter). There are also `_pri_` and `_sec_` only
|
||||
variants for the cases where one strength is 0; for the bench we
|
||||
cover the worst case (both active).
|
||||
|
||||
**C reference**: `cdef_filter_block_8x8_c` from `dav1d/src/cdef_tmpl.c`
|
||||
(macro-expanded), delegating to `cdef_filter_block_c`. Spec source:
|
||||
AV1 specification §7.15 (CDEF).
|
||||
|
||||
### Measurable success (cycle-5 numbering, `5` superscript)
|
||||
|
||||
| ID | Measurement | Gate |
|
||||
|---|---|---|
|
||||
| M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % |
|
||||
| M2₅ | QPU throughput Mblock/s | recorded |
|
||||
| M3₅ | NEON `dav1d_cdef_filter8_pri_sec_8bpc_neon` Mblock/s | recorded |
|
||||
| M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional |
|
||||
|
||||
### Decision bands (carried)
|
||||
|
||||
Same R bands and 30fps-floor calibration as cycles 1-4.
|
||||
|
||||
### Predicted R₅
|
||||
|
||||
The CDEF filter is **compute-heavier than LPF**:
|
||||
- Per pixel: 8 constraint applications (abs + min + max + sign-restore)
|
||||
plus the per-pixel accumulation with min/max tracking
|
||||
- Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many
|
||||
conditionals
|
||||
- Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block
|
||||
(vs LPF's ~88 B and MC's ~184 B)
|
||||
- No DP4A applicability (the multipliers are small constants, but
|
||||
the constraint function dominates)
|
||||
|
||||
**Predicted R₅ band**: 0.15-0.30 (ORANGE). The constraint function's
|
||||
per-pixel min/max conditional logic is heavier than LPF's per-row
|
||||
fm/flat tests. Compute-bound on QPU. M4 may still rescue per
|
||||
cycle-1+2 pattern.
|
||||
|
||||
### NEW for cycle 5
|
||||
|
||||
- **First AV1 kernel** → expands codec coverage beyond VP9
|
||||
- **First dav1d-vendored source** → new external/ subdirectory:
|
||||
`external/dav1d-snapshot/` (BSD-2-Clause; clean license vs LGPL
|
||||
FFmpeg)
|
||||
- **First kernel needing external padding context** — CDEF reads
|
||||
beyond the 8×8 block (2-pixel halo on each side); dav1d's C
|
||||
reference uses pre-padded `tmp_buf[12×12]` constructed by a
|
||||
separate `padding()` function from left/top/bottom edge arrays.
|
||||
Our bench will construct this padding inline for each random
|
||||
block.
|
||||
|
||||
## Phase 2 — situation analysis
|
||||
|
||||
### C reference structure (dav1d)
|
||||
|
||||
`cdef_filter_block_8x8_c` signature:
|
||||
```c
|
||||
void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
|
||||
const pixel (*left)[2],
|
||||
const pixel *top, const pixel *bottom,
|
||||
int pri_strength, int sec_strength,
|
||||
int dir, int damping,
|
||||
enum CdefEdgeFlags edges);
|
||||
```
|
||||
|
||||
The function:
|
||||
1. Allocates `int16_t tmp_buf[144]` (12×12 working buffer)
|
||||
2. Calls `padding()` to fill from left/top/bottom + dst with edge-replicate
|
||||
3. Iterates 8 rows × 8 cols; per pixel:
|
||||
- Looks up direction offsets: `dav1d_cdef_directions[dir+offset][k]`
|
||||
- For each of 4 primary tap positions (k=0..1, both signs):
|
||||
compute pri-constrained diff, multiply by tap weight, accumulate
|
||||
- For each of 4 secondary tap positions (k=0..1, both signs,
|
||||
two adjacent directions):
|
||||
same with sec weights
|
||||
- Track min/max across all sampled neighbours
|
||||
- Output: `iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)`
|
||||
|
||||
The "constraint" function:
|
||||
```c
|
||||
static inline int constrain(int diff, int threshold, int shift) {
|
||||
int adiff = abs(diff);
|
||||
return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
|
||||
diff);
|
||||
}
|
||||
```
|
||||
|
||||
This is the per-pixel-pair clamp that makes CDEF *constrained*
|
||||
(directional enhancement that can't exceed a threshold tied to
|
||||
local strength).
|
||||
|
||||
### Tables needed
|
||||
|
||||
- `dav1d_cdef_directions[12][2]` — 12 directions (8 + 4 wrap-arounds),
|
||||
each a (y_offset, x_offset) pair. In `dav1d/src/tables.c`.
|
||||
- `dav1d_cdef_pri_taps[2][2]` — primary tap weights, indexed by
|
||||
`(pri_strength & 1)` and tap position k. Small ints.
|
||||
- `dav1d_cdef_sec_taps[2]` — secondary tap weights, just 2 entries.
|
||||
|
||||
### NEON reference structure (dav1d)
|
||||
|
||||
`dav1d_cdef_filter8_pri_sec_8bpc_neon` signature:
|
||||
```
|
||||
x0: dst pixel buffer
|
||||
x1: dst_stride ptrdiff_t
|
||||
x2: tmp uint8_t source (the pre-padded 12×12 buffer reinterpreted)
|
||||
w3: pri_strength
|
||||
w4: sec_strength
|
||||
w5: dir
|
||||
w6: damping
|
||||
w7: h height (8 for 8×8)
|
||||
```
|
||||
|
||||
Notable: dav1d's NEON takes the already-padded `tmp` buffer pointer
|
||||
(after the C side did `padding()`). So our bench needs to construct
|
||||
the padded buffer per block.
|
||||
|
||||
Padded buffer layout (12×12, int16 elements):
|
||||
- Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
|
||||
- Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate
|
||||
from adjacent block (if edges flag set) or INT16_MIN (which the
|
||||
constraint function treats as "skip this neighbour")
|
||||
|
||||
### Vendoring plan
|
||||
|
||||
New directory: `external/dav1d-snapshot/` (BSD-2-Clause, separate
|
||||
PROVENANCE.md from FFmpeg pin).
|
||||
|
||||
Files to vendor from dav1d 1.4.3:
|
||||
1. `src/arm/64/cdef.S` — main NEON file (~870 lines)
|
||||
2. `src/arm/64/util.S` — helper macros referenced by cdef.S
|
||||
3. `src/arm/asm.S` — top-level macros (function, endfunc, etc.)
|
||||
4. `src/cdef_tmpl.c` — C reference (~250 lines)
|
||||
5. `src/tables.c` — the static tables (cdef_directions, pri/sec taps)
|
||||
*or* hand-extract just the CDEF tables (~50 lines)
|
||||
6. `include/common/intops.h` — apply_sign, imin, imax, iclip helpers
|
||||
7. A standalone PROVENANCE.md with pin + SHA-256s
|
||||
|
||||
dav1d's asm preamble may need its own config.h shim (different
|
||||
defines than FFmpeg's). Phase 6 setup will identify exact needs.
|
||||
|
||||
### Build path
|
||||
|
||||
dav1d's asm uses similar GAS preamble to FFmpeg's. The config
|
||||
defines are different: `ARCH_AARCH64`, `HAVE_AS_FUNC`, etc., but
|
||||
also dav1d-specific like `PRIVATE_PREFIX dav1d_` and `EXTERN_ASM ` (same
|
||||
empty for ELF as in cycle 1).
|
||||
|
||||
### What Phase 2 does *not* close
|
||||
|
||||
- The exact list of dav1d asm.S macros needed (will surface during
|
||||
first build attempt)
|
||||
- C reference completeness — `padding()` setup logic is non-trivial
|
||||
(handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP,
|
||||
HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by
|
||||
always passing "all edges valid" with synthetic neighbouring pixels.
|
||||
- Direction validation — directions 0..7 should all be tested for
|
||||
bit-exactness; an off-by-one in the direction-offset table would
|
||||
be caught by M1.
|
||||
|
||||
Phase 3 next: vendor the dav1d files, write standalone C ref +
|
||||
bench, capture M3₅ NEON baseline.
|
||||
|
||||
This is **the first multi-session cycle** — Phase 3+ likely lands
|
||||
in next session. Cycle setup commit at end of this session.
|
||||
+109
@@ -0,0 +1,109 @@
|
||||
# dav1d source snapshot
|
||||
|
||||
Verbatim subset of dav1d source pinned for use as reference
|
||||
implementations of AV1 CDEF (cycle 5 of `daedalus-fourier`) and
|
||||
potentially future AV1 kernels. dav1d is the canonical AV1 decoder
|
||||
library (BSD-2-Clause, maintained by VideoLAN).
|
||||
|
||||
See `../../docs/k5_cdef_phase1_2.md` for the cycle 5 scope and
|
||||
rationale.
|
||||
|
||||
## Upstream pin
|
||||
|
||||
- **Repository**: https://github.com/videolan/dav1d (canonical mirror
|
||||
of https://code.videolan.org/videolan/dav1d)
|
||||
- **Tag**: `1.4.3` (last stable release in the 1.4.x line as of
|
||||
2026-05-18; pinned for reproducibility)
|
||||
- **Snapshot fetched**: 2026-05-18 (UTC), via
|
||||
`https://raw.githubusercontent.com/videolan/dav1d/1.4.3/<path>`
|
||||
|
||||
## Files in this snapshot
|
||||
|
||||
All files are byte-for-byte copies of the upstream source at the
|
||||
tagged commit, except `tables_cdef_subset.c` which is a hand-extracted
|
||||
single-table copy from `src/tables.c` (see §"Why each file" below).
|
||||
|
||||
| Path | Lines | SHA-256 |
|
||||
|---|---|---|
|
||||
| `src/arm/64/cdef.S` | 520 | `88d048cbed93f168...` (TODO full hash) |
|
||||
| `src/arm/64/util.S` | 278 | `582acd8e2b74a1e8...` |
|
||||
| `src/arm/asm.S` | 335 | `6a22def2799876c4...` |
|
||||
| `src/cdef_tmpl.c` | 331 | `26a7a5f9fda65c58...` |
|
||||
| `include/common/intops.h` | 84 | `c1e7d52b421d6417...` |
|
||||
| `src/tables_cdef_subset.c` | hand-extracted | — |
|
||||
|
||||
Full SHA-256s (regenerated by `phase 3` setup):
|
||||
|
||||
```sh
|
||||
( cd external/dav1d-snapshot && sha256sum \
|
||||
src/arm/64/cdef.S src/arm/64/util.S src/arm/asm.S \
|
||||
src/cdef_tmpl.c include/common/intops.h )
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
BSD-2-Clause. Copyright (c) 2018 VideoLAN and dav1d authors; (c) 2019
|
||||
Martin Storsjö (NEON aarch64). Original copyright headers preserved
|
||||
in each vendored file.
|
||||
|
||||
Notably cleaner license than the FFmpeg LGPL-2.1+ snapshot — dav1d's
|
||||
BSD allows distribution of binaries without LGPL's "share linking
|
||||
ability" requirements. For daedalus-fourier benches that link only
|
||||
this snapshot, the binary inherits BSD-2-Clause. Benches that
|
||||
combine both snapshots (none currently) inherit LGPL-2.1+ via
|
||||
FFmpeg's stronger terms.
|
||||
|
||||
## Why each file
|
||||
|
||||
- **`src/arm/64/cdef.S`** — the NEON aarch64 implementation. Provides
|
||||
`dav1d_cdef_filter8_pri_sec_8bpc_neon` and pri-only / sec-only
|
||||
variants. The Phase 3 NEON baseline (M3₅) measures this symbol.
|
||||
|
||||
- **`src/arm/64/util.S`** — helper macros (`load_px_8`,
|
||||
`handle_pixel_8`, etc.) referenced by cdef.S.
|
||||
|
||||
- **`src/arm/asm.S`** — top-level GAS preamble (function/endfunc,
|
||||
movrel, register macros). dav1d's own version is similar to FFmpeg's
|
||||
but with different defines (PRIVATE_PREFIX dav1d_ etc.); Phase 6
|
||||
setup will identify the config.h shim needed for standalone
|
||||
assembly.
|
||||
|
||||
- **`src/cdef_tmpl.c`** — the C reference (templated; the
|
||||
`cdef_filter_block_c` core function is in here, expanded to
|
||||
`cdef_filter_block_8x8_c` via `cdef_fn(8, 8)`).
|
||||
|
||||
- **`include/common/intops.h`** — utility helpers (apply_sign,
|
||||
imin, imax, iclip, umin) used by cdef_tmpl.c.
|
||||
|
||||
- **`src/tables_cdef_subset.c`** — hand-extracted `dav1d_cdef_directions`
|
||||
table from `src/tables.c` (lines 400-414). Provides the only
|
||||
table symbol both `cdef.S` and `cdef_tmpl.c` reference externally.
|
||||
Pulling in the full `src/tables.c` (1013 lines) would chain-include
|
||||
the entire dav1d decoder, which is overkill for our purposes.
|
||||
See `tables_cdef_subset.c` header comment for line-range
|
||||
reference back to upstream.
|
||||
|
||||
## Re-vendoring procedure
|
||||
|
||||
Same as FFmpeg snapshot — see `../ffmpeg-snapshot/PROVENANCE.md`.
|
||||
|
||||
```sh
|
||||
TAG=1.x.y
|
||||
BASE=https://raw.githubusercontent.com/videolan/dav1d/$TAG
|
||||
cd external/dav1d-snapshot
|
||||
for f in src/arm/64/cdef.S src/arm/64/util.S src/arm/asm.S \
|
||||
src/cdef_tmpl.c include/common/intops.h; do
|
||||
curl -sSf -o "$f" "$BASE/$f"
|
||||
done
|
||||
# tables_cdef_subset.c needs manual re-extraction from
|
||||
# upstream src/tables.c — search for "dav1d_cdef_directions ="
|
||||
```
|
||||
|
||||
## Pending work (Phase 3+, next session)
|
||||
|
||||
- config.h shim for assembling cdef.S standalone (dav1d's defines
|
||||
differ from FFmpeg's; will identify exact list on first build)
|
||||
- Standalone C reference for `cdef_filter_block_8x8_c` (this snapshot's
|
||||
`cdef_tmpl.c` references several private headers — easier to
|
||||
transcribe to a self-contained `tests/cdef_ref.c`)
|
||||
- `tests/bench_neon_cdef.c` to capture M3₅ baseline
|
||||
+84
@@ -0,0 +1,84 @@
|
||||
/*
|
||||
* Copyright © 2018, VideoLAN and dav1d authors
|
||||
* Copyright © 2018, Two Orioles, LLC
|
||||
* All rights reserved.
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*/
|
||||
|
||||
#ifndef DAV1D_COMMON_INTOPS_H
|
||||
#define DAV1D_COMMON_INTOPS_H
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
#include "common/attributes.h"
|
||||
|
||||
static inline int imax(const int a, const int b) {
|
||||
return a > b ? a : b;
|
||||
}
|
||||
|
||||
static inline int imin(const int a, const int b) {
|
||||
return a < b ? a : b;
|
||||
}
|
||||
|
||||
static inline unsigned umax(const unsigned a, const unsigned b) {
|
||||
return a > b ? a : b;
|
||||
}
|
||||
|
||||
static inline unsigned umin(const unsigned a, const unsigned b) {
|
||||
return a < b ? a : b;
|
||||
}
|
||||
|
||||
static inline int iclip(const int v, const int min, const int max) {
|
||||
return v < min ? min : v > max ? max : v;
|
||||
}
|
||||
|
||||
static inline int iclip_u8(const int v) {
|
||||
return iclip(v, 0, 255);
|
||||
}
|
||||
|
||||
static inline int apply_sign(const int v, const int s) {
|
||||
return s < 0 ? -v : v;
|
||||
}
|
||||
|
||||
static inline int apply_sign64(const int v, const int64_t s) {
|
||||
return s < 0 ? -v : v;
|
||||
}
|
||||
|
||||
static inline int ulog2(const unsigned v) {
|
||||
return 31 - clz(v);
|
||||
}
|
||||
|
||||
static inline int u64log2(const uint64_t v) {
|
||||
return 63 - clzll(v);
|
||||
}
|
||||
|
||||
static inline unsigned inv_recenter(const unsigned r, const unsigned v) {
|
||||
if (v > (r << 1))
|
||||
return v;
|
||||
else if ((v & 1) == 0)
|
||||
return (v >> 1) + r;
|
||||
else
|
||||
return r - ((v + 1) >> 1);
|
||||
}
|
||||
|
||||
#endif /* DAV1D_COMMON_INTOPS_H */
|
||||
+520
@@ -0,0 +1,520 @@
|
||||
/*
|
||||
* Copyright © 2018, VideoLAN and dav1d authors
|
||||
* Copyright © 2019, Martin Storsjo
|
||||
* All rights reserved.
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*/
|
||||
|
||||
#include "src/arm/asm.S"
|
||||
#include "util.S"
|
||||
#include "cdef_tmpl.S"
|
||||
|
||||
.macro pad_top_bottom s1, s2, w, stride, rn, rw, ret
|
||||
tst w7, #1 // CDEF_HAVE_LEFT
|
||||
b.eq 2f
|
||||
// CDEF_HAVE_LEFT
|
||||
sub \s1, \s1, #2
|
||||
sub \s2, \s2, #2
|
||||
tst w7, #2 // CDEF_HAVE_RIGHT
|
||||
b.eq 1f
|
||||
// CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
|
||||
ldr \rn\()0, [\s1]
|
||||
ldr s1, [\s1, #\w]
|
||||
ldr \rn\()2, [\s2]
|
||||
ldr s3, [\s2, #\w]
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
uxtl v2.8h, v2.8b
|
||||
uxtl v3.8h, v3.8b
|
||||
str \rw\()0, [x0]
|
||||
str d1, [x0, #2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
str \rw\()2, [x0]
|
||||
str d3, [x0, #2*\w]
|
||||
.if \ret
|
||||
ret
|
||||
.else
|
||||
add x0, x0, #2*\stride
|
||||
b 3f
|
||||
.endif
|
||||
|
||||
1:
|
||||
// CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
|
||||
ldr \rn\()0, [\s1]
|
||||
ldr h1, [\s1, #\w]
|
||||
ldr \rn\()2, [\s2]
|
||||
ldr h3, [\s2, #\w]
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
uxtl v2.8h, v2.8b
|
||||
uxtl v3.8h, v3.8b
|
||||
str \rw\()0, [x0]
|
||||
str s1, [x0, #2*\w]
|
||||
str s31, [x0, #2*\w+4]
|
||||
add x0, x0, #2*\stride
|
||||
str \rw\()2, [x0]
|
||||
str s3, [x0, #2*\w]
|
||||
str s31, [x0, #2*\w+4]
|
||||
.if \ret
|
||||
ret
|
||||
.else
|
||||
add x0, x0, #2*\stride
|
||||
b 3f
|
||||
.endif
|
||||
|
||||
2:
|
||||
// !CDEF_HAVE_LEFT
|
||||
tst w7, #2 // CDEF_HAVE_RIGHT
|
||||
b.eq 1f
|
||||
// !CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
|
||||
ldr \rn\()0, [\s1]
|
||||
ldr h1, [\s1, #\w]
|
||||
ldr \rn\()2, [\s2]
|
||||
ldr h3, [\s2, #\w]
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
uxtl v2.8h, v2.8b
|
||||
uxtl v3.8h, v3.8b
|
||||
str s31, [x0]
|
||||
stur \rw\()0, [x0, #4]
|
||||
str s1, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
str s31, [x0]
|
||||
stur \rw\()2, [x0, #4]
|
||||
str s3, [x0, #4+2*\w]
|
||||
.if \ret
|
||||
ret
|
||||
.else
|
||||
add x0, x0, #2*\stride
|
||||
b 3f
|
||||
.endif
|
||||
|
||||
1:
|
||||
// !CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
|
||||
ldr \rn\()0, [\s1]
|
||||
ldr \rn\()1, [\s2]
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
str s31, [x0]
|
||||
stur \rw\()0, [x0, #4]
|
||||
str s31, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
str s31, [x0]
|
||||
stur \rw\()1, [x0, #4]
|
||||
str s31, [x0, #4+2*\w]
|
||||
.if \ret
|
||||
ret
|
||||
.else
|
||||
add x0, x0, #2*\stride
|
||||
.endif
|
||||
3:
|
||||
.endm
|
||||
|
||||
.macro load_n_incr dst, src, incr, w
|
||||
.if \w == 4
|
||||
ld1 {\dst\().s}[0], [\src], \incr
|
||||
.else
|
||||
ld1 {\dst\().8b}, [\src], \incr
|
||||
.endif
|
||||
.endm
|
||||
|
||||
// void dav1d_cdef_paddingX_8bpc_neon(uint16_t *tmp, const pixel *src,
|
||||
// ptrdiff_t src_stride, const pixel (*left)[2],
|
||||
// const pixel *const top,
|
||||
// const pixel *const bottom, int h,
|
||||
// enum CdefEdgeFlags edges);
|
||||
|
||||
.macro padding_func w, stride, rn, rw
|
||||
function cdef_padding\w\()_8bpc_neon, export=1
|
||||
cmp w7, #0xf // fully edged
|
||||
b.eq cdef_padding\w\()_edged_8bpc_neon
|
||||
movi v30.8h, #0x80, lsl #8
|
||||
mov v31.16b, v30.16b
|
||||
sub x0, x0, #2*(2*\stride+2)
|
||||
tst w7, #4 // CDEF_HAVE_TOP
|
||||
b.ne 1f
|
||||
// !CDEF_HAVE_TOP
|
||||
st1 {v30.8h, v31.8h}, [x0], #32
|
||||
.if \w == 8
|
||||
st1 {v30.8h, v31.8h}, [x0], #32
|
||||
.endif
|
||||
b 3f
|
||||
1:
|
||||
// CDEF_HAVE_TOP
|
||||
add x9, x4, x2
|
||||
pad_top_bottom x4, x9, \w, \stride, \rn, \rw, 0
|
||||
|
||||
// Middle section
|
||||
3:
|
||||
tst w7, #1 // CDEF_HAVE_LEFT
|
||||
b.eq 2f
|
||||
// CDEF_HAVE_LEFT
|
||||
tst w7, #2 // CDEF_HAVE_RIGHT
|
||||
b.eq 1f
|
||||
// CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
|
||||
0:
|
||||
ld1 {v0.h}[0], [x3], #2
|
||||
ldr h2, [x1, #\w]
|
||||
load_n_incr v1, x1, x2, \w
|
||||
subs w6, w6, #1
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
uxtl v2.8h, v2.8b
|
||||
str s0, [x0]
|
||||
stur \rw\()1, [x0, #4]
|
||||
str s2, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
b.gt 0b
|
||||
b 3f
|
||||
1:
|
||||
// CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
|
||||
ld1 {v0.h}[0], [x3], #2
|
||||
load_n_incr v1, x1, x2, \w
|
||||
subs w6, w6, #1
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
str s0, [x0]
|
||||
stur \rw\()1, [x0, #4]
|
||||
str s31, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
b.gt 1b
|
||||
b 3f
|
||||
2:
|
||||
tst w7, #2 // CDEF_HAVE_RIGHT
|
||||
b.eq 1f
|
||||
// !CDEF_HAVE_LEFT+CDEF_HAVE_RIGHT
|
||||
0:
|
||||
ldr h1, [x1, #\w]
|
||||
load_n_incr v0, x1, x2, \w
|
||||
subs w6, w6, #1
|
||||
uxtl v0.8h, v0.8b
|
||||
uxtl v1.8h, v1.8b
|
||||
str s31, [x0]
|
||||
stur \rw\()0, [x0, #4]
|
||||
str s1, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
b.gt 0b
|
||||
b 3f
|
||||
1:
|
||||
// !CDEF_HAVE_LEFT+!CDEF_HAVE_RIGHT
|
||||
load_n_incr v0, x1, x2, \w
|
||||
subs w6, w6, #1
|
||||
uxtl v0.8h, v0.8b
|
||||
str s31, [x0]
|
||||
stur \rw\()0, [x0, #4]
|
||||
str s31, [x0, #4+2*\w]
|
||||
add x0, x0, #2*\stride
|
||||
b.gt 1b
|
||||
|
||||
3:
|
||||
tst w7, #8 // CDEF_HAVE_BOTTOM
|
||||
b.ne 1f
|
||||
// !CDEF_HAVE_BOTTOM
|
||||
st1 {v30.8h, v31.8h}, [x0], #32
|
||||
.if \w == 8
|
||||
st1 {v30.8h, v31.8h}, [x0], #32
|
||||
.endif
|
||||
ret
|
||||
1:
|
||||
// CDEF_HAVE_BOTTOM
|
||||
add x9, x5, x2
|
||||
pad_top_bottom x5, x9, \w, \stride, \rn, \rw, 1
|
||||
endfunc
|
||||
.endm
|
||||
|
||||
padding_func 8, 16, d, q
|
||||
padding_func 4, 8, s, d
|
||||
|
||||
// void cdef_paddingX_edged_8bpc_neon(uint8_t *tmp, const pixel *src,
|
||||
// ptrdiff_t src_stride, const pixel (*left)[2],
|
||||
// const pixel *const top,
|
||||
// const pixel *const bottom, int h,
|
||||
// enum CdefEdgeFlags edges);
|
||||
|
||||
.macro padding_func_edged w, stride, reg
|
||||
function cdef_padding\w\()_edged_8bpc_neon, export=1
|
||||
sub x4, x4, #2
|
||||
sub x5, x5, #2
|
||||
sub x0, x0, #(2*\stride+2)
|
||||
|
||||
.if \w == 4
|
||||
ldr d0, [x4]
|
||||
ldr d1, [x4, x2]
|
||||
st1 {v0.8b, v1.8b}, [x0], #16
|
||||
.else
|
||||
add x9, x4, x2
|
||||
ldr d0, [x4]
|
||||
ldr s1, [x4, #8]
|
||||
ldr d2, [x9]
|
||||
ldr s3, [x9, #8]
|
||||
str d0, [x0]
|
||||
str s1, [x0, #8]
|
||||
str d2, [x0, #\stride]
|
||||
str s3, [x0, #\stride+8]
|
||||
add x0, x0, #2*\stride
|
||||
.endif
|
||||
|
||||
0:
|
||||
ld1 {v0.h}[0], [x3], #2
|
||||
ldr h2, [x1, #\w]
|
||||
load_n_incr v1, x1, x2, \w
|
||||
subs w6, w6, #1
|
||||
str h0, [x0]
|
||||
stur \reg\()1, [x0, #2]
|
||||
str h2, [x0, #2+\w]
|
||||
add x0, x0, #\stride
|
||||
b.gt 0b
|
||||
|
||||
.if \w == 4
|
||||
ldr d0, [x5]
|
||||
ldr d1, [x5, x2]
|
||||
st1 {v0.8b, v1.8b}, [x0], #16
|
||||
.else
|
||||
add x9, x5, x2
|
||||
ldr d0, [x5]
|
||||
ldr s1, [x5, #8]
|
||||
ldr d2, [x9]
|
||||
ldr s3, [x9, #8]
|
||||
str d0, [x0]
|
||||
str s1, [x0, #8]
|
||||
str d2, [x0, #\stride]
|
||||
str s3, [x0, #\stride+8]
|
||||
.endif
|
||||
ret
|
||||
endfunc
|
||||
.endm
|
||||
|
||||
padding_func_edged 8, 16, d
|
||||
padding_func_edged 4, 8, s
|
||||
|
||||
tables
|
||||
|
||||
filter 8, 8
|
||||
filter 4, 8
|
||||
|
||||
find_dir 8
|
||||
|
||||
.macro load_px_8 d1, d2, w
|
||||
.if \w == 8
|
||||
add x6, x2, w9, sxtb // x + off
|
||||
sub x9, x2, w9, sxtb // x - off
|
||||
ld1 {\d1\().d}[0], [x6] // p0
|
||||
add x6, x6, #16 // += stride
|
||||
ld1 {\d2\().d}[0], [x9] // p1
|
||||
add x9, x9, #16 // += stride
|
||||
ld1 {\d1\().d}[1], [x6] // p0
|
||||
ld1 {\d2\().d}[1], [x9] // p0
|
||||
.else
|
||||
add x6, x2, w9, sxtb // x + off
|
||||
sub x9, x2, w9, sxtb // x - off
|
||||
ld1 {\d1\().s}[0], [x6] // p0
|
||||
add x6, x6, #8 // += stride
|
||||
ld1 {\d2\().s}[0], [x9] // p1
|
||||
add x9, x9, #8 // += stride
|
||||
ld1 {\d1\().s}[1], [x6] // p0
|
||||
add x6, x6, #8 // += stride
|
||||
ld1 {\d2\().s}[1], [x9] // p1
|
||||
add x9, x9, #8 // += stride
|
||||
ld1 {\d1\().s}[2], [x6] // p0
|
||||
add x6, x6, #8 // += stride
|
||||
ld1 {\d2\().s}[2], [x9] // p1
|
||||
add x9, x9, #8 // += stride
|
||||
ld1 {\d1\().s}[3], [x6] // p0
|
||||
ld1 {\d2\().s}[3], [x9] // p1
|
||||
.endif
|
||||
.endm
|
||||
.macro handle_pixel_8 s1, s2, thresh_vec, shift, tap, min
|
||||
.if \min
|
||||
umin v3.16b, v3.16b, \s1\().16b
|
||||
umax v4.16b, v4.16b, \s1\().16b
|
||||
umin v3.16b, v3.16b, \s2\().16b
|
||||
umax v4.16b, v4.16b, \s2\().16b
|
||||
.endif
|
||||
uabd v16.16b, v0.16b, \s1\().16b // abs(diff)
|
||||
uabd v20.16b, v0.16b, \s2\().16b // abs(diff)
|
||||
ushl v17.16b, v16.16b, \shift // abs(diff) >> shift
|
||||
ushl v21.16b, v20.16b, \shift // abs(diff) >> shift
|
||||
uqsub v17.16b, \thresh_vec, v17.16b // clip = imax(0, threshold - (abs(diff) >> shift))
|
||||
uqsub v21.16b, \thresh_vec, v21.16b // clip = imax(0, threshold - (abs(diff) >> shift))
|
||||
cmhi v18.16b, v0.16b, \s1\().16b // px > p0
|
||||
cmhi v22.16b, v0.16b, \s2\().16b // px > p1
|
||||
umin v17.16b, v17.16b, v16.16b // imin(abs(diff), clip)
|
||||
umin v21.16b, v21.16b, v20.16b // imin(abs(diff), clip)
|
||||
dup v19.16b, \tap // taps[k]
|
||||
neg v16.16b, v17.16b // -imin()
|
||||
neg v20.16b, v21.16b // -imin()
|
||||
bsl v18.16b, v16.16b, v17.16b // constrain() = apply_sign()
|
||||
bsl v22.16b, v20.16b, v21.16b // constrain() = apply_sign()
|
||||
mla v1.16b, v18.16b, v19.16b // sum += taps[k] * constrain()
|
||||
mla v2.16b, v22.16b, v19.16b // sum += taps[k] * constrain()
|
||||
.endm
|
||||
|
||||
// void cdef_filterX_edged_8bpc_neon(pixel *dst, ptrdiff_t dst_stride,
|
||||
// const uint8_t *tmp, int pri_strength,
|
||||
// int sec_strength, int dir, int damping,
|
||||
// int h);
|
||||
.macro filter_func_8 w, pri, sec, min, suffix
|
||||
function cdef_filter\w\suffix\()_edged_8bpc_neon
|
||||
.if \pri
|
||||
movrel x8, pri_taps
|
||||
and w9, w3, #1
|
||||
add x8, x8, w9, uxtw #1
|
||||
.endif
|
||||
movrel x9, directions\w
|
||||
add x5, x9, w5, uxtw #1
|
||||
movi v30.8b, #7
|
||||
dup v28.8b, w6 // damping
|
||||
|
||||
.if \pri
|
||||
dup v25.16b, w3 // threshold
|
||||
.endif
|
||||
.if \sec
|
||||
dup v27.16b, w4 // threshold
|
||||
.endif
|
||||
trn1 v24.8b, v25.8b, v27.8b
|
||||
clz v24.8b, v24.8b // clz(threshold)
|
||||
sub v24.8b, v30.8b, v24.8b // ulog2(threshold)
|
||||
uqsub v24.8b, v28.8b, v24.8b // shift = imax(0, damping - ulog2(threshold))
|
||||
neg v24.8b, v24.8b // -shift
|
||||
.if \sec
|
||||
dup v26.16b, v24.b[1]
|
||||
.endif
|
||||
.if \pri
|
||||
dup v24.16b, v24.b[0]
|
||||
.endif
|
||||
|
||||
1:
|
||||
.if \w == 8
|
||||
add x12, x2, #16
|
||||
ld1 {v0.d}[0], [x2] // px
|
||||
ld1 {v0.d}[1], [x12] // px
|
||||
.else
|
||||
add x12, x2, #1*8
|
||||
add x13, x2, #2*8
|
||||
add x14, x2, #3*8
|
||||
ld1 {v0.s}[0], [x2] // px
|
||||
ld1 {v0.s}[1], [x12] // px
|
||||
ld1 {v0.s}[2], [x13] // px
|
||||
ld1 {v0.s}[3], [x14] // px
|
||||
.endif
|
||||
|
||||
// We need 9-bits or two 8-bit accululators to fit the sum.
|
||||
// Max of |sum| > 15*2*6(pri) + 4*4*3(sec) = 228.
|
||||
// Start sum at -1 instead of 0 to help handle rounding later.
|
||||
movi v1.16b, #255 // sum
|
||||
movi v2.16b, #0 // sum
|
||||
.if \min
|
||||
mov v3.16b, v0.16b // min
|
||||
mov v4.16b, v0.16b // max
|
||||
.endif
|
||||
|
||||
// Instead of loading sec_taps 2, 1 from memory, just set it
|
||||
// to 2 initially and decrease for the second round.
|
||||
// This is also used as loop counter.
|
||||
mov w11, #2 // sec_taps[0]
|
||||
|
||||
2:
|
||||
.if \pri
|
||||
ldrb w9, [x5] // off1
|
||||
|
||||
load_px_8 v5, v6, \w
|
||||
.endif
|
||||
|
||||
.if \sec
|
||||
add x5, x5, #4 // +2*2
|
||||
ldrb w9, [x5] // off2
|
||||
load_px_8 v28, v29, \w
|
||||
.endif
|
||||
|
||||
.if \pri
|
||||
ldrb w10, [x8] // *pri_taps
|
||||
|
||||
handle_pixel_8 v5, v6, v25.16b, v24.16b, w10, \min
|
||||
.endif
|
||||
|
||||
.if \sec
|
||||
add x5, x5, #8 // +2*4
|
||||
ldrb w9, [x5] // off3
|
||||
load_px_8 v5, v6, \w
|
||||
|
||||
handle_pixel_8 v28, v29, v27.16b, v26.16b, w11, \min
|
||||
|
||||
handle_pixel_8 v5, v6, v27.16b, v26.16b, w11, \min
|
||||
|
||||
sub x5, x5, #11 // x5 -= 2*(2+4); x5 += 1;
|
||||
.else
|
||||
add x5, x5, #1 // x5 += 1
|
||||
.endif
|
||||
subs w11, w11, #1 // sec_tap-- (value)
|
||||
.if \pri
|
||||
add x8, x8, #1 // pri_taps++ (pointer)
|
||||
.endif
|
||||
b.ne 2b
|
||||
|
||||
// Perform halving adds since the value won't fit otherwise.
|
||||
// To handle the offset for negative values, use both halving w/ and w/o rounding.
|
||||
srhadd v5.16b, v1.16b, v2.16b // sum >> 1
|
||||
shadd v6.16b, v1.16b, v2.16b // (sum - 1) >> 1
|
||||
cmlt v1.16b, v5.16b, #0 // sum < 0
|
||||
bsl v1.16b, v6.16b, v5.16b // (sum - (sum < 0)) >> 1
|
||||
|
||||
srshr v1.16b, v1.16b, #3 // (8 + sum - (sum < 0)) >> 4
|
||||
|
||||
usqadd v0.16b, v1.16b // px + (8 + sum ...) >> 4
|
||||
.if \min
|
||||
umin v0.16b, v0.16b, v4.16b
|
||||
umax v0.16b, v0.16b, v3.16b // iclip(px + .., min, max)
|
||||
.endif
|
||||
.if \w == 8
|
||||
st1 {v0.d}[0], [x0], x1
|
||||
add x2, x2, #2*16 // tmp += 2*tmp_stride
|
||||
subs w7, w7, #2 // h -= 2
|
||||
st1 {v0.d}[1], [x0], x1
|
||||
.else
|
||||
st1 {v0.s}[0], [x0], x1
|
||||
add x2, x2, #4*8 // tmp += 4*tmp_stride
|
||||
st1 {v0.s}[1], [x0], x1
|
||||
subs w7, w7, #4 // h -= 4
|
||||
st1 {v0.s}[2], [x0], x1
|
||||
st1 {v0.s}[3], [x0], x1
|
||||
.endif
|
||||
|
||||
// Reset pri_taps and directions back to the original point
|
||||
sub x5, x5, #2
|
||||
.if \pri
|
||||
sub x8, x8, #2
|
||||
.endif
|
||||
|
||||
b.gt 1b
|
||||
ret
|
||||
endfunc
|
||||
.endm
|
||||
|
||||
.macro filter_8 w
|
||||
filter_func_8 \w, pri=1, sec=0, min=0, suffix=_pri
|
||||
filter_func_8 \w, pri=0, sec=1, min=0, suffix=_sec
|
||||
filter_func_8 \w, pri=1, sec=1, min=1, suffix=_pri_sec
|
||||
.endm
|
||||
|
||||
filter_8 8
|
||||
filter_8 4
|
||||
+278
@@ -0,0 +1,278 @@
|
||||
/******************************************************************************
|
||||
* Copyright © 2018, VideoLAN and dav1d authors
|
||||
* Copyright © 2015 Martin Storsjo
|
||||
* Copyright © 2015 Janne Grunau
|
||||
* All rights reserved.
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#ifndef DAV1D_SRC_ARM_64_UTIL_S
|
||||
#define DAV1D_SRC_ARM_64_UTIL_S
|
||||
|
||||
#include "config.h"
|
||||
#include "src/arm/asm.S"
|
||||
|
||||
#ifndef __has_feature
|
||||
#define __has_feature(x) 0
|
||||
#endif
|
||||
|
||||
.macro movrel rd, val, offset=0
|
||||
#if defined(__APPLE__)
|
||||
.if \offset < 0
|
||||
adrp \rd, \val@PAGE
|
||||
add \rd, \rd, \val@PAGEOFF
|
||||
sub \rd, \rd, -(\offset)
|
||||
.else
|
||||
adrp \rd, \val+(\offset)@PAGE
|
||||
add \rd, \rd, \val+(\offset)@PAGEOFF
|
||||
.endif
|
||||
#elif defined(PIC) && defined(_WIN32)
|
||||
.if \offset < 0
|
||||
adrp \rd, \val
|
||||
add \rd, \rd, :lo12:\val
|
||||
sub \rd, \rd, -(\offset)
|
||||
.else
|
||||
adrp \rd, \val+(\offset)
|
||||
add \rd, \rd, :lo12:\val+(\offset)
|
||||
.endif
|
||||
#elif __has_feature(hwaddress_sanitizer)
|
||||
adrp \rd, :pg_hi21_nc:\val+(\offset)
|
||||
movk \rd, #:prel_g3:\val+0x100000000
|
||||
add \rd, \rd, :lo12:\val+(\offset)
|
||||
#elif defined(PIC)
|
||||
adrp \rd, \val+(\offset)
|
||||
add \rd, \rd, :lo12:\val+(\offset)
|
||||
#else
|
||||
ldr \rd, =\val+\offset
|
||||
#endif
|
||||
.endm
|
||||
|
||||
.macro sub_sp space
|
||||
#ifdef _WIN32
|
||||
.if \space > 8192
|
||||
// Here, we'd need to touch two (or more) pages while decrementing
|
||||
// the stack pointer.
|
||||
.error "sub_sp_align doesn't support values over 8K at the moment"
|
||||
.elseif \space > 4096
|
||||
sub x16, sp, #4096
|
||||
ldr xzr, [x16]
|
||||
sub sp, x16, #(\space - 4096)
|
||||
.else
|
||||
sub sp, sp, #\space
|
||||
.endif
|
||||
#else
|
||||
.if \space >= 4096
|
||||
sub sp, sp, #(\space)/4096*4096
|
||||
.endif
|
||||
.if (\space % 4096) != 0
|
||||
sub sp, sp, #(\space)%4096
|
||||
.endif
|
||||
#endif
|
||||
.endm
|
||||
|
||||
.macro transpose_8x8b_xtl r0, r1, r2, r3, r4, r5, r6, r7, xtl
|
||||
// a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7
|
||||
zip1 \r0\().16b, \r0\().16b, \r1\().16b
|
||||
// c0 d0 c1 d1 c2 d2 d3 d3 c4 d4 c5 d5 c6 d6 d7 d7
|
||||
zip1 \r2\().16b, \r2\().16b, \r3\().16b
|
||||
// e0 f0 e1 f1 e2 f2 e3 f3 e4 f4 e5 f5 e6 f6 e7 f7
|
||||
zip1 \r4\().16b, \r4\().16b, \r5\().16b
|
||||
// g0 h0 g1 h1 g2 h2 h3 h3 g4 h4 g5 h5 g6 h6 h7 h7
|
||||
zip1 \r6\().16b, \r6\().16b, \r7\().16b
|
||||
|
||||
// a0 b0 c0 d0 a2 b2 c2 d2 a4 b4 c4 d4 a6 b6 c6 d6
|
||||
trn1 \r1\().8h, \r0\().8h, \r2\().8h
|
||||
// a1 b1 c1 d1 a3 b3 c3 d3 a5 b5 c5 d5 a7 b7 c7 d7
|
||||
trn2 \r3\().8h, \r0\().8h, \r2\().8h
|
||||
// e0 f0 g0 h0 e2 f2 g2 h2 e4 f4 g4 h4 e6 f6 g6 h6
|
||||
trn1 \r5\().8h, \r4\().8h, \r6\().8h
|
||||
// e1 f1 g1 h1 e3 f3 g3 h3 e5 f5 g5 h5 e7 f7 g7 h7
|
||||
trn2 \r7\().8h, \r4\().8h, \r6\().8h
|
||||
|
||||
// a0 b0 c0 d0 e0 f0 g0 h0 a4 b4 c4 d4 e4 f4 g4 h4
|
||||
trn1 \r0\().4s, \r1\().4s, \r5\().4s
|
||||
// a2 b2 c2 d2 e2 f2 g2 h2 a6 b6 c6 d6 e6 f6 g6 h6
|
||||
trn2 \r2\().4s, \r1\().4s, \r5\().4s
|
||||
// a1 b1 c1 d1 e1 f1 g1 h1 a5 b5 c5 d5 e5 f5 g5 h5
|
||||
trn1 \r1\().4s, \r3\().4s, \r7\().4s
|
||||
// a3 b3 c3 d3 e3 f3 g3 h3 a7 b7 c7 d7 e7 f7 g7 h7
|
||||
trn2 \r3\().4s, \r3\().4s, \r7\().4s
|
||||
|
||||
\xtl\()2 \r4\().8h, \r0\().16b
|
||||
\xtl \r0\().8h, \r0\().8b
|
||||
\xtl\()2 \r6\().8h, \r2\().16b
|
||||
\xtl \r2\().8h, \r2\().8b
|
||||
\xtl\()2 \r5\().8h, \r1\().16b
|
||||
\xtl \r1\().8h, \r1\().8b
|
||||
\xtl\()2 \r7\().8h, \r3\().16b
|
||||
\xtl \r3\().8h, \r3\().8b
|
||||
.endm
|
||||
|
||||
.macro transpose_8x8h r0, r1, r2, r3, r4, r5, r6, r7, t8, t9
|
||||
trn1 \t8\().8h, \r0\().8h, \r1\().8h
|
||||
trn2 \t9\().8h, \r0\().8h, \r1\().8h
|
||||
trn1 \r1\().8h, \r2\().8h, \r3\().8h
|
||||
trn2 \r3\().8h, \r2\().8h, \r3\().8h
|
||||
trn1 \r0\().8h, \r4\().8h, \r5\().8h
|
||||
trn2 \r5\().8h, \r4\().8h, \r5\().8h
|
||||
trn1 \r2\().8h, \r6\().8h, \r7\().8h
|
||||
trn2 \r7\().8h, \r6\().8h, \r7\().8h
|
||||
|
||||
trn1 \r4\().4s, \r0\().4s, \r2\().4s
|
||||
trn2 \r2\().4s, \r0\().4s, \r2\().4s
|
||||
trn1 \r6\().4s, \r5\().4s, \r7\().4s
|
||||
trn2 \r7\().4s, \r5\().4s, \r7\().4s
|
||||
trn1 \r5\().4s, \t9\().4s, \r3\().4s
|
||||
trn2 \t9\().4s, \t9\().4s, \r3\().4s
|
||||
trn1 \r3\().4s, \t8\().4s, \r1\().4s
|
||||
trn2 \t8\().4s, \t8\().4s, \r1\().4s
|
||||
|
||||
trn1 \r0\().2d, \r3\().2d, \r4\().2d
|
||||
trn2 \r4\().2d, \r3\().2d, \r4\().2d
|
||||
trn1 \r1\().2d, \r5\().2d, \r6\().2d
|
||||
trn2 \r5\().2d, \r5\().2d, \r6\().2d
|
||||
trn2 \r6\().2d, \t8\().2d, \r2\().2d
|
||||
trn1 \r2\().2d, \t8\().2d, \r2\().2d
|
||||
trn1 \r3\().2d, \t9\().2d, \r7\().2d
|
||||
trn2 \r7\().2d, \t9\().2d, \r7\().2d
|
||||
.endm
|
||||
|
||||
.macro transpose_8x8h_mov r0, r1, r2, r3, r4, r5, r6, r7, t8, t9, o0, o1, o2, o3, o4, o5, o6, o7
|
||||
trn1 \t8\().8h, \r0\().8h, \r1\().8h
|
||||
trn2 \t9\().8h, \r0\().8h, \r1\().8h
|
||||
trn1 \r1\().8h, \r2\().8h, \r3\().8h
|
||||
trn2 \r3\().8h, \r2\().8h, \r3\().8h
|
||||
trn1 \r0\().8h, \r4\().8h, \r5\().8h
|
||||
trn2 \r5\().8h, \r4\().8h, \r5\().8h
|
||||
trn1 \r2\().8h, \r6\().8h, \r7\().8h
|
||||
trn2 \r7\().8h, \r6\().8h, \r7\().8h
|
||||
|
||||
trn1 \r4\().4s, \r0\().4s, \r2\().4s
|
||||
trn2 \r2\().4s, \r0\().4s, \r2\().4s
|
||||
trn1 \r6\().4s, \r5\().4s, \r7\().4s
|
||||
trn2 \r7\().4s, \r5\().4s, \r7\().4s
|
||||
trn1 \r5\().4s, \t9\().4s, \r3\().4s
|
||||
trn2 \t9\().4s, \t9\().4s, \r3\().4s
|
||||
trn1 \r3\().4s, \t8\().4s, \r1\().4s
|
||||
trn2 \t8\().4s, \t8\().4s, \r1\().4s
|
||||
|
||||
trn1 \o0\().2d, \r3\().2d, \r4\().2d
|
||||
trn2 \o4\().2d, \r3\().2d, \r4\().2d
|
||||
trn1 \o1\().2d, \r5\().2d, \r6\().2d
|
||||
trn2 \o5\().2d, \r5\().2d, \r6\().2d
|
||||
trn2 \o6\().2d, \t8\().2d, \r2\().2d
|
||||
trn1 \o2\().2d, \t8\().2d, \r2\().2d
|
||||
trn1 \o3\().2d, \t9\().2d, \r7\().2d
|
||||
trn2 \o7\().2d, \t9\().2d, \r7\().2d
|
||||
.endm
|
||||
|
||||
.macro transpose_8x16b r0, r1, r2, r3, r4, r5, r6, r7, t8, t9
|
||||
trn1 \t8\().16b, \r0\().16b, \r1\().16b
|
||||
trn2 \t9\().16b, \r0\().16b, \r1\().16b
|
||||
trn1 \r1\().16b, \r2\().16b, \r3\().16b
|
||||
trn2 \r3\().16b, \r2\().16b, \r3\().16b
|
||||
trn1 \r0\().16b, \r4\().16b, \r5\().16b
|
||||
trn2 \r5\().16b, \r4\().16b, \r5\().16b
|
||||
trn1 \r2\().16b, \r6\().16b, \r7\().16b
|
||||
trn2 \r7\().16b, \r6\().16b, \r7\().16b
|
||||
|
||||
trn1 \r4\().8h, \r0\().8h, \r2\().8h
|
||||
trn2 \r2\().8h, \r0\().8h, \r2\().8h
|
||||
trn1 \r6\().8h, \r5\().8h, \r7\().8h
|
||||
trn2 \r7\().8h, \r5\().8h, \r7\().8h
|
||||
trn1 \r5\().8h, \t9\().8h, \r3\().8h
|
||||
trn2 \t9\().8h, \t9\().8h, \r3\().8h
|
||||
trn1 \r3\().8h, \t8\().8h, \r1\().8h
|
||||
trn2 \t8\().8h, \t8\().8h, \r1\().8h
|
||||
|
||||
trn1 \r0\().4s, \r3\().4s, \r4\().4s
|
||||
trn2 \r4\().4s, \r3\().4s, \r4\().4s
|
||||
trn1 \r1\().4s, \r5\().4s, \r6\().4s
|
||||
trn2 \r5\().4s, \r5\().4s, \r6\().4s
|
||||
trn2 \r6\().4s, \t8\().4s, \r2\().4s
|
||||
trn1 \r2\().4s, \t8\().4s, \r2\().4s
|
||||
trn1 \r3\().4s, \t9\().4s, \r7\().4s
|
||||
trn2 \r7\().4s, \t9\().4s, \r7\().4s
|
||||
.endm
|
||||
|
||||
.macro transpose_4x16b r0, r1, r2, r3, t4, t5, t6, t7
|
||||
trn1 \t4\().16b, \r0\().16b, \r1\().16b
|
||||
trn2 \t5\().16b, \r0\().16b, \r1\().16b
|
||||
trn1 \t6\().16b, \r2\().16b, \r3\().16b
|
||||
trn2 \t7\().16b, \r2\().16b, \r3\().16b
|
||||
|
||||
trn1 \r0\().8h, \t4\().8h, \t6\().8h
|
||||
trn2 \r2\().8h, \t4\().8h, \t6\().8h
|
||||
trn1 \r1\().8h, \t5\().8h, \t7\().8h
|
||||
trn2 \r3\().8h, \t5\().8h, \t7\().8h
|
||||
.endm
|
||||
|
||||
.macro transpose_4x4h r0, r1, r2, r3, t4, t5, t6, t7
|
||||
trn1 \t4\().4h, \r0\().4h, \r1\().4h
|
||||
trn2 \t5\().4h, \r0\().4h, \r1\().4h
|
||||
trn1 \t6\().4h, \r2\().4h, \r3\().4h
|
||||
trn2 \t7\().4h, \r2\().4h, \r3\().4h
|
||||
|
||||
trn1 \r0\().2s, \t4\().2s, \t6\().2s
|
||||
trn2 \r2\().2s, \t4\().2s, \t6\().2s
|
||||
trn1 \r1\().2s, \t5\().2s, \t7\().2s
|
||||
trn2 \r3\().2s, \t5\().2s, \t7\().2s
|
||||
.endm
|
||||
|
||||
.macro transpose_4x4s r0, r1, r2, r3, t4, t5, t6, t7
|
||||
trn1 \t4\().4s, \r0\().4s, \r1\().4s
|
||||
trn2 \t5\().4s, \r0\().4s, \r1\().4s
|
||||
trn1 \t6\().4s, \r2\().4s, \r3\().4s
|
||||
trn2 \t7\().4s, \r2\().4s, \r3\().4s
|
||||
|
||||
trn1 \r0\().2d, \t4\().2d, \t6\().2d
|
||||
trn2 \r2\().2d, \t4\().2d, \t6\().2d
|
||||
trn1 \r1\().2d, \t5\().2d, \t7\().2d
|
||||
trn2 \r3\().2d, \t5\().2d, \t7\().2d
|
||||
.endm
|
||||
|
||||
.macro transpose_4x8h r0, r1, r2, r3, t4, t5, t6, t7
|
||||
trn1 \t4\().8h, \r0\().8h, \r1\().8h
|
||||
trn2 \t5\().8h, \r0\().8h, \r1\().8h
|
||||
trn1 \t6\().8h, \r2\().8h, \r3\().8h
|
||||
trn2 \t7\().8h, \r2\().8h, \r3\().8h
|
||||
|
||||
trn1 \r0\().4s, \t4\().4s, \t6\().4s
|
||||
trn2 \r2\().4s, \t4\().4s, \t6\().4s
|
||||
trn1 \r1\().4s, \t5\().4s, \t7\().4s
|
||||
trn2 \r3\().4s, \t5\().4s, \t7\().4s
|
||||
.endm
|
||||
|
||||
.macro transpose_4x8h_mov r0, r1, r2, r3, t4, t5, t6, t7, o0, o1, o2, o3
|
||||
trn1 \t4\().8h, \r0\().8h, \r1\().8h
|
||||
trn2 \t5\().8h, \r0\().8h, \r1\().8h
|
||||
trn1 \t6\().8h, \r2\().8h, \r3\().8h
|
||||
trn2 \t7\().8h, \r2\().8h, \r3\().8h
|
||||
|
||||
trn1 \o0\().4s, \t4\().4s, \t6\().4s
|
||||
trn2 \o2\().4s, \t4\().4s, \t6\().4s
|
||||
trn1 \o1\().4s, \t5\().4s, \t7\().4s
|
||||
trn2 \o3\().4s, \t5\().4s, \t7\().4s
|
||||
.endm
|
||||
|
||||
#endif /* DAV1D_SRC_ARM_64_UTIL_S */
|
||||
+335
@@ -0,0 +1,335 @@
|
||||
/*
|
||||
* Copyright © 2018, VideoLAN and dav1d authors
|
||||
* Copyright © 2018, Janne Grunau
|
||||
* All rights reserved.
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*/
|
||||
|
||||
#ifndef DAV1D_SRC_ARM_ASM_S
|
||||
#define DAV1D_SRC_ARM_ASM_S
|
||||
|
||||
#include "config.h"
|
||||
|
||||
#if ARCH_AARCH64
|
||||
#define x18 do_not_use_x18
|
||||
#define w18 do_not_use_w18
|
||||
|
||||
#if HAVE_AS_ARCH_DIRECTIVE
|
||||
.arch AS_ARCH_LEVEL
|
||||
#endif
|
||||
|
||||
#if HAVE_AS_ARCHEXT_DOTPROD_DIRECTIVE
|
||||
#define ENABLE_DOTPROD .arch_extension dotprod
|
||||
#define DISABLE_DOTPROD .arch_extension nodotprod
|
||||
#else
|
||||
#define ENABLE_DOTPROD
|
||||
#define DISABLE_DOTPROD
|
||||
#endif
|
||||
#if HAVE_AS_ARCHEXT_I8MM_DIRECTIVE
|
||||
#define ENABLE_I8MM .arch_extension i8mm
|
||||
#define DISABLE_I8MM .arch_extension noi8mm
|
||||
#else
|
||||
#define ENABLE_I8MM
|
||||
#define DISABLE_I8MM
|
||||
#endif
|
||||
#if HAVE_AS_ARCHEXT_SVE_DIRECTIVE
|
||||
#define ENABLE_SVE .arch_extension sve
|
||||
#define DISABLE_SVE .arch_extension nosve
|
||||
#else
|
||||
#define ENABLE_SVE
|
||||
#define DISABLE_SVE
|
||||
#endif
|
||||
#if HAVE_AS_ARCHEXT_SVE2_DIRECTIVE
|
||||
#define ENABLE_SVE2 .arch_extension sve2
|
||||
#define DISABLE_SVE2 .arch_extension nosve2
|
||||
#else
|
||||
#define ENABLE_SVE2
|
||||
#define DISABLE_SVE2
|
||||
#endif
|
||||
|
||||
/* If we do support the .arch_extension directives, disable support for all
|
||||
* the extensions that we may use, in case they were implicitly enabled by
|
||||
* the .arch level. This makes it clear if we try to assemble an instruction
|
||||
* from an unintended extension set; we only allow assmbling such instructions
|
||||
* within regions where we explicitly enable those extensions. */
|
||||
DISABLE_DOTPROD
|
||||
DISABLE_I8MM
|
||||
DISABLE_SVE
|
||||
DISABLE_SVE2
|
||||
|
||||
|
||||
/* Support macros for
|
||||
* - Armv8.3-A Pointer Authentication and
|
||||
* - Armv8.5-A Branch Target Identification
|
||||
* features which require emitting a .note.gnu.property section with the
|
||||
* appropriate architecture-dependent feature bits set.
|
||||
*
|
||||
* |AARCH64_SIGN_LINK_REGISTER| and |AARCH64_VALIDATE_LINK_REGISTER| expand to
|
||||
* PACIxSP and AUTIxSP, respectively. |AARCH64_SIGN_LINK_REGISTER| should be
|
||||
* used immediately before saving the LR register (x30) to the stack.
|
||||
* |AARCH64_VALIDATE_LINK_REGISTER| should be used immediately after restoring
|
||||
* it. Note |AARCH64_SIGN_LINK_REGISTER|'s modifications to LR must be undone
|
||||
* with |AARCH64_VALIDATE_LINK_REGISTER| before RET. The SP register must also
|
||||
* have the same value at the two points. For example:
|
||||
*
|
||||
* .global f
|
||||
* f:
|
||||
* AARCH64_SIGN_LINK_REGISTER
|
||||
* stp x29, x30, [sp, #-96]!
|
||||
* mov x29, sp
|
||||
* ...
|
||||
* ldp x29, x30, [sp], #96
|
||||
* AARCH64_VALIDATE_LINK_REGISTER
|
||||
* ret
|
||||
*
|
||||
* |AARCH64_VALID_CALL_TARGET| expands to BTI 'c'. Either it, or
|
||||
* |AARCH64_SIGN_LINK_REGISTER|, must be used at every point that may be an
|
||||
* indirect call target. In particular, all symbols exported from a file must
|
||||
* begin with one of these macros. For example, a leaf function that does not
|
||||
* save LR can instead use |AARCH64_VALID_CALL_TARGET|:
|
||||
*
|
||||
* .globl return_zero
|
||||
* return_zero:
|
||||
* AARCH64_VALID_CALL_TARGET
|
||||
* mov x0, #0
|
||||
* ret
|
||||
*
|
||||
* A non-leaf function which does not immediately save LR may need both macros
|
||||
* because |AARCH64_SIGN_LINK_REGISTER| appears late. For example, the function
|
||||
* may jump to an alternate implementation before setting up the stack:
|
||||
*
|
||||
* .globl with_early_jump
|
||||
* with_early_jump:
|
||||
* AARCH64_VALID_CALL_TARGET
|
||||
* cmp x0, #128
|
||||
* b.lt .Lwith_early_jump_128
|
||||
* AARCH64_SIGN_LINK_REGISTER
|
||||
* stp x29, x30, [sp, #-96]!
|
||||
* mov x29, sp
|
||||
* ...
|
||||
* ldp x29, x30, [sp], #96
|
||||
* AARCH64_VALIDATE_LINK_REGISTER
|
||||
* ret
|
||||
*
|
||||
* .Lwith_early_jump_128:
|
||||
* ...
|
||||
* ret
|
||||
*
|
||||
* These annotations are only required with indirect calls. Private symbols that
|
||||
* are only the target of direct calls do not require annotations. Also note
|
||||
* that |AARCH64_VALID_CALL_TARGET| is only valid for indirect calls (BLR), not
|
||||
* indirect jumps (BR). Indirect jumps in assembly are supported through
|
||||
* |AARCH64_VALID_JUMP_TARGET|. Landing Pads which shall serve for jumps and
|
||||
* calls can be created using |AARCH64_VALID_JUMP_CALL_TARGET|.
|
||||
*
|
||||
* Although not necessary, it is safe to use these macros in 32-bit ARM
|
||||
* assembly. This may be used to simplify dual 32-bit and 64-bit files.
|
||||
*
|
||||
* References:
|
||||
* - "ELF for the Arm® 64-bit Architecture"
|
||||
* https: *github.com/ARM-software/abi-aa/blob/master/aaelf64/aaelf64.rst
|
||||
* - "Providing protection for complex software"
|
||||
* https://developer.arm.com/architectures/learn-the-architecture/providing-protection-for-complex-software
|
||||
*/
|
||||
#if defined(__ARM_FEATURE_BTI_DEFAULT) && (__ARM_FEATURE_BTI_DEFAULT == 1)
|
||||
#define GNU_PROPERTY_AARCH64_BTI (1 << 0) // Has Branch Target Identification
|
||||
#define AARCH64_VALID_JUMP_CALL_TARGET hint #38 // BTI 'jc'
|
||||
#define AARCH64_VALID_CALL_TARGET hint #34 // BTI 'c'
|
||||
#define AARCH64_VALID_JUMP_TARGET hint #36 // BTI 'j'
|
||||
#else
|
||||
#define GNU_PROPERTY_AARCH64_BTI 0 // No Branch Target Identification
|
||||
#define AARCH64_VALID_JUMP_CALL_TARGET
|
||||
#define AARCH64_VALID_CALL_TARGET
|
||||
#define AARCH64_VALID_JUMP_TARGET
|
||||
#endif
|
||||
|
||||
#if defined(__ARM_FEATURE_PAC_DEFAULT)
|
||||
|
||||
#if ((__ARM_FEATURE_PAC_DEFAULT & (1 << 0)) != 0) // authentication using key A
|
||||
#define AARCH64_SIGN_LINK_REGISTER paciasp
|
||||
#define AARCH64_VALIDATE_LINK_REGISTER autiasp
|
||||
#elif ((__ARM_FEATURE_PAC_DEFAULT & (1 << 1)) != 0) // authentication using key B
|
||||
#define AARCH64_SIGN_LINK_REGISTER pacibsp
|
||||
#define AARCH64_VALIDATE_LINK_REGISTER autibsp
|
||||
#else
|
||||
#error Pointer authentication defines no valid key!
|
||||
#endif
|
||||
#if ((__ARM_FEATURE_PAC_DEFAULT & (1 << 2)) != 0) // authentication of leaf functions
|
||||
#error Authentication of leaf functions is enabled but not supported in dav1d!
|
||||
#endif
|
||||
#define GNU_PROPERTY_AARCH64_PAC (1 << 1)
|
||||
|
||||
#elif defined(__APPLE__) && defined(__arm64e__)
|
||||
|
||||
#define GNU_PROPERTY_AARCH64_PAC 0
|
||||
#define AARCH64_SIGN_LINK_REGISTER pacibsp
|
||||
#define AARCH64_VALIDATE_LINK_REGISTER autibsp
|
||||
|
||||
#else /* __ARM_FEATURE_PAC_DEFAULT */
|
||||
|
||||
#define GNU_PROPERTY_AARCH64_PAC 0
|
||||
#define AARCH64_SIGN_LINK_REGISTER
|
||||
#define AARCH64_VALIDATE_LINK_REGISTER
|
||||
|
||||
#endif /* !__ARM_FEATURE_PAC_DEFAULT */
|
||||
|
||||
|
||||
#if (GNU_PROPERTY_AARCH64_BTI != 0 || GNU_PROPERTY_AARCH64_PAC != 0) && defined(__ELF__)
|
||||
.pushsection .note.gnu.property, "a"
|
||||
.balign 8
|
||||
.long 4
|
||||
.long 0x10
|
||||
.long 0x5
|
||||
.asciz "GNU"
|
||||
.long 0xc0000000 /* GNU_PROPERTY_AARCH64_FEATURE_1_AND */
|
||||
.long 4
|
||||
.long (GNU_PROPERTY_AARCH64_BTI | GNU_PROPERTY_AARCH64_PAC)
|
||||
.long 0
|
||||
.popsection
|
||||
#endif /* (GNU_PROPERTY_AARCH64_BTI != 0 || GNU_PROPERTY_AARCH64_PAC != 0) && defined(__ELF__) */
|
||||
#endif /* ARCH_AARCH64 */
|
||||
|
||||
#if ARCH_ARM
|
||||
.syntax unified
|
||||
#ifdef __ELF__
|
||||
.arch armv7-a
|
||||
.fpu neon
|
||||
.eabi_attribute 10, 0 // suppress Tag_FP_arch
|
||||
.eabi_attribute 12, 0 // suppress Tag_Advanced_SIMD_arch
|
||||
.section .note.GNU-stack,"",%progbits // Mark stack as non-executable
|
||||
#endif /* __ELF__ */
|
||||
|
||||
#ifdef _WIN32
|
||||
#define CONFIG_THUMB 1
|
||||
#else
|
||||
#define CONFIG_THUMB 0
|
||||
#endif
|
||||
|
||||
#if CONFIG_THUMB
|
||||
.thumb
|
||||
#define A @
|
||||
#define T
|
||||
#else
|
||||
#define A
|
||||
#define T @
|
||||
#endif /* CONFIG_THUMB */
|
||||
#endif /* ARCH_ARM */
|
||||
|
||||
#if !defined(PIC)
|
||||
#if defined(__PIC__)
|
||||
#define PIC __PIC__
|
||||
#elif defined(__pic__)
|
||||
#define PIC __pic__
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifndef PRIVATE_PREFIX
|
||||
#define PRIVATE_PREFIX dav1d_
|
||||
#endif
|
||||
|
||||
#define PASTE(a,b) a ## b
|
||||
#define CONCAT(a,b) PASTE(a,b)
|
||||
|
||||
#ifdef PREFIX
|
||||
#define EXTERN CONCAT(_,PRIVATE_PREFIX)
|
||||
#else
|
||||
#define EXTERN PRIVATE_PREFIX
|
||||
#endif
|
||||
|
||||
.macro function name, export=0, align=2
|
||||
.macro endfunc
|
||||
#ifdef __ELF__
|
||||
.size \name, . - \name
|
||||
#endif
|
||||
#if HAVE_AS_FUNC
|
||||
.endfunc
|
||||
#endif
|
||||
.purgem endfunc
|
||||
.endm
|
||||
.text
|
||||
.align \align
|
||||
.if \export
|
||||
.global EXTERN\name
|
||||
#ifdef __ELF__
|
||||
.type EXTERN\name, %function
|
||||
.hidden EXTERN\name
|
||||
#elif defined(__MACH__)
|
||||
.private_extern EXTERN\name
|
||||
#endif
|
||||
#if HAVE_AS_FUNC
|
||||
.func EXTERN\name
|
||||
#endif
|
||||
EXTERN\name:
|
||||
.else
|
||||
#ifdef __ELF__
|
||||
.type \name, %function
|
||||
#endif
|
||||
#if HAVE_AS_FUNC
|
||||
.func \name
|
||||
#endif
|
||||
.endif
|
||||
\name:
|
||||
#if ARCH_AARCH64
|
||||
.if \export
|
||||
AARCH64_VALID_CALL_TARGET
|
||||
.endif
|
||||
#endif
|
||||
.endm
|
||||
|
||||
.macro const name, export=0, align=2
|
||||
.macro endconst
|
||||
#ifdef __ELF__
|
||||
.size \name, . - \name
|
||||
#endif
|
||||
.purgem endconst
|
||||
.endm
|
||||
#if defined(_WIN32)
|
||||
.section .rdata
|
||||
#elif !defined(__MACH__)
|
||||
.section .rodata
|
||||
#else
|
||||
.const_data
|
||||
#endif
|
||||
.align \align
|
||||
.if \export
|
||||
.global EXTERN\name
|
||||
#ifdef __ELF__
|
||||
.hidden EXTERN\name
|
||||
#elif defined(__MACH__)
|
||||
.private_extern EXTERN\name
|
||||
#endif
|
||||
EXTERN\name:
|
||||
.endif
|
||||
\name:
|
||||
.endm
|
||||
|
||||
#ifdef __APPLE__
|
||||
#define L(x) L ## x
|
||||
#else
|
||||
#define L(x) .L ## x
|
||||
#endif
|
||||
|
||||
#define X(x) CONCAT(EXTERN, x)
|
||||
|
||||
|
||||
#endif /* DAV1D_SRC_ARM_ASM_S */
|
||||
+331
@@ -0,0 +1,331 @@
|
||||
/*
|
||||
* Copyright © 2018, VideoLAN and dav1d authors
|
||||
* Copyright © 2018, Two Orioles, LLC
|
||||
* All rights reserved.
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*/
|
||||
|
||||
#include "config.h"
|
||||
|
||||
#include <stdlib.h>
|
||||
|
||||
#include "common/intops.h"
|
||||
|
||||
#include "src/cdef.h"
|
||||
#include "src/tables.h"
|
||||
|
||||
static inline int constrain(const int diff, const int threshold,
|
||||
const int shift)
|
||||
{
|
||||
const int adiff = abs(diff);
|
||||
return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))), diff);
|
||||
}
|
||||
|
||||
static inline void fill(int16_t *tmp, const ptrdiff_t stride,
|
||||
const int w, const int h)
|
||||
{
|
||||
/* Use a value that's a large positive number when interpreted as unsigned,
|
||||
* and a large negative number when interpreted as signed. */
|
||||
for (int y = 0; y < h; y++) {
|
||||
for (int x = 0; x < w; x++)
|
||||
tmp[x] = INT16_MIN;
|
||||
tmp += stride;
|
||||
}
|
||||
}
|
||||
|
||||
static void padding(int16_t *tmp, const ptrdiff_t tmp_stride,
|
||||
const pixel *src, const ptrdiff_t src_stride,
|
||||
const pixel (*left)[2],
|
||||
const pixel *top, const pixel *bottom,
|
||||
const int w, const int h, const enum CdefEdgeFlags edges)
|
||||
{
|
||||
// fill extended input buffer
|
||||
int x_start = -2, x_end = w + 2, y_start = -2, y_end = h + 2;
|
||||
if (!(edges & CDEF_HAVE_TOP)) {
|
||||
fill(tmp - 2 - 2 * tmp_stride, tmp_stride, w + 4, 2);
|
||||
y_start = 0;
|
||||
}
|
||||
if (!(edges & CDEF_HAVE_BOTTOM)) {
|
||||
fill(tmp + h * tmp_stride - 2, tmp_stride, w + 4, 2);
|
||||
y_end -= 2;
|
||||
}
|
||||
if (!(edges & CDEF_HAVE_LEFT)) {
|
||||
fill(tmp + y_start * tmp_stride - 2, tmp_stride, 2, y_end - y_start);
|
||||
x_start = 0;
|
||||
}
|
||||
if (!(edges & CDEF_HAVE_RIGHT)) {
|
||||
fill(tmp + y_start * tmp_stride + w, tmp_stride, 2, y_end - y_start);
|
||||
x_end -= 2;
|
||||
}
|
||||
|
||||
for (int y = y_start; y < 0; y++) {
|
||||
for (int x = x_start; x < x_end; x++)
|
||||
tmp[x + y * tmp_stride] = top[x];
|
||||
top += PXSTRIDE(src_stride);
|
||||
}
|
||||
for (int y = 0; y < h; y++)
|
||||
for (int x = x_start; x < 0; x++)
|
||||
tmp[x + y * tmp_stride] = left[y][2 + x];
|
||||
for (int y = 0; y < h; y++) {
|
||||
for (int x = (y < h) ? 0 : x_start; x < x_end; x++)
|
||||
tmp[x] = src[x];
|
||||
src += PXSTRIDE(src_stride);
|
||||
tmp += tmp_stride;
|
||||
}
|
||||
for (int y = h; y < y_end; y++) {
|
||||
for (int x = x_start; x < x_end; x++)
|
||||
tmp[x] = bottom[x];
|
||||
bottom += PXSTRIDE(src_stride);
|
||||
tmp += tmp_stride;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
static NOINLINE void
|
||||
cdef_filter_block_c(pixel *dst, const ptrdiff_t dst_stride,
|
||||
const pixel (*left)[2],
|
||||
const pixel *const top, const pixel *const bottom,
|
||||
const int pri_strength, const int sec_strength,
|
||||
const int dir, const int damping, const int w, int h,
|
||||
const enum CdefEdgeFlags edges HIGHBD_DECL_SUFFIX)
|
||||
{
|
||||
const ptrdiff_t tmp_stride = 12;
|
||||
assert((w == 4 || w == 8) && (h == 4 || h == 8));
|
||||
int16_t tmp_buf[144]; // 12*12 is the maximum value of tmp_stride * (h + 4)
|
||||
int16_t *tmp = tmp_buf + 2 * tmp_stride + 2;
|
||||
|
||||
padding(tmp, tmp_stride, dst, dst_stride, left, top, bottom, w, h, edges);
|
||||
|
||||
if (pri_strength) {
|
||||
const int bitdepth_min_8 = bitdepth_from_max(bitdepth_max) - 8;
|
||||
const int pri_tap = 4 - ((pri_strength >> bitdepth_min_8) & 1);
|
||||
const int pri_shift = imax(0, damping - ulog2(pri_strength));
|
||||
if (sec_strength) {
|
||||
const int sec_shift = damping - ulog2(sec_strength);
|
||||
do {
|
||||
for (int x = 0; x < w; x++) {
|
||||
const int px = dst[x];
|
||||
int sum = 0;
|
||||
int max = px, min = px;
|
||||
int pri_tap_k = pri_tap;
|
||||
for (int k = 0; k < 2; k++) {
|
||||
const int off1 = dav1d_cdef_directions[dir + 2][k]; // dir
|
||||
const int p0 = tmp[x + off1];
|
||||
const int p1 = tmp[x - off1];
|
||||
sum += pri_tap_k * constrain(p0 - px, pri_strength, pri_shift);
|
||||
sum += pri_tap_k * constrain(p1 - px, pri_strength, pri_shift);
|
||||
// if pri_tap_k == 4 then it becomes 2 else it remains 3
|
||||
pri_tap_k = (pri_tap_k & 3) | 2;
|
||||
min = umin(p0, min);
|
||||
max = imax(p0, max);
|
||||
min = umin(p1, min);
|
||||
max = imax(p1, max);
|
||||
const int off2 = dav1d_cdef_directions[dir + 4][k]; // dir + 2
|
||||
const int off3 = dav1d_cdef_directions[dir + 0][k]; // dir - 2
|
||||
const int s0 = tmp[x + off2];
|
||||
const int s1 = tmp[x - off2];
|
||||
const int s2 = tmp[x + off3];
|
||||
const int s3 = tmp[x - off3];
|
||||
// sec_tap starts at 2 and becomes 1
|
||||
const int sec_tap = 2 - k;
|
||||
sum += sec_tap * constrain(s0 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s1 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s2 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s3 - px, sec_strength, sec_shift);
|
||||
min = umin(s0, min);
|
||||
max = imax(s0, max);
|
||||
min = umin(s1, min);
|
||||
max = imax(s1, max);
|
||||
min = umin(s2, min);
|
||||
max = imax(s2, max);
|
||||
min = umin(s3, min);
|
||||
max = imax(s3, max);
|
||||
}
|
||||
dst[x] = iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max);
|
||||
}
|
||||
dst += PXSTRIDE(dst_stride);
|
||||
tmp += tmp_stride;
|
||||
} while (--h);
|
||||
} else { // pri_strength only
|
||||
do {
|
||||
for (int x = 0; x < w; x++) {
|
||||
const int px = dst[x];
|
||||
int sum = 0;
|
||||
int pri_tap_k = pri_tap;
|
||||
for (int k = 0; k < 2; k++) {
|
||||
const int off = dav1d_cdef_directions[dir + 2][k]; // dir
|
||||
const int p0 = tmp[x + off];
|
||||
const int p1 = tmp[x - off];
|
||||
sum += pri_tap_k * constrain(p0 - px, pri_strength, pri_shift);
|
||||
sum += pri_tap_k * constrain(p1 - px, pri_strength, pri_shift);
|
||||
pri_tap_k = (pri_tap_k & 3) | 2;
|
||||
}
|
||||
dst[x] = px + ((sum - (sum < 0) + 8) >> 4);
|
||||
}
|
||||
dst += PXSTRIDE(dst_stride);
|
||||
tmp += tmp_stride;
|
||||
} while (--h);
|
||||
}
|
||||
} else { // sec_strength only
|
||||
assert(sec_strength);
|
||||
const int sec_shift = damping - ulog2(sec_strength);
|
||||
do {
|
||||
for (int x = 0; x < w; x++) {
|
||||
const int px = dst[x];
|
||||
int sum = 0;
|
||||
for (int k = 0; k < 2; k++) {
|
||||
const int off1 = dav1d_cdef_directions[dir + 4][k]; // dir + 2
|
||||
const int off2 = dav1d_cdef_directions[dir + 0][k]; // dir - 2
|
||||
const int s0 = tmp[x + off1];
|
||||
const int s1 = tmp[x - off1];
|
||||
const int s2 = tmp[x + off2];
|
||||
const int s3 = tmp[x - off2];
|
||||
const int sec_tap = 2 - k;
|
||||
sum += sec_tap * constrain(s0 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s1 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s2 - px, sec_strength, sec_shift);
|
||||
sum += sec_tap * constrain(s3 - px, sec_strength, sec_shift);
|
||||
}
|
||||
dst[x] = px + ((sum - (sum < 0) + 8) >> 4);
|
||||
}
|
||||
dst += PXSTRIDE(dst_stride);
|
||||
tmp += tmp_stride;
|
||||
} while (--h);
|
||||
}
|
||||
}
|
||||
|
||||
#define cdef_fn(w, h) \
|
||||
static void cdef_filter_block_##w##x##h##_c(pixel *const dst, \
|
||||
const ptrdiff_t stride, \
|
||||
const pixel (*left)[2], \
|
||||
const pixel *const top, \
|
||||
const pixel *const bottom, \
|
||||
const int pri_strength, \
|
||||
const int sec_strength, \
|
||||
const int dir, \
|
||||
const int damping, \
|
||||
const enum CdefEdgeFlags edges \
|
||||
HIGHBD_DECL_SUFFIX) \
|
||||
{ \
|
||||
cdef_filter_block_c(dst, stride, left, top, bottom, \
|
||||
pri_strength, sec_strength, dir, damping, w, h, edges HIGHBD_TAIL_SUFFIX); \
|
||||
}
|
||||
|
||||
cdef_fn(4, 4);
|
||||
cdef_fn(4, 8);
|
||||
cdef_fn(8, 8);
|
||||
|
||||
static int cdef_find_dir_c(const pixel *img, const ptrdiff_t stride,
|
||||
unsigned *const var HIGHBD_DECL_SUFFIX)
|
||||
{
|
||||
const int bitdepth_min_8 = bitdepth_from_max(bitdepth_max) - 8;
|
||||
int partial_sum_hv[2][8] = { { 0 } };
|
||||
int partial_sum_diag[2][15] = { { 0 } };
|
||||
int partial_sum_alt[4][11] = { { 0 } };
|
||||
|
||||
for (int y = 0; y < 8; y++) {
|
||||
for (int x = 0; x < 8; x++) {
|
||||
const int px = (img[x] >> bitdepth_min_8) - 128;
|
||||
|
||||
partial_sum_diag[0][ y + x ] += px;
|
||||
partial_sum_alt [0][ y + (x >> 1)] += px;
|
||||
partial_sum_hv [0][ y ] += px;
|
||||
partial_sum_alt [1][3 + y - (x >> 1)] += px;
|
||||
partial_sum_diag[1][7 + y - x ] += px;
|
||||
partial_sum_alt [2][3 - (y >> 1) + x ] += px;
|
||||
partial_sum_hv [1][ x ] += px;
|
||||
partial_sum_alt [3][ (y >> 1) + x ] += px;
|
||||
}
|
||||
img += PXSTRIDE(stride);
|
||||
}
|
||||
|
||||
unsigned cost[8] = { 0 };
|
||||
for (int n = 0; n < 8; n++) {
|
||||
cost[2] += partial_sum_hv[0][n] * partial_sum_hv[0][n];
|
||||
cost[6] += partial_sum_hv[1][n] * partial_sum_hv[1][n];
|
||||
}
|
||||
cost[2] *= 105;
|
||||
cost[6] *= 105;
|
||||
|
||||
static const uint16_t div_table[7] = { 840, 420, 280, 210, 168, 140, 120 };
|
||||
for (int n = 0; n < 7; n++) {
|
||||
const int d = div_table[n];
|
||||
cost[0] += (partial_sum_diag[0][n] * partial_sum_diag[0][n] +
|
||||
partial_sum_diag[0][14 - n] * partial_sum_diag[0][14 - n]) * d;
|
||||
cost[4] += (partial_sum_diag[1][n] * partial_sum_diag[1][n] +
|
||||
partial_sum_diag[1][14 - n] * partial_sum_diag[1][14 - n]) * d;
|
||||
}
|
||||
cost[0] += partial_sum_diag[0][7] * partial_sum_diag[0][7] * 105;
|
||||
cost[4] += partial_sum_diag[1][7] * partial_sum_diag[1][7] * 105;
|
||||
|
||||
for (int n = 0; n < 4; n++) {
|
||||
unsigned *const cost_ptr = &cost[n * 2 + 1];
|
||||
for (int m = 0; m < 5; m++)
|
||||
*cost_ptr += partial_sum_alt[n][3 + m] * partial_sum_alt[n][3 + m];
|
||||
*cost_ptr *= 105;
|
||||
for (int m = 0; m < 3; m++) {
|
||||
const int d = div_table[2 * m + 1];
|
||||
*cost_ptr += (partial_sum_alt[n][m] * partial_sum_alt[n][m] +
|
||||
partial_sum_alt[n][10 - m] * partial_sum_alt[n][10 - m]) * d;
|
||||
}
|
||||
}
|
||||
|
||||
int best_dir = 0;
|
||||
unsigned best_cost = cost[0];
|
||||
for (int n = 1; n < 8; n++) {
|
||||
if (cost[n] > best_cost) {
|
||||
best_cost = cost[n];
|
||||
best_dir = n;
|
||||
}
|
||||
}
|
||||
|
||||
*var = (best_cost - (cost[best_dir ^ 4])) >> 10;
|
||||
return best_dir;
|
||||
}
|
||||
|
||||
#if HAVE_ASM
|
||||
#if ARCH_AARCH64 || ARCH_ARM
|
||||
#include "src/arm/cdef.h"
|
||||
#elif ARCH_PPC64LE
|
||||
#include "src/ppc/cdef.h"
|
||||
#elif ARCH_X86
|
||||
#include "src/x86/cdef.h"
|
||||
#endif
|
||||
#endif
|
||||
|
||||
COLD void bitfn(dav1d_cdef_dsp_init)(Dav1dCdefDSPContext *const c) {
|
||||
c->dir = cdef_find_dir_c;
|
||||
c->fb[0] = cdef_filter_block_8x8_c;
|
||||
c->fb[1] = cdef_filter_block_4x8_c;
|
||||
c->fb[2] = cdef_filter_block_4x4_c;
|
||||
|
||||
#if HAVE_ASM
|
||||
#if ARCH_AARCH64 || ARCH_ARM
|
||||
cdef_dsp_init_arm(c);
|
||||
#elif ARCH_PPC64LE
|
||||
cdef_dsp_init_ppc(c);
|
||||
#elif ARCH_X86
|
||||
cdef_dsp_init_x86(c);
|
||||
#endif
|
||||
#endif
|
||||
}
|
||||
@@ -0,0 +1,32 @@
|
||||
/*
|
||||
* dav1d_cdef_directions — verbatim transcription of the CDEF
|
||||
* directions table from dav1d/src/tables.c (1.4.3, lines 400-414).
|
||||
* Provided as a standalone .c so the vendored cdef.S has the
|
||||
* symbol to link against without pulling in dav1d's full tables.c
|
||||
* (which is 1013 lines and chain-references the entire decoder).
|
||||
*
|
||||
* Used by both the C reference (cdef_tmpl.c) and the NEON
|
||||
* implementation (cdef.S).
|
||||
*
|
||||
* The table has 12 entries (2 + 8 + 2) because direction indexing
|
||||
* wraps modulo 8 with ±2 lookahead for secondary taps; the leading
|
||||
* and trailing 2 entries are the wrap-around prefixes/suffixes.
|
||||
*
|
||||
* License: BSD-2-Clause (matches dav1d upstream).
|
||||
*/
|
||||
#include <stdint.h>
|
||||
|
||||
const int8_t dav1d_cdef_directions[2 + 8 + 2][2] = {
|
||||
{ 1 * 12 + 0, 2 * 12 + 0 }, // 6 (wrap prefix)
|
||||
{ 1 * 12 + 0, 2 * 12 - 1 }, // 7 (wrap prefix)
|
||||
{ -1 * 12 + 1, -2 * 12 + 2 }, // 0
|
||||
{ 0 * 12 + 1, -1 * 12 + 2 }, // 1
|
||||
{ 0 * 12 + 1, 0 * 12 + 2 }, // 2
|
||||
{ 0 * 12 + 1, 1 * 12 + 2 }, // 3
|
||||
{ 1 * 12 + 1, 2 * 12 + 2 }, // 4
|
||||
{ 1 * 12 + 0, 2 * 12 + 1 }, // 5
|
||||
{ 1 * 12 + 0, 2 * 12 + 0 }, // 6
|
||||
{ 1 * 12 + 0, 2 * 12 - 1 }, // 7
|
||||
{ -1 * 12 + 1, -2 * 12 + 2 }, // 0 (wrap suffix)
|
||||
{ 0 * 12 + 1, -1 * 12 + 2 }, // 1 (wrap suffix)
|
||||
};
|
||||
Reference in New Issue
Block a user