diff --git a/docs/k8_h264deblock_phase1.md b/docs/k8_h264deblock_phase1.md new file mode 100644 index 0000000..eb14b1e --- /dev/null +++ b/docs/k8_h264deblock_phase1.md @@ -0,0 +1,183 @@ +--- +cycle: 8 +phase: 1 +status: open (Phase 3 deferred to next session — scope larger than VP9 LPF) +date_opened: 2026-05-18 +codec: H.264 +kernel: in-loop deblock filter (luma vertical edge variant first) +parent: project_h264_scope_added.md (memory), k7_h264idct8_phase3_and_4.md (lesson) +predicted_R: 0.3-0.8 (ORANGE/YELLOW) — analogous to VP9 LPF cycles 2/4 which were GREEN +--- + +# Cycle 8, Phase 1 — H.264 in-loop deblock (luma vertical edge first) + +After cycles 6 and 7 both came in as "predicted GREEN, measured +CPU-only" for H.264 transforms (transforms too lightweight on +NEON), cycle 8 targets the one H.264 kernel most likely to actually +benefit from QPU offload: the **in-loop deblock filter**. + +## Why deblock as the H.264 QPU candidate + +Per cycle 7's Phase 9 update: +- H.264 transforms (cycles 6+7) NEON-saturated at ~150 Mblock/s, + no QPU need +- H.264 MC (luma qpel, chroma) likely analogous to cycle 3 VP9 MC + (R=0.067 RED), QPU loses +- **Deblock is bandwidth-bound** with per-pixel branching, analogous + to VP9 LPF (cycle 2 R=0.41 GREEN, cycle 4 R=0.34 GREEN) +- H.264 deblock processes 16-pixel-wide MB edges (vs VP9's 8-pixel + smaller edges), so per-edge work is heavier — better for QPU + amortization + +Predicted R₈ band: **ORANGE to GREEN** based on the VP9 LPF analog. + +## Scope decision: start with luma vertical edge + +H.264 deblock has many variants: +1. Luma vertical edge (v_loop_filter_luma) — 16-row × 8-col region +2. Luma horizontal edge (h_loop_filter_luma) — 4-row × 16-col region +3. Luma intra (stronger filter, bS=4) +4. Chroma {v,h} edge +5. Chroma intra +6. Chroma 4:2:2 variants + +Start with **luma vertical edge non-intra**. Most common case +(most MB-internal edges are non-intra). Other variants are +follow-up cycles (8a, 8b, etc.) using the same QPU shader +template. + +## NEON reference + +`ff_h264_v_loop_filter_luma_neon` +(external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S +line 111, vendored 2026-05-18). + +Signature inferred from `h264_loop_filter_start` macro: +``` +void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, + ptrdiff_t stride, + int alpha, int beta, + int8_t *tc0); +``` + +Where: +- `pix`: pointer to the edge centre — pix[0] = q0 pixel of first row +- `stride`: byte stride between rows (typically picture width) +- `alpha`: filter strength threshold (0..63, MB-derived) +- `beta`: block-boundary threshold (0..63, MB-derived) +- `tc0`: array of 4 int8 values — per-4-pixel-segment tc0 strengths + +The 16-row edge is divided into 4 segments of 4 rows each; each +segment can have its own tc0 (encoder-derived filter strength +parameter). + +## Algorithm summary (H.264 §8.7.2.4) + +Per row, for each 4-row segment: +1. Compute pre-conditions: + - `bS > 0` (tc0[segment] != -1) + - `|p0 - q0| < alpha` + - `|p1 - p0| < beta` + - `|q1 - q0| < beta` +2. If precondition fails → no filter for this row +3. Compute `ap = |p2 - p0|`, `aq = |q2 - q0|` +4. Compute `tc = tc0 + (ap < beta) + (aq < beta)` +5. `delta = clip3(-tc, tc, (((q0-p0)*4 + (p1-q1) + 4) >> 3))` +6. Apply: + - `p0' = clip255(p0 + delta)` + - `q0' = clip255(q0 - delta)` + - If `ap < beta`: `p1' = p1 + clip3(-tc0, tc0, ...)` + - If `aq < beta`: `q1' = q1 + clip3(-tc0, tc0, ...)` + +Multiple branches per row → harder to write a bit-exact C ref +than cycle 2/4 LPF. ~80-100 LOC of C, careful with the clip3 +ranges. + +## 30fps@1080p H.264 deblock floor + +A 1920×1080 frame has 120 × 67.5 = 8100 luma MBs × 4 inner-MB +vertical edges × 4 rows of segments = ~129 600 segment-edges per +frame. Plus 4 horizontal edges per MB. + +At 30fps: ~3.9 M edges/s required for luma vertical alone, ~7.8 M +edges/s for both v and h. Realistic (many edges skip filter via +bS=0 or alpha/beta thresholds): ~30-50 % of these actually filter +→ effective ~2-4 M edges/s. + +**30fps@1080p deblock floor (realistic): 2-4 M edges/s.** +**30fps@1080p deblock floor (worst case): 8 M edges/s.** + +## Acceptance for Phase 7 + +- M1: 100.0000% bit-exact (NEON vs C ref, 10000+ random 4-row segments) +- M3: captured +- M2: captured +- R₈: classified +- M4: same-kernel mixed bench +- 30fps@1080p floor margin reported + +## Cycle 8 deliverables + +1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S` + (already vendored this phase, 1076 lines) +2. `tests/h264_deblock_ref.c` — C reference for luma vertical + non-intra deblock (luma_v_filter_normal) +3. `tests/bench_neon_h264deblock.c` — Phase 3 bench +4. `src/v3d_h264deblock.comp` — Phase 6 shader (likely follow + cycle 2 LPF v3d shader structure, but with deblock branching) +5. `tests/bench_v3d_h264deblock.c` — Phase 6+7 bench +6. CMakeLists.txt wiring + +## What's lands in THIS session + +- This Phase 1 doc +- `h264dsp_neon.S` vendored (file present in repo) +- PROVENANCE.md updated + +What's NOT in this session (deferred to next): +- C reference (~2 hours) +- NEON bench +- M1+M3 capture +- Phase 4-7 + +## Why defer Phase 3+ from this session + +Cycle 8 NEON-baseline scope is materially larger than cycles 6/7 +because the H.264 deblock has: +- Per-row branching (filter applies or not based on alpha/beta) +- Per-4-row-segment tc0 strength +- 4 separate output adjustments per row (p0, q0, p1, q1) +- ap/aq side-condition checks +- All these need bit-exact in the C ref against NEON's vectorised + version + +Better to write the C ref with fresh attention next session than +rush it now and have it M1-fail like cycle 6's first attempt. + +The Phase 1 doc itself captures the analysis so next session can +pick up cleanly from here. + +## Estimated effort for Phase 3 next session + +- C ref: ~2 hours (careful transcription from spec + cross-check + against FFmpeg C reference) +- Bench: ~30 min +- M1 debugging (likely needed; cycle 6 took 90 min for column- + major-block discovery, similar discoveries may apply here): 30-90 min +- M3 capture: 5 min + +Total: 3-4 hours for Phase 3 closure. + +## Linkage with cycles 6+7 closure + +Cycles 6 + 7 + 8 together form the H.264 NEON inventory and the +single-most-promising-QPU-target (cycle 8). After cycle 8 closes, +the H.264 QPU surface area is well-characterised: +- IDCT 4×4: CPU +- IDCT 8×8: CPU +- Deblock: TBD (cycle 8) +- MC luma qpel: CPU (predicted; cycle 9 if measured) +- MC chroma: CPU (predicted; cycle 10 if measured) + +H.264 contribution to daedalus-fourier likely: CPU for transforms +and MC, QPU for deblock IF cycle 8 lands GREEN. diff --git a/external/ffmpeg-snapshot/PROVENANCE.md b/external/ffmpeg-snapshot/PROVENANCE.md index 9f813d5..61097f7 100644 --- a/external/ffmpeg-snapshot/PROVENANCE.md +++ b/external/ffmpeg-snapshot/PROVENANCE.md @@ -27,6 +27,7 @@ tagged commit, no modifications. | `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` | | `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` | | `libavcodec/aarch64/h264idct_neon.S` | 415 | 16269 | `963ffe5f31b5a6a422e13b0d394cf5630126927abfb23aa214f7cbe83d60683f` — H.264 IDCT 4×4/8×8/DC NEON kernels for cycle 6+ | +| `libavcodec/aarch64/h264dsp_neon.S` | 1076 | — | `978e076f0020e688b40c6dd827708c3d53e17c64a99fd0052e43d983536ce638` — H.264 in-loop deblock + weight/biweight kernels for cycle 8+ | | `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery | | `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` | | `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` | diff --git a/external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S b/external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S new file mode 100644 index 0000000..723b692 --- /dev/null +++ b/external/ffmpeg-snapshot/libavcodec/aarch64/h264dsp_neon.S @@ -0,0 +1,1076 @@ +/* + * Copyright (c) 2008 Mans Rullgard + * Copyright (c) 2013 Janne Grunau + * Copyright (c) 2014 Janne Grunau + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" +#include "neon.S" + +.macro h264_loop_filter_start + cmp w2, #0 + ldr w6, [x4] + ccmp w3, #0, #0, ne + mov v24.s[0], w6 + and w8, w6, w6, lsl #16 + b.eq 1f + ands w8, w8, w8, lsl #8 + b.ge 2f +1: + ret +2: +.endm + +.macro h264_loop_filter_luma + dup v22.16b, w2 // alpha + uxtl v24.8h, v24.8b + uabd v21.16b, v16.16b, v0.16b // abs(p0 - q0) + uxtl v24.4s, v24.4h + uabd v28.16b, v18.16b, v16.16b // abs(p1 - p0) + sli v24.8h, v24.8h, #8 + uabd v30.16b, v2.16b, v0.16b // abs(q1 - q0) + sli v24.4s, v24.4s, #16 + cmhi v21.16b, v22.16b, v21.16b // < alpha + dup v22.16b, w3 // beta + cmlt v23.16b, v24.16b, #0 + cmhi v28.16b, v22.16b, v28.16b // < beta + cmhi v30.16b, v22.16b, v30.16b // < beta + bic v21.16b, v21.16b, v23.16b + uabd v17.16b, v20.16b, v16.16b // abs(p2 - p0) + and v21.16b, v21.16b, v28.16b + uabd v19.16b, v4.16b, v0.16b // abs(q2 - q0) + and v21.16b, v21.16b, v30.16b // < beta + shrn v30.8b, v21.8h, #4 + mov x7, v30.d[0] + cmhi v17.16b, v22.16b, v17.16b // < beta + cmhi v19.16b, v22.16b, v19.16b // < beta + cbz x7, 9f + and v17.16b, v17.16b, v21.16b + and v19.16b, v19.16b, v21.16b + and v24.16b, v24.16b, v21.16b + urhadd v28.16b, v16.16b, v0.16b + sub v21.16b, v24.16b, v17.16b + uqadd v23.16b, v18.16b, v24.16b + uhadd v20.16b, v20.16b, v28.16b + sub v21.16b, v21.16b, v19.16b + uhadd v28.16b, v4.16b, v28.16b + umin v23.16b, v23.16b, v20.16b + uqsub v22.16b, v18.16b, v24.16b + uqadd v4.16b, v2.16b, v24.16b + umax v23.16b, v23.16b, v22.16b + uqsub v22.16b, v2.16b, v24.16b + umin v28.16b, v4.16b, v28.16b + uxtl v4.8h, v0.8b + umax v28.16b, v28.16b, v22.16b + uxtl2 v20.8h, v0.16b + usubw v4.8h, v4.8h, v16.8b + usubw2 v20.8h, v20.8h, v16.16b + shl v4.8h, v4.8h, #2 + shl v20.8h, v20.8h, #2 + uaddw v4.8h, v4.8h, v18.8b + uaddw2 v20.8h, v20.8h, v18.16b + usubw v4.8h, v4.8h, v2.8b + usubw2 v20.8h, v20.8h, v2.16b + rshrn v4.8b, v4.8h, #3 + rshrn2 v4.16b, v20.8h, #3 + bsl v17.16b, v23.16b, v18.16b + bsl v19.16b, v28.16b, v2.16b + neg v23.16b, v21.16b + uxtl v28.8h, v16.8b + smin v4.16b, v4.16b, v21.16b + uxtl2 v21.8h, v16.16b + smax v4.16b, v4.16b, v23.16b + uxtl v22.8h, v0.8b + uxtl2 v24.8h, v0.16b + saddw v28.8h, v28.8h, v4.8b + saddw2 v21.8h, v21.8h, v4.16b + ssubw v22.8h, v22.8h, v4.8b + ssubw2 v24.8h, v24.8h, v4.16b + sqxtun v16.8b, v28.8h + sqxtun2 v16.16b, v21.8h + sqxtun v0.8b, v22.8h + sqxtun2 v0.16b, v24.8h +.endm + +function ff_h264_v_loop_filter_luma_neon, export=1 + h264_loop_filter_start + + ld1 {v0.16b}, [x0], x1 + ld1 {v2.16b}, [x0], x1 + ld1 {v4.16b}, [x0], x1 + sub x0, x0, x1, lsl #2 + sub x0, x0, x1, lsl #1 + ld1 {v20.16b}, [x0], x1 + ld1 {v18.16b}, [x0], x1 + ld1 {v16.16b}, [x0], x1 + + h264_loop_filter_luma + + sub x0, x0, x1, lsl #1 + st1 {v17.16b}, [x0], x1 + st1 {v16.16b}, [x0], x1 + st1 {v0.16b}, [x0], x1 + st1 {v19.16b}, [x0] +9: + ret +endfunc + +function ff_h264_h_loop_filter_luma_neon, export=1 + h264_loop_filter_start + + sub x0, x0, #4 + ld1 {v6.8b}, [x0], x1 + ld1 {v20.8b}, [x0], x1 + ld1 {v18.8b}, [x0], x1 + ld1 {v16.8b}, [x0], x1 + ld1 {v0.8b}, [x0], x1 + ld1 {v2.8b}, [x0], x1 + ld1 {v4.8b}, [x0], x1 + ld1 {v26.8b}, [x0], x1 + ld1 {v6.d}[1], [x0], x1 + ld1 {v20.d}[1], [x0], x1 + ld1 {v18.d}[1], [x0], x1 + ld1 {v16.d}[1], [x0], x1 + ld1 {v0.d}[1], [x0], x1 + ld1 {v2.d}[1], [x0], x1 + ld1 {v4.d}[1], [x0], x1 + ld1 {v26.d}[1], [x0], x1 + + transpose_8x16B v6, v20, v18, v16, v0, v2, v4, v26, v21, v23 + + h264_loop_filter_luma + + transpose_4x16B v17, v16, v0, v19, v21, v23, v25, v27 + + sub x0, x0, x1, lsl #4 + add x0, x0, #2 + st1 {v17.s}[0], [x0], x1 + st1 {v16.s}[0], [x0], x1 + st1 {v0.s}[0], [x0], x1 + st1 {v19.s}[0], [x0], x1 + st1 {v17.s}[1], [x0], x1 + st1 {v16.s}[1], [x0], x1 + st1 {v0.s}[1], [x0], x1 + st1 {v19.s}[1], [x0], x1 + st1 {v17.s}[2], [x0], x1 + st1 {v16.s}[2], [x0], x1 + st1 {v0.s}[2], [x0], x1 + st1 {v19.s}[2], [x0], x1 + st1 {v17.s}[3], [x0], x1 + st1 {v16.s}[3], [x0], x1 + st1 {v0.s}[3], [x0], x1 + st1 {v19.s}[3], [x0], x1 +9: + ret +endfunc + + +.macro h264_loop_filter_start_intra + orr w4, w2, w3 + cbnz w4, 1f + ret +1: + dup v30.16b, w2 // alpha + dup v31.16b, w3 // beta +.endm + +.macro h264_loop_filter_luma_intra + uabd v16.16b, v7.16b, v0.16b // abs(p0 - q0) + uabd v17.16b, v6.16b, v7.16b // abs(p1 - p0) + uabd v18.16b, v1.16b, v0.16b // abs(q1 - q0) + cmhi v19.16b, v30.16b, v16.16b // < alpha + cmhi v17.16b, v31.16b, v17.16b // < beta + cmhi v18.16b, v31.16b, v18.16b // < beta + + movi v29.16b, #2 + ushr v30.16b, v30.16b, #2 // alpha >> 2 + add v30.16b, v30.16b, v29.16b // (alpha >> 2) + 2 + cmhi v16.16b, v30.16b, v16.16b // < (alpha >> 2) + 2 + + and v19.16b, v19.16b, v17.16b + and v19.16b, v19.16b, v18.16b + shrn v20.8b, v19.8h, #4 + mov x4, v20.d[0] + cbz x4, 9f + + ushll v20.8h, v6.8b, #1 + ushll v22.8h, v1.8b, #1 + ushll2 v21.8h, v6.16b, #1 + ushll2 v23.8h, v1.16b, #1 + uaddw v20.8h, v20.8h, v7.8b + uaddw v22.8h, v22.8h, v0.8b + uaddw2 v21.8h, v21.8h, v7.16b + uaddw2 v23.8h, v23.8h, v0.16b + uaddw v20.8h, v20.8h, v1.8b + uaddw v22.8h, v22.8h, v6.8b + uaddw2 v21.8h, v21.8h, v1.16b + uaddw2 v23.8h, v23.8h, v6.16b + + rshrn v24.8b, v20.8h, #2 // p0'_1 + rshrn v25.8b, v22.8h, #2 // q0'_1 + rshrn2 v24.16b, v21.8h, #2 // p0'_1 + rshrn2 v25.16b, v23.8h, #2 // q0'_1 + + uabd v17.16b, v5.16b, v7.16b // abs(p2 - p0) + uabd v18.16b, v2.16b, v0.16b // abs(q2 - q0) + cmhi v17.16b, v31.16b, v17.16b // < beta + cmhi v18.16b, v31.16b, v18.16b // < beta + + and v17.16b, v16.16b, v17.16b // if_2 && if_3 + and v18.16b, v16.16b, v18.16b // if_2 && if_4 + + not v30.16b, v17.16b + not v31.16b, v18.16b + + and v30.16b, v30.16b, v19.16b // if_1 && !(if_2 && if_3) + and v31.16b, v31.16b, v19.16b // if_1 && !(if_2 && if_4) + + and v17.16b, v19.16b, v17.16b // if_1 && if_2 && if_3 + and v18.16b, v19.16b, v18.16b // if_1 && if_2 && if_4 + + //calc p, v7, v6, v5, v4, v17, v7, v6, v5, v4 + uaddl v26.8h, v5.8b, v7.8b + uaddl2 v27.8h, v5.16b, v7.16b + uaddw v26.8h, v26.8h, v0.8b + uaddw2 v27.8h, v27.8h, v0.16b + add v20.8h, v20.8h, v26.8h + add v21.8h, v21.8h, v27.8h + uaddw v20.8h, v20.8h, v0.8b + uaddw2 v21.8h, v21.8h, v0.16b + rshrn v20.8b, v20.8h, #3 // p0'_2 + rshrn2 v20.16b, v21.8h, #3 // p0'_2 + uaddw v26.8h, v26.8h, v6.8b + uaddw2 v27.8h, v27.8h, v6.16b + rshrn v21.8b, v26.8h, #2 // p1'_2 + rshrn2 v21.16b, v27.8h, #2 // p1'_2 + uaddl v28.8h, v4.8b, v5.8b + uaddl2 v29.8h, v4.16b, v5.16b + shl v28.8h, v28.8h, #1 + shl v29.8h, v29.8h, #1 + add v28.8h, v28.8h, v26.8h + add v29.8h, v29.8h, v27.8h + rshrn v19.8b, v28.8h, #3 // p2'_2 + rshrn2 v19.16b, v29.8h, #3 // p2'_2 + + //calc q, v0, v1, v2, v3, v18, v0, v1, v2, v3 + uaddl v26.8h, v2.8b, v0.8b + uaddl2 v27.8h, v2.16b, v0.16b + uaddw v26.8h, v26.8h, v7.8b + uaddw2 v27.8h, v27.8h, v7.16b + add v22.8h, v22.8h, v26.8h + add v23.8h, v23.8h, v27.8h + uaddw v22.8h, v22.8h, v7.8b + uaddw2 v23.8h, v23.8h, v7.16b + rshrn v22.8b, v22.8h, #3 // q0'_2 + rshrn2 v22.16b, v23.8h, #3 // q0'_2 + uaddw v26.8h, v26.8h, v1.8b + uaddw2 v27.8h, v27.8h, v1.16b + rshrn v23.8b, v26.8h, #2 // q1'_2 + rshrn2 v23.16b, v27.8h, #2 // q1'_2 + uaddl v28.8h, v2.8b, v3.8b + uaddl2 v29.8h, v2.16b, v3.16b + shl v28.8h, v28.8h, #1 + shl v29.8h, v29.8h, #1 + add v28.8h, v28.8h, v26.8h + add v29.8h, v29.8h, v27.8h + rshrn v26.8b, v28.8h, #3 // q2'_2 + rshrn2 v26.16b, v29.8h, #3 // q2'_2 + + bit v7.16b, v24.16b, v30.16b // p0'_1 + bit v0.16b, v25.16b, v31.16b // q0'_1 + bit v7.16b, v20.16b, v17.16b // p0'_2 + bit v6.16b, v21.16b, v17.16b // p1'_2 + bit v5.16b, v19.16b, v17.16b // p2'_2 + bit v0.16b, v22.16b, v18.16b // q0'_2 + bit v1.16b, v23.16b, v18.16b // q1'_2 + bit v2.16b, v26.16b, v18.16b // q2'_2 +.endm + +function ff_h264_v_loop_filter_luma_intra_neon, export=1 + h264_loop_filter_start_intra + + ld1 {v0.16b}, [x0], x1 // q0 + ld1 {v1.16b}, [x0], x1 // q1 + ld1 {v2.16b}, [x0], x1 // q2 + ld1 {v3.16b}, [x0], x1 // q3 + sub x0, x0, x1, lsl #3 + ld1 {v4.16b}, [x0], x1 // p3 + ld1 {v5.16b}, [x0], x1 // p2 + ld1 {v6.16b}, [x0], x1 // p1 + ld1 {v7.16b}, [x0] // p0 + + h264_loop_filter_luma_intra + + sub x0, x0, x1, lsl #1 + st1 {v5.16b}, [x0], x1 // p2 + st1 {v6.16b}, [x0], x1 // p1 + st1 {v7.16b}, [x0], x1 // p0 + st1 {v0.16b}, [x0], x1 // q0 + st1 {v1.16b}, [x0], x1 // q1 + st1 {v2.16b}, [x0] // q2 +9: + ret +endfunc + +function ff_h264_h_loop_filter_luma_intra_neon, export=1 + h264_loop_filter_start_intra + + sub x0, x0, #4 + ld1 {v4.8b}, [x0], x1 + ld1 {v5.8b}, [x0], x1 + ld1 {v6.8b}, [x0], x1 + ld1 {v7.8b}, [x0], x1 + ld1 {v0.8b}, [x0], x1 + ld1 {v1.8b}, [x0], x1 + ld1 {v2.8b}, [x0], x1 + ld1 {v3.8b}, [x0], x1 + ld1 {v4.d}[1], [x0], x1 + ld1 {v5.d}[1], [x0], x1 + ld1 {v6.d}[1], [x0], x1 + ld1 {v7.d}[1], [x0], x1 + ld1 {v0.d}[1], [x0], x1 + ld1 {v1.d}[1], [x0], x1 + ld1 {v2.d}[1], [x0], x1 + ld1 {v3.d}[1], [x0], x1 + + transpose_8x16B v4, v5, v6, v7, v0, v1, v2, v3, v21, v23 + + h264_loop_filter_luma_intra + + transpose_8x16B v4, v5, v6, v7, v0, v1, v2, v3, v21, v23 + + sub x0, x0, x1, lsl #4 + st1 {v4.8b}, [x0], x1 + st1 {v5.8b}, [x0], x1 + st1 {v6.8b}, [x0], x1 + st1 {v7.8b}, [x0], x1 + st1 {v0.8b}, [x0], x1 + st1 {v1.8b}, [x0], x1 + st1 {v2.8b}, [x0], x1 + st1 {v3.8b}, [x0], x1 + st1 {v4.d}[1], [x0], x1 + st1 {v5.d}[1], [x0], x1 + st1 {v6.d}[1], [x0], x1 + st1 {v7.d}[1], [x0], x1 + st1 {v0.d}[1], [x0], x1 + st1 {v1.d}[1], [x0], x1 + st1 {v2.d}[1], [x0], x1 + st1 {v3.d}[1], [x0], x1 +9: + ret +endfunc + +.macro h264_loop_filter_chroma + dup v22.8b, w2 // alpha + dup v23.8b, w3 // beta + uxtl v24.8h, v24.8b + uabd v26.8b, v16.8b, v0.8b // abs(p0 - q0) + uabd v28.8b, v18.8b, v16.8b // abs(p1 - p0) + uabd v30.8b, v2.8b, v0.8b // abs(q1 - q0) + cmhi v26.8b, v22.8b, v26.8b // < alpha + cmhi v28.8b, v23.8b, v28.8b // < beta + cmhi v30.8b, v23.8b, v30.8b // < beta + uxtl v4.8h, v0.8b + and v26.8b, v26.8b, v28.8b + usubw v4.8h, v4.8h, v16.8b + and v26.8b, v26.8b, v30.8b + shl v4.8h, v4.8h, #2 + mov x8, v26.d[0] + sli v24.8h, v24.8h, #8 + uaddw v4.8h, v4.8h, v18.8b + cbz x8, 9f + usubw v4.8h, v4.8h, v2.8b + rshrn v4.8b, v4.8h, #3 + smin v4.8b, v4.8b, v24.8b + neg v25.8b, v24.8b + smax v4.8b, v4.8b, v25.8b + uxtl v22.8h, v0.8b + and v4.8b, v4.8b, v26.8b + uxtl v28.8h, v16.8b + saddw v28.8h, v28.8h, v4.8b + ssubw v22.8h, v22.8h, v4.8b + sqxtun v16.8b, v28.8h + sqxtun v0.8b, v22.8h +.endm + +function ff_h264_v_loop_filter_chroma_neon, export=1 + h264_loop_filter_start + + sub x0, x0, x1, lsl #1 + ld1 {v18.8b}, [x0], x1 + ld1 {v16.8b}, [x0], x1 + ld1 {v0.8b}, [x0], x1 + ld1 {v2.8b}, [x0] + + h264_loop_filter_chroma + + sub x0, x0, x1, lsl #1 + st1 {v16.8b}, [x0], x1 + st1 {v0.8b}, [x0], x1 +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_neon, export=1 + h264_loop_filter_start + + sub x0, x0, #2 +h_loop_filter_chroma420: + ld1 {v18.s}[0], [x0], x1 + ld1 {v16.s}[0], [x0], x1 + ld1 {v0.s}[0], [x0], x1 + ld1 {v2.s}[0], [x0], x1 + ld1 {v18.s}[1], [x0], x1 + ld1 {v16.s}[1], [x0], x1 + ld1 {v0.s}[1], [x0], x1 + ld1 {v2.s}[1], [x0], x1 + + transpose_4x8B v18, v16, v0, v2, v28, v29, v30, v31 + + h264_loop_filter_chroma + + transpose_4x8B v18, v16, v0, v2, v28, v29, v30, v31 + + sub x0, x0, x1, lsl #3 + st1 {v18.s}[0], [x0], x1 + st1 {v16.s}[0], [x0], x1 + st1 {v0.s}[0], [x0], x1 + st1 {v2.s}[0], [x0], x1 + st1 {v18.s}[1], [x0], x1 + st1 {v16.s}[1], [x0], x1 + st1 {v0.s}[1], [x0], x1 + st1 {v2.s}[1], [x0], x1 +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma422_neon, export=1 + h264_loop_filter_start + add x5, x0, x1 + sub x0, x0, #2 + add x1, x1, x1 + mov x7, x30 + bl h_loop_filter_chroma420 + mov x30, x7 + sub x0, x5, #2 + mov v24.s[0], w6 + b h_loop_filter_chroma420 +endfunc + +.macro h264_loop_filter_chroma_intra + uabd v26.8b, v16.8b, v17.8b // abs(p0 - q0) + uabd v27.8b, v18.8b, v16.8b // abs(p1 - p0) + uabd v28.8b, v19.8b, v17.8b // abs(q1 - q0) + cmhi v26.8b, v30.8b, v26.8b // < alpha + cmhi v27.8b, v31.8b, v27.8b // < beta + cmhi v28.8b, v31.8b, v28.8b // < beta + and v26.8b, v26.8b, v27.8b + and v26.8b, v26.8b, v28.8b + mov x2, v26.d[0] + + ushll v4.8h, v18.8b, #1 + ushll v6.8h, v19.8b, #1 + cbz x2, 9f + uaddl v20.8h, v16.8b, v19.8b + uaddl v22.8h, v17.8b, v18.8b + add v20.8h, v20.8h, v4.8h + add v22.8h, v22.8h, v6.8h + uqrshrn v24.8b, v20.8h, #2 + uqrshrn v25.8b, v22.8h, #2 + bit v16.8b, v24.8b, v26.8b + bit v17.8b, v25.8b, v26.8b +.endm + +function ff_h264_v_loop_filter_chroma_intra_neon, export=1 + h264_loop_filter_start_intra + + sub x0, x0, x1, lsl #1 + ld1 {v18.8b}, [x0], x1 + ld1 {v16.8b}, [x0], x1 + ld1 {v17.8b}, [x0], x1 + ld1 {v19.8b}, [x0] + + h264_loop_filter_chroma_intra + + sub x0, x0, x1, lsl #1 + st1 {v16.8b}, [x0], x1 + st1 {v17.8b}, [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_mbaff_intra_neon, export=1 + h264_loop_filter_start_intra + + sub x4, x0, #2 + sub x0, x0, #1 + ld1 {v18.8b}, [x4], x1 + ld1 {v16.8b}, [x4], x1 + ld1 {v17.8b}, [x4], x1 + ld1 {v19.8b}, [x4], x1 + + transpose_4x8B v18, v16, v17, v19, v26, v27, v28, v29 + + h264_loop_filter_chroma_intra + + st2 {v16.b,v17.b}[0], [x0], x1 + st2 {v16.b,v17.b}[1], [x0], x1 + st2 {v16.b,v17.b}[2], [x0], x1 + st2 {v16.b,v17.b}[3], [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_intra_neon, export=1 + h264_loop_filter_start_intra + + sub x4, x0, #2 + sub x0, x0, #1 +h_loop_filter_chroma420_intra: + ld1 {v18.8b}, [x4], x1 + ld1 {v16.8b}, [x4], x1 + ld1 {v17.8b}, [x4], x1 + ld1 {v19.8b}, [x4], x1 + ld1 {v18.s}[1], [x4], x1 + ld1 {v16.s}[1], [x4], x1 + ld1 {v17.s}[1], [x4], x1 + ld1 {v19.s}[1], [x4], x1 + + transpose_4x8B v18, v16, v17, v19, v26, v27, v28, v29 + + h264_loop_filter_chroma_intra + + st2 {v16.b,v17.b}[0], [x0], x1 + st2 {v16.b,v17.b}[1], [x0], x1 + st2 {v16.b,v17.b}[2], [x0], x1 + st2 {v16.b,v17.b}[3], [x0], x1 + st2 {v16.b,v17.b}[4], [x0], x1 + st2 {v16.b,v17.b}[5], [x0], x1 + st2 {v16.b,v17.b}[6], [x0], x1 + st2 {v16.b,v17.b}[7], [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma422_intra_neon, export=1 + h264_loop_filter_start_intra + sub x4, x0, #2 + add x5, x0, x1, lsl #3 + sub x0, x0, #1 + mov x7, x30 + bl h_loop_filter_chroma420_intra + sub x0, x5, #1 + mov x30, x7 + b h_loop_filter_chroma420_intra +endfunc + +.macro biweight_16 macs, macd + dup v0.16b, w5 + dup v1.16b, w6 + mov v4.16b, v16.16b + mov v6.16b, v16.16b +1: subs w3, w3, #2 + ld1 {v20.16b}, [x0], x2 + \macd v4.8h, v0.8b, v20.8b + \macd\()2 v6.8H, v0.16B, v20.16B + ld1 {v22.16b}, [x1], x2 + \macs v4.8h, v1.8b, v22.8b + \macs\()2 v6.8H, v1.16B, v22.16B + mov v24.16b, v16.16b + ld1 {v28.16b}, [x0], x2 + mov v26.16b, v16.16b + \macd v24.8h, v0.8b, v28.8b + \macd\()2 v26.8H, v0.16B, v28.16B + ld1 {v30.16b}, [x1], x2 + \macs v24.8h, v1.8b, v30.8b + \macs\()2 v26.8H, v1.16B, v30.16B + sshl v4.8h, v4.8h, v18.8h + sshl v6.8h, v6.8h, v18.8h + sqxtun v4.8b, v4.8h + sqxtun2 v4.16b, v6.8h + sshl v24.8h, v24.8h, v18.8h + sshl v26.8h, v26.8h, v18.8h + sqxtun v24.8b, v24.8h + sqxtun2 v24.16b, v26.8h + mov v6.16b, v16.16b + st1 {v4.16b}, [x7], x2 + mov v4.16b, v16.16b + st1 {v24.16b}, [x7], x2 + b.ne 1b + ret +.endm + +.macro biweight_8 macs, macd + dup v0.8b, w5 + dup v1.8b, w6 + mov v2.16b, v16.16b + mov v20.16b, v16.16b +1: subs w3, w3, #2 + ld1 {v4.8b}, [x0], x2 + \macd v2.8h, v0.8b, v4.8b + ld1 {v5.8b}, [x1], x2 + \macs v2.8h, v1.8b, v5.8b + ld1 {v6.8b}, [x0], x2 + \macd v20.8h, v0.8b, v6.8b + ld1 {v7.8b}, [x1], x2 + \macs v20.8h, v1.8b, v7.8b + sshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + sshl v20.8h, v20.8h, v18.8h + sqxtun v4.8b, v20.8h + mov v20.16b, v16.16b + st1 {v2.8b}, [x7], x2 + mov v2.16b, v16.16b + st1 {v4.8b}, [x7], x2 + b.ne 1b + ret +.endm + +.macro biweight_4 macs, macd + dup v0.8b, w5 + dup v1.8b, w6 + mov v2.16b, v16.16b + mov v20.16b,v16.16b +1: subs w3, w3, #4 + ld1 {v4.s}[0], [x0], x2 + ld1 {v4.s}[1], [x0], x2 + \macd v2.8h, v0.8b, v4.8b + ld1 {v5.s}[0], [x1], x2 + ld1 {v5.s}[1], [x1], x2 + \macs v2.8h, v1.8b, v5.8b + b.lt 2f + ld1 {v6.s}[0], [x0], x2 + ld1 {v6.s}[1], [x0], x2 + \macd v20.8h, v0.8b, v6.8b + ld1 {v7.s}[0], [x1], x2 + ld1 {v7.s}[1], [x1], x2 + \macs v20.8h, v1.8b, v7.8b + sshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + sshl v20.8h, v20.8h, v18.8h + sqxtun v4.8b, v20.8h + mov v20.16b, v16.16b + st1 {v2.s}[0], [x7], x2 + st1 {v2.s}[1], [x7], x2 + mov v2.16b, v16.16b + st1 {v4.s}[0], [x7], x2 + st1 {v4.s}[1], [x7], x2 + b.ne 1b + ret +2: sshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + st1 {v2.s}[0], [x7], x2 + st1 {v2.s}[1], [x7], x2 + ret +.endm + +.macro biweight_func w +function ff_biweight_h264_pixels_\w\()_neon, export=1 + lsr w8, w5, #31 + add w7, w7, #1 + eor w8, w8, w6, lsr #30 + orr w7, w7, #1 + dup v18.8h, w4 + lsl w7, w7, w4 + not v18.16b, v18.16b + dup v16.8h, w7 + mov x7, x0 + cbz w8, 10f + subs w8, w8, #1 + b.eq 20f + subs w8, w8, #1 + b.eq 30f + b 40f +10: biweight_\w umlal, umlal +20: neg w5, w5 + biweight_\w umlal, umlsl +30: neg w5, w5 + neg w6, w6 + biweight_\w umlsl, umlsl +40: neg w6, w6 + biweight_\w umlsl, umlal +endfunc +.endm + + biweight_func 16 + biweight_func 8 + biweight_func 4 + +.macro weight_16 add + dup v0.16b, w4 +1: subs w2, w2, #2 + ld1 {v20.16b}, [x0], x1 + umull v4.8h, v0.8b, v20.8b + umull2 v6.8h, v0.16b, v20.16b + ld1 {v28.16b}, [x0], x1 + umull v24.8h, v0.8b, v28.8b + umull2 v26.8h, v0.16b, v28.16b + \add v4.8h, v16.8h, v4.8h + srshl v4.8h, v4.8h, v18.8h + \add v6.8h, v16.8h, v6.8h + srshl v6.8h, v6.8h, v18.8h + sqxtun v4.8b, v4.8h + sqxtun2 v4.16b, v6.8h + \add v24.8h, v16.8h, v24.8h + srshl v24.8h, v24.8h, v18.8h + \add v26.8h, v16.8h, v26.8h + srshl v26.8h, v26.8h, v18.8h + sqxtun v24.8b, v24.8h + sqxtun2 v24.16b, v26.8h + st1 {v4.16b}, [x5], x1 + st1 {v24.16b}, [x5], x1 + b.ne 1b + ret +.endm + +.macro weight_8 add + dup v0.8b, w4 +1: subs w2, w2, #2 + ld1 {v4.8b}, [x0], x1 + umull v2.8h, v0.8b, v4.8b + ld1 {v6.8b}, [x0], x1 + umull v20.8h, v0.8b, v6.8b + \add v2.8h, v16.8h, v2.8h + srshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + \add v20.8h, v16.8h, v20.8h + srshl v20.8h, v20.8h, v18.8h + sqxtun v4.8b, v20.8h + st1 {v2.8b}, [x5], x1 + st1 {v4.8b}, [x5], x1 + b.ne 1b + ret +.endm + +.macro weight_4 add + dup v0.8b, w4 +1: subs w2, w2, #4 + ld1 {v4.s}[0], [x0], x1 + ld1 {v4.s}[1], [x0], x1 + umull v2.8h, v0.8b, v4.8b + b.lt 2f + ld1 {v6.s}[0], [x0], x1 + ld1 {v6.s}[1], [x0], x1 + umull v20.8h, v0.8b, v6.8b + \add v2.8h, v16.8h, v2.8h + srshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + \add v20.8h, v16.8h, v20.8h + srshl v20.8h, v20.8h, v18.8h + sqxtun v4.8b, v20.8h + st1 {v2.s}[0], [x5], x1 + st1 {v2.s}[1], [x5], x1 + st1 {v4.s}[0], [x5], x1 + st1 {v4.s}[1], [x5], x1 + b.ne 1b + ret +2: \add v2.8h, v16.8h, v2.8h + srshl v2.8h, v2.8h, v18.8h + sqxtun v2.8b, v2.8h + st1 {v2.s}[0], [x5], x1 + st1 {v2.s}[1], [x5], x1 + ret +.endm + +.macro weight_func w +function ff_weight_h264_pixels_\w\()_neon, export=1 + cmp w3, #1 + mov w6, #1 + lsl w5, w5, w3 + dup v16.8h, w5 + mov x5, x0 + b.le 20f + sub w6, w6, w3 + dup v18.8h, w6 + cmp w4, #0 + b.lt 10f + weight_\w shadd +10: neg w4, w4 + weight_\w shsub +20: neg w6, w3 + dup v18.8h, w6 + cmp w4, #0 + b.lt 10f + weight_\w add +10: neg w4, w4 + weight_\w sub +endfunc +.endm + + weight_func 16 + weight_func 8 + weight_func 4 + +.macro h264_loop_filter_start_10 + cmp w2, #0 + ldr w6, [x4] + ccmp w3, #0, #0, ne + lsl w2, w2, #2 + mov v24.s[0], w6 + lsl w3, w3, #2 + and w8, w6, w6, lsl #16 + b.eq 1f + ands w8, w8, w8, lsl #8 + b.ge 2f +1: + ret +2: +.endm + +.macro h264_loop_filter_start_intra_10 + orr w4, w2, w3 + cbnz w4, 1f + ret +1: + lsl w2, w2, #2 + lsl w3, w3, #2 + dup v30.8h, w2 // alpha + dup v31.8h, w3 // beta +.endm + +.macro h264_loop_filter_chroma_10 + dup v22.8h, w2 // alpha + dup v23.8h, w3 // beta + uxtl v24.8h, v24.8b // tc0 + + uabd v26.8h, v16.8h, v0.8h // abs(p0 - q0) + uabd v28.8h, v18.8h, v16.8h // abs(p1 - p0) + uabd v30.8h, v2.8h, v0.8h // abs(q1 - q0) + cmhi v26.8h, v22.8h, v26.8h // < alpha + cmhi v28.8h, v23.8h, v28.8h // < beta + cmhi v30.8h, v23.8h, v30.8h // < beta + + and v26.16b, v26.16b, v28.16b + mov v4.16b, v0.16b + sub v4.8h, v4.8h, v16.8h + and v26.16b, v26.16b, v30.16b + shl v4.8h, v4.8h, #2 + mov x8, v26.d[0] + mov x9, v26.d[1] + sli v24.8h, v24.8h, #8 + uxtl v24.8h, v24.8b + add v4.8h, v4.8h, v18.8h + adds x8, x8, x9 + shl v24.8h, v24.8h, #2 + + b.eq 9f + + movi v31.8h, #3 // (tc0 - 1) << (BIT_DEPTH - 8)) + 1 + uqsub v24.8h, v24.8h, v31.8h + sub v4.8h, v4.8h, v2.8h + srshr v4.8h, v4.8h, #3 + smin v4.8h, v4.8h, v24.8h + neg v25.8h, v24.8h + smax v4.8h, v4.8h, v25.8h + and v4.16b, v4.16b, v26.16b + add v16.8h, v16.8h, v4.8h + sub v0.8h, v0.8h, v4.8h + + mvni v4.8h, #0xFC, lsl #8 // 1023 for clipping + movi v5.8h, #0 + smin v0.8h, v0.8h, v4.8h + smin v16.8h, v16.8h, v4.8h + smax v0.8h, v0.8h, v5.8h + smax v16.8h, v16.8h, v5.8h +.endm + +function ff_h264_v_loop_filter_chroma_neon_10, export=1 + h264_loop_filter_start_10 + + mov x10, x0 + sub x0, x0, x1, lsl #1 + ld1 {v18.8h}, [x0 ], x1 + ld1 {v0.8h}, [x10], x1 + ld1 {v16.8h}, [x0 ], x1 + ld1 {v2.8h}, [x10] + + h264_loop_filter_chroma_10 + + sub x0, x10, x1, lsl #1 + st1 {v16.8h}, [x0], x1 + st1 {v0.8h}, [x0], x1 +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_neon_10, export=1 + h264_loop_filter_start_10 + + sub x0, x0, #4 // access the 2nd left pixel +h_loop_filter_chroma420_10: + add x10, x0, x1, lsl #2 + ld1 {v18.d}[0], [x0 ], x1 + ld1 {v18.d}[1], [x10], x1 + ld1 {v16.d}[0], [x0 ], x1 + ld1 {v16.d}[1], [x10], x1 + ld1 {v0.d}[0], [x0 ], x1 + ld1 {v0.d}[1], [x10], x1 + ld1 {v2.d}[0], [x0 ], x1 + ld1 {v2.d}[1], [x10], x1 + + transpose_4x8H v18, v16, v0, v2, v28, v29, v30, v31 + + h264_loop_filter_chroma_10 + + transpose_4x8H v18, v16, v0, v2, v28, v29, v30, v31 + + sub x0, x10, x1, lsl #3 + st1 {v18.d}[0], [x0], x1 + st1 {v16.d}[0], [x0], x1 + st1 {v0.d}[0], [x0], x1 + st1 {v2.d}[0], [x0], x1 + st1 {v18.d}[1], [x0], x1 + st1 {v16.d}[1], [x0], x1 + st1 {v0.d}[1], [x0], x1 + st1 {v2.d}[1], [x0], x1 +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma422_neon_10, export=1 + h264_loop_filter_start_10 + add x5, x0, x1 + sub x0, x0, #4 + add x1, x1, x1 + mov x7, x30 + bl h_loop_filter_chroma420_10 + mov x30, x7 + sub x0, x5, #4 + mov v24.s[0], w6 + b h_loop_filter_chroma420_10 +endfunc + +.macro h264_loop_filter_chroma_intra_10 + uabd v26.8h, v16.8h, v17.8h // abs(p0 - q0) + uabd v27.8h, v18.8h, v16.8h // abs(p1 - p0) + uabd v28.8h, v19.8h, v17.8h // abs(q1 - q0) + cmhi v26.8h, v30.8h, v26.8h // < alpha + cmhi v27.8h, v31.8h, v27.8h // < beta + cmhi v28.8h, v31.8h, v28.8h // < beta + and v26.16b, v26.16b, v27.16b + and v26.16b, v26.16b, v28.16b + mov x2, v26.d[0] + mov x3, v26.d[1] + + shl v4.8h, v18.8h, #1 + shl v6.8h, v19.8h, #1 + + adds x2, x2, x3 + b.eq 9f + + add v20.8h, v16.8h, v19.8h + add v22.8h, v17.8h, v18.8h + add v20.8h, v20.8h, v4.8h + add v22.8h, v22.8h, v6.8h + urshr v24.8h, v20.8h, #2 + urshr v25.8h, v22.8h, #2 + bit v16.16b, v24.16b, v26.16b + bit v17.16b, v25.16b, v26.16b +.endm + +function ff_h264_v_loop_filter_chroma_intra_neon_10, export=1 + h264_loop_filter_start_intra_10 + mov x9, x0 + sub x0, x0, x1, lsl #1 + ld1 {v18.8h}, [x0], x1 + ld1 {v17.8h}, [x9], x1 + ld1 {v16.8h}, [x0], x1 + ld1 {v19.8h}, [x9] + + h264_loop_filter_chroma_intra_10 + + sub x0, x9, x1, lsl #1 + st1 {v16.8h}, [x0], x1 + st1 {v17.8h}, [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_mbaff_intra_neon_10, export=1 + h264_loop_filter_start_intra_10 + + sub x4, x0, #4 + sub x0, x0, #2 + add x9, x4, x1, lsl #1 + ld1 {v18.8h}, [x4], x1 + ld1 {v17.8h}, [x9], x1 + ld1 {v16.8h}, [x4], x1 + ld1 {v19.8h}, [x9], x1 + + transpose_4x8H v18, v16, v17, v19, v26, v27, v28, v29 + + h264_loop_filter_chroma_intra_10 + + st2 {v16.h,v17.h}[0], [x0], x1 + st2 {v16.h,v17.h}[1], [x0], x1 + st2 {v16.h,v17.h}[2], [x0], x1 + st2 {v16.h,v17.h}[3], [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma_intra_neon_10, export=1 + h264_loop_filter_start_intra_10 + sub x4, x0, #4 + sub x0, x0, #2 +h_loop_filter_chroma420_intra_10: + add x9, x4, x1, lsl #2 + ld1 {v18.4h}, [x4], x1 + ld1 {v18.d}[1], [x9], x1 + ld1 {v16.4h}, [x4], x1 + ld1 {v16.d}[1], [x9], x1 + ld1 {v17.4h}, [x4], x1 + ld1 {v17.d}[1], [x9], x1 + ld1 {v19.4h}, [x4], x1 + ld1 {v19.d}[1], [x9], x1 + + transpose_4x8H v18, v16, v17, v19, v26, v27, v28, v29 + + h264_loop_filter_chroma_intra_10 + + st2 {v16.h,v17.h}[0], [x0], x1 + st2 {v16.h,v17.h}[1], [x0], x1 + st2 {v16.h,v17.h}[2], [x0], x1 + st2 {v16.h,v17.h}[3], [x0], x1 + st2 {v16.h,v17.h}[4], [x0], x1 + st2 {v16.h,v17.h}[5], [x0], x1 + st2 {v16.h,v17.h}[6], [x0], x1 + st2 {v16.h,v17.h}[7], [x0], x1 + +9: + ret +endfunc + +function ff_h264_h_loop_filter_chroma422_intra_neon_10, export=1 + h264_loop_filter_start_intra_10 + sub x4, x0, #4 + add x5, x0, x1, lsl #3 + sub x0, x0, #2 + mov x7, x30 + bl h_loop_filter_chroma420_intra_10 + mov x4, x9 + sub x0, x5, #2 + mov x30, x7 + b h_loop_filter_chroma420_intra_10 +endfunc