Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -68,6 +68,14 @@ set(FFASM_SOURCES
|
|||||||
${FFSNAP}/libavcodec/aarch64/vp9itxfm_neon.S
|
${FFSNAP}/libavcodec/aarch64/vp9itxfm_neon.S
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Cycle 6 — H.264 IDCT 4x4 + 8x8 NEON (vendored 2026-05-18).
|
||||||
|
set(FFASM_H264IDCT_SOURCES
|
||||||
|
${FFSNAP}/libavcodec/aarch64/h264idct_neon.S
|
||||||
|
)
|
||||||
|
set_source_files_properties(${FFASM_H264IDCT_SOURCES} PROPERTIES
|
||||||
|
COMPILE_OPTIONS "${FFASM_FLAGS}"
|
||||||
|
LANGUAGE ASM)
|
||||||
|
|
||||||
# Cycle 2 — VP9 loop filter NEON source (vendored 2026-05-18).
|
# Cycle 2 — VP9 loop filter NEON source (vendored 2026-05-18).
|
||||||
set(FFASM_LPF_SOURCES
|
set(FFASM_LPF_SOURCES
|
||||||
${FFSNAP}/libavcodec/aarch64/vp9lpf_neon.S
|
${FFSNAP}/libavcodec/aarch64/vp9lpf_neon.S
|
||||||
@@ -96,6 +104,14 @@ set_source_files_properties(${FFASM_SOURCES} PROPERTIES
|
|||||||
|
|
||||||
# ---- NEON baseline microbenches --------------------------------------------
|
# ---- NEON baseline microbenches --------------------------------------------
|
||||||
|
|
||||||
|
# Cycle 6 — H.264 IDCT 4x4 NEON M3 baseline bench.
|
||||||
|
add_executable(bench_neon_h264idct4
|
||||||
|
tests/bench_neon_h264idct4.c
|
||||||
|
tests/h264_idct4_ref.c
|
||||||
|
${FFASM_H264IDCT_SOURCES}
|
||||||
|
)
|
||||||
|
target_compile_options(bench_neon_h264idct4 PRIVATE -O3 -march=armv8-a+simd)
|
||||||
|
|
||||||
add_executable(bench_neon_idct
|
add_executable(bench_neon_idct
|
||||||
tests/bench_neon_idct.c
|
tests/bench_neon_idct.c
|
||||||
tests/vp9_idct8_ref.c
|
tests/vp9_idct8_ref.c
|
||||||
|
|||||||
@@ -0,0 +1,119 @@
|
|||||||
|
---
|
||||||
|
cycle: 6
|
||||||
|
phase: 1
|
||||||
|
status: open
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
codec: H.264
|
||||||
|
kernel: IDCT 4x4 + add (intra-block residual)
|
||||||
|
parent: project_h264_scope_added.md (memory)
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 6, Phase 1 — H.264 IDCT 4×4 + add
|
||||||
|
|
||||||
|
First H.264 kernel. Per `project_h264_scope_added`, the user
|
||||||
|
added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5
|
||||||
|
has no hardware H.264 decoder despite H.264 being the most
|
||||||
|
common web codec.
|
||||||
|
|
||||||
|
## Why IDCT 4×4 first
|
||||||
|
|
||||||
|
- **Smallest H.264 transform.** 16 coefficients per block, 4×4
|
||||||
|
output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs).
|
||||||
|
- **Most-used.** H.264 macroblocks default to 4×4 intra
|
||||||
|
prediction + residual; 8×8 is High-profile only. 4×4 hits
|
||||||
|
most real-world H.264 streams.
|
||||||
|
- **Predicted GREEN.** Per the cycle 1-5 bandwidth-bound vs
|
||||||
|
compute-bound classification: 4×4 IDCT is bandwidth-bound
|
||||||
|
(16 reads, 16 writes, ~20 ALU ops/output). Should map well
|
||||||
|
to V3D 7.1 compute.
|
||||||
|
- **Clean reference.** FFmpeg's `ff_h264_idct_add_neon` is
|
||||||
|
standalone (no eob parameter, no complex DC dispatch). Single
|
||||||
|
call computes 1 block of IDCT + add.
|
||||||
|
|
||||||
|
## Kernel contract
|
||||||
|
|
||||||
|
Per H.264 spec §8.5.12, the inverse transform is an
|
||||||
|
integer-arithmetic transform (no rounding-by-cosine like VP9's
|
||||||
|
Q14 trig math). Each 4×4 block:
|
||||||
|
|
||||||
|
1. Inverse row transform (4 row passes, each one 1D IDCT-like
|
||||||
|
integer butterfly).
|
||||||
|
2. Inverse column transform (4 column passes, same butterfly).
|
||||||
|
3. Round and add to `dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255)`.
|
||||||
|
|
||||||
|
Spec coefficients (Hadamard-like with 1/2 scaling):
|
||||||
|
```
|
||||||
|
[1 1 1 1/2]
|
||||||
|
[1 1/2 -1 -1]
|
||||||
|
[1 -1/2 -1 1]
|
||||||
|
[1 -1 1 -1/2]
|
||||||
|
```
|
||||||
|
Integer form scales by 2: replace 1/2 with 1 and ½ with right-
|
||||||
|
shift in the round step.
|
||||||
|
|
||||||
|
## NEON reference (M3 target)
|
||||||
|
|
||||||
|
FFmpeg's `ff_h264_idct_add_neon`
|
||||||
|
(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
|
||||||
|
line 25, 56 instructions of NEON asm). Signature:
|
||||||
|
|
||||||
|
```
|
||||||
|
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
||||||
|
```
|
||||||
|
|
||||||
|
- `dst`: 4×4 pixel block in 8-bit luma surface, `stride` between rows.
|
||||||
|
- `block`: 16 int16 coefficients (row-major).
|
||||||
|
- destructively clears `block` to zero after the transform (per H.264 conformance).
|
||||||
|
|
||||||
|
## 30fps@1080p H.264 floor
|
||||||
|
|
||||||
|
H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB.
|
||||||
|
Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame ×
|
||||||
|
16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8
|
||||||
|
chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k
|
||||||
|
4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip).
|
||||||
|
|
||||||
|
At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case.
|
||||||
|
A more realistic average (many MBs use 8×8, P-skip, etc.) is
|
||||||
|
~2 Mblock/s.
|
||||||
|
|
||||||
|
**30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s.**
|
||||||
|
**30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.**
|
||||||
|
|
||||||
|
## R-band decision rules (carried from phase1.md)
|
||||||
|
|
||||||
|
- R ≥ 1.0 → **GREEN** (QPU faster than NEON-1 in isolation).
|
||||||
|
- 0.5 ≤ R < 1.0 → **YELLOW** (M4 decides).
|
||||||
|
- 0.1 ≤ R < 0.5 → **ORANGE** (M4 may rescue).
|
||||||
|
- R < 0.1 → **RED** (structural mismatch).
|
||||||
|
|
||||||
|
Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85
|
||||||
|
Mblock/s worst-case 30fps floor.
|
||||||
|
|
||||||
|
## Acceptance for Phase 7
|
||||||
|
|
||||||
|
- M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random
|
||||||
|
blocks). Same standard as cycles 1-5.
|
||||||
|
- M2: captured, classified per R band.
|
||||||
|
- M4: same-kernel mixed-bench measured (with Issue 003 caveats —
|
||||||
|
this is the worst-case framing).
|
||||||
|
- 30fps@1080p H.264 4×4 floor margin reported.
|
||||||
|
|
||||||
|
## Cycle 6 deliverables
|
||||||
|
|
||||||
|
1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S`
|
||||||
|
(vendored 2026-05-18, this phase).
|
||||||
|
2. `tests/h264_idct4_ref.c` — standalone C reference (LGPL-2.1+
|
||||||
|
transcribed from spec).
|
||||||
|
3. `tests/bench_neon_h264idct4.c` — Phase 3 M3 bench.
|
||||||
|
4. `src/v3d_h264idct4.comp` — Phase 6 QPU shader.
|
||||||
|
5. `tests/bench_v3d_h264idct4.c` — Phase 6+7 M1+M2 bench (3-way
|
||||||
|
vs NEON + C ref).
|
||||||
|
6. M4: extend `bench_concurrent_mixed.c` with K_H264_IDCT4.
|
||||||
|
7. Phase 4-7 docs.
|
||||||
|
|
||||||
|
## Next step (within this phase)
|
||||||
|
|
||||||
|
Move to Phase 3 (NEON baseline M3) after writing the C
|
||||||
|
reference. Phase 2 (libavcodec inventory) is implicit since we
|
||||||
|
know the kernel from the FFmpeg vendor.
|
||||||
@@ -0,0 +1,132 @@
|
|||||||
|
---
|
||||||
|
cycle: 6
|
||||||
|
phase: 3
|
||||||
|
status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
date_closed: 2026-05-18
|
||||||
|
codec: H.264
|
||||||
|
kernel: IDCT 4x4 + add
|
||||||
|
parent: k6_h264idct4_phase1.md
|
||||||
|
host: hertz
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline
|
||||||
|
|
||||||
|
## M3₆ throughput
|
||||||
|
|
||||||
|
```
|
||||||
|
=== M3₆ NEON throughput ===
|
||||||
|
blocks/batch: 4096
|
||||||
|
batches done: 51 206
|
||||||
|
total blocks: 209 739 776
|
||||||
|
elapsed (kernel)=1.199 s
|
||||||
|
throughput = 175.0 Mblock/s
|
||||||
|
per-block = 5.7 ns
|
||||||
|
H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
|
||||||
|
H.264 1080p30 realistic floor: 87.50× margin (2.0 Mblock/s req'd)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2
|
||||||
|
LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a
|
||||||
|
genuinely small kernel and FFmpeg's NEON is extremely tight
|
||||||
|
(56 instructions per block).
|
||||||
|
|
||||||
|
NEON 4-core scaling: not measured this phase; based on cycle 2/4
|
||||||
|
patterns, expect ~3-4× scaling (bandwidth-bound territory) →
|
||||||
|
~500-700 Mblock/s aggregate. That's >100× the floor.
|
||||||
|
|
||||||
|
## M1 bit-exact gate
|
||||||
|
|
||||||
|
```
|
||||||
|
=== M1₆ bit-exact (10000 random 4x4 blocks) ===
|
||||||
|
M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Phase 9 lesson — H.264 block layout is column-major
|
||||||
|
|
||||||
|
The bench's initial C reference assumed row-major block storage
|
||||||
|
(`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all
|
||||||
|
random). After failed attempts swapping the row/column pass order
|
||||||
|
(both row-first and column-first gave the same 5 % rate), trace
|
||||||
|
analysis revealed the actual mismatch:
|
||||||
|
|
||||||
|
- NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does
|
||||||
|
**interleaved** loading (load 4 structures of 4 elements,
|
||||||
|
scattering across registers), NOT sequential — I initially
|
||||||
|
assumed sequential.
|
||||||
|
- Combined with FFmpeg's choice of **column-major** block layout
|
||||||
|
(`block[c*4 + r]` = coefficient at row r, column c), the
|
||||||
|
interleaved load gives each NEON vector `v_r` = row r of block
|
||||||
|
(lane = column).
|
||||||
|
- FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses
|
||||||
|
`block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major
|
||||||
|
indexing in disguise.
|
||||||
|
|
||||||
|
Fix: read block as column-major (`block[c*4 + r]`) in the C
|
||||||
|
reference's row-pass loop. M1 then PASS 10000/10000.
|
||||||
|
|
||||||
|
Lesson encoded for future H.264 cycles:
|
||||||
|
- **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg.
|
||||||
|
- This convention propagates through all the libavcodec/aarch64
|
||||||
|
H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc).
|
||||||
|
Cycles 7+ (other H.264 kernels) should default-assume
|
||||||
|
column-major.
|
||||||
|
|
||||||
|
## Comparison vs cycle 1 IDCT 8×8 (the closest analog)
|
||||||
|
|
||||||
|
| | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 |
|
||||||
|
|---|---|---|
|
||||||
|
| Codec | VP9 | H.264 |
|
||||||
|
| Block size | 8×8 (64 coefs) | 4×4 (16 coefs) |
|
||||||
|
| Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) |
|
||||||
|
| NEON cycles/block | 122 ns | **5.7 ns** (21× faster) |
|
||||||
|
| Block storage | row-major | column-major |
|
||||||
|
| 30fps@1080p floor margin | 8× | **30×** (vs worst case) |
|
||||||
|
|
||||||
|
H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both
|
||||||
|
per-coef and per-block. This validates the "H.264 should be
|
||||||
|
easier" hypothesis from [project_h264_scope_added].
|
||||||
|
|
||||||
|
## Predicted R₆ band
|
||||||
|
|
||||||
|
NEON per-block 5.7 ns is so fast that the QPU must be very fast
|
||||||
|
to compete. QPU dispatch overhead is ~30 µs per call (from M5),
|
||||||
|
so the QPU-call breakeven needs to amortize across many blocks
|
||||||
|
per dispatch.
|
||||||
|
|
||||||
|
Per-block estimate for QPU on a similar tiny kernel:
|
||||||
|
- 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
|
||||||
|
- ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
|
||||||
|
- At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
|
||||||
|
- R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation**
|
||||||
|
|
||||||
|
But: per-WG packing of 16 blocks means dispatch overhead amortizes
|
||||||
|
better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes
|
||||||
|
read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to
|
||||||
|
LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's
|
||||||
|
bandwidth doesn't contend on the same channel.
|
||||||
|
|
||||||
|
Plan: implement QPU path anyway for cycle-completion and
|
||||||
|
opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel
|
||||||
|
(per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and
|
||||||
|
CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.
|
||||||
|
|
||||||
|
## Next: Phase 4 plan
|
||||||
|
|
||||||
|
Per the established cycle pattern. Plan the QPU shader. Phase 5
|
||||||
|
Sonnet review. Phase 6 implementation. Phase 7 measurement.
|
||||||
|
Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel
|
||||||
|
to make per-call buffer alloc dominate the latency.
|
||||||
|
|
||||||
|
Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader
|
||||||
|
build) and instead move directly to next H.264 cycles where QPU
|
||||||
|
might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or
|
||||||
|
deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it
|
||||||
|
doesn't NEED QPU help.
|
||||||
|
|
||||||
|
## Acceptance
|
||||||
|
|
||||||
|
- ✓ M1 bit-exact (100.00 % on 10 000 random blocks)
|
||||||
|
- ✓ M3 captured (175 Mblock/s)
|
||||||
|
- ✓ 30fps@1080p floor exceeded by 30× worst-case
|
||||||
|
- ✓ Block-layout convention documented for future cycles
|
||||||
+1
@@ -26,6 +26,7 @@ tagged commit, no modifications.
|
|||||||
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
|
| `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
|
||||||
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
|
| `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
|
||||||
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
|
| `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
|
||||||
|
| `libavcodec/aarch64/h264idct_neon.S` | 415 | 16269 | `963ffe5f31b5a6a422e13b0d394cf5630126927abfb23aa214f7cbe83d60683f` — H.264 IDCT 4×4/8×8/DC NEON kernels for cycle 6+ |
|
||||||
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
|
| `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
|
||||||
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
|
| `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
|
||||||
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
|
| `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
|
||||||
|
|||||||
@@ -0,0 +1,415 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2008 Mans Rullgard <mans@mansr.com>
|
||||||
|
* Copyright (c) 2013 Janne Grunau <janne-libav@jannau.net>
|
||||||
|
*
|
||||||
|
* This file is part of FFmpeg.
|
||||||
|
*
|
||||||
|
* FFmpeg is free software; you can redistribute it and/or
|
||||||
|
* modify it under the terms of the GNU Lesser General Public
|
||||||
|
* License as published by the Free Software Foundation; either
|
||||||
|
* version 2.1 of the License, or (at your option) any later version.
|
||||||
|
*
|
||||||
|
* FFmpeg is distributed in the hope that it will be useful,
|
||||||
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||||
|
* Lesser General Public License for more details.
|
||||||
|
*
|
||||||
|
* You should have received a copy of the GNU Lesser General Public
|
||||||
|
* License along with FFmpeg; if not, write to the Free Software
|
||||||
|
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "libavutil/aarch64/asm.S"
|
||||||
|
#include "neon.S"
|
||||||
|
|
||||||
|
function ff_h264_idct_add_neon, export=1
|
||||||
|
.L_ff_h264_idct_add_neon:
|
||||||
|
AARCH64_VALID_CALL_TARGET
|
||||||
|
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]
|
||||||
|
sxtw x2, w2
|
||||||
|
movi v30.8h, #0
|
||||||
|
|
||||||
|
add v4.4h, v0.4h, v2.4h
|
||||||
|
sshr v16.4h, v1.4h, #1
|
||||||
|
st1 {v30.8h}, [x1], #16
|
||||||
|
sshr v17.4h, v3.4h, #1
|
||||||
|
st1 {v30.8h}, [x1], #16
|
||||||
|
sub v5.4h, v0.4h, v2.4h
|
||||||
|
sub v6.4h, v16.4h, v3.4h
|
||||||
|
add v7.4h, v1.4h, v17.4h
|
||||||
|
add v0.4h, v4.4h, v7.4h
|
||||||
|
add v1.4h, v5.4h, v6.4h
|
||||||
|
sub v2.4h, v5.4h, v6.4h
|
||||||
|
sub v3.4h, v4.4h, v7.4h
|
||||||
|
|
||||||
|
transpose_4x4H v0, v1, v2, v3, v4, v5, v6, v7
|
||||||
|
|
||||||
|
add v4.4h, v0.4h, v2.4h
|
||||||
|
ld1 {v18.s}[0], [x0], x2
|
||||||
|
sshr v16.4h, v3.4h, #1
|
||||||
|
sshr v17.4h, v1.4h, #1
|
||||||
|
ld1 {v18.s}[1], [x0], x2
|
||||||
|
sub v5.4h, v0.4h, v2.4h
|
||||||
|
ld1 {v19.s}[1], [x0], x2
|
||||||
|
add v6.4h, v16.4h, v1.4h
|
||||||
|
ins v4.d[1], v5.d[0]
|
||||||
|
sub v7.4h, v17.4h, v3.4h
|
||||||
|
ld1 {v19.s}[0], [x0], x2
|
||||||
|
ins v6.d[1], v7.d[0]
|
||||||
|
sub x0, x0, x2, lsl #2
|
||||||
|
add v0.8h, v4.8h, v6.8h
|
||||||
|
sub v1.8h, v4.8h, v6.8h
|
||||||
|
|
||||||
|
srshr v0.8h, v0.8h, #6
|
||||||
|
srshr v1.8h, v1.8h, #6
|
||||||
|
|
||||||
|
uaddw v0.8h, v0.8h, v18.8b
|
||||||
|
uaddw v1.8h, v1.8h, v19.8b
|
||||||
|
|
||||||
|
sqxtun v0.8b, v0.8h
|
||||||
|
sqxtun v1.8b, v1.8h
|
||||||
|
|
||||||
|
st1 {v0.s}[0], [x0], x2
|
||||||
|
st1 {v0.s}[1], [x0], x2
|
||||||
|
st1 {v1.s}[1], [x0], x2
|
||||||
|
st1 {v1.s}[0], [x0], x2
|
||||||
|
|
||||||
|
sub x1, x1, #32
|
||||||
|
ret
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct_dc_add_neon, export=1
|
||||||
|
.L_ff_h264_idct_dc_add_neon:
|
||||||
|
AARCH64_VALID_CALL_TARGET
|
||||||
|
sxtw x2, w2
|
||||||
|
mov w3, #0
|
||||||
|
ld1r {v2.8h}, [x1]
|
||||||
|
strh w3, [x1]
|
||||||
|
srshr v2.8h, v2.8h, #6
|
||||||
|
ld1 {v0.s}[0], [x0], x2
|
||||||
|
ld1 {v0.s}[1], [x0], x2
|
||||||
|
uaddw v3.8h, v2.8h, v0.8b
|
||||||
|
ld1 {v1.s}[0], [x0], x2
|
||||||
|
ld1 {v1.s}[1], [x0], x2
|
||||||
|
uaddw v4.8h, v2.8h, v1.8b
|
||||||
|
sqxtun v0.8b, v3.8h
|
||||||
|
sqxtun v1.8b, v4.8h
|
||||||
|
sub x0, x0, x2, lsl #2
|
||||||
|
st1 {v0.s}[0], [x0], x2
|
||||||
|
st1 {v0.s}[1], [x0], x2
|
||||||
|
st1 {v1.s}[0], [x0], x2
|
||||||
|
st1 {v1.s}[1], [x0], x2
|
||||||
|
ret
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct_add16_neon, export=1
|
||||||
|
mov x12, x30
|
||||||
|
mov x6, x0 // dest
|
||||||
|
mov x5, x1 // block_offset
|
||||||
|
mov x1, x2 // block
|
||||||
|
mov w9, w3 // stride
|
||||||
|
movrel x7, scan8
|
||||||
|
mov x10, #16
|
||||||
|
movrel x13, .L_ff_h264_idct_dc_add_neon
|
||||||
|
movrel x14, .L_ff_h264_idct_add_neon
|
||||||
|
1: mov w2, w9
|
||||||
|
ldrb w3, [x7], #1
|
||||||
|
ldrsw x0, [x5], #4
|
||||||
|
ldrb w3, [x4, w3, uxtw]
|
||||||
|
subs w3, w3, #1
|
||||||
|
b.lt 2f
|
||||||
|
ldrsh w3, [x1]
|
||||||
|
add x0, x0, x6
|
||||||
|
ccmp w3, #0, #4, eq
|
||||||
|
csel x15, x13, x14, ne
|
||||||
|
blr x15
|
||||||
|
2: subs x10, x10, #1
|
||||||
|
add x1, x1, #32
|
||||||
|
b.ne 1b
|
||||||
|
ret x12
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct_add16intra_neon, export=1
|
||||||
|
mov x12, x30
|
||||||
|
mov x6, x0 // dest
|
||||||
|
mov x5, x1 // block_offset
|
||||||
|
mov x1, x2 // block
|
||||||
|
mov w9, w3 // stride
|
||||||
|
movrel x7, scan8
|
||||||
|
mov x10, #16
|
||||||
|
movrel x13, .L_ff_h264_idct_dc_add_neon
|
||||||
|
movrel x14, .L_ff_h264_idct_add_neon
|
||||||
|
1: mov w2, w9
|
||||||
|
ldrb w3, [x7], #1
|
||||||
|
ldrsw x0, [x5], #4
|
||||||
|
ldrb w3, [x4, w3, uxtw]
|
||||||
|
add x0, x0, x6
|
||||||
|
cmp w3, #0
|
||||||
|
ldrsh w3, [x1]
|
||||||
|
csel x15, x13, x14, eq
|
||||||
|
ccmp w3, #0, #0, eq
|
||||||
|
b.eq 2f
|
||||||
|
blr x15
|
||||||
|
2: subs x10, x10, #1
|
||||||
|
add x1, x1, #32
|
||||||
|
b.ne 1b
|
||||||
|
ret x12
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct_add8_neon, export=1
|
||||||
|
stp x19, x20, [sp, #-0x40]!
|
||||||
|
mov x12, x30
|
||||||
|
ldp x6, x15, [x0] // dest[0], dest[1]
|
||||||
|
add x5, x1, #16*4 // block_offset
|
||||||
|
add x9, x2, #16*32 // block
|
||||||
|
mov w19, w3 // stride
|
||||||
|
movrel x13, .L_ff_h264_idct_dc_add_neon
|
||||||
|
movrel x14, .L_ff_h264_idct_add_neon
|
||||||
|
movrel x7, scan8, 16
|
||||||
|
mov x10, #0
|
||||||
|
mov x11, #16
|
||||||
|
1: mov w2, w19
|
||||||
|
ldrb w3, [x7, x10] // scan8[i]
|
||||||
|
ldrsw x0, [x5, x10, lsl #2] // block_offset[i]
|
||||||
|
ldrb w3, [x4, w3, uxtw] // nnzc[ scan8[i] ]
|
||||||
|
add x0, x0, x6 // block_offset[i] + dst[j-1]
|
||||||
|
add x1, x9, x10, lsl #5 // block + i * 16
|
||||||
|
cmp w3, #0
|
||||||
|
ldrsh w3, [x1] // block[i*16]
|
||||||
|
csel x20, x13, x14, eq
|
||||||
|
ccmp w3, #0, #0, eq
|
||||||
|
b.eq 2f
|
||||||
|
blr x20
|
||||||
|
2: add x10, x10, #1
|
||||||
|
cmp x10, #4
|
||||||
|
csel x10, x11, x10, eq // mov x10, #16
|
||||||
|
csel x6, x15, x6, eq
|
||||||
|
cmp x10, #20
|
||||||
|
b.lt 1b
|
||||||
|
ldp x19, x20, [sp], #0x40
|
||||||
|
ret x12
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
.macro idct8x8_cols pass
|
||||||
|
.if \pass == 0
|
||||||
|
va .req v18
|
||||||
|
vb .req v30
|
||||||
|
sshr v18.8h, v26.8h, #1
|
||||||
|
add v16.8h, v24.8h, v28.8h
|
||||||
|
ld1 {v30.8h, v31.8h}, [x1]
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
sub v17.8h, v24.8h, v28.8h
|
||||||
|
sshr v19.8h, v30.8h, #1
|
||||||
|
sub v18.8h, v18.8h, v30.8h
|
||||||
|
add v19.8h, v19.8h, v26.8h
|
||||||
|
.else
|
||||||
|
va .req v30
|
||||||
|
vb .req v18
|
||||||
|
sshr v30.8h, v26.8h, #1
|
||||||
|
sshr v19.8h, v18.8h, #1
|
||||||
|
add v16.8h, v24.8h, v28.8h
|
||||||
|
sub v17.8h, v24.8h, v28.8h
|
||||||
|
sub v30.8h, v30.8h, v18.8h
|
||||||
|
add v19.8h, v19.8h, v26.8h
|
||||||
|
.endif
|
||||||
|
add v26.8h, v17.8h, va.8h
|
||||||
|
sub v28.8h, v17.8h, va.8h
|
||||||
|
add v24.8h, v16.8h, v19.8h
|
||||||
|
sub vb.8h, v16.8h, v19.8h
|
||||||
|
sub v16.8h, v29.8h, v27.8h
|
||||||
|
add v17.8h, v31.8h, v25.8h
|
||||||
|
sub va.8h, v31.8h, v25.8h
|
||||||
|
add v19.8h, v29.8h, v27.8h
|
||||||
|
sub v16.8h, v16.8h, v31.8h
|
||||||
|
sub v17.8h, v17.8h, v27.8h
|
||||||
|
add va.8h, va.8h, v29.8h
|
||||||
|
add v19.8h, v19.8h, v25.8h
|
||||||
|
sshr v25.8h, v25.8h, #1
|
||||||
|
sshr v27.8h, v27.8h, #1
|
||||||
|
sshr v29.8h, v29.8h, #1
|
||||||
|
sshr v31.8h, v31.8h, #1
|
||||||
|
sub v16.8h, v16.8h, v31.8h
|
||||||
|
sub v17.8h, v17.8h, v27.8h
|
||||||
|
add va.8h, va.8h, v29.8h
|
||||||
|
add v19.8h, v19.8h, v25.8h
|
||||||
|
sshr v25.8h, v16.8h, #2
|
||||||
|
sshr v27.8h, v17.8h, #2
|
||||||
|
sshr v29.8h, va.8h, #2
|
||||||
|
sshr v31.8h, v19.8h, #2
|
||||||
|
sub v19.8h, v19.8h, v25.8h
|
||||||
|
sub va.8h, v27.8h, va.8h
|
||||||
|
add v17.8h, v17.8h, v29.8h
|
||||||
|
add v16.8h, v16.8h, v31.8h
|
||||||
|
.if \pass == 0
|
||||||
|
sub v31.8h, v24.8h, v19.8h
|
||||||
|
add v24.8h, v24.8h, v19.8h
|
||||||
|
add v25.8h, v26.8h, v18.8h
|
||||||
|
sub v18.8h, v26.8h, v18.8h
|
||||||
|
add v26.8h, v28.8h, v17.8h
|
||||||
|
add v27.8h, v30.8h, v16.8h
|
||||||
|
sub v29.8h, v28.8h, v17.8h
|
||||||
|
sub v28.8h, v30.8h, v16.8h
|
||||||
|
.else
|
||||||
|
sub v31.8h, v24.8h, v19.8h
|
||||||
|
add v24.8h, v24.8h, v19.8h
|
||||||
|
add v25.8h, v26.8h, v30.8h
|
||||||
|
sub v30.8h, v26.8h, v30.8h
|
||||||
|
add v26.8h, v28.8h, v17.8h
|
||||||
|
sub v29.8h, v28.8h, v17.8h
|
||||||
|
add v27.8h, v18.8h, v16.8h
|
||||||
|
sub v28.8h, v18.8h, v16.8h
|
||||||
|
.endif
|
||||||
|
.unreq va
|
||||||
|
.unreq vb
|
||||||
|
.endm
|
||||||
|
|
||||||
|
function ff_h264_idct8_add_neon, export=1
|
||||||
|
.L_ff_h264_idct8_add_neon:
|
||||||
|
AARCH64_VALID_CALL_TARGET
|
||||||
|
movi v19.8h, #0
|
||||||
|
sxtw x2, w2
|
||||||
|
ld1 {v24.8h, v25.8h}, [x1]
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
ld1 {v26.8h, v27.8h}, [x1]
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
ld1 {v28.8h, v29.8h}, [x1]
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
st1 {v19.8h}, [x1], #16
|
||||||
|
|
||||||
|
idct8x8_cols 0
|
||||||
|
transpose_8x8H v24, v25, v26, v27, v28, v29, v18, v31, v6, v7
|
||||||
|
idct8x8_cols 1
|
||||||
|
|
||||||
|
mov x3, x0
|
||||||
|
srshr v24.8h, v24.8h, #6
|
||||||
|
ld1 {v0.8b}, [x0], x2
|
||||||
|
srshr v25.8h, v25.8h, #6
|
||||||
|
ld1 {v1.8b}, [x0], x2
|
||||||
|
srshr v26.8h, v26.8h, #6
|
||||||
|
ld1 {v2.8b}, [x0], x2
|
||||||
|
srshr v27.8h, v27.8h, #6
|
||||||
|
ld1 {v3.8b}, [x0], x2
|
||||||
|
srshr v28.8h, v28.8h, #6
|
||||||
|
ld1 {v4.8b}, [x0], x2
|
||||||
|
srshr v29.8h, v29.8h, #6
|
||||||
|
ld1 {v5.8b}, [x0], x2
|
||||||
|
srshr v30.8h, v30.8h, #6
|
||||||
|
ld1 {v6.8b}, [x0], x2
|
||||||
|
srshr v31.8h, v31.8h, #6
|
||||||
|
ld1 {v7.8b}, [x0], x2
|
||||||
|
uaddw v24.8h, v24.8h, v0.8b
|
||||||
|
uaddw v25.8h, v25.8h, v1.8b
|
||||||
|
uaddw v26.8h, v26.8h, v2.8b
|
||||||
|
sqxtun v0.8b, v24.8h
|
||||||
|
uaddw v27.8h, v27.8h, v3.8b
|
||||||
|
sqxtun v1.8b, v25.8h
|
||||||
|
uaddw v28.8h, v28.8h, v4.8b
|
||||||
|
sqxtun v2.8b, v26.8h
|
||||||
|
st1 {v0.8b}, [x3], x2
|
||||||
|
uaddw v29.8h, v29.8h, v5.8b
|
||||||
|
sqxtun v3.8b, v27.8h
|
||||||
|
st1 {v1.8b}, [x3], x2
|
||||||
|
uaddw v30.8h, v30.8h, v6.8b
|
||||||
|
sqxtun v4.8b, v28.8h
|
||||||
|
st1 {v2.8b}, [x3], x2
|
||||||
|
uaddw v31.8h, v31.8h, v7.8b
|
||||||
|
sqxtun v5.8b, v29.8h
|
||||||
|
st1 {v3.8b}, [x3], x2
|
||||||
|
sqxtun v6.8b, v30.8h
|
||||||
|
sqxtun v7.8b, v31.8h
|
||||||
|
st1 {v4.8b}, [x3], x2
|
||||||
|
st1 {v5.8b}, [x3], x2
|
||||||
|
st1 {v6.8b}, [x3], x2
|
||||||
|
st1 {v7.8b}, [x3], x2
|
||||||
|
|
||||||
|
sub x1, x1, #128
|
||||||
|
ret
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct8_dc_add_neon, export=1
|
||||||
|
.L_ff_h264_idct8_dc_add_neon:
|
||||||
|
AARCH64_VALID_CALL_TARGET
|
||||||
|
mov w3, #0
|
||||||
|
sxtw x2, w2
|
||||||
|
ld1r {v31.8h}, [x1]
|
||||||
|
strh w3, [x1]
|
||||||
|
ld1 {v0.8b}, [x0], x2
|
||||||
|
srshr v31.8h, v31.8h, #6
|
||||||
|
ld1 {v1.8b}, [x0], x2
|
||||||
|
ld1 {v2.8b}, [x0], x2
|
||||||
|
uaddw v24.8h, v31.8h, v0.8b
|
||||||
|
ld1 {v3.8b}, [x0], x2
|
||||||
|
uaddw v25.8h, v31.8h, v1.8b
|
||||||
|
ld1 {v4.8b}, [x0], x2
|
||||||
|
uaddw v26.8h, v31.8h, v2.8b
|
||||||
|
ld1 {v5.8b}, [x0], x2
|
||||||
|
uaddw v27.8h, v31.8h, v3.8b
|
||||||
|
ld1 {v6.8b}, [x0], x2
|
||||||
|
uaddw v28.8h, v31.8h, v4.8b
|
||||||
|
ld1 {v7.8b}, [x0], x2
|
||||||
|
uaddw v29.8h, v31.8h, v5.8b
|
||||||
|
uaddw v30.8h, v31.8h, v6.8b
|
||||||
|
uaddw v31.8h, v31.8h, v7.8b
|
||||||
|
sqxtun v0.8b, v24.8h
|
||||||
|
sqxtun v1.8b, v25.8h
|
||||||
|
sqxtun v2.8b, v26.8h
|
||||||
|
sqxtun v3.8b, v27.8h
|
||||||
|
sub x0, x0, x2, lsl #3
|
||||||
|
st1 {v0.8b}, [x0], x2
|
||||||
|
sqxtun v4.8b, v28.8h
|
||||||
|
st1 {v1.8b}, [x0], x2
|
||||||
|
sqxtun v5.8b, v29.8h
|
||||||
|
st1 {v2.8b}, [x0], x2
|
||||||
|
sqxtun v6.8b, v30.8h
|
||||||
|
st1 {v3.8b}, [x0], x2
|
||||||
|
sqxtun v7.8b, v31.8h
|
||||||
|
st1 {v4.8b}, [x0], x2
|
||||||
|
st1 {v5.8b}, [x0], x2
|
||||||
|
st1 {v6.8b}, [x0], x2
|
||||||
|
st1 {v7.8b}, [x0], x2
|
||||||
|
ret
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
function ff_h264_idct8_add4_neon, export=1
|
||||||
|
mov x12, x30
|
||||||
|
mov x6, x0
|
||||||
|
mov x5, x1
|
||||||
|
mov x1, x2
|
||||||
|
mov w2, w3
|
||||||
|
movrel x7, scan8
|
||||||
|
mov w10, #16
|
||||||
|
movrel x13, .L_ff_h264_idct8_dc_add_neon
|
||||||
|
movrel x14, .L_ff_h264_idct8_add_neon
|
||||||
|
1: ldrb w9, [x7], #4
|
||||||
|
ldrsw x0, [x5], #16
|
||||||
|
ldrb w9, [x4, w9, uxtw]
|
||||||
|
subs w9, w9, #1
|
||||||
|
b.lt 2f
|
||||||
|
ldrsh w11, [x1]
|
||||||
|
add x0, x6, x0
|
||||||
|
ccmp w11, #0, #4, eq
|
||||||
|
csel x15, x13, x14, ne
|
||||||
|
blr x15
|
||||||
|
2: subs w10, w10, #4
|
||||||
|
add x1, x1, #128
|
||||||
|
b.ne 1b
|
||||||
|
ret x12
|
||||||
|
endfunc
|
||||||
|
|
||||||
|
const scan8
|
||||||
|
.byte 4+ 1*8, 5+ 1*8, 4+ 2*8, 5+ 2*8
|
||||||
|
.byte 6+ 1*8, 7+ 1*8, 6+ 2*8, 7+ 2*8
|
||||||
|
.byte 4+ 3*8, 5+ 3*8, 4+ 4*8, 5+ 4*8
|
||||||
|
.byte 6+ 3*8, 7+ 3*8, 6+ 4*8, 7+ 4*8
|
||||||
|
.byte 4+ 6*8, 5+ 6*8, 4+ 7*8, 5+ 7*8
|
||||||
|
.byte 6+ 6*8, 7+ 6*8, 6+ 7*8, 7+ 7*8
|
||||||
|
.byte 4+ 8*8, 5+ 8*8, 4+ 9*8, 5+ 9*8
|
||||||
|
.byte 6+ 8*8, 7+ 8*8, 6+ 9*8, 7+ 9*8
|
||||||
|
.byte 4+11*8, 5+11*8, 4+12*8, 5+12*8
|
||||||
|
.byte 6+11*8, 7+11*8, 6+12*8, 7+12*8
|
||||||
|
.byte 4+13*8, 5+13*8, 4+14*8, 5+14*8
|
||||||
|
.byte 6+13*8, 7+13*8, 6+14*8, 7+14*8
|
||||||
|
endconst
|
||||||
@@ -0,0 +1,210 @@
|
|||||||
|
/*
|
||||||
|
* Cycle 6 Phase 3 — NEON M3 baseline for H.264 IDCT 4x4 + add.
|
||||||
|
*
|
||||||
|
* Calls FFmpeg `ff_h264_idct_add_neon`. Reports M1 bit-exact vs
|
||||||
|
* the standalone C reference, plus M3 throughput.
|
||||||
|
*
|
||||||
|
* License: BSD-2-Clause; links FFmpeg LGPL-2.1+ snapshot.
|
||||||
|
*/
|
||||||
|
#define _POSIX_C_SOURCE 200809L
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <stddef.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include <time.h>
|
||||||
|
#include <getopt.h>
|
||||||
|
|
||||||
|
extern void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
||||||
|
extern void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
||||||
|
|
||||||
|
#define DST_STRIDE 16 /* arbitrary stride for the test surface */
|
||||||
|
#define DST_ROWS 4
|
||||||
|
#define DST_BYTES (DST_ROWS * DST_STRIDE)
|
||||||
|
|
||||||
|
static uint64_t xs_state;
|
||||||
|
static inline uint64_t xs(void) {
|
||||||
|
uint64_t x = xs_state;
|
||||||
|
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
|
||||||
|
return xs_state = x;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void gen_block(int16_t b[16])
|
||||||
|
{
|
||||||
|
/* Realistic H.264 residual: small coefficients, mostly zero,
|
||||||
|
* a few non-zero in low-frequency positions. */
|
||||||
|
memset(b, 0, 16 * sizeof(int16_t));
|
||||||
|
int n_nonzero = 1 + (int)(xs() % 8);
|
||||||
|
for (int i = 0; i < n_nonzero; i++) {
|
||||||
|
int pos = (int)(xs() % 16);
|
||||||
|
int16_t v = (int16_t)((int)(xs() % 1024) - 512);
|
||||||
|
b[pos] = v;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static double now_seconds(void) {
|
||||||
|
struct timespec ts;
|
||||||
|
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
|
||||||
|
return ts.tv_sec + ts.tv_nsec * 1e-9;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int correctness_check(uint64_t seed, int n)
|
||||||
|
{
|
||||||
|
xs_state = seed ? seed : 0xc0de264cULL;
|
||||||
|
int mismatches = 0;
|
||||||
|
int prints = 0;
|
||||||
|
|
||||||
|
int16_t block_a[16], block_b[16], block_saved[16];
|
||||||
|
uint8_t dst_a[DST_BYTES], dst_b[DST_BYTES], dst_initial[DST_BYTES];
|
||||||
|
|
||||||
|
for (int i = 0; i < n; i++) {
|
||||||
|
gen_block(block_a);
|
||||||
|
memcpy(block_b, block_a, sizeof(block_a));
|
||||||
|
memcpy(block_saved, block_a, sizeof(block_a));
|
||||||
|
|
||||||
|
/* Random initial dst (4×4 region at offset 0, row stride DST_STRIDE). */
|
||||||
|
for (int r = 0; r < 4; r++)
|
||||||
|
for (int c = 0; c < 4; c++)
|
||||||
|
dst_a[r * DST_STRIDE + c] = dst_b[r * DST_STRIDE + c] = (uint8_t)(xs() & 0xff);
|
||||||
|
memcpy(dst_initial, dst_a, DST_BYTES);
|
||||||
|
|
||||||
|
daedalus_h264_idct_add_ref(dst_a, block_a, DST_STRIDE);
|
||||||
|
ff_h264_idct_add_neon(dst_b, block_b, DST_STRIDE);
|
||||||
|
|
||||||
|
int diff = 0;
|
||||||
|
for (int r = 0; r < 4; r++)
|
||||||
|
for (int c = 0; c < 4; c++)
|
||||||
|
if (dst_a[r*DST_STRIDE + c] != dst_b[r*DST_STRIDE + c]) diff++;
|
||||||
|
if (diff) {
|
||||||
|
if (prints < 3) {
|
||||||
|
fprintf(stderr, "MISMATCH block %d (%d/16 pix diff):\n", i, diff);
|
||||||
|
fprintf(stderr, " input block (row-major):");
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
fprintf(stderr, "\n r%d ", r);
|
||||||
|
for (int c = 0; c < 4; c++) fprintf(stderr, "%6d ", block_saved[r*4 + c]);
|
||||||
|
}
|
||||||
|
fprintf(stderr, "\n initial dst:");
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
fprintf(stderr, "\n r%d ", r);
|
||||||
|
for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_initial[r*DST_STRIDE + c]);
|
||||||
|
}
|
||||||
|
fprintf(stderr, "\n");
|
||||||
|
fprintf(stderr, " ref:");
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
fprintf(stderr, "\n r%d ", r);
|
||||||
|
for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_a[r*DST_STRIDE+c]);
|
||||||
|
}
|
||||||
|
fprintf(stderr, "\n neon:");
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
fprintf(stderr, "\n r%d ", r);
|
||||||
|
for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_b[r*DST_STRIDE+c]);
|
||||||
|
}
|
||||||
|
fprintf(stderr, "\n");
|
||||||
|
prints++;
|
||||||
|
}
|
||||||
|
mismatches++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("M1₆ correctness: %d / %d blocks bit-exact (%.4f%%)\n",
|
||||||
|
n - mismatches, n, 100.0 * (n - mismatches) / n);
|
||||||
|
return mismatches;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
|
||||||
|
{
|
||||||
|
xs_state = seed ? seed : 0xc0de264cULL;
|
||||||
|
int16_t *master_blocks = malloc((size_t) n_blocks * 16 * sizeof(int16_t));
|
||||||
|
int16_t *work_blocks = malloc((size_t) n_blocks * 16 * sizeof(int16_t));
|
||||||
|
uint8_t *master_dst = malloc((size_t) n_blocks * 16);
|
||||||
|
uint8_t *work_dst = malloc((size_t) n_blocks * 16);
|
||||||
|
if (!master_blocks || !work_blocks || !master_dst || !work_dst) {
|
||||||
|
fprintf(stderr, "alloc fail\n"); exit(1);
|
||||||
|
}
|
||||||
|
for (int i = 0; i < n_blocks; i++) {
|
||||||
|
gen_block(master_blocks + i * 16);
|
||||||
|
for (int j = 0; j < 16; j++) master_dst[i * 16 + j] = (uint8_t)(xs() & 0xff);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Warm-up. */
|
||||||
|
memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
|
||||||
|
memcpy(work_dst, master_dst, (size_t) n_blocks * 16);
|
||||||
|
for (int i = 0; i < n_blocks; i++)
|
||||||
|
ff_h264_idct_add_neon(work_dst + i * 16, work_blocks + i * 16, 4);
|
||||||
|
|
||||||
|
double t0 = now_seconds();
|
||||||
|
double t_end = t0 + duration_s;
|
||||||
|
uint64_t done = 0;
|
||||||
|
while (now_seconds() < t_end) {
|
||||||
|
memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
|
||||||
|
memcpy(work_dst, master_dst, (size_t) n_blocks * 16);
|
||||||
|
for (int i = 0; i < n_blocks; i++)
|
||||||
|
ff_h264_idct_add_neon(work_dst + i * 16, work_blocks + i * 16, 4);
|
||||||
|
done += n_blocks;
|
||||||
|
}
|
||||||
|
double elapsed = now_seconds() - t0;
|
||||||
|
|
||||||
|
/* Subtract setup cost. */
|
||||||
|
int iters = (int)(done / n_blocks);
|
||||||
|
double s0 = now_seconds();
|
||||||
|
for (int i = 0; i < iters; i++) {
|
||||||
|
memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
|
||||||
|
memcpy(work_dst, master_dst, (size_t) n_blocks * 16);
|
||||||
|
}
|
||||||
|
double s1 = now_seconds();
|
||||||
|
|
||||||
|
double kernel_seconds = elapsed - (s1 - s0);
|
||||||
|
double mbps = done / kernel_seconds / 1e6;
|
||||||
|
|
||||||
|
printf("M3₆ NEON throughput:\n");
|
||||||
|
printf(" blocks/batch: %d\n", n_blocks);
|
||||||
|
printf(" batches done: %d\n", iters);
|
||||||
|
printf(" total blocks: %llu\n", (unsigned long long) done);
|
||||||
|
printf(" elapsed (kernel)=%.6f s\n", kernel_seconds);
|
||||||
|
printf(" throughput = %.3f Mblock/s\n", mbps);
|
||||||
|
printf(" per-block = %.1f ns\n", kernel_seconds / done * 1e9);
|
||||||
|
/* H.264 1080p 4×4 floor: ~5.85 Mblock/s worst-case, ~2 realistic. */
|
||||||
|
printf(" H.264 1080p30 worst-case floor: %.2fx margin (5.85 Mblock/s req'd)\n", mbps / 5.85);
|
||||||
|
printf(" H.264 1080p30 realistic floor: %.2fx margin (2.0 Mblock/s req'd)\n", mbps / 2.0);
|
||||||
|
|
||||||
|
free(master_blocks); free(work_blocks); free(master_dst); free(work_dst);
|
||||||
|
}
|
||||||
|
|
||||||
|
int main(int argc, char **argv)
|
||||||
|
{
|
||||||
|
int n_blocks = 65536;
|
||||||
|
double duration = 5.0;
|
||||||
|
uint64_t seed = 0;
|
||||||
|
int do_correctness = 1;
|
||||||
|
|
||||||
|
static struct option opts[] = {
|
||||||
|
{"blocks", required_argument, 0, 'b'},
|
||||||
|
{"duration", required_argument, 0, 'd'},
|
||||||
|
{"seed", required_argument, 0, 's'},
|
||||||
|
{"no-correctness", no_argument, 0, 'C'},
|
||||||
|
{0,0,0,0}
|
||||||
|
};
|
||||||
|
for (int c; (c = getopt_long(argc, argv, "b:d:s:C", opts, 0)) != -1;) {
|
||||||
|
switch (c) {
|
||||||
|
case 'b': n_blocks = atoi(optarg); break;
|
||||||
|
case 'd': duration = atof(optarg); break;
|
||||||
|
case 's': seed = strtoull(optarg, 0, 0); break;
|
||||||
|
case 'C': do_correctness = 0; break;
|
||||||
|
default: return 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (do_correctness) {
|
||||||
|
printf("=== M1₆ bit-exact (10000 random 4x4 blocks) ===\n");
|
||||||
|
int mis = correctness_check(seed, 10000);
|
||||||
|
if (mis != 0) {
|
||||||
|
fprintf(stderr, "M1 gate FAILED — refusing to measure throughput.\n");
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
printf("\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("=== M3₆ NEON throughput ===\n");
|
||||||
|
throughput_neon(seed, n_blocks, duration);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
@@ -0,0 +1,81 @@
|
|||||||
|
/*
|
||||||
|
* Standalone bit-exact C reference for H.264 4x4 inverse integer
|
||||||
|
* transform + add. Algorithm per H.264 spec §8.5.12.1 (4x4 IT for
|
||||||
|
* blocks coded with TransformBypassFlag = 0).
|
||||||
|
*
|
||||||
|
* Mirrors FFmpeg `ff_h264_idct_add_neon` in
|
||||||
|
* external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
|
||||||
|
* (n7.1.3 pin). Destructively zeroes `block` to match upstream
|
||||||
|
* convention (post-call block must be zero for the H.264 conformance
|
||||||
|
* residual loop).
|
||||||
|
*
|
||||||
|
* Signature mirrors the NEON convention:
|
||||||
|
* void(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
||||||
|
*
|
||||||
|
* License: LGPL-2.1-or-later (matches FFmpeg upstream the algorithm
|
||||||
|
* was transcribed from). Spec is H.264 ITU-T Rec H.264 / ISO/IEC
|
||||||
|
* 14496-10.
|
||||||
|
*/
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <stddef.h>
|
||||||
|
#include <string.h>
|
||||||
|
|
||||||
|
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
|
||||||
|
|
||||||
|
/* 1D butterfly per H.264 spec §8.5.12.1.
|
||||||
|
* d[0..3] are input, e/f/g/h are intermediate, h_c[0..3] are output. */
|
||||||
|
static inline void h264_idct4_butterfly(const int d[4], int h_c[4])
|
||||||
|
{
|
||||||
|
int e = d[0] + d[2];
|
||||||
|
int f = d[0] - d[2];
|
||||||
|
int g = (d[1] >> 1) - d[3];
|
||||||
|
int h = d[1] + (d[3] >> 1);
|
||||||
|
h_c[0] = e + h;
|
||||||
|
h_c[1] = f + g;
|
||||||
|
h_c[2] = f - g;
|
||||||
|
h_c[3] = e - h;
|
||||||
|
}
|
||||||
|
|
||||||
|
void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride)
|
||||||
|
{
|
||||||
|
/* H.264/FFmpeg block layout is COLUMN-MAJOR:
|
||||||
|
* block[c*4 + r] = coefficient at row r, column c.
|
||||||
|
* NEON ld1.4h{4 regs} interleaves consecutive memory across
|
||||||
|
* registers; with column-major source this gives v_r[c] = block at
|
||||||
|
* (row=r, col=c). The first lane-wise butterfly (v0+v2 etc.) then
|
||||||
|
* combines column 0 and column 2 within each row → row pass.
|
||||||
|
* JM and FFmpeg C reference both do row-first then column-pass.
|
||||||
|
*
|
||||||
|
* dst is row-major (dst[r*stride + c]).
|
||||||
|
*/
|
||||||
|
int tmp[4][4];
|
||||||
|
|
||||||
|
/* Row pass FIRST. Read block as column-major (block[c*4 + r]). */
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
int d[4] = { block[0*4 + r], block[1*4 + r],
|
||||||
|
block[2*4 + r], block[3*4 + r] };
|
||||||
|
int h_c[4];
|
||||||
|
h264_idct4_butterfly(d, h_c);
|
||||||
|
for (int c = 0; c < 4; c++) tmp[r][c] = h_c[c];
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Column pass NEXT (on row-major tmp). */
|
||||||
|
int col_out[4][4];
|
||||||
|
for (int c = 0; c < 4; c++) {
|
||||||
|
int d[4] = { tmp[0][c], tmp[1][c], tmp[2][c], tmp[3][c] };
|
||||||
|
int h_c[4];
|
||||||
|
h264_idct4_butterfly(d, h_c);
|
||||||
|
for (int r = 0; r < 4; r++) col_out[r][c] = h_c[r];
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Round (+32) >> 6, add to dst, clip to u8. */
|
||||||
|
for (int r = 0; r < 4; r++) {
|
||||||
|
for (int c = 0; c < 4; c++) {
|
||||||
|
int rounded = (col_out[r][c] + 32) >> 6;
|
||||||
|
dst[r * stride + c] = (uint8_t) clip_u8(dst[r * stride + c] + rounded);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/* FFmpeg convention: zero the block after the transform. */
|
||||||
|
memset(block, 0, 16 * sizeof(int16_t));
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user