Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:14:43 +00:00
parent eb5cfb34c4
commit f92dc40f43
7 changed files with 974 additions and 0 deletions
@@ -68,6 +68,14 @@ set(FFASM_SOURCES
    ${FFSNAP}/libavcodec/aarch64/vp9itxfm_neon.S
 )

+# Cycle 6 — H.264 IDCT 4x4 + 8x8 NEON (vendored 2026-05-18).
+set(FFASM_H264IDCT_SOURCES
+    ${FFSNAP}/libavcodec/aarch64/h264idct_neon.S
+)
+set_source_files_properties(${FFASM_H264IDCT_SOURCES} PROPERTIES
+    COMPILE_OPTIONS "${FFASM_FLAGS}"
+    LANGUAGE ASM)
+
 # Cycle 2 — VP9 loop filter NEON source (vendored 2026-05-18).
 set(FFASM_LPF_SOURCES
    ${FFSNAP}/libavcodec/aarch64/vp9lpf_neon.S
@@ -96,6 +104,14 @@ set_source_files_properties(${FFASM_SOURCES} PROPERTIES

 # ---- NEON baseline microbenches --------------------------------------------

+# Cycle 6 — H.264 IDCT 4x4 NEON M3 baseline bench.
+add_executable(bench_neon_h264idct4
+    tests/bench_neon_h264idct4.c
+    tests/h264_idct4_ref.c
+    ${FFASM_H264IDCT_SOURCES}
+)
+target_compile_options(bench_neon_h264idct4 PRIVATE -O3 -march=armv8-a+simd)
+
 add_executable(bench_neon_idct
    tests/bench_neon_idct.c
    tests/vp9_idct8_ref.c
@@ -0,0 +1,119 @@
+---
+cycle: 6
+phase: 1
+status: open
+date_opened: 2026-05-18
+codec: H.264
+kernel: IDCT 4x4 + add (intra-block residual)
+parent: project_h264_scope_added.md (memory)
+---
+
+# Cycle 6, Phase 1 — H.264 IDCT 4×4 + add
+
+First H.264 kernel. Per `project_h264_scope_added`, the user
+added H.264 to daedalus-fourier scope 2026-05-18 because Pi 5
+has no hardware H.264 decoder despite H.264 being the most
+common web codec.
+
+## Why IDCT 4×4 first
+
+- **Smallest H.264 transform.** 16 coefficients per block, 4×4
+  output pixels. Simpler than VP9 IDCT 8×8 (cycle 1, 64 coefs).
+- **Most-used.** H.264 macroblocks default to 4×4 intra
+  prediction + residual; 8×8 is High-profile only. 4×4 hits
+  most real-world H.264 streams.
+- **Predicted GREEN.** Per the cycle 1-5 bandwidth-bound vs
+  compute-bound classification: 4×4 IDCT is bandwidth-bound
+  (16 reads, 16 writes, ~20 ALU ops/output). Should map well
+  to V3D 7.1 compute.
+- **Clean reference.** FFmpeg's `ff_h264_idct_add_neon` is
+  standalone (no eob parameter, no complex DC dispatch). Single
+  call computes 1 block of IDCT + add.
+
+## Kernel contract
+
+Per H.264 spec §8.5.12, the inverse transform is an
+integer-arithmetic transform (no rounding-by-cosine like VP9's
+Q14 trig math). Each 4×4 block:
+
+1. Inverse row transform (4 row passes, each one 1D IDCT-like
+   integer butterfly).
+2. Inverse column transform (4 column passes, same butterfly).
+3. Round and add to `dst[r,c] = clamp(dst[r,c] + ((idct[r,c] + 32) >> 6), 0, 255)`.
+
+Spec coefficients (Hadamard-like with 1/2 scaling):
+```
+  [1  1  1  1/2]
+  [1  1/2 -1 -1]
+  [1 -1/2 -1  1]
+  [1 -1   1 -1/2]
+```
+Integer form scales by 2: replace 1/2 with 1 and ½ with right-
+shift in the round step.
+
+## NEON reference (M3 target)
+
+FFmpeg's `ff_h264_idct_add_neon`
+(external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
+line 25, 56 instructions of NEON asm). Signature:
+
+```
+void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
+```
+
+- `dst`: 4×4 pixel block in 8-bit luma surface, `stride` between rows.
+- `block`: 16 int16 coefficients (row-major).
+- destructively clears `block` to zero after the transform (per H.264 conformance).
+
+## 30fps@1080p H.264 floor
+
+H.264 1080p uses 16×16 macroblocks with up to 16 4×4 blocks per MB.
+Luma: (1920/16) × (1080/16) = 120 × 67.5 = 8100 MB/frame ×
+16 blocks/MB = 129 600 4×4 blocks/frame. Plus chroma: 4 + 4 = 8
+chroma 4×4 per MB × 8100 = 64 800 chroma blocks. Total: ~195k
+4×4 blocks/frame max (worst case; many real MBs use 8×8 or skip).
+
+At 30fps: ~5.85 Mblock/s required for full-frame 4×4 worst case.
+A more realistic average (many MBs use 8×8, P-skip, etc.) is
+~2 Mblock/s.
+
+**30fps@1080p H.264 4×4 floor (realistic): 2 Mblock/s.**
+**30fps@1080p H.264 4×4 floor (worst case): 5.85 Mblock/s.**
+
+## R-band decision rules (carried from phase1.md)
+
+- R ≥ 1.0 → **GREEN** (QPU faster than NEON-1 in isolation).
+- 0.5 ≤ R < 1.0 → **YELLOW** (M4 decides).
+- 0.1 ≤ R < 0.5 → **ORANGE** (M4 may rescue).
+- R < 0.1 → **RED** (structural mismatch).
+
+Floor margin: ratio of M2 (or M3 if CPU-only) over the 5.85
+Mblock/s worst-case 30fps floor.
+
+## Acceptance for Phase 7
+
+- M1: 100.0000% bit-exact (QPU output vs C ref, 10000+ random
+  blocks). Same standard as cycles 1-5.
+- M2: captured, classified per R band.
+- M4: same-kernel mixed-bench measured (with Issue 003 caveats —
+  this is the worst-case framing).
+- 30fps@1080p H.264 4×4 floor margin reported.
+
+## Cycle 6 deliverables
+
+1. `external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S`
+   (vendored 2026-05-18, this phase).
+2. `tests/h264_idct4_ref.c` — standalone C reference (LGPL-2.1+
+   transcribed from spec).
+3. `tests/bench_neon_h264idct4.c` — Phase 3 M3 bench.
+4. `src/v3d_h264idct4.comp` — Phase 6 QPU shader.
+5. `tests/bench_v3d_h264idct4.c` — Phase 6+7 M1+M2 bench (3-way
+   vs NEON + C ref).
+6. M4: extend `bench_concurrent_mixed.c` with K_H264_IDCT4.
+7. Phase 4-7 docs.
+
+## Next step (within this phase)
+
+Move to Phase 3 (NEON baseline M3) after writing the C
+reference. Phase 2 (libavcodec inventory) is implicit since we
+know the kernel from the FFmpeg vendor.
@@ -0,0 +1,132 @@
+---
+cycle: 6
+phase: 3
+status: closed 2026-05-18 — M1 PASS, M3₆ = 175 Mblock/s
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+codec: H.264
+kernel: IDCT 4x4 + add
+parent: k6_h264idct4_phase1.md
+host: hertz
+---
+
+# Cycle 6, Phase 3 — H.264 IDCT 4×4 NEON baseline
+
+## M3₆ throughput
+
+```
+=== M3₆ NEON throughput ===
+  blocks/batch:    4096
+  batches done:    51 206
+  total blocks:    209 739 776
+  elapsed (kernel)=1.199 s
+  throughput      = 175.0 Mblock/s
+  per-block       = 5.7 ns
+  H.264 1080p30 worst-case floor: 29.91× margin (5.85 Mblock/s req'd)
+  H.264 1080p30 realistic floor:  87.50× margin (2.0 Mblock/s req'd)
+```
+
+**Per-block 5.7 ns — by far the lightest cycle so far** (cycle 2
+LPF wd=4 was 21 ns, cycle 1 IDCT 8x8 was 122 ns). 4×4 is a
+genuinely small kernel and FFmpeg's NEON is extremely tight
+(56 instructions per block).
+
+NEON 4-core scaling: not measured this phase; based on cycle 2/4
+patterns, expect ~3-4× scaling (bandwidth-bound territory) →
+~500-700 Mblock/s aggregate. That's >100× the floor.
+
+## M1 bit-exact gate
+
+```
+=== M1₆ bit-exact (10000 random 4x4 blocks) ===
+M1₆ correctness: 10000 / 10000 blocks bit-exact (100.0000%)
+```
+
+## Key Phase 9 lesson — H.264 block layout is column-major
+
+The bench's initial C reference assumed row-major block storage
+(`block[r*4 + c]`), giving M1 = 4.98 % bit-exact (essentially all
+random). After failed attempts swapping the row/column pass order
+(both row-first and column-first gave the same 5 % rate), trace
+analysis revealed the actual mismatch:
+
+- NEON `ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [x1]` does
+  **interleaved** loading (load 4 structures of 4 elements,
+  scattering across registers), NOT sequential — I initially
+  assumed sequential.
+- Combined with FFmpeg's choice of **column-major** block layout
+  (`block[c*4 + r]` = coefficient at row r, column c), the
+  interleaved load gives each NEON vector `v_r` = row r of block
+  (lane = column).
+- FFmpeg's C reference (`libavcodec/h264dsp_template.c`) uses
+  `block[i + 4*0]`, `block[i + 4*1]`, etc. which is column-major
+  indexing in disguise.
+
+Fix: read block as column-major (`block[c*4 + r]`) in the C
+reference's row-pass loop. M1 then PASS 10000/10000.
+
+Lesson encoded for future H.264 cycles:
+- **H.264 4×4 (and 8×8) blocks are column-major** in FFmpeg.
+- This convention propagates through all the libavcodec/aarch64
+  H.264 NEON kernels (h264idct, h264dsp, h264qpel, h264cmc).
+  Cycles 7+ (other H.264 kernels) should default-assume
+  column-major.
+
+## Comparison vs cycle 1 IDCT 8×8 (the closest analog)
+
+| | Cycle 1 IDCT 8×8 | Cycle 6 IDCT 4×4 |
+|---|---|---|
+| Codec | VP9 | H.264 |
+| Block size | 8×8 (64 coefs) | 4×4 (16 coefs) |
+| Transform math | Q14 trig DCT (heavy multiplies) | Integer butterfly (no multiplies, only shifts) |
+| NEON cycles/block | 122 ns | **5.7 ns** (21× faster) |
+| Block storage | row-major | column-major |
+| 30fps@1080p floor margin | 8× | **30×** (vs worst case) |
+
+H.264 IDCT 4×4 is dramatically lighter than VP9 IDCT 8×8 — both
+per-coef and per-block. This validates the "H.264 should be
+easier" hypothesis from [project_h264_scope_added].
+
+## Predicted R₆ band
+
+NEON per-block 5.7 ns is so fast that the QPU must be very fast
+to compete. QPU dispatch overhead is ~30 µs per call (from M5),
+so the QPU-call breakeven needs to amortize across many blocks
+per dispatch.
+
+Per-block estimate for QPU on a similar tiny kernel:
+- 4 lanes per block (per pixel), 64 invocations/WG → 16 blocks/WG
+- ~50-100 instructions per block (much less than cycle 1 IDCT 8x8's 250)
+- At 8 ns/instruction (NEON-tuned guess), ~600 ns per block.
+- R₆ = 5.7 / 600 = 0.01 → **deep RED in isolation**
+
+But: per-WG packing of 16 blocks means dispatch overhead amortizes
+better. And 4×4 is bandwidth-bound on NEON (5.7 ns/block ≈ 32 bytes
+read + 16 bytes write = 48 bytes per 5.7 ns ≈ 8 GB/s, close to
+LPDDR4 ceiling). So same-kernel M4 on QPU may pull free if QPU's
+bandwidth doesn't contend on the same channel.
+
+Plan: implement QPU path anyway for cycle-completion and
+opportunistic-helper hypothesis. If R₆ is deep RED but mixed-kernel
+(per Issue 003) deployment shape uses QPU for VP9 cycles 1+2+4 and
+CPU for H.264 IDCT 4×4, that's fine — the recipe carries over.
+
+## Next: Phase 4 plan
+
+Per the established cycle pattern. Plan the QPU shader. Phase 5
+Sonnet review. Phase 6 implementation. Phase 7 measurement.
+Predicted R₆ = 0.01 (deep RED, isolation), but small enough kernel
+to make per-call buffer alloc dominate the latency.
+
+Alternative path: defer cycle 6 Phase 4-7 (skip the QPU shader
+build) and instead move directly to next H.264 cycles where QPU
+might actually win — IDCT 8x8 (cycle 7), 6-tap MC (cycle 9), or
+deblock (cycle 10). H.264 IDCT 4×4 on CPU is so fast that it
+doesn't NEED QPU help.
+
+## Acceptance
+
+- ✓ M1 bit-exact (100.00 % on 10 000 random blocks)
+- ✓ M3 captured (175 Mblock/s)
+- ✓ 30fps@1080p floor exceeded by 30× worst-case
+- ✓ Block-layout convention documented for future cycles
@@ -26,6 +26,7 @@ tagged commit, no modifications.
 | `libavcodec/aarch64/vp9itxfm_neon.S` | 1580 | 63534 | `82ee3ceed4735c63576bafdcee28e2215652743ade55a9eab46a16d9530369f6` |
 | `libavcodec/aarch64/vp9lpf_neon.S` | 1334 | — | `384e49e7a6e838d9e38aedc00838ed4aebfa6c5bdb343ecaf23ef639bc10fbb7` |
 | `libavcodec/aarch64/vp9mc_neon.S` | 665 | — | `6b1d50f9821742584fdd47758057f810644aff3a008faaa774ff5b9cac4d1fef` |
+| `libavcodec/aarch64/h264idct_neon.S` | 415 | 16269 | `963ffe5f31b5a6a422e13b0d394cf5630126927abfb23aa214f7cbe83d60683f` — H.264 IDCT 4×4/8×8/DC NEON kernels for cycle 6+ |
 | `libavcodec/vp9_subpel_filters_table.c` | — | — | hand-extracted from `libavcodec/vp9dsp.c` at same n7.1.3 pin — provides `ff_vp9_subpel_filters` for `vp9mc_neon.S` to link against without dragging in vp9dsp.c's full init machinery |
 | `libavcodec/aarch64/neon.S` | 173 | 7496 | `72d36ce6c3fcc5e53de869cfe10fda16225ebe580c32891bccc240a30a85a538` |
 | `libavutil/aarch64/asm.S` | 260 | 8069 | `c0d03143b1bc5a9e358222d08d2d449d595271844fe7a3dc23bffb91abe8b0e3` |
@@ -0,0 +1,415 @@
+/*
+ * Copyright (c) 2008 Mans Rullgard <mans@mansr.com>
+ * Copyright (c) 2013 Janne Grunau <janne-libav@jannau.net>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+#include "neon.S"
+
+function ff_h264_idct_add_neon, export=1
+.L_ff_h264_idct_add_neon:
+        AARCH64_VALID_CALL_TARGET
+        ld1             {v0.4h, v1.4h, v2.4h, v3.4h},  [x1]
+        sxtw            x2,     w2
+        movi            v30.8h, #0
+
+        add             v4.4h,  v0.4h,  v2.4h
+        sshr            v16.4h, v1.4h,  #1
+        st1             {v30.8h},    [x1], #16
+        sshr            v17.4h, v3.4h,  #1
+        st1             {v30.8h},    [x1], #16
+        sub             v5.4h,  v0.4h,  v2.4h
+        sub             v6.4h,  v16.4h, v3.4h
+        add             v7.4h,  v1.4h,  v17.4h
+        add             v0.4h,  v4.4h,  v7.4h
+        add             v1.4h,  v5.4h,  v6.4h
+        sub             v2.4h,  v5.4h,  v6.4h
+        sub             v3.4h,  v4.4h,  v7.4h
+
+        transpose_4x4H  v0, v1, v2, v3, v4, v5, v6, v7
+
+        add             v4.4h,  v0.4h,  v2.4h
+        ld1             {v18.s}[0], [x0], x2
+        sshr            v16.4h,  v3.4h,  #1
+        sshr            v17.4h,  v1.4h,  #1
+        ld1             {v18.s}[1], [x0], x2
+        sub             v5.4h,  v0.4h,  v2.4h
+        ld1             {v19.s}[1], [x0], x2
+        add             v6.4h,  v16.4h, v1.4h
+        ins             v4.d[1],  v5.d[0]
+        sub             v7.4h,  v17.4h, v3.4h
+        ld1             {v19.s}[0], [x0], x2
+        ins             v6.d[1],  v7.d[0]
+        sub             x0,  x0,  x2, lsl #2
+        add             v0.8h,  v4.8h,  v6.8h
+        sub             v1.8h,  v4.8h,  v6.8h
+
+        srshr           v0.8h,  v0.8h,  #6
+        srshr           v1.8h,  v1.8h,  #6
+
+        uaddw           v0.8h,  v0.8h,  v18.8b
+        uaddw           v1.8h,  v1.8h,  v19.8b
+
+        sqxtun          v0.8b, v0.8h
+        sqxtun          v1.8b, v1.8h
+
+        st1             {v0.s}[0],  [x0], x2
+        st1             {v0.s}[1],  [x0], x2
+        st1             {v1.s}[1],  [x0], x2
+        st1             {v1.s}[0],  [x0], x2
+
+        sub             x1,  x1,  #32
+        ret
+endfunc
+
+function ff_h264_idct_dc_add_neon, export=1
+.L_ff_h264_idct_dc_add_neon:
+        AARCH64_VALID_CALL_TARGET
+        sxtw            x2,  w2
+        mov             w3,       #0
+        ld1r            {v2.8h},  [x1]
+        strh            w3,       [x1]
+        srshr           v2.8h,  v2.8h,  #6
+        ld1             {v0.s}[0],  [x0], x2
+        ld1             {v0.s}[1],  [x0], x2
+        uaddw           v3.8h,  v2.8h,  v0.8b
+        ld1             {v1.s}[0],  [x0], x2
+        ld1             {v1.s}[1],  [x0], x2
+        uaddw           v4.8h,  v2.8h,  v1.8b
+        sqxtun          v0.8b,  v3.8h
+        sqxtun          v1.8b,  v4.8h
+        sub             x0,  x0,  x2, lsl #2
+        st1             {v0.s}[0],  [x0], x2
+        st1             {v0.s}[1],  [x0], x2
+        st1             {v1.s}[0],  [x0], x2
+        st1             {v1.s}[1],  [x0], x2
+        ret
+endfunc
+
+function ff_h264_idct_add16_neon, export=1
+        mov             x12, x30
+        mov             x6,  x0         // dest
+        mov             x5,  x1         // block_offset
+        mov             x1,  x2         // block
+        mov             w9,  w3         // stride
+        movrel          x7,  scan8
+        mov             x10, #16
+        movrel          x13, .L_ff_h264_idct_dc_add_neon
+        movrel          x14, .L_ff_h264_idct_add_neon
+1:      mov             w2,  w9
+        ldrb            w3,  [x7], #1
+        ldrsw           x0,  [x5], #4
+        ldrb            w3,  [x4,  w3,  uxtw]
+        subs            w3,  w3,  #1
+        b.lt            2f
+        ldrsh           w3,  [x1]
+        add             x0,  x0,  x6
+        ccmp            w3,  #0,  #4,  eq
+        csel            x15, x13, x14, ne
+        blr             x15
+2:      subs            x10, x10, #1
+        add             x1,  x1,  #32
+        b.ne            1b
+        ret             x12
+endfunc
+
+function ff_h264_idct_add16intra_neon, export=1
+        mov             x12, x30
+        mov             x6,  x0         // dest
+        mov             x5,  x1         // block_offset
+        mov             x1,  x2         // block
+        mov             w9,  w3         // stride
+        movrel          x7,  scan8
+        mov             x10, #16
+        movrel          x13, .L_ff_h264_idct_dc_add_neon
+        movrel          x14, .L_ff_h264_idct_add_neon
+1:      mov             w2,  w9
+        ldrb            w3,  [x7], #1
+        ldrsw           x0,  [x5], #4
+        ldrb            w3,  [x4,  w3,  uxtw]
+        add             x0,  x0,  x6
+        cmp             w3,  #0
+        ldrsh           w3,  [x1]
+        csel            x15, x13, x14, eq
+        ccmp            w3,  #0,  #0,  eq
+        b.eq            2f
+        blr             x15
+2:      subs            x10, x10, #1
+        add             x1,  x1,  #32
+        b.ne            1b
+        ret             x12
+endfunc
+
+function ff_h264_idct_add8_neon, export=1
+        stp             x19, x20, [sp, #-0x40]!
+        mov             x12, x30
+        ldp             x6,  x15, [x0]          // dest[0], dest[1]
+        add             x5,  x1,  #16*4         // block_offset
+        add             x9,  x2,  #16*32        // block
+        mov             w19, w3                 // stride
+        movrel          x13, .L_ff_h264_idct_dc_add_neon
+        movrel          x14, .L_ff_h264_idct_add_neon
+        movrel          x7,  scan8, 16
+        mov             x10, #0
+        mov             x11, #16
+1:      mov             w2,  w19
+        ldrb            w3,  [x7, x10]          // scan8[i]
+        ldrsw           x0,  [x5, x10, lsl #2]  // block_offset[i]
+        ldrb            w3,  [x4, w3,  uxtw]    // nnzc[ scan8[i] ]
+        add             x0,  x0,  x6            // block_offset[i] + dst[j-1]
+        add             x1,  x9,  x10, lsl #5   // block + i * 16
+        cmp             w3,  #0
+        ldrsh           w3,  [x1]               // block[i*16]
+        csel            x20, x13, x14, eq
+        ccmp            w3,  #0,  #0,  eq
+        b.eq            2f
+        blr             x20
+2:      add             x10, x10, #1
+        cmp             x10, #4
+        csel            x10, x11, x10, eq     // mov x10, #16
+        csel            x6,  x15, x6,  eq
+        cmp             x10, #20
+        b.lt            1b
+        ldp             x19, x20, [sp], #0x40
+        ret             x12
+endfunc
+
+.macro  idct8x8_cols    pass
+  .if \pass == 0
+        va      .req    v18
+        vb      .req    v30
+        sshr            v18.8h, v26.8h, #1
+        add             v16.8h, v24.8h, v28.8h
+        ld1             {v30.8h, v31.8h}, [x1]
+        st1             {v19.8h}, [x1],  #16
+        st1             {v19.8h}, [x1],  #16
+        sub             v17.8h,  v24.8h, v28.8h
+        sshr            v19.8h,  v30.8h, #1
+        sub             v18.8h,  v18.8h,  v30.8h
+        add             v19.8h,  v19.8h,  v26.8h
+  .else
+        va      .req    v30
+        vb      .req    v18
+        sshr            v30.8h, v26.8h, #1
+        sshr            v19.8h, v18.8h, #1
+        add             v16.8h, v24.8h, v28.8h
+        sub             v17.8h, v24.8h, v28.8h
+        sub             v30.8h, v30.8h, v18.8h
+        add             v19.8h, v19.8h, v26.8h
+  .endif
+        add             v26.8h, v17.8h, va.8h
+        sub             v28.8h, v17.8h, va.8h
+        add             v24.8h, v16.8h, v19.8h
+        sub             vb.8h,  v16.8h, v19.8h
+        sub             v16.8h, v29.8h, v27.8h
+        add             v17.8h, v31.8h, v25.8h
+        sub             va.8h,  v31.8h, v25.8h
+        add             v19.8h, v29.8h, v27.8h
+        sub             v16.8h, v16.8h, v31.8h
+        sub             v17.8h, v17.8h, v27.8h
+        add             va.8h,  va.8h,  v29.8h
+        add             v19.8h, v19.8h, v25.8h
+        sshr            v25.8h, v25.8h, #1
+        sshr            v27.8h, v27.8h, #1
+        sshr            v29.8h, v29.8h, #1
+        sshr            v31.8h, v31.8h, #1
+        sub             v16.8h, v16.8h, v31.8h
+        sub             v17.8h, v17.8h, v27.8h
+        add             va.8h,  va.8h,  v29.8h
+        add             v19.8h, v19.8h, v25.8h
+        sshr            v25.8h, v16.8h, #2
+        sshr            v27.8h, v17.8h, #2
+        sshr            v29.8h, va.8h,  #2
+        sshr            v31.8h, v19.8h, #2
+        sub             v19.8h, v19.8h, v25.8h
+        sub             va.8h,  v27.8h, va.8h
+        add             v17.8h, v17.8h, v29.8h
+        add             v16.8h, v16.8h, v31.8h
+  .if \pass == 0
+        sub             v31.8h, v24.8h, v19.8h
+        add             v24.8h, v24.8h, v19.8h
+        add             v25.8h, v26.8h, v18.8h
+        sub             v18.8h, v26.8h, v18.8h
+        add             v26.8h, v28.8h, v17.8h
+        add             v27.8h, v30.8h, v16.8h
+        sub             v29.8h, v28.8h, v17.8h
+        sub             v28.8h, v30.8h, v16.8h
+  .else
+        sub             v31.8h, v24.8h, v19.8h
+        add             v24.8h, v24.8h, v19.8h
+        add             v25.8h, v26.8h, v30.8h
+        sub             v30.8h, v26.8h, v30.8h
+        add             v26.8h, v28.8h, v17.8h
+        sub             v29.8h, v28.8h, v17.8h
+        add             v27.8h, v18.8h, v16.8h
+        sub             v28.8h, v18.8h, v16.8h
+  .endif
+        .unreq          va
+        .unreq          vb
+.endm
+
+function ff_h264_idct8_add_neon, export=1
+.L_ff_h264_idct8_add_neon:
+        AARCH64_VALID_CALL_TARGET
+        movi            v19.8h,   #0
+        sxtw            x2,       w2
+        ld1             {v24.8h, v25.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
+        ld1             {v26.8h, v27.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
+        ld1             {v28.8h, v29.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
+
+        idct8x8_cols    0
+        transpose_8x8H  v24, v25, v26, v27, v28, v29, v18, v31, v6, v7
+        idct8x8_cols    1
+
+        mov             x3,  x0
+        srshr           v24.8h, v24.8h, #6
+        ld1             {v0.8b},     [x0], x2
+        srshr           v25.8h, v25.8h, #6
+        ld1             {v1.8b},     [x0], x2
+        srshr           v26.8h, v26.8h, #6
+        ld1             {v2.8b},     [x0], x2
+        srshr           v27.8h, v27.8h, #6
+        ld1             {v3.8b},     [x0], x2
+        srshr           v28.8h, v28.8h, #6
+        ld1             {v4.8b},     [x0], x2
+        srshr           v29.8h, v29.8h, #6
+        ld1             {v5.8b},     [x0], x2
+        srshr           v30.8h, v30.8h, #6
+        ld1             {v6.8b},     [x0], x2
+        srshr           v31.8h, v31.8h, #6
+        ld1             {v7.8b},     [x0], x2
+        uaddw           v24.8h, v24.8h, v0.8b
+        uaddw           v25.8h, v25.8h, v1.8b
+        uaddw           v26.8h, v26.8h, v2.8b
+        sqxtun          v0.8b,  v24.8h
+        uaddw           v27.8h, v27.8h, v3.8b
+        sqxtun          v1.8b,  v25.8h
+        uaddw           v28.8h, v28.8h, v4.8b
+        sqxtun          v2.8b,  v26.8h
+        st1             {v0.8b},     [x3], x2
+        uaddw           v29.8h, v29.8h, v5.8b
+        sqxtun          v3.8b,  v27.8h
+        st1             {v1.8b},     [x3], x2
+        uaddw           v30.8h, v30.8h, v6.8b
+        sqxtun          v4.8b,  v28.8h
+        st1             {v2.8b},     [x3], x2
+        uaddw           v31.8h, v31.8h, v7.8b
+        sqxtun          v5.8b,  v29.8h
+        st1             {v3.8b},     [x3], x2
+        sqxtun          v6.8b,  v30.8h
+        sqxtun          v7.8b,  v31.8h
+        st1             {v4.8b},     [x3], x2
+        st1             {v5.8b},     [x3], x2
+        st1             {v6.8b},     [x3], x2
+        st1             {v7.8b},     [x3], x2
+
+        sub             x1,  x1,  #128
+        ret
+endfunc
+
+function ff_h264_idct8_dc_add_neon, export=1
+.L_ff_h264_idct8_dc_add_neon:
+        AARCH64_VALID_CALL_TARGET
+        mov             w3,       #0
+        sxtw            x2,       w2
+        ld1r            {v31.8h}, [x1]
+        strh            w3,       [x1]
+        ld1             {v0.8b},  [x0], x2
+        srshr           v31.8h, v31.8h, #6
+        ld1             {v1.8b},     [x0], x2
+        ld1             {v2.8b},     [x0], x2
+        uaddw           v24.8h, v31.8h, v0.8b
+        ld1             {v3.8b},     [x0], x2
+        uaddw           v25.8h, v31.8h, v1.8b
+        ld1             {v4.8b},     [x0], x2
+        uaddw           v26.8h, v31.8h, v2.8b
+        ld1             {v5.8b},     [x0], x2
+        uaddw           v27.8h, v31.8h, v3.8b
+        ld1             {v6.8b},     [x0], x2
+        uaddw           v28.8h, v31.8h, v4.8b
+        ld1             {v7.8b},     [x0], x2
+        uaddw           v29.8h, v31.8h, v5.8b
+        uaddw           v30.8h, v31.8h, v6.8b
+        uaddw           v31.8h, v31.8h, v7.8b
+        sqxtun          v0.8b,  v24.8h
+        sqxtun          v1.8b,  v25.8h
+        sqxtun          v2.8b,  v26.8h
+        sqxtun          v3.8b,  v27.8h
+        sub             x0,  x0,  x2, lsl #3
+        st1             {v0.8b},     [x0], x2
+        sqxtun          v4.8b,  v28.8h
+        st1             {v1.8b},     [x0], x2
+        sqxtun          v5.8b,  v29.8h
+        st1             {v2.8b},     [x0], x2
+        sqxtun          v6.8b,  v30.8h
+        st1             {v3.8b},     [x0], x2
+        sqxtun          v7.8b,  v31.8h
+        st1             {v4.8b},     [x0], x2
+        st1             {v5.8b},     [x0], x2
+        st1             {v6.8b},     [x0], x2
+        st1             {v7.8b},     [x0], x2
+        ret
+endfunc
+
+function ff_h264_idct8_add4_neon, export=1
+        mov             x12, x30
+        mov             x6,  x0
+        mov             x5,  x1
+        mov             x1,  x2
+        mov             w2,  w3
+        movrel          x7,  scan8
+        mov             w10, #16
+        movrel          x13, .L_ff_h264_idct8_dc_add_neon
+        movrel          x14, .L_ff_h264_idct8_add_neon
+1:      ldrb            w9,  [x7], #4
+        ldrsw           x0,  [x5], #16
+        ldrb            w9,  [x4, w9, uxtw]
+        subs            w9,  w9,  #1
+        b.lt            2f
+        ldrsh           w11,  [x1]
+        add             x0,  x6,  x0
+        ccmp            w11, #0,  #4,  eq
+        csel            x15, x13, x14, ne
+        blr             x15
+2:      subs            w10, w10, #4
+        add             x1,  x1,  #128
+        b.ne            1b
+        ret             x12
+endfunc
+
+const   scan8
+        .byte           4+ 1*8, 5+ 1*8, 4+ 2*8, 5+ 2*8
+        .byte           6+ 1*8, 7+ 1*8, 6+ 2*8, 7+ 2*8
+        .byte           4+ 3*8, 5+ 3*8, 4+ 4*8, 5+ 4*8
+        .byte           6+ 3*8, 7+ 3*8, 6+ 4*8, 7+ 4*8
+        .byte           4+ 6*8, 5+ 6*8, 4+ 7*8, 5+ 7*8
+        .byte           6+ 6*8, 7+ 6*8, 6+ 7*8, 7+ 7*8
+        .byte           4+ 8*8, 5+ 8*8, 4+ 9*8, 5+ 9*8
+        .byte           6+ 8*8, 7+ 8*8, 6+ 9*8, 7+ 9*8
+        .byte           4+11*8, 5+11*8, 4+12*8, 5+12*8
+        .byte           6+11*8, 7+11*8, 6+12*8, 7+12*8
+        .byte           4+13*8, 5+13*8, 4+14*8, 5+14*8
+        .byte           6+13*8, 7+13*8, 6+14*8, 7+14*8
+endconst
@@ -0,0 +1,210 @@
+/*
+ * Cycle 6 Phase 3 — NEON M3 baseline for H.264 IDCT 4x4 + add.
+ *
+ * Calls FFmpeg `ff_h264_idct_add_neon`. Reports M1 bit-exact vs
+ * the standalone C reference, plus M3 throughput.
+ *
+ * License: BSD-2-Clause; links FFmpeg LGPL-2.1+ snapshot.
+ */
+#define _POSIX_C_SOURCE 200809L
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <time.h>
+#include <getopt.h>
+
+extern void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride);
+extern void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, ptrdiff_t stride);
+
+#define DST_STRIDE 16   /* arbitrary stride for the test surface */
+#define DST_ROWS    4
+#define DST_BYTES  (DST_ROWS * DST_STRIDE)
+
+static uint64_t xs_state;
+static inline uint64_t xs(void) {
+    uint64_t x = xs_state;
+    x ^= x << 13; x ^= x >> 7; x ^= x << 17;
+    return xs_state = x;
+}
+
+static void gen_block(int16_t b[16])
+{
+    /* Realistic H.264 residual: small coefficients, mostly zero,
+     * a few non-zero in low-frequency positions. */
+    memset(b, 0, 16 * sizeof(int16_t));
+    int n_nonzero = 1 + (int)(xs() % 8);
+    for (int i = 0; i < n_nonzero; i++) {
+        int pos = (int)(xs() % 16);
+        int16_t v = (int16_t)((int)(xs() % 1024) - 512);
+        b[pos] = v;
+    }
+}
+
+static double now_seconds(void) {
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
+    return ts.tv_sec + ts.tv_nsec * 1e-9;
+}
+
+static int correctness_check(uint64_t seed, int n)
+{
+    xs_state = seed ? seed : 0xc0de264cULL;
+    int mismatches = 0;
+    int prints = 0;
+
+    int16_t block_a[16], block_b[16], block_saved[16];
+    uint8_t dst_a[DST_BYTES], dst_b[DST_BYTES], dst_initial[DST_BYTES];
+
+    for (int i = 0; i < n; i++) {
+        gen_block(block_a);
+        memcpy(block_b, block_a, sizeof(block_a));
+        memcpy(block_saved, block_a, sizeof(block_a));
+
+        /* Random initial dst (4×4 region at offset 0, row stride DST_STRIDE). */
+        for (int r = 0; r < 4; r++)
+            for (int c = 0; c < 4; c++)
+                dst_a[r * DST_STRIDE + c] = dst_b[r * DST_STRIDE + c] = (uint8_t)(xs() & 0xff);
+        memcpy(dst_initial, dst_a, DST_BYTES);
+
+        daedalus_h264_idct_add_ref(dst_a, block_a, DST_STRIDE);
+        ff_h264_idct_add_neon(dst_b, block_b, DST_STRIDE);
+
+        int diff = 0;
+        for (int r = 0; r < 4; r++)
+            for (int c = 0; c < 4; c++)
+                if (dst_a[r*DST_STRIDE + c] != dst_b[r*DST_STRIDE + c]) diff++;
+        if (diff) {
+            if (prints < 3) {
+                fprintf(stderr, "MISMATCH block %d (%d/16 pix diff):\n", i, diff);
+                fprintf(stderr, "  input block (row-major):");
+                for (int r = 0; r < 4; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 4; c++) fprintf(stderr, "%6d ", block_saved[r*4 + c]);
+                }
+                fprintf(stderr, "\n  initial dst:");
+                for (int r = 0; r < 4; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_initial[r*DST_STRIDE + c]);
+                }
+                fprintf(stderr, "\n");
+                fprintf(stderr, "  ref:");
+                for (int r = 0; r < 4; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_a[r*DST_STRIDE+c]);
+                }
+                fprintf(stderr, "\n  neon:");
+                for (int r = 0; r < 4; r++) {
+                    fprintf(stderr, "\n    r%d ", r);
+                    for (int c = 0; c < 4; c++) fprintf(stderr, "%3u ", dst_b[r*DST_STRIDE+c]);
+                }
+                fprintf(stderr, "\n");
+                prints++;
+            }
+            mismatches++;
+        }
+    }
+
+    printf("M1₆ correctness: %d / %d blocks bit-exact (%.4f%%)\n",
+           n - mismatches, n, 100.0 * (n - mismatches) / n);
+    return mismatches;
+}
+
+static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
+{
+    xs_state = seed ? seed : 0xc0de264cULL;
+    int16_t *master_blocks = malloc((size_t) n_blocks * 16 * sizeof(int16_t));
+    int16_t *work_blocks   = malloc((size_t) n_blocks * 16 * sizeof(int16_t));
+    uint8_t *master_dst    = malloc((size_t) n_blocks * 16);
+    uint8_t *work_dst      = malloc((size_t) n_blocks * 16);
+    if (!master_blocks || !work_blocks || !master_dst || !work_dst) {
+        fprintf(stderr, "alloc fail\n"); exit(1);
+    }
+    for (int i = 0; i < n_blocks; i++) {
+        gen_block(master_blocks + i * 16);
+        for (int j = 0; j < 16; j++) master_dst[i * 16 + j] = (uint8_t)(xs() & 0xff);
+    }
+
+    /* Warm-up. */
+    memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
+    memcpy(work_dst,    master_dst,    (size_t) n_blocks * 16);
+    for (int i = 0; i < n_blocks; i++)
+        ff_h264_idct_add_neon(work_dst + i * 16, work_blocks + i * 16, 4);
+
+    double t0 = now_seconds();
+    double t_end = t0 + duration_s;
+    uint64_t done = 0;
+    while (now_seconds() < t_end) {
+        memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
+        memcpy(work_dst,    master_dst,    (size_t) n_blocks * 16);
+        for (int i = 0; i < n_blocks; i++)
+            ff_h264_idct_add_neon(work_dst + i * 16, work_blocks + i * 16, 4);
+        done += n_blocks;
+    }
+    double elapsed = now_seconds() - t0;
+
+    /* Subtract setup cost. */
+    int iters = (int)(done / n_blocks);
+    double s0 = now_seconds();
+    for (int i = 0; i < iters; i++) {
+        memcpy(work_blocks, master_blocks, (size_t) n_blocks * 16 * sizeof(int16_t));
+        memcpy(work_dst,    master_dst,    (size_t) n_blocks * 16);
+    }
+    double s1 = now_seconds();
+
+    double kernel_seconds = elapsed - (s1 - s0);
+    double mbps = done / kernel_seconds / 1e6;
+
+    printf("M3₆ NEON throughput:\n");
+    printf("  blocks/batch:    %d\n", n_blocks);
+    printf("  batches done:    %d\n", iters);
+    printf("  total blocks:    %llu\n", (unsigned long long) done);
+    printf("  elapsed (kernel)=%.6f s\n", kernel_seconds);
+    printf("  throughput      = %.3f Mblock/s\n", mbps);
+    printf("  per-block       = %.1f ns\n", kernel_seconds / done * 1e9);
+    /* H.264 1080p 4×4 floor: ~5.85 Mblock/s worst-case, ~2 realistic. */
+    printf("  H.264 1080p30 worst-case floor: %.2fx margin (5.85 Mblock/s req'd)\n", mbps / 5.85);
+    printf("  H.264 1080p30 realistic floor: %.2fx margin (2.0 Mblock/s req'd)\n", mbps / 2.0);
+
+    free(master_blocks); free(work_blocks); free(master_dst); free(work_dst);
+}
+
+int main(int argc, char **argv)
+{
+    int n_blocks = 65536;
+    double duration = 5.0;
+    uint64_t seed = 0;
+    int do_correctness = 1;
+
+    static struct option opts[] = {
+        {"blocks",         required_argument, 0, 'b'},
+        {"duration",       required_argument, 0, 'd'},
+        {"seed",           required_argument, 0, 's'},
+        {"no-correctness", no_argument,       0, 'C'},
+        {0,0,0,0}
+    };
+    for (int c; (c = getopt_long(argc, argv, "b:d:s:C", opts, 0)) != -1;) {
+        switch (c) {
+        case 'b': n_blocks = atoi(optarg); break;
+        case 'd': duration = atof(optarg); break;
+        case 's': seed = strtoull(optarg, 0, 0); break;
+        case 'C': do_correctness = 0; break;
+        default: return 2;
+        }
+    }
+
+    if (do_correctness) {
+        printf("=== M1₆ bit-exact (10000 random 4x4 blocks) ===\n");
+        int mis = correctness_check(seed, 10000);
+        if (mis != 0) {
+            fprintf(stderr, "M1 gate FAILED — refusing to measure throughput.\n");
+            return 1;
+        }
+        printf("\n");
+    }
+
+    printf("=== M3₆ NEON throughput ===\n");
+    throughput_neon(seed, n_blocks, duration);
+    return 0;
+}
@@ -0,0 +1,81 @@
+/*
+ * Standalone bit-exact C reference for H.264 4x4 inverse integer
+ * transform + add. Algorithm per H.264 spec §8.5.12.1 (4x4 IT for
+ * blocks coded with TransformBypassFlag = 0).
+ *
+ * Mirrors FFmpeg `ff_h264_idct_add_neon` in
+ * external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
+ * (n7.1.3 pin). Destructively zeroes `block` to match upstream
+ * convention (post-call block must be zero for the H.264 conformance
+ * residual loop).
+ *
+ * Signature mirrors the NEON convention:
+ *   void(uint8_t *dst, int16_t *block, ptrdiff_t stride);
+ *
+ * License: LGPL-2.1-or-later (matches FFmpeg upstream the algorithm
+ * was transcribed from). Spec is H.264 ITU-T Rec H.264 / ISO/IEC
+ * 14496-10.
+ */
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+
+static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
+
+/* 1D butterfly per H.264 spec §8.5.12.1.
+ * d[0..3] are input, e/f/g/h are intermediate, h_c[0..3] are output. */
+static inline void h264_idct4_butterfly(const int d[4], int h_c[4])
+{
+    int e = d[0] + d[2];
+    int f = d[0] - d[2];
+    int g = (d[1] >> 1) - d[3];
+    int h = d[1] + (d[3] >> 1);
+    h_c[0] = e + h;
+    h_c[1] = f + g;
+    h_c[2] = f - g;
+    h_c[3] = e - h;
+}
+
+void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride)
+{
+    /* H.264/FFmpeg block layout is COLUMN-MAJOR:
+     *   block[c*4 + r] = coefficient at row r, column c.
+     * NEON ld1.4h{4 regs} interleaves consecutive memory across
+     * registers; with column-major source this gives v_r[c] = block at
+     * (row=r, col=c). The first lane-wise butterfly (v0+v2 etc.) then
+     * combines column 0 and column 2 within each row → row pass.
+     * JM and FFmpeg C reference both do row-first then column-pass.
+     *
+     * dst is row-major (dst[r*stride + c]).
+     */
+    int tmp[4][4];
+
+    /* Row pass FIRST. Read block as column-major (block[c*4 + r]). */
+    for (int r = 0; r < 4; r++) {
+        int d[4] = { block[0*4 + r], block[1*4 + r],
+                     block[2*4 + r], block[3*4 + r] };
+        int h_c[4];
+        h264_idct4_butterfly(d, h_c);
+        for (int c = 0; c < 4; c++) tmp[r][c] = h_c[c];
+    }
+
+    /* Column pass NEXT (on row-major tmp). */
+    int col_out[4][4];
+    for (int c = 0; c < 4; c++) {
+        int d[4] = { tmp[0][c], tmp[1][c], tmp[2][c], tmp[3][c] };
+        int h_c[4];
+        h264_idct4_butterfly(d, h_c);
+        for (int r = 0; r < 4; r++) col_out[r][c] = h_c[r];
+    }
+
+    /* Round (+32) >> 6, add to dst, clip to u8. */
+    for (int r = 0; r < 4; r++) {
+        for (int c = 0; c < 4; c++) {
+            int rounded = (col_out[r][c] + 32) >> 6;
+            dst[r * stride + c] = (uint8_t) clip_u8(dst[r * stride + c] + rounded);
+        }
+    }
+
+    /* FFmpeg convention: zero the block after the transform. */
+    memset(block, 0, 16 * sizeof(int16_t));
+}