f92dc40f43
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
82 lines
2.8 KiB
C
82 lines
2.8 KiB
C
/*
|
|
* Standalone bit-exact C reference for H.264 4x4 inverse integer
|
|
* transform + add. Algorithm per H.264 spec §8.5.12.1 (4x4 IT for
|
|
* blocks coded with TransformBypassFlag = 0).
|
|
*
|
|
* Mirrors FFmpeg `ff_h264_idct_add_neon` in
|
|
* external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S
|
|
* (n7.1.3 pin). Destructively zeroes `block` to match upstream
|
|
* convention (post-call block must be zero for the H.264 conformance
|
|
* residual loop).
|
|
*
|
|
* Signature mirrors the NEON convention:
|
|
* void(uint8_t *dst, int16_t *block, ptrdiff_t stride);
|
|
*
|
|
* License: LGPL-2.1-or-later (matches FFmpeg upstream the algorithm
|
|
* was transcribed from). Spec is H.264 ITU-T Rec H.264 / ISO/IEC
|
|
* 14496-10.
|
|
*/
|
|
#include <stdint.h>
|
|
#include <stddef.h>
|
|
#include <string.h>
|
|
|
|
static inline int clip_u8(int v) { return v < 0 ? 0 : v > 255 ? 255 : v; }
|
|
|
|
/* 1D butterfly per H.264 spec §8.5.12.1.
|
|
* d[0..3] are input, e/f/g/h are intermediate, h_c[0..3] are output. */
|
|
static inline void h264_idct4_butterfly(const int d[4], int h_c[4])
|
|
{
|
|
int e = d[0] + d[2];
|
|
int f = d[0] - d[2];
|
|
int g = (d[1] >> 1) - d[3];
|
|
int h = d[1] + (d[3] >> 1);
|
|
h_c[0] = e + h;
|
|
h_c[1] = f + g;
|
|
h_c[2] = f - g;
|
|
h_c[3] = e - h;
|
|
}
|
|
|
|
void daedalus_h264_idct_add_ref(uint8_t *dst, int16_t *block, ptrdiff_t stride)
|
|
{
|
|
/* H.264/FFmpeg block layout is COLUMN-MAJOR:
|
|
* block[c*4 + r] = coefficient at row r, column c.
|
|
* NEON ld1.4h{4 regs} interleaves consecutive memory across
|
|
* registers; with column-major source this gives v_r[c] = block at
|
|
* (row=r, col=c). The first lane-wise butterfly (v0+v2 etc.) then
|
|
* combines column 0 and column 2 within each row → row pass.
|
|
* JM and FFmpeg C reference both do row-first then column-pass.
|
|
*
|
|
* dst is row-major (dst[r*stride + c]).
|
|
*/
|
|
int tmp[4][4];
|
|
|
|
/* Row pass FIRST. Read block as column-major (block[c*4 + r]). */
|
|
for (int r = 0; r < 4; r++) {
|
|
int d[4] = { block[0*4 + r], block[1*4 + r],
|
|
block[2*4 + r], block[3*4 + r] };
|
|
int h_c[4];
|
|
h264_idct4_butterfly(d, h_c);
|
|
for (int c = 0; c < 4; c++) tmp[r][c] = h_c[c];
|
|
}
|
|
|
|
/* Column pass NEXT (on row-major tmp). */
|
|
int col_out[4][4];
|
|
for (int c = 0; c < 4; c++) {
|
|
int d[4] = { tmp[0][c], tmp[1][c], tmp[2][c], tmp[3][c] };
|
|
int h_c[4];
|
|
h264_idct4_butterfly(d, h_c);
|
|
for (int r = 0; r < 4; r++) col_out[r][c] = h_c[r];
|
|
}
|
|
|
|
/* Round (+32) >> 6, add to dst, clip to u8. */
|
|
for (int r = 0; r < 4; r++) {
|
|
for (int c = 0; c < 4; c++) {
|
|
int rounded = (col_out[r][c] + 32) >> 6;
|
|
dst[r * stride + c] = (uint8_t) clip_u8(dst[r * stride + c] + rounded);
|
|
}
|
|
}
|
|
|
|
/* FFmpeg convention: zero the block after the transform. */
|
|
memset(block, 0, 16 * sizeof(int16_t));
|
|
}
|