h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates #12
Reference in New Issue
Block a user
Delete Branch "noether/h264-intra-pred-4x4"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction. Spec-derived C reference covering all 9 modes (Vertical/Horizontal/DC/DDL/DDR/VR/HD/VL/HU); standalone
test_intra_pred_4x4exercises each against hand-computed expected patterns. 10/10 PASS including an asymmetric Vertical_Right sanity test with 16 distinct hand-computed cells.CPU C reference (no NEON yet —
h264pred_neon.Snot in the vendored snapshot, deferred pending perf data; estimated ~650 µs/frame at 1080p in plain C, comfortably inside budget). Pure spec primitive — no daedalus_core dependency; both daedalus-decoder Stage 2a and eventual QPU shaders will use this as the bit-exact gate.Does NOT cover: Intra_16x16 luma, Intra_8x8 chroma, Intra_8x8 luma (High profile), neighbour-availability fallbacks, dispatch wrappers. Each is a follow-up.
Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction. Spec-derived C reference covering all 9 modes; standalone test exercises each against hand-computed expected 4x4 patterns. Why fourier (not the decoder) gets this: it's a reusable spec-level primitive — both daedalus-decoder (Phase 1 Stage 2a intra prediction) and any future shader work will need the same bit-exact reference. Putting it in fourier alongside the IDCT / deblock refs keeps the "spec implementations" library cohesive. Why CPU C reference, not NEON or QPU: the vendored FFmpeg snapshot (external/ffmpeg-snapshot/libavcodec/aarch64/) has h264dsp/idct/qpel but NOT h264pred. Vendoring h264pred_neon.S would expand the snapshot surface; deferring that pending real perf data. Per the cycle 9 NEON benches that take ~5 ns per 8x8 qpel block, intra prediction at ~5 ns per 4x4 block × 16 blocks/MB × 8160 MBs = ~650 us/frame at 1080p — well inside budget even at NEON, and much further inside at plain C. Not the critical-path concern. Scope: - tests/h264_intra_pred_4x4_ref.c — 9 prediction modes per H.264 spec §8.3.1.4 sub-clauses, FFmpeg-style interface: void daedalus_h264_pred_4x4_<name>_ref(uint8_t *dst, ptrdiff_t stride); Reads top/top-right/left/top-left neighbours from dst[-stride/-1] offsets, writes 4×4 output at dst[0..3][0..3]. Assumes all 13 neighbour bytes are valid (interior-MB case; availability fallbacks are caller-side per spec). - tests/test_intra_pred_4x4.c — 10 cases: * 9 uniform-context degenerate tests (one per mode), establishing that nothing is structurally broken (all output cells must equal the uniform input value). * 1 asymmetric Vertical_Right sanity test with 16 distinct expected cells hand-computed from spec §8.3.1.4.6 — the "really exercise orientation + row/col arithmetic" gate. - CMakeLists.txt — new test_intra_pred_4x4 binary (no daedalus_core dependency; pure-CPU library doesn't need a context to construct). Verified on hertz: $ ./build/test_intra_pred_4x4 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS DiagDownLeft (mode 3) PASS DiagDownRight (mode 4) PASS VerticalRight (mode 5) PASS HorizontalDown (mode 6) PASS VerticalLeft (mode 7) PASS HorizontalUp (mode 8) PASS VR asym (sanity) PASS ALL 10 intra-4x4 mode references PASS The VR asym test passed first try; the DC test fell on the first attempt because my test expectation miscomputed the rounding shift (I wrote 4, actual is 2 = (16+4)>>3). Fixed in the test. Reference itself never had the bug. What this does NOT cover (next-step backlog): - Intra_16x16 luma prediction (4 modes per H.264 §8.3.2): vertical, horizontal, DC, plane. - Intra_8x8 chroma prediction (4 modes per H.264 §8.3.3): DC, horizontal, vertical, plane. - Intra_8x8 luma prediction (High profile, 9 modes per §8.3.2.1) — these are the High-profile siblings of the modes in this PR with the 1-2-1 smoothing pre-filter. Different but well-defined. - Neighbour availability fallback (top-edge MB, left-edge MB, slice-boundary, top-right unavailable in some positions). - Dispatch wrappers — these refs aren't surfaced through daedalus_dispatch_*(). Whether to do that depends on the daedalus-decoder Stage 2a architecture (per-block CPU vs per-diagonal GPU wavefront — TBD).