Files
daedalus-decoder/CMakeLists.txt
T
claude-noether 44e92fa3dc Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact
Final option-A deliverable.  CLI now extracts real per-MB
coefficients from libavcodec via the inspection callback +
side-buffer (marfrit-packages 0016 + 0017), reconstructs the
pre-residual predicted samples P via inverse-of-IDCT-add, and
feeds daedalus-decoder with real (P, C, no edges).  Daedalus
output BYTE-EXACT against libavcodec's pre-deblock AVFrame
across 5 frames at 320x240 and 3 frames at 1920x1088, all three
substrates (auto / cpu / qpu).

Path summary
------------

avctx->thread_count = 1                  (single-threaded decode — 0017's
                                           side buffer is per-H264Context;
                                           multi-threaded would race)
avctx->skip_loop_filter = AVDISCARD_ALL  (AVFrame stays pre-deblock so the
                                           P-recovery subtraction is exact)
ff_h264_set_mb_inspect_cb               (registers the callback)

Inspection callback (per MB, fires post-hl_decode_mb):
  - Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (skipped MBs
    fall back to identity-passthrough in the main loop)
  - Snapshot pre-deblock pixels from h->cur_pic.f->data[0]
  - Read coefficients from h->mb_inspect_coeffs (= sl->mb copy, the
    0017 side buffer)
  - For each 4x4 block (16/MB in raster order, indexed via
    raster_to_zscan[] to find its slot in the z-scan-ordered side
    buffer): compute IDCT(C) using a transcribed H.264 C reference,
    derive P = clip(pre_deblock - ((IDCT + 32) >> 6))
  - Stash per-MB capture (P + C) for the main loop

Main loop:
  - Default identity-passthrough (predicted = AVFrame pixels, coeffs = 0)
  - For real-coeffs-valid MBs: override luma with captured P + C
  - flush_frame, byte-exact compare against AVFrame

A diagnostic also asserts (silently when passing) that the
callback's pre_deblock snapshot equals AVFrame at each real-coeffs
MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer
under skip_loop_filter=AVDISCARD_ALL with thread_count=1.

Bug hunted in this PR
---------------------

Initial implementation transposed the coefficients from row-major
(sl->mb) to "column-major" (the layout that daedalus_decoder.h's
mb_input.coeffs docstring describes).  This caused ~0.2% Y pixel
divergence on real streams (~150/frame at 320x240).  Root cause
identified via a standalone /tmp/idct_compare.c harness running
daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical
int16[16] inputs: outputs IDENTICAL.  The two functions implement
the spec H.264 IDCT on the array regardless of layout
interpretation; the "column-major" label is decoration.  Removed
the transpose; PR is now byte-exact.

Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs
docstring so future integrators don't repeat this transpose
mistake.

Result on hertz (Pi 5 V3D 7.1)
------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320x240,    5 frames, substrate=auto:  Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=cpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=qpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  1920x1088,  3 frames, substrate=auto:  Y diff 0/2088960,  UV diff 0/1044480 PASS

Real-coeffs path engaged for 77-95 MBs per 320x240 frame and
598-643 MBs per 1080p frame (testsrc2 is mostly flat → many
Intra_16x16 MBs that fall back to identity passthrough; richer
content streams would engage real-coeffs more).

Followups
---------

  - PR-A4: extend the gate to Intra_16x16 (chroma DC Hadamard +
    Intra_16x16 luma DC Hadamard pre-pass) — currently ~30-60%
    of MBs fall back to identity-passthrough due to this.
  - PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch
    path on the daedalus-decoder side, similar plumbing).
  - PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE)
    and have daedalus's deblock produce the post-deblock output
    that matches AVFrame.  Closes the loop on the full I-only
    pipeline.
  - Task #184: daedalus_decoder.h coeffs docstring clarification.
2026-05-26 11:19:11 +02:00

249 lines
11 KiB
CMake

# SPDX-License-Identifier: BSD-2-Clause
#
# daedalus-decoder — frame-level GPU H.264 decoder for V3D7 (Pi 5).
# Phase 1 scaffold; see DESIGN.md for architecture.
#
# Build dependencies:
# - daedalus-fourier ≥ 0.1.0 (kernel pack, V3D primitives + recipe layer)
# resolved via pkg-config; install via the daedalus-fourier upstream
# `cmake --install` rule (PR #5 made the .pc relocatable, so any
# install prefix works as long as $PKG_CONFIG_PATH is set).
# - Vulkan headers + libvulkan (pulled in transitively via
# daedalus-fourier, listed here explicitly for the link order).
#
# Build:
# cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
# cmake --build build
# ctest --test-dir build
cmake_minimum_required(VERSION 3.20)
project(daedalus-decoder
VERSION 0.0.1
DESCRIPTION "Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7"
LANGUAGES C)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED ON)
set(CMAKE_C_EXTENSIONS OFF)
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif()
# Pi 5 is the only supported target. Other aarch64 SoCs (Pi 4 V3D4,
# RK3588 Mali, …) might work but would need explicit substrate +
# shader-pack validation per the daedalus-fourier architecture
# backlog. Don't pretend to support what we haven't validated.
if(NOT CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")
message(WARNING
"daedalus-decoder is designed for aarch64 (Pi 5 BCM2712 / V3D7). "
"Build will proceed but is unlikely to function.")
endif()
add_compile_options(-Wall -Wextra -Wno-unused-parameter)
# ---- Dependencies --------------------------------------------------
find_package(PkgConfig REQUIRED)
# daedalus-fourier — find_package via pkg-config per the Phase 1
# decision §9.6. Minimum version 0.1.0 (the cycle 6-9 shaders + pool
# + recipe-flip baseline). PKG_CONFIG_PATH should point at the
# directory holding daedalus-fourier.pc (e.g. /usr/local/lib/pkgconfig
# or a custom install prefix).
pkg_check_modules(DAEDALUS_FOURIER REQUIRED daedalus-fourier>=0.1.0)
# Vulkan — daedalus-fourier already depends on this; we add it
# explicitly so the link order stays correct (daedalus-fourier static
# archive contains undefined vk* symbols that the loader resolves).
find_package(Vulkan REQUIRED)
# ---- Version string baked into the library ------------------------
# git rev tagged onto the version string for traceability; degrades
# gracefully to bare semver if git isn't available.
execute_process(
COMMAND git -C ${CMAKE_CURRENT_SOURCE_DIR} rev-parse --short=7 HEAD
OUTPUT_VARIABLE DAEDALUS_DECODER_GITREV
OUTPUT_STRIP_TRAILING_WHITESPACE
ERROR_QUIET)
if(DAEDALUS_DECODER_GITREV)
set(DAEDALUS_DECODER_VERSION "${PROJECT_VERSION}+g${DAEDALUS_DECODER_GITREV}")
else()
set(DAEDALUS_DECODER_VERSION "${PROJECT_VERSION}")
endif()
message(STATUS "daedalus-decoder version: ${DAEDALUS_DECODER_VERSION}")
# ---- Library ------------------------------------------------------
add_library(daedalus_decoder STATIC
src/daedalus_decoder.c
)
target_include_directories(daedalus_decoder
PUBLIC
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
$<INSTALL_INTERFACE:include>
PRIVATE
src
${DAEDALUS_FOURIER_INCLUDE_DIRS}
)
target_link_directories(daedalus_decoder
PUBLIC
${DAEDALUS_FOURIER_LIBRARY_DIRS}
)
target_link_libraries(daedalus_decoder
PUBLIC
# Order matters: daedalus-fourier static archive references
# vulkan symbols; the loader needs daedalus-fourier first then
# vulkan to resolve them.
${DAEDALUS_FOURIER_LIBRARIES}
Vulkan::Vulkan
)
target_compile_definitions(daedalus_decoder
PRIVATE
DAEDALUS_DECODER_VERSION="${DAEDALUS_DECODER_VERSION}"
)
target_compile_options(daedalus_decoder PRIVATE -O2)
# ---- Smoke test ---------------------------------------------------
enable_testing()
add_executable(test_smoke tests/test_smoke.c)
target_link_libraries(test_smoke PRIVATE daedalus_decoder)
target_compile_options(test_smoke PRIVATE -O2)
add_test(NAME smoke COMMAND test_smoke)
add_executable(test_idct_bitexact tests/test_idct_bitexact.c)
target_link_libraries(test_idct_bitexact PRIVATE daedalus_decoder)
target_compile_options(test_idct_bitexact PRIVATE -O2)
# 320x240 QVGA — fast inner-loop test (300 MBs, sub-second).
add_test(NAME idct_bitexact COMMAND test_idct_bitexact)
# Same QVGA test re-run on the CPU NEON path (forces fallback even on
# V3D7 hosts). Catches silent drift between the V3D shader and the
# NEON reference path — both must produce identical output for the
# same coefficient input. Also keeps the bit-exact gate alive on
# hosts without V3D7 (CI runners, x86 dev boxes).
add_test(NAME idct_bitexact_cpu COMMAND test_idct_bitexact 320 240
0xfeedface5a5a5a5a cpu)
# 1920x1088 1080p — deployment-scale test (8160 MBs, ~0.25 s on hertz).
# Validates the per-MB block index + pixel offset math at full coded
# height (1088, not 1080 — see daedalus_decoder.h on H.264 coded vs
# displayed dims). Cheap enough to run unconditionally; if it ever
# gets slow we'll split into a CTest LABEL for opt-in.
add_test(NAME idct_bitexact_1080p COMMAND test_idct_bitexact 1920 1088)
# ---- Stage 2 PR-b deblock smoke ------------------------------------
#
# Validates flush_frame's per-frame deblock dispatch (luma + chroma,
# V + H, bS<4 + bS=4 intra — up to 8 dispatches added after IDCT).
# Strategy: same input through substrate=CPU and substrate=QPU, assert
# byte-exact match (transitive bit-exact gate — daedalus-fourier's own
# test_api_h264 already validates each substrate against a C reference,
# so CPU-QPU equivalence here means both match the spec). Plus an
# anti-no-op check: run a third pass with edges removed and assert
# different output, proving deblock actually ran.
add_executable(test_deblock_smoke tests/test_deblock_smoke.c)
target_link_libraries(test_deblock_smoke PRIVATE daedalus_decoder)
target_compile_options(test_deblock_smoke PRIVATE -O2)
add_test(NAME deblock_smoke COMMAND test_deblock_smoke)
# ---- Benchmarks (not gated by ctest) ------------------------------
#
# Build-time only; user runs them by hand when checking perf. Adding
# them as ctest would make every CI run slow and the numbers would
# get drowned in pass/fail noise. See the header of each .c for what
# they measure.
add_executable(bench_flush_frame tests/bench_flush_frame.c)
target_link_libraries(bench_flush_frame PRIVATE daedalus_decoder)
target_compile_options(bench_flush_frame PRIVATE -O2)
# ---- Tools (not gated by ctest; opt-in via DAEDALUS_BUILD_TOOLS) ----
#
# daedalus_decode_h264 — option A standalone test harness that
# wraps libavcodec + daedalus-decoder and bit-exact-compares their
# outputs on real H.264 streams. Identity-passthrough mode in this
# first iteration (predicted = AVFrame pixels, coeffs = 0, no
# deblock edges); follow-up PRs use the per-MB inspection callback
# (marfrit-packages patch 0016) to feed REAL per-MB state.
#
# Requires libavcodec + libavformat headers + libs. Off by default
# so the standard ctest build doesn't pull in FFmpeg as a hard dep.
option(DAEDALUS_BUILD_TOOLS "Build daedalus-decoder CLI tools (requires libavcodec)" OFF)
if(DAEDALUS_BUILD_TOOLS)
# Optional path to a private FFmpeg install carrying the per-MB
# inspection callback (marfrit-packages patch 0016). When set,
# the CLI links against it instead of the system FFmpeg and the
# inspection-callback code path is compiled in.
set(DAEDALUS_FFMPEG_PREFIX "" CACHE PATH
"Path to a patched FFmpeg install (with 0016 mb-inspect-callback) for daedalus_decode_h264. Empty = use system pkg-config FFmpeg.")
if(DAEDALUS_FFMPEG_PREFIX)
message(STATUS "daedalus_decode_h264: patched FFmpeg at ${DAEDALUS_FFMPEG_PREFIX}")
set(FFMPEG_INCLUDE_DIRS ${DAEDALUS_FFMPEG_PREFIX}/include)
set(FFMPEG_LIBRARY_DIRS ${DAEDALUS_FFMPEG_PREFIX}/lib)
# Patched libavcodec is built static (no shared libs in the private prefix).
# System pull-ins are still needed for libav* dependencies.
set(FFMPEG_LIBRARIES
${DAEDALUS_FFMPEG_PREFIX}/lib/libavformat.a
${DAEDALUS_FFMPEG_PREFIX}/lib/libavcodec.a
${DAEDALUS_FFMPEG_PREFIX}/lib/libavutil.a
${DAEDALUS_FFMPEG_PREFIX}/lib/libswresample.a
m z pthread)
set(FFMPEG_CFLAGS_OTHER "-DDAEDALUS_HAVE_H264_MB_INSPECT_CB=1")
# PR-A3+ optional: also point at the patched FFmpeg SOURCE TREE
# so the CLI can include libavcodec/h264dec.h directly and
# dereference H264Context fields (the side-buffer mb_inspect_coeffs
# added in marfrit-packages patch 0017, the cur_pic.f for
# pre-deblock pixel access, etc.). When set, the internal-header
# include codepath is compiled in.
set(DAEDALUS_FFMPEG_SRC "" CACHE PATH
"Path to patched FFmpeg source tree (= path to FFmpeg/ checkout where build was run; contains config.h + libavcodec/h264dec.h). Empty = h264dec.h includes are disabled.")
if(DAEDALUS_FFMPEG_SRC)
message(STATUS "daedalus_decode_h264: FFmpeg source at ${DAEDALUS_FFMPEG_SRC}")
# IMPORTANT: source tree FIRST in -I order — its
# libavutil/common.h does #include "intmath.h" with HAVE_AV_CONFIG_H,
# which resolves to libavutil/intmath.h (in the source tree
# only — that header isn't installed since it's arch-dispatched).
# The installed-prefix include path's libavutil/common.h is the
# same file textually but resolves "intmath.h" against the
# install dir where it doesn't exist.
set(FFMPEG_INCLUDE_DIRS ${DAEDALUS_FFMPEG_SRC})
set(FFMPEG_CFLAGS_OTHER
"${FFMPEG_CFLAGS_OTHER} -DDAEDALUS_HAVE_H264_MB_INSPECT_COEFFS=1 -DHAVE_AV_CONFIG_H")
# Convert space-separated string to list (CMake idiom for compile flags).
separate_arguments(FFMPEG_CFLAGS_OTHER UNIX_COMMAND "${FFMPEG_CFLAGS_OTHER}")
endif()
else()
pkg_check_modules(FFMPEG REQUIRED libavcodec libavformat libavutil)
message(STATUS "daedalus_decode_h264: system FFmpeg (no inspection callback)")
endif()
add_executable(daedalus_decode_h264 tools/daedalus_decode_h264.c)
target_link_libraries(daedalus_decode_h264
PRIVATE daedalus_decoder ${FFMPEG_LIBRARIES})
target_include_directories(daedalus_decode_h264
PRIVATE ${FFMPEG_INCLUDE_DIRS})
target_link_directories(daedalus_decode_h264
PRIVATE ${FFMPEG_LIBRARY_DIRS})
target_compile_options(daedalus_decode_h264
PRIVATE -O2 ${FFMPEG_CFLAGS_OTHER})
endif()
# ---- Install ------------------------------------------------------
#
# Library + public header. Stage 2/3 will add a pkg-config file and
# CMake config exports once the API stabilises; pre-0.1 the scaffold
# install just gives the static archive a home.
include(GNUInstallDirs)
install(TARGETS daedalus_decoder
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
install(FILES include/daedalus_decoder.h
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})