Commit Graph

10 Commits

Author SHA1 Message Date
marfrit df339c07fd Install daedalus-decoder.pc for sibling consumers
Adds pkg-config plumbing so consumers (daedalus-v4l2 daemon for the
upcoming PR-Q3a shadow-mode wiring; the daedalus_decode_h264 CLI when
built outside this tree) can locate libdaedalus_decoder.a + the public
header via pkg_check_modules / pkg-config.

Mirrors daedalus-fourier's relocatable-prefix scheme: prefix is derived
from ${pcfiledir} so cmake --install --prefix /foo produces a .pc that
resolves to /foo at lookup time. Verified across two install prefixes.

daedalus-fourier is declared as a public Requires: because consumers
static-linking libdaedalus_decoder.a also need libdaedalus_core.a in
their link line to resolve the daedalus_ctx_* / daedalus_recipe_*
symbols this archive references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 13:32:58 +02:00
claude-noether 44e92fa3dc Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact
Final option-A deliverable.  CLI now extracts real per-MB
coefficients from libavcodec via the inspection callback +
side-buffer (marfrit-packages 0016 + 0017), reconstructs the
pre-residual predicted samples P via inverse-of-IDCT-add, and
feeds daedalus-decoder with real (P, C, no edges).  Daedalus
output BYTE-EXACT against libavcodec's pre-deblock AVFrame
across 5 frames at 320x240 and 3 frames at 1920x1088, all three
substrates (auto / cpu / qpu).

Path summary
------------

avctx->thread_count = 1                  (single-threaded decode — 0017's
                                           side buffer is per-H264Context;
                                           multi-threaded would race)
avctx->skip_loop_filter = AVDISCARD_ALL  (AVFrame stays pre-deblock so the
                                           P-recovery subtraction is exact)
ff_h264_set_mb_inspect_cb               (registers the callback)

Inspection callback (per MB, fires post-hl_decode_mb):
  - Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (skipped MBs
    fall back to identity-passthrough in the main loop)
  - Snapshot pre-deblock pixels from h->cur_pic.f->data[0]
  - Read coefficients from h->mb_inspect_coeffs (= sl->mb copy, the
    0017 side buffer)
  - For each 4x4 block (16/MB in raster order, indexed via
    raster_to_zscan[] to find its slot in the z-scan-ordered side
    buffer): compute IDCT(C) using a transcribed H.264 C reference,
    derive P = clip(pre_deblock - ((IDCT + 32) >> 6))
  - Stash per-MB capture (P + C) for the main loop

Main loop:
  - Default identity-passthrough (predicted = AVFrame pixels, coeffs = 0)
  - For real-coeffs-valid MBs: override luma with captured P + C
  - flush_frame, byte-exact compare against AVFrame

A diagnostic also asserts (silently when passing) that the
callback's pre_deblock snapshot equals AVFrame at each real-coeffs
MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer
under skip_loop_filter=AVDISCARD_ALL with thread_count=1.

Bug hunted in this PR
---------------------

Initial implementation transposed the coefficients from row-major
(sl->mb) to "column-major" (the layout that daedalus_decoder.h's
mb_input.coeffs docstring describes).  This caused ~0.2% Y pixel
divergence on real streams (~150/frame at 320x240).  Root cause
identified via a standalone /tmp/idct_compare.c harness running
daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical
int16[16] inputs: outputs IDENTICAL.  The two functions implement
the spec H.264 IDCT on the array regardless of layout
interpretation; the "column-major" label is decoration.  Removed
the transpose; PR is now byte-exact.

Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs
docstring so future integrators don't repeat this transpose
mistake.

Result on hertz (Pi 5 V3D 7.1)
------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320x240,    5 frames, substrate=auto:  Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=cpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=qpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  1920x1088,  3 frames, substrate=auto:  Y diff 0/2088960,  UV diff 0/1044480 PASS

Real-coeffs path engaged for 77-95 MBs per 320x240 frame and
598-643 MBs per 1080p frame (testsrc2 is mostly flat → many
Intra_16x16 MBs that fall back to identity passthrough; richer
content streams would engage real-coeffs more).

Followups
---------

  - PR-A4: extend the gate to Intra_16x16 (chroma DC Hadamard +
    Intra_16x16 luma DC Hadamard pre-pass) — currently ~30-60%
    of MBs fall back to identity-passthrough due to this.
  - PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch
    path on the daedalus-decoder side, similar plumbing).
  - PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE)
    and have daedalus's deblock produce the post-deblock output
    that matches AVFrame.  Closes the loop on the full I-only
    pipeline.
  - Task #184: daedalus_decoder.h coeffs docstring clarification.
2026-05-26 11:19:11 +02:00
claude-noether 86a28d2a3b Stage 2 PR-A2: per-MB inspection callback wiring + invariant checks
Validates marfrit-packages patch 0016 (PR #106) end-to-end against
the daedalus_decode_h264 CLI.  Callback fires once per macroblock
in coded order; this PR checks the count + uniqueness invariants
WITHOUT yet driving daedalus-decoder differently — that's PR-A3.

Infrastructure landed
---------------------

CMake gains DAEDALUS_FFMPEG_PREFIX option pointing at a private
FFmpeg install carrying patch 0016.  When set, the CLI links
against it (static .a's from $prefix/lib) and the inspection
codepath is compiled in (DAEDALUS_HAVE_H264_MB_INSPECT_CB).  When
unset, the CLI falls back to the pkg-config-discovered system
FFmpeg and behaves as PR-A1b did (identity-passthrough only, no
callback).

The H264Context struct stays opaque (forward-decl only — its
real definition lives in libavcodec's internal h264dec.h which
isn't installed).  Real per-MB state extraction (sl->mb coeffs,
mb_type, intra modes, deblock params) will land in PR-A3
alongside an internal-header include path.

The callback's only job in this PR: assert (mb_x, mb_y) lies in
the coded grid, mark "seen" in a per-frame bitmap, count
invocations.  At end-of-frame: assert seen-count == mb_w*mb_h,
0 duplicates, 0 out-of-bounds.

Per-frame mb-grid init goes BEFORE first avcodec_send_packet
(callbacks fire from inside send_packet, before the first
receive_frame ever returns — lazy init from AVFrame would miss
all of frame 0).  Dims come from codecpar->width/height rounded
up to 16-mod (H.264 codes 1080 display as 1088 coded).

Raster-order check considered but dropped: libavcodec uses
MB-level threading in some configs so callbacks fire out of
raster order.  The contract is "each MB exactly once", not "in
raster order"; the bitmap check captures that.

Result on hertz (Pi 5, patched FFmpeg at /tmp/ffmpeg-inspect-prefix)
-------------------------------------------------------------------

  320x240 I-only, 3 frames:
    mb-grid 20x15
    callback invocations: 900 (= 3 * 300)
    missing/duplicates/oob: 0/0/0
    identity-passthrough Y diff 0/230400, UV diff 0/115200
    PASS

  1920x1088 I-only, 3 frames:
    mb-grid 120x68
    callback invocations: 24480 (= 3 * 8160)
    missing/duplicates/oob: 0/0/0
    identity-passthrough Y diff 0/6266880, UV diff 0/3133440
    PASS

Followups
---------

  - PR-A3: include libavcodec/h264dec.h via -I to access H264Context
    internals; extract sl->mb coefficients in the callback, compute
    P = pre-deblock pixels - IDCT(C) using a transcribed C reference;
    feed daedalus_decoder with REAL (P, C, edges) instead of identity.
    Use avctx->skip_loop_filter = AVDISCARD_ALL to make libavcodec
    output pre-deblock so the subtraction is exact.
  - PR-A4 onwards: extend to P/B frames + chroma DC + intra prediction
    coverage.
2026-05-26 07:06:31 +02:00
claude-noether 56f8498057 Stage 2 PR-A1b: tools/daedalus_decode_h264 — H.264 standalone test harness
Option A's standalone end-to-end gate against real H.264 streams.
First iteration: identity-passthrough validation — daedalus-decoder
produces output byte-exact to libavcodec's AVFrame when fed the
reconstructed pixels as `predicted`, zero coeffs, no deblock edges.

Validates: daedalus-decoder data path (append_mb + flush_frame +
NV12 output + coded-vs-display dim handling) at real-stream frame
sizes (320x240 and 1920x1088) with real H.264-decoded predicted-
sample distributions — not the random patterns the existing
test_idct_bitexact + test_deblock_smoke synthesize.

Identity-passthrough math:
  - mb_input.predicted = AVFrame pixels at MB raster position
  - mb_input.coeffs    = 384 int16's, all zero
  - mb_input.edges     = NULL, n_edges = 0
  flush_frame:
    scratch_y/_uv pre-fill from predicted (= AVFrame pixels)
    IDCT dispatches with all-zero coeffs add 0 (no-op compute)
    No deblock dispatches (no edges)
    copy-out → caller's NV12 planes
  Result MUST equal AVFrame pixels byte-for-byte.

Build
-----

New cmake option DAEDALUS_BUILD_TOOLS (default OFF).  When enabled,
pkg-checks libavcodec / libavformat / libavutil and builds the
daedalus_decode_h264 binary against the system FFmpeg.

Stock libavcodec is sufficient for THIS PR (identity passthrough
reads from AVFrame after avcodec_receive_frame; no per-MB internal
state extraction needed).  Follow-up PRs (A2+) will use the per-MB
inspection callback added in marfrit-packages patch 0016 (PR #106)
to feed REAL per-MB state (pre-residual predicted samples, residual
coeffs, deblock edges) for actual non-trivial daedalus-decoder
validation.

Usage
-----

  daedalus_decode_h264 [--substrate cpu|qpu|auto]
                       [--max-frames N]
                       <input.h264> <output_dadec.yuv> <output_ref.yuv>

Exit codes:
  0 = byte-exact match across all frames
  1 = argument / setup error
  2 = decode error from libavcodec
  3 = daedalus-decoder error (ctx, append, flush)
  4 = bit-exact comparison failed

Result on hertz (Pi 5 V3D 7.1)
------------------------------

I-only test clip via ffmpeg testsrc2 + libx264 -bf 0 -g 1:

  320x240, 5 frames:
    substrate=auto:  Y diff 0/76800   UV diff 0/38400   PASS
    substrate=cpu:   Y diff 0/76800   UV diff 0/38400   PASS
    substrate=qpu:   Y diff 0/76800   UV diff 0/38400   PASS

  1920x1088 (coded; 1080 display), 3 frames:
    substrate=auto:  Y diff 0/2088960 UV diff 0/1044480 PASS

Followups
---------

  - PR-A2: wire the per-MB inspection callback (marfrit-packages
    0016) so per-MB state — coeffs (sl->mb), predicted-before-
    residual (from prediction kernels), bS/alpha/beta — flows into
    mb_input instead of zeros, and IDCT / deblock dispatches do
    real GPU work.  At that point we're decoding real H.264 streams
    through daedalus-decoder for real.
  - PR-A3: extend to P/B frames once MC dispatch lands.
2026-05-26 06:12:51 +02:00
claude-noether 92453d7019 wip: deblock smoke test 2026-05-25 23:16:08 +02:00
claude-noether 44ca4e550f phase1: substrate selector API + cross-substrate bit-exact ctest
Surfaces daedalus-fourier's substrate-override capability at the
decoder boundary.  Lets tests run on CPU-only hosts (CI runners,
x86 dev boxes) AND cross-checks V3D shader output against NEON
reference on hosts that have both.

API additions (pre-0.1 ABI, additive):

  - daedalus_decoder_substrate enum { AUTO, CPU, QPU }
    (mirrors daedalus_substrate; isolated for ABI reasons).
  - daedalus_decoder_set_substrate(dec, sub) setter, same
    mid-frame-change restrictions as set_output_format.
  - Default remains AUTO — the only sensible choice for production.

Internal:

  - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an
    explicit substrate instead of daedalus_recipe_dispatch_*.  Mapped
    via a small map_substrate() helper.  No perf delta on AUTO (recipe
    layer was just doing the same dispatch under the hood).

Test changes:

  - test_smoke: new EXPECTs for set_substrate (valid + bogus).
  - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or
    "qpu" to force the substrate.
  - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the
    QVGA case forcing the CPU path.  Catches silent drift between
    the V3D shader and the NEON reference; both must produce
    identical output for the same coefficient input (and they do —
    see ctest log below).

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
      Start 1: smoke
  1/4 Test #1: smoke ............................   Passed    0.10 sec
      Start 2: idct_bitexact
  2/4 Test #2: idct_bitexact ....................   Passed    0.03 sec
      Start 3: idct_bitexact_cpu
  3/4 Test #3: idct_bitexact_cpu ................   Passed    0.03 sec
      Start 4: idct_bitexact_1080p
  4/4 Test #4: idct_bitexact_1080p ..............   Passed    0.06 sec

  100% tests passed, 0 tests failed out of 4

CPU substrate produces byte-identical Y + Cb + Cr planes against the
same C reference that the AUTO/QPU path matches — confirming the V3D
shaders and the daedalus-fourier NEON path agree at the spec level.

Why we plumbed the lower-level dispatch instead of leaving recipe in
place: recipe is just a thin wrapper that calls dispatch with
AUTO.  Once we needed substrate control, the wrapper became a
liability (would have required adding a parallel recipe API for each
substrate); going direct is simpler and the AUTO path is unchanged.

Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at
1080p because the CPU path's wall time scales linearly with block
count and a 1080p CPU run is ~0.5s on hertz — fine standalone but
slows ctest enough that it would tempt opt-in gating.  The bit-exact
content is the same regardless of frame size; the 1080p variant only
exists to gate index-arithmetic bugs that surface above small int
boundaries.
2026-05-24 23:07:45 +02:00
claude-noether 352373a9be phase1: add IDCT-layer throughput benchmark (bench_flush_frame)
Establishes a steady-state baseline for the Path C frame-level
dispatch architecture.  Times daedalus_decoder_flush_frame at a
configurable coded resolution with random coefficients, reporting
per-frame latency stats and fps.

NOT a ctest — produces wall-time numbers, doesn't pass/fail.  Run
manually:

  ./build/bench_flush_frame [width] [height] [iters] [warmup]

Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader-
pipeline-pool materialisation cost from the timing average).

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/bench_flush_frame
  bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup)
  ctx has_qpu=1

  flush_frame (post-warmup, 95 samples):
    min    =   9.699 ms
    median =   9.905 ms
    mean   =  10.014 ms
    p99    =  12.011 ms
    max    =  12.011 ms

  throughput (steady-state, IDCT only — NO intra/MC/deblock):
    mean   = 99.9 fps
    median = 101.0 fps
    target = 30.0 fps (project_30fps_floor_is_fine.md)
    status = MEETS target (with 3.4x headroom for intra/MC/deblock)

Interpretation:

  Per-frame work measured:
    - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8,
      chroma meta+coeffs buffers
    - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their
      respective vkQueueSubmit + vkQueueWaitIdle round-trips
    - CPU NV12 interleave (chroma planar → UV)
    - calloc/free for scratch_y / coeffs / meta buffers

  Doing all of that in ~10 ms means the architecture pays back the
  Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer
  pool keeps amortised cost low) is the right granularity.  The
  per-block dispatch fail-mode that motivated Path C (~6500 ms/frame
  from the libavcodec substitution arc) is 600x slower than this.

  3.4x headroom from 101 fps → 30 fps target gives a budget of
  ~23 ms/frame for the remaining decode work (intra prediction
  wavefront, MC, deblock).  Each of those needs to fit inside that
  budget at steady state for the end-to-end decoder to hit 30 fps
  at 1080p.

  p99 latency 12 ms means even worst-case frames clear the 33-ms
  deadline (30 fps) easily; tail latency isn't a concern at this
  stage.

What this number does NOT validate:
  - Intra prediction shader dispatch overhead (likely per-anti-diagonal
    or per-MB-wavefront; dispatch count goes up)
  - MC dispatch (per qpel-block; up to several per MB)
  - Deblock dispatch (4 edges per MB; per-edge meta entries)
  - Real H.264 streams (random coeffs ≠ real residuals; perf shape
    of memory access is content-independent, but cache pressure may
    differ at scale).
2026-05-24 22:53:49 +02:00
claude-noether 045553ccaf phase1: add deployment-scale bit-exact ctest (1080p, 8160 MBs)
The existing 320x240 bit-exact test (300 MBs) is the fast inner-loop
gate, but it's small enough that index arithmetic bugs that only
surface above 16-bit boundaries would slip through.  This adds a
second ctest entry that runs the same binary against a full coded
1080p frame (1920x1088, 8160 MBs):

  - 4080 MBs at transform_8x8=0 → 65,280 luma 4x4 blocks
  - 4080 MBs at transform_8x8=1 → 16,320 luma 8x8 blocks
  - 65,280 chroma 4x4 blocks (32,640 Cb + 32,640 Cr)
  - 146,880 IDCTs total across 3 separate luma_4x4 + luma_8x8 +
    chroma dispatches; bit-exact compared against the in-test C
    reference for each.

No code change to the test binary itself — it already accepted
width/height as argv[1..2].  Just a second `add_test` in
CMakeLists.txt that invokes it with `1920 1088`.

Coverage rationale:

  - dst_off is uint32_t in daedalus_h264_block_meta; at 1920x1088
    the max offset is ~2.1 MiB, still well within uint32 range, but
    the test exercises the largest stride math we'll see in production
    (per-MB chroma offset = mb_y*8 + cb_plane_size = up to 1.06 MiB).
  - flush_frame partitions 8160 MBs by transform mode → exercises the
    bi4 == 4080*16 and bi8 == 4080*4 accumulators at frame scale.
  - Verifies the 1088 coded height handling (the displayed 1080 +
    8 cropped rows trap that catches Pi 5 H.264 integrations).

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
      Start 1: smoke
  1/3 Test #1: smoke ............................   Passed    0.09 sec
      Start 2: idct_bitexact
  2/3 Test #2: idct_bitexact ....................   Passed    0.03 sec
      Start 3: idct_bitexact_1080p
  3/3 Test #3: idct_bitexact_1080p ..............   Passed    0.06 sec

  100% tests passed, 0 tests failed out of 3

  $ ./build/test_idct_bitexact 1920 1088
  test_idct_bitexact: 1920x1088 (8160 MBs), seed=0xfeedface5a5a5a5a
  MB mix: 4080 4x4 MBs, 4080 8x8 MBs
  Y bytes total:  2088960
  Y bytes diff:   0 (0.0000%)
  Cb bytes total: 522240  diff: 0 (0.0000%)
  Cr bytes total: 522240  diff: 0 (0.0000%)
  BIT-EXACT PASS (Y + Cb + Cr)

(0.06 s when shader pool warm; ~0.2 s cold via the standalone
invocation above — the 1080p run happens after smoke, so pool is
already primed by the time it runs in ctest.)
2026-05-24 22:49:01 +02:00
claude-noether 948697ef0d phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4
Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").

What's tested:

  - 320×240 coded frame (300 MBs), enough to cover multiple workgroups
    of the V3D shader (16 blocks/WG → ≥30 WGs)
  - Per-MB → flat-raster block layout consistent with flush_frame
  - Random coeffs in [-512, 511] (same range as daedalus-fourier
    cycle-6 M1 gate)
  - Inline C reference: H.264 §8.5.12.1 butterfly with column-major
    block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
    mirrors daedalus-fourier tests/h264_idct4_ref.c

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
    Start 1: smoke
  1/2 Test #1: smoke ............................   Passed    0.16 sec
    Start 2: idct_bitexact
  2/2 Test #2: idct_bitexact ....................   Passed    0.03 sec

  100% tests passed, 0 tests failed out of 2

Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame.  Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).

What is NOT tested by this PR (deferred to follow-ons):

  - Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
    so the IDCT-ADD reduces to clip255(IDCT).  Real predicted comes
    from Stage 2a intra prediction.
  - Z-scan permutation between FFmpeg's per-MB coeffs layout and our
    per-MB → flat raster — the test uses its own coefficient generator
    that already matches our layout, so it doesn't exercise the
    permutation.  The libavcodec-intercept patch is where the
    permutation lands and gets validated against real H.264 streams.
  - Chroma 4×4 IDCT.
  - IDCT 8×8 (High profile).

Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch).  Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
2026-05-24 22:20:21 +02:00
claude-noether 08080f062c scaffold: CMake + API skeleton + smoke test
First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24.
Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec
intercept.  Establishes the build shape so subsequent work has a place
to land.

Layout:

  LICENSE                          BSD-2-Clause (matches daedalus-fourier)
  .gitignore                       build/, CMake artefacts, *.spv
  CMakeLists.txt                   top-level — finds daedalus-fourier
                                   ≥0.1.0 via pkg-config (per §9.6
                                   decision: find_package, pinned to
                                   tagged release; .pc consumed via
                                   pkg_check_modules until we ship a
                                   CMake config), Vulkan via
                                   find_package, builds static lib
                                   + smoke test, GNUInstallDirs install
  include/daedalus_decoder.h       public API surface:
                                     - daedalus_decoder_{create,destroy,
                                                         version,has_qpu}
                                     - daedalus_decoder_set_output_format
                                       (NV12 default, RGBA opt-in per §5)
                                     - daedalus_decoder_append_mb +
                                       struct daedalus_decoder_mb_input
                                       (matches §3 per-MB descriptor)
                                     - daedalus_decoder_flush_frame
                                       (per-frame submit + wait)
                                     - daedalus_decoder_export_dmabuf
                                       (Vulkan-native VkImage export per
                                       §9.4 decision)
                                   Dimensions are CODED frame size
                                   (mod-16), not displayed — caller
                                   translates from SPS + crop offsets.
  src/internal.h                   internal mb_desc struct (matches
                                   shader std430 layout, to be nailed
                                   down once shaders exist) + per-ctx
                                   state
  src/daedalus_decoder.c           stub bodies:
                                     - create/destroy with proper resource
                                       lifecycle
                                     - append_mb validates + writes CPU
                                       staging buffers (no GPU yet)
                                     - flush_frame returns -2 (not
                                       implemented) — Phase 1 work
                                     - export_dmabuf returns -1
                                     - has_qpu / version diagnostics
  tests/test_smoke.c               link + lifecycle test: bad dims
                                   reject, OOB MB reject, null inputs
                                   reject, raster-order enforcement,
                                   mid-frame format-change reject,
                                   incomplete-frame flush reject.
                                   On hosts without V3D7 Vulkan,
                                   SKIPs gracefully (returns 0).

Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier
0.1.0):

  $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
  $ cmake --build build
  $ ctest --test-dir build --output-on-failure
  Test #1: smoke ... Passed

  $ ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  smoke OK

Note the coded-vs-displayed dims trap: 1080p H.264 has coded height
1088 with 8 rows cropped via SPS frame_cropping_*.  Header docstring
on daedalus_decoder_create() spells this out so future callers don't
hit the multiple-of-16 reject (smoke test caught it during scaffold
write).

Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled
dispatch (reusing daedalus-fourier shaders per Appendix A), intra
prediction wavefront, reconstruct stage, NV12 output via dmabuf
export.  Smoke test grows from "ctx lifecycle works" to
"I-frame-only Baseline decode bit-exact vs FFmpeg reference".
2026-05-24 22:08:46 +02:00