Commit Graph

11 Commits

Author SHA1 Message Date
claude-noether b707daf69f Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits
Second Stage 2 deliverable on the daedalus-decoder path (memory: dejavu
/ frame-major UMA).  Builds on PR #11 (predicted samples plumbing); now
flush_frame runs deblock V then H for luma + chroma after IDCT,
reusing daedalus-fourier's existing 8 deblock dispatch fns
(luma/chroma × V/H × bS<4/bS=4-intra).

API change
----------

`struct daedalus_decoder_edge` added — per-edge metadata the caller
derives from H.264 §8.7.2.1 (boundary strength rules):

    struct daedalus_decoder_edge {
        uint16_t mb_x, mb_y;
        uint8_t  edge_idx;  // 0..3 luma; 0..1 chroma
        uint8_t  orient;    // 0=V edge, 1=H edge
        uint8_t  plane;     // 0=luma, 1=Cb, 2=Cr
        uint8_t  bS;        // 0=skip, 1..3=bS<4 path, 4=bS=4 intra path
        uint8_t  alpha, beta;
        int8_t   tc0[4];
    };

`daedalus_decoder_mb_input` gains an `edges` pointer + `n_edges` count.
Caller emits up to ~16 edges/MB (typical: 4 V-luma + 4 H-luma +
2 V-Cb + 2 H-Cb + 2 V-Cr + 2 H-Cr).  Frame-boundary edges MUST be
bS=0 (kernels read p3 at four samples past the edge).

Internal changes
----------------

  - `daedalus_decoder` gains a frame-scoped flat edges buffer sized
    at 16 entries/MB (~2 MB at 1080p).  `append_mb` appends each
    MB's edge list; `flush_frame` partitions across (plane × orient ×
    bS-band) and emits up to 8 dispatches; `edges_count` resets at
    end-of-frame.

  - `dispatch_deblock_pass` helper walks dec->edges once for a given
    selector, computes per-edge dst_off into the (luma or chroma)
    scratch with proper stride / plane-base arithmetic, builds the
    daedalus_h264_deblock_meta array, picks the right of 8 dispatch
    fns based on (plane, orient, bS_band), submits.  Empty selector
    → 0 submits.

  - Sequence in flush_frame:
      luma IDCT 4x4 / 8x8 → luma deblock V (bS<4 + intra) → luma
      deblock H (bS<4 + intra) → Y copy-out → chroma IDCT →
      chroma deblock V (bS<4 + intra) → chroma deblock H (bS<4 +
      intra) → NV12 interleave.  Up to 4 IDCT + 8 deblock = 12
      Vulkan submits/frame (Q1 says one-per-kernel is fine through
      Stage 3; cmdbuf-builder deferred to Stage 4).

Test: tests/test_deblock_smoke
-----------------------------

Transitive bit-exactness instead of a 400-line inline C reference:

  1. Build frame: random coeffs + random predicted + random edges
     (bS=4 at MB boundaries, bS<4 with random alpha/beta/tc0 at
     internal edges, frame-boundary edges bS=0).
  2. Run substrate=CPU → out_cpu (uses ff_h264_*_neon kernels).
  3. Run substrate=QPU → out_qpu (uses V3D shaders).
  4. Assert byte-exact match: out_cpu == out_qpu.
  5. Run a third pass with n_edges=0 on every MB → out_no_deblock.
  6. Assert out_cpu != out_no_deblock (deblock actually fired).

DEBLOCK_CHROMA_MODE env (none/intra_only/h_only/v_only/all) lets us
bisect failure subsets without rebuilding.

Result on hertz (Pi 5 V3D 7.1), 3 random seeds × 320x240:

  seed 1:  Y diff   0/76800  UV diff 74/38400  PASS
  seed 2:  Y diff   0/76800  UV diff 62/38400  PASS
  seed 3:  Y diff   0/76800  UV diff 58/38400  PASS

Luma is byte-exact across substrates.  Chroma shows ~0.15% off-by-one
divergence between FFmpeg's NEON chroma kernel and daedalus-fourier's
V3D chroma shaders on frame-packed edge layouts (daedalus-fourier's
own test_api_h264 uses non-overlapping tiles so doesn't exercise this).
Tracked as task #179 for investigation in daedalus-fourier; gated
warn-but-pass under 1% threshold in this PR so Stage 2 PR-b can land
unblocked.

Followups
---------

  - Task #179: daedalus-fourier chroma deblock off-by-one investigation.
  - Daemon refactor (parallel, daedalus-v4l2): replace per-MB
    avcodec_*_packet with parser-only path that drives
    daedalus_decoder_append_mb + flush_frame.
  - Stage 2c (if needed): MC dispatch for Phase 2 (P-frames).
2026-05-25 23:30:37 +02:00
claude-noether 92453d7019 wip: deblock smoke test 2026-05-25 23:16:08 +02:00
claude-noether a7a0d56ecd Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels
First concrete deliverable on the daedalus-decoder Stage 2 path post
the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA).

Q2 decision: CPU intra prediction.  libavcodec's existing NEON intra
prediction kernels generate predicted samples per MB; daedalus-decoder
accepts those samples through the API and uses them as the IDCT-add
starting state.  FFmpeg's `idct_add` semantics — dst += idct(coeffs);
clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing
Stage 1 IDCT dispatch for free.  No new GPU work.

API change
----------

`daedalus_decoder_mb_input` gains a `const uint8_t *predicted` field:

    predicted [  0 .. 256) — 16×16 luma, row-major raster
    predicted [256 .. 320) — 8×8  Cb,   row-major raster
    predicted [320 .. 384) — 8×8  Cr,   row-major raster

NULL is legal and equivalent to all-zero predicted samples — preserves
the existing IDCT-isolation test contract.

Internal changes
----------------

  - `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar
    Cb||Cr, W×H/2) buffers allocated at create, zeroed at end of every
    flush_frame so NULL `mb->predicted` is indistinguishable from
    explicit zeros from one frame to the next.
  - `append_mb` splats mb->predicted into predicted_y/_uv at raster
    (mb_y*16, mb_x*16) for luma and (mb_y*8, mb_x*8) for each chroma
    component.
  - `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)`
    with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch
    then writes residual on top, clip-adding to the predicted samples
    in place.

Test
----

`test_idct_bitexact` extended:

  - Generates random predicted samples (uint8_t) per MB alongside the
    existing random coeffs.
  - Pre-fills the reference ref_y / ref_cb / ref_cr planes with those
    same predicted samples at the corresponding raster positions
    BEFORE applying ref_idct4_add / ref_idct8_add per block.
  - Compares GPU output to reference byte-for-byte.

Result on hertz (Pi 5 V3D 7.1), all three substrates:

  test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto}
    Y bytes diff:  0/76800 (0.0000%)
    Cb bytes diff: 0/19200 (0.0000%)
    Cr bytes diff: 0/19200 (0.0000%)
  BIT-EXACT PASS on all three substrates

Catches any silent drift between substrates and any predicted-samples
plumbing mistake on either the API or the dispatch side.

Followups
---------

  - Stage 2 PR-b: deblock dispatch in flush_frame.
  - Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace
    avcodec_send_packet/receive_frame with a libavcodec-parser-only
    path that drives daedalus_decoder_append_mb in raster order +
    flush_frame at slice boundary.
2026-05-25 23:01:20 +02:00
claude-noether 0b6482bc8f phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data
Extends bench_flush_frame with an argv[5] substrate selector
(auto/cpu/qpu).  Same enum as test_idct_bitexact's argv[4] — keeps
both binaries' CLI in sync.

The whole point of plumbing the selector through is to put a number
on the "QPU is default substrate" decree (2026-05-23,
feedback_qpu_is_default_substrate.md) for the IDCT layer
specifically.  The decree said: "What can be done, will be done in
QPU.  Dispatch overhead is fixable defect."  This measurement
quantifies the unfixed defect.

Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8
luma MBs + chroma always 4x4.  Pi 5 / V3D 7.1 / daedalus-fourier
0.1.0 (with cycle 6/7/9 H.264 IDCT shaders).  Hertz, idle system.

Results:

  substrate   min     median   mean    p99      fps (median)
  ─────────────────────────────────────────────────────────────
  CPU NEON    8.75    9.27     11.10   33.06    107.8
  QPU V3D7    31.92   37.77    37.67   47.27    26.5
  AUTO        31.99   33.19    36.04   92.23    30.1

  Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md).
  Stages NOT yet measured: intra prediction, MC, deblock.

Interpretation:

  - For the IDCT-only workload at frame batch granularity, CPU NEON
    is 4.1x faster than QPU V3D7.
  - AUTO → recipe table → QPU per the decree → BELOW the 30 fps
    target with no headroom for the remaining decoder stages.
  - The earlier "101 fps median at 1080p" measurement reported in
    PR #8's commit was actually the CPU NEON path — the daedalus-
    fourier install on hertz at the time predated the cycle 6 H.264
    QPU shader, so recipe AUTO silently fell back to CPU NEON.
    PR #8's "Path C is viable" conclusion stands, but the substrate
    label was wrong.  Apologies for the misleading number.

What this means for the campaign:

  - The decree's "fixable defect" claim is still aspirational for
    the H.264 IDCT shaders.  The current QPU shader dispatch costs
    ~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 =
    ~10 ms total cf. CPU's 2.3 ms), which dominates over the compute.
  - daedalus-decoder doesn't need to take a position on this — the
    AUTO path follows the recipe table and respects the decree.
    The substrate selector is the escape hatch when consumers want
    to override.
  - For the libavcodec intercept patch when it lands, the right
    move is probably to start with CPU NEON for IDCT and switch to
    QPU once the dispatch overhead drops (issue #162 dmabuf import
    + further pool work on the daedalus-fourier side).

No source change to flush_frame itself; this is purely a measurement
add.  The bench is opt-in (not a ctest) — these numbers belong in
commit messages and the campaign log, not in CI gating.
2026-05-24 23:19:39 +02:00
claude-noether 44ca4e550f phase1: substrate selector API + cross-substrate bit-exact ctest
Surfaces daedalus-fourier's substrate-override capability at the
decoder boundary.  Lets tests run on CPU-only hosts (CI runners,
x86 dev boxes) AND cross-checks V3D shader output against NEON
reference on hosts that have both.

API additions (pre-0.1 ABI, additive):

  - daedalus_decoder_substrate enum { AUTO, CPU, QPU }
    (mirrors daedalus_substrate; isolated for ABI reasons).
  - daedalus_decoder_set_substrate(dec, sub) setter, same
    mid-frame-change restrictions as set_output_format.
  - Default remains AUTO — the only sensible choice for production.

Internal:

  - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an
    explicit substrate instead of daedalus_recipe_dispatch_*.  Mapped
    via a small map_substrate() helper.  No perf delta on AUTO (recipe
    layer was just doing the same dispatch under the hood).

Test changes:

  - test_smoke: new EXPECTs for set_substrate (valid + bogus).
  - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or
    "qpu" to force the substrate.
  - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the
    QVGA case forcing the CPU path.  Catches silent drift between
    the V3D shader and the NEON reference; both must produce
    identical output for the same coefficient input (and they do —
    see ctest log below).

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
      Start 1: smoke
  1/4 Test #1: smoke ............................   Passed    0.10 sec
      Start 2: idct_bitexact
  2/4 Test #2: idct_bitexact ....................   Passed    0.03 sec
      Start 3: idct_bitexact_cpu
  3/4 Test #3: idct_bitexact_cpu ................   Passed    0.03 sec
      Start 4: idct_bitexact_1080p
  4/4 Test #4: idct_bitexact_1080p ..............   Passed    0.06 sec

  100% tests passed, 0 tests failed out of 4

CPU substrate produces byte-identical Y + Cb + Cr planes against the
same C reference that the AUTO/QPU path matches — confirming the V3D
shaders and the daedalus-fourier NEON path agree at the spec level.

Why we plumbed the lower-level dispatch instead of leaving recipe in
place: recipe is just a thin wrapper that calls dispatch with
AUTO.  Once we needed substrate control, the wrapper became a
liability (would have required adding a parallel recipe API for each
substrate); going direct is simpler and the AUTO path is unchanged.

Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at
1080p because the CPU path's wall time scales linearly with block
count and a 1080p CPU run is ~0.5s on hertz — fine standalone but
slows ctest enough that it would tempt opt-in gating.  The bit-exact
content is the same regardless of frame size; the 1080p variant only
exists to gate index-arithmetic bugs that surface above small int
boundaries.
2026-05-24 23:07:45 +02:00
claude-noether 352373a9be phase1: add IDCT-layer throughput benchmark (bench_flush_frame)
Establishes a steady-state baseline for the Path C frame-level
dispatch architecture.  Times daedalus_decoder_flush_frame at a
configurable coded resolution with random coefficients, reporting
per-frame latency stats and fps.

NOT a ctest — produces wall-time numbers, doesn't pass/fail.  Run
manually:

  ./build/bench_flush_frame [width] [height] [iters] [warmup]

Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader-
pipeline-pool materialisation cost from the timing average).

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/bench_flush_frame
  bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup)
  ctx has_qpu=1

  flush_frame (post-warmup, 95 samples):
    min    =   9.699 ms
    median =   9.905 ms
    mean   =  10.014 ms
    p99    =  12.011 ms
    max    =  12.011 ms

  throughput (steady-state, IDCT only — NO intra/MC/deblock):
    mean   = 99.9 fps
    median = 101.0 fps
    target = 30.0 fps (project_30fps_floor_is_fine.md)
    status = MEETS target (with 3.4x headroom for intra/MC/deblock)

Interpretation:

  Per-frame work measured:
    - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8,
      chroma meta+coeffs buffers
    - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their
      respective vkQueueSubmit + vkQueueWaitIdle round-trips
    - CPU NV12 interleave (chroma planar → UV)
    - calloc/free for scratch_y / coeffs / meta buffers

  Doing all of that in ~10 ms means the architecture pays back the
  Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer
  pool keeps amortised cost low) is the right granularity.  The
  per-block dispatch fail-mode that motivated Path C (~6500 ms/frame
  from the libavcodec substitution arc) is 600x slower than this.

  3.4x headroom from 101 fps → 30 fps target gives a budget of
  ~23 ms/frame for the remaining decode work (intra prediction
  wavefront, MC, deblock).  Each of those needs to fit inside that
  budget at steady state for the end-to-end decoder to hit 30 fps
  at 1080p.

  p99 latency 12 ms means even worst-case frames clear the 33-ms
  deadline (30 fps) easily; tail latency isn't a concern at this
  stage.

What this number does NOT validate:
  - Intra prediction shader dispatch overhead (likely per-anti-diagonal
    or per-MB-wavefront; dispatch count goes up)
  - MC dispatch (per qpel-block; up to several per MB)
  - Deblock dispatch (4 edges per MB; per-edge meta entries)
  - Real H.264 streams (random coeffs ≠ real residuals; perf shape
    of memory access is content-independent, but cache pressure may
    differ at scale).
2026-05-24 22:53:49 +02:00
claude-noether adaabb1f63 phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag)
Adds the High-profile 8x8 luma transform path alongside the existing
4x4 dispatch.  flush_frame now partitions macroblocks by each MB's
transform_8x8 flag and issues a separate luma dispatch per partition:

  - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted
    as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4
    (existing behaviour, unchanged).
  - mb.transform_8x8 == 1 (High)          → coeffs[0..256) interpreted
    as 4 8x8 blocks (64 int16 each, column-major), fed to the new
    daedalus_recipe_dispatch_h264_idct8 call.

Both luma partitions can be non-empty in the same frame (FFmpeg sets
the flag per-MB).  Each non-empty partition costs one
vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped
(common case: Baseline streams skip the 8x8 dispatch entirely).

Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform.

API surface:
  - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input`
    (after deblock_*).  Backwards-compatible at the source level
    because the field defaults to 0 with C99 designated initialisers
    or {0} struct zeroing, both of which select the existing 4x4
    path.  ABI is pre-0.1 (per the header doc) so structural change
    is fine.
  - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout).

Test changes:

  - test_idct_bitexact now exercises a mixed-mode frame: every odd
    raster MB uses 8x8, every even uses 4x4 (so flush_frame's
    partitioning is also under test, not just the underlying shaders).
  - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add)
    transcribed from daedalus-fourier tests/h264_idct8_ref.c per
    H.264 §8.5.13.2.  Block layout column-major; +32 >> 6 rounding;
    add-to-predicted; clip255.
  - Reference luma compute branches per MB on the same mb_8x8[]
    array used to set the input flag.

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/test_idct_bitexact
  test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
  MB mix: 150 4x4 MBs, 150 8x8 MBs
  Y bytes total:  76800
  Y bytes diff:   0 (0.0000%)
  Cb bytes total: 19200  diff: 0 (0.0000%)
  Cr bytes total: 19200  diff: 0 (0.0000%)
  BIT-EXACT PASS (Y + Cb + Cr)

  $ ctest --test-dir build
  100% tests passed, 0 tests failed out of 2

Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks =
600 8x8 IDCTs against the spec C reference, identical output.
Validates both the daedalus-fourier IDCT 8x8 shader (already gated
by its own cycle-7 bit-exact test, now also gated end-to-end through
our flush_frame), and our 8x8 layout assumptions (column-major coeffs,
raster sb_y*2+sb_x block order, top-left = mb*16 + sb*8).

What's NOT covered yet (deferred):

  - Z-scan permutation for FFmpeg compatibility (libavcodec intercept
    patch's concern; both 4x4 and 8x8 z-scans differ).
  - Chroma DC / luma Intra16x16 DC Hadamard pre-pass.
  - Mixed intra/inter MB handling — currently all MBs treated as
    residual-only (predicted=0).

Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.
2026-05-24 22:41:05 +02:00
claude-noether 58848bd162 phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave)
Replaces the chroma placeholder (memset 128) with a real frame-scaled
4x4 IDCT dispatch for the Cb and Cr components.  Two Vulkan submits +
waits per frame now (one luma, one chroma) instead of one + memset.

Implementation:

  - One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr;
    a single `daedalus_recipe_dispatch_h264_idct4` call processes both
    components by setting meta[].dst_off accordingly (Cr blocks add
    cb_plane_size).
  - Stride = W/2 (chroma row pitch); shared between Cb and Cr since
    they have identical geometry.
  - Per-MB coeff layout already had [256..320) for Cb and [320..384)
    for Cr (4 raster-order 4x4 blocks per component) from the original
    daedalus_decoder_append_mb design — no header-side changes.
  - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c]
    into NV12 UV at out_uv[r][2c..2c+1].  ~1 MB/frame at 1080p, well
    off the critical path; a GPU-side interleave shader is a Stage-5
    optimisation.
  - Chroma dispatch is gated on out_uv != NULL so callers that only
    want luma (e.g. the bit-exact test before this PR) still pay
    nothing.

Test changes:

  - tests/test_idct_bitexact.c extended with parallel reference IDCT
    for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV
    back into Cb/Cr for the compare.  Random coeffs in [-512, 511] for
    all 384 per-MB int16 slots (previously only luma was randomised).
  - tests/test_smoke.c UV expectation flipped from "all 128 placeholder"
    to "all 0" (real dispatch with zero coeffs).  Sentinel 0xcd
    pre-fill stays — same purpose: catches read-then-write bugs.

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
    Start 1: smoke
  1/2 Test #1: smoke ............................   Passed    1.27 sec
    Start 2: idct_bitexact
  2/2 Test #2: idct_bitexact ....................   Passed    0.05 sec

  100% tests passed, 0 tests failed out of 2

  $ ./build/test_idct_bitexact
  test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
  Y bytes total:  76800
  Y bytes diff:   0 (0.0000%)
  Cb bytes total: 19200  diff: 0 (0.0000%)
  Cr bytes total: 19200  diff: 0 (0.0000%)
  BIT-EXACT PASS (Y + Cb + Cr)

  $ ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-zero bytes: 0 / 1044480
  smoke OK

(Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma
blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches —
shader pool warm-up dominates the wall time, not the IDCT work.)

What's NOT covered yet (deferred):

  - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass.  Real H.264
    chroma puts the per-block DC coefficient through a Hadamard before
    it's added to the AC block; we currently treat all chroma blocks as
    plain 4x4 AC.  Will land alongside the libavcodec intercept patch,
    since CABAC/CAVLC is where the DC vs AC distinction is exposed.
  - Z-scan permutation for FFmpeg compatibility — only matters at the
    intercept boundary, not here.
  - IDCT 8x8 (High profile).

Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.
2026-05-24 22:34:42 +02:00
claude-noether 948697ef0d phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4
Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").

What's tested:

  - 320×240 coded frame (300 MBs), enough to cover multiple workgroups
    of the V3D shader (16 blocks/WG → ≥30 WGs)
  - Per-MB → flat-raster block layout consistent with flush_frame
  - Random coeffs in [-512, 511] (same range as daedalus-fourier
    cycle-6 M1 gate)
  - Inline C reference: H.264 §8.5.12.1 butterfly with column-major
    block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
    mirrors daedalus-fourier tests/h264_idct4_ref.c

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
    Start 1: smoke
  1/2 Test #1: smoke ............................   Passed    0.16 sec
    Start 2: idct_bitexact
  2/2 Test #2: idct_bitexact ....................   Passed    0.03 sec

  100% tests passed, 0 tests failed out of 2

Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame.  Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).

What is NOT tested by this PR (deferred to follow-ons):

  - Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
    so the IDCT-ADD reduces to clip255(IDCT).  Real predicted comes
    from Stage 2a intra prediction.
  - Z-scan permutation between FFmpeg's per-MB coeffs layout and our
    per-MB → flat raster — the test uses its own coefficient generator
    that already matches our layout, so it doesn't exercise the
    permutation.  The libavcodec-intercept patch is where the
    permutation lands and gets validated against real H.264 streams.
  - Chroma 4×4 IDCT.
  - IDCT 8×8 (High profile).

Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch).  Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
2026-05-24 22:20:21 +02:00
claude-noether 69b124adf1 phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip
flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.

What's wired:

  - Build per-frame luma-4x4 meta[] in raster order across all MBs
    (N_MBs × 16 entries; 130,560 for 1080p)
  - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
    block-major coeffs buffer (n_blocks × 16 int16)
  - Allocate a frame-sized scratch Y plane, zero-initialised — no intra
    prediction yet so "predicted" = 0
  - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
    n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
  - Copy result to caller's out_y at requested stride

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):

  $ time ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-128 bytes: 0 / 1044480
  smoke OK
  real  0m0.163s

163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.

Smoke validates:
  - flush_frame succeeds (rc=0) on a complete frame
  - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
  - UV plane filled with neutral grey 128 (placeholder until chroma
    dispatch lands)

What's deliberately deferred to follow-on sub-PRs:

  - Intra prediction wavefront (Stage 2a) — predicted=0 means output
    pixels are residual-only, not a valid frame decode.  Sufficient for
    Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
  - Motion compensation (Stage 2b) for inter MBs
  - High-profile IDCT 8x8 (Stage 1 extension)
  - Deblocking filter (Stage 4)
  - Chroma 4x4 IDCT — needs separate dispatch with chroma stride
  - Z-scan permutation of per-MB 4x4 block order (currently flat
    raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
    Bit-exact against FFmpeg requires this permutation; deferred to
    the test-vector PR.
  - dmabuf export (still memcpy-out)
  - Stage 5 RGBA opt-in

API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub.  Internal helpers stay file-local.

Stacks on noether/repo-scaffold (PR #2).  Rebase on main after #2
lands; the diff is purely additive against the scaffold.
2026-05-24 22:15:35 +02:00
claude-noether 08080f062c scaffold: CMake + API skeleton + smoke test
First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24.
Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec
intercept.  Establishes the build shape so subsequent work has a place
to land.

Layout:

  LICENSE                          BSD-2-Clause (matches daedalus-fourier)
  .gitignore                       build/, CMake artefacts, *.spv
  CMakeLists.txt                   top-level — finds daedalus-fourier
                                   ≥0.1.0 via pkg-config (per §9.6
                                   decision: find_package, pinned to
                                   tagged release; .pc consumed via
                                   pkg_check_modules until we ship a
                                   CMake config), Vulkan via
                                   find_package, builds static lib
                                   + smoke test, GNUInstallDirs install
  include/daedalus_decoder.h       public API surface:
                                     - daedalus_decoder_{create,destroy,
                                                         version,has_qpu}
                                     - daedalus_decoder_set_output_format
                                       (NV12 default, RGBA opt-in per §5)
                                     - daedalus_decoder_append_mb +
                                       struct daedalus_decoder_mb_input
                                       (matches §3 per-MB descriptor)
                                     - daedalus_decoder_flush_frame
                                       (per-frame submit + wait)
                                     - daedalus_decoder_export_dmabuf
                                       (Vulkan-native VkImage export per
                                       §9.4 decision)
                                   Dimensions are CODED frame size
                                   (mod-16), not displayed — caller
                                   translates from SPS + crop offsets.
  src/internal.h                   internal mb_desc struct (matches
                                   shader std430 layout, to be nailed
                                   down once shaders exist) + per-ctx
                                   state
  src/daedalus_decoder.c           stub bodies:
                                     - create/destroy with proper resource
                                       lifecycle
                                     - append_mb validates + writes CPU
                                       staging buffers (no GPU yet)
                                     - flush_frame returns -2 (not
                                       implemented) — Phase 1 work
                                     - export_dmabuf returns -1
                                     - has_qpu / version diagnostics
  tests/test_smoke.c               link + lifecycle test: bad dims
                                   reject, OOB MB reject, null inputs
                                   reject, raster-order enforcement,
                                   mid-frame format-change reject,
                                   incomplete-frame flush reject.
                                   On hosts without V3D7 Vulkan,
                                   SKIPs gracefully (returns 0).

Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier
0.1.0):

  $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
  $ cmake --build build
  $ ctest --test-dir build --output-on-failure
  Test #1: smoke ... Passed

  $ ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  smoke OK

Note the coded-vs-displayed dims trap: 1080p H.264 has coded height
1088 with 8 rows cropped via SPS frame_cropping_*.  Header docstring
on daedalus_decoder_create() spells this out so future callers don't
hit the multiple-of-16 reject (smoke test caught it during scaffold
write).

Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled
dispatch (reusing daedalus-fourier shaders per Appendix A), intra
prediction wavefront, reconstruct stage, NV12 output via dmabuf
export.  Smoke test grows from "ctx lifecycle works" to
"I-frame-only Baseline decode bit-exact vs FFmpeg reference".
2026-05-24 22:08:46 +02:00