a7a0d56ecd
First concrete deliverable on the daedalus-decoder Stage 2 path post
the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA).
Q2 decision: CPU intra prediction. libavcodec's existing NEON intra
prediction kernels generate predicted samples per MB; daedalus-decoder
accepts those samples through the API and uses them as the IDCT-add
starting state. FFmpeg's `idct_add` semantics — dst += idct(coeffs);
clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing
Stage 1 IDCT dispatch for free. No new GPU work.
API change
----------
`daedalus_decoder_mb_input` gains a `const uint8_t *predicted` field:
predicted [ 0 .. 256) — 16×16 luma, row-major raster
predicted [256 .. 320) — 8×8 Cb, row-major raster
predicted [320 .. 384) — 8×8 Cr, row-major raster
NULL is legal and equivalent to all-zero predicted samples — preserves
the existing IDCT-isolation test contract.
Internal changes
----------------
- `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar
Cb||Cr, W×H/2) buffers allocated at create, zeroed at end of every
flush_frame so NULL `mb->predicted` is indistinguishable from
explicit zeros from one frame to the next.
- `append_mb` splats mb->predicted into predicted_y/_uv at raster
(mb_y*16, mb_x*16) for luma and (mb_y*8, mb_x*8) for each chroma
component.
- `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)`
with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch
then writes residual on top, clip-adding to the predicted samples
in place.
Test
----
`test_idct_bitexact` extended:
- Generates random predicted samples (uint8_t) per MB alongside the
existing random coeffs.
- Pre-fills the reference ref_y / ref_cb / ref_cr planes with those
same predicted samples at the corresponding raster positions
BEFORE applying ref_idct4_add / ref_idct8_add per block.
- Compares GPU output to reference byte-for-byte.
Result on hertz (Pi 5 V3D 7.1), all three substrates:
test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto}
Y bytes diff: 0/76800 (0.0000%)
Cb bytes diff: 0/19200 (0.0000%)
Cr bytes diff: 0/19200 (0.0000%)
BIT-EXACT PASS on all three substrates
Catches any silent drift between substrates and any predicted-samples
plumbing mistake on either the API or the dispatch side.
Followups
---------
- Stage 2 PR-b: deblock dispatch in flush_frame.
- Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace
avcodec_send_packet/receive_frame with a libavcodec-parser-only
path that drives daedalus_decoder_append_mb in raster order +
flush_frame at slice boundary.