b707daf69f12af44f716c6523340feabd22c7de1
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a7a0d56ecd |
Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels
First concrete deliverable on the daedalus-decoder Stage 2 path post
the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA).
Q2 decision: CPU intra prediction. libavcodec's existing NEON intra
prediction kernels generate predicted samples per MB; daedalus-decoder
accepts those samples through the API and uses them as the IDCT-add
starting state. FFmpeg's `idct_add` semantics — dst += idct(coeffs);
clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing
Stage 1 IDCT dispatch for free. No new GPU work.
API change
----------
`daedalus_decoder_mb_input` gains a `const uint8_t *predicted` field:
predicted [ 0 .. 256) — 16×16 luma, row-major raster
predicted [256 .. 320) — 8×8 Cb, row-major raster
predicted [320 .. 384) — 8×8 Cr, row-major raster
NULL is legal and equivalent to all-zero predicted samples — preserves
the existing IDCT-isolation test contract.
Internal changes
----------------
- `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar
Cb||Cr, W×H/2) buffers allocated at create, zeroed at end of every
flush_frame so NULL `mb->predicted` is indistinguishable from
explicit zeros from one frame to the next.
- `append_mb` splats mb->predicted into predicted_y/_uv at raster
(mb_y*16, mb_x*16) for luma and (mb_y*8, mb_x*8) for each chroma
component.
- `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)`
with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch
then writes residual on top, clip-adding to the predicted samples
in place.
Test
----
`test_idct_bitexact` extended:
- Generates random predicted samples (uint8_t) per MB alongside the
existing random coeffs.
- Pre-fills the reference ref_y / ref_cb / ref_cr planes with those
same predicted samples at the corresponding raster positions
BEFORE applying ref_idct4_add / ref_idct8_add per block.
- Compares GPU output to reference byte-for-byte.
Result on hertz (Pi 5 V3D 7.1), all three substrates:
test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto}
Y bytes diff: 0/76800 (0.0000%)
Cb bytes diff: 0/19200 (0.0000%)
Cr bytes diff: 0/19200 (0.0000%)
BIT-EXACT PASS on all three substrates
Catches any silent drift between substrates and any predicted-samples
plumbing mistake on either the API or the dispatch side.
Followups
---------
- Stage 2 PR-b: deblock dispatch in flush_frame.
- Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace
avcodec_send_packet/receive_frame with a libavcodec-parser-only
path that drives daedalus_decoder_append_mb in raster order +
flush_frame at slice boundary.
|
||
|
|
44ca4e550f |
phase1: substrate selector API + cross-substrate bit-exact ctest
Surfaces daedalus-fourier's substrate-override capability at the
decoder boundary. Lets tests run on CPU-only hosts (CI runners,
x86 dev boxes) AND cross-checks V3D shader output against NEON
reference on hosts that have both.
API additions (pre-0.1 ABI, additive):
- daedalus_decoder_substrate enum { AUTO, CPU, QPU }
(mirrors daedalus_substrate; isolated for ABI reasons).
- daedalus_decoder_set_substrate(dec, sub) setter, same
mid-frame-change restrictions as set_output_format.
- Default remains AUTO — the only sensible choice for production.
Internal:
- flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an
explicit substrate instead of daedalus_recipe_dispatch_*. Mapped
via a small map_substrate() helper. No perf delta on AUTO (recipe
layer was just doing the same dispatch under the hood).
Test changes:
- test_smoke: new EXPECTs for set_substrate (valid + bogus).
- test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or
"qpu" to force the substrate.
- CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the
QVGA case forcing the CPU path. Catches silent drift between
the V3D shader and the NEON reference; both must produce
identical output for the same coefficient input (and they do —
see ctest log below).
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/4 Test #1: smoke ............................ Passed 0.10 sec
Start 2: idct_bitexact
2/4 Test #2: idct_bitexact .................... Passed 0.03 sec
Start 3: idct_bitexact_cpu
3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec
Start 4: idct_bitexact_1080p
4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec
100% tests passed, 0 tests failed out of 4
CPU substrate produces byte-identical Y + Cb + Cr planes against the
same C reference that the AUTO/QPU path matches — confirming the V3D
shaders and the daedalus-fourier NEON path agree at the spec level.
Why we plumbed the lower-level dispatch instead of leaving recipe in
place: recipe is just a thin wrapper that calls dispatch with
AUTO. Once we needed substrate control, the wrapper became a
liability (would have required adding a parallel recipe API for each
substrate); going direct is simpler and the AUTO path is unchanged.
Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at
1080p because the CPU path's wall time scales linearly with block
count and a 1080p CPU run is ~0.5s on hertz — fine standalone but
slows ctest enough that it would tempt opt-in gating. The bit-exact
content is the same regardless of frame size; the 1080p variant only
exists to gate index-arithmetic bugs that surface above small int
boundaries.
|
||
|
|
adaabb1f63 |
phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag)
Adds the High-profile 8x8 luma transform path alongside the existing
4x4 dispatch. flush_frame now partitions macroblocks by each MB's
transform_8x8 flag and issues a separate luma dispatch per partition:
- mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted
as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4
(existing behaviour, unchanged).
- mb.transform_8x8 == 1 (High) → coeffs[0..256) interpreted
as 4 8x8 blocks (64 int16 each, column-major), fed to the new
daedalus_recipe_dispatch_h264_idct8 call.
Both luma partitions can be non-empty in the same frame (FFmpeg sets
the flag per-MB). Each non-empty partition costs one
vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped
(common case: Baseline streams skip the 8x8 dispatch entirely).
Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform.
API surface:
- New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input`
(after deblock_*). Backwards-compatible at the source level
because the field defaults to 0 with C99 designated initialisers
or {0} struct zeroing, both of which select the existing 4x4
path. ABI is pre-0.1 (per the header doc) so structural change
is fine.
- Mirrored in `struct daedalus_decoder_mb_desc` (internal layout).
Test changes:
- test_idct_bitexact now exercises a mixed-mode frame: every odd
raster MB uses 8x8, every even uses 4x4 (so flush_frame's
partitioning is also under test, not just the underlying shaders).
- 8x8 C reference (h264_idct8_butterfly + ref_idct8_add)
transcribed from daedalus-fourier tests/h264_idct8_ref.c per
H.264 §8.5.13.2. Block layout column-major; +32 >> 6 rounding;
add-to-predicted; clip255.
- Reference luma compute branches per MB on the same mb_8x8[]
array used to set the input flag.
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ./build/test_idct_bitexact
test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
MB mix: 150 4x4 MBs, 150 8x8 MBs
Y bytes total: 76800
Y bytes diff: 0 (0.0000%)
Cb bytes total: 19200 diff: 0 (0.0000%)
Cr bytes total: 19200 diff: 0 (0.0000%)
BIT-EXACT PASS (Y + Cb + Cr)
$ ctest --test-dir build
100% tests passed, 0 tests failed out of 2
Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks =
600 8x8 IDCTs against the spec C reference, identical output.
Validates both the daedalus-fourier IDCT 8x8 shader (already gated
by its own cycle-7 bit-exact test, now also gated end-to-end through
our flush_frame), and our 8x8 layout assumptions (column-major coeffs,
raster sb_y*2+sb_x block order, top-left = mb*16 + sb*8).
What's NOT covered yet (deferred):
- Z-scan permutation for FFmpeg compatibility (libavcodec intercept
patch's concern; both 4x4 and 8x8 z-scans differ).
- Chroma DC / luma Intra16x16 DC Hadamard pre-pass.
- Mixed intra/inter MB handling — currently all MBs treated as
residual-only (predicted=0).
Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.
|
||
|
|
58848bd162 |
phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave)
Replaces the chroma placeholder (memset 128) with a real frame-scaled
4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits +
waits per frame now (one luma, one chroma) instead of one + memset.
Implementation:
- One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr;
a single `daedalus_recipe_dispatch_h264_idct4` call processes both
components by setting meta[].dst_off accordingly (Cr blocks add
cb_plane_size).
- Stride = W/2 (chroma row pitch); shared between Cb and Cr since
they have identical geometry.
- Per-MB coeff layout already had [256..320) for Cb and [320..384)
for Cr (4 raster-order 4x4 blocks per component) from the original
daedalus_decoder_append_mb design — no header-side changes.
- Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c]
into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well
off the critical path; a GPU-side interleave shader is a Stage-5
optimisation.
- Chroma dispatch is gated on out_uv != NULL so callers that only
want luma (e.g. the bit-exact test before this PR) still pay
nothing.
Test changes:
- tests/test_idct_bitexact.c extended with parallel reference IDCT
for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV
back into Cb/Cr for the compare. Random coeffs in [-512, 511] for
all 384 per-MB int16 slots (previously only luma was randomised).
- tests/test_smoke.c UV expectation flipped from "all 128 placeholder"
to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd
pre-fill stays — same purpose: catches read-then-write bugs.
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 1.27 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.05 sec
100% tests passed, 0 tests failed out of 2
$ ./build/test_idct_bitexact
test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
Y bytes total: 76800
Y bytes diff: 0 (0.0000%)
Cb bytes total: 19200 diff: 0 (0.0000%)
Cr bytes total: 19200 diff: 0 (0.0000%)
BIT-EXACT PASS (Y + Cb + Cr)
$ ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-zero bytes: 0 / 1044480
smoke OK
(Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma
blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches —
shader pool warm-up dominates the wall time, not the IDCT work.)
What's NOT covered yet (deferred):
- Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264
chroma puts the per-block DC coefficient through a Hadamard before
it's added to the AC block; we currently treat all chroma blocks as
plain 4x4 AC. Will land alongside the libavcodec intercept patch,
since CABAC/CAVLC is where the DC vs AC distinction is exposed.
- Z-scan permutation for FFmpeg compatibility — only matters at the
intercept boundary, not here.
- IDCT 8x8 (High profile).
Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.
|
||
|
|
948697ef0d |
phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4
Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").
What's tested:
- 320×240 coded frame (300 MBs), enough to cover multiple workgroups
of the V3D shader (16 blocks/WG → ≥30 WGs)
- Per-MB → flat-raster block layout consistent with flush_frame
- Random coeffs in [-512, 511] (same range as daedalus-fourier
cycle-6 M1 gate)
- Inline C reference: H.264 §8.5.12.1 butterfly with column-major
block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
mirrors daedalus-fourier tests/h264_idct4_ref.c
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 0.16 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.03 sec
100% tests passed, 0 tests failed out of 2
Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame. Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).
What is NOT tested by this PR (deferred to follow-ons):
- Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes
from Stage 2a intra prediction.
- Z-scan permutation between FFmpeg's per-MB coeffs layout and our
per-MB → flat raster — the test uses its own coefficient generator
that already matches our layout, so it doesn't exercise the
permutation. The libavcodec-intercept patch is where the
permutation lands and gets validated against real H.264 streams.
- Chroma 4×4 IDCT.
- IDCT 8×8 (High profile).
Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch). Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
|