58848bd162dbead99f91737372e3c900b97d6383
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
58848bd162 |
phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave)
Replaces the chroma placeholder (memset 128) with a real frame-scaled
4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits +
waits per frame now (one luma, one chroma) instead of one + memset.
Implementation:
- One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr;
a single `daedalus_recipe_dispatch_h264_idct4` call processes both
components by setting meta[].dst_off accordingly (Cr blocks add
cb_plane_size).
- Stride = W/2 (chroma row pitch); shared between Cb and Cr since
they have identical geometry.
- Per-MB coeff layout already had [256..320) for Cb and [320..384)
for Cr (4 raster-order 4x4 blocks per component) from the original
daedalus_decoder_append_mb design — no header-side changes.
- Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c]
into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well
off the critical path; a GPU-side interleave shader is a Stage-5
optimisation.
- Chroma dispatch is gated on out_uv != NULL so callers that only
want luma (e.g. the bit-exact test before this PR) still pay
nothing.
Test changes:
- tests/test_idct_bitexact.c extended with parallel reference IDCT
for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV
back into Cb/Cr for the compare. Random coeffs in [-512, 511] for
all 384 per-MB int16 slots (previously only luma was randomised).
- tests/test_smoke.c UV expectation flipped from "all 128 placeholder"
to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd
pre-fill stays — same purpose: catches read-then-write bugs.
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 1.27 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.05 sec
100% tests passed, 0 tests failed out of 2
$ ./build/test_idct_bitexact
test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
Y bytes total: 76800
Y bytes diff: 0 (0.0000%)
Cb bytes total: 19200 diff: 0 (0.0000%)
Cr bytes total: 19200 diff: 0 (0.0000%)
BIT-EXACT PASS (Y + Cb + Cr)
$ ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-zero bytes: 0 / 1044480
smoke OK
(Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma
blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches —
shader pool warm-up dominates the wall time, not the IDCT work.)
What's NOT covered yet (deferred):
- Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264
chroma puts the per-block DC coefficient through a Hadamard before
it's added to the AC block; we currently treat all chroma blocks as
plain 4x4 AC. Will land alongside the libavcodec intercept patch,
since CABAC/CAVLC is where the DC vs AC distinction is exposed.
- Z-scan permutation for FFmpeg compatibility — only matters at the
intercept boundary, not here.
- IDCT 8x8 (High profile).
Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.
|
||
|
|
948697ef0d |
phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4
Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").
What's tested:
- 320×240 coded frame (300 MBs), enough to cover multiple workgroups
of the V3D shader (16 blocks/WG → ≥30 WGs)
- Per-MB → flat-raster block layout consistent with flush_frame
- Random coeffs in [-512, 511] (same range as daedalus-fourier
cycle-6 M1 gate)
- Inline C reference: H.264 §8.5.12.1 butterfly with column-major
block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
mirrors daedalus-fourier tests/h264_idct4_ref.c
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 0.16 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.03 sec
100% tests passed, 0 tests failed out of 2
Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame. Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).
What is NOT tested by this PR (deferred to follow-ons):
- Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes
from Stage 2a intra prediction.
- Z-scan permutation between FFmpeg's per-MB coeffs layout and our
per-MB → flat raster — the test uses its own coefficient generator
that already matches our layout, so it doesn't exercise the
permutation. The libavcodec-intercept patch is where the
permutation lands and gets validated against real H.264 streams.
- Chroma 4×4 IDCT.
- IDCT 8×8 (High profile).
Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch). Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
|