69b124adf1501cdce818614f5510493cf6521b94
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
69b124adf1 |
phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip
flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.
What's wired:
- Build per-frame luma-4x4 meta[] in raster order across all MBs
(N_MBs × 16 entries; 130,560 for 1080p)
- Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
block-major coeffs buffer (n_blocks × 16 int16)
- Allocate a frame-sized scratch Y plane, zero-initialised — no intra
prediction yet so "predicted" = 0
- daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
- Copy result to caller's out_y at requested stride
Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):
$ time ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-128 bytes: 0 / 1044480
smoke OK
real 0m0.163s
163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.
Smoke validates:
- flush_frame succeeds (rc=0) on a complete frame
- Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
- UV plane filled with neutral grey 128 (placeholder until chroma
dispatch lands)
What's deliberately deferred to follow-on sub-PRs:
- Intra prediction wavefront (Stage 2a) — predicted=0 means output
pixels are residual-only, not a valid frame decode. Sufficient for
Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
- Motion compensation (Stage 2b) for inter MBs
- High-profile IDCT 8x8 (Stage 1 extension)
- Deblocking filter (Stage 4)
- Chroma 4x4 IDCT — needs separate dispatch with chroma stride
- Z-scan permutation of per-MB 4x4 block order (currently flat
raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
Bit-exact against FFmpeg requires this permutation; deferred to
the test-vector PR.
- dmabuf export (still memcpy-out)
- Stage 5 RGBA opt-in
API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub. Internal helpers stay file-local.
Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2
lands; the diff is purely additive against the scaffold.
|
||
|
|
08080f062c |
scaffold: CMake + API skeleton + smoke test
First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24.
Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec
intercept. Establishes the build shape so subsequent work has a place
to land.
Layout:
LICENSE BSD-2-Clause (matches daedalus-fourier)
.gitignore build/, CMake artefacts, *.spv
CMakeLists.txt top-level — finds daedalus-fourier
≥0.1.0 via pkg-config (per §9.6
decision: find_package, pinned to
tagged release; .pc consumed via
pkg_check_modules until we ship a
CMake config), Vulkan via
find_package, builds static lib
+ smoke test, GNUInstallDirs install
include/daedalus_decoder.h public API surface:
- daedalus_decoder_{create,destroy,
version,has_qpu}
- daedalus_decoder_set_output_format
(NV12 default, RGBA opt-in per §5)
- daedalus_decoder_append_mb +
struct daedalus_decoder_mb_input
(matches §3 per-MB descriptor)
- daedalus_decoder_flush_frame
(per-frame submit + wait)
- daedalus_decoder_export_dmabuf
(Vulkan-native VkImage export per
§9.4 decision)
Dimensions are CODED frame size
(mod-16), not displayed — caller
translates from SPS + crop offsets.
src/internal.h internal mb_desc struct (matches
shader std430 layout, to be nailed
down once shaders exist) + per-ctx
state
src/daedalus_decoder.c stub bodies:
- create/destroy with proper resource
lifecycle
- append_mb validates + writes CPU
staging buffers (no GPU yet)
- flush_frame returns -2 (not
implemented) — Phase 1 work
- export_dmabuf returns -1
- has_qpu / version diagnostics
tests/test_smoke.c link + lifecycle test: bad dims
reject, OOB MB reject, null inputs
reject, raster-order enforcement,
mid-frame format-change reject,
incomplete-frame flush reject.
On hosts without V3D7 Vulkan,
SKIPs gracefully (returns 0).
Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier
0.1.0):
$ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
$ cmake --build build
$ ctest --test-dir build --output-on-failure
Test #1: smoke ... Passed
$ ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
smoke OK
Note the coded-vs-displayed dims trap: 1080p H.264 has coded height
1088 with 8 rows cropped via SPS frame_cropping_*. Header docstring
on daedalus_decoder_create() spells this out so future callers don't
hit the multiple-of-16 reject (smoke test caught it during scaffold
write).
Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled
dispatch (reusing daedalus-fourier shaders per Appendix A), intra
prediction wavefront, reconstruct stage, NV12 output via dmabuf
export. Smoke test grows from "ctx lifecycle works" to
"I-frame-only Baseline decode bit-exact vs FFmpeg reference".
|
||
|
|
59885dd868 |
initial design doc — frame-level GPU H.264 decoder for V3D7
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production. Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.
NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.
DESIGN.md covers:
- architecture sketch (CPU side keeps entropy decode + descriptors;
GPU runs 4-stage compute pipeline per frame)
- per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
- inter-stage dependencies (vkCmdPipelineBarrier within one command
buffer)
- intra prediction wavefront (~187 dispatches per frame on diagonals)
- libavcodec intercept point (macroblock-level, evolves the
substitution shim from "dispatch now" to "append to frame buffer")
- shader inventory (existing daedalus-fourier reuse + ~14 new ones)
- 4-phase plan, 4-6 months total budget
- 7 open questions including DPB allocation, qpel parameterization,
daemon integration shape
- explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced
This is design only. No code beyond README.md and DESIGN.md. User
review + redirect expected before Phase 1 implementation begins.
|