All seven questions from the initial design draft decided in the
user's 2026-05-24 review:
1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck
2. libavcodec intercept: macroblock-level for Phase 1
3. Shader parameterisation: measure both during Phase 2 MC, pick winner
4. DPB allocation: Vulkan-native VkImage with dma_buf export
5. Daemon integration: library link
6. daedalus-fourier dep: CMake find_package, pinned to tagged release
7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out;
VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion
vs the initial draft which had grouped them with HEVC)
Section heading renamed "Open questions" → "Phase 1 decisions" with
explicit user-confirmed annotations. Each item preserves the original
wording for traceability.
§8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1
deferral and reaffirming HEVC's firmly-out status.
No architecture changes; only decisions captured. Phase 1
implementation can now begin against this baseline.
Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms. All findings appended to DESIGN.md; no new
files, no architecture changes.
Appendix A — daedalus-fourier shader reuse audit
- 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
just at frame scale instead of n_blocks=1 per call
- 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
serve as templates for ~20 sibling variants (horizontal/chroma
deblock variants, 15 missing qpel positions + 16x16 size + avg)
- 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
- 7 brand-new shaders required (iquant, intra prediction modes,
chroma MC, reconstruct, optional yuv→rgba)
- ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
if done in sequence with M1 bit-exact gate methodology
Appendix B — libavcodec intercept point
- decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
- Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
- Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
non_zero_count_cache into a frame-shaped descriptor SSBO
- End-of-slice flush builds + submits the GPU pipeline
- CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
- New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
0008-h264-daedalus-decoder-frame-pipeline.patch
- daedalus_decoder_active(h) gates the intercept; default OFF =
no-op = full coexistence with the kernel-pack substitution arc
Appendix C — risk register
- 6 risks catalogued: intra wavefront perf, qpel shader explosion,
Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
pin drift, Phase 4 30fps@1080p target miss
- Highest impact: project fails to beat NEON. Acknowledged from
project start (§10), explicit pivot language.
User question 2026-05-23: 'Wayland does need a conversion of NV12 to
its output format. Could we cram that in?'
Yes — trivially. Added Stage 5 to the pipeline doc with:
- 5-line per-pixel compute shader (BT.709 limited-range example
given; matrix selected from H.264 VUI at runtime)
- explicit OPT-IN flag, off by default
- rationale for default-off: most consumers (V4L2 stateless,
Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI
paths) want NV12 because compositors convert during composition
essentially for free. RGBA8 is 4x the bandwidth of NV12 — not
worth burning DMA + electrons when no downstream needs it
- colourspace metadata plumbing requirement: SPS vui_parameters
(colour_primaries, transfer_characteristics, matrix_coefficients,
video_full_range_flag) MUST flow through to the shader; default
BT.709 limited-range with warning if VUI absent
Updated the new-shader inventory to include v3d_h264_yuv_to_rgba.
Total dispatches/frame remains ~190-200; Stage 5 adds one.
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production. Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.
NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.
DESIGN.md covers:
- architecture sketch (CPU side keeps entropy decode + descriptors;
GPU runs 4-stage compute pipeline per frame)
- per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
- inter-stage dependencies (vkCmdPipelineBarrier within one command
buffer)
- intra prediction wavefront (~187 dispatches per frame on diagonals)
- libavcodec intercept point (macroblock-level, evolves the
substitution shim from "dispatch now" to "append to frame buffer")
- shader inventory (existing daedalus-fourier reuse + ~14 new ones)
- 4-phase plan, 4-6 months total budget
- 7 open questions including DPB allocation, qpel parameterization,
daemon integration shape
- explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced
This is design only. No code beyond README.md and DESIGN.md. User
review + redirect expected before Phase 1 implementation begins.