All seven questions from the initial design draft decided in the
user's 2026-05-24 review:
1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck
2. libavcodec intercept: macroblock-level for Phase 1
3. Shader parameterisation: measure both during Phase 2 MC, pick winner
4. DPB allocation: Vulkan-native VkImage with dma_buf export
5. Daemon integration: library link
6. daedalus-fourier dep: CMake find_package, pinned to tagged release
7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out;
VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion
vs the initial draft which had grouped them with HEVC)
Section heading renamed "Open questions" → "Phase 1 decisions" with
explicit user-confirmed annotations. Each item preserves the original
wording for traceability.
§8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1
deferral and reaffirming HEVC's firmly-out status.
No architecture changes; only decisions captured. Phase 1
implementation can now begin against this baseline.
18 KiB
daedalus-decoder — design
Phase: scoping / first-draft design. No code written. Date: 2026-05-23.
This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the daedalus-fourier substitution prototype — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it.
1. The shape we need
NVDEC and Vulkan Video both target dedicated decode hardware via a picture-level command: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA.
We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — rpi-hevc-dec — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the same shape as NVDEC using V3D7 compute as the substrate.
┌──────────────────────────────────────────────────────────────┐
│ CPU side │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ libavcodec H.264 parser (existing) │ │
│ │ - NAL split + emulation prevention removal │ │
│ │ - SPS / PPS │ │
│ │ - Slice header decode │ │
│ │ - CABAC / CAVLC entropy decode │ │
│ │ - Per-MB: mb_type, qp, partition info, │ │
│ │ transform coefficients, MVs, intra modes │ │
│ └─────────────────────┬─────────────────────────────────┘ │
│ ↓ (write into frame-shaped SSBO) │
│ │ │
└────────────────────────┼───────────────────────────────────────┘
↓ Vulkan compute queue submit
┌────────────────────────┼───────────────────────────────────────┐
│ GPU (V3D7) side ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 1: inverse quant + IDCT 4x4 / 8x8 │ │
│ │ in: coeffs[N_MBs × 384] │ │
│ │ out: residual[N_MBs × 384] u8/i16 │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ vkCmdPipelineBarrier │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 2: prediction (intra + MC) │ │
│ │ in: mb_descriptors, residual, │ │
│ │ DPB reference frames │ │
│ │ out: predicted[N_MBs × 384] u8 │ │
│ │ note: intra dispatched in wavefront order; │ │
│ │ MC fully parallel │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 3: reconstruct = clip255(residual + predicted) │ │
│ │ out: reconstructed Y, U, V planes │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 4: deblocking filter at MB edges │ │
│ │ - V-edges first (rows independent) │ │
│ │ - H-edges next (columns independent) │ │
│ │ out: NV12 in DPB slot + dma-buf-exported │ │
│ └────────────────────────┬────────────────────────────┘ │
└────────────────────────────┼───────────────────────────────────┘
↓ vkWaitFences (one wait, one frame)
NV12 ready in V4L2 CAPTURE plane
One Vulkan submit per frame. The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip.
2. CPU/GPU split
The CPU keeps everything that is inherently serial. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope.
The CPU also keeps:
- NAL-unit boundary scanning + emulation-prevention byte removal
- SPS/PPS storage and lookup
- Slice header decoding
- Reference picture list construction (DPB management at the index level)
- Display-order reorder (B-frame POC machinery)
The CPU produces a frame-shaped descriptor that the GPU consumes.
The GPU runs everything that is massively parallel per macroblock:
| Stage | What it does | Parallelism |
|---|---|---|
| 1. Inverse quant + IDCT | Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform | Fully parallel (per block) |
| 2a. Intra prediction | Predict from neighbouring decoded pixels (top, top-right, left) | Wavefront — diagonal of MBs |
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
3. Per-MB descriptor (CPU → GPU)
Frame-shaped SSBO: mb_descriptors[N_MBs], where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p).
struct mb_descriptor {
uint8_t mb_type; /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */
uint8_t mb_qp_y;
uint8_t mb_qp_uv;
uint8_t cbp; /* coded block pattern */
uint8_t intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */
uint8_t intra_16x16_mode; /* when I_16x16 */
uint8_t intra_chroma_mode;
uint8_t partition_mode; /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */
uint8_t ref_idx_l0[4]; /* up to 4 partitions */
uint8_t ref_idx_l1[4]; /* B only */
int16_t mv_l0[4][2]; /* qpel-precision (1/4 sample) */
int16_t mv_l1[4][2];
uint8_t deblock_disable;
int8_t deblock_alpha_c0;
int8_t deblock_beta;
};
Coefficient buffer separate, also frame-shaped:
int16_t coeffs[N_MBs * 384]; /* 256 luma + 64 cb + 64 cr per MB */
Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast.
4. Pipeline stage dispatch shape
All stages live in one VkCommandBuffer per frame. Inter-stage dependencies use vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ) on the relevant SSBO.
Stage 1 (IDCT) — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block.
Stage 2a (intra) — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+68−1 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block).
Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred.
Stage 2b (MC) — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing v3d_h264_qpel_mc20 covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader.
Stage 3 (reconstruct) — trivial per-pixel clip255(residual + predicted). One dispatch over all pixels.
Stage 4 (deblock) — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's v3d_h264deblock shader, adapted for the new buffer layout.
Total dispatches per frame: ~200 (mostly the intra wavefront), all in one command buffer, one submit, one fence wait.
5. Reference frame management
DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4.
CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding.
At frame-decode end, the output VkImage may become a new reference (per ref_pic_marking decisions on CPU). Slot retirement is CPU-managed.
This is the right shape for V4L2 CAPTURE integration — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer.
6. libavcodec integration
The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow.
Intercept point candidates:
- Slice level (
ff_h264_decode_sliceordecode_slice). Replace the per-MB loop entirely. Highest leverage but biggest rewrite. - Macroblock level (
ff_h264_decode_mb_*). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite.
The macroblock-level intercept is the natural evolution of the substitution arc. The current daedalus_recipe_dispatch_h264_* calls become append-to-buffer instead of dispatch-to-GPU, and an end-of-slice flush triggers the actual Vulkan submit.
This is the same shape as deferred rendering in GL/Vulkan — record commands, submit in bulk.
7. What we reuse from daedalus-fourier
Existing V3D shaders (post-PR-#7+#8 state):
| Shader | Reuse status |
|---|---|
v3d_h264_idct4 |
Yes — adapt input/output buffer layout |
v3d_h264_idct8 |
Yes — same |
v3d_h264deblock |
Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new |
v3d_h264_qpel_mc20 |
Partial — only mc20. All 16 qpel positions needed for MC. |
v3d_idct8 (VP9) |
Not for H.264 |
v3d_lpf_h_4_8, v3d_lpf_h_8_8 (VP9) |
Not for H.264 |
v3d_mc_8h (VP9 8-tap) |
Not for H.264 (6-tap) |
v3d_cdef (AV1) |
Not for H.264 |
New shaders needed:
v3d_h264_iquant— inverse quantization (can be fused with IDCT into one stage)v3d_h264_intra_4x4— 9 prediction modesv3d_h264_intra_16x16— 4 modesv3d_h264_intra_chroma_8x8— 4 modesv3d_h264_qpel_{mc00..mc33}— 15 missing variants (or one parameterized shader)v3d_h264_chroma_mc— 1/8-pel chroma MC (bilinear)v3d_h264_deblock_h_luma+ chroma variantsv3d_h264_reconstruct— trivial per-pixel add+clip
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
8. Phasing
Phase 1 — MVP, I-frames only, Baseline profile (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams.
Phase 2 — P-frames, single reference, Baseline profile (+4 weeks). Add MC stage and DPB.
Phase 3 — CABAC + Main/High profile + B-frames + multi-ref (+6 weeks). CABAC stays on CPU. Full DPB management.
Phase 4 — Production-ready deblock + perf optimization + libva integration (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend.
Total H.264 budget: 4-6 months.
Phase 5+ (future codec scope, not committed): VP9 and AV1 reuse the same frame-level dispatch architecture, daedalus-fourier kernel pack, and DPB plumbing. Per §9.7, they are deferred but not firmly out-of-scope. HEVC stays firmly out (Pi 5 has rpi-hevc-dec for that).
9. Phase 1 decisions
User-confirmed 2026-05-24. All seven questions from the initial draft are now decided; this section preserves the original wording of each item for traceability.
-
Intra prediction strategy: GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). Decision: wavefront in Phase 1; revisit if it's the perf bottleneck.
-
libavcodec intercept granularity: macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). Decision: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.
-
Shader parameterization: 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? Decision: measure both during Phase 2 (the MC phase) and pick the winner. No commit ahead of measurement.
-
DPB allocation: Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. Decision: Vulkan-native with
VK_KHR_external_memory_dma_bufexport; daedalus-v4l2 daemon imports. -
Daemon integration shape: static library the daemon links, or separate process. Decision: library link.
-
Build dependency on daedalus-fourier: CMake
find_package, or vendored? Decision:find_package, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library. -
Codec scope. Decision: firmly out-of-scope for daedalus-decoder are HEVC (Pi 5 has
rpi-hevc-decfor that), 10-bit, interlaced, and FMO/ASO. VP9 and AV1 are not firmly out — they're future codec scope for the same framework after H.264 lands. This is a scope expansion from the initial draft, which had grouped them with HEVC under "firmly out".
10. Success criteria
Phase 1 — architecture validation, not perf:
- Bit-exact NV12 output against
ffmpeg -i input.h264 -f rawvideo -for at least one synthetic I-frame-only stream - Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block)
- Vulkan validation layer reports no errors
- No regression in daedalus-fourier or daedalus-v4l2
Phase 4 — production target:
- 1080p H.264 at ≥30 fps on Pi 5 (matches the 30fps floor user-facing requirement)
- Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably
- CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal)
If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict.
11. References
- Khronos: An Introduction to Vulkan Video
- NVDEC Video Decoder API Programming Guide
- VK_KHR_video_decode_h264 proposal
- Mesa V3D driver docs
- daedalus-fourier substitution arc — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
- ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process