initial design doc — frame-level GPU H.264 decoder for V3D7
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production. Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.
NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.
DESIGN.md covers:
- architecture sketch (CPU side keeps entropy decode + descriptors;
GPU runs 4-stage compute pipeline per frame)
- per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
- inter-stage dependencies (vkCmdPipelineBarrier within one command
buffer)
- intra prediction wavefront (~187 dispatches per frame on diagonals)
- libavcodec intercept point (macroblock-level, evolves the
substitution shim from "dispatch now" to "append to frame buffer")
- shader inventory (existing daedalus-fourier reuse + ~14 new ones)
- 4-phase plan, 4-6 months total budget
- 7 open questions including DPB allocation, qpel parameterization,
daemon integration shape
- explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced
This is design only. No code beyond README.md and DESIGN.md. User
review + redirect expected before Phase 1 implementation begins.
This commit is contained in:
@@ -0,0 +1,266 @@
|
|||||||
|
# daedalus-decoder — design
|
||||||
|
|
||||||
|
**Phase:** scoping / first-draft design. No code written. Date: 2026-05-23.
|
||||||
|
|
||||||
|
This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the [daedalus-fourier substitution prototype](https://git.reauktion.de/marfrit/daedalus-fourier) — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. The shape we need
|
||||||
|
|
||||||
|
NVDEC and Vulkan Video both target dedicated decode hardware via a **picture-level command**: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA.
|
||||||
|
|
||||||
|
We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — `rpi-hevc-dec` — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the *same shape* as NVDEC using V3D7 compute as the substrate.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────┐
|
||||||
|
│ CPU side │
|
||||||
|
│ ┌───────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ libavcodec H.264 parser (existing) │ │
|
||||||
|
│ │ - NAL split + emulation prevention removal │ │
|
||||||
|
│ │ - SPS / PPS │ │
|
||||||
|
│ │ - Slice header decode │ │
|
||||||
|
│ │ - CABAC / CAVLC entropy decode │ │
|
||||||
|
│ │ - Per-MB: mb_type, qp, partition info, │ │
|
||||||
|
│ │ transform coefficients, MVs, intra modes │ │
|
||||||
|
│ └─────────────────────┬─────────────────────────────────┘ │
|
||||||
|
│ ↓ (write into frame-shaped SSBO) │
|
||||||
|
│ │ │
|
||||||
|
└────────────────────────┼───────────────────────────────────────┘
|
||||||
|
↓ Vulkan compute queue submit
|
||||||
|
┌────────────────────────┼───────────────────────────────────────┐
|
||||||
|
│ GPU (V3D7) side ↓ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Stage 1: inverse quant + IDCT 4x4 / 8x8 │ │
|
||||||
|
│ │ in: coeffs[N_MBs × 384] │ │
|
||||||
|
│ │ out: residual[N_MBs × 384] u8/i16 │ │
|
||||||
|
│ └────────────────────────┬────────────────────────────┘ │
|
||||||
|
│ ↓ vkCmdPipelineBarrier │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Stage 2: prediction (intra + MC) │ │
|
||||||
|
│ │ in: mb_descriptors, residual, │ │
|
||||||
|
│ │ DPB reference frames │ │
|
||||||
|
│ │ out: predicted[N_MBs × 384] u8 │ │
|
||||||
|
│ │ note: intra dispatched in wavefront order; │ │
|
||||||
|
│ │ MC fully parallel │ │
|
||||||
|
│ └────────────────────────┬────────────────────────────┘ │
|
||||||
|
│ ↓ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Stage 3: reconstruct = clip255(residual + predicted) │ │
|
||||||
|
│ │ out: reconstructed Y, U, V planes │ │
|
||||||
|
│ └────────────────────────┬────────────────────────────┘ │
|
||||||
|
│ ↓ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Stage 4: deblocking filter at MB edges │ │
|
||||||
|
│ │ - V-edges first (rows independent) │ │
|
||||||
|
│ │ - H-edges next (columns independent) │ │
|
||||||
|
│ │ out: NV12 in DPB slot + dma-buf-exported │ │
|
||||||
|
│ └────────────────────────┬────────────────────────────┘ │
|
||||||
|
└────────────────────────────┼───────────────────────────────────┘
|
||||||
|
↓ vkWaitFences (one wait, one frame)
|
||||||
|
NV12 ready in V4L2 CAPTURE plane
|
||||||
|
```
|
||||||
|
|
||||||
|
**One Vulkan submit per frame.** The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. CPU/GPU split
|
||||||
|
|
||||||
|
The CPU keeps everything that is **inherently serial**. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope.
|
||||||
|
|
||||||
|
The CPU also keeps:
|
||||||
|
|
||||||
|
- NAL-unit boundary scanning + emulation-prevention byte removal
|
||||||
|
- SPS/PPS storage and lookup
|
||||||
|
- Slice header decoding
|
||||||
|
- Reference picture list construction (DPB management at the index level)
|
||||||
|
- Display-order reorder (B-frame POC machinery)
|
||||||
|
|
||||||
|
The CPU **produces** a frame-shaped descriptor that the GPU consumes.
|
||||||
|
|
||||||
|
The GPU runs **everything that is massively parallel per macroblock**:
|
||||||
|
|
||||||
|
| Stage | What it does | Parallelism |
|
||||||
|
|---|---|---|
|
||||||
|
| 1. Inverse quant + IDCT | Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform | Fully parallel (per block) |
|
||||||
|
| 2a. Intra prediction | Predict from neighbouring decoded pixels (top, top-right, left) | Wavefront — diagonal of MBs |
|
||||||
|
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
|
||||||
|
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
|
||||||
|
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Per-MB descriptor (CPU → GPU)
|
||||||
|
|
||||||
|
Frame-shaped SSBO: `mb_descriptors[N_MBs]`, where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p).
|
||||||
|
|
||||||
|
```c
|
||||||
|
struct mb_descriptor {
|
||||||
|
uint8_t mb_type; /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */
|
||||||
|
uint8_t mb_qp_y;
|
||||||
|
uint8_t mb_qp_uv;
|
||||||
|
uint8_t cbp; /* coded block pattern */
|
||||||
|
|
||||||
|
uint8_t intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */
|
||||||
|
uint8_t intra_16x16_mode; /* when I_16x16 */
|
||||||
|
uint8_t intra_chroma_mode;
|
||||||
|
|
||||||
|
uint8_t partition_mode; /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */
|
||||||
|
uint8_t ref_idx_l0[4]; /* up to 4 partitions */
|
||||||
|
uint8_t ref_idx_l1[4]; /* B only */
|
||||||
|
int16_t mv_l0[4][2]; /* qpel-precision (1/4 sample) */
|
||||||
|
int16_t mv_l1[4][2];
|
||||||
|
|
||||||
|
uint8_t deblock_disable;
|
||||||
|
int8_t deblock_alpha_c0;
|
||||||
|
int8_t deblock_beta;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Coefficient buffer separate, also frame-shaped:
|
||||||
|
|
||||||
|
```c
|
||||||
|
int16_t coeffs[N_MBs * 384]; /* 256 luma + 64 cb + 64 cr per MB */
|
||||||
|
```
|
||||||
|
|
||||||
|
Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Pipeline stage dispatch shape
|
||||||
|
|
||||||
|
All stages live in **one VkCommandBuffer per frame**. Inter-stage dependencies use `vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ)` on the relevant SSBO.
|
||||||
|
|
||||||
|
**Stage 1 (IDCT)** — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block.
|
||||||
|
|
||||||
|
**Stage 2a (intra)** — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+68−1 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block).
|
||||||
|
|
||||||
|
Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred.
|
||||||
|
|
||||||
|
**Stage 2b (MC)** — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing `v3d_h264_qpel_mc20` covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader.
|
||||||
|
|
||||||
|
**Stage 3 (reconstruct)** — trivial per-pixel `clip255(residual + predicted)`. One dispatch over all pixels.
|
||||||
|
|
||||||
|
**Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout.
|
||||||
|
|
||||||
|
Total dispatches per frame: ~200 (mostly the intra wavefront), all in one command buffer, one submit, one fence wait.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Reference frame management
|
||||||
|
|
||||||
|
DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4.
|
||||||
|
|
||||||
|
CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding.
|
||||||
|
|
||||||
|
At frame-decode end, the output VkImage may become a new reference (per `ref_pic_marking` decisions on CPU). Slot retirement is CPU-managed.
|
||||||
|
|
||||||
|
This is the **right shape for V4L2 CAPTURE integration** — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. libavcodec integration
|
||||||
|
|
||||||
|
The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow.
|
||||||
|
|
||||||
|
**Intercept point candidates:**
|
||||||
|
|
||||||
|
- **Slice level** (`ff_h264_decode_slice` or `decode_slice`). Replace the per-MB loop entirely. Highest leverage but biggest rewrite.
|
||||||
|
- **Macroblock level** (`ff_h264_decode_mb_*`). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite.
|
||||||
|
|
||||||
|
The macroblock-level intercept is the natural evolution of the substitution arc. The current `daedalus_recipe_dispatch_h264_*` calls become **append-to-buffer** instead of **dispatch-to-GPU**, and an end-of-slice flush triggers the actual Vulkan submit.
|
||||||
|
|
||||||
|
This is the same shape as **deferred rendering** in GL/Vulkan — record commands, submit in bulk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. What we reuse from daedalus-fourier
|
||||||
|
|
||||||
|
Existing V3D shaders (post-PR-#7+#8 state):
|
||||||
|
|
||||||
|
| Shader | Reuse status |
|
||||||
|
|---|---|
|
||||||
|
| `v3d_h264_idct4` | Yes — adapt input/output buffer layout |
|
||||||
|
| `v3d_h264_idct8` | Yes — same |
|
||||||
|
| `v3d_h264deblock` | Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new |
|
||||||
|
| `v3d_h264_qpel_mc20` | Partial — only mc20. All 16 qpel positions needed for MC. |
|
||||||
|
| `v3d_idct8` (VP9) | Not for H.264 |
|
||||||
|
| `v3d_lpf_h_4_8`, `v3d_lpf_h_8_8` (VP9) | Not for H.264 |
|
||||||
|
| `v3d_mc_8h` (VP9 8-tap) | Not for H.264 (6-tap) |
|
||||||
|
| `v3d_cdef` (AV1) | Not for H.264 |
|
||||||
|
|
||||||
|
New shaders needed:
|
||||||
|
|
||||||
|
- `v3d_h264_iquant` — inverse quantization (can be fused with IDCT into one stage)
|
||||||
|
- `v3d_h264_intra_4x4` — 9 prediction modes
|
||||||
|
- `v3d_h264_intra_16x16` — 4 modes
|
||||||
|
- `v3d_h264_intra_chroma_8x8` — 4 modes
|
||||||
|
- `v3d_h264_qpel_{mc00..mc33}` — 15 missing variants (or one parameterized shader)
|
||||||
|
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
|
||||||
|
- `v3d_h264_deblock_h_luma` + chroma variants
|
||||||
|
- `v3d_h264_reconstruct` — trivial per-pixel add+clip
|
||||||
|
|
||||||
|
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Phasing
|
||||||
|
|
||||||
|
**Phase 1 — MVP, I-frames only, Baseline profile** (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams.
|
||||||
|
|
||||||
|
**Phase 2 — P-frames, single reference, Baseline profile** (+4 weeks). Add MC stage and DPB.
|
||||||
|
|
||||||
|
**Phase 3 — CABAC + Main/High profile + B-frames + multi-ref** (+6 weeks). CABAC stays on CPU. Full DPB management.
|
||||||
|
|
||||||
|
**Phase 4 — Production-ready deblock + perf optimization + libva integration** (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend.
|
||||||
|
|
||||||
|
**Total budget:** 4-6 months.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Open questions
|
||||||
|
|
||||||
|
1. **Intra prediction strategy:** GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). Plan: wavefront in Phase 1; revisit if it's the perf bottleneck.
|
||||||
|
|
||||||
|
2. **libavcodec intercept granularity:** macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). Plan: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.
|
||||||
|
|
||||||
|
3. **Shader parameterization:** 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? V3D's compiler might inline-optimize either; needs measurement.
|
||||||
|
|
||||||
|
4. **DPB allocation:** Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. Affects V4L2 integration story. Plan: Vulkan-native with `VK_KHR_external_memory_dma_buf` export; daedalus-v4l2 daemon imports.
|
||||||
|
|
||||||
|
5. **Daemon integration shape:** does daedalus-decoder ship as a static library the daemon links, or as a separate process the daemon talks to? Library, almost certainly — process boundary would multiply IPC cost.
|
||||||
|
|
||||||
|
6. **Build dependency on daedalus-fourier:** as a CMake `find_package`, or vendored? `find_package`, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library.
|
||||||
|
|
||||||
|
7. **Out-of-scope for daedalus-decoder (firmly):** VP9, AV1, HEVC (Pi 5 has rpi-hevc-dec for that), 10-bit, interlaced, FMO/ASO.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Success criteria
|
||||||
|
|
||||||
|
Phase 1 — **architecture validation, not perf**:
|
||||||
|
|
||||||
|
- Bit-exact NV12 output against `ffmpeg -i input.h264 -f rawvideo -` for at least one synthetic I-frame-only stream
|
||||||
|
- Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block)
|
||||||
|
- Vulkan validation layer reports no errors
|
||||||
|
- No regression in daedalus-fourier or daedalus-v4l2
|
||||||
|
|
||||||
|
Phase 4 — **production target**:
|
||||||
|
|
||||||
|
- 1080p H.264 at ≥30 fps on Pi 5 (matches the [30fps floor](https://git.reauktion.de/marfrit/daedalus-fourier/src/branch/main/CLAUDE-CODE-MEMORY/project_30fps_floor_is_fine.md) user-facing requirement)
|
||||||
|
- Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably
|
||||||
|
- CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal)
|
||||||
|
|
||||||
|
If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. References
|
||||||
|
|
||||||
|
- [Khronos: An Introduction to Vulkan Video](https://www.khronos.org/blog/an-introduction-to-vulkan-video)
|
||||||
|
- [NVDEC Video Decoder API Programming Guide](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/)
|
||||||
|
- [VK_KHR_video_decode_h264 proposal](https://docs.vulkan.org/features/latest/features/proposals/VK_KHR_video_decode_h264.html)
|
||||||
|
- [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html)
|
||||||
|
- [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
|
||||||
|
- ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
# daedalus-decoder
|
||||||
|
|
||||||
|
Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. **Design phase — not implemented yet.**
|
||||||
|
|
||||||
|
The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses [daedalus-fourier](https://git.reauktion.de/marfrit/daedalus-fourier)'s V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.
|
||||||
|
|
||||||
|
Sibling projects:
|
||||||
|
|
||||||
|
- [daedalus-fourier](https://git.reauktion.de/marfrit/daedalus-fourier) — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
|
||||||
|
- [daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2) — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
|
||||||
|
- [libva-v4l2-request-fourier](https://git.reauktion.de/reauktion/libva-v4l2-request-fourier) — VAAPI ↔ V4L2 stateless bridge. End consumer.
|
||||||
|
|
||||||
|
See [DESIGN.md](DESIGN.md) for the architecture sketch.
|
||||||
Reference in New Issue
Block a user