Files
daedalus-decoder/DESIGN.md
T
claude-noether 59885dd868 initial design doc — frame-level GPU H.264 decoder for V3D7
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production.  Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.

NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.

DESIGN.md covers:

  - architecture sketch (CPU side keeps entropy decode + descriptors;
    GPU runs 4-stage compute pipeline per frame)
  - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
  - inter-stage dependencies (vkCmdPipelineBarrier within one command
    buffer)
  - intra prediction wavefront (~187 dispatches per frame on diagonals)
  - libavcodec intercept point (macroblock-level, evolves the
    substitution shim from "dispatch now" to "append to frame buffer")
  - shader inventory (existing daedalus-fourier reuse + ~14 new ones)
  - 4-phase plan, 4-6 months total budget
  - 7 open questions including DPB allocation, qpel parameterization,
    daemon integration shape
  - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced

This is design only.  No code beyond README.md and DESIGN.md.  User
review + redirect expected before Phase 1 implementation begins.
2026-05-23 22:44:03 +02:00

267 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# daedalus-decoder — design
**Phase:** scoping / first-draft design. No code written. Date: 2026-05-23.
This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the [daedalus-fourier substitution prototype](https://git.reauktion.de/marfrit/daedalus-fourier) — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it.
---
## 1. The shape we need
NVDEC and Vulkan Video both target dedicated decode hardware via a **picture-level command**: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA.
We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — `rpi-hevc-dec` — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the *same shape* as NVDEC using V3D7 compute as the substrate.
```
┌──────────────────────────────────────────────────────────────┐
│ CPU side │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ libavcodec H.264 parser (existing) │ │
│ │ - NAL split + emulation prevention removal │ │
│ │ - SPS / PPS │ │
│ │ - Slice header decode │ │
│ │ - CABAC / CAVLC entropy decode │ │
│ │ - Per-MB: mb_type, qp, partition info, │ │
│ │ transform coefficients, MVs, intra modes │ │
│ └─────────────────────┬─────────────────────────────────┘ │
│ ↓ (write into frame-shaped SSBO) │
│ │ │
└────────────────────────┼───────────────────────────────────────┘
↓ Vulkan compute queue submit
┌────────────────────────┼───────────────────────────────────────┐
│ GPU (V3D7) side ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 1: inverse quant + IDCT 4x4 / 8x8 │ │
│ │ in: coeffs[N_MBs × 384] │ │
│ │ out: residual[N_MBs × 384] u8/i16 │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ vkCmdPipelineBarrier │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 2: prediction (intra + MC) │ │
│ │ in: mb_descriptors, residual, │ │
│ │ DPB reference frames │ │
│ │ out: predicted[N_MBs × 384] u8 │ │
│ │ note: intra dispatched in wavefront order; │ │
│ │ MC fully parallel │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 3: reconstruct = clip255(residual + predicted) │ │
│ │ out: reconstructed Y, U, V planes │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 4: deblocking filter at MB edges │ │
│ │ - V-edges first (rows independent) │ │
│ │ - H-edges next (columns independent) │ │
│ │ out: NV12 in DPB slot + dma-buf-exported │ │
│ └────────────────────────┬────────────────────────────┘ │
└────────────────────────────┼───────────────────────────────────┘
↓ vkWaitFences (one wait, one frame)
NV12 ready in V4L2 CAPTURE plane
```
**One Vulkan submit per frame.** The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip.
---
## 2. CPU/GPU split
The CPU keeps everything that is **inherently serial**. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope.
The CPU also keeps:
- NAL-unit boundary scanning + emulation-prevention byte removal
- SPS/PPS storage and lookup
- Slice header decoding
- Reference picture list construction (DPB management at the index level)
- Display-order reorder (B-frame POC machinery)
The CPU **produces** a frame-shaped descriptor that the GPU consumes.
The GPU runs **everything that is massively parallel per macroblock**:
| Stage | What it does | Parallelism |
|---|---|---|
| 1. Inverse quant + IDCT | Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform | Fully parallel (per block) |
| 2a. Intra prediction | Predict from neighbouring decoded pixels (top, top-right, left) | Wavefront — diagonal of MBs |
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
---
## 3. Per-MB descriptor (CPU → GPU)
Frame-shaped SSBO: `mb_descriptors[N_MBs]`, where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p).
```c
struct mb_descriptor {
uint8_t mb_type; /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */
uint8_t mb_qp_y;
uint8_t mb_qp_uv;
uint8_t cbp; /* coded block pattern */
uint8_t intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */
uint8_t intra_16x16_mode; /* when I_16x16 */
uint8_t intra_chroma_mode;
uint8_t partition_mode; /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */
uint8_t ref_idx_l0[4]; /* up to 4 partitions */
uint8_t ref_idx_l1[4]; /* B only */
int16_t mv_l0[4][2]; /* qpel-precision (1/4 sample) */
int16_t mv_l1[4][2];
uint8_t deblock_disable;
int8_t deblock_alpha_c0;
int8_t deblock_beta;
};
```
Coefficient buffer separate, also frame-shaped:
```c
int16_t coeffs[N_MBs * 384]; /* 256 luma + 64 cb + 64 cr per MB */
```
Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast.
---
## 4. Pipeline stage dispatch shape
All stages live in **one VkCommandBuffer per frame**. Inter-stage dependencies use `vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ)` on the relevant SSBO.
**Stage 1 (IDCT)** — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block.
**Stage 2a (intra)** — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+681 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block).
Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred.
**Stage 2b (MC)** — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing `v3d_h264_qpel_mc20` covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader.
**Stage 3 (reconstruct)** — trivial per-pixel `clip255(residual + predicted)`. One dispatch over all pixels.
**Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout.
Total dispatches per frame: ~200 (mostly the intra wavefront), all in one command buffer, one submit, one fence wait.
---
## 5. Reference frame management
DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4.
CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding.
At frame-decode end, the output VkImage may become a new reference (per `ref_pic_marking` decisions on CPU). Slot retirement is CPU-managed.
This is the **right shape for V4L2 CAPTURE integration** — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer.
---
## 6. libavcodec integration
The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow.
**Intercept point candidates:**
- **Slice level** (`ff_h264_decode_slice` or `decode_slice`). Replace the per-MB loop entirely. Highest leverage but biggest rewrite.
- **Macroblock level** (`ff_h264_decode_mb_*`). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite.
The macroblock-level intercept is the natural evolution of the substitution arc. The current `daedalus_recipe_dispatch_h264_*` calls become **append-to-buffer** instead of **dispatch-to-GPU**, and an end-of-slice flush triggers the actual Vulkan submit.
This is the same shape as **deferred rendering** in GL/Vulkan — record commands, submit in bulk.
---
## 7. What we reuse from daedalus-fourier
Existing V3D shaders (post-PR-#7+#8 state):
| Shader | Reuse status |
|---|---|
| `v3d_h264_idct4` | Yes — adapt input/output buffer layout |
| `v3d_h264_idct8` | Yes — same |
| `v3d_h264deblock` | Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new |
| `v3d_h264_qpel_mc20` | Partial — only mc20. All 16 qpel positions needed for MC. |
| `v3d_idct8` (VP9) | Not for H.264 |
| `v3d_lpf_h_4_8`, `v3d_lpf_h_8_8` (VP9) | Not for H.264 |
| `v3d_mc_8h` (VP9 8-tap) | Not for H.264 (6-tap) |
| `v3d_cdef` (AV1) | Not for H.264 |
New shaders needed:
- `v3d_h264_iquant` — inverse quantization (can be fused with IDCT into one stage)
- `v3d_h264_intra_4x4` — 9 prediction modes
- `v3d_h264_intra_16x16` — 4 modes
- `v3d_h264_intra_chroma_8x8` — 4 modes
- `v3d_h264_qpel_{mc00..mc33}` — 15 missing variants (or one parameterized shader)
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
- `v3d_h264_deblock_h_luma` + chroma variants
- `v3d_h264_reconstruct` — trivial per-pixel add+clip
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
---
## 8. Phasing
**Phase 1 — MVP, I-frames only, Baseline profile** (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams.
**Phase 2 — P-frames, single reference, Baseline profile** (+4 weeks). Add MC stage and DPB.
**Phase 3 — CABAC + Main/High profile + B-frames + multi-ref** (+6 weeks). CABAC stays on CPU. Full DPB management.
**Phase 4 — Production-ready deblock + perf optimization + libva integration** (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend.
**Total budget:** 4-6 months.
---
## 9. Open questions
1. **Intra prediction strategy:** GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). Plan: wavefront in Phase 1; revisit if it's the perf bottleneck.
2. **libavcodec intercept granularity:** macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). Plan: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.
3. **Shader parameterization:** 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? V3D's compiler might inline-optimize either; needs measurement.
4. **DPB allocation:** Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. Affects V4L2 integration story. Plan: Vulkan-native with `VK_KHR_external_memory_dma_buf` export; daedalus-v4l2 daemon imports.
5. **Daemon integration shape:** does daedalus-decoder ship as a static library the daemon links, or as a separate process the daemon talks to? Library, almost certainly — process boundary would multiply IPC cost.
6. **Build dependency on daedalus-fourier:** as a CMake `find_package`, or vendored? `find_package`, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library.
7. **Out-of-scope for daedalus-decoder (firmly):** VP9, AV1, HEVC (Pi 5 has rpi-hevc-dec for that), 10-bit, interlaced, FMO/ASO.
---
## 10. Success criteria
Phase 1 — **architecture validation, not perf**:
- Bit-exact NV12 output against `ffmpeg -i input.h264 -f rawvideo -` for at least one synthetic I-frame-only stream
- Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block)
- Vulkan validation layer reports no errors
- No regression in daedalus-fourier or daedalus-v4l2
Phase 4 — **production target**:
- 1080p H.264 at ≥30 fps on Pi 5 (matches the [30fps floor](https://git.reauktion.de/marfrit/daedalus-fourier/src/branch/main/CLAUDE-CODE-MEMORY/project_30fps_floor_is_fine.md) user-facing requirement)
- Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably
- CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal)
If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict.
---
## 11. References
- [Khronos: An Introduction to Vulkan Video](https://www.khronos.org/blog/an-introduction-to-vulkan-video)
- [NVDEC Video Decoder API Programming Guide](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/)
- [VK_KHR_video_decode_h264 proposal](https://docs.vulkan.org/features/latest/features/proposals/VK_KHR_video_decode_h264.html)
- [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html)
- [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
- ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process