Files
daedalus-decoder/DESIGN.md

27 KiB
Raw Permalink Blame History

daedalus-decoder — design

Phase: scoping / first-draft design. No code written. Date: 2026-05-23.

This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the daedalus-fourier substitution prototype — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it.


1. The shape we need

NVDEC and Vulkan Video both target dedicated decode hardware via a picture-level command: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA.

We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — rpi-hevc-dec — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the same shape as NVDEC using V3D7 compute as the substrate.

┌──────────────────────────────────────────────────────────────┐
│ CPU side                                                      │
│  ┌───────────────────────────────────────────────────────┐   │
│  │ libavcodec H.264 parser (existing)                    │   │
│  │   - NAL split + emulation prevention removal          │   │
│  │   - SPS / PPS                                          │   │
│  │   - Slice header decode                                │   │
│  │   - CABAC / CAVLC entropy decode                       │   │
│  │   - Per-MB: mb_type, qp, partition info,               │   │
│  │            transform coefficients, MVs, intra modes    │   │
│  └─────────────────────┬─────────────────────────────────┘   │
│                        ↓ (write into frame-shaped SSBO)       │
│                        │                                       │
└────────────────────────┼───────────────────────────────────────┘
                         ↓ Vulkan compute queue submit
┌────────────────────────┼───────────────────────────────────────┐
│ GPU (V3D7) side        ↓                                       │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ Stage 1: inverse quant + IDCT 4x4 / 8x8             │      │
│  │   in:  coeffs[N_MBs × 384]                           │      │
│  │   out: residual[N_MBs × 384] u8/i16                  │      │
│  └────────────────────────┬────────────────────────────┘      │
│                           ↓  vkCmdPipelineBarrier              │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ Stage 2: prediction (intra + MC)                     │      │
│  │   in:  mb_descriptors, residual,                     │      │
│  │        DPB reference frames                          │      │
│  │   out: predicted[N_MBs × 384] u8                     │      │
│  │   note: intra dispatched in wavefront order;         │      │
│  │         MC fully parallel                            │      │
│  └────────────────────────┬────────────────────────────┘      │
│                           ↓                                    │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ Stage 3: reconstruct = clip255(residual + predicted) │      │
│  │   out: reconstructed Y, U, V planes                  │      │
│  └────────────────────────┬────────────────────────────┘      │
│                           ↓                                    │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ Stage 4: deblocking filter at MB edges               │      │
│  │   - V-edges first (rows independent)                 │      │
│  │   - H-edges next  (columns independent)              │      │
│  │   out: NV12 in DPB slot + dma-buf-exported           │      │
│  └────────────────────────┬────────────────────────────┘      │
└────────────────────────────┼───────────────────────────────────┘
                             ↓ vkWaitFences (one wait, one frame)
                       NV12 ready in V4L2 CAPTURE plane

One Vulkan submit per frame. The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip.


2. CPU/GPU split

The CPU keeps everything that is inherently serial. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope.

The CPU also keeps:

  • NAL-unit boundary scanning + emulation-prevention byte removal
  • SPS/PPS storage and lookup
  • Slice header decoding
  • Reference picture list construction (DPB management at the index level)
  • Display-order reorder (B-frame POC machinery)

The CPU produces a frame-shaped descriptor that the GPU consumes.

The GPU runs everything that is massively parallel per macroblock:

Stage What it does Parallelism
1. Inverse quant + IDCT Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform Fully parallel (per block)
2a. Intra prediction Predict from neighbouring decoded pixels (top, top-right, left) Wavefront — diagonal of MBs
2b. Motion compensation qpel filter on reference pixels Fully parallel
3. Reconstruct residual + predicted, clip255 Fully parallel (per pixel)
4. Deblocking Filter MB edges (alpha/beta/tc0 table) Two-pass: V-edges (rows ∥), H-edges (cols ∥)
5. NV12 → RGBA (optional) YUV→RGB matrix conversion via H.264 VUI colourspace Fully parallel (per pixel)

3. Per-MB descriptor (CPU → GPU)

Frame-shaped SSBO: mb_descriptors[N_MBs], where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p).

struct mb_descriptor {
    uint8_t  mb_type;            /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */
    uint8_t  mb_qp_y;
    uint8_t  mb_qp_uv;
    uint8_t  cbp;                /* coded block pattern */

    uint8_t  intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */
    uint8_t  intra_16x16_mode;    /* when I_16x16 */
    uint8_t  intra_chroma_mode;

    uint8_t  partition_mode;     /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */
    uint8_t  ref_idx_l0[4];      /* up to 4 partitions */
    uint8_t  ref_idx_l1[4];      /* B only */
    int16_t  mv_l0[4][2];        /* qpel-precision (1/4 sample) */
    int16_t  mv_l1[4][2];

    uint8_t  deblock_disable;
    int8_t   deblock_alpha_c0;
    int8_t   deblock_beta;
};

Coefficient buffer separate, also frame-shaped:

int16_t coeffs[N_MBs * 384];  /* 256 luma + 64 cb + 64 cr per MB */

Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast.


4. Pipeline stage dispatch shape

All stages live in one VkCommandBuffer per frame. Inter-stage dependencies use vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ) on the relevant SSBO.

Stage 1 (IDCT) — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block.

Stage 2a (intra) — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+681 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block).

Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred.

Stage 2b (MC) — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing v3d_h264_qpel_mc20 covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader.

Stage 3 (reconstruct) — trivial per-pixel clip255(residual + predicted). One dispatch over all pixels.

Stage 4 (deblock) — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's v3d_h264deblock shader, adapted for the new buffer layout.

Stage 5 (optional, NV12 → RGBA conversion) — a single per-pixel compute dispatch that converts the just-deblocked NV12 to packed RGBA8. Roughly:

// BT.709 limited-range example; matrix selected per H.264 VUI.
int Y  = int(luma[...]) - 16;
int CB = int(chroma_uv[...]) - 128;
int CR = int(chroma_uv[... + 1]) - 128;
int R = clip255((298*Y + 459*CR + 128) >> 8);
int G = clip255((298*Y - 137*CB - 55*CR + 128) >> 8);
int B = clip255((298*Y + 541*CB + 128) >> 8);
out_rgba[r*W + c] = pack32(R, G, B, 255);

Off by default. Most consumers want NV12 because:

  • V4L2 stateless decoder output is canonically NV12; the daedalus-v4l2 CAPTURE plane is already plumbed for NV12 dmabuf.
  • Wayland compositors (sway, mutter, KWin, weston) advertise NV12 via zwp_linux_dmabuf_v1 and convert during composition — essentially free because it's one texture sample with a swizzle in the composite fragment shader.
  • Firefox / mpv VAAPI paths use NV12 dmabuf zero-copy to the compositor (cf. project memory pi5_h264_hwaccel_landed).
  • RGBA8 is 4× the bandwidth of NV12 (8 MB/frame vs 3 MB/frame at 1080p) — converting in the decoder when no consumer needs it burns DMA bandwidth and electrons.

Opt in via a decoder config flag for consumers that genuinely need RGBA pre-compositor (SHM-buffer-only clients, future X11/headless paths, GL contexts without YUV sampler extension support).

Colourspace metadata MUST be plumbed through when Stage 5 is on: the H.264 SPS's vui_parameters carry colour_primaries, transfer_characteristics, matrix_coefficients, and video_full_range_flag. Picking the wrong matrix (BT.601 vs BT.709) or range silently mis- saturates skin tones and shadows. Default to the BT.709 limited-range combination when VUI is absent (the dominant case for HD content); log a warning in that case.

Total dispatches per frame: ~200 with intra wavefront + ~190 without Stage 5; +1 with Stage 5. All in one command buffer, one submit, one fence wait.


5. Reference frame management

DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4.

CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding.

At frame-decode end, the output VkImage may become a new reference (per ref_pic_marking decisions on CPU). Slot retirement is CPU-managed.

This is the right shape for V4L2 CAPTURE integration — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer.


6. libavcodec integration

The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow.

Intercept point candidates:

  • Slice level (ff_h264_decode_slice or decode_slice). Replace the per-MB loop entirely. Highest leverage but biggest rewrite.
  • Macroblock level (ff_h264_decode_mb_*). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite.

The macroblock-level intercept is the natural evolution of the substitution arc. The current daedalus_recipe_dispatch_h264_* calls become append-to-buffer instead of dispatch-to-GPU, and an end-of-slice flush triggers the actual Vulkan submit.

This is the same shape as deferred rendering in GL/Vulkan — record commands, submit in bulk.


7. What we reuse from daedalus-fourier

Existing V3D shaders (post-PR-#7+#8 state):

Shader Reuse status
v3d_h264_idct4 Yes — adapt input/output buffer layout
v3d_h264_idct8 Yes — same
v3d_h264deblock Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new
v3d_h264_qpel_mc20 Partial — only mc20. All 16 qpel positions needed for MC.
v3d_idct8 (VP9) Not for H.264
v3d_lpf_h_4_8, v3d_lpf_h_8_8 (VP9) Not for H.264
v3d_mc_8h (VP9 8-tap) Not for H.264 (6-tap)
v3d_cdef (AV1) Not for H.264

New shaders needed:

  • v3d_h264_iquant — inverse quantization (can be fused with IDCT into one stage)
  • v3d_h264_intra_4x4 — 9 prediction modes
  • v3d_h264_intra_16x16 — 4 modes
  • v3d_h264_intra_chroma_8x8 — 4 modes
  • v3d_h264_qpel_{mc00..mc33} — 15 missing variants (or one parameterized shader)
  • v3d_h264_chroma_mc — 1/8-pel chroma MC (bilinear)
  • v3d_h264_deblock_h_luma + chroma variants
  • v3d_h264_reconstruct — trivial per-pixel add+clip
  • v3d_h264_yuv_to_rgba (optional Stage 5) — BT.601 / BT.709 / SMPTE-240M matrix selectable; limited vs full range selectable from VUI

That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.


8. Phasing

Phase 1 — MVP, I-frames only, Baseline profile (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams.

Phase 2 — P-frames, single reference, Baseline profile (+4 weeks). Add MC stage and DPB.

Phase 3 — CABAC + Main/High profile + B-frames + multi-ref (+6 weeks). CABAC stays on CPU. Full DPB management.

Phase 4 — Production-ready deblock + perf optimization + libva integration (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend.

Total H.264 budget: 4-6 months.

Phase 5+ (future codec scope, not committed): VP9 and AV1 reuse the same frame-level dispatch architecture, daedalus-fourier kernel pack, and DPB plumbing. Per §9.7, they are deferred but not firmly out-of-scope. HEVC stays firmly out (Pi 5 has rpi-hevc-dec for that).


9. Phase 1 decisions

User-confirmed 2026-05-24. All seven questions from the initial draft are now decided; this section preserves the original wording of each item for traceability.

  1. Intra prediction strategy: GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). Decision: wavefront in Phase 1; revisit if it's the perf bottleneck.

  2. libavcodec intercept granularity: macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). Decision: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.

  3. Shader parameterization: 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? Decision: measure both during Phase 2 (the MC phase) and pick the winner. No commit ahead of measurement.

  4. DPB allocation: Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. Decision: Vulkan-native with VK_KHR_external_memory_dma_buf export; daedalus-v4l2 daemon imports.

  5. Daemon integration shape: static library the daemon links, or separate process. Decision: library link.

  6. Build dependency on daedalus-fourier: CMake find_package, or vendored? Decision: find_package, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library.

  7. Codec scope. Decision: firmly out-of-scope for daedalus-decoder are HEVC (Pi 5 has rpi-hevc-dec for that), 10-bit, interlaced, and FMO/ASO. VP9 and AV1 are not firmly out — they're future codec scope for the same framework after H.264 lands. This is a scope expansion from the initial draft, which had grouped them with HEVC under "firmly out".


10. Success criteria

Phase 1 — architecture validation, not perf:

  • Bit-exact NV12 output against ffmpeg -i input.h264 -f rawvideo - for at least one synthetic I-frame-only stream
  • Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block)
  • Vulkan validation layer reports no errors
  • No regression in daedalus-fourier or daedalus-v4l2

Phase 4 — production target:

  • 1080p H.264 at ≥30 fps on Pi 5 (matches the 30fps floor user-facing requirement)
  • Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably
  • CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal)

If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict.


11. References


Appendix A — daedalus-fourier shader reuse audit

Inventory of cycles 1-9 V3D shaders against the daedalus-decoder shader requirements. Done 2026-05-23 against post-PR-#8 main.

Directly reusable (just call at frame scale):

daedalus-fourier shader Cycle Used in daedalus-decoder for Notes
v3d_h264_idct4.comp 6 Stage 1 IDCT 4×4 16 blocks/WG; scales freely. Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1. Can be replaced with a fused iquant+IDCT variant if measurement justifies it.
v3d_h264_idct8.comp 7 Stage 1 IDCT 8×8 8 blocks/WG; same reuse pattern. High-profile only.

Partially reusable (existing shader is the template; need siblings):

daedalus-fourier shader Cycle Used in daedalus-decoder for Sibling shaders needed
v3d_h264deblock.comp 8 Stage 4 deblocking Handles only vertical luma, non-intra (bS<4). Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template.
v3d_h264_qpel_mc20.comp 9 Stage 2b motion comp Handles only mc20 8×8 "put". Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template.

Not reusable (different codec or different math):

daedalus-fourier shader Cycle Why not
v3d_idct8.comp 1 VP9 cospi-multiplier IDCT; H.264 uses integer butterfly
v3d_lpf_h_4_8.comp 2 VP9 loop filter, different from H.264 deblock
v3d_lpf_h_8_8.comp 4 Same
v3d_mc_8h.comp 3 VP9 8-tap subpel; H.264 qpel is 6-tap
v3d_cdef.comp 5 AV1 CDEF

Brand-new shaders needed (not seeded by daedalus-fourier):

  • v3d_h264_iquant — inverse quantization (potentially fused into IDCT)
  • v3d_h264_intra_4x4 — 9 prediction modes
  • v3d_h264_intra_16x16 — 4 modes
  • v3d_h264_intra_chroma_8x8 — 4 modes
  • v3d_h264_chroma_mc — 1/8-pel chroma MC (bilinear)
  • v3d_h264_reconstruct — trivial per-pixel residual+predicted clip255
  • v3d_h264_yuv_to_rgba (Stage 5, optional)

Summary: ~22 H.264-related shaders total; 2 directly reusable, 2 partial-reuse (5+15 = ~20 sibling variants), 7 brand-new. All small (100250 lines GLSL each). Each carries its own M1 bit-exact gate against an FFmpeg reference — same methodology as daedalus-fourier cycles 1-9. Estimate 1-3 days per shader including the gate methodology; total shader work ~6-10 weeks if done in order.


Appendix B — libavcodec intercept point

Located in libavcodec/h264_slice.c:decode_slice (FFmpeg 8.1 Kwiboo pin, function starts at line 2598). The relevant loop:

for (;;) {
    ret = ff_h264_decode_mb_cabac(h, sl);   /* or _cavlc */
    if (ret >= 0)
        ff_h264_hl_decode_mb(h, sl);        /* ← pixel-level decode */
    ...
    eos = get_cabac_terminate(&sl->cabac);
    if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2)
        break;
    if (eos || ...) break;
}

After ff_h264_decode_mb_cabac returns, sl and h are populated with everything we need:

  • sl->mb_x, sl->mb_y — current macroblock coordinates
  • sl->mb[] — coefficient buffer (residual coefficients, post-quant)
  • sl->mv_cache[], sl->ref_cache[] — motion vectors, ref indices
  • h->cur_pic.mb_type[] — mb_type field
  • sl->intra4x4_pred_mode_cache[] — 9-mode intra4x4 predictions
  • h->cur_pic.qscale_table[] — per-MB QP
  • sl->non_zero_count_cache[] — non-zero coefficient count per 4×4 block

ff_h264_hl_decode_mb is the function that calls the pixel-level kernels (IDCT, intra prediction, MC, deblock). This is where the daedalus-fourier substitution arc plugged in via H264DSPContext function pointers (PR #6 / #76 et seq).

The daedalus-decoder intercept: replace ff_h264_hl_decode_mb with a stub that does no pixel work. Instead:

static inline void daedalus_decoder_append_mb(H264Context *h,
                                               H264SliceContext *sl)
{
    /* Snapshot the populated sl/h fields into the frame-shaped
     * mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x).
     * Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the
     * matching offset.  No GPU dispatch yet — just memory writes. */
    ...
}

End-of-slice (or end-of-frame, depending on slice grouping) calls daedalus_decoder_flush_frame() which builds the VkCommandBuffer with all 4-5 pipeline stages and submits once.

Implementation shape: this is a new FFmpeg patch in marfrit-packages, sibling to the existing 0003-0007 substitution patches. Tentatively 0008-h264-daedalus-decoder-frame-pipeline.patch:

libavcodec/h264_slice.c:
  - decode_slice() per-MB loop: replace
    `ff_h264_hl_decode_mb(h, sl)` with
    `if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl);
     else ff_h264_hl_decode_mb(h, sl);`
  - After the loop, if (daedalus_decoder_active(h))
    daedalus_decoder_flush_frame(h);

libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c):
  - daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1
    or a per-codec flag
  - daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so

Crucially, CABAC and CAVLC stay in libavcodec — they're the fastest open-source CABAC implementations and the structure they produce in sl->mb_cache already has the right shape. We don't re-implement entropy decoding; we just intercept after it runs and feed the GPU pipeline from there.

Backwards compatibility: when daedalus_decoder_active(h) returns false (default), the patch is a no-op — ff_h264_hl_decode_mb runs as before. Coexists with the kernel-pack substitution arc without conflict.


Appendix C — risk register

Risk Likelihood Impact Mitigation
Intra prediction wavefront serialization dominates wall time Medium Phase 1 misses 30 fps target Speculative CPU intra fallback; profile early
qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden Medium Phase 2 schedule slips Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline
H.264 VUI matrix selection error in Stage 5 Low Wrong colours in HD content Default BT.709 limited-range; load test against ffmpeg -filter_complex eq=... reference
Mesa V3DV concurrency bug under heavy compute dispatch Low Sporadic GPU hang Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction)
daedalus-fourier shader changes break daedalus-decoder Medium Phase 1 dev velocity drops Pin daedalus-fourier to a tagged release; bump deliberately
Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p Medium-High Project fails to deliver user-facing improvement Acknowledged from project start; explicit success-criteria language in §10 covers the pivot