Files
daedalus-decoder/DESIGN.md
T

466 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# daedalus-decoder — design
**Phase:** scoping / first-draft design. No code written. Date: 2026-05-23.
This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the [daedalus-fourier substitution prototype](https://git.reauktion.de/marfrit/daedalus-fourier) — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it.
---
## 1. The shape we need
NVDEC and Vulkan Video both target dedicated decode hardware via a **picture-level command**: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA.
We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — `rpi-hevc-dec` — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the *same shape* as NVDEC using V3D7 compute as the substrate.
```
┌──────────────────────────────────────────────────────────────┐
│ CPU side │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ libavcodec H.264 parser (existing) │ │
│ │ - NAL split + emulation prevention removal │ │
│ │ - SPS / PPS │ │
│ │ - Slice header decode │ │
│ │ - CABAC / CAVLC entropy decode │ │
│ │ - Per-MB: mb_type, qp, partition info, │ │
│ │ transform coefficients, MVs, intra modes │ │
│ └─────────────────────┬─────────────────────────────────┘ │
│ ↓ (write into frame-shaped SSBO) │
│ │ │
└────────────────────────┼───────────────────────────────────────┘
↓ Vulkan compute queue submit
┌────────────────────────┼───────────────────────────────────────┐
│ GPU (V3D7) side ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 1: inverse quant + IDCT 4x4 / 8x8 │ │
│ │ in: coeffs[N_MBs × 384] │ │
│ │ out: residual[N_MBs × 384] u8/i16 │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ vkCmdPipelineBarrier │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 2: prediction (intra + MC) │ │
│ │ in: mb_descriptors, residual, │ │
│ │ DPB reference frames │ │
│ │ out: predicted[N_MBs × 384] u8 │ │
│ │ note: intra dispatched in wavefront order; │ │
│ │ MC fully parallel │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 3: reconstruct = clip255(residual + predicted) │ │
│ │ out: reconstructed Y, U, V planes │ │
│ └────────────────────────┬────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Stage 4: deblocking filter at MB edges │ │
│ │ - V-edges first (rows independent) │ │
│ │ - H-edges next (columns independent) │ │
│ │ out: NV12 in DPB slot + dma-buf-exported │ │
│ └────────────────────────┬────────────────────────────┘ │
└────────────────────────────┼───────────────────────────────────┘
↓ vkWaitFences (one wait, one frame)
NV12 ready in V4L2 CAPTURE plane
```
**One Vulkan submit per frame.** The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip.
---
## 2. CPU/GPU split
The CPU keeps everything that is **inherently serial**. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope.
The CPU also keeps:
- NAL-unit boundary scanning + emulation-prevention byte removal
- SPS/PPS storage and lookup
- Slice header decoding
- Reference picture list construction (DPB management at the index level)
- Display-order reorder (B-frame POC machinery)
The CPU **produces** a frame-shaped descriptor that the GPU consumes.
The GPU runs **everything that is massively parallel per macroblock**:
| Stage | What it does | Parallelism |
|---|---|---|
| 1. Inverse quant + IDCT | Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform | Fully parallel (per block) |
| 2a. Intra prediction | Predict from neighbouring decoded pixels (top, top-right, left) | Wavefront — diagonal of MBs |
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
| 5. NV12 → RGBA (optional) | YUV→RGB matrix conversion via H.264 VUI colourspace | Fully parallel (per pixel) |
---
## 3. Per-MB descriptor (CPU → GPU)
Frame-shaped SSBO: `mb_descriptors[N_MBs]`, where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p).
```c
struct mb_descriptor {
uint8_t mb_type; /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */
uint8_t mb_qp_y;
uint8_t mb_qp_uv;
uint8_t cbp; /* coded block pattern */
uint8_t intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */
uint8_t intra_16x16_mode; /* when I_16x16 */
uint8_t intra_chroma_mode;
uint8_t partition_mode; /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */
uint8_t ref_idx_l0[4]; /* up to 4 partitions */
uint8_t ref_idx_l1[4]; /* B only */
int16_t mv_l0[4][2]; /* qpel-precision (1/4 sample) */
int16_t mv_l1[4][2];
uint8_t deblock_disable;
int8_t deblock_alpha_c0;
int8_t deblock_beta;
};
```
Coefficient buffer separate, also frame-shaped:
```c
int16_t coeffs[N_MBs * 384]; /* 256 luma + 64 cb + 64 cr per MB */
```
Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast.
---
## 4. Pipeline stage dispatch shape
All stages live in **one VkCommandBuffer per frame**. Inter-stage dependencies use `vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ)` on the relevant SSBO.
**Stage 1 (IDCT)** — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block.
**Stage 2a (intra)** — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+681 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block).
Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred.
**Stage 2b (MC)** — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing `v3d_h264_qpel_mc20` covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader.
**Stage 3 (reconstruct)** — trivial per-pixel `clip255(residual + predicted)`. One dispatch over all pixels.
**Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout.
**Stage 5 (optional, NV12 → RGBA conversion)** — a single per-pixel
compute dispatch that converts the just-deblocked NV12 to packed
RGBA8. Roughly:
```glsl
// BT.709 limited-range example; matrix selected per H.264 VUI.
int Y = int(luma[...]) - 16;
int CB = int(chroma_uv[...]) - 128;
int CR = int(chroma_uv[... + 1]) - 128;
int R = clip255((298*Y + 459*CR + 128) >> 8);
int G = clip255((298*Y - 137*CB - 55*CR + 128) >> 8);
int B = clip255((298*Y + 541*CB + 128) >> 8);
out_rgba[r*W + c] = pack32(R, G, B, 255);
```
Off by default. Most consumers want NV12 because:
- V4L2 stateless decoder output is canonically NV12; the daedalus-v4l2
CAPTURE plane is already plumbed for NV12 dmabuf.
- Wayland compositors (sway, mutter, KWin, weston) advertise NV12 via
`zwp_linux_dmabuf_v1` and convert during composition — essentially
free because it's one texture sample with a swizzle in the
composite fragment shader.
- Firefox / mpv VAAPI paths use NV12 dmabuf zero-copy to the
compositor (cf. project memory `pi5_h264_hwaccel_landed`).
- RGBA8 is 4× the bandwidth of NV12 (8 MB/frame vs 3 MB/frame at
1080p) — converting in the decoder when no consumer needs it
burns DMA bandwidth and electrons.
Opt in via a decoder config flag for consumers that genuinely need
RGBA pre-compositor (SHM-buffer-only clients, future X11/headless
paths, GL contexts without YUV sampler extension support).
Colourspace metadata MUST be plumbed through when Stage 5 is on:
the H.264 SPS's `vui_parameters` carry `colour_primaries`,
`transfer_characteristics`, `matrix_coefficients`, and `video_full_range_flag`.
Picking the wrong matrix (BT.601 vs BT.709) or range silently mis-
saturates skin tones and shadows. Default to the BT.709 limited-range
combination when VUI is absent (the dominant case for HD content);
log a warning in that case.
Total dispatches per frame: ~200 with intra wavefront + ~190
without Stage 5; +1 with Stage 5. All in one command buffer, one
submit, one fence wait.
---
## 5. Reference frame management
DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4.
CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding.
At frame-decode end, the output VkImage may become a new reference (per `ref_pic_marking` decisions on CPU). Slot retirement is CPU-managed.
This is the **right shape for V4L2 CAPTURE integration** — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer.
---
## 6. libavcodec integration
The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow.
**Intercept point candidates:**
- **Slice level** (`ff_h264_decode_slice` or `decode_slice`). Replace the per-MB loop entirely. Highest leverage but biggest rewrite.
- **Macroblock level** (`ff_h264_decode_mb_*`). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite.
The macroblock-level intercept is the natural evolution of the substitution arc. The current `daedalus_recipe_dispatch_h264_*` calls become **append-to-buffer** instead of **dispatch-to-GPU**, and an end-of-slice flush triggers the actual Vulkan submit.
This is the same shape as **deferred rendering** in GL/Vulkan — record commands, submit in bulk.
---
## 7. What we reuse from daedalus-fourier
Existing V3D shaders (post-PR-#7+#8 state):
| Shader | Reuse status |
|---|---|
| `v3d_h264_idct4` | Yes — adapt input/output buffer layout |
| `v3d_h264_idct8` | Yes — same |
| `v3d_h264deblock` | Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new |
| `v3d_h264_qpel_mc20` | Partial — only mc20. All 16 qpel positions needed for MC. |
| `v3d_idct8` (VP9) | Not for H.264 |
| `v3d_lpf_h_4_8`, `v3d_lpf_h_8_8` (VP9) | Not for H.264 |
| `v3d_mc_8h` (VP9 8-tap) | Not for H.264 (6-tap) |
| `v3d_cdef` (AV1) | Not for H.264 |
New shaders needed:
- `v3d_h264_iquant` — inverse quantization (can be fused with IDCT into one stage)
- `v3d_h264_intra_4x4` — 9 prediction modes
- `v3d_h264_intra_16x16` — 4 modes
- `v3d_h264_intra_chroma_8x8` — 4 modes
- `v3d_h264_qpel_{mc00..mc33}` — 15 missing variants (or one parameterized shader)
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
- `v3d_h264_deblock_h_luma` + chroma variants
- `v3d_h264_reconstruct` — trivial per-pixel add+clip
- `v3d_h264_yuv_to_rgba` (optional Stage 5) — BT.601 / BT.709 / SMPTE-240M
matrix selectable; limited vs full range selectable from VUI
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
---
## 8. Phasing
**Phase 1 — MVP, I-frames only, Baseline profile** (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams.
**Phase 2 — P-frames, single reference, Baseline profile** (+4 weeks). Add MC stage and DPB.
**Phase 3 — CABAC + Main/High profile + B-frames + multi-ref** (+6 weeks). CABAC stays on CPU. Full DPB management.
**Phase 4 — Production-ready deblock + perf optimization + libva integration** (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend.
**Total H.264 budget:** 4-6 months.
**Phase 5+ (future codec scope, not committed):** VP9 and AV1 reuse the same frame-level dispatch architecture, daedalus-fourier kernel pack, and DPB plumbing. Per §9.7, they are deferred but *not firmly out-of-scope*. HEVC stays firmly out (Pi 5 has `rpi-hevc-dec` for that).
---
## 9. Phase 1 decisions
User-confirmed 2026-05-24. All seven questions from the initial
draft are now decided; this section preserves the original wording
of each item for traceability.
1. **Intra prediction strategy:** GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). **Decision: wavefront in Phase 1; revisit if it's the perf bottleneck.**
2. **libavcodec intercept granularity:** macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). **Decision: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.**
3. **Shader parameterization:** 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? **Decision: measure both during Phase 2 (the MC phase) and pick the winner. No commit ahead of measurement.**
4. **DPB allocation:** Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. **Decision: Vulkan-native with `VK_KHR_external_memory_dma_buf` export; daedalus-v4l2 daemon imports.**
5. **Daemon integration shape:** static library the daemon links, or separate process. **Decision: library link.**
6. **Build dependency on daedalus-fourier:** CMake `find_package`, or vendored? **Decision: `find_package`, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library.**
7. **Codec scope.** **Decision: firmly out-of-scope for daedalus-decoder are HEVC (Pi 5 has `rpi-hevc-dec` for that), 10-bit, interlaced, and FMO/ASO.** VP9 and AV1 are *not* firmly out — they're future codec scope for the same framework after H.264 lands. This is a scope expansion from the initial draft, which had grouped them with HEVC under "firmly out".
---
## 10. Success criteria
Phase 1 — **architecture validation, not perf**:
- Bit-exact NV12 output against `ffmpeg -i input.h264 -f rawvideo -` for at least one synthetic I-frame-only stream
- Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block)
- Vulkan validation layer reports no errors
- No regression in daedalus-fourier or daedalus-v4l2
Phase 4 — **production target**:
- 1080p H.264 at ≥30 fps on Pi 5 (matches the [30fps floor](https://git.reauktion.de/marfrit/daedalus-fourier/src/branch/main/CLAUDE-CODE-MEMORY/project_30fps_floor_is_fine.md) user-facing requirement)
- Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably
- CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal)
If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict.
---
## 11. References
- [Khronos: An Introduction to Vulkan Video](https://www.khronos.org/blog/an-introduction-to-vulkan-video)
- [NVDEC Video Decoder API Programming Guide](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/)
- [VK_KHR_video_decode_h264 proposal](https://docs.vulkan.org/features/latest/features/proposals/VK_KHR_video_decode_h264.html)
- [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html)
- [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
- ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process
---
## Appendix A — daedalus-fourier shader reuse audit
Inventory of cycles 1-9 V3D shaders against the daedalus-decoder
shader requirements. Done 2026-05-23 against post-PR-#8 main.
**Directly reusable (just call at frame scale):**
| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Notes |
|---|---|---|---|
| `v3d_h264_idct4.comp` | 6 | Stage 1 IDCT 4×4 | 16 blocks/WG; scales freely. Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1. Can be replaced with a fused iquant+IDCT variant if measurement justifies it. |
| `v3d_h264_idct8.comp` | 7 | Stage 1 IDCT 8×8 | 8 blocks/WG; same reuse pattern. High-profile only. |
**Partially reusable (existing shader is the template; need siblings):**
| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Sibling shaders needed |
|---|---|---|---|
| `v3d_h264deblock.comp` | 8 | Stage 4 deblocking | Handles only vertical luma, non-intra (bS<4). Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template. |
| `v3d_h264_qpel_mc20.comp` | 9 | Stage 2b motion comp | Handles only mc20 8×8 "put". Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template. |
**Not reusable (different codec or different math):**
| daedalus-fourier shader | Cycle | Why not |
|---|---|---|
| `v3d_idct8.comp` | 1 | VP9 cospi-multiplier IDCT; H.264 uses integer butterfly |
| `v3d_lpf_h_4_8.comp` | 2 | VP9 loop filter, different from H.264 deblock |
| `v3d_lpf_h_8_8.comp` | 4 | Same |
| `v3d_mc_8h.comp` | 3 | VP9 8-tap subpel; H.264 qpel is 6-tap |
| `v3d_cdef.comp` | 5 | AV1 CDEF |
**Brand-new shaders needed (not seeded by daedalus-fourier):**
- `v3d_h264_iquant` — inverse quantization (potentially fused into IDCT)
- `v3d_h264_intra_4x4` — 9 prediction modes
- `v3d_h264_intra_16x16` — 4 modes
- `v3d_h264_intra_chroma_8x8` — 4 modes
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
- `v3d_h264_reconstruct` — trivial per-pixel residual+predicted clip255
- `v3d_h264_yuv_to_rgba` (Stage 5, optional)
**Summary**: ~22 H.264-related shaders total; 2 directly reusable, 2
partial-reuse (5+15 = ~20 sibling variants), 7 brand-new. All small
(100250 lines GLSL each). Each carries its own M1 bit-exact gate
against an FFmpeg reference — same methodology as daedalus-fourier
cycles 1-9. Estimate 1-3 days per shader including the gate
methodology; total shader work ~6-10 weeks if done in order.
---
## Appendix B — libavcodec intercept point
Located in `libavcodec/h264_slice.c:decode_slice` (FFmpeg 8.1
Kwiboo pin, function starts at line 2598). The relevant loop:
```c
for (;;) {
ret = ff_h264_decode_mb_cabac(h, sl); /* or _cavlc */
if (ret >= 0)
ff_h264_hl_decode_mb(h, sl); /* ← pixel-level decode */
...
eos = get_cabac_terminate(&sl->cabac);
if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2)
break;
if (eos || ...) break;
}
```
After `ff_h264_decode_mb_cabac` returns, `sl` and `h` are populated
with everything we need:
- `sl->mb_x`, `sl->mb_y` — current macroblock coordinates
- `sl->mb[]` — coefficient buffer (residual coefficients, post-quant)
- `sl->mv_cache[]`, `sl->ref_cache[]` — motion vectors, ref indices
- `h->cur_pic.mb_type[]` — mb_type field
- `sl->intra4x4_pred_mode_cache[]` — 9-mode intra4x4 predictions
- `h->cur_pic.qscale_table[]` — per-MB QP
- `sl->non_zero_count_cache[]` — non-zero coefficient count per 4×4 block
`ff_h264_hl_decode_mb` is the function that calls the pixel-level
kernels (IDCT, intra prediction, MC, deblock). This is where the
daedalus-fourier substitution arc plugged in via `H264DSPContext`
function pointers (PR #6 / #76 et seq).
**The daedalus-decoder intercept**: replace `ff_h264_hl_decode_mb`
with a stub that does *no pixel work*. Instead:
```c
static inline void daedalus_decoder_append_mb(H264Context *h,
H264SliceContext *sl)
{
/* Snapshot the populated sl/h fields into the frame-shaped
* mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x).
* Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the
* matching offset. No GPU dispatch yet — just memory writes. */
...
}
```
End-of-slice (or end-of-frame, depending on slice grouping) calls
`daedalus_decoder_flush_frame()` which builds the VkCommandBuffer
with all 4-5 pipeline stages and submits once.
**Implementation shape**: this is a new FFmpeg patch in
marfrit-packages, sibling to the existing 0003-0007 substitution
patches. Tentatively `0008-h264-daedalus-decoder-frame-pipeline.patch`:
```
libavcodec/h264_slice.c:
- decode_slice() per-MB loop: replace
`ff_h264_hl_decode_mb(h, sl)` with
`if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl);
else ff_h264_hl_decode_mb(h, sl);`
- After the loop, if (daedalus_decoder_active(h))
daedalus_decoder_flush_frame(h);
libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c):
- daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1
or a per-codec flag
- daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so
```
Crucially, **CABAC and CAVLC stay in libavcodec** — they're the
fastest open-source CABAC implementations and the structure they
produce in `sl->mb_cache` already has the right shape. We don't
re-implement entropy decoding; we just *intercept after it runs*
and feed the GPU pipeline from there.
**Backwards compatibility**: when `daedalus_decoder_active(h)`
returns false (default), the patch is a no-op — `ff_h264_hl_decode_mb`
runs as before. Coexists with the kernel-pack substitution arc
without conflict.
---
## Appendix C — risk register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Intra prediction wavefront serialization dominates wall time | Medium | Phase 1 misses 30 fps target | Speculative CPU intra fallback; profile early |
| qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden | Medium | Phase 2 schedule slips | Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline |
| H.264 VUI matrix selection error in Stage 5 | Low | Wrong colours in HD content | Default BT.709 limited-range; load test against `ffmpeg -filter_complex eq=...` reference |
| Mesa V3DV concurrency bug under heavy compute dispatch | Low | Sporadic GPU hang | Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction) |
| daedalus-fourier shader changes break daedalus-decoder | Medium | Phase 1 dev velocity drops | Pin daedalus-fourier to a tagged release; bump deliberately |
| Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p | Medium-High | Project fails to deliver user-facing improvement | Acknowledged from project start; explicit success-criteria language in §10 covers the pivot |