# daedalus-decoder — design **Phase:** scoping / first-draft design. No code written. Date: 2026-05-23. This document captures the architecture for an NVDEC-shaped H.264 decoder targeting Raspberry Pi 5 (V3D7). It exists because the [daedalus-fourier substitution prototype](https://git.reauktion.de/marfrit/daedalus-fourier) — a per-block kernel pack called synchronously from inside libavcodec's H.264 decoder — measured at >600× slower than NEON in production due to per-call Vulkan synchronization overhead. The R-band methodology correctly predicted this; the "QPU is default substrate" decree (2026-05-23) overrode it for prototype purposes and confirmed the architectural mismatch. This document is the structural redesign that addresses it. --- ## 1. The shape we need NVDEC and Vulkan Video both target dedicated decode hardware via a **picture-level command**: app submits a bitstream, hardware decodes a whole frame, app gets an NV12 surface. One submit per frame, one fence wait per frame, one DMA. We don't have dedicated H.264 silicon on Pi 5 (the chip has a separate HEVC block — `rpi-hevc-dec` — but no H.264 decoder). We do have V3D7 Vulkan compute. The goal is to build the *same shape* as NVDEC using V3D7 compute as the substrate. ``` ┌──────────────────────────────────────────────────────────────┐ │ CPU side │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ libavcodec H.264 parser (existing) │ │ │ │ - NAL split + emulation prevention removal │ │ │ │ - SPS / PPS │ │ │ │ - Slice header decode │ │ │ │ - CABAC / CAVLC entropy decode │ │ │ │ - Per-MB: mb_type, qp, partition info, │ │ │ │ transform coefficients, MVs, intra modes │ │ │ └─────────────────────┬─────────────────────────────────┘ │ │ ↓ (write into frame-shaped SSBO) │ │ │ │ └────────────────────────┼───────────────────────────────────────┘ ↓ Vulkan compute queue submit ┌────────────────────────┼───────────────────────────────────────┐ │ GPU (V3D7) side ↓ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Stage 1: inverse quant + IDCT 4x4 / 8x8 │ │ │ │ in: coeffs[N_MBs × 384] │ │ │ │ out: residual[N_MBs × 384] u8/i16 │ │ │ └────────────────────────┬────────────────────────────┘ │ │ ↓ vkCmdPipelineBarrier │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Stage 2: prediction (intra + MC) │ │ │ │ in: mb_descriptors, residual, │ │ │ │ DPB reference frames │ │ │ │ out: predicted[N_MBs × 384] u8 │ │ │ │ note: intra dispatched in wavefront order; │ │ │ │ MC fully parallel │ │ │ └────────────────────────┬────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Stage 3: reconstruct = clip255(residual + predicted) │ │ │ │ out: reconstructed Y, U, V planes │ │ │ └────────────────────────┬────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Stage 4: deblocking filter at MB edges │ │ │ │ - V-edges first (rows independent) │ │ │ │ - H-edges next (columns independent) │ │ │ │ out: NV12 in DPB slot + dma-buf-exported │ │ │ └────────────────────────┬────────────────────────────┘ │ └────────────────────────────┼───────────────────────────────────┘ ↓ vkWaitFences (one wait, one frame) NV12 ready in V4L2 CAPTURE plane ``` **One Vulkan submit per frame.** The per-call sync cost that killed the kernel-substitution prototype goes away because the dispatch granularity is the right size: thousands of macroblocks of work amortize the ~50 µs submit/wait roundtrip. --- ## 2. CPU/GPU split The CPU keeps everything that is **inherently serial**. H.264's entropy decode (CABAC, and CAVLC for Baseline-equivalent paths) is bit-by-bit serial — each binary decision updates context state that the next decision reads. NVDEC has a dedicated CABAC ASIC; doing CABAC on a SIMD compute architecture like V3D is a research project of its own and not in scope. The CPU also keeps: - NAL-unit boundary scanning + emulation-prevention byte removal - SPS/PPS storage and lookup - Slice header decoding - Reference picture list construction (DPB management at the index level) - Display-order reorder (B-frame POC machinery) The CPU **produces** a frame-shaped descriptor that the GPU consumes. The GPU runs **everything that is massively parallel per macroblock**: | Stage | What it does | Parallelism | |---|---|---| | 1. Inverse quant + IDCT | Multiply coeffs by Q scale, run 4×4 / 8×8 inverse transform | Fully parallel (per block) | | 2a. Intra prediction | Predict from neighbouring decoded pixels (top, top-right, left) | Wavefront — diagonal of MBs | | 2b. Motion compensation | qpel filter on reference pixels | Fully parallel | | 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) | | 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) | | 5. NV12 → RGBA (optional) | YUV→RGB matrix conversion via H.264 VUI colourspace | Fully parallel (per pixel) | --- ## 3. Per-MB descriptor (CPU → GPU) Frame-shaped SSBO: `mb_descriptors[N_MBs]`, where N_MBs = ⌈W/16⌉ × ⌈H/16⌉ (8160 for 1080p). ```c struct mb_descriptor { uint8_t mb_type; /* I_NxN, I_16x16, P_..., B_..., I_PCM, ... */ uint8_t mb_qp_y; uint8_t mb_qp_uv; uint8_t cbp; /* coded block pattern */ uint8_t intra_4x4_modes[16]; /* per 4x4 sub-block, when I_NxN */ uint8_t intra_16x16_mode; /* when I_16x16 */ uint8_t intra_chroma_mode; uint8_t partition_mode; /* P_16x16 / P_16x8 / P_8x16 / P_8x8 / ... */ uint8_t ref_idx_l0[4]; /* up to 4 partitions */ uint8_t ref_idx_l1[4]; /* B only */ int16_t mv_l0[4][2]; /* qpel-precision (1/4 sample) */ int16_t mv_l1[4][2]; uint8_t deblock_disable; int8_t deblock_alpha_c0; int8_t deblock_beta; }; ``` Coefficient buffer separate, also frame-shaped: ```c int16_t coeffs[N_MBs * 384]; /* 256 luma + 64 cb + 64 cr per MB */ ``` Sized for the worst case (every block has nz coefficients). Most MBs have sparse coefficients but we don't compact — the GPU shader just iterates and skip-zeros are fast. --- ## 4. Pipeline stage dispatch shape All stages live in **one VkCommandBuffer per frame**. Inter-stage dependencies use `vkCmdPipelineBarrier(VK_PIPELINE_STAGE_COMPUTE → VK_PIPELINE_STAGE_COMPUTE, SHADER_WRITE → SHADER_READ)` on the relevant SSBO. **Stage 1 (IDCT)** — one workgroup per MB; many MBs per dispatch. Reuses daedalus-fourier's IDCT 4×4 / 8×8 shader logic but at a different scale: a single dispatch covers all N_MBs macroblocks, not one block. **Stage 2a (intra)** — wavefront dispatched as one dispatch per diagonal. For a 1080p frame (120×68 MB grid), there are 120+68−1 = 187 diagonals; the longest has 68 MBs. Each diagonal's MBs are independent. One dispatch per diagonal, with a barrier in between → 187 dispatches in the same command buffer (still ~225k× cheaper than per-block). Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simpler, slower. Decision deferred. **Stage 2b (MC)** — single dispatch over all inter-MBs. Reads reference frames from DPB. qpel filter reads 6×8 src window per 8×8 output block; daedalus-fourier's existing `v3d_h264_qpel_mc20` covers mc20 — we need the other 15 variants (mc00/01/02/.../33) either as separate shaders or as one parameterized shader. **Stage 3 (reconstruct)** — trivial per-pixel `clip255(residual + predicted)`. One dispatch over all pixels. **Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout. **Stage 5 (optional, NV12 → RGBA conversion)** — a single per-pixel compute dispatch that converts the just-deblocked NV12 to packed RGBA8. Roughly: ```glsl // BT.709 limited-range example; matrix selected per H.264 VUI. int Y = int(luma[...]) - 16; int CB = int(chroma_uv[...]) - 128; int CR = int(chroma_uv[... + 1]) - 128; int R = clip255((298*Y + 459*CR + 128) >> 8); int G = clip255((298*Y - 137*CB - 55*CR + 128) >> 8); int B = clip255((298*Y + 541*CB + 128) >> 8); out_rgba[r*W + c] = pack32(R, G, B, 255); ``` Off by default. Most consumers want NV12 because: - V4L2 stateless decoder output is canonically NV12; the daedalus-v4l2 CAPTURE plane is already plumbed for NV12 dmabuf. - Wayland compositors (sway, mutter, KWin, weston) advertise NV12 via `zwp_linux_dmabuf_v1` and convert during composition — essentially free because it's one texture sample with a swizzle in the composite fragment shader. - Firefox / mpv VAAPI paths use NV12 dmabuf zero-copy to the compositor (cf. project memory `pi5_h264_hwaccel_landed`). - RGBA8 is 4× the bandwidth of NV12 (8 MB/frame vs 3 MB/frame at 1080p) — converting in the decoder when no consumer needs it burns DMA bandwidth and electrons. Opt in via a decoder config flag for consumers that genuinely need RGBA pre-compositor (SHM-buffer-only clients, future X11/headless paths, GL contexts without YUV sampler extension support). Colourspace metadata MUST be plumbed through when Stage 5 is on: the H.264 SPS's `vui_parameters` carry `colour_primaries`, `transfer_characteristics`, `matrix_coefficients`, and `video_full_range_flag`. Picking the wrong matrix (BT.601 vs BT.709) or range silently mis- saturates skin tones and shadows. Default to the BT.709 limited-range combination when VUI is absent (the dominant case for HD content); log a warning in that case. Total dispatches per frame: ~200 with intra wavefront + ~190 without Stage 5; +1 with Stage 5. All in one command buffer, one submit, one fence wait. --- ## 5. Reference frame management DPB = N reference frames in GPU memory, each a VkImage with NV12 format, exported as dma_buf. Up to 16 reference frames per H.264 spec; typical streams use 1-4. CPU tracks the index mapping (refIdx0/refIdx1 in MV lists → DPB slot). GPU shader receives DPB slot indices via the mb_descriptor; reads pixels via VkImage descriptor binding. At frame-decode end, the output VkImage may become a new reference (per `ref_pic_marking` decisions on CPU). Slot retirement is CPU-managed. This is the **right shape for V4L2 CAPTURE integration** — daedalus-v4l2 today uses dma_buf-exported CAPTURE planes; reusing those as DPB entries means zero-copy from decode → V4L2 consumer. --- ## 6. libavcodec integration The honest hard problem. libavcodec's H.264 decoder calls thousands of per-MB / per-block functions in sequence; we need to replace that with a "collect everything into the descriptor buffer, then dispatch one pipeline" flow. **Intercept point candidates:** - **Slice level** (`ff_h264_decode_slice` or `decode_slice`). Replace the per-MB loop entirely. Highest leverage but biggest rewrite. - **Macroblock level** (`ff_h264_decode_mb_*`). Each call returns to libavcodec which calls pixel kernels. Replace the kernels with descriptor-write stubs, dispatch at slice end. Closer to the substitution arc; less rewrite. The macroblock-level intercept is the natural evolution of the substitution arc. The current `daedalus_recipe_dispatch_h264_*` calls become **append-to-buffer** instead of **dispatch-to-GPU**, and an end-of-slice flush triggers the actual Vulkan submit. This is the same shape as **deferred rendering** in GL/Vulkan — record commands, submit in bulk. --- ## 7. What we reuse from daedalus-fourier Existing V3D shaders (post-PR-#7+#8 state): | Shader | Reuse status | |---|---| | `v3d_h264_idct4` | Yes — adapt input/output buffer layout | | `v3d_h264_idct8` | Yes — same | | `v3d_h264deblock` | Yes — adapt; the existing shader handles vertical-luma; chroma + horizontal variants are new | | `v3d_h264_qpel_mc20` | Partial — only mc20. All 16 qpel positions needed for MC. | | `v3d_idct8` (VP9) | Not for H.264 | | `v3d_lpf_h_4_8`, `v3d_lpf_h_8_8` (VP9) | Not for H.264 | | `v3d_mc_8h` (VP9 8-tap) | Not for H.264 (6-tap) | | `v3d_cdef` (AV1) | Not for H.264 | New shaders needed: - `v3d_h264_iquant` — inverse quantization (can be fused with IDCT into one stage) - `v3d_h264_intra_4x4` — 9 prediction modes - `v3d_h264_intra_16x16` — 4 modes - `v3d_h264_intra_chroma_8x8` — 4 modes - `v3d_h264_qpel_{mc00..mc33}` — 15 missing variants (or one parameterized shader) - `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear) - `v3d_h264_deblock_h_luma` + chroma variants - `v3d_h264_reconstruct` — trivial per-pixel add+clip - `v3d_h264_yuv_to_rgba` (optional Stage 5) — BT.601 / BT.709 / SMPTE-240M matrix selectable; limited vs full range selectable from VUI That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9. --- ## 8. Phasing **Phase 1 — MVP, I-frames only, Baseline profile** (4-6 weeks). No CABAC (CAVLC only). No MC. No B-frames. Validates the architecture: parse → entropy → frame-pipeline → NV12 out, all bit-exact for I-frames. Test against ffmpeg-generated I-frame-only H.264 streams. **Phase 2 — P-frames, single reference, Baseline profile** (+4 weeks). Add MC stage and DPB. **Phase 3 — CABAC + Main/High profile + B-frames + multi-ref** (+6 weeks). CABAC stays on CPU. Full DPB management. **Phase 4 — Production-ready deblock + perf optimization + libva integration** (+4 weeks). Real-world stream conformance. Plug into daedalus-v4l2 daemon as the actual decode backend. **Total H.264 budget:** 4-6 months. **Phase 5+ (future codec scope, not committed):** VP9 and AV1 reuse the same frame-level dispatch architecture, daedalus-fourier kernel pack, and DPB plumbing. Per §9.7, they are deferred but *not firmly out-of-scope*. HEVC stays firmly out (Pi 5 has `rpi-hevc-dec` for that). --- ## 9. Phase 1 decisions User-confirmed 2026-05-24. All seven questions from the initial draft are now decided; this section preserves the original wording of each item for traceability. 1. **Intra prediction strategy:** GPU wavefront (~187 dispatches, more complex) vs CPU speculative (simpler, slower). **Decision: wavefront in Phase 1; revisit if it's the perf bottleneck.** 2. **libavcodec intercept granularity:** macroblock-level (substitution-arc evolution) vs slice-level (cleaner rewrite). **Decision: macroblock-level for Phase 1; consider slice-level later if buffer accumulation overhead is non-trivial.** 3. **Shader parameterization:** 16 qpel variants as 16 shaders, or one parameterized shader with switch on mc_position? **Decision: measure both during Phase 2 (the MC phase) and pick the winner. No commit ahead of measurement.** 4. **DPB allocation:** Vulkan-native VkImage with dmabuf export, vs CPU-allocated dma_buf imported into Vulkan. **Decision: Vulkan-native with `VK_KHR_external_memory_dma_buf` export; daedalus-v4l2 daemon imports.** 5. **Daemon integration shape:** static library the daemon links, or separate process. **Decision: library link.** 6. **Build dependency on daedalus-fourier:** CMake `find_package`, or vendored? **Decision: `find_package`, pinned to a tagged release. daedalus-fourier becomes the "kernel pack" upstream library.** 7. **Codec scope.** **Decision: firmly out-of-scope for daedalus-decoder are HEVC (Pi 5 has `rpi-hevc-dec` for that), 10-bit, interlaced, and FMO/ASO.** VP9 and AV1 are *not* firmly out — they're future codec scope for the same framework after H.264 lands. This is a scope expansion from the initial draft, which had grouped them with HEVC under "firmly out". --- ## 10. Success criteria Phase 1 — **architecture validation, not perf**: - Bit-exact NV12 output against `ffmpeg -i input.h264 -f rawvideo -` for at least one synthetic I-frame-only stream - Demonstrates one-submit-per-frame dispatch path (not one-dispatch-per-block) - Vulkan validation layer reports no errors - No regression in daedalus-fourier or daedalus-v4l2 Phase 4 — **production target**: - 1080p H.264 at ≥30 fps on Pi 5 (matches the [30fps floor](https://git.reauktion.de/marfrit/daedalus-fourier/src/branch/main/CLAUDE-CODE-MEMORY/project_30fps_floor_is_fine.md) user-facing requirement) - Daily YouTube playback via Firefox → libva → daedalus-v4l2 daemon → daedalus-decoder works durably - CPU consumption noticeably lower than the current libavcodec-NEON path (the "save electrons" goal) If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput isn't enough for H.264 1080p at 30fps and pivot back to NEON. The R-band methodology was clear that per-block QPU is RED on Pi 5; the bet here is that frame-level batched dispatch — the NVDEC shape — changes the ratio enough to flip the verdict. --- ## 11. References - [Khronos: An Introduction to Vulkan Video](https://www.khronos.org/blog/an-introduction-to-vulkan-video) - [NVDEC Video Decoder API Programming Guide](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/) - [VK_KHR_video_decode_h264 proposal](https://docs.vulkan.org/features/latest/features/proposals/VK_KHR_video_decode_h264.html) - [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html) - [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore - ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process --- ## Appendix A — daedalus-fourier shader reuse audit Inventory of cycles 1-9 V3D shaders against the daedalus-decoder shader requirements. Done 2026-05-23 against post-PR-#8 main. **Directly reusable (just call at frame scale):** | daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Notes | |---|---|---|---| | `v3d_h264_idct4.comp` | 6 | Stage 1 IDCT 4×4 | 16 blocks/WG; scales freely. Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1. Can be replaced with a fused iquant+IDCT variant if measurement justifies it. | | `v3d_h264_idct8.comp` | 7 | Stage 1 IDCT 8×8 | 8 blocks/WG; same reuse pattern. High-profile only. | **Partially reusable (existing shader is the template; need siblings):** | daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Sibling shaders needed | |---|---|---|---| | `v3d_h264deblock.comp` | 8 | Stage 4 deblocking | Handles only vertical luma, non-intra (bS<4). Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template. | | `v3d_h264_qpel_mc20.comp` | 9 | Stage 2b motion comp | Handles only mc20 8×8 "put". Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template. | **Not reusable (different codec or different math):** | daedalus-fourier shader | Cycle | Why not | |---|---|---| | `v3d_idct8.comp` | 1 | VP9 cospi-multiplier IDCT; H.264 uses integer butterfly | | `v3d_lpf_h_4_8.comp` | 2 | VP9 loop filter, different from H.264 deblock | | `v3d_lpf_h_8_8.comp` | 4 | Same | | `v3d_mc_8h.comp` | 3 | VP9 8-tap subpel; H.264 qpel is 6-tap | | `v3d_cdef.comp` | 5 | AV1 CDEF | **Brand-new shaders needed (not seeded by daedalus-fourier):** - `v3d_h264_iquant` — inverse quantization (potentially fused into IDCT) - `v3d_h264_intra_4x4` — 9 prediction modes - `v3d_h264_intra_16x16` — 4 modes - `v3d_h264_intra_chroma_8x8` — 4 modes - `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear) - `v3d_h264_reconstruct` — trivial per-pixel residual+predicted clip255 - `v3d_h264_yuv_to_rgba` (Stage 5, optional) **Summary**: ~22 H.264-related shaders total; 2 directly reusable, 2 partial-reuse (5+15 = ~20 sibling variants), 7 brand-new. All small (100–250 lines GLSL each). Each carries its own M1 bit-exact gate against an FFmpeg reference — same methodology as daedalus-fourier cycles 1-9. Estimate 1-3 days per shader including the gate methodology; total shader work ~6-10 weeks if done in order. --- ## Appendix B — libavcodec intercept point Located in `libavcodec/h264_slice.c:decode_slice` (FFmpeg 8.1 Kwiboo pin, function starts at line 2598). The relevant loop: ```c for (;;) { ret = ff_h264_decode_mb_cabac(h, sl); /* or _cavlc */ if (ret >= 0) ff_h264_hl_decode_mb(h, sl); /* ← pixel-level decode */ ... eos = get_cabac_terminate(&sl->cabac); if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2) break; if (eos || ...) break; } ``` After `ff_h264_decode_mb_cabac` returns, `sl` and `h` are populated with everything we need: - `sl->mb_x`, `sl->mb_y` — current macroblock coordinates - `sl->mb[]` — coefficient buffer (residual coefficients, post-quant) - `sl->mv_cache[]`, `sl->ref_cache[]` — motion vectors, ref indices - `h->cur_pic.mb_type[]` — mb_type field - `sl->intra4x4_pred_mode_cache[]` — 9-mode intra4x4 predictions - `h->cur_pic.qscale_table[]` — per-MB QP - `sl->non_zero_count_cache[]` — non-zero coefficient count per 4×4 block `ff_h264_hl_decode_mb` is the function that calls the pixel-level kernels (IDCT, intra prediction, MC, deblock). This is where the daedalus-fourier substitution arc plugged in via `H264DSPContext` function pointers (PR #6 / #76 et seq). **The daedalus-decoder intercept**: replace `ff_h264_hl_decode_mb` with a stub that does *no pixel work*. Instead: ```c static inline void daedalus_decoder_append_mb(H264Context *h, H264SliceContext *sl) { /* Snapshot the populated sl/h fields into the frame-shaped * mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x). * Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the * matching offset. No GPU dispatch yet — just memory writes. */ ... } ``` End-of-slice (or end-of-frame, depending on slice grouping) calls `daedalus_decoder_flush_frame()` which builds the VkCommandBuffer with all 4-5 pipeline stages and submits once. **Implementation shape**: this is a new FFmpeg patch in marfrit-packages, sibling to the existing 0003-0007 substitution patches. Tentatively `0008-h264-daedalus-decoder-frame-pipeline.patch`: ``` libavcodec/h264_slice.c: - decode_slice() per-MB loop: replace `ff_h264_hl_decode_mb(h, sl)` with `if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl); else ff_h264_hl_decode_mb(h, sl);` - After the loop, if (daedalus_decoder_active(h)) daedalus_decoder_flush_frame(h); libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c): - daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1 or a per-codec flag - daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so ``` Crucially, **CABAC and CAVLC stay in libavcodec** — they're the fastest open-source CABAC implementations and the structure they produce in `sl->mb_cache` already has the right shape. We don't re-implement entropy decoding; we just *intercept after it runs* and feed the GPU pipeline from there. **Backwards compatibility**: when `daedalus_decoder_active(h)` returns false (default), the patch is a no-op — `ff_h264_hl_decode_mb` runs as before. Coexists with the kernel-pack substitution arc without conflict. --- ## Appendix C — risk register | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | Intra prediction wavefront serialization dominates wall time | Medium | Phase 1 misses 30 fps target | Speculative CPU intra fallback; profile early | | qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden | Medium | Phase 2 schedule slips | Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline | | H.264 VUI matrix selection error in Stage 5 | Low | Wrong colours in HD content | Default BT.709 limited-range; load test against `ffmpeg -filter_complex eq=...` reference | | Mesa V3DV concurrency bug under heavy compute dispatch | Low | Sporadic GPU hang | Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction) | | daedalus-fourier shader changes break daedalus-decoder | Medium | Phase 1 dev velocity drops | Pin daedalus-fourier to a tagged release; bump deliberately | | Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p | Medium-High | Project fails to deliver user-facing improvement | Acknowledged from project start; explicit success-criteria language in §10 covers the pivot |