design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register)

Read-only research done autonomously while push to marfrit/daedalus-decoder is blocked on user perms. All findings appended to DESIGN.md; no new files, no architecture changes. Appendix A — daedalus-fourier shader reuse audit - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8) just at frame scale instead of n_blocks=1 per call - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20) serve as templates for ~20 sibling variants (horizontal/chroma deblock variants, 15 missing qpel positions + 16x16 size + avg) - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific) - 7 brand-new shaders required (iquant, intra prediction modes, chroma MC, reconstruct, optional yuv→rgba) - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory if done in sequence with M1 bit-exact gate methodology Appendix B — libavcodec intercept point - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP, non_zero_count_cache into a frame-shaped descriptor SSBO - End-of-slice flush builds + submits the GPU pipeline - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy) - New FFmpeg patch in marfrit-packages, sibling to 0003-0007: 0008-h264-daedalus-decoder-frame-pipeline.patch - daedalus_decoder_active(h) gates the intercept; default OFF = no-op = full coexistence with the kernel-pack substitution arc Appendix C — risk register - 6 risks catalogued: intra wavefront perf, qpel shader explosion, Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier pin drift, Phase 4 30fps@1080p target miss - Highest impact: project fails to beat NEON. Acknowledged from project start (§10), explicit pivot language.
2026-05-23 23:10:39 +02:00
parent 4182b32adf
commit 8a4fb10a7f
1 changed files with 147 additions and 0 deletions
@@ -310,3 +310,150 @@ If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput
 - [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html)
 - [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
 - ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process
+
+---
+
+## Appendix A — daedalus-fourier shader reuse audit
+
+Inventory of cycles 1-9 V3D shaders against the daedalus-decoder
+shader requirements.  Done 2026-05-23 against post-PR-#8 main.
+
+**Directly reusable (just call at frame scale):**
+
+| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Notes |
+|---|---|---|---|
+| `v3d_h264_idct4.comp` | 6 | Stage 1 IDCT 4×4 | 16 blocks/WG; scales freely.  Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1.  Can be replaced with a fused iquant+IDCT variant if measurement justifies it. |
+| `v3d_h264_idct8.comp` | 7 | Stage 1 IDCT 8×8 | 8 blocks/WG; same reuse pattern.  High-profile only. |
+
+**Partially reusable (existing shader is the template; need siblings):**
+
+| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Sibling shaders needed |
+|---|---|---|---|
+| `v3d_h264deblock.comp` | 8 | Stage 4 deblocking | Handles only vertical luma, non-intra (bS<4).  Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template. |
+| `v3d_h264_qpel_mc20.comp` | 9 | Stage 2b motion comp | Handles only mc20 8×8 "put".  Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template. |
+
+**Not reusable (different codec or different math):**
+
+| daedalus-fourier shader | Cycle | Why not |
+|---|---|---|
+| `v3d_idct8.comp` | 1 | VP9 cospi-multiplier IDCT; H.264 uses integer butterfly |
+| `v3d_lpf_h_4_8.comp` | 2 | VP9 loop filter, different from H.264 deblock |
+| `v3d_lpf_h_8_8.comp` | 4 | Same |
+| `v3d_mc_8h.comp` | 3 | VP9 8-tap subpel; H.264 qpel is 6-tap |
+| `v3d_cdef.comp` | 5 | AV1 CDEF |
+
+**Brand-new shaders needed (not seeded by daedalus-fourier):**
+
+- `v3d_h264_iquant` — inverse quantization (potentially fused into IDCT)
+- `v3d_h264_intra_4x4` — 9 prediction modes
+- `v3d_h264_intra_16x16` — 4 modes
+- `v3d_h264_intra_chroma_8x8` — 4 modes
+- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
+- `v3d_h264_reconstruct` — trivial per-pixel residual+predicted clip255
+- `v3d_h264_yuv_to_rgba` (Stage 5, optional)
+
+**Summary**: ~22 H.264-related shaders total; 2 directly reusable, 2
+partial-reuse (5+15 = ~20 sibling variants), 7 brand-new.  All small
+(100–250 lines GLSL each).  Each carries its own M1 bit-exact gate
+against an FFmpeg reference — same methodology as daedalus-fourier
+cycles 1-9.  Estimate 1-3 days per shader including the gate
+methodology; total shader work ~6-10 weeks if done in order.
+
+---
+
+## Appendix B — libavcodec intercept point
+
+Located in `libavcodec/h264_slice.c:decode_slice` (FFmpeg 8.1
+Kwiboo pin, function starts at line 2598).  The relevant loop:
+
+```c
+for (;;) {
+    ret = ff_h264_decode_mb_cabac(h, sl);   /* or _cavlc */
+    if (ret >= 0)
+        ff_h264_hl_decode_mb(h, sl);        /* ← pixel-level decode */
+    ...
+    eos = get_cabac_terminate(&sl->cabac);
+    if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2)
+        break;
+    if (eos || ...) break;
+}
+```
+
+After `ff_h264_decode_mb_cabac` returns, `sl` and `h` are populated
+with everything we need:
+
+- `sl->mb_x`, `sl->mb_y` — current macroblock coordinates
+- `sl->mb[]` — coefficient buffer (residual coefficients, post-quant)
+- `sl->mv_cache[]`, `sl->ref_cache[]` — motion vectors, ref indices
+- `h->cur_pic.mb_type[]` — mb_type field
+- `sl->intra4x4_pred_mode_cache[]` — 9-mode intra4x4 predictions
+- `h->cur_pic.qscale_table[]` — per-MB QP
+- `sl->non_zero_count_cache[]` — non-zero coefficient count per 4×4 block
+
+`ff_h264_hl_decode_mb` is the function that calls the pixel-level
+kernels (IDCT, intra prediction, MC, deblock).  This is where the
+daedalus-fourier substitution arc plugged in via `H264DSPContext`
+function pointers (PR #6 / #76 et seq).
+
+**The daedalus-decoder intercept**: replace `ff_h264_hl_decode_mb`
+with a stub that does *no pixel work*.  Instead:
+
+```c
+static inline void daedalus_decoder_append_mb(H264Context *h,
+                                               H264SliceContext *sl)
+{
+    /* Snapshot the populated sl/h fields into the frame-shaped
+     * mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x).
+     * Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the
+     * matching offset.  No GPU dispatch yet — just memory writes. */
+    ...
+}
+```
+
+End-of-slice (or end-of-frame, depending on slice grouping) calls
+`daedalus_decoder_flush_frame()` which builds the VkCommandBuffer
+with all 4-5 pipeline stages and submits once.
+
+**Implementation shape**: this is a new FFmpeg patch in
+marfrit-packages, sibling to the existing 0003-0007 substitution
+patches.  Tentatively `0008-h264-daedalus-decoder-frame-pipeline.patch`:
+
+```
+libavcodec/h264_slice.c:
+  - decode_slice() per-MB loop: replace
+    `ff_h264_hl_decode_mb(h, sl)` with
+    `if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl);
+     else ff_h264_hl_decode_mb(h, sl);`
+  - After the loop, if (daedalus_decoder_active(h))
+    daedalus_decoder_flush_frame(h);
+
+libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c):
+  - daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1
+    or a per-codec flag
+  - daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so
+```
+
+Crucially, **CABAC and CAVLC stay in libavcodec** — they're the
+fastest open-source CABAC implementations and the structure they
+produce in `sl->mb_cache` already has the right shape.  We don't
+re-implement entropy decoding; we just *intercept after it runs*
+and feed the GPU pipeline from there.
+
+**Backwards compatibility**: when `daedalus_decoder_active(h)`
+returns false (default), the patch is a no-op — `ff_h264_hl_decode_mb`
+runs as before.  Coexists with the kernel-pack substitution arc
+without conflict.
+
+---
+
+## Appendix C — risk register
+
+| Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|
+| Intra prediction wavefront serialization dominates wall time | Medium | Phase 1 misses 30 fps target | Speculative CPU intra fallback; profile early |
+| qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden | Medium | Phase 2 schedule slips | Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline |
+| H.264 VUI matrix selection error in Stage 5 | Low | Wrong colours in HD content | Default BT.709 limited-range; load test against `ffmpeg -filter_complex eq=...` reference |
+| Mesa V3DV concurrency bug under heavy compute dispatch | Low | Sporadic GPU hang | Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction) |
+| daedalus-fourier shader changes break daedalus-decoder | Medium | Phase 1 dev velocity drops | Pin daedalus-fourier to a tagged release; bump deliberately |
+| Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p | Medium-High | Project fails to deliver user-facing improvement | Acknowledged from project start; explicit success-criteria language in §10 covers the pivot |
+