diff --git a/DESIGN.md b/DESIGN.md index cb5648e..cea5149 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -310,3 +310,150 @@ If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput - [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html) - [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore - ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process + +--- + +## Appendix A — daedalus-fourier shader reuse audit + +Inventory of cycles 1-9 V3D shaders against the daedalus-decoder +shader requirements. Done 2026-05-23 against post-PR-#8 main. + +**Directly reusable (just call at frame scale):** + +| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Notes | +|---|---|---|---| +| `v3d_h264_idct4.comp` | 6 | Stage 1 IDCT 4×4 | 16 blocks/WG; scales freely. Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1. Can be replaced with a fused iquant+IDCT variant if measurement justifies it. | +| `v3d_h264_idct8.comp` | 7 | Stage 1 IDCT 8×8 | 8 blocks/WG; same reuse pattern. High-profile only. | + +**Partially reusable (existing shader is the template; need siblings):** + +| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Sibling shaders needed | +|---|---|---|---| +| `v3d_h264deblock.comp` | 8 | Stage 4 deblocking | Handles only vertical luma, non-intra (bS<4). Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template. | +| `v3d_h264_qpel_mc20.comp` | 9 | Stage 2b motion comp | Handles only mc20 8×8 "put". Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template. | + +**Not reusable (different codec or different math):** + +| daedalus-fourier shader | Cycle | Why not | +|---|---|---| +| `v3d_idct8.comp` | 1 | VP9 cospi-multiplier IDCT; H.264 uses integer butterfly | +| `v3d_lpf_h_4_8.comp` | 2 | VP9 loop filter, different from H.264 deblock | +| `v3d_lpf_h_8_8.comp` | 4 | Same | +| `v3d_mc_8h.comp` | 3 | VP9 8-tap subpel; H.264 qpel is 6-tap | +| `v3d_cdef.comp` | 5 | AV1 CDEF | + +**Brand-new shaders needed (not seeded by daedalus-fourier):** + +- `v3d_h264_iquant` — inverse quantization (potentially fused into IDCT) +- `v3d_h264_intra_4x4` — 9 prediction modes +- `v3d_h264_intra_16x16` — 4 modes +- `v3d_h264_intra_chroma_8x8` — 4 modes +- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear) +- `v3d_h264_reconstruct` — trivial per-pixel residual+predicted clip255 +- `v3d_h264_yuv_to_rgba` (Stage 5, optional) + +**Summary**: ~22 H.264-related shaders total; 2 directly reusable, 2 +partial-reuse (5+15 = ~20 sibling variants), 7 brand-new. All small +(100–250 lines GLSL each). Each carries its own M1 bit-exact gate +against an FFmpeg reference — same methodology as daedalus-fourier +cycles 1-9. Estimate 1-3 days per shader including the gate +methodology; total shader work ~6-10 weeks if done in order. + +--- + +## Appendix B — libavcodec intercept point + +Located in `libavcodec/h264_slice.c:decode_slice` (FFmpeg 8.1 +Kwiboo pin, function starts at line 2598). The relevant loop: + +```c +for (;;) { + ret = ff_h264_decode_mb_cabac(h, sl); /* or _cavlc */ + if (ret >= 0) + ff_h264_hl_decode_mb(h, sl); /* ← pixel-level decode */ + ... + eos = get_cabac_terminate(&sl->cabac); + if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2) + break; + if (eos || ...) break; +} +``` + +After `ff_h264_decode_mb_cabac` returns, `sl` and `h` are populated +with everything we need: + +- `sl->mb_x`, `sl->mb_y` — current macroblock coordinates +- `sl->mb[]` — coefficient buffer (residual coefficients, post-quant) +- `sl->mv_cache[]`, `sl->ref_cache[]` — motion vectors, ref indices +- `h->cur_pic.mb_type[]` — mb_type field +- `sl->intra4x4_pred_mode_cache[]` — 9-mode intra4x4 predictions +- `h->cur_pic.qscale_table[]` — per-MB QP +- `sl->non_zero_count_cache[]` — non-zero coefficient count per 4×4 block + +`ff_h264_hl_decode_mb` is the function that calls the pixel-level +kernels (IDCT, intra prediction, MC, deblock). This is where the +daedalus-fourier substitution arc plugged in via `H264DSPContext` +function pointers (PR #6 / #76 et seq). + +**The daedalus-decoder intercept**: replace `ff_h264_hl_decode_mb` +with a stub that does *no pixel work*. Instead: + +```c +static inline void daedalus_decoder_append_mb(H264Context *h, + H264SliceContext *sl) +{ + /* Snapshot the populated sl/h fields into the frame-shaped + * mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x). + * Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the + * matching offset. No GPU dispatch yet — just memory writes. */ + ... +} +``` + +End-of-slice (or end-of-frame, depending on slice grouping) calls +`daedalus_decoder_flush_frame()` which builds the VkCommandBuffer +with all 4-5 pipeline stages and submits once. + +**Implementation shape**: this is a new FFmpeg patch in +marfrit-packages, sibling to the existing 0003-0007 substitution +patches. Tentatively `0008-h264-daedalus-decoder-frame-pipeline.patch`: + +``` +libavcodec/h264_slice.c: + - decode_slice() per-MB loop: replace + `ff_h264_hl_decode_mb(h, sl)` with + `if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl); + else ff_h264_hl_decode_mb(h, sl);` + - After the loop, if (daedalus_decoder_active(h)) + daedalus_decoder_flush_frame(h); + +libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c): + - daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1 + or a per-codec flag + - daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so +``` + +Crucially, **CABAC and CAVLC stay in libavcodec** — they're the +fastest open-source CABAC implementations and the structure they +produce in `sl->mb_cache` already has the right shape. We don't +re-implement entropy decoding; we just *intercept after it runs* +and feed the GPU pipeline from there. + +**Backwards compatibility**: when `daedalus_decoder_active(h)` +returns false (default), the patch is a no-op — `ff_h264_hl_decode_mb` +runs as before. Coexists with the kernel-pack substitution arc +without conflict. + +--- + +## Appendix C — risk register + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| Intra prediction wavefront serialization dominates wall time | Medium | Phase 1 misses 30 fps target | Speculative CPU intra fallback; profile early | +| qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden | Medium | Phase 2 schedule slips | Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline | +| H.264 VUI matrix selection error in Stage 5 | Low | Wrong colours in HD content | Default BT.709 limited-range; load test against `ffmpeg -filter_complex eq=...` reference | +| Mesa V3DV concurrency bug under heavy compute dispatch | Low | Sporadic GPU hang | Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction) | +| daedalus-fourier shader changes break daedalus-decoder | Medium | Phase 1 dev velocity drops | Pin daedalus-fourier to a tagged release; bump deliberately | +| Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p | Medium-High | Project fails to deliver user-facing improvement | Acknowledged from project start; explicit success-criteria language in §10 covers the pivot | +