design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register)
Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms. All findings appended to DESIGN.md; no new
files, no architecture changes.
Appendix A — daedalus-fourier shader reuse audit
- 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
just at frame scale instead of n_blocks=1 per call
- 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
serve as templates for ~20 sibling variants (horizontal/chroma
deblock variants, 15 missing qpel positions + 16x16 size + avg)
- 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
- 7 brand-new shaders required (iquant, intra prediction modes,
chroma MC, reconstruct, optional yuv→rgba)
- ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
if done in sequence with M1 bit-exact gate methodology
Appendix B — libavcodec intercept point
- decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
- Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
- Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
non_zero_count_cache into a frame-shaped descriptor SSBO
- End-of-slice flush builds + submits the GPU pipeline
- CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
- New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
0008-h264-daedalus-decoder-frame-pipeline.patch
- daedalus_decoder_active(h) gates the intercept; default OFF =
no-op = full coexistence with the kernel-pack substitution arc
Appendix C — risk register
- 6 risks catalogued: intra wavefront perf, qpel shader explosion,
Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
pin drift, Phase 4 30fps@1080p target miss
- Highest impact: project fails to beat NEON. Acknowledged from
project start (§10), explicit pivot language.
This commit is contained in:
@@ -310,3 +310,150 @@ If Phase 4 can't beat NEON-on-CPU, we acknowledge the V3D7's compute throughput
|
||||
- [Mesa V3D driver docs](https://docs.mesa3d.org/drivers/v3d.html)
|
||||
- [daedalus-fourier substitution arc](https://git.reauktion.de/marfrit/daedalus-fourier) — cycles 6-9; the kernel pack we're not using at the wrong granularity anymore
|
||||
- ITU-T H.264 spec (Rec. H.264 (08/2021)) — sections 7-9 for syntax; section 8 for decoding process
|
||||
|
||||
---
|
||||
|
||||
## Appendix A — daedalus-fourier shader reuse audit
|
||||
|
||||
Inventory of cycles 1-9 V3D shaders against the daedalus-decoder
|
||||
shader requirements. Done 2026-05-23 against post-PR-#8 main.
|
||||
|
||||
**Directly reusable (just call at frame scale):**
|
||||
|
||||
| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Notes |
|
||||
|---|---|---|---|
|
||||
| `v3d_h264_idct4.comp` | 6 | Stage 1 IDCT 4×4 | 16 blocks/WG; scales freely. Just call with n_blocks = N_MBs × 16 instead of n_blocks = 1. Can be replaced with a fused iquant+IDCT variant if measurement justifies it. |
|
||||
| `v3d_h264_idct8.comp` | 7 | Stage 1 IDCT 8×8 | 8 blocks/WG; same reuse pattern. High-profile only. |
|
||||
|
||||
**Partially reusable (existing shader is the template; need siblings):**
|
||||
|
||||
| daedalus-fourier shader | Cycle | Used in daedalus-decoder for | Sibling shaders needed |
|
||||
|---|---|---|---|
|
||||
| `v3d_h264deblock.comp` | 8 | Stage 4 deblocking | Handles only vertical luma, non-intra (bS<4). Need: horizontal luma, vertical chroma, horizontal chroma, intra (bS=4) variants — 4 new files using this as template. |
|
||||
| `v3d_h264_qpel_mc20.comp` | 9 | Stage 2b motion comp | Handles only mc20 8×8 "put". Need 15 more positional variants (mc00..mc33), 16×16 size variants, and "avg" variants for B-frame compound MC — order-of-magnitude more code but all from the same template. |
|
||||
|
||||
**Not reusable (different codec or different math):**
|
||||
|
||||
| daedalus-fourier shader | Cycle | Why not |
|
||||
|---|---|---|
|
||||
| `v3d_idct8.comp` | 1 | VP9 cospi-multiplier IDCT; H.264 uses integer butterfly |
|
||||
| `v3d_lpf_h_4_8.comp` | 2 | VP9 loop filter, different from H.264 deblock |
|
||||
| `v3d_lpf_h_8_8.comp` | 4 | Same |
|
||||
| `v3d_mc_8h.comp` | 3 | VP9 8-tap subpel; H.264 qpel is 6-tap |
|
||||
| `v3d_cdef.comp` | 5 | AV1 CDEF |
|
||||
|
||||
**Brand-new shaders needed (not seeded by daedalus-fourier):**
|
||||
|
||||
- `v3d_h264_iquant` — inverse quantization (potentially fused into IDCT)
|
||||
- `v3d_h264_intra_4x4` — 9 prediction modes
|
||||
- `v3d_h264_intra_16x16` — 4 modes
|
||||
- `v3d_h264_intra_chroma_8x8` — 4 modes
|
||||
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
|
||||
- `v3d_h264_reconstruct` — trivial per-pixel residual+predicted clip255
|
||||
- `v3d_h264_yuv_to_rgba` (Stage 5, optional)
|
||||
|
||||
**Summary**: ~22 H.264-related shaders total; 2 directly reusable, 2
|
||||
partial-reuse (5+15 = ~20 sibling variants), 7 brand-new. All small
|
||||
(100–250 lines GLSL each). Each carries its own M1 bit-exact gate
|
||||
against an FFmpeg reference — same methodology as daedalus-fourier
|
||||
cycles 1-9. Estimate 1-3 days per shader including the gate
|
||||
methodology; total shader work ~6-10 weeks if done in order.
|
||||
|
||||
---
|
||||
|
||||
## Appendix B — libavcodec intercept point
|
||||
|
||||
Located in `libavcodec/h264_slice.c:decode_slice` (FFmpeg 8.1
|
||||
Kwiboo pin, function starts at line 2598). The relevant loop:
|
||||
|
||||
```c
|
||||
for (;;) {
|
||||
ret = ff_h264_decode_mb_cabac(h, sl); /* or _cavlc */
|
||||
if (ret >= 0)
|
||||
ff_h264_hl_decode_mb(h, sl); /* ← pixel-level decode */
|
||||
...
|
||||
eos = get_cabac_terminate(&sl->cabac);
|
||||
if (sl->cabac.bytestream > sl->cabac.bytestream_end + 2)
|
||||
break;
|
||||
if (eos || ...) break;
|
||||
}
|
||||
```
|
||||
|
||||
After `ff_h264_decode_mb_cabac` returns, `sl` and `h` are populated
|
||||
with everything we need:
|
||||
|
||||
- `sl->mb_x`, `sl->mb_y` — current macroblock coordinates
|
||||
- `sl->mb[]` — coefficient buffer (residual coefficients, post-quant)
|
||||
- `sl->mv_cache[]`, `sl->ref_cache[]` — motion vectors, ref indices
|
||||
- `h->cur_pic.mb_type[]` — mb_type field
|
||||
- `sl->intra4x4_pred_mode_cache[]` — 9-mode intra4x4 predictions
|
||||
- `h->cur_pic.qscale_table[]` — per-MB QP
|
||||
- `sl->non_zero_count_cache[]` — non-zero coefficient count per 4×4 block
|
||||
|
||||
`ff_h264_hl_decode_mb` is the function that calls the pixel-level
|
||||
kernels (IDCT, intra prediction, MC, deblock). This is where the
|
||||
daedalus-fourier substitution arc plugged in via `H264DSPContext`
|
||||
function pointers (PR #6 / #76 et seq).
|
||||
|
||||
**The daedalus-decoder intercept**: replace `ff_h264_hl_decode_mb`
|
||||
with a stub that does *no pixel work*. Instead:
|
||||
|
||||
```c
|
||||
static inline void daedalus_decoder_append_mb(H264Context *h,
|
||||
H264SliceContext *sl)
|
||||
{
|
||||
/* Snapshot the populated sl/h fields into the frame-shaped
|
||||
* mb_descriptors[] buffer indexed by (mb_y * mb_width + mb_x).
|
||||
* Snapshot coeffs[] into the frame-shaped coeffs[] SSBO at the
|
||||
* matching offset. No GPU dispatch yet — just memory writes. */
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
End-of-slice (or end-of-frame, depending on slice grouping) calls
|
||||
`daedalus_decoder_flush_frame()` which builds the VkCommandBuffer
|
||||
with all 4-5 pipeline stages and submits once.
|
||||
|
||||
**Implementation shape**: this is a new FFmpeg patch in
|
||||
marfrit-packages, sibling to the existing 0003-0007 substitution
|
||||
patches. Tentatively `0008-h264-daedalus-decoder-frame-pipeline.patch`:
|
||||
|
||||
```
|
||||
libavcodec/h264_slice.c:
|
||||
- decode_slice() per-MB loop: replace
|
||||
`ff_h264_hl_decode_mb(h, sl)` with
|
||||
`if (daedalus_decoder_active(h)) daedalus_decoder_append_mb(h, sl);
|
||||
else ff_h264_hl_decode_mb(h, sl);`
|
||||
- After the loop, if (daedalus_decoder_active(h))
|
||||
daedalus_decoder_flush_frame(h);
|
||||
|
||||
libavcodec/h264_decoder.c (or new daedalus_decoder_intercept.c):
|
||||
- daedalus_decoder_active(h) — checks env var DAEDALUS_DECODER=1
|
||||
or a per-codec flag
|
||||
- daedalus_decoder_append_mb / flush_frame — call into libdaedalus_decoder.so
|
||||
```
|
||||
|
||||
Crucially, **CABAC and CAVLC stay in libavcodec** — they're the
|
||||
fastest open-source CABAC implementations and the structure they
|
||||
produce in `sl->mb_cache` already has the right shape. We don't
|
||||
re-implement entropy decoding; we just *intercept after it runs*
|
||||
and feed the GPU pipeline from there.
|
||||
|
||||
**Backwards compatibility**: when `daedalus_decoder_active(h)`
|
||||
returns false (default), the patch is a no-op — `ff_h264_hl_decode_mb`
|
||||
runs as before. Coexists with the kernel-pack substitution arc
|
||||
without conflict.
|
||||
|
||||
---
|
||||
|
||||
## Appendix C — risk register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|---|---|---|---|
|
||||
| Intra prediction wavefront serialization dominates wall time | Medium | Phase 1 misses 30 fps target | Speculative CPU intra fallback; profile early |
|
||||
| qpel shader explosion (16 variants × put/avg × 8/16 size) maintenance burden | Medium | Phase 2 schedule slips | Single parameterized shader with switch on (mc_pos, size, op) — V3D compiler should inline |
|
||||
| H.264 VUI matrix selection error in Stage 5 | Low | Wrong colours in HD content | Default BT.709 limited-range; load test against `ffmpeg -filter_complex eq=...` reference |
|
||||
| Mesa V3DV concurrency bug under heavy compute dispatch | Low | Sporadic GPU hang | Single-thread the libavcodec H.264 decode side (we're already running it in deferred-write mode; the dispatch is serial by construction) |
|
||||
| daedalus-fourier shader changes break daedalus-decoder | Medium | Phase 1 dev velocity drops | Pin daedalus-fourier to a tagged release; bump deliberately |
|
||||
| Phase 1 succeeds but Phase 4 perf misses 30 fps@1080p | Medium-High | Project fails to deliver user-facing improvement | Acknowledged from project start; explicit success-criteria language in §10 covers the pivot |
|
||||
|
||||
|
||||
Reference in New Issue
Block a user