claude-noether 8a4fb10a7f design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register)
Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms.  All findings appended to DESIGN.md; no new
files, no architecture changes.

Appendix A — daedalus-fourier shader reuse audit
  - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
    just at frame scale instead of n_blocks=1 per call
  - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
    serve as templates for ~20 sibling variants (horizontal/chroma
    deblock variants, 15 missing qpel positions + 16x16 size + avg)
  - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
  - 7 brand-new shaders required (iquant, intra prediction modes,
    chroma MC, reconstruct, optional yuv→rgba)
  - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
    if done in sequence with M1 bit-exact gate methodology

Appendix B — libavcodec intercept point
  - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
  - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
  - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
    sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
    non_zero_count_cache into a frame-shaped descriptor SSBO
  - End-of-slice flush builds + submits the GPU pipeline
  - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
  - New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
    0008-h264-daedalus-decoder-frame-pipeline.patch
  - daedalus_decoder_active(h) gates the intercept; default OFF =
    no-op = full coexistence with the kernel-pack substitution arc

Appendix C — risk register
  - 6 risks catalogued: intra wavefront perf, qpel shader explosion,
    Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
    pin drift, Phase 4 30fps@1080p target miss
  - Highest impact: project fails to beat NEON.  Acknowledged from
    project start (§10), explicit pivot language.
2026-05-23 23:10:39 +02:00

daedalus-decoder

Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.

The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.

Sibling projects:

  • daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
  • daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
  • libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.

See DESIGN.md for the architecture sketch.

S
Description
Frame-level GPU H.264 decoder for Raspberry Pi 5 V3D7. NVDEC-shaped pipeline (encoded bitstream in, NV12 out, one Vulkan submit per frame) built on daedalus-fourier's V3D compute primitives. Phase 1 design exploration.
Readme 560 KiB
Languages
Markdown 100%