daedalus-decoder

marfrit/daedalus-decoder

Fork 0

Commit Graph

Author	SHA1	Message	Date
claude-noether	8a4fb10a7f	design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register) Read-only research done autonomously while push to marfrit/daedalus-decoder is blocked on user perms. All findings appended to DESIGN.md; no new files, no architecture changes. Appendix A — daedalus-fourier shader reuse audit - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8) just at frame scale instead of n_blocks=1 per call - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20) serve as templates for ~20 sibling variants (horizontal/chroma deblock variants, 15 missing qpel positions + 16x16 size + avg) - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific) - 7 brand-new shaders required (iquant, intra prediction modes, chroma MC, reconstruct, optional yuv→rgba) - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory if done in sequence with M1 bit-exact gate methodology Appendix B — libavcodec intercept point - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP, non_zero_count_cache into a frame-shaped descriptor SSBO - End-of-slice flush builds + submits the GPU pipeline - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy) - New FFmpeg patch in marfrit-packages, sibling to 0003-0007: 0008-h264-daedalus-decoder-frame-pipeline.patch - daedalus_decoder_active(h) gates the intercept; default OFF = no-op = full coexistence with the kernel-pack substitution arc Appendix C — risk register - 6 risks catalogued: intra wavefront perf, qpel shader explosion, Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier pin drift, Phase 4 30fps@1080p target miss - Highest impact: project fails to beat NEON. Acknowledged from project start (§10), explicit pivot language.	2026-05-23 23:10:39 +02:00
claude-noether	4182b32adf	design: optional Stage 5 NV12 → RGBA conversion User question 2026-05-23: 'Wayland does need a conversion of NV12 to its output format. Could we cram that in?' Yes — trivially. Added Stage 5 to the pipeline doc with: - 5-line per-pixel compute shader (BT.709 limited-range example given; matrix selected from H.264 VUI at runtime) - explicit OPT-IN flag, off by default - rationale for default-off: most consumers (V4L2 stateless, Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI paths) want NV12 because compositors convert during composition essentially for free. RGBA8 is 4x the bandwidth of NV12 — not worth burning DMA + electrons when no downstream needs it - colourspace metadata plumbing requirement: SPS vui_parameters (colour_primaries, transfer_characteristics, matrix_coefficients, video_full_range_flag) MUST flow through to the shader; default BT.709 limited-range with warning if VUI absent Updated the new-shader inventory to include v3d_h264_yuv_to_rgba. Total dispatches/frame remains ~190-200; Stage 5 adds one.	2026-05-23 22:46:45 +02:00
claude-noether	59885dd868	initial design doc — frame-level GPU H.264 decoder for V3D7 Path C of the 2026-05-23 architecture decision after the daedalus- fourier substitution arc's per-block QPU dispatch was measured to be >600x slower than NEON in production. Root cause: per-block synchronous Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic. NVDEC and Vulkan Video escape this by dispatching at picture-level. Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does not implement VK_KHR_video_decode_h264; this project builds the same shape (one submit per frame, one fence wait per frame, encoded bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate. DESIGN.md covers: - architecture sketch (CPU side keeps entropy decode + descriptors; GPU runs 4-stage compute pipeline per frame) - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p) - inter-stage dependencies (vkCmdPipelineBarrier within one command buffer) - intra prediction wavefront (~187 dispatches per frame on diagonals) - libavcodec intercept point (macroblock-level, evolves the substitution shim from "dispatch now" to "append to frame buffer") - shader inventory (existing daedalus-fourier reuse + ~14 new ones) - 4-phase plan, 4-6 months total budget - 7 open questions including DPB allocation, qpel parameterization, daemon integration shape - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced This is design only. No code beyond README.md and DESIGN.md. User review + redirect expected before Phase 1 implementation begins.	2026-05-23 22:44:03 +02:00

Author

SHA1

Message

Date

claude-noether

8a4fb10a7f

design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register)

Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms.  All findings appended to DESIGN.md; no new
files, no architecture changes.

Appendix A — daedalus-fourier shader reuse audit
  - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
    just at frame scale instead of n_blocks=1 per call
  - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
    serve as templates for ~20 sibling variants (horizontal/chroma
    deblock variants, 15 missing qpel positions + 16x16 size + avg)
  - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
  - 7 brand-new shaders required (iquant, intra prediction modes,
    chroma MC, reconstruct, optional yuv→rgba)
  - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
    if done in sequence with M1 bit-exact gate methodology

Appendix B — libavcodec intercept point
  - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
  - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
  - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
    sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
    non_zero_count_cache into a frame-shaped descriptor SSBO
  - End-of-slice flush builds + submits the GPU pipeline
  - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
  - New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
    0008-h264-daedalus-decoder-frame-pipeline.patch
  - daedalus_decoder_active(h) gates the intercept; default OFF =
    no-op = full coexistence with the kernel-pack substitution arc

Appendix C — risk register
  - 6 risks catalogued: intra wavefront perf, qpel shader explosion,
    Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
    pin drift, Phase 4 30fps@1080p target miss
  - Highest impact: project fails to beat NEON.  Acknowledged from
    project start (§10), explicit pivot language.

2026-05-23 23:10:39 +02:00

claude-noether

4182b32adf

design: optional Stage 5 NV12 → RGBA conversion

User question 2026-05-23: 'Wayland does need a conversion of NV12 to
its output format. Could we cram that in?'

Yes — trivially.  Added Stage 5 to the pipeline doc with:

  - 5-line per-pixel compute shader (BT.709 limited-range example
    given; matrix selected from H.264 VUI at runtime)
  - explicit OPT-IN flag, off by default
  - rationale for default-off: most consumers (V4L2 stateless,
    Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI
    paths) want NV12 because compositors convert during composition
    essentially for free.  RGBA8 is 4x the bandwidth of NV12 — not
    worth burning DMA + electrons when no downstream needs it
  - colourspace metadata plumbing requirement: SPS vui_parameters
    (colour_primaries, transfer_characteristics, matrix_coefficients,
    video_full_range_flag) MUST flow through to the shader; default
    BT.709 limited-range with warning if VUI absent

Updated the new-shader inventory to include v3d_h264_yuv_to_rgba.
Total dispatches/frame remains ~190-200; Stage 5 adds one.

2026-05-23 22:46:45 +02:00

claude-noether

59885dd868

initial design doc — frame-level GPU H.264 decoder for V3D7

Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production.  Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.

NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.

DESIGN.md covers:

  - architecture sketch (CPU side keeps entropy decode + descriptors;
    GPU runs 4-stage compute pipeline per frame)
  - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
  - inter-stage dependencies (vkCmdPipelineBarrier within one command
    buffer)
  - intra prediction wavefront (~187 dispatches per frame on diagonals)
  - libavcodec intercept point (macroblock-level, evolves the
    substitution shim from "dispatch now" to "append to frame buffer")
  - shader inventory (existing daedalus-fourier reuse + ~14 new ones)
  - 4-phase plan, 4-6 months total budget
  - 7 open questions including DPB allocation, qpel parameterization,
    daemon integration shape
  - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced

This is design only.  No code beyond README.md and DESIGN.md.  User
review + redirect expected before Phase 1 implementation begins.

2026-05-23 22:44:03 +02:00

3 Commits