Files

T

claude-noether 59885dd868 initial design doc — frame-level GPU H.264 decoder for V3D7

Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production.  Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.

NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.

DESIGN.md covers:

  - architecture sketch (CPU side keeps entropy decode + descriptors;
    GPU runs 4-stage compute pipeline per frame)
  - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
  - inter-stage dependencies (vkCmdPipelineBarrier within one command
    buffer)
  - intra prediction wavefront (~187 dispatches per frame on diagonals)
  - libavcodec intercept point (macroblock-level, evolves the
    substitution shim from "dispatch now" to "append to frame buffer")
  - shader inventory (existing daedalus-fourier reuse + ~14 new ones)
  - 4-phase plan, 4-6 months total budget
  - 7 open questions including DPB allocation, qpel parameterization,
    daemon integration shape
  - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced

This is design only.  No code beyond README.md and DESIGN.md.  User
review + redirect expected before Phase 1 implementation begins.

2026-05-23 22:44:03 +02:00

1.0 KiB

Raw Permalink Blame History

daedalus-decoder

Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.

The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.

Sibling projects:

daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.

See DESIGN.md for the architecture sketch.

1.0 KiB Raw Permalink Blame History

daedalus-decoder

1.0 KiB

Raw Permalink Blame History