daedalus-decoder

2 Commits 3 Branches 0 Tags

Author	SHA1	Message	Date
claude-noether	7cbf4ce15b	design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24) All seven questions from the initial design draft decided in the user's 2026-05-24 review: 1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck 2. libavcodec intercept: macroblock-level for Phase 1 3. Shader parameterisation: measure both during Phase 2 MC, pick winner 4. DPB allocation: Vulkan-native VkImage with dma_buf export 5. Daemon integration: library link 6. daedalus-fourier dep: CMake find_package, pinned to tagged release 7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out; VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion vs the initial draft which had grouped them with HEVC) Section heading renamed "Open questions" → "Phase 1 decisions" with explicit user-confirmed annotations. Each item preserves the original wording for traceability. §8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1 deferral and reaffirming HEVC's firmly-out status. No architecture changes; only decisions captured. Phase 1 implementation can now begin against this baseline.	2026-05-24 21:57:20 +02:00
claude-noether	59885dd868	initial design doc — frame-level GPU H.264 decoder for V3D7 Path C of the 2026-05-23 architecture decision after the daedalus- fourier substitution arc's per-block QPU dispatch was measured to be >600x slower than NEON in production. Root cause: per-block synchronous Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic. NVDEC and Vulkan Video escape this by dispatching at picture-level. Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does not implement VK_KHR_video_decode_h264; this project builds the same shape (one submit per frame, one fence wait per frame, encoded bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate. DESIGN.md covers: - architecture sketch (CPU side keeps entropy decode + descriptors; GPU runs 4-stage compute pipeline per frame) - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p) - inter-stage dependencies (vkCmdPipelineBarrier within one command buffer) - intra prediction wavefront (~187 dispatches per frame on diagonals) - libavcodec intercept point (macroblock-level, evolves the substitution shim from "dispatch now" to "append to frame buffer") - shader inventory (existing daedalus-fourier reuse + ~14 new ones) - 4-phase plan, 4-6 months total budget - 7 open questions including DPB allocation, qpel parameterization, daemon integration shape - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced This is design only. No code beyond README.md and DESIGN.md. User review + redirect expected before Phase 1 implementation begins.	2026-05-23 22:44:03 +02:00

Author

SHA1

Message

Date

claude-noether

7cbf4ce15b

design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24)

All seven questions from the initial design draft decided in the
user's 2026-05-24 review:

  1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck
  2. libavcodec intercept: macroblock-level for Phase 1
  3. Shader parameterisation: measure both during Phase 2 MC, pick winner
  4. DPB allocation: Vulkan-native VkImage with dma_buf export
  5. Daemon integration: library link
  6. daedalus-fourier dep: CMake find_package, pinned to tagged release
  7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out;
     VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion
     vs the initial draft which had grouped them with HEVC)

Section heading renamed "Open questions" → "Phase 1 decisions" with
explicit user-confirmed annotations.  Each item preserves the original
wording for traceability.

§8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1
deferral and reaffirming HEVC's firmly-out status.

No architecture changes; only decisions captured.  Phase 1
implementation can now begin against this baseline.

2026-05-24 21:57:20 +02:00

claude-noether

59885dd868

initial design doc — frame-level GPU H.264 decoder for V3D7

Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production.  Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.

NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.

DESIGN.md covers:

  - architecture sketch (CPU side keeps entropy decode + descriptors;
    GPU runs 4-stage compute pipeline per frame)
  - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
  - inter-stage dependencies (vkCmdPipelineBarrier within one command
    buffer)
  - intra prediction wavefront (~187 dispatches per frame on diagonals)
  - libavcodec intercept point (macroblock-level, evolves the
    substitution shim from "dispatch now" to "append to frame buffer")
  - shader inventory (existing daedalus-fourier reuse + ~14 new ones)
  - 4-phase plan, 4-6 months total budget
  - 7 open questions including DPB allocation, qpel parameterization,
    daemon integration shape
  - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced

This is design only.  No code beyond README.md and DESIGN.md.  User
review + redirect expected before Phase 1 implementation begins.

2026-05-23 22:44:03 +02:00