Commit Graph

9 Commits

Author SHA1 Message Date
marfrit abd94e9db5 Merge pull request 'phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip' (#3) from noether/phase1-stage1-idct into main
Reviewed-on: #3
2026-05-24 20:18:43 +00:00
claude-noether 69b124adf1 phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip
flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.

What's wired:

  - Build per-frame luma-4x4 meta[] in raster order across all MBs
    (N_MBs × 16 entries; 130,560 for 1080p)
  - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
    block-major coeffs buffer (n_blocks × 16 int16)
  - Allocate a frame-sized scratch Y plane, zero-initialised — no intra
    prediction yet so "predicted" = 0
  - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
    n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
  - Copy result to caller's out_y at requested stride

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):

  $ time ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-128 bytes: 0 / 1044480
  smoke OK
  real  0m0.163s

163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.

Smoke validates:
  - flush_frame succeeds (rc=0) on a complete frame
  - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
  - UV plane filled with neutral grey 128 (placeholder until chroma
    dispatch lands)

What's deliberately deferred to follow-on sub-PRs:

  - Intra prediction wavefront (Stage 2a) — predicted=0 means output
    pixels are residual-only, not a valid frame decode.  Sufficient for
    Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
  - Motion compensation (Stage 2b) for inter MBs
  - High-profile IDCT 8x8 (Stage 1 extension)
  - Deblocking filter (Stage 4)
  - Chroma 4x4 IDCT — needs separate dispatch with chroma stride
  - Z-scan permutation of per-MB 4x4 block order (currently flat
    raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
    Bit-exact against FFmpeg requires this permutation; deferred to
    the test-vector PR.
  - dmabuf export (still memcpy-out)
  - Stage 5 RGBA opt-in

API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub.  Internal helpers stay file-local.

Stacks on noether/repo-scaffold (PR #2).  Rebase on main after #2
lands; the diff is purely additive against the scaffold.
2026-05-24 22:15:35 +02:00
marfrit 90d7c546bd Merge pull request 'scaffold: CMake + API skeleton + smoke test' (#2) from noether/repo-scaffold into main
Reviewed-on: #2
2026-05-24 20:10:25 +00:00
claude-noether 08080f062c scaffold: CMake + API skeleton + smoke test
First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24.
Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec
intercept.  Establishes the build shape so subsequent work has a place
to land.

Layout:

  LICENSE                          BSD-2-Clause (matches daedalus-fourier)
  .gitignore                       build/, CMake artefacts, *.spv
  CMakeLists.txt                   top-level — finds daedalus-fourier
                                   ≥0.1.0 via pkg-config (per §9.6
                                   decision: find_package, pinned to
                                   tagged release; .pc consumed via
                                   pkg_check_modules until we ship a
                                   CMake config), Vulkan via
                                   find_package, builds static lib
                                   + smoke test, GNUInstallDirs install
  include/daedalus_decoder.h       public API surface:
                                     - daedalus_decoder_{create,destroy,
                                                         version,has_qpu}
                                     - daedalus_decoder_set_output_format
                                       (NV12 default, RGBA opt-in per §5)
                                     - daedalus_decoder_append_mb +
                                       struct daedalus_decoder_mb_input
                                       (matches §3 per-MB descriptor)
                                     - daedalus_decoder_flush_frame
                                       (per-frame submit + wait)
                                     - daedalus_decoder_export_dmabuf
                                       (Vulkan-native VkImage export per
                                       §9.4 decision)
                                   Dimensions are CODED frame size
                                   (mod-16), not displayed — caller
                                   translates from SPS + crop offsets.
  src/internal.h                   internal mb_desc struct (matches
                                   shader std430 layout, to be nailed
                                   down once shaders exist) + per-ctx
                                   state
  src/daedalus_decoder.c           stub bodies:
                                     - create/destroy with proper resource
                                       lifecycle
                                     - append_mb validates + writes CPU
                                       staging buffers (no GPU yet)
                                     - flush_frame returns -2 (not
                                       implemented) — Phase 1 work
                                     - export_dmabuf returns -1
                                     - has_qpu / version diagnostics
  tests/test_smoke.c               link + lifecycle test: bad dims
                                   reject, OOB MB reject, null inputs
                                   reject, raster-order enforcement,
                                   mid-frame format-change reject,
                                   incomplete-frame flush reject.
                                   On hosts without V3D7 Vulkan,
                                   SKIPs gracefully (returns 0).

Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier
0.1.0):

  $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
  $ cmake --build build
  $ ctest --test-dir build --output-on-failure
  Test #1: smoke ... Passed

  $ ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  smoke OK

Note the coded-vs-displayed dims trap: 1080p H.264 has coded height
1088 with 8 rows cropped via SPS frame_cropping_*.  Header docstring
on daedalus_decoder_create() spells this out so future callers don't
hit the multiple-of-16 reject (smoke test caught it during scaffold
write).

Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled
dispatch (reusing daedalus-fourier shaders per Appendix A), intra
prediction wavefront, reconstruct stage, NV12 output via dmabuf
export.  Smoke test grows from "ctx lifecycle works" to
"I-frame-only Baseline decode bit-exact vs FFmpeg reference".
2026-05-24 22:08:46 +02:00
marfrit 4c9f07f082 Merge pull request 'design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24)' (#1) from noether/design-decisions into main
Reviewed-on: #1
2026-05-24 19:58:41 +00:00
claude-noether 7cbf4ce15b design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24)
All seven questions from the initial design draft decided in the
user's 2026-05-24 review:

  1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck
  2. libavcodec intercept: macroblock-level for Phase 1
  3. Shader parameterisation: measure both during Phase 2 MC, pick winner
  4. DPB allocation: Vulkan-native VkImage with dma_buf export
  5. Daemon integration: library link
  6. daedalus-fourier dep: CMake find_package, pinned to tagged release
  7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out;
     VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion
     vs the initial draft which had grouped them with HEVC)

Section heading renamed "Open questions" → "Phase 1 decisions" with
explicit user-confirmed annotations.  Each item preserves the original
wording for traceability.

§8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1
deferral and reaffirming HEVC's firmly-out status.

No architecture changes; only decisions captured.  Phase 1
implementation can now begin against this baseline.
2026-05-24 21:57:20 +02:00
claude-noether 8a4fb10a7f design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register)
Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms.  All findings appended to DESIGN.md; no new
files, no architecture changes.

Appendix A — daedalus-fourier shader reuse audit
  - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
    just at frame scale instead of n_blocks=1 per call
  - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
    serve as templates for ~20 sibling variants (horizontal/chroma
    deblock variants, 15 missing qpel positions + 16x16 size + avg)
  - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
  - 7 brand-new shaders required (iquant, intra prediction modes,
    chroma MC, reconstruct, optional yuv→rgba)
  - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
    if done in sequence with M1 bit-exact gate methodology

Appendix B — libavcodec intercept point
  - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
  - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
  - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
    sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
    non_zero_count_cache into a frame-shaped descriptor SSBO
  - End-of-slice flush builds + submits the GPU pipeline
  - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
  - New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
    0008-h264-daedalus-decoder-frame-pipeline.patch
  - daedalus_decoder_active(h) gates the intercept; default OFF =
    no-op = full coexistence with the kernel-pack substitution arc

Appendix C — risk register
  - 6 risks catalogued: intra wavefront perf, qpel shader explosion,
    Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
    pin drift, Phase 4 30fps@1080p target miss
  - Highest impact: project fails to beat NEON.  Acknowledged from
    project start (§10), explicit pivot language.
2026-05-23 23:10:39 +02:00
claude-noether 4182b32adf design: optional Stage 5 NV12 → RGBA conversion
User question 2026-05-23: 'Wayland does need a conversion of NV12 to
its output format. Could we cram that in?'

Yes — trivially.  Added Stage 5 to the pipeline doc with:

  - 5-line per-pixel compute shader (BT.709 limited-range example
    given; matrix selected from H.264 VUI at runtime)
  - explicit OPT-IN flag, off by default
  - rationale for default-off: most consumers (V4L2 stateless,
    Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI
    paths) want NV12 because compositors convert during composition
    essentially for free.  RGBA8 is 4x the bandwidth of NV12 — not
    worth burning DMA + electrons when no downstream needs it
  - colourspace metadata plumbing requirement: SPS vui_parameters
    (colour_primaries, transfer_characteristics, matrix_coefficients,
    video_full_range_flag) MUST flow through to the shader; default
    BT.709 limited-range with warning if VUI absent

Updated the new-shader inventory to include v3d_h264_yuv_to_rgba.
Total dispatches/frame remains ~190-200; Stage 5 adds one.
2026-05-23 22:46:45 +02:00
claude-noether 59885dd868 initial design doc — frame-level GPU H.264 decoder for V3D7
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production.  Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.

NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.

DESIGN.md covers:

  - architecture sketch (CPU side keeps entropy decode + descriptors;
    GPU runs 4-stage compute pipeline per frame)
  - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
  - inter-stage dependencies (vkCmdPipelineBarrier within one command
    buffer)
  - intra prediction wavefront (~187 dispatches per frame on diagonals)
  - libavcodec intercept point (macroblock-level, evolves the
    substitution shim from "dispatch now" to "append to frame buffer")
  - shader inventory (existing daedalus-fourier reuse + ~14 new ones)
  - 4-phase plan, 4-6 months total budget
  - 7 open questions including DPB allocation, qpel parameterization,
    daemon integration shape
  - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced

This is design only.  No code beyond README.md and DESIGN.md.  User
review + redirect expected before Phase 1 implementation begins.
2026-05-23 22:44:03 +02:00