T

claude-noether 44e92fa3dc Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact

Final option-A deliverable.  CLI now extracts real per-MB
coefficients from libavcodec via the inspection callback +
side-buffer (marfrit-packages 0016 + 0017), reconstructs the
pre-residual predicted samples P via inverse-of-IDCT-add, and
feeds daedalus-decoder with real (P, C, no edges).  Daedalus
output BYTE-EXACT against libavcodec's pre-deblock AVFrame
across 5 frames at 320x240 and 3 frames at 1920x1088, all three
substrates (auto / cpu / qpu).

Path summary
------------

avctx->thread_count = 1                  (single-threaded decode — 0017's
                                           side buffer is per-H264Context;
                                           multi-threaded would race)
avctx->skip_loop_filter = AVDISCARD_ALL  (AVFrame stays pre-deblock so the
                                           P-recovery subtraction is exact)
ff_h264_set_mb_inspect_cb               (registers the callback)

Inspection callback (per MB, fires post-hl_decode_mb):
  - Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (skipped MBs
    fall back to identity-passthrough in the main loop)
  - Snapshot pre-deblock pixels from h->cur_pic.f->data[0]
  - Read coefficients from h->mb_inspect_coeffs (= sl->mb copy, the
    0017 side buffer)
  - For each 4x4 block (16/MB in raster order, indexed via
    raster_to_zscan[] to find its slot in the z-scan-ordered side
    buffer): compute IDCT(C) using a transcribed H.264 C reference,
    derive P = clip(pre_deblock - ((IDCT + 32) >> 6))
  - Stash per-MB capture (P + C) for the main loop

Main loop:
  - Default identity-passthrough (predicted = AVFrame pixels, coeffs = 0)
  - For real-coeffs-valid MBs: override luma with captured P + C
  - flush_frame, byte-exact compare against AVFrame

A diagnostic also asserts (silently when passing) that the
callback's pre_deblock snapshot equals AVFrame at each real-coeffs
MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer
under skip_loop_filter=AVDISCARD_ALL with thread_count=1.

Bug hunted in this PR
---------------------

Initial implementation transposed the coefficients from row-major
(sl->mb) to "column-major" (the layout that daedalus_decoder.h's
mb_input.coeffs docstring describes).  This caused ~0.2% Y pixel
divergence on real streams (~150/frame at 320x240).  Root cause
identified via a standalone /tmp/idct_compare.c harness running
daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical
int16[16] inputs: outputs IDENTICAL.  The two functions implement
the spec H.264 IDCT on the array regardless of layout
interpretation; the "column-major" label is decoration.  Removed
the transpose; PR is now byte-exact.

Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs
docstring so future integrators don't repeat this transpose
mistake.

Result on hertz (Pi 5 V3D 7.1)
------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320x240,    5 frames, substrate=auto:  Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=cpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=qpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  1920x1088,  3 frames, substrate=auto:  Y diff 0/2088960,  UV diff 0/1044480 PASS

Real-coeffs path engaged for 77-95 MBs per 320x240 frame and
598-643 MBs per 1080p frame (testsrc2 is mostly flat → many
Intra_16x16 MBs that fall back to identity passthrough; richer
content streams would engage real-coeffs more).

Followups
---------

  - PR-A4: extend the gate to Intra_16x16 (chroma DC Hadamard +
    Intra_16x16 luma DC Hadamard pre-pass) — currently ~30-60%
    of MBs fall back to identity-passthrough due to this.
  - PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch
    path on the daedalus-decoder side, similar plumbing).
  - PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE)
    and have daedalus's deblock produce the post-deblock output
    that matches AVFrame.  Closes the loop on the full I-only
    pipeline.
  - Task #184: daedalus_decoder.h coeffs docstring clarification.

2026-05-26 11:19:11 +02:00

include

wip: deblock dispatch

2026-05-25 23:14:24 +02:00

src

Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits

2026-05-25 23:30:37 +02:00

tests

Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits

2026-05-25 23:30:37 +02:00

tools

Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact

2026-05-26 11:19:11 +02:00

.gitignore

scaffold: CMake + API skeleton + smoke test

2026-05-24 22:08:46 +02:00

CMakeLists.txt

Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact

2026-05-26 11:19:11 +02:00

DESIGN.md

Merge pull request 'design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24)' (#1 ) from noether/design-decisions into main

2026-05-24 19:58:41 +00:00

LICENSE

scaffold: CMake + API skeleton + smoke test

2026-05-24 22:08:46 +02:00

README.md

initial design doc — frame-level GPU H.264 decoder for V3D7

2026-05-23 22:44:03 +02:00

README.md

daedalus-decoder

Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.

The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.

Sibling projects:

daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.

See DESIGN.md for the architecture sketch.