Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact #15

Merged
marfrit merged 1 commits from noether/tools-h264-real-coeffs into main 2026-05-26 09:36:04 +00:00
Owner

Final option-A deliverable. CLI extracts real per-MB coefficients from libavcodec via the inspection callback + side-buffer (marfrit-packages 0016 + 0017, both merged), reconstructs pre-residual predicted samples P via inverse-of-IDCT-add, and feeds daedalus-decoder with real (P, C, no edges). Byte-exact against libavcodec's pre-deblock AVFrame across all three substrates and both test resolutions.

Path summary

avctx->thread_count     = 1;                  // 0017 side buffer is per-H264Context
avctx->skip_loop_filter = AVDISCARD_ALL;      // AVFrame stays pre-deblock
ff_h264_set_mb_inspect_cb(avctx, inspect_cb, &state);

Inspection callback (per MB, fires post-hl_decode_mb):

  1. Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (other MB flavours fall back to identity-passthrough in the main loop).
  2. Snapshot pre-deblock pixels from h->cur_pic.f->data[0].
  3. Read coefficients from h->mb_inspect_coeffs (the 0017 side buffer — sl->mb snapshot taken before IDCT zeros it).
  4. For each 4×4 block (16/MB in raster order, indexed via raster_to_zscan[] to find its slot in the z-scan-ordered side buffer): compute IDCT(C) using a transcribed H.264 C ref, derive P = clip(pre_deblock - ((IDCT + 32) >> 6)).
  5. Stash per-MB capture for the main loop.

A diagnostic asserts (silently on pass) that the callback's pre_deblock snapshot equals AVFrame at each real-coeffs MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer under skip_loop_filter=AVDISCARD_ALL + thread_count=1.

Bug hunted in this PR

Initial implementation transposed the coefficients from row-major (sl->mb) to "column-major" (the layout that daedalus_decoder.h's mb_input.coeffs docstring describes). This caused ~0.2% Y pixel divergence on real streams (~150/frame at 320×240).

Root cause identified via a standalone /tmp/idct_compare.c harness running daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical int16[16] inputs: outputs are IDENTICAL. The two functions implement the spec H.264 IDCT on the array regardless of layout interpretation; the "column-major" label is decoration. Removed the transpose; PR is now byte-exact.

Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs docstring so future integrators don't repeat this transpose mistake.

Result on hertz (Pi 5 V3D 7.1)

testsrc2 I-only via libx264 -bf 0 -g 1:

dims frames substrate Y diff UV diff real-coeffs MBs
320×240 5 auto 0/76800 0/38400 77-95/300
320×240 5 cpu 0/76800 0/38400 77-95/300
320×240 5 qpu 0/76800 0/38400 77-95/300
1920×1088 3 auto 0/2088960 0/1044480 598-643/8160

testsrc2 is mostly flat → many Intra_16x16 MBs that fall back to identity-passthrough; richer content streams would engage real-coeffs path more.

Followups

  • PR-A4: extend gate to Intra_16x16 (chroma DC Hadamard + Intra_16x16 luma DC Hadamard pre-pass).
  • PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch path).
  • PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE) and have daedalus's deblock produce the post-deblock output. Closes the loop on full I-only pipeline.
  • Task #184: daedalus_decoder.h coeffs docstring clarification.
Final option-A deliverable. CLI extracts **real per-MB coefficients** from libavcodec via the inspection callback + side-buffer (marfrit-packages 0016 + 0017, both merged), reconstructs pre-residual predicted samples `P` via inverse-of-IDCT-add, and feeds daedalus-decoder with real `(P, C, no edges)`. **Byte-exact against libavcodec's pre-deblock AVFrame across all three substrates and both test resolutions.** ## Path summary ```c avctx->thread_count = 1; // 0017 side buffer is per-H264Context avctx->skip_loop_filter = AVDISCARD_ALL; // AVFrame stays pre-deblock ff_h264_set_mb_inspect_cb(avctx, inspect_cb, &state); ``` **Inspection callback** (per MB, fires post-`hl_decode_mb`): 1. Gate on `IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM` (other MB flavours fall back to identity-passthrough in the main loop). 2. Snapshot pre-deblock pixels from `h->cur_pic.f->data[0]`. 3. Read coefficients from `h->mb_inspect_coeffs` (the 0017 side buffer — `sl->mb` snapshot taken before IDCT zeros it). 4. For each 4×4 block (16/MB in raster order, indexed via `raster_to_zscan[]` to find its slot in the z-scan-ordered side buffer): compute `IDCT(C)` using a transcribed H.264 C ref, derive `P = clip(pre_deblock - ((IDCT + 32) >> 6))`. 5. Stash per-MB capture for the main loop. A diagnostic asserts (silently on pass) that the callback's `pre_deblock` snapshot equals AVFrame at each real-coeffs MB position — i.e. `h->cur_pic.f` IS the eventual AVFrame buffer under `skip_loop_filter=AVDISCARD_ALL` + `thread_count=1`. ## Bug hunted in this PR Initial implementation transposed the coefficients from row-major (sl->mb) to "column-major" (the layout that `daedalus_decoder.h`'s `mb_input.coeffs` docstring describes). This caused ~0.2% Y pixel divergence on real streams (~150/frame at 320×240). Root cause identified via a standalone `/tmp/idct_compare.c` harness running daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical `int16[16]` inputs: **outputs are IDENTICAL.** The two functions implement the spec H.264 IDCT on the array regardless of layout interpretation; the "column-major" label is decoration. Removed the transpose; PR is now byte-exact. Follow-up **task #184**: clarify `daedalus_decoder.h`'s `mb_input.coeffs` docstring so future integrators don't repeat this transpose mistake. ## Result on hertz (Pi 5 V3D 7.1) testsrc2 I-only via `libx264 -bf 0 -g 1`: | dims | frames | substrate | Y diff | UV diff | real-coeffs MBs | |---|---|---|---|---|---| | 320×240 | 5 | auto | 0/76800 | 0/38400 | 77-95/300 | | 320×240 | 5 | cpu | 0/76800 | 0/38400 | 77-95/300 | | 320×240 | 5 | qpu | 0/76800 | 0/38400 | 77-95/300 | | 1920×1088 | 3 | auto | 0/2088960 | 0/1044480 | 598-643/8160 | testsrc2 is mostly flat → many Intra_16x16 MBs that fall back to identity-passthrough; richer content streams would engage real-coeffs path more. ## Followups - **PR-A4**: extend gate to Intra_16x16 (chroma DC Hadamard + Intra_16x16 luma DC Hadamard pre-pass). - **PR-A5**: extend to 8x8 transform (separate IDCT 8x8 dispatch path). - **PR-A6**: enable libavcodec's deblock (`skip_loop_filter=AVDISCARD_NONE`) and have daedalus's deblock produce the post-deblock output. Closes the loop on full I-only pipeline. - **Task #184**: daedalus_decoder.h coeffs docstring clarification.
marfrit added 1 commit 2026-05-26 09:19:41 +00:00
Final option-A deliverable.  CLI now extracts real per-MB
coefficients from libavcodec via the inspection callback +
side-buffer (marfrit-packages 0016 + 0017), reconstructs the
pre-residual predicted samples P via inverse-of-IDCT-add, and
feeds daedalus-decoder with real (P, C, no edges).  Daedalus
output BYTE-EXACT against libavcodec's pre-deblock AVFrame
across 5 frames at 320x240 and 3 frames at 1920x1088, all three
substrates (auto / cpu / qpu).

Path summary
------------

avctx->thread_count = 1                  (single-threaded decode — 0017's
                                           side buffer is per-H264Context;
                                           multi-threaded would race)
avctx->skip_loop_filter = AVDISCARD_ALL  (AVFrame stays pre-deblock so the
                                           P-recovery subtraction is exact)
ff_h264_set_mb_inspect_cb               (registers the callback)

Inspection callback (per MB, fires post-hl_decode_mb):
  - Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (skipped MBs
    fall back to identity-passthrough in the main loop)
  - Snapshot pre-deblock pixels from h->cur_pic.f->data[0]
  - Read coefficients from h->mb_inspect_coeffs (= sl->mb copy, the
    0017 side buffer)
  - For each 4x4 block (16/MB in raster order, indexed via
    raster_to_zscan[] to find its slot in the z-scan-ordered side
    buffer): compute IDCT(C) using a transcribed H.264 C reference,
    derive P = clip(pre_deblock - ((IDCT + 32) >> 6))
  - Stash per-MB capture (P + C) for the main loop

Main loop:
  - Default identity-passthrough (predicted = AVFrame pixels, coeffs = 0)
  - For real-coeffs-valid MBs: override luma with captured P + C
  - flush_frame, byte-exact compare against AVFrame

A diagnostic also asserts (silently when passing) that the
callback's pre_deblock snapshot equals AVFrame at each real-coeffs
MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer
under skip_loop_filter=AVDISCARD_ALL with thread_count=1.

Bug hunted in this PR
---------------------

Initial implementation transposed the coefficients from row-major
(sl->mb) to "column-major" (the layout that daedalus_decoder.h's
mb_input.coeffs docstring describes).  This caused ~0.2% Y pixel
divergence on real streams (~150/frame at 320x240).  Root cause
identified via a standalone /tmp/idct_compare.c harness running
daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical
int16[16] inputs: outputs IDENTICAL.  The two functions implement
the spec H.264 IDCT on the array regardless of layout
interpretation; the "column-major" label is decoration.  Removed
the transpose; PR is now byte-exact.

Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs
docstring so future integrators don't repeat this transpose
mistake.

Result on hertz (Pi 5 V3D 7.1)
------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320x240,    5 frames, substrate=auto:  Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=cpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  320x240,    5 frames, substrate=qpu:   Y diff 0/76800,    UV diff 0/38400   PASS
  1920x1088,  3 frames, substrate=auto:  Y diff 0/2088960,  UV diff 0/1044480 PASS

Real-coeffs path engaged for 77-95 MBs per 320x240 frame and
598-643 MBs per 1080p frame (testsrc2 is mostly flat → many
Intra_16x16 MBs that fall back to identity passthrough; richer
content streams would engage real-coeffs more).

Followups
---------

  - PR-A4: extend the gate to Intra_16x16 (chroma DC Hadamard +
    Intra_16x16 luma DC Hadamard pre-pass) — currently ~30-60%
    of MBs fall back to identity-passthrough due to this.
  - PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch
    path on the daedalus-decoder side, similar plumbing).
  - PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE)
    and have daedalus's deblock produce the post-deblock output
    that matches AVFrame.  Closes the loop on the full I-only
    pipeline.
  - Task #184: daedalus_decoder.h coeffs docstring clarification.
marfrit merged commit 35b4f163c6 into main 2026-05-26 09:36:04 +00:00
marfrit deleted branch noether/tools-h264-real-coeffs 2026-05-26 09:36:04 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#15