Stage 2 PR-A3b: real H.264 coefficients through daedalus-decoder, byte-exact #15
Reference in New Issue
Block a user
Delete Branch "noether/tools-h264-real-coeffs"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Final option-A deliverable. CLI extracts real per-MB coefficients from libavcodec via the inspection callback + side-buffer (marfrit-packages 0016 + 0017, both merged), reconstructs pre-residual predicted samples
Pvia inverse-of-IDCT-add, and feeds daedalus-decoder with real(P, C, no edges). Byte-exact against libavcodec's pre-deblock AVFrame across all three substrates and both test resolutions.Path summary
Inspection callback (per MB, fires post-
hl_decode_mb):IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM(other MB flavours fall back to identity-passthrough in the main loop).h->cur_pic.f->data[0].h->mb_inspect_coeffs(the 0017 side buffer —sl->mbsnapshot taken before IDCT zeros it).raster_to_zscan[]to find its slot in the z-scan-ordered side buffer): computeIDCT(C)using a transcribed H.264 C ref, deriveP = clip(pre_deblock - ((IDCT + 32) >> 6)).A diagnostic asserts (silently on pass) that the callback's
pre_deblocksnapshot equals AVFrame at each real-coeffs MB position — i.e.h->cur_pic.fIS the eventual AVFrame buffer underskip_loop_filter=AVDISCARD_ALL+thread_count=1.Bug hunted in this PR
Initial implementation transposed the coefficients from row-major (sl->mb) to "column-major" (the layout that
daedalus_decoder.h'smb_input.coeffsdocstring describes). This caused ~0.2% Y pixel divergence on real streams (~150/frame at 320×240).Root cause identified via a standalone
/tmp/idct_compare.charness running daedalus's C ref IDCT and FFmpeg's reference C IDCT on identicalint16[16]inputs: outputs are IDENTICAL. The two functions implement the spec H.264 IDCT on the array regardless of layout interpretation; the "column-major" label is decoration. Removed the transpose; PR is now byte-exact.Follow-up task #184: clarify
daedalus_decoder.h'smb_input.coeffsdocstring so future integrators don't repeat this transpose mistake.Result on hertz (Pi 5 V3D 7.1)
testsrc2 I-only via
libx264 -bf 0 -g 1:testsrc2 is mostly flat → many Intra_16x16 MBs that fall back to identity-passthrough; richer content streams would engage real-coeffs path more.
Followups
skip_loop_filter=AVDISCARD_NONE) and have daedalus's deblock produce the post-deblock output. Closes the loop on full I-only pipeline.Final option-A deliverable. CLI now extracts real per-MB coefficients from libavcodec via the inspection callback + side-buffer (marfrit-packages 0016 + 0017), reconstructs the pre-residual predicted samples P via inverse-of-IDCT-add, and feeds daedalus-decoder with real (P, C, no edges). Daedalus output BYTE-EXACT against libavcodec's pre-deblock AVFrame across 5 frames at 320x240 and 3 frames at 1920x1088, all three substrates (auto / cpu / qpu). Path summary ------------ avctx->thread_count = 1 (single-threaded decode — 0017's side buffer is per-H264Context; multi-threaded would race) avctx->skip_loop_filter = AVDISCARD_ALL (AVFrame stays pre-deblock so the P-recovery subtraction is exact) ff_h264_set_mb_inspect_cb (registers the callback) Inspection callback (per MB, fires post-hl_decode_mb): - Gate on IS_INTRA4x4 && !IS_8x8DCT && !IS_INTRA_PCM (skipped MBs fall back to identity-passthrough in the main loop) - Snapshot pre-deblock pixels from h->cur_pic.f->data[0] - Read coefficients from h->mb_inspect_coeffs (= sl->mb copy, the 0017 side buffer) - For each 4x4 block (16/MB in raster order, indexed via raster_to_zscan[] to find its slot in the z-scan-ordered side buffer): compute IDCT(C) using a transcribed H.264 C reference, derive P = clip(pre_deblock - ((IDCT + 32) >> 6)) - Stash per-MB capture (P + C) for the main loop Main loop: - Default identity-passthrough (predicted = AVFrame pixels, coeffs = 0) - For real-coeffs-valid MBs: override luma with captured P + C - flush_frame, byte-exact compare against AVFrame A diagnostic also asserts (silently when passing) that the callback's pre_deblock snapshot equals AVFrame at each real-coeffs MB position — i.e. h->cur_pic.f IS the eventual AVFrame buffer under skip_loop_filter=AVDISCARD_ALL with thread_count=1. Bug hunted in this PR --------------------- Initial implementation transposed the coefficients from row-major (sl->mb) to "column-major" (the layout that daedalus_decoder.h's mb_input.coeffs docstring describes). This caused ~0.2% Y pixel divergence on real streams (~150/frame at 320x240). Root cause identified via a standalone /tmp/idct_compare.c harness running daedalus's C ref IDCT and FFmpeg's reference C IDCT on identical int16[16] inputs: outputs IDENTICAL. The two functions implement the spec H.264 IDCT on the array regardless of layout interpretation; the "column-major" label is decoration. Removed the transpose; PR is now byte-exact. Follow-up task #184: clarify daedalus_decoder.h's mb_input.coeffs docstring so future integrators don't repeat this transpose mistake. Result on hertz (Pi 5 V3D 7.1) ------------------------------ testsrc2 I-only via libx264 -bf 0 -g 1: 320x240, 5 frames, substrate=auto: Y diff 0/76800, UV diff 0/38400 PASS 320x240, 5 frames, substrate=cpu: Y diff 0/76800, UV diff 0/38400 PASS 320x240, 5 frames, substrate=qpu: Y diff 0/76800, UV diff 0/38400 PASS 1920x1088, 3 frames, substrate=auto: Y diff 0/2088960, UV diff 0/1044480 PASS Real-coeffs path engaged for 77-95 MBs per 320x240 frame and 598-643 MBs per 1080p frame (testsrc2 is mostly flat → many Intra_16x16 MBs that fall back to identity passthrough; richer content streams would engage real-coeffs more). Followups --------- - PR-A4: extend the gate to Intra_16x16 (chroma DC Hadamard + Intra_16x16 luma DC Hadamard pre-pass) — currently ~30-60% of MBs fall back to identity-passthrough due to this. - PR-A5: extend to 8x8 transform (separate IDCT 8x8 dispatch path on the daedalus-decoder side, similar plumbing). - PR-A6: enable libavcodec's deblock (skip_loop_filter=AVDISCARD_NONE) and have daedalus's deblock produce the post-deblock output that matches AVFrame. Closes the loop on the full I-only pipeline. - Task #184: daedalus_decoder.h coeffs docstring clarification.