PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams #16

Merged

marfrit merged 1 commits from noether/tools-h264-deblock-validation into main

2026-05-26 10:12:31 +00:00

Author	SHA1	Message	Date
claude-noether	b597fc0098	PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real H.264 streams via daedalus-decoder's frame-major dispatch. Residual divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0% UV depending on substrate + resolution. Kernel-level off-by-one issues remain; structurally same family as task #179. Architecture (verified against `dejavu` memory before coding) ------------------------------------------------------------- - NO new libavcodec patches. Uses existing 0016 + 0017 callback infrastructure. - daedalus-decoder is the consumer-side frame-major dispatch path; libavcodec runs to produce the post-deblock reference. Daedalus is NOT substituted into libavcodec's deblock path. - Edge derivation is a one-time spec implementation in the CLI, not a per-block function-pointer hijack. Different shape from the banned per-kernel substitution arc. Hard re-check vs the magic word memory before any tool call (per the user's explicit instruction "make sure no dejavu"). What changed in the CLI ----------------------- 1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs and AVFrame is post-deblock (the new reference). 2. Per-MB callback captures pre-deblock pixels for all 3 planes (Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for MB(N) regardless of incremental deblock timing for neighbours (filter_mb runs AFTER hl_decode_mb returns, so callback sees fresh-decoded fresh-pre-deblock pixels). 3. Per-MB callback also captures qp_y, mb_type_intra, transform_8x8. Slice-level: slice_alpha_c0_offset, slice_beta_offset, slice_deblocking_filter. 4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156], tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+ transcription; algorithm/values normative-spec, unpatented). 5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0, which matches x264 default). 6. Main loop: for each MB, build daedalus_decoder_mb_input.edges from spec rules. 16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb + 2 V-Cr + 2 H-Cb + 2 H-Cr). bS=4 at MB boundary, bS=3 internal, bS=0 at frame boundary. 8x8 DCT MBs skip internal edges at col/row 4 and 12 (only the 8x8-block boundary fires). 7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs + identity passthrough for skipped MBs, THEN dispatches the 4 deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra variants) across the frame. 8. Compare daedalus output to AVFrame (post-deblock). Subtle bug hunted: sl->deblocking_filter convention inversion ------------------------------------------------------------- FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1` to invert the spec's `disable_deblocking_filter_idc` semantics. Internal convention: - 0 = DISABLED (was 1 in spec) - 1 = ENABLED (was 0 in spec) - 2 = enabled-but-not-across-slice-boundaries (unchanged) Initial implementation treated `== 1` as "disabled" per spec semantics, which silently skipped all edge emission (deblock_off=1) and gave the same diff count as the no-edges baseline. Inverted to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed and divergence dropped 5346→438 Y diffs (92% reduction) per frame. Results on hertz (Pi 5 V3D 7.1) ------------------------------- testsrc2 I-only via libx264 -bf 0 -g 1: 320×240, 5 frames, substrate=cpu: Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%) 320×240, 5 frames, substrate=qpu: Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%) 1920×1088, 3 frames, substrate=auto: Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%) The 1080p rate is lower than QVGA's — content has fewer edges relative to total pixels at higher resolution. Residual divergence — root cause analysis ----------------------------------------- - CPU substrate uses ff_h264__loop_filter__neon (same kernel libavcodec uses). Same kernel + same alpha/beta/tc0/bS → output SHOULD be identical. But still 0.52% Y diff. - Likely cause: edge dispatch ORDER mismatch. libavcodec serialises per-MB (filter MB(N)'s edges, then MB(N+1)'s). Daedalus batches frame-wide (all V luma across the frame, then all H luma, etc.). For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge + MB(N+1)'s col 0 boundary edge both touch cols 13-15), the order affects the final pixel. - QPU substrate has slightly higher divergence (0.86% Y) — additional kernel-level off-by-one between daedalus's V3D shader and the NEON reference, in the same family as task #179's chroma divergence. These are kernel-level / dispatch-order issues, not CLI bugs. Task #179 extended in scope (now includes luma + cross-MB edge ordering); root cause investigation belongs in daedalus-fourier. PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real edges are derived per spec, daedalus runs IDCT + deblock in one frame- major dispatch, output is within ~1% of libavcodec reference on real H.264 streams. Full byte-exact closure depends on the daedalus-fourier deblock kernel/dispatch investigation. Followups --------- - Extend task #179 to cover luma edges + cross-MB edge ordering on real-stream layouts. - PR-A4: Intra_16x16 + chroma DC Hadamard. Would also help the UV diff rate since currently chroma is identity-passthrough (no real chroma residual coefficients flowing through daedalus). - Q3 deferred: daemon refactor in daedalus-v4l2.	2026-05-26 11:53:23 +02:00

Author

SHA1

Message

Date

claude-noether

b597fc0098

PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams

PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real
H.264 streams via daedalus-decoder's frame-major dispatch.  Residual
divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0%
UV depending on substrate + resolution.  Kernel-level off-by-one
issues remain; structurally same family as task #179.

Architecture (verified against `dejavu` memory before coding)
-------------------------------------------------------------

  - NO new libavcodec patches.  Uses existing 0016 + 0017 callback
    infrastructure.
  - daedalus-decoder is the consumer-side frame-major dispatch path;
    libavcodec runs to produce the post-deblock reference.  Daedalus
    is NOT substituted into libavcodec's deblock path.
  - Edge derivation is a one-time spec implementation in the CLI, not
    a per-block function-pointer hijack.

Different shape from the banned per-kernel substitution arc.  Hard
re-check vs the magic word memory before any tool call (per the user's
explicit instruction "make sure no dejavu").

What changed in the CLI
-----------------------

  1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs
     and AVFrame is post-deblock (the new reference).
  2. Per-MB callback captures pre-deblock pixels for all 3 planes
     (Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for
     MB(N) regardless of incremental deblock timing for neighbours
     (filter_mb runs AFTER hl_decode_mb returns, so callback sees
     fresh-decoded fresh-pre-deblock pixels).
  3. Per-MB callback also captures qp_y, mb_type_intra,
     transform_8x8.  Slice-level: slice_alpha_c0_offset,
     slice_beta_offset, slice_deblocking_filter.
  4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156],
     tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+
     transcription; algorithm/values normative-spec, unpatented).
  5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for
     qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0,
     which matches x264 default).
  6. Main loop: for each MB, build daedalus_decoder_mb_input.edges
     from spec rules.  16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb +
     2 V-Cr + 2 H-Cb + 2 H-Cr).  bS=4 at MB boundary, bS=3 internal,
     bS=0 at frame boundary.  8x8 DCT MBs skip internal edges at
     col/row 4 and 12 (only the 8x8-block boundary fires).
  7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs +
     identity passthrough for skipped MBs, THEN dispatches the 4
     deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra
     variants) across the frame.
  8. Compare daedalus output to AVFrame (post-deblock).

Subtle bug hunted: sl->deblocking_filter convention inversion
-------------------------------------------------------------

FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1`
to invert the spec's `disable_deblocking_filter_idc` semantics.
Internal convention:
  - 0 = DISABLED (was 1 in spec)
  - 1 = ENABLED  (was 0 in spec)
  - 2 = enabled-but-not-across-slice-boundaries (unchanged)

Initial implementation treated `== 1` as "disabled" per spec
semantics, which silently skipped all edge emission (deblock_off=1)
and gave the same diff count as the no-edges baseline.  Inverted
to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed
and divergence dropped 5346→438 Y diffs (92% reduction) per frame.

Results on hertz (Pi 5 V3D 7.1)
-------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320×240, 5 frames, substrate=cpu:
    Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%)
  320×240, 5 frames, substrate=qpu:
    Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%)
  1920×1088, 3 frames, substrate=auto:
    Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%)

The 1080p rate is lower than QVGA's — content has fewer edges relative
to total pixels at higher resolution.

Residual divergence — root cause analysis
-----------------------------------------

  - CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel
    libavcodec uses).  Same kernel + same alpha/beta/tc0/bS → output
    SHOULD be identical.  But still 0.52% Y diff.
  - Likely cause: edge dispatch ORDER mismatch.  libavcodec serialises
    per-MB (filter MB(N)'s edges, then MB(N+1)'s).  Daedalus batches
    frame-wide (all V luma across the frame, then all H luma, etc.).
    For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge
    + MB(N+1)'s col 0 boundary edge both touch cols 13-15), the
    order affects the final pixel.
  - QPU substrate has slightly higher divergence (0.86% Y) — additional
    kernel-level off-by-one between daedalus's V3D shader and the NEON
    reference, in the same family as task #179's chroma divergence.

These are kernel-level / dispatch-order issues, not CLI bugs.  Task #179
extended in scope (now includes luma + cross-MB edge ordering); root
cause investigation belongs in daedalus-fourier.

PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real
edges are derived per spec, daedalus runs IDCT + deblock in one frame-
major dispatch, output is within ~1% of libavcodec reference on real
H.264 streams.  Full byte-exact closure depends on the daedalus-fourier
deblock kernel/dispatch investigation.

Followups
---------

  - Extend task #179 to cover luma edges + cross-MB edge ordering on
    real-stream layouts.
  - PR-A4: Intra_16x16 + chroma DC Hadamard.  Would also help the UV
    diff rate since currently chroma is identity-passthrough (no real
    chroma residual coefficients flowing through daedalus).
  - Q3 deferred: daemon refactor in daedalus-v4l2.

2026-05-26 11:53:23 +02:00

PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams #16

1 Commits