PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams #16

Merged
marfrit merged 1 commits from noether/tools-h264-deblock-validation into main 2026-05-26 10:12:31 +00:00

1 Commits

Author SHA1 Message Date
claude-noether b597fc0098 PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams
PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real
H.264 streams via daedalus-decoder's frame-major dispatch.  Residual
divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0%
UV depending on substrate + resolution.  Kernel-level off-by-one
issues remain; structurally same family as task #179.

Architecture (verified against `dejavu` memory before coding)
-------------------------------------------------------------

  - NO new libavcodec patches.  Uses existing 0016 + 0017 callback
    infrastructure.
  - daedalus-decoder is the consumer-side frame-major dispatch path;
    libavcodec runs to produce the post-deblock reference.  Daedalus
    is NOT substituted into libavcodec's deblock path.
  - Edge derivation is a one-time spec implementation in the CLI, not
    a per-block function-pointer hijack.

Different shape from the banned per-kernel substitution arc.  Hard
re-check vs the magic word memory before any tool call (per the user's
explicit instruction "make sure no dejavu").

What changed in the CLI
-----------------------

  1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs
     and AVFrame is post-deblock (the new reference).
  2. Per-MB callback captures pre-deblock pixels for all 3 planes
     (Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for
     MB(N) regardless of incremental deblock timing for neighbours
     (filter_mb runs AFTER hl_decode_mb returns, so callback sees
     fresh-decoded fresh-pre-deblock pixels).
  3. Per-MB callback also captures qp_y, mb_type_intra,
     transform_8x8.  Slice-level: slice_alpha_c0_offset,
     slice_beta_offset, slice_deblocking_filter.
  4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156],
     tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+
     transcription; algorithm/values normative-spec, unpatented).
  5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for
     qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0,
     which matches x264 default).
  6. Main loop: for each MB, build daedalus_decoder_mb_input.edges
     from spec rules.  16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb +
     2 V-Cr + 2 H-Cb + 2 H-Cr).  bS=4 at MB boundary, bS=3 internal,
     bS=0 at frame boundary.  8x8 DCT MBs skip internal edges at
     col/row 4 and 12 (only the 8x8-block boundary fires).
  7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs +
     identity passthrough for skipped MBs, THEN dispatches the 4
     deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra
     variants) across the frame.
  8. Compare daedalus output to AVFrame (post-deblock).

Subtle bug hunted: sl->deblocking_filter convention inversion
-------------------------------------------------------------

FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1`
to invert the spec's `disable_deblocking_filter_idc` semantics.
Internal convention:
  - 0 = DISABLED (was 1 in spec)
  - 1 = ENABLED  (was 0 in spec)
  - 2 = enabled-but-not-across-slice-boundaries (unchanged)

Initial implementation treated `== 1` as "disabled" per spec
semantics, which silently skipped all edge emission (deblock_off=1)
and gave the same diff count as the no-edges baseline.  Inverted
to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed
and divergence dropped 5346→438 Y diffs (92% reduction) per frame.

Results on hertz (Pi 5 V3D 7.1)
-------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320×240, 5 frames, substrate=cpu:
    Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%)
  320×240, 5 frames, substrate=qpu:
    Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%)
  1920×1088, 3 frames, substrate=auto:
    Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%)

The 1080p rate is lower than QVGA's — content has fewer edges relative
to total pixels at higher resolution.

Residual divergence — root cause analysis
-----------------------------------------

  - CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel
    libavcodec uses).  Same kernel + same alpha/beta/tc0/bS → output
    SHOULD be identical.  But still 0.52% Y diff.
  - Likely cause: edge dispatch ORDER mismatch.  libavcodec serialises
    per-MB (filter MB(N)'s edges, then MB(N+1)'s).  Daedalus batches
    frame-wide (all V luma across the frame, then all H luma, etc.).
    For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge
    + MB(N+1)'s col 0 boundary edge both touch cols 13-15), the
    order affects the final pixel.
  - QPU substrate has slightly higher divergence (0.86% Y) — additional
    kernel-level off-by-one between daedalus's V3D shader and the NEON
    reference, in the same family as task #179's chroma divergence.

These are kernel-level / dispatch-order issues, not CLI bugs.  Task #179
extended in scope (now includes luma + cross-MB edge ordering); root
cause investigation belongs in daedalus-fourier.

PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real
edges are derived per spec, daedalus runs IDCT + deblock in one frame-
major dispatch, output is within ~1% of libavcodec reference on real
H.264 streams.  Full byte-exact closure depends on the daedalus-fourier
deblock kernel/dispatch investigation.

Followups
---------

  - Extend task #179 to cover luma edges + cross-MB edge ordering on
    real-stream layouts.
  - PR-A4: Intra_16x16 + chroma DC Hadamard.  Would also help the UV
    diff rate since currently chroma is identity-passthrough (no real
    chroma residual coefficients flowing through daedalus).
  - Q3 deferred: daemon refactor in daedalus-v4l2.
2026-05-26 11:53:23 +02:00