Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels #11

2026-05-25T21:01:46Z

marfrit commented

2026-05-25 21:01:46 +00:00

First concrete deliverable on the daedalus-decoder Stage 2 path post the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA).

Q2 decision applies: CPU intra prediction. libavcodec's existing NEON intra prediction kernels generate predicted samples per MB; daedalus-decoder accepts those samples through the API and uses them as the IDCT-add starting state. FFmpeg's idct_add semantics (dst += idct(coeffs); clip255) fold DESIGN.md's Stage 3 reconstruction into the existing Stage 1 IDCT dispatch for free. No new GPU work.

API change

daedalus_decoder_mb_input gains a const uint8_t *predicted field:

range	content
`[ 0 .. 256)`	16×16 luma, row-major raster
`[256 .. 320)`	8×8 Cb, row-major raster
`[320 .. 384)`	8×8 Cr, row-major raster

NULL is legal and equivalent to all-zero predicted samples — preserves the existing IDCT-isolation test contract.

Internal changes

daedalus_decoder gains predicted_y (W×H) and predicted_uv (planar Cb‖Cr, W×H/2) buffers allocated at create, zeroed at end of every flush_frame so NULL mb->predicted is indistinguishable from explicit zeros across frame boundaries.
append_mb splats mb->predicted into predicted_y/_uv at raster (mb_y*16, mb_x*16) for luma and (mb_y*8, mb_x*8) for each chroma component.
flush_frame replaces calloc(scratch_y) / calloc(scratch_uv) with malloc + memcpy from predicted_y/_uv — the IDCT dispatch writes residual on top, clip-adding to the predicted samples in place.

Test

test_idct_bitexact extended:

Generates random predicted samples (uint8_t) per MB alongside random coeffs.
Pre-fills reference ref_y / ref_cb / ref_cr with those predicted samples at corresponding raster positions BEFORE applying ref_idct4_add / ref_idct8_add per block.
Compares GPU output to reference byte-for-byte.

Result on hertz (Pi 5 V3D 7.1), all three substrates:

substrate=cpu:  Y diff 0/76800   Cb diff 0/19200   Cr diff 0/19200   BIT-EXACT PASS
substrate=qpu:  Y diff 0/76800   Cb diff 0/19200   Cr diff 0/19200   BIT-EXACT PASS
substrate=auto: Y diff 0/76800   Cb diff 0/19200   Cr diff 0/19200   BIT-EXACT PASS

Followups

Stage 2 PR-b: deblock dispatch in flush_frame.
Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace avcodec_send_packet/receive_frame with a parser-only path that drives daedalus_decoder_append_mb in raster order + flush_frame at slice boundary.

First concrete deliverable on the daedalus-decoder Stage 2 path post the 2026-05-25 architecture re-pin (memory: `dejavu` / frame-major UMA). Q2 decision applies: **CPU intra prediction.** libavcodec's existing NEON intra prediction kernels generate predicted samples per MB; daedalus-decoder accepts those samples through the API and uses them as the IDCT-add starting state. FFmpeg's `idct_add` semantics (`dst += idct(coeffs); clip255`) fold DESIGN.md's Stage 3 reconstruction into the existing Stage 1 IDCT dispatch for free. No new GPU work. ## API change `daedalus_decoder_mb_input` gains a `const uint8_t *predicted` field: | range | content | |---|---| | `[ 0 .. 256)` | 16×16 luma, row-major raster | | `[256 .. 320)` | 8×8 Cb, row-major raster | | `[320 .. 384)` | 8×8 Cr, row-major raster | NULL is legal and equivalent to all-zero predicted samples — preserves the existing IDCT-isolation test contract. ## Internal changes - `daedalus_decoder` gains `predicted_y` (W×H) and `predicted_uv` (planar Cb‖Cr, W×H/2) buffers allocated at create, zeroed at end of every `flush_frame` so NULL `mb->predicted` is indistinguishable from explicit zeros across frame boundaries. - `append_mb` splats `mb->predicted` into `predicted_y`/`_uv` at raster `(mb_y*16, mb_x*16)` for luma and `(mb_y*8, mb_x*8)` for each chroma component. - `flush_frame` replaces `calloc(scratch_y)` / `calloc(scratch_uv)` with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch writes residual on top, clip-adding to the predicted samples in place. ## Test `test_idct_bitexact` extended: - Generates random predicted samples (uint8_t) per MB alongside random coeffs. - Pre-fills reference `ref_y` / `ref_cb` / `ref_cr` with those predicted samples at corresponding raster positions BEFORE applying `ref_idct4_add` / `ref_idct8_add` per block. - Compares GPU output to reference byte-for-byte. **Result on hertz (Pi 5 V3D 7.1), all three substrates**: ``` substrate=cpu: Y diff 0/76800 Cb diff 0/19200 Cr diff 0/19200 BIT-EXACT PASS substrate=qpu: Y diff 0/76800 Cb diff 0/19200 Cr diff 0/19200 BIT-EXACT PASS substrate=auto: Y diff 0/76800 Cb diff 0/19200 Cr diff 0/19200 BIT-EXACT PASS ``` ## Followups - Stage 2 PR-b: deblock dispatch in `flush_frame`. - Stage 2 daemon refactor (parallel, `daedalus-v4l2` daemon): replace `avcodec_send_packet`/`receive_frame` with a parser-only path that drives `daedalus_decoder_append_mb` in raster order + `flush_frame` at slice boundary.

marfrit added 1 commit 2026-05-25 21:01:47 +00:00

Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels a7a0d56ecd

First concrete deliverable on the daedalus-decoder Stage 2 path post
the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA).

Q2 decision: CPU intra prediction.  libavcodec's existing NEON intra
prediction kernels generate predicted samples per MB; daedalus-decoder
accepts those samples through the API and uses them as the IDCT-add
starting state.  FFmpeg's `idct_add` semantics — dst += idct(coeffs);
clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing
Stage 1 IDCT dispatch for free.  No new GPU work.

API change
----------

`daedalus_decoder_mb_input` gains a `const uint8_t *predicted` field:

    predicted [  0 .. 256) — 16×16 luma, row-major raster
    predicted [256 .. 320) — 8×8  Cb,   row-major raster
    predicted [320 .. 384) — 8×8  Cr,   row-major raster

NULL is legal and equivalent to all-zero predicted samples — preserves
the existing IDCT-isolation test contract.

Internal changes
----------------

  - `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar
    Cb||Cr, W×H/2) buffers allocated at create, zeroed at end of every
    flush_frame so NULL `mb->predicted` is indistinguishable from
    explicit zeros from one frame to the next.
  - `append_mb` splats mb->predicted into predicted_y/_uv at raster
    (mb_y*16, mb_x*16) for luma and (mb_y*8, mb_x*8) for each chroma
    component.
  - `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)`
    with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch
    then writes residual on top, clip-adding to the predicted samples
    in place.

Test
----

`test_idct_bitexact` extended:

  - Generates random predicted samples (uint8_t) per MB alongside the
    existing random coeffs.
  - Pre-fills the reference ref_y / ref_cb / ref_cr planes with those
    same predicted samples at the corresponding raster positions
    BEFORE applying ref_idct4_add / ref_idct8_add per block.
  - Compares GPU output to reference byte-for-byte.

Result on hertz (Pi 5 V3D 7.1), all three substrates:

  test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto}
    Y bytes diff:  0/76800 (0.0000%)
    Cb bytes diff: 0/19200 (0.0000%)
    Cr bytes diff: 0/19200 (0.0000%)
  BIT-EXACT PASS on all three substrates

Catches any silent drift between substrates and any predicted-samples
plumbing mistake on either the API or the dispatch side.

Followups
---------

  - Stage 2 PR-b: deblock dispatch in flush_frame.
  - Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace
    avcodec_send_packet/receive_frame with a libavcodec-parser-only
    path that drives daedalus_decoder_append_mb in raster order +
    flush_frame at slice boundary.

marfrit merged commit 418053db8d into main

2026-05-25 21:07:29 +00:00

marfrit deleted branch noether/stage2-predicted-samples

2026-05-25 21:07:29 +00:00

claude-noether referenced this issue from a commit

2026-05-25 21:30:40 +00:00

Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits

marfrit referenced this pull request

2026-05-25 21:31:03 +00:00

Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits #12

Sign in to join this conversation.