47 Commits

Author SHA1 Message Date
marfrit fd56eca3cb Merge pull request 'mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo' (#91) from claude-noether/marfrit-packages:noether/panvk-bifrost-url-fix into main
Reviewed-on: marfrit/marfrit-packages#91
2026-05-23 03:18:37 +00:00
marfrit 91022b390e mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo
Both PKGBUILDs referenced url=https://github.com/marfrit/panvk-bifrost,
which was a hallucinated URL — no such repo existed. The campaign's
real source-of-truth home was just created at
https://git.reauktion.de/marfrit/panvk-bifrost (mfritsche, 2026-05-23).

Point both PKGBUILDs at the real URL so `pacman -Si` and any consumer
reading package metadata follows a working link.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:17:41 +02:00
marfrit b736dd0529 Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier' (#90) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-qpel-mc20-daedalus into main
Reviewed-on: marfrit/marfrit-packages#90
2026-05-23 01:34:04 +00:00
claude-noether 0bfc4ab03e ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
marfrit 8729c2db92 Merge pull request 'daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20)' (#89) from claude-noether/marfrit-packages:noether/daedalus-bump-872eec5-1mib-payload into main
Reviewed-on: marfrit/marfrit-packages#89
2026-05-22 18:52:53 +00:00
marfrit d449ec1073 daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20)
Picks up reauktion/daedalus-v4l2 PR #20 (closes #19): wire-protocol
cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB to 1 MiB.
DAEDALUS_MAX_BITSTREAM follows; daedalus_fill_output_fmt now reports
OUTPUT_MPLANE sizeimage = ~1 MiB.

Fixes the Firefox YouTube avc1 SW-fallback observed on higgs when
any H.264 slice exceeded 64 KiB (routine on 720p+ streams).
libva-v4l2-request-fourier's S_FMT-driven OUTPUT-pool resize was
clamping back to 65484 and Firefox lost the slice; now the kernel
honours the larger sizeimage.

Both packages bumped to 0.1.0+r45+g872eec5-1:

  - daedalus-v4l2 (daemon): r43 -> r45.  Daemon-side allocations
    are dynamic, so the only growth is one ~1 MiB read buffer per
    daemon process at startup.
  - daedalus-v4l2-dkms (kernel module): r33 -> r45.  Skips the
    daemon-only bumps r37/r39/r41/r43 (no kernel/include change in
    that range) and lands the PROTO_MAX_PAYLOAD bump.

LOCK-STEP INSTALL REQUIRED: effective cap is min(kernel, daemon).
A stale kernel with a new daemon (or vice versa) still rejects
>64 KiB payloads.  apt/pacman should pick both up in one
transaction since they share the same upstream pin.

Wire-protocol value-only change in include/daedalus_v4l2_proto.h;
struct layout unchanged.  DAEDALUS_PROTO_VERSION stays at 0.
2026-05-22 20:50:04 +02:00
marfrit 9d30c34be9 Merge pull request 'daedalus-v4l2: 6e6dfa1 -> 1d8f5af — pause-time tiny-bitstream filter (#18)' (#88) from claude-noether/marfrit-packages:noether/daedalus-bump-1d8f5af-pause-filter into main
Reviewed-on: marfrit/marfrit-packages#88
2026-05-22 16:20:14 +00:00
marfrit 1ca18ac130 daedalus-v4l2: 6e6dfa1 -> 1d8f5af — pause-time tiny-bitstream filter (#18)
Picks up reauktion/daedalus-v4l2 PR #18 (closes #17): daemon drops
degenerate (<4 byte) bitstreams at REQ_DECODE entry instead of
letting avcodec_send_packet emit AVERROR_INVALIDDATA, replies
RESP_FRAME NO_FRAME so libva's V4L2 surface pool stays alive.

Fixes the Firefox YouTube avc1 pause→resume regression observed on
higgs: libva-v4l2-request-fourier flushes a 3-byte stub into
OUTPUT_MPLANE at the pause boundary; the old daemon path turned
that into a decode failure, Firefox marked H.264-via-VAAPI as
broken for the session, and routed every subsequent frame to
libmozavcodec SW.  After this bump the daemon logs 'tiny bitstream
3 bytes — dropping as no-op' and the next real REQ_DECODE
proceeds normally.

Wire protocol unchanged.  daedalus-v4l2-dkms bump not needed.
2026-05-22 18:16:33 +02:00
marfrit cf9eef6cfa Merge pull request 'ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder' (#87) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-restore-low-delay into main
Reviewed-on: marfrit/marfrit-packages#87
2026-05-22 14:27:43 +00:00
marfrit 5c69460722 ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder
FFmpeg 8.x dropped the H.264 decoder's low_delay code path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents h264_select_output_frame
from running the display-order DPB output queue.  The daedalus-v4l2
daemon's `ctx->flags |= AV_CODEC_FLAG_LOW_DELAY` at
daemon/src/decoder.c:202 has been a silent no-op since the SONAME
61→62 jump landed in reauktion/daedalus-v4l2 PR #16; on Firefox
YouTube this re-introduced the 2-1-4-3 B-frame pair-swap that PR
#12's daemon flag was supposed to prevent.

Fix lives in libavcodec, not the daemon: restore the documented
LOW_DELAY semantics so the daemon (and any other V4L2-stateless-
style consumer) keeps the one-frame-per-send_packet decode-order
output contract it already declares.

## Patch

0006-h264-restore-low-delay.patch touches libavcodec/h264_slice.c:

- h264_select_output_frame: early-exit when LOW_DELAY is set.
  Emit the just-decoded picture as next_output_pic, mirror the
  corruption / recovery-point tracking the main path performs,
  skip delayed_pic[] / POC reorder machinery entirely.

- h264_field_start: suppress the SPS-driven
  `has_b_frames = sps->num_reorder_frames` clobber when LOW_DELAY
  is set.  Without this the per-slice bitstream_restriction_flag
  re-pickup would reintroduce a nonzero reorder buffer mid-stream
  even after the daemon set has_b_frames=0 at avcodec_open2.

## Why not daemon-side

A daemon SPS-rewrite (`num_reorder_frames=0`) was considered but
rejected: it works only for the daemon's reconstructed SPS NAL,
not for any in-band SPS the daemon dlopens libavformat to parse
in other code paths.  Restoring documented FFmpeg flag semantics
is the smaller, more durable change and keeps the daemon
interface stable.

## Packaging

- PKGREL/pkgrel bump to 9.
- No new build-deps, no Depends change.
- Substitution arc cycles 6/7/8 unchanged.

## Refs

- reauktion/daedalus-v4l2#11 / #12 (LOW_DELAY half-measure on
  daemon side, originally landed against FFmpeg 7.x).
- daemon/src/decoder.c:202 (`ctx->flags |= AV_CODEC_FLAG_LOW_DELAY`
  for H.264 only — unchanged, but now actually has effect again).
2026-05-22 14:20:37 +02:00
marfrit d11a52405d Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier' (#86) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-deblock-luma-v-daedalus into main
Reviewed-on: marfrit/marfrit-packages#86
2026-05-22 10:29:09 +00:00
marfrit 29e0852d11 ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier
Cycle 8 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.v_loop_filter_luma — non-intra bS<4 vertical
luma deblock, called per macroblock-row edge from the slice deblock
loop in libavcodec/h264_loopfilter.c — now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.

## What

- Add 0005-h264-deblock-luma-v-daedalus-fourier.patch (in both arch/
  and debian/ ffmpeg-v4l2-request-fourier/).  Extends
  libavcodec/aarch64/h264_idct_daedalus.c with
  ff_h264_v_loop_filter_luma_daedalus (constructs a
  daedalus_h264_deblock_meta from FFmpeg's (alpha, beta, tc0[4]) and
  calls daedalus_recipe_dispatch_h264_deblock_luma_v with n_edges=1).
  Patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire
  c->v_loop_filter_luma to the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append patch + bump pkgrel/PKGREL
  to 8.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_deblock_luma_v.

## Why

Cycle 8 is marked "CPU primary; QPU opportunistic" in the daedalus-
fourier API docstring.  Per the hybrid substrate philosophy
("if there's a coprocessor, use it") we eventually want the QPU
opportunism active here.  But the libavcodec.so context is
process-global and shared with cycles 6/7 via pthread_once, and it
uses daedalus_ctx_create_no_qpu deliberately to avoid implicit
Vulkan init in arbitrary host processes (Firefox content, mpv-fourier,
ffmpeg-fourier CLI, ...).  Switching to daedalus_ctx_create here
without a feature flag would be a footgun.

So cycle 8 lands as plumbing-only NEON-by-recipe substitution for
now; opportunistic QPU enablement is a separate follow-up that adds
a DAEDALUS_FOURIER_ENABLE_QPU env var or equivalent.

## Scope NOT covered

- Intra (bS=4) loop filter c->v_loop_filter_luma_intra — daedalus's
  daedalus_h264_deblock_meta only covers the non-intra path.
- Horizontal-edge variant c->h_loop_filter_luma — separate kernel
  (not yet in daedalus-fourier API).
- Chroma loop filters — separate kernels.
- Bulk batching — single-edge dispatch wastes the kernel's n_edges>1
  amortization.  Same caveat as cycles 6/7; follow-up.
- QPU opportunism — see "Why" above.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11: reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #85 (cycle 7 IDCT 8×8)
- marfrit/daedalus-fourier cycle 8 close (deblock luma-v NEON green)
2026-05-22 12:17:14 +02:00
marfrit 510a31622c Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier' (#85) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-idct8-daedalus into main
Reviewed-on: marfrit/marfrit-packages#85
2026-05-22 08:32:15 +00:00
marfrit db9ae16da9 Merge pull request 'mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot' (#84) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#84
2026-05-22 08:20:34 +00:00
marfrit 493c762967 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier
Cycle 7 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct8_add — called per 8×8 block from the
High-profile intra-8×8-DCT decode path in libavcodec/h264_mb.c — now
dispatches through daedalus_recipe_dispatch_h264_idct8 instead of
ff_h264_idct8_add_neon.

## What

- Add 0004-h264-idct8-daedalus-fourier.patch (in both arch/ and debian/
  ffmpeg-v4l2-request-fourier/).  Extends libavcodec/aarch64/
  h264_idct_daedalus.c (introduced by 0003) with ff_h264_idct8_add_daedalus
  and a daedalus_recipe_dispatch_h264_idct8 call; patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct8_add to
  the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append the new patch to the
  apply list; bump pkgrel/PKGREL to 7.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_idct8.

## Why

The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8×8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of cycle 6.  Production validation of
cycle 6 on higgs Firefox YouTube: 3040 frames decoded cleanly,
avg_decode_us=3388 (no regression vs the pre-substitution ~4 ms
baseline).  Cycle 7 inherits the same shim's pthread_once context.

Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green; FFmpeg 8×8 block storage block[r + 8*c] matches daedalus
column-major convention).

## Scope NOT covered (deferred)

- Bulk c->idct8_add4 (inter 8×8-DCT macroblocks) stays on the
  in-tree NEON .S code; batched substitution with n_blocks>1 lands
  later alongside the cycle-6 bulk-paths work.
- High-bit-depth (10-bit) path untouched.
- Cycles 8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #78 (libxml2 ABI-skew workaround)
- marfrit/daedalus-fourier cycle 7 close (H.264 IDCT 8×8 NEON green)
2026-05-22 10:20:27 +02:00
marfrit 7ecbcb3c1b mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot
The original 0005 patch was generated from the pre-Phase-5-review source
snapshot (phase5_review_input_2026-05-21.tgz), missing the four
load-bearing review fixes that landed in the post-review snapshot:
  - probe_hantro gate on KHR_video_* extension advertisement
  - per-session ts_counter (was process-global static)
  - panvk_v4l2_session_finish full unwind (munmap + STREAMOFF + REQBUFS=0)
  - MIN2(rb.count, 18) clamp on num_*_buffers

Run #162 (job 17032) failed in prepare() because the PKGBUILD sanity
check 'grep -q "KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro()"'
didn't match the actual patched output (which still had the pre-review
'KHR_video_queue = PAN_ARCH < 9,').

This patch (regenerated from phase5_post_review_2026-05-21.tgz) carries
all four review fixes. Validated locally: vanilla mesa-26.0.6 + r1..r4 +
this patch reproduces prepare()-OK byte-for-byte.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:18:11 +02:00
marfrit 360e8eb6bf Merge pull request 'mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI)' (#83) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#83
2026-05-22 07:55:59 +00:00
marfrit 4db64917bc mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI)
The original PR #79 used symlinks for 0001..0004 patches (pointing into
../mesa-panvk-bifrost/) to avoid drift between siblings. CI's
"cp -r arch/mesa-panvk-bifrost-video /tmp/build-..." preserves the
symlinks, but the destination /tmp/build-... has no sibling dir to
resolve them against, so makepkg errors with:

  ==> ERROR: 0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch
             was not found in the build directory and is not a URL.

Each Arch PKGBUILD owns its source files per convention; the
duplication risk is low because r1..r4 are closed-release patches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 09:49:59 +02:00
marfrit 6288536223 Merge pull request 'ci: fix duplicate run: key in build.yml wipe-secrets step (unblocks all builds since 2026-05-21)' (#82) from claude-noether/marfrit-packages:noether/fix-build-yaml-duplicate-run into main
Reviewed-on: marfrit/marfrit-packages#82
2026-05-22 07:30:37 +00:00
claude-noether 09d8813507 ci: fix duplicate run: key in build.yml wipe-secrets step
PR #79 (6ee8f2748, mesa-panvk-bifrost-video) added a second `run:`
mapping key on the next line of the same step:

    - name: wipe secrets
      if: always()
      run: rm -f /root/repo_pass /root/.ssh/id_ed25519
      run: rm -f /root/.ssh/id_ed25519_hertz    ← duplicate `run:` key

YAML doesn't allow two mappings with the same key in one node, so
Gitea's workflow parser rejected the entire file:

  actions/workflows.go:124:DetectWorkflows() [W]
    ignore invalid workflow "build.yml": yaml: unmarshal errors:
      line 1423: mapping key "run" already defined at line 1422

Result: every push to main since 6ee8f2748 (2026-05-21 23:14 CEST)
silently failed to enqueue ANY action run.  PR #80's "re-trigger by
README touch" had no chance — workflow file was invalid before #80
even existed.  Runs #161-163 do not exist; #160 (pre-#79) is the
last successful enqueue.

Fix: merge the two single-line `run:` invocations into one literal
block.  Functionally identical, YAML-valid.

Post-merge: workflow file becomes valid again, new push to main
triggers a fresh build run covering the backlog (#79's
mesa-panvk-bifrost-video build that #80 wanted re-triggered).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 09:15:18 +02:00
marfrit 8a3186b53c Merge pull request 'mesa-panvk-bifrost-video: re-trigger Actions for PR #79' (#80) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#80
2026-05-22 06:32:42 +00:00
marfrit b81e2251c2 mesa-panvk-bifrost-video: re-trigger Actions for PR #79
The merge commit for PR #79 (e7cc22e42) did not auto-fire the
Gitea Actions workflow despite touching paths matched by the
build.yml filter (arch/** + .gitea/workflows/**). No run row
exists between #160 (PR #78 merge) and now. This README touch
is a no-op content change to force a fresh workflow_dispatch
through the standard push trigger.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 07:51:01 +02:00
marfrit e7cc22e42d Merge pull request 'mesa-panvk-bifrost-video: sibling package adding VK_KHR_video_decode_h264' (#79) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-phase8 into main
Reviewed-on: marfrit/marfrit-packages#79
2026-05-21 21:33:53 +00:00
marfrit 62b6b0a700 Merge pull request 'ffmpeg-v4l2-request-fourier (debian): drop --enable-libxml2 (runner SONAME skew)' (#78) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-drop-libxml2 into main
Reviewed-on: marfrit/marfrit-packages#78
2026-05-21 21:24:07 +00:00
marfrit a8f4a70887 ffmpeg-v4l2-request-fourier (debian): drop --enable-libxml2 (runner SONAME skew)
The Gitea debian-aarch64 runner has been upgraded past Debian trixie
and now ships libxml2 ≥ 2.14 (SONAME 16) while higgs (and any other
trixie target) still has libxml2 2.12 (SONAME 2).  -5 built cleanly,
but on higgs the daedalus-v4l2 daemon's dlopen of libavformat.so.62
fails:

    dlopen(libavformat.so.62): libxml2.so.16:
    cannot open shared object file: No such file or directory

Drop --enable-libxml2 from the Debian configure invocation; remove
the libxml2 entry from Depends; remove libxml2-dev from the CI
build-deps.  FFmpeg's libxml2-backed DASH demuxer is unused on the
Fourier fleet — daedalus-v4l2 daemon feeds AVPackets straight to
avcodec_send_packet (no demux); mpv-fourier uses ytdlp + mpv's own
stream code; firefox-fourier uses gecko-media's DASH demux.

Bumps PKGREL 5 → 6.  No source code or substitution-patch change.
Mirrors the libva trixie/runner ABI-skew workaround pattern
(marfrit-packages PR #62).

Arch PKGBUILD unaffected — Arch runner + Arch consumers both
rolling, libxml2 SONAMEs match.

After this lands, re-deploy on higgs via:
    sudo apt update && sudo apt install -y ffmpeg-v4l2-request-fourier
    sudo systemctl restart daedalus-v4l2
2026-05-21 23:18:00 +02:00
marfrit 6ee8f2748e mesa-panvk-bifrost-video: sibling package adding VK_KHR_video_decode_h264
panvk-bifrost-video campaign close. Phase 4 byte-exact validated
2026-05-21 on RK3566/PineTab2 (Mali-G52 r1 MC1 + hantro VPU): 48/48
unique BBB display frames decoded by this driver are byte-identical
to ffmpeg+libva-v4l2-request-fourier on the same hantro hardware
(frame 42 Y md5 = 54b9b396e6cd377256eb4bce0efc0bed both ways).
Phase 5 second-model review passed; load-bearing findings applied.

Co-installs at /usr/lib/panvk-bifrost-video/ parallel to the r4
sibling at /usr/lib/panvk-bifrost/; opt-in via VK_ICD_FILENAMES.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 23:14:01 +02:00
marfrit 711a921e66 Merge pull request 'ffmpeg-v4l2-request-fourier: PKGREL 3 → 5 (force rebuild past orphan -4 .deb)' (#77) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-debian-pkgrel-5 into main
Reviewed-on: marfrit/marfrit-packages#77
2026-05-21 20:36:15 +00:00
marfrit 9bf97fdb49 ffmpeg-v4l2-request-fourier: PKGREL 3 → 5 (force rebuild past orphan -4 .deb)
PR #76 (H.264 IDCT 4×4 daedalus-fourier substitution) was merged but
the resulting .deb was not actually built: an orphan
ffmpeg-v4l2-request-fourier_8.1+rfourier+gb57fbbe-4_arm64.deb (dated
2026-05-19, no matching source commit in main) sat in the apt pool.
.gitea/scripts/check-already-published.sh's debian branch compares
`dpkg --compare-versions $pool_ver ge $source_full` — pool -4
≥ source -3, so CI's skip-check emitted skip=1 and short-circuited
the build.  The ffmpeg-v4l2-request-debian Action reported success
without actually publishing.

Bump source PKGREL past -4 so the next CI run sees source >= pool
and proceeds to build + publish.

No source code change beyond PKGREL + changelog.  Arch side
unaffected (its skip-check is exact-URL-match, not pool-head-ge).
2026-05-21 22:17:00 +02:00
marfrit a536e20218 Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 4×4 → daedalus-fourier' (#76) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-idct4-daedalus into main
Reviewed-on: marfrit/marfrit-packages#76
2026-05-21 19:57:31 +00:00
marfrit a1dba5f630 Merge remote-tracking branch 'origin/main' into noether/ffmpeg-fourier-idct4-daedalus 2026-05-21 21:56:41 +02:00
marfrit 88a65cb6d0 CI: add cmake / ninja-build / libvulkan-dev / glslang-tools to ffmpeg-debian deps
The ffmpeg-v4l2-request-debian job now needs to build daedalus-fourier
into a temp prefix before configuring FFmpeg (substitution patch
0003-h264-idct4-daedalus-fourier.patch links libdaedalus_core.a into
libavcodec.so).  Mirror the build-deps the daedalus-v4l2-debian job
already declared for the same reason.

No-op on Arch — makepkg --syncdeps auto-installs cmake/ninja/
vulkan-headers from the PKGBUILD makedepends.
2026-05-21 21:47:48 +02:00
marfrit e641d679d3 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 4×4 → daedalus-fourier
First cycle of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct_add — called per 4×4 block from the
intra-4×4 decode path in libavcodec/h264_mb.c — now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.

## What

- Add 0003-h264-idct4-daedalus-fourier.patch (in both arch/ and
  debian/ ffmpeg-v4l2-request-fourier/).  Creates
  libavcodec/aarch64/h264_idct_daedalus.c (ff_h264_idct_add_daedalus
  shim + lazy pthread_once context init via
  daedalus_ctx_create_no_qpu), patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct_add to
  the shim, adds the new .o to libavcodec/aarch64/Makefile.
- arch/PKGBUILD + debian/build-deb.sh: fetch + build
  daedalus-fourier (pinned at d87239d — lockstep with the
  daedalus-v4l2 daemon's inline build) with
  -DCMAKE_POSITION_INDEPENDENT_CODE=ON into a per-build temp prefix,
  then pass --extra-cflags=-I.../include --extra-ldflags=-L.../lib
  --extra-libs="-ldaedalus_core -lvulkan -lpthread" to FFmpeg
  configure.  daedalus_core.a is static-linked into libavcodec.so.62.
- debian/control Depends gains libvulkan1 (daedalus_core PUBLIC-links
  Vulkan::Vulkan for the queryable QPU substrate; the no-QPU
  constructor still works at runtime but the loader needs
  libvulkan.so.1 present to dlopen libavcodec.so.62).
- arch/PKGBUILD depends gains vulkan-icd-loader, makedepends gains
  cmake / ninja / vulkan-headers.

## Why

The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4×4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
The point of this first cycle isn't perf wins — it's plumbing.  Once
the path is wired and stable, follow-up patches batch through the
bulk paths (idct_add16 / idct_add16intra / idct_add8) and stack
cycles 7/8/9 (IDCT 8×8, luma-v deblock, qpel mc20).

Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; FFmpeg's 4×4 block storage matches daedalus's column-major
convention).

## Scope NOT covered

- Bulk paths (idct_add16 / idct_add16intra / idct_add8) — most IDCT
  4×4 calls in real H.264 streams go through these, not the per-
  block c->idct_add path; intra-4×4-only macroblocks are a minority.
  Batched substitution lands in a follow-up.
- High-bit-depth (10-bit) path — not touched; 8-bit only.
- Cycles 7/8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.
No daedalus-v4l2-dkms or daedalus-v4l2 bump required.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit/daedalus-fourier cycle 6 close (H.264 IDCT 4×4 NEON green)
2026-05-21 21:44:35 +02:00
marfrit 877238bd1b Merge pull request 'daedalus-v4l2: 3bc0da1 -> 6e6dfa1 — dlopen Kwiboo soname 62 + CI build-deps swap' (#75) from claude-noether/marfrit-packages:noether/daedalus-bump-6e6dfa1-soname62 into main
Reviewed-on: marfrit/marfrit-packages#75
2026-05-21 19:26:39 +00:00
claude-noether 27617e4cb0 daedalus-v4l2: 3bc0da1 -> 6e6dfa1 — dlopen Kwiboo soname 62 (#16) 2026-05-21 21:24:03 +02:00
marfrit a2daab1b28 Merge pull request 'daedalus-v4l2: 77e14e5 -> 3bc0da1 — decode_us + periodic stats' (#74) from claude-noether/marfrit-packages:noether/daedalus-bump-3bc0da1 into main
Reviewed-on: marfrit/marfrit-packages#74
2026-05-21 18:50:46 +00:00
claude-noether 9146e83710 daedalus-v4l2: 77e14e5 -> 3bc0da1 — decode_us + periodic stats (#15) 2026-05-21 20:29:07 +02:00
marfrit abf8fb3077 Merge pull request 'ci: add libvulkan-dev + glslang-tools for daedalus-fourier build dep' (#73) from claude-noether/marfrit-packages:noether/ci-fourier-build-deps into main
Reviewed-on: marfrit/marfrit-packages#73
2026-05-21 18:05:59 +00:00
claude-noether 1414dfeac2 .gitea/workflows: add libvulkan-dev + glslang-tools to daedalus-v4l2 Debian build deps
The daedalus-v4l2 build-deb.sh (post marfrit-packages#72) now fetches
+ cmake-builds daedalus-fourier into a per-build temp prefix before
building the daemon, so the static-archive can be linked in.
daedalus-fourier's CMakeLists requires Vulkan headers and glslangValidator
(for SPIR-V compilation of the .comp compute shaders).  Without them
the configure step on the debian-aarch64 runner fails with:

  CMake Error at FindPackageHandleStandardArgs.cmake:233 (message):
    Could NOT find Vulkan (missing: Vulkan_LIBRARY Vulkan_INCLUDE_DIR)

(Observed on Gitea Actions run 1056.)

Add `libvulkan-dev` and `glslang-tools` to the apt-get install line so
the in-build daedalus-fourier compile succeeds and the daemon can link.
2026-05-21 19:58:19 +02:00
marfrit 41c1e0b6b9 Merge pull request 'daedalus-v4l2: 5d8b436 -> 77e14e5 — #12 (LOW_DELAY) + #13 (daedalus-fourier linkage)' (#72) from claude-noether/marfrit-packages:noether/daedalus-bump-77e14e5-with-fourier into main
Reviewed-on: marfrit/marfrit-packages#72
2026-05-21 17:15:12 +00:00
claude-noether c9a4b82f2c daedalus-v4l2: 5d8b436 -> 77e14e5 — picks up #12 (LOW_DELAY) + #13 (daedalus-fourier linkage)
Daemon-only bump (no daedalus-v4l2-dkms change needed; PROTO_VERSION
stays at 0).

#12 (LOW_DELAY half-measure): daemon sets AV_CODEC_FLAG_LOW_DELAY on
the H.264 AVCodecContext so libavcodec emits frames in decode order
~99% of the time (a few stragglers at GOP boundaries when the
stream's SPS num_reorder_frames overrides the flag).  Visible
improvement vs the 2-1-4-3 pair-swap on Firefox + mpv playback;
not the permanent fix — see daedalus-v4l2#11 for the architectural
plan to substitute daedalus-fourier kernels for libavcodec's
pixel math one cycle at a time.

#13 (daedalus-fourier linkage): daemon now pkg-config-links against
the daedalus-fourier kernel library (marfrit/daedalus-fourier) and
logs substrate availability at startup.  No kernels dispatched yet
— this is the build-time foundation for the substitution work.

build-deb.sh updated to fetch + build + install daedalus-fourier
(pinned at d87239d, marfrit/daedalus-fourier PR #1) into a per-
build temp prefix before invoking the daemon's cmake, exposing it
via PKG_CONFIG_PATH.  Static-linked, so the resulting .deb has no
new runtime deps.  Requires libvulkan-dev + glslang-tools on the
CI runner.

Arch PKGBUILD bumped to the same upstream commit but Arch packaging
for daedalus-fourier itself is a follow-up; until that lands the
Arch build expects daedalus-fourier installed by the user (AUR-style).
Debian-side is end-to-end self-contained via build-deb.sh.

Refs:
  * reauktion/daedalus-v4l2#12
  * reauktion/daedalus-v4l2#13
  * reauktion/daedalus-v4l2#11
  * marfrit/daedalus-fourier#1
2026-05-21 18:39:22 +02:00
marfrit 736b6da176 Merge pull request 'daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design' (#71) from claude-noether/marfrit-packages:noether/daedalus-revert-bump-5d8b436 into main
Reviewed-on: marfrit/marfrit-packages#71
2026-05-21 14:54:18 +00:00
claude-noether 34972ae9c1 daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design
Lock-step downgrade of both packages to the revert tip of
daedalus-v4l2 (PR #10 closed PRs #7 + #8).  After
0.1.0+r28+g79256dc-1 / 0.1.0+r30+g6ffe92b-1 landed in production,
mpv (--hwdec=vaapi-copy) failed pre-playing with "Unable to dequeue
buffer: Resource temporarily unavailable" because the daemon
parked CAPTURE buffers waiting for libavcodec's display-order
reorder, violating libva's V4L2 stateless 1:1 contract.  See
daedalus-v4l2#9 for the diagnostic, #10 for the revert PR.

DAEDALUS_PROTO_VERSION drops 1 → 0; install both .debs in the same
apt transaction.  Userspace ABI returns to the f0d4186-equivalent
behaviour, plus PR #4 (cosmetic H.264 menu controls).  The
daedalus-v4l2-dkms #64 multi-kernel postinst behaviour stays in
build-deb.sh.

Visible regression: H.264 B-frame streams in Firefox return to the
"2 1 4 3 6 5" pair-swap visual.  Proper fix (concurrent in-flight
requests in daemon + display-order reorder moved into libva-v4l2-
request-fourier) tracked at daedalus-v4l2#11.

Refs:
  * reauktion/daedalus-v4l2#9
  * reauktion/daedalus-v4l2#10  (merged)
  * reauktion/daedalus-v4l2#11
2026-05-21 15:42:03 +02:00
marfrit a9f1b833b9 Merge pull request 'mesa-panvk-bifrost: r3 -> r4 — iter17 XFB primitive decomposition' (#70) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r4-iter17-xfb-decomp into main
Reviewed-on: marfrit/marfrit-packages#70
2026-05-21 12:18:23 +00:00
marfrit 83e8eca56d mesa-panvk-bifrost: r3 -> r4 — iter17 XFB primitive decomposition
iter17 closes the 162 winding_* CTS failures from iter15's baseline by
replacing the upstream pan_nir_lower_xfb call with a panvk-specific NIR
pass (panvk_per_arch(nir_lower_xfb)) that handles per-primitive
decomposition for non-LIST topologies (LINE_STRIP, TRIANGLE_STRIP,
TRIANGLE_FAN, and the four _WITH_ADJACENCY variants).

Topology + per-instance output vertex count are threaded as new sysvals
(vs.xfb_topology + vs.xfb_output_count) so the NIR pass can dispatch
per-topology at runtime without compiling 7+ shader variants.

dEQP-VK.transform_feedback.simple.* result (133596 cases total):
                  iter15 baseline  ->  iter17
  Pass:             796               958   (+162)
  Fail:             243               81    (-162; resume_* by-design only)
  NotSupported:     132551            132551
  Fatal-skip:       6                 6
  Pass rate of runnable: 76.2% -> 91.7% (+15.5pp)

100% of the iter15 winding-fail cluster closed. The remaining 81 fails
are all resume_* (pause/resume XFB, by design — we advertise
transformFeedbackDraw=false).

Second-model review (janet) produced 3 findings; Findings 1+2 were
already fixed in the in-tree applied state (stale applied_state/ snapshot
read by reviewer), Finding 3 (degenerate N underflow on N<2) addressed
by gating non-LIST emission on `output_count > 0` predicate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 14:07:00 +02:00
marfrit 1c8c186681 Merge pull request 'daedalus-v4l2-dkms: 79256dc -> 6ffe92b — fix kernel panic regression from #67' (#69) from claude-noether/marfrit-packages:noether/daedalus-dkms-bump-6ffe92b into main
Reviewed-on: marfrit/marfrit-packages#69
2026-05-21 12:00:15 +00:00
claude-noether a0be2dcc9f daedalus-v4l2-dkms: 79256dc -> 6ffe92b — fix kernel panic from #7
Kernel-only bump.  Fixes the hard-reboot regression introduced by
the daedalus-v4l2#7 split-completion design and observed on higgs
(Pi CM5) during the first mpv vaapi-copy playback of 720p H.264:
device_run now removes src + dst from m2m_ctx's rdy_queue at the
moment it picks them up, not at buf_done time.  Without this, a
parked dst_buf (waiting for libavcodec's display-order release)
stayed in the rdy_queue and got re-picked by the next device_run
after SRC_CONSUMED's job_finish released the scheduler — two
inflight entries on the same vb2_buffer, later HAS_PIXELS calls
list_del on an already-detached list_head, panic.

DAEDALUS_PROTO_VERSION stays at 1 — daemon (userspace
daedalus-v4l2) need NOT bump in lockstep with this DKMS update.
The existing daedalus-v4l2 0.1.0+r28+g79256dc is wire-compatible
with daedalus-v4l2-dkms 0.1.0+r30+g6ffe92b.

Refs:
  * reauktion/daedalus-v4l2#8
2026-05-21 13:56:42 +02:00
marfrit eb89f12c3e Merge pull request 'libva-v4l2-request-fourier: bump pin to c454618 (#15 transparent resize)' (#68) from claude-noether/marfrit-packages:bump-libva-fourier-c454618-issue-15 into main
Reviewed-on: marfrit/marfrit-packages#68
2026-05-21 11:25:39 +00:00
30 changed files with 6255 additions and 48 deletions
+161 -10
View File
@@ -930,12 +930,13 @@ jobs:
# map 1:1 to the previous Arch list; libav*-dev intentionally # map 1:1 to the previous Arch list; libav*-dev intentionally
# absent (we are FFmpeg itself, providing those libs). # absent (we are FFmpeg itself, providing those libs).
retry apt-get install -y --no-install-recommends \ retry apt-get install -y --no-install-recommends \
build-essential git pkg-config nasm yasm \ build-essential cmake ninja-build git pkg-config nasm yasm \
linux-libc-dev libgl1-mesa-dev libasound2-dev libbz2-dev \ linux-libc-dev libgl1-mesa-dev libasound2-dev libbz2-dev \
libfontconfig-dev libfribidi-dev libgmp-dev libgnutls28-dev \ libfontconfig-dev libfribidi-dev libgmp-dev libgnutls28-dev \
libmp3lame-dev libass-dev libdav1d-dev libdrm-dev \ libmp3lame-dev libass-dev libdav1d-dev libdrm-dev \
libfreetype-dev libpulse-dev libva-dev libvorbis-dev libvpx-dev \ libfreetype-dev libpulse-dev libva-dev libvorbis-dev libvpx-dev \
libwebp-dev libx264-dev libx265-dev libxml2-dev libopus-dev \ libwebp-dev libx264-dev libx265-dev libopus-dev \
libvulkan-dev glslang-tools \
v4l-utils liblzma-dev zlib1g-dev \ v4l-utils liblzma-dev zlib1g-dev \
curl ca-certificates openssh-client rsync dpkg-dev curl ca-certificates openssh-client rsync dpkg-dev
@@ -1159,16 +1160,30 @@ jobs:
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; } retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
export DEBIAN_FRONTEND=noninteractive export DEBIAN_FRONTEND=noninteractive
retry apt-get update -qq retry apt-get update -qq
# libav*-dev provide the headers daedalus daemon dlopens at # FFmpeg headers + sonames the daemon dlopens. As of
# runtime — Debian's stock packages match the trixie ABI the # daedalus-v4l2 PR #16 (commit 514da29), the daemon targets
# daemon will encounter on Pi 5 hosts (both ship libavcodec # the Kwiboo fork's libavcodec.so.62 / libavformat.so.62 /
# 61.x). The fourier ffmpeg fork isn't needed here; the # libavutil.so.60 at /opt/fourier — so the build needs
# daemon never link-binds against libav (Option γ — dlopen # /opt/fourier/include and /opt/fourier/lib/pkgconfig.
# at runtime), so any header set with the right struct # ffmpeg-v4l2-request-fourier provides both (plus the
# definitions works. # runtime libs the .deb will dlopen on the target host;
# we install it as a build-dep here and the dpkg-shlibdeps
# step pulls it into the daemon .deb's Depends automatically).
# Debian-stock libav*-dev removed — would conflict on
# /usr/include/libavcodec/avcodec.h vs /opt/fourier's copy.
#
# libvulkan-dev + glslang-tools: needed by the in-build
# daedalus-fourier fetch (build-deb.sh fetches the sibling
# library, cmake-builds it into a temp prefix, then the
# daedalus daemon static-links against it via pkg-config).
# Without these, daedalus-fourier's find_package(Vulkan)
# and glslangValidator find_program both fail at configure
# time. See marfrit/daedalus-fourier PR #1 +
# reauktion/daedalus-v4l2 PR #13.
retry apt-get install -y --no-install-recommends \ retry apt-get install -y --no-install-recommends \
build-essential cmake ninja-build pkg-config git \ build-essential cmake ninja-build pkg-config git \
libavcodec-dev libavformat-dev libavutil-dev libdrm-dev \ ffmpeg-v4l2-request-fourier libdrm-dev \
libvulkan-dev glslang-tools \
linux-libc-dev \ linux-libc-dev \
curl ca-certificates openssh-client rsync dpkg-dev curl ca-certificates openssh-client rsync dpkg-dev
@@ -1402,6 +1417,142 @@ jobs:
-e 'ssh -i /root/.ssh/id_ed25519' \ -e 'ssh -i /root/.ssh/id_ed25519' \
./ mfritsche@nc.reauktion.de:arch/aarch64/ ./ mfritsche@nc.reauktion.de:arch/aarch64/
- name: wipe secrets
if: always()
run: |
rm -f /root/repo_pass /root/.ssh/id_ed25519
rm -f /root/.ssh/id_ed25519_hertz
# -------------------------------------------------------------------------
# mesa-panvk-bifrost-video (aarch64 only) — sibling adding VK_KHR_video_decode_h264
# via the V4L2 hantro VPU. Phase 4 byte-exact validated 2026-05-21.
# Co-installs at /usr/lib/panvk-bifrost-video/ (parallel to r4); opt-in
# via VK_ICD_FILENAMES (no launcher shipped — uses standard Vulkan loader).
#
# Build is slow (~30-60min on actrunner-aarch64): full Mesa-from-source.
# Standalone job — no `needs:` since it doesn't depend on the fourier
# codec stack. continue-on-error so a build hiccup doesn't block other
# jobs in the same workflow run.
# -------------------------------------------------------------------------
mesa-panvk-bifrost-video-aarch64:
runs-on: arch-aarch64
continue-on-error: true
steps:
- uses: actions/checkout@v4
- name: skip if already published
id: skip-check
run: |
set -e
result=$(./.gitea/scripts/check-already-published.sh arch/mesa-panvk-bifrost-video)
echo "$result" >> "$GITHUB_OUTPUT"
echo "decision: $result"
- name: bootstrap runner (idempotent)
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
retry pacman -Syu --noconfirm --needed base-devel git rsync gnupg openssh sudo
- name: import signing key
if: steps.skip-check.outputs.skip != '1'
env:
PRIV: ${{ secrets.MARFRIT_REPO_PRIVATE_KEY }}
PASS: ${{ secrets.MARFRIT_REPO_PASSPHRASE }}
run: |
set -e
gpgconf --homedir /root/.gnupg --kill all 2>/dev/null || true
rm -rf /root/.gnupg /root/repo_pass
mkdir -m700 -p /root/.gnupg
printf '%s' "$PASS" > /root/repo_pass
chmod 600 /root/repo_pass
printf '%s\n' "$PRIV" | gpg --batch --import
echo "92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C:6:" | gpg --import-ownertrust
- name: install deploy ssh key
if: steps.skip-check.outputs.skip != '1'
env:
KEY: ${{ secrets.MARFRIT_REPO_DEPLOY_KEY }}
run: |
mkdir -m700 -p /root/.ssh
printf '%s\n' "$KEY" > /root/.ssh/id_ed25519
chmod 600 /root/.ssh/id_ed25519
ssh-keyscan -t ed25519 nc.reauktion.de > /root/.ssh/known_hosts 2>/dev/null
- name: makepkg mesa-panvk-bifrost-video
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
rm -rf /tmp/build-mesa-panvk-bifrost-video
cp -r arch/mesa-panvk-bifrost-video /tmp/build-mesa-panvk-bifrost-video
chown -R builder:builder /tmp/build-mesa-panvk-bifrost-video
cd /tmp/build-mesa-panvk-bifrost-video
# MAKEFLAGS for parallel build; runner is multi-core.
# --skipinteg because sha256sums=SKIP in PKGBUILD (matches the
# fourier-fork PKGBUILD convention).
sudo -u builder -H env MAKEFLAGS="-j60" \
makepkg --nocheck --noconfirm --syncdeps --cleanbuild --skipinteg
ls -la *.pkg.tar.* | grep -v "\.sig$"
- name: sign mesa-panvk-bifrost-video
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
cd /tmp/build-mesa-panvk-bifrost-video
for f in *.pkg.tar.xz *.pkg.tar.zst *.pkg.tar.gz; do
[ -f "$f" ] || continue
gpg --batch --pinentry-mode loopback --passphrase-file /root/repo_pass \
--detach-sign --yes -u 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C "$f"
done
- name: update aarch64 repo db
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
mkdir -p /tmp/arch-stage-mesa-panvk-video
cd /tmp/arch-stage-mesa-panvk-video
rm -f *
for f in marfrit.db.tar.gz marfrit.db.tar.gz.sig marfrit.files.tar.gz marfrit.files.tar.gz.sig; do
curl -sSLf "https://packages.reauktion.de/arch/aarch64/$f" -o "$f" || rm -f "$f"
done
for ext in xz zst gz; do
ls /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext 2>/dev/null && \
mv /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext.sig .
done || true
export GNUPGHOME=/root/.gnupg
printf 'pinentry-mode loopback\npassphrase-file /root/repo_pass\n' > /root/.gnupg/gpg.conf
printf 'allow-loopback-pinentry\n' > /root/.gnupg/gpg-agent.conf
gpg-connect-agent reloadagent /bye
pkgs=()
for ext in xz zst gz; do
for f in *.pkg.tar.$ext; do [ -f "$f" ] && pkgs+=("$f"); done
done
if [ -f marfrit.db.tar.gz ]; then
for f in "${pkgs[@]}"; do
name=$(echo "$f" | sed -E 's/-[0-9].*//')
repo-remove --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
marfrit.db.tar.gz "$name" 2>/dev/null || true
done
fi
repo-add --new --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
--verify marfrit.db.tar.gz "${pkgs[@]}"
ln -sf marfrit.db.tar.gz marfrit.db
ln -sf marfrit.files.tar.gz marfrit.files
ln -sf marfrit.db.tar.gz.sig marfrit.db.sig
rm -f marfrit.files.sig
- name: publish to aarch64
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
cd /tmp/arch-stage-mesa-panvk-video
retry rsync -avL --copy-unsafe-links \
-e 'ssh -i /root/.ssh/id_ed25519' \
./ mfritsche@nc.reauktion.de:arch/aarch64/
- name: wipe secrets - name: wipe secrets
if: always() if: always()
run: rm -f /root/repo_pass /root/.ssh/id_ed25519 run: rm -f /root/repo_pass /root/.ssh/id_ed25519
+8 -5
View File
@@ -18,12 +18,15 @@ _module=daedalus_v4l2
# Same pin as arch/daedalus-v4l2 — keep kernel module + daemon # Same pin as arch/daedalus-v4l2 — keep kernel module + daemon
# bit-versioned together so the chardev wire protocol stays in sync. # bit-versioned together so the chardev wire protocol stays in sync.
# PROTO_VERSION 0 → 1 at this pin (H.264 B-frame reorder fix); must # 5d8b436 reverts PRs #7 + #8 (parking design that broke libva's
# install both packages atomically. # 1:1 contract — see daedalus-v4l2#9 + #10). Tree is
_commit=79256dc7ef41f83873ca9c23db20f5888858e65d # content-equivalent to f0d4186 plus PR #4 (cosmetic menu ctrls).
# PROTO_VERSION drops 1 → 0; lock-step install with
# daedalus-v4l2 0.1.0.r33.5d8b436 REQUIRED.
_commit=872eec505eb91b561892d02a0526749348ddc121
pkgver=0.1.0.r28.79256dc pkgver=0.1.0.r45.872eec5
pkgrel=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix) pkgrel=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2 0.1.0.r45.872eec5 REQUIRED
pkgdesc="V4L2 stateless decoder shim kernel module (DKMS) — Pi 5 / CM5" pkgdesc="V4L2 stateless decoder shim kernel module (DKMS) — Pi 5 / CM5"
arch=('any') arch=('any')
url="https://git.reauktion.de/reauktion/daedalus-v4l2" url="https://git.reauktion.de/reauktion/daedalus-v4l2"
+11 -10
View File
@@ -16,18 +16,19 @@
pkgname=daedalus-v4l2 pkgname=daedalus-v4l2
_upstreampkg=daedalus-v4l2 _upstreampkg=daedalus-v4l2
# Pin the daedalus-v4l2 tip. 79256dc = "kernel + daemon: H.264 B-frame # 6e6dfa1 = picks up daedalus-v4l2 PR #16 — daemon now dlopens
# display reorder fix (closes #6)" — adds the wire-protocol src_pts / # the Kwiboo fourier fork's libavcodec.so.62 / libavformat.so.62 /
# output_src_pts / RESP_FRAME flags split that lets H.264 streams with # libavutil.so.60 at /opt/fourier instead of Debian-stock soname
# B-frames preserve display order through libva → kernel → daemon. # 61/61/59. First step on the daedalus-fourier substitution arc
# PROTO_VERSION bumps 0 → 1; lock-step userspace + kernel rebuild # (daedalus-v4l2#11). Daemon still needs daedalus-fourier at
# REQUIRED (daedalus-v4l2-dkms PKGBUILD pinned to the same commit). # build time (Arch packaging for that is a follow-up; Debian side
_commit=79256dc7ef41f83873ca9c23db20f5888858e65d # fetches inline via build-deb.sh).
_commit=872eec505eb91b561892d02a0526749348ddc121
# 0.1.0 (pre-1.0) + commit count + short sha. Bump the .Y on each # 0.1.0 (pre-1.0) + commit count + short sha. Bump the .Y on each
# Phase 8.x close. pkgver() recomputes at build time. # Phase 8.x close. pkgver() recomputes at build time.
pkgver=0.1.0.r28.79256dc pkgver=0.1.0.r45.872eec5
pkgrel=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix) pkgrel=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2-dkms 0.1.0.r45.872eec5 REQUIRED
pkgdesc="Userspace daemon for the daedalus-v4l2 V4L2 stateless decoder shim (VP9/AV1/H.264 on Pi 5 / CM5)" pkgdesc="Userspace daemon for the daedalus-v4l2 V4L2 stateless decoder shim (VP9/AV1/H.264 on Pi 5 / CM5)"
arch=('aarch64') arch=('aarch64')
url="https://git.reauktion.de/reauktion/daedalus-v4l2" url="https://git.reauktion.de/reauktion/daedalus-v4l2"
@@ -35,7 +36,7 @@ license=('BSD-2-Clause' 'GPL-2.0-or-later')
# Daemon dlopens libavformat.so.61 / libavcodec.so.61 / libavutil.so.59 # Daemon dlopens libavformat.so.61 / libavcodec.so.61 / libavutil.so.59
# at runtime (Option γ — see daemon/src/ffmpeg_loader.h). ffmpeg # at runtime (Option γ — see daemon/src/ffmpeg_loader.h). ffmpeg
# provides those; we don't link them. # provides those; we don't link them.
depends=('ffmpeg' 'libdrm') depends=('ffmpeg-v4l2-request-fourier' 'libdrm')
# Headers from libav*-dev needed at compile time for type-safe function # Headers from libav*-dev needed at compile time for type-safe function
# pointer signatures; pkg-config locates them. # pointer signatures; pkg-config locates them.
makedepends=('cmake' 'ninja' 'pkgconf' 'git' 'ffmpeg') makedepends=('cmake' 'ninja' 'pkgconf' 'git' 'ffmpeg')
@@ -0,0 +1,137 @@
From f760c0541586f43334c02611fcb4c212c08ad576 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Thu, 21 May 2026 21:40:22 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 4x4 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct_add (called per 4x4 block from the intra-4x4
decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4x4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
Provides the first end-to-end exercise of the daedalus-fourier
kernel pack inside the libavcodec.so decode hot path; follow-up
patches wire IDCT 8x8, luma-v deblock, and qpel mc20.
The library context is process-global, lazily initialised under
pthread_once on first call. We pick the no-QPU constructor because
libavcodec.so is loaded into arbitrary host processes
(firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
cannot assume the host has a usable Vulkan instance. Higher cycles
(deblock luma-v, MC) that benefit from the QPU will provision their
own recipe-selected context once that path is wired.
Bulk paths (idct_add16, idct_add16intra, idct_add8 — used for
non-intra4x4 macroblocks) remain on the stock NEON .S implementations
and will be batched through daedalus_recipe_dispatch_h264_idct4 with
n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; see marfrit/daedalus-fourier/CYCLE_LOGS.md).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_idct_daedalus.c | 49 +++++++++++++++++++++++
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 +-
3 files changed, 53 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_idct_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 41ab025..7b95fb1 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -3,7 +3,8 @@ OBJS-$(CONFIG_AC3DSP) += aarch64/ac3dsp_init_aarch64.o
OBJS-$(CONFIG_FDCTDSP) += aarch64/fdctdsp_init_aarch64.o
OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o
OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o
-OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o
+OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
+ aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
new file mode 100644
index 0000000..538d223
--- /dev/null
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -0,0 +1,49 @@
+/*
+ * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ *
+ * Routes H264DSPContext.idct_add through
+ * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+ * The recipe layer picks the substrate (CPU NEON by default for
+ * cycle 6; future cycles may dispatch to V3D opportunistically).
+ *
+ * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+ * convention: block[r + 4*c] = coefficient at (row r, col c). Both
+ * sides destructively zero the block after the transform.
+ *
+ * The library context is process-global and lazily initialised under
+ * pthread_once. We pick the no-QPU constructor here because
+ * libavcodec.so is loaded into arbitrary host processes
+ * (firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
+ * cannot assume the host has a usable Vulkan instance. Higher cycles
+ * (deblock, MC) that benefit from the QPU initialise their own
+ * recipe-selected context once that path is wired.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+#include "libavcodec/h264dsp.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index c684574..b993df2 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -66,6 +66,7 @@ void ff_biweight_h264_pixels_4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride
int weights, int offset);
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add16_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -139,7 +140,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->biweight_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
c->biweight_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
- c->idct_add = ff_h264_idct_add_neon;
+ c->idct_add = ff_h264_idct_add_daedalus;
c->idct_dc_add = ff_h264_idct_dc_add_neon;
c->idct_add16 = ff_h264_idct_add16_neon;
c->idct_add16intra = ff_h264_idct_add16intra_neon;
--
2.47.3
@@ -0,0 +1,107 @@
From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 10:18:21 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct8_add (called per 8x8 block from the High-profile
intra-8x8-DCT decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of the cycle-6 IDCT 4x4 wiring. Same
pthread_once global context, same destructive-zero semantics; FFmpeg
column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
on the in-tree NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
---
libavcodec/aarch64/h264_idct_daedalus.c | 29 ++++++++++++++++-------
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 ++-
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index 538d223..cbb98af 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,14 +1,16 @@
/*
- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add through
- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
- * The recipe layer picks the substrate (CPU NEON by default for
- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
*
- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
- * convention: block[r + 4*c] = coefficient at (row r, col c). Both
- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}. Both sides destructively zero the
+ * block after the transform.
*
* The library context is process-global and lazily initialised under
* pthread_once. We pick the no-QPU constructor here because
@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
}
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index b993df2..741e551 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
const uint8_t nnzc[15 * 8]);
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->idct_add16intra = ff_h264_idct_add16intra_neon;
if (chroma_format_idc <= 1)
c->idct_add8 = ff_h264_idct_add8_neon;
- c->idct8_add = ff_h264_idct8_add_neon;
+ c->idct8_add = ff_h264_idct8_add_daedalus;
c->idct8_dc_add = ff_h264_idct8_dc_add_neon;
c->idct8_add4 = ff_h264_idct8_add4_neon;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
--
2.47.3
@@ -0,0 +1,121 @@
From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 12:15:16 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.
The recipe layer picks the substrate; for cycle 8 the daedalus
docstring marks the kernel "CPU primary; QPU opportunistic", but
the libavcodec.so context here is built with
daedalus_ctx_create_no_qpu — process-global pthread_once init,
shared with cycles 6/7. QPU opportunism stays gated off until a
follow-up adds an explicit feature flag (no implicit Vulkan init
in arbitrary host processes). In the meantime cycle 8 is a
plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its docstring.
FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
(int32_t alpha/beta + inline int8_t tc0[4]). pix already points
to row 0 of the bottom block per FFmpeg's deblock convention,
satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
---
libavcodec/aarch64/h264_idct_daedalus.c | 36 +++++++++++++++++++----
libavcodec/aarch64/h264dsp_init_aarch64.c | 4 ++-
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index cbb98af..92365fa 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
/*
- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
- * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
- * recipe layer picks the substrate (CPU NEON by default for cycles
- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly. The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
*
* FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
* column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index 741e551..85ac381 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
int cpu_flags = av_get_cpu_flags();
if (have_neon(cpu_flags) && bit_depth == 8) {
- c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_neon;
+ c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,82 @@
From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 14:18:25 +0200
Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
FFmpeg 8.x dropped the H.264 decoder's low_delay path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents
h264_select_output_frame from running the display-order DPB
output queue. V4L2-stateless-style consumers (daedalus-v4l2
daemon, libva-v4l2-request-fourier) that set the flag end up
seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
Restore the documented semantics:
- Early-exit at the top of h264_select_output_frame when the
flag is set: emit the just-decoded picture immediately as
next_output_pic, mirror the corruption / recovery-point
tracking the main path performs, and skip the entire
delayed_pic[] / POC reorder machinery.
- Suppress the SPS-driven has_b_frames clobber in
h264_field_start when the flag is set, so the per-slice
bitstream_restriction_flag re-pickup cannot reintroduce a
nonzero reorder buffer mid-stream.
This is a fork-only change required by the daedalus-v4l2 daemon's
one-frame-per-send_packet contract; upstream FFmpeg consumers that
expect display-order output remain untouched (flag default = off).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
---
libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
index 97fab70..a7bfbd6 100644
--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
cur->mmco_reset = h->mmco_reset;
h->mmco_reset = 0;
+ /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+ * decoder's low_delay path). Bypass the display-order DPB
+ * output queue: emit the just-decoded picture immediately, in
+ * decode order, one per send_packet. V4L2-stateless-style
+ * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+ * do their own POC-based reorder downstream and require this
+ * behaviour. */
+ if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+ h->next_output_pic = cur;
+ h->next_outputed_poc = cur->poc;
+ h->frame_recovered |= cur->recovered;
+ cur->recovered |= h->frame_recovered & FRAME_RECOVERED_SEI;
+ if (!cur->recovered) {
+ if (!(h->avctx->flags & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+ !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+ h->next_output_pic = NULL;
+ else
+ cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+ }
+ return 0;
+ }
+
if (sps->bitstream_restriction_flag ||
h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
sps = h->ps.sps;
if (sps->bitstream_restriction_flag &&
+ !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
h->avctx->has_b_frames < sps->num_reorder_frames) {
h->avctx->has_b_frames = sps->num_reorder_frames;
}
--
2.47.3
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
+41 -3
View File
@@ -24,8 +24,13 @@ _srcname=FFmpeg
_version='8.1' _version='8.1'
_commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935' # v4l2-request-n8.1 tip 2026-04-24 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935' # v4l2-request-n8.1 tip 2026-04-24
pkgver=8.1.r123329.b57fbbe pkgver=8.1.r123329.b57fbbe
pkgrel=5 pkgrel=10 # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution (cycle 9, 2026-05-23)
epoch=2 epoch=2
# daedalus-fourier pin. 209a421 = PR #2 merge (Phase 8c — public API
# gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).
# Cycle 9 closes the libavcodec.so substitution arc started at cycle 6.
_daedalus_fourier_commit='209a4218bcb98b91c04f07ad61513bb04adb13ad'
pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)' pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)'
arch=('aarch64') arch=('aarch64')
url='https://github.com/Kwiboo/FFmpeg' url='https://github.com/Kwiboo/FFmpeg'
@@ -34,6 +39,7 @@ depends=(
alsa-lib alsa-lib
bzip2 bzip2
fontconfig fontconfig
vulkan-icd-loader
fribidi fribidi
gmp gmp
gnutls gnutls
@@ -59,10 +65,13 @@ depends=(
zlib zlib
) )
makedepends=( makedepends=(
cmake
git git
linux-api-headers linux-api-headers
mesa mesa
nasm nasm
ninja
vulkan-headers
) )
provides=( provides=(
libavcodec.so libavcodec.so
@@ -78,9 +87,15 @@ provides=(
conflicts=(ffmpeg) conflicts=(ffmpeg)
replaces=(ffmpeg ffmpeg-v4l2-request-git) replaces=(ffmpeg ffmpeg-v4l2-request-git)
source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}" source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
"daedalus-fourier-${_daedalus_fourier_commit}.tar.gz::https://git.reauktion.de/marfrit/daedalus-fourier/archive/${_daedalus_fourier_commit}.tar.gz"
'0001-libudev-bypass-fallback.patch' '0001-libudev-bypass-fallback.patch'
'0002-nv15-to-p010-unpack.patch') '0002-nv15-to-p010-unpack.patch'
sha256sums=('SKIP' 'SKIP' 'SKIP') '0003-h264-idct4-daedalus-fourier.patch'
'0004-h264-idct8-daedalus-fourier.patch'
'0005-h264-deblock-luma-v-daedalus-fourier.patch'
'0006-h264-restore-low-delay.patch'
'0007-h264-qpel-mc20-daedalus-fourier.patch')
sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
pkgver() { pkgver() {
cd "${_srcname}" cd "${_srcname}"
@@ -93,9 +108,29 @@ prepare() {
cd "${_srcname}" cd "${_srcname}"
patch -Np1 -i "${srcdir}/0001-libudev-bypass-fallback.patch" patch -Np1 -i "${srcdir}/0001-libudev-bypass-fallback.patch"
patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch" patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch"
patch -Np1 -i "${srcdir}/0003-h264-idct4-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
patch -Np1 -i "${srcdir}/0007-h264-qpel-mc20-daedalus-fourier.patch"
} }
build() { build() {
# --- daedalus-fourier: build static .a with PIC, install to a
# per-build prefix; libavcodec.so links it into the shared object so
# H264DSPContext.idct_add (and follow-up kernels) dispatch through
# the daedalus recipe layer instead of the in-tree NEON .S code. ---
local _fourier_prefix="${srcdir}/fourier-prefix"
mkdir -p "${_fourier_prefix}"
pushd "${srcdir}"/daedalus-fourier >/dev/null
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_INSTALL_PREFIX="${_fourier_prefix}"
cmake --build build --target daedalus_core
cmake --install build
popd >/dev/null
cd "${_srcname}" cd "${_srcname}"
# FFmpeg's configure resolves the compiler via `which` and bakes the # FFmpeg's configure resolves the compiler via `which` and bakes the
@@ -147,6 +182,9 @@ build() {
--enable-libx265 \ --enable-libx265 \
--enable-libwebp \ --enable-libwebp \
\ \
--extra-cflags="-I${_fourier_prefix}/include" \
--extra-ldflags="-L${_fourier_prefix}/lib" \
--extra-libs="-ldaedalus_core -lvulkan -lpthread" \
--host-cflags='-fPIC' --host-cflags='-fPIC'
make make
@@ -0,0 +1,57 @@
From: claude-noether (on behalf of mfritsche)
Date: 2026-05-19
Subject: panvk: expose VK_KHR/EXT_robustness2 + nullDescriptor on Bifrost (PAN_ARCH 6/7)
Without this, Mesa's Zink driver refuses to use PanVk-Bifrost as its Vulkan
backend, falling back silently to llvmpipe (software rasterizer) for all
GL-via-Zink on Bifrost SBCs. That defeats the entire purpose of having a
Vulkan driver on Bifrost — GL acceleration via Zink is the most natural
near-term consumer.
panvk_vX_nir_lower_descriptors.c:1309 and panvk_vX_shader.c:1355 already
plumb dev->vk.enabled_features.nullDescriptor arch-agnostically — the gate
at panvk_vX_physical_device.c was set conservatively when Bifrost was
unmaintained, not because of hardware incapability.
iter17 of the panvk-bifrost campaign proved fundamental driver functions
on Mali-G52 r1 MC1 (PAN_ARCH=7). This patch is the iter8 follow-up.
robustBufferAccess2 and robustImageAccess2 are NOT flipped — they're
independent rb2 features Zink doesn't require, gated differently
(robustBufferAccess2 = PAN_ARCH >= 11, robustImageAccess2 = false), and
out of scope for iter8.
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -91,7 +91,7 @@ get_device_extensions(const struct panvk_physical_device *device,
.KHR_pipeline_binary = true,
.KHR_pipeline_executable_properties = true,
.KHR_pipeline_library = true,
- .KHR_robustness2 = PAN_ARCH >= 10,
+ .KHR_robustness2 = true,
.KHR_sampler_mirror_clamp_to_edge = true,
.KHR_sampler_ycbcr_conversion = true,
.KHR_separate_depth_stencil_layouts = true,
@@ -168,7 +168,7 @@ get_device_extensions(const struct panvk_physical_device *device,
.EXT_queue_family_foreign = true,
.EXT_robustness = pan_arch(device->kmod.dev->props.gpu_id) >= 9,
.EXT_image_robustness = true,
- .EXT_robustness2 = PAN_ARCH >= 10,
+ .EXT_robustness2 = true,
.EXT_sampler_filter_minmax = PAN_ARCH >= 10,
.EXT_scalar_block_layout = true,
.EXT_separate_stencil_usage = true,
@@ -493,7 +493,7 @@ get_device_features(const struct panvk_physical_device *device,
/* VK_KHR_robustness2 */
.robustBufferAccess2 = PAN_ARCH >= 11,
.robustImageAccess2 = false,
- .nullDescriptor = PAN_ARCH >= 10,
+ .nullDescriptor = true,
/* VK_KHR_shader_clock */
.shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
@@ -0,0 +1,47 @@
From: claude-noether (on behalf of mfritsche)
Date: 2026-05-20
Subject: panvk: expose Vulkan 1.1 + 1.2 on Bifrost (PAN_ARCH 6/7)
ANGLE (Chromium's GL stack) requires apiVersion >= 1.1 to initialize. Without
this, Brave / Chromium's GPU process fails at GL info collection:
vk_renderer.cpp:2659 (initialize): ANGLE Requires a minimum Vulkan device
version of 1.1
Display::initialize error 0: Internal Vulkan error (-9): The requested
version of Vulkan is not supported by the driver
Stack-up with iter8's robustness2 patch enables ANGLE → PanVk-Bifrost →
Skia (via --enable-features=Vulkan) on Bifrost SBCs.
PanVk-Bifrost already supports the bulk of 1.1-promoted features as extensions
(multiview, maintenance1-3, descriptor update template, 16-bit storage,
descriptor update template, sampler ycbcr, variable pointers, etc. — all
visible in iter0 vulkaninfo). The version bump primarily bundles them.
Risk: Vulkan 1.1 has features beyond what iter17 exercised (protected memory,
full subgroup ops). Specific app failures will be characterizable.
1.2 is also flipped — Brave's Vulkan path may want descriptor indexing,
buffer device address, etc. (all listed in iter0 vulkaninfo as supported
extensions, just gated as 1.0-with-extensions, not 1.2-core).
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -38,8 +38,8 @@ get_device_extensions(const struct panvk_physical_device *device,
struct vk_device_extension_table *ext)
{
*ext = (struct vk_device_extension_table){
- .KHR_8bit_storage = true,
- .KHR_16bit_storage = true,
- bool has_vk1_1 = PAN_ARCH >= 10;
- bool has_vk1_2 = PAN_ARCH >= 10;
+ .KHR_8bit_storage = true,
+ .KHR_16bit_storage = true,
+ bool has_vk1_1 = true;
+ bool has_vk1_2 = true;
*ext = (struct vk_device_extension_table){
@@ -0,0 +1,328 @@
--- a/src/panfrost/vulkan/panvk_shader.h 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-20 18:52:53.312698258 +0200
@@ -150,6 +150,10 @@
struct {
#if PAN_ARCH < 9
int32_t raw_vertex_offset;
+ uint32_t num_vertices; /* iter13: XFB needs per-draw vertex count */
+ /* aligned_u64 attribute below inserts the 4-byte alignment gap
+ * after num_vertices automatically — no explicit pad needed. */
+ aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
#endif
int32_t first_vertex;
int32_t base_instance;
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c 2026-05-20 19:09:29.711145446 +0200
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c 2026-05-20 18:52:54.832720445 +0200
@@ -169,6 +169,7 @@
.EXT_provoking_vertex = true,
.EXT_queue_family_foreign = true,
.EXT_robustness2 = true,
+ .EXT_transform_feedback = PAN_ARCH < 9, /* iter13: JM-class only for now */
.EXT_sampler_filter_minmax = PAN_ARCH >= 10,
.EXT_scalar_block_layout = true,
.EXT_separate_stencil_usage = true,
@@ -495,6 +496,10 @@
.robustImageAccess2 = false,
.nullDescriptor = true,
+ /* VK_EXT_transform_feedback (iter13) */
+ .transformFeedback = PAN_ARCH < 9,
+ .geometryStreams = false,
+
/* VK_KHR_shader_clock */
.shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
.shaderDeviceClock = device->kmod.dev->props.timestamp_device_coherent,
@@ -1020,6 +1025,18 @@
.robustStorageBufferAccessSizeAlignment = 1,
.robustUniformBufferAccessSizeAlignment = 1,
+ /* VK_EXT_transform_feedback (iter13) */
+ .maxTransformFeedbackStreams = 1,
+ .maxTransformFeedbackBuffers = 4,
+ .maxTransformFeedbackBufferSize = UINT32_MAX,
+ .maxTransformFeedbackStreamDataSize = 512,
+ .maxTransformFeedbackBufferDataSize = 512,
+ .maxTransformFeedbackBufferDataStride = 2048,
+ .transformFeedbackQueries = false,
+ .transformFeedbackStreamsLinesTriangles = false,
+ .transformFeedbackRasterizationStreamSelect = false,
+ .transformFeedbackDraw = false,
+
/* VK_EXT_shader_object */
/* We do not currently support VK_EXT_shader_object but this is used
* internally by vk_shader
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-20 18:52:56.556745611 +0200
@@ -21,6 +21,7 @@
#include "panvk_physical_device.h"
#include "panvk_sampler.h"
#include "panvk_shader.h"
+#include "pan_nir.h" /* iter13: pan_nir_lower_xfb */
#include "spirv/nir_spirv.h"
#include "util/memstream.h"
@@ -100,6 +101,20 @@
case nir_intrinsic_load_raw_vertex_offset_pan:
val = load_sysval(b, graphics, bit_size, vs.raw_vertex_offset);
break;
+ case nir_intrinsic_load_num_vertices: /* iter13: XFB index calc */
+ val = load_sysval(b, graphics, bit_size, vs.num_vertices);
+ break;
+ case nir_intrinsic_load_xfb_address: { /* iter13: XFB buffer N base address */
+ unsigned idx = nir_intrinsic_base(intr);
+ switch (idx) {
+ case 0: val = load_sysval(b, graphics, bit_size, vs.xfb_address[0]); break;
+ case 1: val = load_sysval(b, graphics, bit_size, vs.xfb_address[1]); break;
+ case 2: val = load_sysval(b, graphics, bit_size, vs.xfb_address[2]); break;
+ case 3: val = load_sysval(b, graphics, bit_size, vs.xfb_address[3]); break;
+ default: return false;
+ }
+ break;
+ }
case nir_intrinsic_load_layer_id:
assert(b->shader->info.stage == MESA_SHADER_FRAGMENT);
val = load_sysval(b, graphics, bit_size, layer_id);
@@ -457,6 +472,7 @@
core_max_id);
pan_preprocess_nir(nir, pdev->kmod.dev->props.gpu_id);
+
}
static void
@@ -870,6 +886,18 @@
nir_var_shader_in | nir_var_shader_out, UINT32_MAX);
NIR_PASS(_, nir, nir_lower_io, nir_var_shader_in | nir_var_shader_out,
glsl_type_size, nir_lower_io_use_interpolated_input_intrinsics);
+
+#if PAN_ARCH < 9
+ /* iter13: VK_EXT_transform_feedback — runs AFTER nir_lower_io so that
+ * shader outputs are now store_output intrinsics that pan_nir_lower_xfb
+ * can rewrite to nir_store_global+nir_load_xfb_address. */
+ if (nir->info.stage == MESA_SHADER_VERTEX &&
+ nir->info.has_transform_feedback_varyings) {
+ NIR_PASS(_, nir, nir_opt_constant_folding);
+ NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
+ NIR_PASS(_, nir, pan_nir_lower_xfb);
+ }
+#endif
}
static VkResult
@@ -1288,6 +1316,9 @@
.view_mask = (state && state->rp) ? state->rp->view_mask : 0,
.robust2_modes = robust2_modes,
.robust_descriptors = dev->vk.enabled_features.nullDescriptor,
+ /* iter13: XFB shaders must disable IDVS (matches Panfrost-Gallium). */
+ .no_idvs = (info->stage == MESA_SHADER_VERTEX) &&
+ info->nir->info.has_transform_feedback_varyings,
};
switch (info->stage) {
--- a/src/panfrost/vulkan/panvk_cmd_draw.h 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_cmd_draw.h 2026-05-20 18:52:57.748763011 +0200
@@ -135,6 +135,19 @@
struct panvk_graphics_sysvals sysvals;
#if PAN_ARCH < 9
+ /* iter13: VK_EXT_transform_feedback state (JM-class only for now). */
+ struct {
+ bool active;
+ uint32_t buffer_count;
+ struct {
+ uint64_t addr;
+ uint64_t offset;
+ uint64_t size;
+ } buffers[4];
+ } xfb;
+#endif
+
+#if PAN_ARCH < 9
struct panvk_shader_link link;
#endif
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-20 19:10:23.031919662 +0200
@@ -10,6 +10,7 @@
#include "panvk_entrypoints.h"
#include "pan_desc.h"
+#include "pan_compiler.h" /* PAN_SHADER_OOB_ADDRESS */
#include "pan_util.h"
static void
@@ -722,6 +723,35 @@
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.raw_vertex_offset,
info->vertex.raw_offset);
set_gfx_sysval(cmdbuf, dirty_sysvals, layer_id, info->layer_id);
+
+ /* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
+ * reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+ {
+ const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
+ /* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
+ * (= 1<<63). This is the Panfrost-Gallium memory-sink idiom — the
+ * Bifrost MMU silently discards stores to this address, so a pipeline
+ * with XFB outputs used in a non-XFB draw (or in an XFB draw with
+ * fewer bound buffers than the shader declares) is safe instead of
+ * faulting. See gallium/drivers/panfrost/pan_cmdstream.c PAN_SYSVAL_XFB. */
+ uint64_t _xa0 = PAN_SHADER_OOB_ADDRESS, _xa1 = PAN_SHADER_OOB_ADDRESS,
+ _xa2 = PAN_SHADER_OOB_ADDRESS, _xa3 = PAN_SHADER_OOB_ADDRESS;
+ if (_gfx->xfb.active) {
+ if (_gfx->xfb.buffer_count > 0 && _gfx->xfb.buffers[0].addr)
+ _xa0 = _gfx->xfb.buffers[0].addr + _gfx->xfb.buffers[0].offset;
+ if (_gfx->xfb.buffer_count > 1 && _gfx->xfb.buffers[1].addr)
+ _xa1 = _gfx->xfb.buffers[1].addr + _gfx->xfb.buffers[1].offset;
+ if (_gfx->xfb.buffer_count > 2 && _gfx->xfb.buffers[2].addr)
+ _xa2 = _gfx->xfb.buffers[2].addr + _gfx->xfb.buffers[2].offset;
+ if (_gfx->xfb.buffer_count > 3 && _gfx->xfb.buffers[3].addr)
+ _xa3 = _gfx->xfb.buffers[3].addr + _gfx->xfb.buffers[3].offset;
+ }
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[0], _xa0);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[1], _xa1);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[2], _xa2);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[3], _xa3);
+ }
#endif
if (dyn_gfx_state_dirty(cmdbuf, CB_BLEND_CONSTANTS)) {
--- a/src/panfrost/vulkan/meson.build 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-20 18:53:04.484861338 +0200
@@ -73,6 +73,7 @@
jm_inc_dir = ['jm']
jm_files = [
'jm/panvk_vX_bind_queue.c',
+ 'jm/panvk_vX_cmd_xfb.c', # iter13
'jm/panvk_vX_cmd_buffer.c',
'jm/panvk_vX_cmd_dispatch.c',
'jm/panvk_vX_cmd_draw.c',
--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c 2026-05-20 19:10:26.163965149 +0200
@@ -473,5 +473,12 @@
vk_command_buffer_begin(&cmdbuf->vk, pBeginInfo);
+#if PAN_ARCH < 9
+ /* iter13: clear XFB state on Begin so a reused command buffer does not
+ * inherit stale xfb.buffer_count / xfb.active / xfb.buffers[] from a
+ * prior recording. */
+ memset(&cmdbuf->state.gfx.xfb, 0, sizeof(cmdbuf->state.gfx.xfb));
+#endif
+
return VK_SUCCESS;
}
--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c 2026-05-18 12:50:53.067999996 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c 2026-05-20 19:10:27.175979847 +0200
@@ -0,0 +1,111 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter13: VK_EXT_transform_feedback command handlers for the JM
+ * architecture path (Bifrost v6/v7 + Valhall-JM v9).
+ *
+ * The runtime contract:
+ * - vkCmdBindTransformFeedbackBuffersEXT: stash (gpu_addr, offset, size)
+ * for each slot into cmdbuf->state.gfx.xfb.buffers[].
+ * - vkCmdBeginTransformFeedbackEXT: set cmdbuf->state.gfx.xfb.active = true.
+ * Mark sysvals dirty so the next draw re-emits vs.xfb_address[].
+ * - vkCmdEndTransformFeedbackEXT: set active = false.
+ *
+ * Counter buffers (firstCounterBuffer/counterBufferCount/pCounterBuffers/
+ * pCounterBufferOffsets) are accepted by API but ignored — v1 doesn't
+ * support pause/resume. transformFeedbackDraw is advertised as false.
+ *
+ * Per-draw integration: jm/panvk_vX_cmd_draw.c reads cmdbuf->state.gfx.xfb
+ * and populates vs.xfb_address[i] for shader use. The pan_nir_lower_xfb
+ * pass in panvk_vX_shader.c emits nir_load_xfb_address(i) which lowers
+ * (via panvk_vX_shader.c sysval handler) to a load from the per-draw
+ * sysval push area.
+ */
+
+#include "vk_log.h"
+#include "util/log.h"
+
+#include "panvk_cmd_buffer.h"
+#include "panvk_cmd_draw.h"
+#include "panvk_buffer.h"
+#include "panvk_entrypoints.h"
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBindTransformFeedbackBuffersEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstBinding,
+ uint32_t bindingCount,
+ const VkBuffer *pBuffers,
+ const VkDeviceSize *pOffsets,
+ const VkDeviceSize *pSizes)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ for (uint32_t i = 0; i < bindingCount; i++) {
+ uint32_t slot = firstBinding + i;
+ if (slot >= 4)
+ continue;
+
+ VK_FROM_HANDLE(panvk_buffer, buf, pBuffers[i]);
+ gfx->xfb.buffers[slot].addr = panvk_buffer_gpu_ptr(buf, 0);
+ gfx->xfb.buffers[slot].offset = pOffsets[i];
+ gfx->xfb.buffers[slot].size =
+ (pSizes != NULL && pSizes[i] != VK_WHOLE_SIZE)
+ ? pSizes[i]
+ : (buf->vk.size - pOffsets[i]);
+ }
+
+ if (firstBinding + bindingCount > gfx->xfb.buffer_count)
+ gfx->xfb.buffer_count = firstBinding + bindingCount;
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBeginTransformFeedbackEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstCounterBuffer,
+ uint32_t counterBufferCount,
+ const VkBuffer *pCounterBuffers,
+ const VkDeviceSize *pCounterBufferOffsets)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ /* Counter buffers ignored in v1 — see VkPhysicalDeviceTransformFeedback
+ * PropertiesEXT.transformFeedbackDraw = false in panvk_vX_physical_device.c.
+ * App is spec-compliant if it does not pass counter buffers (which our
+ * features advertisement allows), but warn loudly if it does so we do not
+ * silently produce wrong capture state. */
+ (void)firstCounterBuffer;
+ (void)pCounterBufferOffsets;
+ if (counterBufferCount > 0 && pCounterBuffers != NULL) {
+ mesa_logw("panvk: CmdBeginTransformFeedbackEXT: counter buffers not "
+ "implemented (transformFeedbackDraw=false); XFB resume will "
+ "restart at buffer offset 0");
+ }
+
+ gfx->xfb.active = true;
+ /* Per-draw set_gfx_sysval picks up the change automatically — no
+ * explicit dirty marking required (set_gfx_sysval uses memcmp +
+ * BITSET to detect state diffs and re-emit sysvals). */
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdEndTransformFeedbackEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstCounterBuffer,
+ uint32_t counterBufferCount,
+ const VkBuffer *pCounterBuffers,
+ const VkDeviceSize *pCounterBufferOffsets)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ (void)firstCounterBuffer;
+ (void)counterBufferCount;
+ (void)pCounterBuffers;
+ (void)pCounterBufferOffsets;
+
+ gfx->xfb.active = false;
+}
@@ -0,0 +1,629 @@
diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
--- a/src/panfrost/vulkan/meson.build 2026-05-21 14:04:02.529474145 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-21 14:04:04.106755486 +0200
@@ -123,6 +123,7 @@
'panvk_vX_nir_lower_input_attachment_loads.c',
'panvk_vX_sampler.c',
'panvk_vX_shader.c',
+ 'panvk_vX_xfb_lower.c',
sha1_h,
]
diff -urN a/src/panfrost/vulkan/panvk_shader.h b/src/panfrost/vulkan/panvk_shader.h
--- a/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:02.525251986 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:04.084251800 +0200
@@ -154,6 +154,8 @@
/* aligned_u64 attribute below inserts the 4-byte alignment gap
* after num_vertices automatically — no explicit pad needed. */
aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
+ uint32_t xfb_topology; /* iter17: panvk_xfb_topology enum value */
+ uint32_t xfb_output_count; /* iter17: per-instance output verts after decomp */
#endif
int32_t first_vertex;
int32_t base_instance;
@@ -569,4 +571,76 @@
struct pan_compute_dim local_size, const void *bin_ptr, size_t bin_size,
struct panvk_shader **shader_out);
+
+#if PAN_ARCH < 9
+/* iter17: encoding for vs.xfb_topology sysval. Maps VkPrimitiveTopology values
+ * we need to distinguish at shader runtime for XFB capture. LIST topologies
+ * use the iter13 single-store fast path; non-LIST need per-vertex decomposition. */
+enum panvk_xfb_topology {
+ PANVK_XFB_TOPO_LIST = 0,
+ PANVK_XFB_TOPO_LINE_STRIP = 1,
+ PANVK_XFB_TOPO_TRI_STRIP = 2,
+ PANVK_XFB_TOPO_TRI_FAN = 3,
+ PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
+ PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
+ PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
+ PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
+};
+
+#include "panvk_macros.h"
+struct nir_shader;
+bool panvk_per_arch(nir_lower_xfb)(struct nir_shader *nir);
+
+/* Map VkPrimitiveTopology to panvk_xfb_topology enum (driver-side helper). */
+static inline uint32_t
+panvk_vk_topology_to_xfb_enum(VkPrimitiveTopology topo)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return PANVK_XFB_TOPO_LINE_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ return PANVK_XFB_TOPO_TRI_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return PANVK_XFB_TOPO_TRI_FAN;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_POINT_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST:
+ default:
+ return PANVK_XFB_TOPO_LIST;
+ }
+}
+
+/* Compute the per-instance output vertex count for a given (topology, input count). */
+static inline uint32_t
+panvk_xfb_output_count(VkPrimitiveTopology topo, uint32_t input_count)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return input_count >= 1 ? 2u * (input_count - 1u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return input_count >= 2 ? 3u * (input_count - 2u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return (input_count / 4u) * 2u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return input_count >= 3 ? 2u * (input_count - 3u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return (input_count / 6u) * 3u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return input_count >= 6 ? 3u * (input_count / 2u - 2u) : 0u;
+ default:
+ return input_count; /* LIST topologies: 1:1 mapping */
+ }
+}
+#endif
+
+
#endif
diff -urN a/src/panfrost/vulkan/panvk_vX_cmd_draw.c b/src/panfrost/vulkan/panvk_vX_cmd_draw.c
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:02.528576354 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:04.091357598 +0200
@@ -727,6 +727,20 @@
/* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
* reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+
+ /* iter17: XFB primitive-decomposition sysvals.
+ * xfb_topology = enum value for the current bound topology.
+ * xfb_output_count = per-instance output vertex count after decomposition.
+ * For LIST topologies, output_count == input vertex count and the shader
+ * takes the iter13 single-store fast path. */
+ {
+ VkPrimitiveTopology vk_topo =
+ cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+ uint32_t topo_enum = panvk_vk_topology_to_xfb_enum(vk_topo);
+ uint32_t out_count = panvk_xfb_output_count(vk_topo, info->vertex.count);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_output_count, out_count);
+ }
{
const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
/* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
diff -urN a/src/panfrost/vulkan/panvk_vX_shader.c b/src/panfrost/vulkan/panvk_vX_shader.c
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:02.527576494 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:04.098356619 +0200
@@ -895,7 +895,10 @@
nir->info.has_transform_feedback_varyings) {
NIR_PASS(_, nir, nir_opt_constant_folding);
NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
- NIR_PASS(_, nir, pan_nir_lower_xfb);
+ /* iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for non-LIST topologies. Single-store LIST
+ * fast path matches iter13 behavior. */
+ NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb));
}
#endif
}
diff -urN a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
--- a/src/panfrost/vulkan/panvk_vX_xfb_lower.c 1970-01-01 01:00:00.000000000 +0100
+++ b/src/panfrost/vulkan/panvk_vX_xfb_lower.c 2026-05-21 14:04:04.115354242 +0200
@@ -0,0 +1,486 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for transform_feedback on non-LIST topologies
+ * (TRIANGLE_STRIP/FAN, LINE_STRIP, *_WITH_ADJACENCY).
+ *
+ * Approach: emit a topology dispatch at the start of each store_output
+ * lowering. The shader reads vs.xfb_topology sysval at runtime and branches
+ * into per-topology emission logic. For each affected topology, the lowered
+ * code emits guarded conditional stores — one per primitive this vertex
+ * contributes to, computing the output buffer position via primitive index
+ * and slot within the decomposed primitive.
+ *
+ * For LIST topologies (POINT/LINE/TRIANGLE LIST), takes a fast path that
+ * matches iter13's single-store behavior.
+ *
+ * For TRIANGLE_FAN, the central vertex (v=0) contributes to ALL primitives
+ * as slot 2 — handled via a NIR loop bounded by num_vertices.
+ *
+ * See ~/src/panvk-bifrost/iter17/phase{0,1,2}_*.md for full design context.
+ */
+
+#include "panvk_macros.h"
+
+#if PAN_ARCH < 9
+
+#include "panvk_shader.h"
+
+#include "compiler/nir/nir_builder.h"
+#include "pan_nir.h"
+
+#include <vulkan/vulkan_core.h>
+
+/* ----- Address arithmetic ----- */
+
+static nir_def *
+xfb_store_addr(nir_builder *b, nir_def *buf, nir_def *out_idx,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *byte_off = nir_iadd_imm(b,
+ nir_imul_imm(b, out_idx, stride), offset_bytes);
+ return nir_iadd(b, buf, nir_u2u64(b, byte_off));
+}
+
+static void
+emit_list_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *raw_vid, nir_def *value,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count), raw_vid);
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+}
+
+static void
+emit_prim_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *eligible,
+ nir_def *prim_idx, nir_def *slot,
+ uint32_t verts_per_prim,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_push_if(b, eligible);
+ {
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd(b, nir_imul_imm(b, prim_idx, verts_per_prim), slot));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Per-topology emission ----- */
+
+/* TRIANGLE_STRIP: vertex v contributes to prims v, v-1, v-2 (per eligibility). */
+static void
+emit_tri_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm2),
+ v, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim v-1, slot = 1 if prim even else 2: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot = 2 if prim even else 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+}
+
+/* LINE_STRIP: vertex v contributes to prim v slot 0 + prim v-1 slot 1. */
+static void
+emit_line_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-1 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm1),
+ v, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ /* Prim v-1, slot 1: 1 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_FAN: prim p emits {p+1, p+2, 0}.
+ * vertex v=0: contributes to ALL prims as slot 2 (loop required)
+ * vertex v>=1: contributes to prim v-1 as slot 0 (if 1 <= v <= N-2)
+ * vertex v>=2: contributes to prim v-2 as slot 1 (if 2 <= v <= N-1)
+ */
+static void
+emit_tri_fan(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 3, value, stride, offset_bytes);
+ }
+
+ /* Central vertex (v == 0): loop over all prims, write to slot 2. */
+ nir_push_if(b, nir_ieq_imm(b, v, 0));
+ {
+ nir_variable *p_var = nir_local_variable_create(b->impl,
+ glsl_uint_type(), "fan_p");
+ nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
+ nir_push_loop(b);
+ {
+ nir_def *p = nir_load_var(b, p_var);
+ nir_push_if(b, nir_uge(b, p, Nm2));
+ {
+ nir_jump(b, nir_jump_break);
+ }
+ nir_pop_if(b, NULL);
+
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+
+ nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
+ }
+ nir_pop_loop(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* LINE_LIST_WITH_ADJACENCY: 4-vertex groups [4i..4i+3]; output {4i+1, 4i+2}.
+ * v contributes if v%4 == 1: prim v/4 slot 0
+ * v contributes if v%4 == 2: prim v/4 slot 1
+ */
+static void
+emit_line_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N; /* eligibility is mod-based, not range-based */
+ nir_def *vmod4 = nir_iand_imm(b, v, 3u);
+ nir_def *prim = nir_ushr_imm(b, v, 2); /* v / 4 */
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 1),
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 2),
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+}
+
+/* LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}.
+ * v contributes to prim v-1 slot 0 (1 <= v <= N-2)
+ * v contributes to prim v-2 slot 1 (2 <= v <= N-1)
+ */
+static void
+emit_line_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v <= N-2 ⇔ v >= 1 AND v <= N-2 ⇔ v >= 1 AND v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ (void)Nm2;
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v <= N-1 ⇔ v >= 2 AND v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_LIST_WITH_ADJACENCY: 6-vertex groups; output {6i, 6i+2, 6i+4}.
+ * v contributes if v%6 == 0: prim v/6 slot 0
+ * v contributes if v%6 == 2: prim v/6 slot 1
+ * v contributes if v%6 == 4: prim v/6 slot 2
+ */
+static void
+emit_tri_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N;
+ nir_def *vmod6 = nir_umod_imm(b, v, 6);
+ nir_def *prim = nir_udiv_imm(b, v, 6);
+
+ for (uint32_t slot = 0; slot < 3; slot++) {
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod6, slot * 2),
+ prim, nir_imm_int(b, slot), 3, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_STRIP_WITH_ADJACENCY: prim i emits:
+ * even i: {2i, 2i+2, 2i+4} (slots 0, 1, 2 ← input indices 2i, 2i+2, 2i+4)
+ * odd i: {2i, 2i+4, 2i+2} (slots 0, 1, 2 ← input indices 2i, 2i+4, 2i+2)
+ *
+ * Only EVEN input vertices contribute (since all output indices are 2*something).
+ * For even input v:
+ * prim v/2 slot 0 (always, if v/2 < N/2-2)
+ * prim (v-2)/2 slot 1 if (v-2)/2 even, slot 2 if odd (when v >= 2)
+ * prim (v-4)/2 slot 2 if (v-4)/2 even, slot 1 if odd (when v >= 4)
+ */
+static void
+emit_tri_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ /* Bail for odd input vertices — they never contribute. */
+ nir_def *v_is_even = nir_ieq_imm(b, nir_iand_imm(b, v, 1u), 0);
+ nir_push_if(b, v_is_even);
+ {
+ nir_def *N_half = nir_ushr_imm(b, N, 1);
+ nir_def *max_prim = nir_iadd_imm(b, N_half, -2); /* N/2 - 2 */
+ nir_def *v_half = nir_ushr_imm(b, v, 1);
+
+ /* Prim v/2 slot 0: v/2 < N/2 - 2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v_half, max_prim),
+ v_half, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim (v-2)/2 = v/2 - 1: v >= 2 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1); /* even→1, odd→2 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim (v-4)/2 = v/2 - 2: v >= 4 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity); /* even→2, odd→1 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 4)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Main lowering: per store_output XFB channel ----- */
+
+static void
+lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ unsigned channel_idx, unsigned num_components,
+ unsigned buffer, unsigned offset_words)
+{
+ assert(buffer < MAX_XFB_BUFFERS);
+ assert(nir_intrinsic_component(intr) == 0);
+
+ uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
+ assert(stride != 0);
+ uint16_t offset_bytes = offset_words * 4;
+
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_VERTEX_ID_ZERO_BASE);
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_INSTANCE_ID);
+
+ nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
+ nir_def *out_count = load_sysval(b, graphics, 32, vs.xfb_output_count);
+ nir_def *N = nir_load_num_vertices(b);
+ nir_def *v = nir_load_raw_vertex_id_pan(b);
+ nir_def *instance = nir_load_instance_id(b);
+ nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
+
+ nir_def *src = intr->src[0].ssa;
+ nir_component_mask_t mask = nir_component_mask(num_components);
+ nir_def *value = nir_channels(b, src, mask << channel_idx);
+
+ /* Topology dispatch ladder. LIST first (fast path). */
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
+ {
+ emit_list_store(b, buf, out_count, instance, v, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* iter17 Janet Finding 3: gate all non-LIST emission on
+ * output_count > 0. For degenerate input counts (N < min required
+ * for the topology), output_count is 0 and we must emit NO stores
+ * — otherwise N-2 / N-3 / etc. arithmetic underflows in the
+ * eligibility predicates and we falsely fire stores. */
+ nir_push_if(b, nir_ult(b, nir_imm_int(b, 0), out_count));
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
+ {
+ emit_tri_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
+ {
+ emit_line_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_FAN));
+ {
+ emit_tri_fan(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_LIST_ADJ));
+ {
+ emit_line_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP_ADJ));
+ {
+ emit_line_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_LIST_ADJ));
+ {
+ emit_tri_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* TRI_STRIP_ADJ — last case */
+ emit_tri_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL); /* Janet Finding 3: close output_count > 0 guard */
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* Mirror of pan_nir_lower_xfb's lower_xfb: load_vertex_id rewrite +
+ * dispatch store_output through our topology-aware emission. */
+static bool
+lower_xfb_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ UNUSED void *data)
+{
+ if (intr->intrinsic == nir_intrinsic_load_vertex_id) {
+ b->cursor = nir_instr_remove(&intr->instr);
+ nir_def *repl = nir_iadd(b, nir_load_raw_vertex_id_pan(b),
+ nir_load_raw_vertex_offset_pan(b));
+ nir_def_rewrite_uses(&intr->def, repl);
+ return true;
+ }
+
+ if (intr->intrinsic != nir_intrinsic_store_output)
+ return false;
+
+ bool progress = false;
+ b->cursor = nir_before_instr(&intr->instr);
+
+ /* io_xfb has only out[0,1]; the other 2 channels are in io_xfb2.
+ * Outer loop selects which annotation; inner picks which channel. */
+ for (unsigned i = 0; i < 2; ++i) {
+ nir_io_xfb xfb = i ? nir_intrinsic_io_xfb2(intr)
+ : nir_intrinsic_io_xfb(intr);
+ for (unsigned j = 0; j < 2; ++j) {
+ if (!xfb.out[j].num_components)
+ continue;
+ lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
+ xfb.out[j].buffer, xfb.out[j].offset);
+ progress = true;
+ }
+ }
+
+ if (progress)
+ nir_instr_remove(&intr->instr);
+ return progress;
+}
+
+bool
+panvk_per_arch(nir_lower_xfb)(nir_shader *nir)
+{
+ return nir_shader_intrinsics_pass(
+ nir, lower_xfb_iter17, nir_metadata_control_flow, NULL);
+}
+
+#endif /* PAN_ARCH < 9 */
+181
View File
@@ -0,0 +1,181 @@
# Maintainer: Markus Fritsche <fritsche.markus@gmail.com>
#
# mesa-panvk-bifrost-video — sibling of mesa-panvk-bifrost (r4) that adds
# VK_KHR_video_decode_h264 on Mali Bifrost SBCs (PAN_ARCH 6/7) backed by
# the SoC's V4L2-stateless hantro VPU (RK3566/RK3568).
#
# Campaign: ~/src/panvk-bifrost-video/ — Phase 4 byte-exact validated
# 2026-05-21 (48/48 BBB display frames match ffmpeg+libva-v4l2-request-
# fourier byte-for-byte on the same hantro). Phase 5 second-model review
# completed; load-bearing findings (output_map OOB, static counter,
# session_init unwind, probe_hantro gate) all applied.
#
# What it does (on top of r4):
# - 0001..0004: inherited from mesa-panvk-bifrost (robustness2/null-
# descriptor, vk1.1/1.2 advertisement, EXT_transform_feedback, XFB
# primitive decomposition) — symlinked from the r4 package directory
# so the patches don't drift between siblings.
# - 0005: VK_KHR_video_queue + VK_KHR_video_decode_queue +
# VK_KHR_video_decode_h264 backed by V4L2-stateless hantro.
# Touches 14 files in src/panfrost/vulkan/; full diff in
# 0005-panvk-bifrost-video-KHR-video-decode-h264.patch.
#
# Co-existence:
# - Installs to /usr/lib/panvk-bifrost-video/ (parallel to r4's
# /usr/lib/panvk-bifrost/). Pick at runtime via VK_ICD_FILENAMES.
# - r4 stays the recommended default for the Chromium-GPU-process
# consumer (no video needed there). Use this package when the
# consumer wants Vulkan video decode (mpv-fourier, ffmpeg-vulkan,
# future Chromium-VulkanVideoDecoder).
#
# Phase 1 limitations to know about (documented in source comments):
# - Single video session per device (active_video singleton)
# - Synchronous decode at record time — no pipelining yet
# - Hardcoded /dev/video1 + /dev/media0 (matches RK3566/68, blocks
# other SoCs without a topology-walk port)
# - Bitstream source buffer assumed HOST_VISIBLE (true on panvk-
# bifrost, would need fallback on other backends)
#
# Build target: arch-aarch64 runner via marfrit-packages Gitea Actions.
# Mesa build is slow (~30-60min on Cortex-A55).
pkgname=mesa-panvk-bifrost-video
_mesaver=26.0.6
pkgver=26.0.6.r5.video1
pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)"
arch=('aarch64')
url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT')
depends=(
'mesa' # for shared mesa runtime libs
'libdrm'
'wayland'
'libxcb'
'libx11'
'libxshmfence'
'zlib'
'zstd'
'libelf'
'libffi'
'expat'
'llvm-libs'
'lm_sensors'
)
makedepends=(
'meson'
'ninja'
'glslang'
'python-mako'
'python-packaging'
'wayland-protocols'
'libxrandr'
'xorgproto'
'libdrm'
'llvm'
'libclc'
'spirv-llvm-translator'
'spirv-tools'
'rust-bindgen'
'patch'
)
source=(
"https://archive.mesa3d.org/mesa-${_mesaver}.tar.xz"
"0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch"
"0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
"0003-panvk-bifrost-vk-ext-transform-feedback.patch"
"0004-panvk-bifrost-xfb-primitive-decomposition.patch"
"0005-panvk-bifrost-video-KHR-video-decode-h264.patch"
"icd.json"
)
# Mesa tarball checksum matches the sibling r4 package — same upstream version.
sha256sums=(
'SKIP' # mesa tarball — co-trust w/ r4 sibling
'SKIP' # patches are local
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP' # icd.json
)
prepare() {
cd "mesa-${_mesaver}"
# r1+r2: small sed-based edits inherited from r4 (verbatim from the
# sibling PKGBUILD — keep in sync).
sed -i 's|\.KHR_robustness2 = PAN_ARCH >= 10,|.KHR_robustness2 = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|\.EXT_robustness2 = PAN_ARCH >= 10,|.EXT_robustness2 = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|\.nullDescriptor = PAN_ARCH >= 10,|.nullDescriptor = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|bool has_vk1_1 = PAN_ARCH >= 10;|bool has_vk1_1 = true;|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|bool has_vk1_2 = PAN_ARCH >= 10;|bool has_vk1_2 = true;|' src/panfrost/vulkan/panvk_vX_physical_device.c
# r3: EXT_transform_feedback for Bifrost.
patch -p1 < "${srcdir}/0003-panvk-bifrost-vk-ext-transform-feedback.patch"
# r4: XFB primitive decomposition NIR pass.
patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
# video: VK_KHR_video_decode_h264 via V4L2-hantro.
patch -p1 < "${srcdir}/0005-panvk-bifrost-video-KHR-video-decode-h264.patch"
# Sanity-check r1..r4 (inherited).
grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "nullDescriptor = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "has_vk1_1 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "has_vk1_2 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_transform_feedback = PAN_ARCH < 9," src/panfrost/vulkan/panvk_vX_physical_device.c
test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
# Sanity-check video patch landed.
grep -q "KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro()" \
src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "PANVK_QUEUE_FAMILY_VIDEO_DECODE" src/panfrost/vulkan/panvk_device.h
test -f src/panfrost/vulkan/panvk_video_decode.c
test -f src/panfrost/vulkan/panvk_video_decode.h
test -f src/panfrost/vulkan/panvk_v4l2.c
test -f src/panfrost/vulkan/panvk_v4l2_h264.c
test -f src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c
test -f src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h
grep -q "panvk_v4l2_h264_slice_header.c" src/panfrost/vulkan/meson.build
grep -q "panvk_video_queue_submit_noop" src/panfrost/vulkan/panvk_vX_device.c
}
build() {
cd "mesa-${_mesaver}"
# Mirror r4's narrow build profile.
meson setup build/ \
--prefix=/usr \
--libdir=lib \
--buildtype=release \
-Dvulkan-drivers=panfrost \
-Dgallium-drivers= \
-Dplatforms=wayland,x11 \
-Dglx=disabled \
-Degl=disabled \
-Dgles1=disabled \
-Dgles2=disabled \
-Dvulkan-layers= \
-Dtools= \
-Dgallium-rusticl=false \
-Dmicrosoft-clc=disabled
meson compile -C build
}
package() {
cd "${srcdir}/mesa-${_mesaver}"
# Co-install path — parallel to r4's /usr/lib/panvk-bifrost/.
install -Dm755 build/src/panfrost/vulkan/libvulkan_panfrost.so \
"$pkgdir/usr/lib/panvk-bifrost-video/libvulkan_panfrost.so"
# ICD JSON pointing at the video build. Opt-in via VK_ICD_FILENAMES;
# NOT in /usr/share/vulkan/icd.d/ so it doesn't override stock or r4.
install -Dm644 "$srcdir/icd.json" \
"$pkgdir/usr/lib/panvk-bifrost-video/icd.json"
}
+40
View File
@@ -0,0 +1,40 @@
# mesa-panvk-bifrost-video
Patched Mesa `libvulkan_panfrost.so` that **adds `VK_KHR_video_decode_h264`** on Mali Bifrost SBCs (PAN_ARCH 6/7, RK3566/RK3568 class hardware), backed by the SoC's V4L2-stateless **hantro** VPU.
This is a **sibling** of [mesa-panvk-bifrost](../mesa-panvk-bifrost/) (the r4 package that exposes Bifrost to Chromium's Vulkan compositor). Pick this one when the consumer wants Vulkan **video decode** in addition; pick r4 for compositor-only.
## Status
Phase 4 byte-exact validated 2026-05-21: 48/48 unique BBB display frames decoded by this package are byte-identical to `ffmpeg+libva-v4l2-request-fourier` running on the same hantro hardware. Phase 5 second-model review completed; all load-bearing findings addressed. First publish via marfrit-packages CI 2026-05-22 (PR #79 merge did not auto-fire Actions; this re-trigger restores the standard build/sign/publish path).
## How to use
```sh
# Co-installs alongside r4 and stock mesa.
sudo pacman -S mesa-panvk-bifrost-video
# Opt in (not on the default loader search path).
export VK_ICD_FILENAMES=/usr/lib/panvk-bifrost-video/icd.json
export PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1 # mesa-upstream gate
# Run a Vulkan video consumer.
vulkan-video-dec-simple-test -i your.h264 --codec h264 --noPresent --maxFrameCount 50
# or
ffmpeg -hwaccel vulkan -i your.mp4 ...
```
## Phase 1 limitations
Documented in source comments and worth knowing before relying on this in production:
- **Single video session per device.** Concurrent `VkVideoSessionKHR` on the same device clobber each other (`active_video` singleton). Sufficient for current single-stream consumers.
- **Synchronous decode at record time.** The full V4L2 ioctl dance runs to completion inside `vkCmdDecodeVideoKHR`. No pipelining. Throughput is bounded by hantro's ~1.16× realtime on 1080p H.264.
- **Hardcoded `/dev/video1` + `/dev/media0`.** Matches RK3566/68 but won't work on other SoCs without a topology-walk port (see `libva-v4l2-request-fourier` for the full version).
- **Bitstream source buffer assumed HOST_VISIBLE.** True on panvk-bifrost (no DEVICE_LOCAL-only memory types exist), but the code silently skips decode if the app bound the buffer to non-host-visible memory.
## Co-existence
- Installs to `/usr/lib/panvk-bifrost-video/` — parallel to r4's `/usr/lib/panvk-bifrost/` and stock `/usr/lib/`.
- Opt-in via `VK_ICD_FILENAMES`; does NOT register itself in `/usr/share/vulkan/icd.d/`.
- Three drivers coexist without conflict; the user picks at runtime which to use.
+7
View File
@@ -0,0 +1,7 @@
{
"ICD": {
"api_version": "1.4.335",
"library_path": "/usr/lib/panvk-bifrost-video/libvulkan_panfrost.so"
},
"file_format_version": "1.0.1"
}
@@ -0,0 +1,629 @@
diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
--- a/src/panfrost/vulkan/meson.build 2026-05-21 14:04:02.529474145 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-21 14:04:04.106755486 +0200
@@ -123,6 +123,7 @@
'panvk_vX_nir_lower_input_attachment_loads.c',
'panvk_vX_sampler.c',
'panvk_vX_shader.c',
+ 'panvk_vX_xfb_lower.c',
sha1_h,
]
diff -urN a/src/panfrost/vulkan/panvk_shader.h b/src/panfrost/vulkan/panvk_shader.h
--- a/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:02.525251986 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:04.084251800 +0200
@@ -154,6 +154,8 @@
/* aligned_u64 attribute below inserts the 4-byte alignment gap
* after num_vertices automatically — no explicit pad needed. */
aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
+ uint32_t xfb_topology; /* iter17: panvk_xfb_topology enum value */
+ uint32_t xfb_output_count; /* iter17: per-instance output verts after decomp */
#endif
int32_t first_vertex;
int32_t base_instance;
@@ -569,4 +571,76 @@
struct pan_compute_dim local_size, const void *bin_ptr, size_t bin_size,
struct panvk_shader **shader_out);
+
+#if PAN_ARCH < 9
+/* iter17: encoding for vs.xfb_topology sysval. Maps VkPrimitiveTopology values
+ * we need to distinguish at shader runtime for XFB capture. LIST topologies
+ * use the iter13 single-store fast path; non-LIST need per-vertex decomposition. */
+enum panvk_xfb_topology {
+ PANVK_XFB_TOPO_LIST = 0,
+ PANVK_XFB_TOPO_LINE_STRIP = 1,
+ PANVK_XFB_TOPO_TRI_STRIP = 2,
+ PANVK_XFB_TOPO_TRI_FAN = 3,
+ PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
+ PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
+ PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
+ PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
+};
+
+#include "panvk_macros.h"
+struct nir_shader;
+bool panvk_per_arch(nir_lower_xfb)(struct nir_shader *nir);
+
+/* Map VkPrimitiveTopology to panvk_xfb_topology enum (driver-side helper). */
+static inline uint32_t
+panvk_vk_topology_to_xfb_enum(VkPrimitiveTopology topo)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return PANVK_XFB_TOPO_LINE_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ return PANVK_XFB_TOPO_TRI_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return PANVK_XFB_TOPO_TRI_FAN;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_POINT_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST:
+ default:
+ return PANVK_XFB_TOPO_LIST;
+ }
+}
+
+/* Compute the per-instance output vertex count for a given (topology, input count). */
+static inline uint32_t
+panvk_xfb_output_count(VkPrimitiveTopology topo, uint32_t input_count)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return input_count >= 1 ? 2u * (input_count - 1u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return input_count >= 2 ? 3u * (input_count - 2u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return (input_count / 4u) * 2u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return input_count >= 3 ? 2u * (input_count - 3u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return (input_count / 6u) * 3u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return input_count >= 6 ? 3u * (input_count / 2u - 2u) : 0u;
+ default:
+ return input_count; /* LIST topologies: 1:1 mapping */
+ }
+}
+#endif
+
+
#endif
diff -urN a/src/panfrost/vulkan/panvk_vX_cmd_draw.c b/src/panfrost/vulkan/panvk_vX_cmd_draw.c
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:02.528576354 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:04.091357598 +0200
@@ -727,6 +727,20 @@
/* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
* reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+
+ /* iter17: XFB primitive-decomposition sysvals.
+ * xfb_topology = enum value for the current bound topology.
+ * xfb_output_count = per-instance output vertex count after decomposition.
+ * For LIST topologies, output_count == input vertex count and the shader
+ * takes the iter13 single-store fast path. */
+ {
+ VkPrimitiveTopology vk_topo =
+ cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+ uint32_t topo_enum = panvk_vk_topology_to_xfb_enum(vk_topo);
+ uint32_t out_count = panvk_xfb_output_count(vk_topo, info->vertex.count);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_output_count, out_count);
+ }
{
const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
/* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
diff -urN a/src/panfrost/vulkan/panvk_vX_shader.c b/src/panfrost/vulkan/panvk_vX_shader.c
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:02.527576494 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:04.098356619 +0200
@@ -895,7 +895,10 @@
nir->info.has_transform_feedback_varyings) {
NIR_PASS(_, nir, nir_opt_constant_folding);
NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
- NIR_PASS(_, nir, pan_nir_lower_xfb);
+ /* iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for non-LIST topologies. Single-store LIST
+ * fast path matches iter13 behavior. */
+ NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb));
}
#endif
}
diff -urN a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
--- a/src/panfrost/vulkan/panvk_vX_xfb_lower.c 1970-01-01 01:00:00.000000000 +0100
+++ b/src/panfrost/vulkan/panvk_vX_xfb_lower.c 2026-05-21 14:04:04.115354242 +0200
@@ -0,0 +1,486 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for transform_feedback on non-LIST topologies
+ * (TRIANGLE_STRIP/FAN, LINE_STRIP, *_WITH_ADJACENCY).
+ *
+ * Approach: emit a topology dispatch at the start of each store_output
+ * lowering. The shader reads vs.xfb_topology sysval at runtime and branches
+ * into per-topology emission logic. For each affected topology, the lowered
+ * code emits guarded conditional stores — one per primitive this vertex
+ * contributes to, computing the output buffer position via primitive index
+ * and slot within the decomposed primitive.
+ *
+ * For LIST topologies (POINT/LINE/TRIANGLE LIST), takes a fast path that
+ * matches iter13's single-store behavior.
+ *
+ * For TRIANGLE_FAN, the central vertex (v=0) contributes to ALL primitives
+ * as slot 2 — handled via a NIR loop bounded by num_vertices.
+ *
+ * See ~/src/panvk-bifrost/iter17/phase{0,1,2}_*.md for full design context.
+ */
+
+#include "panvk_macros.h"
+
+#if PAN_ARCH < 9
+
+#include "panvk_shader.h"
+
+#include "compiler/nir/nir_builder.h"
+#include "pan_nir.h"
+
+#include <vulkan/vulkan_core.h>
+
+/* ----- Address arithmetic ----- */
+
+static nir_def *
+xfb_store_addr(nir_builder *b, nir_def *buf, nir_def *out_idx,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *byte_off = nir_iadd_imm(b,
+ nir_imul_imm(b, out_idx, stride), offset_bytes);
+ return nir_iadd(b, buf, nir_u2u64(b, byte_off));
+}
+
+static void
+emit_list_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *raw_vid, nir_def *value,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count), raw_vid);
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+}
+
+static void
+emit_prim_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *eligible,
+ nir_def *prim_idx, nir_def *slot,
+ uint32_t verts_per_prim,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_push_if(b, eligible);
+ {
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd(b, nir_imul_imm(b, prim_idx, verts_per_prim), slot));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Per-topology emission ----- */
+
+/* TRIANGLE_STRIP: vertex v contributes to prims v, v-1, v-2 (per eligibility). */
+static void
+emit_tri_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm2),
+ v, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim v-1, slot = 1 if prim even else 2: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot = 2 if prim even else 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+}
+
+/* LINE_STRIP: vertex v contributes to prim v slot 0 + prim v-1 slot 1. */
+static void
+emit_line_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-1 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm1),
+ v, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ /* Prim v-1, slot 1: 1 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_FAN: prim p emits {p+1, p+2, 0}.
+ * vertex v=0: contributes to ALL prims as slot 2 (loop required)
+ * vertex v>=1: contributes to prim v-1 as slot 0 (if 1 <= v <= N-2)
+ * vertex v>=2: contributes to prim v-2 as slot 1 (if 2 <= v <= N-1)
+ */
+static void
+emit_tri_fan(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 3, value, stride, offset_bytes);
+ }
+
+ /* Central vertex (v == 0): loop over all prims, write to slot 2. */
+ nir_push_if(b, nir_ieq_imm(b, v, 0));
+ {
+ nir_variable *p_var = nir_local_variable_create(b->impl,
+ glsl_uint_type(), "fan_p");
+ nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
+ nir_push_loop(b);
+ {
+ nir_def *p = nir_load_var(b, p_var);
+ nir_push_if(b, nir_uge(b, p, Nm2));
+ {
+ nir_jump(b, nir_jump_break);
+ }
+ nir_pop_if(b, NULL);
+
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+
+ nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
+ }
+ nir_pop_loop(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* LINE_LIST_WITH_ADJACENCY: 4-vertex groups [4i..4i+3]; output {4i+1, 4i+2}.
+ * v contributes if v%4 == 1: prim v/4 slot 0
+ * v contributes if v%4 == 2: prim v/4 slot 1
+ */
+static void
+emit_line_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N; /* eligibility is mod-based, not range-based */
+ nir_def *vmod4 = nir_iand_imm(b, v, 3u);
+ nir_def *prim = nir_ushr_imm(b, v, 2); /* v / 4 */
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 1),
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 2),
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+}
+
+/* LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}.
+ * v contributes to prim v-1 slot 0 (1 <= v <= N-2)
+ * v contributes to prim v-2 slot 1 (2 <= v <= N-1)
+ */
+static void
+emit_line_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v <= N-2 ⇔ v >= 1 AND v <= N-2 ⇔ v >= 1 AND v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ (void)Nm2;
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v <= N-1 ⇔ v >= 2 AND v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_LIST_WITH_ADJACENCY: 6-vertex groups; output {6i, 6i+2, 6i+4}.
+ * v contributes if v%6 == 0: prim v/6 slot 0
+ * v contributes if v%6 == 2: prim v/6 slot 1
+ * v contributes if v%6 == 4: prim v/6 slot 2
+ */
+static void
+emit_tri_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N;
+ nir_def *vmod6 = nir_umod_imm(b, v, 6);
+ nir_def *prim = nir_udiv_imm(b, v, 6);
+
+ for (uint32_t slot = 0; slot < 3; slot++) {
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod6, slot * 2),
+ prim, nir_imm_int(b, slot), 3, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_STRIP_WITH_ADJACENCY: prim i emits:
+ * even i: {2i, 2i+2, 2i+4} (slots 0, 1, 2 ← input indices 2i, 2i+2, 2i+4)
+ * odd i: {2i, 2i+4, 2i+2} (slots 0, 1, 2 ← input indices 2i, 2i+4, 2i+2)
+ *
+ * Only EVEN input vertices contribute (since all output indices are 2*something).
+ * For even input v:
+ * prim v/2 slot 0 (always, if v/2 < N/2-2)
+ * prim (v-2)/2 slot 1 if (v-2)/2 even, slot 2 if odd (when v >= 2)
+ * prim (v-4)/2 slot 2 if (v-4)/2 even, slot 1 if odd (when v >= 4)
+ */
+static void
+emit_tri_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ /* Bail for odd input vertices — they never contribute. */
+ nir_def *v_is_even = nir_ieq_imm(b, nir_iand_imm(b, v, 1u), 0);
+ nir_push_if(b, v_is_even);
+ {
+ nir_def *N_half = nir_ushr_imm(b, N, 1);
+ nir_def *max_prim = nir_iadd_imm(b, N_half, -2); /* N/2 - 2 */
+ nir_def *v_half = nir_ushr_imm(b, v, 1);
+
+ /* Prim v/2 slot 0: v/2 < N/2 - 2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v_half, max_prim),
+ v_half, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim (v-2)/2 = v/2 - 1: v >= 2 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1); /* even→1, odd→2 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim (v-4)/2 = v/2 - 2: v >= 4 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity); /* even→2, odd→1 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 4)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Main lowering: per store_output XFB channel ----- */
+
+static void
+lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ unsigned channel_idx, unsigned num_components,
+ unsigned buffer, unsigned offset_words)
+{
+ assert(buffer < MAX_XFB_BUFFERS);
+ assert(nir_intrinsic_component(intr) == 0);
+
+ uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
+ assert(stride != 0);
+ uint16_t offset_bytes = offset_words * 4;
+
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_VERTEX_ID_ZERO_BASE);
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_INSTANCE_ID);
+
+ nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
+ nir_def *out_count = load_sysval(b, graphics, 32, vs.xfb_output_count);
+ nir_def *N = nir_load_num_vertices(b);
+ nir_def *v = nir_load_raw_vertex_id_pan(b);
+ nir_def *instance = nir_load_instance_id(b);
+ nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
+
+ nir_def *src = intr->src[0].ssa;
+ nir_component_mask_t mask = nir_component_mask(num_components);
+ nir_def *value = nir_channels(b, src, mask << channel_idx);
+
+ /* Topology dispatch ladder. LIST first (fast path). */
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
+ {
+ emit_list_store(b, buf, out_count, instance, v, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* iter17 Janet Finding 3: gate all non-LIST emission on
+ * output_count > 0. For degenerate input counts (N < min required
+ * for the topology), output_count is 0 and we must emit NO stores
+ * — otherwise N-2 / N-3 / etc. arithmetic underflows in the
+ * eligibility predicates and we falsely fire stores. */
+ nir_push_if(b, nir_ult(b, nir_imm_int(b, 0), out_count));
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
+ {
+ emit_tri_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
+ {
+ emit_line_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_FAN));
+ {
+ emit_tri_fan(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_LIST_ADJ));
+ {
+ emit_line_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP_ADJ));
+ {
+ emit_line_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_LIST_ADJ));
+ {
+ emit_tri_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* TRI_STRIP_ADJ — last case */
+ emit_tri_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL); /* Janet Finding 3: close output_count > 0 guard */
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* Mirror of pan_nir_lower_xfb's lower_xfb: load_vertex_id rewrite +
+ * dispatch store_output through our topology-aware emission. */
+static bool
+lower_xfb_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ UNUSED void *data)
+{
+ if (intr->intrinsic == nir_intrinsic_load_vertex_id) {
+ b->cursor = nir_instr_remove(&intr->instr);
+ nir_def *repl = nir_iadd(b, nir_load_raw_vertex_id_pan(b),
+ nir_load_raw_vertex_offset_pan(b));
+ nir_def_rewrite_uses(&intr->def, repl);
+ return true;
+ }
+
+ if (intr->intrinsic != nir_intrinsic_store_output)
+ return false;
+
+ bool progress = false;
+ b->cursor = nir_before_instr(&intr->instr);
+
+ /* io_xfb has only out[0,1]; the other 2 channels are in io_xfb2.
+ * Outer loop selects which annotation; inner picks which channel. */
+ for (unsigned i = 0; i < 2; ++i) {
+ nir_io_xfb xfb = i ? nir_intrinsic_io_xfb2(intr)
+ : nir_intrinsic_io_xfb(intr);
+ for (unsigned j = 0; j < 2; ++j) {
+ if (!xfb.out[j].num_components)
+ continue;
+ lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
+ xfb.out[j].buffer, xfb.out[j].offset);
+ progress = true;
+ }
+ }
+
+ if (progress)
+ nir_instr_remove(&intr->instr);
+ return progress;
+}
+
+bool
+panvk_per_arch(nir_lower_xfb)(nir_shader *nir)
+{
+ return nir_shader_intrinsics_pass(
+ nir, lower_xfb_iter17, nir_metadata_control_flow, NULL);
+}
+
+#endif /* PAN_ARCH < 9 */
+18 -3
View File
@@ -30,11 +30,11 @@
pkgname=mesa-panvk-bifrost pkgname=mesa-panvk-bifrost
_mesaver=26.0.6 _mesaver=26.0.6
pkgver=26.0.6.r3 pkgver=26.0.6.r4
pkgrel=1 pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)" pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)"
arch=('aarch64') arch=('aarch64')
url="https://github.com/marfrit/panvk-bifrost" url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT') license=('MIT')
# We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa. # We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa.
@@ -80,6 +80,7 @@ source=(
"0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch" "0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch"
"0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch" "0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
"0003-panvk-bifrost-vk-ext-transform-feedback.patch" "0003-panvk-bifrost-vk-ext-transform-feedback.patch"
"0004-panvk-bifrost-xfb-primitive-decomposition.patch"
"brave-vulkan" "brave-vulkan"
"icd.json" "icd.json"
) )
@@ -90,6 +91,7 @@ sha256sums=(
'SKIP' 'SKIP'
'SKIP' 'SKIP'
'SKIP' 'SKIP'
'SKIP'
) )
prepare() { prepare() {
@@ -116,6 +118,15 @@ prepare() {
# reports "Hardware accelerated" across the board for the affected paths). # reports "Hardware accelerated" across the board for the affected paths).
patch -p1 < "${srcdir}/0003-panvk-bifrost-vk-ext-transform-feedback.patch" patch -p1 < "${srcdir}/0003-panvk-bifrost-vk-ext-transform-feedback.patch"
# iter17: XFB primitive decomposition for non-LIST topologies (TRI_STRIP,
# TRI_FAN, LINE_STRIP, *_WITH_ADJACENCY). Replacement panvk-specific
# NIR pass (panvk_per_arch(nir_lower_xfb)) substituted for upstream
# pan_nir_lower_xfb. Closes the 162 dEQP-VK winding_* failures from
# iter15 (958 P / 81 F / 0 Crash on full XFB CTS — remaining 81 fails
# are by-design resume_* tests, transformFeedbackDraw=false).
# Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md.
patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
# Sanity-check the patches landed. # Sanity-check the patches landed.
grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -124,8 +135,12 @@ prepare() {
grep -q "has_vk1_2 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "has_vk1_2 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
# iter13 sanity: # iter13 sanity:
grep -q "EXT_transform_feedback = PAN_ARCH < 9," src/panfrost/vulkan/panvk_vX_physical_device.c grep -q "EXT_transform_feedback = PAN_ARCH < 9," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "pan_nir_lower_xfb" src/panfrost/vulkan/panvk_vX_shader.c
test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
# iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present.
grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h
grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h
test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
} }
build() { build() {
+3 -3
View File
@@ -14,9 +14,9 @@
# Sibling userspace package: ../daedalus-v4l2/build-deb.sh # Sibling userspace package: ../daedalus-v4l2/build-deb.sh
set -euo pipefail set -euo pipefail
UPSTREAM_COMMIT=79256dc7ef41f83873ca9c23db20f5888858e65d UPSTREAM_COMMIT=872eec505eb91b561892d02a0526749348ddc121
PKGVER=0.1.0+r28+g79256dc PKGVER=0.1.0+r45+g872eec5
PKGREL=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix); still carries the #64 multi-kernel postinst fix PKGREL=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2 0.1.0+r45+g872eec5 REQUIRED
MODULE_NAME=daedalus_v4l2 MODULE_NAME=daedalus_v4l2
HERE=$(dirname "$(readlink -f "$0")") HERE=$(dirname "$(readlink -f "$0")")
+57
View File
@@ -1,3 +1,60 @@
daedalus-v4l2-dkms (0.1.0+r45+g872eec5-1) bookworm trixie; urgency=medium
* Bump to 872eec5 — picks up daedalus-v4l2 PR #20 (closes #19).
Wire-protocol cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB
to 1 MiB in include/daedalus_v4l2_proto.h. The kernel module
inherits the larger DAEDALUS_MAX_BITSTREAM via the same #define
and daedalus_fill_output_fmt now reports OUTPUT_MPLANE
sizeimage = ~1 MiB instead of 65484.
* Skips the r33 -> r45 commit range — between 5d8b436 and 872eec5
only one kernel/include change landed (the PROTO_MAX_PAYLOAD
bump above). The intervening daemon-only bumps (r37 / r39 /
r41 / r43) didn't touch kernel/ or include/ at all.
* Effective wire cap is min(kernel, daemon) — lock-step install
WITH daedalus-v4l2 0.1.0+r45+g872eec5 REQUIRED.
* Allocations (kmemdup / kmalloc on payload, vb2 plane backing)
are dynamic and sized per-payload at runtime; the bump only
sets the ceiling. KMALLOC_MAX_SIZE on aarch64 SLUB is several
MiB so 1 MiB is well within bounds.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 21:00:00 +0000
daedalus-v4l2-dkms (0.1.0+r33+g5d8b436-1) bookworm trixie; urgency=medium
* Bump to 5d8b436 — reverts daedalus-v4l2 PRs #7 + #8. Kernel
module returns to the pre-#7 buf_done_and_job_finish completion
model: no src/dst lifecycle decoupling, no parked dst_bufs, no
1:1-contract violation against libva-v4l2-request-fourier
(closes daedalus-v4l2#9 + #10 as won't-fix at this layer; proper
fix tracked at daedalus-v4l2#11).
* Wire-protocol drops 1 → 0; lock-step install with daedalus-v4l2
0.1.0+r33+g5d8b436 REQUIRED.
* Carries forward the #64 multi-kernel postinst fix.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:50:00 +0000
daedalus-v4l2-dkms (0.1.0+r30+g6ffe92b-1) bookworm trixie; urgency=medium
* Bump to 6ffe92b — fixes the kernel panic regression introduced
by 79256dc's split-completion design (closes daedalus-v4l2#8).
`device_run` now removes both src + dst from `m2m_ctx`'s
rdy_queue at pickup time, not at `buf_done` time. Without
this, after `SRC_CONSUMED`'s `job_finish` released the m2m
scheduler, the NEXT `device_run` saw the still-queued parked
dst_buf and paired it with a fresh src — two inflight entries
referencing the same vb2_buffer, the later `HAS_PIXELS`
triggered list_del on an already-detached list_head, smashing
the rdy_queue → hard reboot on Pi CM5 during `mpv vaapi-copy`
playback of 720p H.264 (2026-05-21).
* Wire protocol unchanged — DAEDALUS_PROTO_VERSION stays at 1.
Daemon (userspace daedalus-v4l2 package) need NOT bump in
lockstep with this DKMS update; the existing
daedalus-v4l2 0.1.0+r28+g79256dc is wire-compatible with
daedalus-v4l2-dkms 0.1.0+r30+g6ffe92b.
* Carries forward the #64 multi-kernel postinst fix.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:00:00 +0000
daedalus-v4l2-dkms (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium daedalus-v4l2-dkms (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium
* Bump to 79256dc — H.264 B-frame display reorder fix (closes * Bump to 79256dc — H.264 B-frame display reorder fix (closes
+41 -11
View File
@@ -11,16 +11,23 @@
# Upstream repo: https://git.reauktion.de/reauktion/daedalus-v4l2 # Upstream repo: https://git.reauktion.de/reauktion/daedalus-v4l2
set -euo pipefail set -euo pipefail
# Same pin as the Arch PKGBUILD. 79256dc = "kernel + daemon: H.264 # 6e6dfa1 = picks up daedalus-v4l2 PR #16 — daemon now dlopens
# B-frame display reorder fix (closes #6)" — adds the wire-protocol # the Kwiboo fourier fork's libavcodec.so.62 / libavformat.so.62 /
# src_pts / output_src_pts / RESP_FRAME flags split that lets H.264 # libavutil.so.60 at /opt/fourier instead of Debian-stock soname
# streams with B-frames preserve display order through libva → kernel # 61/61/59. First step on the daedalus-fourier substitution arc
# daemon. PROTO_VERSION bumps 0 → 1; lock-step userspace + kernel # (daedalus-v4l2#11): routes the daemon through the libavcodec
# rebuild REQUIRED (daedalus-v4l2-dkms build-deb.sh pinned to the same # source tree we own in marfrit-packages. Headers + .pc files
# commit). # come from ffmpeg-v4l2-request-fourier (installed by the CI
UPSTREAM_COMMIT=79256dc7ef41f83873ca9c23db20f5888858e65d # workflow before this script runs; see PKG_CONFIG_PATH below).
PKGVER=0.1.0+r28+g79256dc UPSTREAM_COMMIT=872eec505eb91b561892d02a0526749348ddc121
PKGREL=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix) PKGVER=0.1.0+r45+g872eec5
PKGREL=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2-dkms 0.1.0+r45+g872eec5 REQUIRED
# daedalus-fourier pin. d87239d = marfrit/daedalus-fourier PR #1 merge
# (install rules + pkg-config, enables this consumer to find_package
# + link). Bump in lockstep with the upstream daemon when daedalus-
# fourier's API or installed shaders are changed by a new consumer.
DAEDALUS_FOURIER_COMMIT=d87239d8172307d9a1b93c95cbed116d175b85cc
HERE=$(dirname "$(readlink -f "$0")") HERE=$(dirname "$(readlink -f "$0")")
@@ -30,14 +37,37 @@ export SOURCE_DATE_EPOCH=1779231600
work=$(mktemp -d) work=$(mktemp -d)
trap "rm -rf $work" EXIT trap "rm -rf $work" EXIT
# --- daedalus-fourier: fetch + build + install to per-build prefix ---
#
# Static-linked into the daemon, so the temp prefix is only for the
# duration of this build script. Requires libvulkan-dev + glslang-tools
# on the runner (already needed for the daedalus-fourier benches).
FOURIER_PREFIX=$work/fourier-prefix
mkdir -p "$FOURIER_PREFIX"
cd "$work"
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-fourier.tar.gz \
"https://git.reauktion.de/marfrit/daedalus-fourier/archive/${DAEDALUS_FOURIER_COMMIT}.tar.gz"
tar xzf daedalus-fourier.tar.gz
cd daedalus-fourier
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="$FOURIER_PREFIX"
cmake --build build --target daedalus_core
cmake --install build
# --- daedalus-v4l2: fetch + build daemon against installed daedalus-fourier ---
cd "$work" cd "$work"
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-v4l2.tar.gz \ curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-v4l2.tar.gz \
"https://git.reauktion.de/reauktion/daedalus-v4l2/archive/${UPSTREAM_COMMIT}.tar.gz" "https://git.reauktion.de/reauktion/daedalus-v4l2/archive/${UPSTREAM_COMMIT}.tar.gz"
tar xzf daedalus-v4l2.tar.gz tar xzf daedalus-v4l2.tar.gz
SRCDIR=daedalus-v4l2 SRCDIR=daedalus-v4l2
# Build daemon (CMake) # Build daemon (CMake) — point pkg-config at the daedalus-fourier
# temp prefix so pkg_check_modules(DAEDALUS_FOURIER …) resolves to it.
cd "$SRCDIR/daemon" cd "$SRCDIR/daemon"
PKG_CONFIG_PATH="$FOURIER_PREFIX/lib/pkgconfig:/opt/fourier/lib/pkgconfig" \
cmake -B build -G Ninja \ cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \ -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_PREFIX=/usr
+136
View File
@@ -1,3 +1,139 @@
daedalus-v4l2 (0.1.0+r45+g872eec5-1) bookworm trixie; urgency=medium
* Bump to 872eec5 — picks up daedalus-v4l2 PR #20 (closes #19).
Wire-protocol cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB
to 1 MiB. DAEDALUS_MAX_BITSTREAM follows; daedalus_fill_output_fmt
now reports OUTPUT_MPLANE sizeimage = ~1 MiB instead of 65484.
libva-v4l2-request-fourier's S_FMT-driven OUTPUT-pool resize
finally succeeds; Firefox no longer falls off to libmozavcodec
SW when an H.264 slice exceeds 64 KiB (routine on any
720p+ stream).
* #define-only change in include/daedalus_v4l2_proto.h; struct
layout unchanged. But effective cap is min(kernel, daemon) —
lock-step install of this package WITH
daedalus-v4l2-dkms 0.1.0+r45+g872eec5 REQUIRED.
* Daemon-side allocations are dynamic (malloc-on-payload), so
the practical growth is one ~1 MiB read buffer per daemon
process at startup. Negligible on Pi 5 / 8 GB.
* Picks up the same r43 -> r45 transition as daedalus-v4l2-dkms
(which had been stuck at r33+g5d8b436 since the parking-design
revert because the kernel module didn't change in r37/r39/r41/r43).
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 21:00:00 +0000
daedalus-v4l2 (0.1.0+r43+g1d8f5af-1) bookworm trixie; urgency=medium
* Bump to 1d8f5af — picks up daedalus-v4l2 PR #18 (closes #17).
Daemon now drops degenerate (<4 byte) bitstreams at the REQ_DECODE
entry instead of letting avcodec_send_packet return
AVERROR_INVALIDDATA. Reply RESP_FRAME with status=
DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool stays
healthy.
* Fixes the Firefox YouTube avc1 pause→resume regression observed
on higgs: libva-v4l2-request-fourier flushes a 3-byte stub
(presumably a bare NAL start code) into OUTPUT_MPLANE at the
pause boundary; the old INVALIDDATA error path made Firefox
fall off to libmozavcodec SW for the rest of the session. With
this filter the daemon logs the sentinel as 'tiny bitstream 3
bytes — dropping as no-op' and the next real REQ_DECODE
proceeds normally.
* Wire protocol unchanged. No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 17:30:00 +0000
daedalus-v4l2 (0.1.0+r41+g6e6dfa1-1) bookworm trixie; urgency=medium
* Bump to 6e6dfa1 — daedalus-v4l2 PR #16. Daemon dlopens Kwiboo
fourier fork's libavcodec.so.62 / libavformat.so.62 /
libavutil.so.60 at /opt/fourier instead of Debian-stock
soname 61/61/59. First step on the daedalus-fourier
substitution arc (daedalus-v4l2#11): the next PR series
layers daedalus_recipe_dispatch_h264_* substitution patches
into ffmpeg-v4l2-request-fourier's H264DSPContext NEON init,
reaching the daemon's production decode path.
* Build: PKG_CONFIG_PATH now includes /opt/fourier/lib/pkgconfig
so daemon's pkg_check_modules picks up the Kwiboo .pc files.
* CI workflow build-deps: libavcodec-dev / libavformat-dev /
libavutil-dev (Debian stock 7.1.3) → ffmpeg-v4l2-request-fourier
(provides /opt/fourier/include + .pc files).
* Wire protocol unchanged. No daedalus-v4l2-dkms bump.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 21:30:00 +0000
daedalus-v4l2 (0.1.0+r39+g3bc0da1-1) bookworm trixie; urgency=medium
* Bump to 3bc0da1 — picks up daedalus-v4l2 PR #15. Per-frame
`decoder: OK ...` log line gains `decode_us=N` (libavcodec
send_packet + receive_frame wall-clock cost in microseconds).
New `decoder stats` summary line every 60 decoded frames with
codec, fps, avg decode_us, MBs/s throughput, B/MB bitrate.
* Pure observability — no decode-path behaviour change.
Establishes baseline metrics for the substitution work in
daedalus-v4l2#11 step 2 (replacing libavcodec primitives with
daedalus-fourier kernels one cycle at a time).
* On Pi CM5 / bbb 720p H.264 baseline: ~4 ms decode_us / 24 fps
/ 90 K MBs/s — workload is well under 1 % of any single
daedalus-fourier kernel's NEON ceiling.
* Wire protocol unchanged. No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 18:30:00 +0000
daedalus-v4l2 (0.1.0+r37+g77e14e5-1) bookworm trixie; urgency=medium
* Bump to 77e14e5 — picks up daedalus-v4l2 PRs #12 + #13.
* #12 (LOW_DELAY half-measure): the daemon now sets
AV_CODEC_FLAG_LOW_DELAY on the H.264 AVCodecContext so libavcodec
emits frames in decode order ~99% of the time (a few stragglers
at GOP boundaries when the stream's SPS num_reorder_frames
overrides the flag). Visible improvement vs the 2-1-4-3
pair-swap on Firefox YouTube + mpv playback; not a permanent
fix (see #11 for the architectural plan).
* #13 (daedalus-fourier linkage): the daemon now pkg-config-links
against the daedalus-fourier kernel library (marfrit/
daedalus-fourier) and logs substrate availability at startup.
No kernels dispatched yet — this is the build-time / link-time
foundation for the H.264 daemon-rewrite plan in #11
(substituting daedalus-fourier IDCT 4×4 / IDCT 8×8 / luma
deblock primitives for libavcodec's per-MB pixel math, one
cycle at a time, measuring CPU saved per substitution).
* Build-deb.sh now fetches + builds + installs daedalus-fourier
(pinned at d87239d, marfrit/daedalus-fourier PR #1) into a
per-build temp prefix, then builds the daemon with
PKG_CONFIG_PATH pointing at it. daedalus-fourier is
statically linked into the daemon binary, so the resulting
.deb has no new runtime deps. Requires libvulkan-dev +
glslang-tools on the CI runner (the daedalus-fourier benches
already needed those).
* Wire protocol unchanged — DAEDALUS_PROTO_VERSION stays at 0.
No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 16:30:00 +0000
daedalus-v4l2 (0.1.0+r33+g5d8b436-1) bookworm trixie; urgency=medium
* Bump to 5d8b436 — reverts daedalus-v4l2 PRs #7 + #8 (the parking
design that broke libva-v4l2-request-fourier's 1:1 CAPTURE
contract; see daedalus-v4l2#9 + #10). After daemon-r28+g79256dc
landed, mpv (--hwdec=vaapi-copy) failed pre-playing with
"Unable to dequeue buffer: Resource temporarily unavailable" /
"Failed to end picture decode" because the daemon parked CAPTURE
buffers waiting for libavcodec to release H.264 B-frames in
display order — violating the V4L2 stateless 1:1 contract.
Firefox tolerated the mess (visible "2 1 4 3" pair-swap); mpv
bailed.
* This bump restores f0d4186-equivalent behaviour, plus PR #4
(cosmetic H.264 DECODE_MODE / START_CODE menu controls). PR #7
+ PR #8 wire-protocol additions (src_pts / output_src_pts /
RESP_FRAME flags) are reverted — DAEDALUS_PROTO_VERSION drops
back from 1 → 0. Lock-step install with daedalus-v4l2-dkms
0.1.0+r33+g5d8b436 REQUIRED.
* Visible regression: H.264 B-frame streams in Firefox revert to
the original "2 1 4 3 6 5" pair-swap visual. The proper fix
(concurrent in-flight requests in daemon + display-order reorder
in libva-v4l2-request-fourier) is tracked at daedalus-v4l2#11.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:50:00 +0000
daedalus-v4l2 (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium daedalus-v4l2 (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium
* Bump to 79256dc — H.264 B-frame display reorder fix (closes * Bump to 79256dc — H.264 B-frame display reorder fix (closes
@@ -0,0 +1,137 @@
From f760c0541586f43334c02611fcb4c212c08ad576 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Thu, 21 May 2026 21:40:22 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 4x4 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct_add (called per 4x4 block from the intra-4x4
decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4x4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
Provides the first end-to-end exercise of the daedalus-fourier
kernel pack inside the libavcodec.so decode hot path; follow-up
patches wire IDCT 8x8, luma-v deblock, and qpel mc20.
The library context is process-global, lazily initialised under
pthread_once on first call. We pick the no-QPU constructor because
libavcodec.so is loaded into arbitrary host processes
(firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
cannot assume the host has a usable Vulkan instance. Higher cycles
(deblock luma-v, MC) that benefit from the QPU will provision their
own recipe-selected context once that path is wired.
Bulk paths (idct_add16, idct_add16intra, idct_add8 — used for
non-intra4x4 macroblocks) remain on the stock NEON .S implementations
and will be batched through daedalus_recipe_dispatch_h264_idct4 with
n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; see marfrit/daedalus-fourier/CYCLE_LOGS.md).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_idct_daedalus.c | 49 +++++++++++++++++++++++
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 +-
3 files changed, 53 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_idct_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 41ab025..7b95fb1 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -3,7 +3,8 @@ OBJS-$(CONFIG_AC3DSP) += aarch64/ac3dsp_init_aarch64.o
OBJS-$(CONFIG_FDCTDSP) += aarch64/fdctdsp_init_aarch64.o
OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o
OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o
-OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o
+OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
+ aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
new file mode 100644
index 0000000..538d223
--- /dev/null
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -0,0 +1,49 @@
+/*
+ * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ *
+ * Routes H264DSPContext.idct_add through
+ * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+ * The recipe layer picks the substrate (CPU NEON by default for
+ * cycle 6; future cycles may dispatch to V3D opportunistically).
+ *
+ * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+ * convention: block[r + 4*c] = coefficient at (row r, col c). Both
+ * sides destructively zero the block after the transform.
+ *
+ * The library context is process-global and lazily initialised under
+ * pthread_once. We pick the no-QPU constructor here because
+ * libavcodec.so is loaded into arbitrary host processes
+ * (firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
+ * cannot assume the host has a usable Vulkan instance. Higher cycles
+ * (deblock, MC) that benefit from the QPU initialise their own
+ * recipe-selected context once that path is wired.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+#include "libavcodec/h264dsp.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index c684574..b993df2 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -66,6 +66,7 @@ void ff_biweight_h264_pixels_4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride
int weights, int offset);
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add16_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -139,7 +140,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->biweight_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
c->biweight_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
- c->idct_add = ff_h264_idct_add_neon;
+ c->idct_add = ff_h264_idct_add_daedalus;
c->idct_dc_add = ff_h264_idct_dc_add_neon;
c->idct_add16 = ff_h264_idct_add16_neon;
c->idct_add16intra = ff_h264_idct_add16intra_neon;
--
2.47.3
@@ -0,0 +1,107 @@
From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 10:18:21 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct8_add (called per 8x8 block from the High-profile
intra-8x8-DCT decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of the cycle-6 IDCT 4x4 wiring. Same
pthread_once global context, same destructive-zero semantics; FFmpeg
column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
on the in-tree NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
---
libavcodec/aarch64/h264_idct_daedalus.c | 29 ++++++++++++++++-------
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 ++-
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index 538d223..cbb98af 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,14 +1,16 @@
/*
- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add through
- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
- * The recipe layer picks the substrate (CPU NEON by default for
- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
*
- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
- * convention: block[r + 4*c] = coefficient at (row r, col c). Both
- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}. Both sides destructively zero the
+ * block after the transform.
*
* The library context is process-global and lazily initialised under
* pthread_once. We pick the no-QPU constructor here because
@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
}
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index b993df2..741e551 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
const uint8_t nnzc[15 * 8]);
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->idct_add16intra = ff_h264_idct_add16intra_neon;
if (chroma_format_idc <= 1)
c->idct_add8 = ff_h264_idct_add8_neon;
- c->idct8_add = ff_h264_idct8_add_neon;
+ c->idct8_add = ff_h264_idct8_add_daedalus;
c->idct8_dc_add = ff_h264_idct8_dc_add_neon;
c->idct8_add4 = ff_h264_idct8_add4_neon;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
--
2.47.3
@@ -0,0 +1,121 @@
From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 12:15:16 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.
The recipe layer picks the substrate; for cycle 8 the daedalus
docstring marks the kernel "CPU primary; QPU opportunistic", but
the libavcodec.so context here is built with
daedalus_ctx_create_no_qpu — process-global pthread_once init,
shared with cycles 6/7. QPU opportunism stays gated off until a
follow-up adds an explicit feature flag (no implicit Vulkan init
in arbitrary host processes). In the meantime cycle 8 is a
plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its docstring.
FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
(int32_t alpha/beta + inline int8_t tc0[4]). pix already points
to row 0 of the bottom block per FFmpeg's deblock convention,
satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
---
libavcodec/aarch64/h264_idct_daedalus.c | 36 +++++++++++++++++++----
libavcodec/aarch64/h264dsp_init_aarch64.c | 4 ++-
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index cbb98af..92365fa 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
/*
- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
- * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
- * recipe layer picks the substrate (CPU NEON by default for cycles
- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly. The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
*
* FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
* column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index 741e551..85ac381 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
int cpu_flags = av_get_cpu_flags();
if (have_neon(cpu_flags) && bit_depth == 8) {
- c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_neon;
+ c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,82 @@
From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 14:18:25 +0200
Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
FFmpeg 8.x dropped the H.264 decoder's low_delay path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents
h264_select_output_frame from running the display-order DPB
output queue. V4L2-stateless-style consumers (daedalus-v4l2
daemon, libva-v4l2-request-fourier) that set the flag end up
seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
Restore the documented semantics:
- Early-exit at the top of h264_select_output_frame when the
flag is set: emit the just-decoded picture immediately as
next_output_pic, mirror the corruption / recovery-point
tracking the main path performs, and skip the entire
delayed_pic[] / POC reorder machinery.
- Suppress the SPS-driven has_b_frames clobber in
h264_field_start when the flag is set, so the per-slice
bitstream_restriction_flag re-pickup cannot reintroduce a
nonzero reorder buffer mid-stream.
This is a fork-only change required by the daedalus-v4l2 daemon's
one-frame-per-send_packet contract; upstream FFmpeg consumers that
expect display-order output remain untouched (flag default = off).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
---
libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
index 97fab70..a7bfbd6 100644
--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
cur->mmco_reset = h->mmco_reset;
h->mmco_reset = 0;
+ /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+ * decoder's low_delay path). Bypass the display-order DPB
+ * output queue: emit the just-decoded picture immediately, in
+ * decode order, one per send_packet. V4L2-stateless-style
+ * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+ * do their own POC-based reorder downstream and require this
+ * behaviour. */
+ if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+ h->next_output_pic = cur;
+ h->next_outputed_poc = cur->poc;
+ h->frame_recovered |= cur->recovered;
+ cur->recovered |= h->frame_recovered & FRAME_RECOVERED_SEI;
+ if (!cur->recovered) {
+ if (!(h->avctx->flags & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+ !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+ h->next_output_pic = NULL;
+ else
+ cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+ }
+ return 0;
+ }
+
if (sps->bitstream_restriction_flag ||
h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
sps = h->ps.sps;
if (sps->bitstream_restriction_flag &&
+ !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
h->avctx->has_b_frames < sps->num_reorder_frames) {
h->avctx->has_b_frames = sps->num_reorder_frames;
}
--
2.47.3
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
+49 -3
View File
@@ -33,7 +33,19 @@ FFMPEG_VERSION=8.1
# epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie); # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
# +rfourier suffix to avoid colliding with upstream/Debian rebuilds. # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
PKGREL=2 # pkgrel=2Path A move to /opt/fourier prefix (2026-05-19) PKGREL=10 # pkgrel=10H.264 luma qpel mc20 daedalus-fourier substitution
# (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
# the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
# 8 luma-v deblock / 9 qpel mc20). Pulls daedalus-fourier PR #2
# which extends the public API with
# daedalus_recipe_dispatch_h264_qpel_mc20. (2026-05-23)
# daedalus-fourier pin. 209a421 = daedalus-fourier PR #2 merge — public
# API now exposes daedalus_recipe_dispatch_h264_qpel_mc20 +
# DAEDALUS_KERNEL_H264_QPEL_MC20. Cycle 9 plumbs the last H.264 NEON
# kernel through the recipe layer. Daemon-side build (debian/daedalus-v4l2)
# can bump in a follow-up; this PR only changes the libavcodec.so consumer.
DAEDALUS_FOURIER_COMMIT=209a4218bcb98b91c04f07ad61513bb04adb13ad
HERE=$(dirname "$(readlink -f "$0")") HERE=$(dirname "$(readlink -f "$0")")
@@ -57,6 +69,38 @@ fi
# Apply patches (same as Arch). # Apply patches (same as Arch).
patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch" patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch"
patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch" patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"
patch -Np1 -i "$HERE/0007-h264-qpel-mc20-daedalus-fourier.patch"
# --- daedalus-fourier: fetch + build static .a with PIC, install to a
# per-build prefix; libavcodec.so links it into the shared object so
# H264DSPContext.idct_add (and follow-up kernels) dispatch through the
# daedalus recipe layer instead of the in-tree NEON .S code. ---
#
# PIC is mandatory — the static .a is linked into a .so, so all object
# code must be relocatable. Vulkan is PUBLIC-linked by daedalus_core
# (queryable QPU substrate); we add libvulkan1 to Debian Depends below
# so dlopen of libavcodec.so.62 succeeds on stock trixie.
FOURIER_PREFIX=$work/fourier-prefix
mkdir -p "$FOURIER_PREFIX"
pushd "$work" >/dev/null
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-fourier.tar.gz \
"https://git.reauktion.de/marfrit/daedalus-fourier/archive/${DAEDALUS_FOURIER_COMMIT}.tar.gz"
tar xzf daedalus-fourier.tar.gz
pushd daedalus-fourier >/dev/null
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_INSTALL_PREFIX="$FOURIER_PREFIX"
cmake --build build --target daedalus_core
cmake --install build
popd >/dev/null
popd >/dev/null
cd "$work/FFmpeg"
# Configure with Arch-parity flags. Drops the same set of features # Configure with Arch-parity flags. Drops the same set of features
# (X11, AMF, CUDA, FireWire, AviSynth, Bluray, OpenMPT, JPEG-XL, # (X11, AMF, CUDA, FireWire, AviSynth, Bluray, OpenMPT, JPEG-XL,
@@ -73,6 +117,9 @@ patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
--mandir=/opt/fourier/share/man \ --mandir=/opt/fourier/share/man \
--extra-ldexeflags='-Wl,-rpath,/opt/fourier/lib' \ --extra-ldexeflags='-Wl,-rpath,/opt/fourier/lib' \
--extra-ldsoflags='-Wl,-rpath,/opt/fourier/lib' \ --extra-ldsoflags='-Wl,-rpath,/opt/fourier/lib' \
--extra-cflags="-I${FOURIER_PREFIX}/include" \
--extra-ldflags="-L${FOURIER_PREFIX}/lib" \
--extra-libs="-ldaedalus_core -lvulkan -lpthread" \
--disable-debug \ --disable-debug \
--disable-static \ --disable-static \
--disable-doc \ --disable-doc \
@@ -95,7 +142,6 @@ patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
--enable-libass \ --enable-libass \
--enable-libfreetype \ --enable-libfreetype \
--enable-libfribidi \ --enable-libfribidi \
--enable-libxml2 \
--enable-libpulse \ --enable-libpulse \
--enable-libdav1d \ --enable-libdav1d \
--enable-libopus \ --enable-libopus \
@@ -147,10 +193,10 @@ Priority: optional
Architecture: arm64 Architecture: arm64
Depends: libc6, Depends: libc6,
libdrm2, libdrm2,
libvulkan1,
libfontconfig1, libfontconfig1,
libfreetype6, libfreetype6,
libfribidi0, libfribidi0,
libxml2,
libpulse0, libpulse0,
libdav1d7 | libdav1d6, libdav1d7 | libdav1d6,
libopus0, libopus0,
+168
View File
@@ -1,3 +1,171 @@
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
* Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
horizontal half-pel, 6-tap "put" — the canonical representative
of the H.264 luma motion-compensation family) now dispatches
through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11
step 2 substitution arc; closes the 4-cycle libavcodec.so
substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
9 qpel mc20).
* Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).
* Cycle 9 is "CPU primary; QPU pointless" per
docs/k9_h264qpel_mc20.md. Per-block 7.6 ns at 131 Mblock/s
gives 135x margin over 30 fps 1080p; QPU dispatch floor at
~250 ns makes any V3D shader strictly worse. Substitution
is plumbing-only, NEON-by-recipe — same
daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
shims already own (kept SEPARATE from the H264DSP shim's ctx
because H264QPEL is its own libavcodec Makefile module and
link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).
* Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
16x16 size tier stay on the in-tree NEON .S code. Per the
cycle-9 phase-1 rationale, mc20 8x8 is representative of the
whole family's per-block cost.
* Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Sat, 23 May 2026 12:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
* Add 0006-h264-restore-low-delay.patch — restore the documented
AV_CODEC_FLAG_LOW_DELAY semantics in the H.264 decoder. FFmpeg
8.x dropped the H.264 low_delay code path entirely; setting the
flag at avcodec_open2 no longer prevents the display-order DPB
output queue from running. Visible on Firefox YouTube as the
2-1-4-3 B-frame pair-swap, re-introduced silently by the
SONAME 61→62 jump in daedalus-v4l2 PR #16.
* h264_select_output_frame: early-exit when LOW_DELAY is set;
emit the just-decoded picture as next_output_pic, mirror the
corruption / recovery-point tracking, skip delayed_pic[] and
the POC reorder machinery entirely.
* h264_field_start: suppress the SPS-driven
has_b_frames = sps->num_reorder_frames clobber when LOW_DELAY
is set — without this the per-slice bitstream_restriction_flag
re-pickup would reintroduce a nonzero reorder buffer mid-
stream.
* Restores the same one-frame-per-send_packet contract the
daedalus-v4l2 daemon's decoder.c already relies on (the flag
is set unconditionally for H.264). No daemon side change.
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 13:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-8) bookworm trixie; urgency=medium
* Add 0005-h264-deblock-luma-v-daedalus-fourier.patch —
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop in libavcodec/h264_loopfilter.c) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon. Cycle 8 of the daedalus-v4l2#11
step 2 substitution arc.
* Cycle 8 is marked "CPU primary; QPU opportunistic" in
daedalus-fourier, but the libavcodec.so context here uses
daedalus_ctx_create_no_qpu (process-global pthread_once,
shared with cycles 6/7). Opportunistic QPU is deferred to a
separate change that gates Vulkan init on a feature flag, to
avoid implicit Vulkan init in arbitrary host processes. For
now cycle 8 is plumbing-only — NEON-by-recipe.
* Intra (bS=4) loop filter c->v_loop_filter_luma_intra stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its API docstring.
* Bit-exact against ff_h264_v_loop_filter_luma_neon (daedalus-fourier
cycle 8 green).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 12:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-7) bookworm trixie; urgency=medium
* Add 0004-h264-idct8-daedalus-fourier.patch — H264DSPContext.idct8_add
(per-block 8x8 IDCT, called from the High-profile intra-8x8-DCT
macroblock path in libavcodec/h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of
ff_h264_idct8_add_neon. Cycle 7 of the daedalus-v4l2#11 step 2
substitution arc — NEON-by-recipe, same pthread_once context the
cycle-6 IDCT 4x4 shim already owns.
* Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green; FFmpeg 8x8 block storage block[r + 8*c] matches daedalus
column-major convention).
* Bulk c->idct8_add4 (inter 8x8-DCT macroblocks) stays on the
in-tree NEON .S code; batched substitution lands later.
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 10:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-6) bookworm trixie; urgency=medium
* Drop --enable-libxml2 + libxml2 Depends — the Gitea
debian-aarch64 runner ships libxml2 ≥ 2.14 (SONAME 16) while
Debian trixie targets 2.12 (SONAME 2). -5 built fine, then
failed to load on higgs trixie:
dlopen(libavformat.so.62): libxml2.so.16:
cannot open shared object file
Neither the daedalus-v4l2 daemon (direct AVPacket feed —
libavformat used only for the in-tree v4l2request hwaccel
glue) nor mpv-fourier (Lua + ytdlp + mpv's stream code do
DASH/HLS) nor firefox-fourier (gecko-media DASH demux)
consumes FFmpeg's libxml2-backed DASH demuxer, so dropping is
feature-neutral. Mirrors the libva trixie/runner ABI-skew
workaround documented in PR #62.
* CI workflow build-deps lose libxml2-dev for the same reason.
* No source code change beyond configure flags + Depends.
Substitution stays as PRs #76/#77 landed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 23:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-5) bookworm trixie; urgency=medium
* pkgrel-only bump (3 → 5) to force a rebuild of the H.264 IDCT 4x4
daedalus-fourier substitution that landed in marfrit-packages PR
#76. An orphan -4 .deb already sat in the apt pool (dated
2026-05-19, no matching source commit in main); CI's
check-already-published.sh compares with `dpkg --compare-versions
pool_ver ge source_full`, which short-circuited PR #76's -3
build. Skipping past -4 lets the CI workflow actually publish the
substitution.
* No source code change beyond PKGREL and this changelog entry.
Substitution + control + build-deb.sh wiring stay as PR #76 left
them.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 21:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-3) bookworm trixie; urgency=medium
* Add 0003-h264-idct4-daedalus-fourier.patch — H264DSPContext.idct_add
(per-block 4x4 IDCT, called from the intra-4x4 decode path in
libavcodec/h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of
ff_h264_idct_add_neon. First end-to-end exercise of the
daedalus-fourier kernel pack inside libavcodec.so on the
production decode hot path (daedalus-v4l2#11 step 2 — cycle 6
H.264 IDCT 4x4, NEON-by-recipe).
* build-deb.sh: fetches + builds daedalus-fourier (pinned at
d87239d, lockstep with the daemon's static link) with
-fPIC into a per-build temp prefix, then passes
--extra-cflags=-I.../include --extra-ldflags=-L.../lib
--extra-libs="-ldaedalus_core -lvulkan -lpthread" to FFmpeg
configure. Static-linked into libavcodec.so.62.
* Bulk paths (idct_add16 / idct_add16intra / idct_add8) remain on
the stock NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct4 with n_blocks>1 in a
follow-up. Cycles 7/8/9 (IDCT 8x8 / luma-v deblock / qpel mc20)
land in subsequent patches.
* Depends gains libvulkan1 — daedalus_core PUBLIC-links Vulkan
(queryable QPU substrate); the no-QPU constructor still works,
but the loader refuses libavcodec.so.62 at dlopen time without
libvulkan.so.1 present.
* No ABI change; SONAMEs stay 62/62/60.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 20:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-1) bookworm trixie; urgency=medium ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-1) bookworm trixie; urgency=medium
* Initial Debian packaging for the Kwiboo FFmpeg fork with V4L2 * Initial Debian packaging for the Kwiboo FFmpeg fork with V4L2